GDP Executive Summary

In this study, selected economic factors that may influence Gross Domestic Product (GDP) per capita on a county level in the U.S. were explored. GDP Data from 2018 was available from the Bureau of Economic Analysis (BEA) and information on business patterns, net migration and population from 2017 were available from the US Census Bureau. Analyzing data at the county level rather than a state level was important since this allows us to examine economic diversity that is present at the state level. In total six models were tested, with the Gradient Boosting Machine (GBM) model performing the strongest across the performance measures, with an R-Squared value of approximately 55%.

Some of the main drivers in this analysis that determine a county’s GDP per capita were median household income and population, as well as the domestic migration rate. One interesting finding was that domestic migration and GDP have a negative correlation when examined outside the models, indicating that moving to a new county within the U.S. may decrease GDP. Other variables such as the amount of retail stores and construction sites were important, and were negatively correlated with GDP according to the correlation analysis. Other important variables that were positively correlated with GDP were higher international immigration rates and the number of wholesale trade establishments in a given county. In addition, the GBM model picked population estimate, median household income, and domestic migration rate as the top three variables, respectively. The R-squared value of 55% achieved in this analysis indicates that, while some important drivers of county-level GDP were identified, more work can be done to identify important factors.

US County GDP data

The top 10 GDP per capita counties in the U.S. all belong to Texas. However these counties all have a population estimate of under 8000, so their GDP per capita is highly driven by their low population, not necessarily indicating economic prosperity. For example, while the GDP is Loving County is very high, the economy in is dependent on oil and gas production, a natural resource-intensive industry not available to many U.S. Counties.

In order to get a better indication of GDP values for typical US counties, the top 122 counties(4.37% of the data) were removed from the data set. While the complete data set was used for modeling, removing the outliers for this initial examination created a more normal distribution among the data for viewing. While Texas is still represented in the top 10 of the reduced data, there is a more diverse group of states, such as Oregon, Kansas, and Massachusetts. Also, the counties of Middlesex and Hennepin in Massachusetts and Minnesota, respectively, both have populations over a million. These are also metro areas for Boston and Minneapolis, which indicate that these cities could be seeing prosperous growth.

The following is a list of the bottom ten counties by GDP per capita. The results support what we saw in the map since the states represented are all the in southeastern/mid-western part of the country, with Kentucky having six in bottom ten.

The histogram of the data after the outliers were removed is shown below. It is skewed right, however only using counties with a GDP per capita under 100 allows us to see where a typical county in the U.S. is on the GDP scale. The mean GDP per capita decreased from 861.44 to 39.78.

Statistical Modeling of U.S. County GDP data

Six models were used to examine GDP data in this analysis, from more simple to increasing complexity. The three linear models that were used were linear regression (LM), partial least squares (PLS) and elastic net (ENET). The three non-linear models were support vector machines (SVM), multivariate adaptive regression splines (MARS) and gradient boosted machine (GBM).

In order to test each model, the data was split into 70% training and 30% testing. The training sets used 5-fold cross validation to select tuning parameters (except LM). The response variable (2018 per capita GDP) was log10 transformed. The predictors were tested and removed if they were found to have near-zero variance and had a correlation of greater than .8 with other variables in order to reduce multicolliniarity. In addition each variable was centered and scaled, to create a mean of 0 and a standard deviation of 1. This eliminates scale difference across the variables. The last step uses the YeoJohnson function to transform the data to be more normal.

Models were fit using the caret package in R.

Cross-validation Performance Results on the Training Set

After the best tuning parameters were selected for each model, each CV run gives a reading of performance measures for the hold out set. The average of 5 RMSE and R-squared values, and plus/minus one standard deviation, appear below for each model. Aside from GBM performing the best (R-squared of about 55% and RMSE of 0.39), the results also indicate that the three non-linear models(GBM, MARS, and SVM) appear to be the best three models for this data, so it could be safe to say this data needs to be fit to a non-linear model to see the best results.

Performance Results on the Test Set

The following table shows the results of running each model on the test set. As for the training data, the GBM produced the lowest RMSE and highest R-squared values. The results of its test set were within one standard deviation of both the RMSE and R squared for the training data, so there doesn’t appear to be over-fitting.

This plot shows the individual results of each observed test set GDP point vs. its model-predicted value for the GBM model. The observed vs. predicted plots for the rest of the models are in the appendix.

Variable Importance

Variable importance is a measure that indicates how influential each predictor is in determining the model’s results. The most important variable is always given a score of 100, and the rest of the variables are assigned a score between 0 and 100 that is relative to the first one. For each model we tested, we produced a variable importance ranking in order to tell us which variables contributed the most to GDP per capita in that model. To combine the results across models, we considered the average rank and frequency that it showed up in the top 10 across models, as well as its average importance score.These dimensions are displayed in the plot below.

Domestic migration rate and median household income were in the top ten for each model, and had an average ranking of 2. Population was important, indicating that there are population effects beyond those adjusted for by using per capita values. Other variables such as the number of retail stores, wholesale trade establishments and construction sites, and international immigration rates appeared to be important in determining county per capita GDP.

Correlation Matrix

One limitation of the Variable Importance analysis is that it doesn’t indicate the direction of the effect of the variable on the response. The correlation plot below allows us to see what direction a certain variable impacts GDP, at least in a univariate dimension. For example, the median household income and number of whole trade establishments are positively correlated with GDP. On the other side, the number of retail establishments and the domestic migration rate are negatively correlated GDP. For the retail sector, it could indicate a rise in e-commerce that has caused certain retail stores to struggle and even go out of business. Domestic migration issues could be due to a saturation in local labor or housing markets, as the supply might be too great for the demand in both cases. Viewing the data from a single time period allows us to analyze what happened during 2018 and what look at what events could have increased or decreased GDP based on how a variable performed.

Appendix

The following shows the results of running MAE on the models on the training set. Even though the SVM has a higher margin of error than the GBM, it still performed slightly better. However, for this study, we were more focused on RMSE and R-Squared.

The following is a list of the important variables for the GBM model ranked. The most important variable was automatically given a value of 100, with the following variables being more important the closer they were to 100. Population estimate was determined to be the most important in the GBM, with median household income not far behind, given a value of 96.14. The importance levels off a little bit after those two, but net domestic migration rate and the number of construction establishments were also ranked higher.

The following 5 plots show the observed vs. predicted for the test set fits for the LM, PLS, ENET, SVM and MARS models.

The following were the tuning parameters for the GBM model on the training set. The lowest RMSE was achieved with a shrinkage of .01 and a max tree depth of seven. The RMSE appears to level off after about 600 boosting iterations.