A n application o f Bayesian variable selection to international econ om ic data
By
X ian g Tian, B .A .
A P ro je ct S ubm itted in Partial Fulfillment o f the Requirem ents
for the Degree o f
M aster o f science in
Statistics
U niversity o f Alaska Fairbanks
June 2017
A P P R O V E D :
Scott G od d a rd , C om m ittee Chair R on Barry, C om m ittee M em ber M argaret Short, C om m ittee M em ber Julie M cIntyre, C om m ittee M em ber Leah Berm an, Chair
A b s t r a c t
G D P plays an im portant role in p eop le's lives. For exam ple, w hen G D P increases, the un em ploym ent rate will frequently decrease. In this p roject, we will use four different Bayesian variable selection m ethods to verify econ om ic theory regarding im portant predictors to G D P. T he four m ethods are: g-prior variable selection w ith credible intervals, local em pirical Bayes w ith credible intervals, variable selection by indicator function, and hyper-g prior variable selection. A lso, we will use four measures to com pare the results o f the various Bayesian variable selection m ethods: A IC , B IC , A d ju ste d -R squared and cross-validation.
K eyw ords: G D P , Bayesian statistical m ethods, M arkov chain M on te C arlo, Bayesian variable selection, g-prior.
Ta b l e o f Co n t e n t s T IT L E P A G E ... 1 A B S T R A C T ... 2 T A B L E O F C O N T E N T S ...3 S E C T IO N 1: In troduction ... 4 S E C T IO N 2: D a t a ... 5 S E C T IO N 3: M e t h o d s ...6
3.1 M ultiple linear regression ...6
3.2 D ata transform ation and m odel diagnostics ... 9
3.3 Bayesian linear r e g r e s s i o n ... 14
3.4 Bayesian variable selection ... 16
3.4.1 Basic g-prior w ith credible interval ... 16
3.4.2 T h e local Em pirical Bayes approach ...17
3.4.3 Indicator variable selection m e t h o d ... 17
3.4.4 H yper-g prior w ith credible intervals ...18
3.5 Measures for com paring the m e th o d s’ results ... 19
3.5.1 Akaike inform ation criterion (A IC ) ... 19
3.5.2 Bayesian inform ation criterion ( B I C ) ... 20
3.5.3 A d ju sted R -squared ... 20
3.5.4 C ross-validation ... 21
S E C T IO N 4: Results ... 22
S E C T IO N 5: Discussion and future work ... 23
A cknow ledgem ents ... 24
R E F E R E N C E S ... 25
1
Introduction
In this paper, we will select m odels for predictin g a n a tio n ’s gross dom estic p rod u ct (G D P ) using a set o f candidate predictors. G D P plays an im portant role in p e o p le ’s lives. For exam ple, standard econ om ic theory predicts that when G D P increases, the unem ploy m ent rate will decrease. As m any econom ists have n oted, G D P is a flawed measure o f econ om ic welfare. Leisure, inequality, m ortality, m orbidity, crime, and the natural environ m ent are just som e o f the m a jor factors affecting living standards w ithin a country, and these factors are in corporated im perfectly, if at all, in G D P. It is nevertheless w orth m od el ing because G D P is so im portant.
Since G D P is so im portant, m any econom ists have built m odels to predict it. Huerta and Freitas L opes (2000) analyzed the Brazilian industrial p rod u ction index using a Bayesian tim e series m eth od to fit a Bayesian m odel and make short-term forecasts. In our paper, posterior estim ates and predictors are obtain ed using M arkov chain M onte C arlo (M C M C ) m ethods based on the G ibbs sampler. There are som e differences between our p ro je ct and the H uerta e ta l’s paper. First, we fit a m odel using data from m any nations instead o f using just one country's industrial p rod u ction index. Secondly, whereas Huerta and Freitas Lopes collected the Brazilian's industrial p rod u ction index from m ultiple years, 1980 to 1998, we use only the G D P o f 2010. A s another exam ple, Fang and M iller (2008) analyzed the volatil ity o f real G D P grow th for Japan using a tim e series m odel.
In addition to m odelin g G D P, this paper is concerned w ith com paring several Bayesian variable selection m ethods. Bayesian variable selection (B V S ) has a long history (Zellner 1971; Leam er 1978; M itchell and B eaucham p 1988). T h e advent o f M arkov chain M onte Carlo m eth ods catalyzed the developm ent o f Bayesian m odel selection and Bayesian m odel averaging in regression m odels (G eorge and M cC u lloch 1993; Sm ith and K oh n 1996; Raftery, M adigan, and H oeting 1997; H oeting, M adigan, Raftery, and Volinsky 1999; C lyde and G eorge 2004). Bayesian variable selection is a class o f m ethods w ithin the Bayesian paradigm that is used for selecting im portant predictors from am ong a set o f candidate predictors. One advantage o f B V S over frequentist m ethods like L A SSO and S C A D is having p osterior dis tributions on param eters, which gives us the ability to quantify our uncertainty abou t them.
A n oth er advantage is having posterior predictive distributions, which gives us the ability to simulate data for pred iction and quantify uncertainty in them . T h e general types o f B V S are posterior-based m ethods, Bayes factors-based m ethods, and inform ation criteria. In this p roject, we will use four different posterior-based m ethods to m odel the relationship between our response variable (G D P ) and ten candidate predictors. T he four variable selection m eth ods are: g-priors w ith credible intervals, em pirical Bayesian g-priors w ith credible intervals, in dicator variable selection, and hyper g-priors w ith credible intervals. These variable selec tion m ethods will b e com pared in order to determ ine which perform s the best.
W e have also chosen to use four ways to com pare the perform ance o f our selected sets o f variables: A IC , B IC , adjusted R -squared, and cross validation.
T h e rem ainder o f the paper is organized as follows: in Section 2, we introduce the orig inal data; in Section 3, we introduce the four Bayesian variable selection m eth ods and four measures to com pare the perform ance o f the various Bayesian variable selection m ethods; in Section 4, we provide the results o f the Bayesian variable selection m ethods applied to our data set; the last section contains discussion and future work.
2
Data
W e obtain ed our data from the W orld Bank website (2017). T h e data set contains in form ation on 217 countries; however, because o f m issing data, we used the data from only 79 countries. T h ou gh several years were included in the original data set, we decided to use the data from just 2010 to con du ct our analysis. This year is recent enough to b e relevant but also old enough to in corporate data revisions.
In our data set, we have one response variable (G D P ) and ten predictors. In Table 1 we introduce the candidate predictors used in our analysis. T he units o f the response vari able (G D P ) are U.S. dollars; the first p red ictor is expenditures on education, expressed as a percentage o f total governm ent expenditures; the second p red ictor is exports o f g ood s and services, expressed as a percentage o f G D P ; the third pred ictor is fertility rate, w hich is the average num ber o f births per wom an; the fourth pred ictor is general governm ent final
con sum ption expenditures, expressed in U.S. dollars; the fifth p red ictor is gross savings, expressed as a percentage o f G D P ; the sixth p red ictor is household final con su m ption ex penditures, expressed in U.S. dollars; the seventh pred ictor is im ports o f g ood s and services, expressed as a percentage o f G D P ; the eighth pred ictor is inflation rate, expressed as a percentage; the ninth pred ictor is m ilitary expenditures, expressed as a percentage o f g ov ernment expenditures; the last pred ictor is p op u lation ages 15-64, expressed as a percentage o f total p op u lation count.
T h e choice to use these specific candidate predictors in this study is rooted in econ om ic theory. A ccord in g to econom ists, there are tw o ways to partition the total value o f G D P (M acroecon om ics: A G row th T h eory A pproach , A lejan dro and M ark). One way to obtain G D P is by expenditure (the different ways that our ou tpu t is “b ou g h t” ), using the follow ing formula:
G D P = co n su m p tion + investment + governm ent p ro d u ctio n + net exports. T h e other way is by incom e:
G D P = w a g e s+ p ro fits.
T herefore, accordin g to econ om ic theory, som e variables should have a significant rela tionship w ith G D P. A m on g the ten candidate predictors we used were som e which w ould alm ost certainly have a significant relationship w ith G D P, such as consum ption, exports, im ports, and investment. W e also included som e predictors which we did not exp ect to have an effect on G D P in order to test the specificity o f the various variable selection m ethods. W h en we perform the variable selection, those variables ought to be elim inated from our m odel.
3
Methods
3.1
M ultiple linear regression
In this section we describe linear regression m ethods and bu ild up to the Bayesian variable selection m ethods. Part o f the reason for doing this is to describe the transform ations we
Table 1: Candidate predictors used in our analysis
Variable V ariable’s name
X i E xpenditures on education
X2 E xports
X3 Fertility rate
X4 G overnm ent con sum ption expenditures
X5 Gross savings
X e H ousehold con sum ption expenditures
X 7 Im ports
X8 Inflation
X9 M ilitary expenditures
X 1 0 P opu lation ages 15-64 (% total)
selected for the data.
In statistics, linear regression is an approach for m odeling the relationship betw een a dependent variable Y and p explanatory variables (or independent variables) denoted X ^ X 2, ■ ■ ■, X p. Specifically, we are trying to m odel E (Y | X ) using the linear function X
fl
, where X is a m atrix containing colum n vectors for X 1, X 2, ■ ■ ■ , X p;fl
is a colum n vector o f coefficients; and Y is a colum n vector o f responses. A n oth er way to write this is Y = Xfl
+ e , where we assume that the errors e have a m ean o f 0 and constant variance and are un correlated. O rdinary least squares is a m eth od for estim ating the unknow n param eters in a linear regression m odel. T h e goal o f the ordinary least squares m eth od is to m inim ize the sum o f the squares o f differences betw een the observed responses in the given data set and those pred icted by a linear function o f a set o f explanatory variables. T h at is, we m inim ize e Te = ( Y - X
fl
) T ( Y - Xfl
), wherefl
is the estim ate o ffl
.T o begin our analysis, we fit the m ultiple linear regression m odel,
E (Y | X ) = 1 / 0 + X i/3i + X2 / ? 2 + ■ ■ ■ + X 10/3l0
Here, 1 is a colum n vector o f 1s. B y fitting a m ultiple linear regression m odel, we h op ed to begin to see the relationships betw een our response variable and the explanatory variables. Table 2 summarizes the m ultiple linear regression m odel fit. Figure 1 provides a scatterplot m atrix that depicts the estim ated effects from the m ultiple linear regression m odel. A
c-Table 2: M ultiple linear regression ou tpu t o f the ordinary least squares fit Coefficients Estim ate Std. Error t value p-value
Intercept 1.492e+11 2.739e+11 0.545 0.5878
X i 2.825e+ 07 1.815e+09 0.016 0.9876 X2 1.012e+08 7.691e+ 08 0.132 0.8957 X3 -1.152e+ 10 1.643e+10 -0.701 0.4855 X4 1.920e+00 1.548e-01 12.401 < 2e - 16 *** X5 3.516e+ 09 1.088e+09 3.231 0.0019 ** X e 1.0 0 2e+ 0 0 4.038e-02 24.822 < 2e - 16 *** X 7 -3.640e+ 08 8.555e+ 08 -0.425 0.6719 X 8 8.741e+ 08 2.379e+ 09 0.367 0.7144 X 9 1.049e+08 1.563e+09 0.067 0.9467 X 1 0 -2.520e+ 09 3.667e+ 09 -0.687 0.4944
cording to this graph, only a few predictors have a linear relationship w ith our response variable.
It is clear that in the OLS fit, assum ing the m o d e l’s assum ptions are satisfied, only three predictors are significant, since the p-values o f those predictors are less than 0.05. W e will now investigate the m odel diagnostics to verify the m o d e l’s assum ptions.
3.2
D ata transformations and m odel diagnostics
T h e nonlinear relationship depicted in the norm al p robability plot in Figure 2 indicates that the residuals for the m ultiple linear regression are not norm ally distributed. Thus, we should use at least one transform ation on the data.
T h e scatterplot m atrix in Figure 1 and the curvature plots included in A ppendices A. 1 and A. 2 allow us to assess the linearity o f the relationship betw een the response and pre
dictors. A hypothesis test is also perform ed to form ally test the null hypothesis that a linear m ean fu n ction is appropriate. T h e alternative hypothesis states that the m ean fu n ction is instead curvilinear. W e found that the relationship betw een variable X4 and our response
variable is curvilinear and the relationship betw een variable X 6 and our response variable is also curvilinear. T o see this better, Figure 3 shows the histogram o f variable X 6; it shows
If) <D C CD =5 a a> a .
E
CD CO2
1
0
1
2
Theoretical QuantilesFigure 2: N orm al p robability plot o f residuals before transform ation
that this variable has a highly-skewed distribution, which is not g o o d for regression analysis. X4 has a similar problem . In order to make those tw o variables’ distributions less skewed,
we used a log transform ation. W e also used a log transform ation o f the response Y in order to im prove the norm ality diagnostics. Finally, we consider the constant variance assum ption inherent in m ultiple linear regression. Figure 4 shows a residual vs. fitted value plot; it suggests that the constant variance assum ption is violated. A form al test o f non-constant variance also indicates that the variance is non-constant. This problem provides us another ju stification for the use o f log transform ations on X 4, X 6, and Y .
W e also use C o o k ’s D istance to assess w hether the any o f the observations have dis prop ortion ate influence on the fitted regression m odel. C o o k ’s distance or C o o k ’s D is a com m on ly-u sed estim ate o f the influence o f a data point when perform ing a least-squares regression analysis. T h e form ula o f C o o k ’s distance is:
D ( Y - Y i)2 hi
• p x M S E ( 1 - h i)2 ’
where h is leverage o f i-th observation, Y is the fitted value o f the ith observation, and M SE is the m ean squared error o f the fit.
Figure 3: H istogram o f the non-standardized variable X6
There are different opinions regarding what cu t-off values to use for spottin g highly influential points. A simple operational guideline o f D > 1 has been suggested (K im 1996). For our data, the largest C o o k ’s distance is 0.491, w hich is less than 1. W e can arrive at the conclusion that none o f the observations exert undue influence on our regression analysis. Finally, in order to give the m odel-fittin g algorithm s m ore num erical stability and to put the coefficients on the same scale, we standardized the predictors X i through X i 0. In other words, we subtracted out the m ean o f each pred ictor and divided by its sam ple standard deviation.
Figure 5 and Table 3 depict a scatterplot m atrix and m ultiple linear regression ou tput for the standardized transform ed data. Figure 5 indicates that the m odeling assum ptions for linear regression are better satisfied. Table 3 indicates that all o f the predictors except for X 1 (E d u cation expenditures) and X8 (Inflation) appear to b e significant predictors o f G D P,
Table data
Figure 4: Residuals vs. F itted values plot using raw, non-transform ed data
M ultiple linear regression ou tpu t o f the ordinary least squares fit o f transform ed
Coefficients Estim ate Std. Error t value p-value Intercept 25.211551 0.005083 4959.605 < 2e - 16 *** X i -0.005885 0.006011 -0.979 0.331045 X2 0.208712 0.016196 1 2 . 8 8 6 < 2e - 16 *** X3 0.063403 0.017051 3.718 0.000408 *** X4 0.375603 0.032482 11.563 < 2e - 16 *** X5 0.067602 0.007125 9.489 4.46e - 14 *** X e 1.580155 0.031692 49.859 < 2e - 16 *** X 7 -0.165789 0.015659 -10.587 5.04e - 16 *** X8 0.001230 0.005900 0.208 0.835501 X9 -0.013717 0.006668 -2.057 0.043512 * X 10 0.053378 0.016576 3.220 0.001965 **
3.3
Bayesian linear regression
Having applied appropriate transform ations to the data set, we now turn to a Bayesian linear m odel for it. In p robability theory and statistics, B ayes’ theorem describes the probability o f an event, conditional on prior knowledge abou t associated events.
A prior probability distribution, often sim ply called the prior, o f an uncertain quantity 9 or vector o f uncertain quantities
0
is the p robability distribution that expresses o n e ’s beliefs abou t this quantity before som e given data set is taken into account. T h e prior distribution o f0
is denoted by n (0
).T h e p osterior distribution n (
0
|Y), is the conditional p robability distribution for0
, con ditional on som e data. It is con stru cted by com bining the likelihood fu n ction and the prior distribution.In statistics, a likelihood function (often sim ply the likelihood) is a fu n ction o f the param eters o f a statistical m odel given data. L (Y |9) is the n otation for the likelihood fu n c tion.
T h e posterior distribution is prop ortion al to likelihood * prior:
(
0
|Y) = n(param eter(s)|data) a n (0
) * L (Y |0
)For this p ro je ct, we will utilize a Bayesian linear regression m odel, Y = X
fl
+ e, where e ~ N ( 0 , a2I). Y is a 7 9 x 1 response vector, X is the 79 x 11 design m atrix,fl
is a11 x 1 vector o f coefficients. Thus
0
= (fl
,a
2)T , and the likelihood fu n ction becom es L (Y |fl
, a 2) = (2 n )- 8 ( a 2) - 8 e x p ( - ( V - W l v - x g ) ).In Bayesian linear regression, priors must be set for a 2 and
fl
. T h e g-prior is a certain class o f priors for the regression coefficients o f a Bayesian m ultiple regression. T h e join t prior distribution o f a2 andfl
factors aswhich is the prod u ct o f the prior distribution o f a 2 tim es the prior distribution o f
fl
given a 2. It is com m on to use an inverse-gam m a distribution on a2 because it is a conjugate prior.W e will prefer the im proper prior,
n ( a 2) a
a 2
which is the limit o f an inverse-gam m a p robability density fu n ction param eterized by a and b as a and b approach 0 from above in such a way that | is constant. This prior is also conjugate for a 2. It can b e shown that
a 2|Y - I G ( n , I Y T( I - - ^ P x) Y
2 2 I + g
where Px = X ( X TX ) - 1X T . T h e g-prior for
fl
, conditional on a2, is a norm al distributionw ith m ean 0 and variance g a 2( X TX ) - 1. g is a hyperparam eter. In other words,
fl
|a2 - N ( 0, g a 2( X TX ) -1).W e n ote that
fl
OLS, the ordinary least squares estim ator o ffl
, isfl
OLS= ( X TX ) -1X TY andthat var(
fl
OLS) = a 2( X TX ) - 1, w hich serves as partial explanation for why the g-prior isdefined the way it is. This prior for
fl
is conditionally conjugate, m eaning that, conditional on a2, b o th the prior and the posterior are o f the same fam ily o f density functions. T he join tposterior density is likewise the prod u ct
n ( a2,
fl
|Y) = n ( a2|Y) * n (fl
|a2, Y ) ,which is the p rod u ct o f the posterior distribution o f a2 and the posterior distribution o f
fl
,given a2. It can b e shown that the conditional posterior distribution o f
fl
isfl
|a2, Y - N (fl
o l s , Ig+ 2- ( X TX ) - ‘ jV I + g I + g )
T h at is, the posterior distribution o f
fl
given a2 is norm al w ith m ean f l OLS and variance3.4
Bayesian variable selection
T he selection o f variables in Bayesian linear regression m odel is related to the prior as sum ptions m ade on the m o d e l’s param eters. Some general strategies for Bayesian variable selection are posterior-based m ethods, Bayes factor based m ethods, and inform ation criteria. In this p ro je ct, we focus on posterior-based m ethods and consider four such m eth ods which em ploy the Bayesian linear regression m odel. In particular, three o f the four m ethods utilize g-priors.
3 .4 .1 B a s ic g - p r i o r w it h c r e d i b l e in te r v a ls
For this m eth od g must be set to som e value in order to have a w ell-defined prior dis tribu tion for
fl
. Various values o f g have been prop osed . A ccord in g to the risk inflation criterion (Foster and G eorge I994), g should be set equal to p 2, where p is the num ber o f predictors; accordin g to the unit inform ation prior (Kass and W asserm an I995), g should be set equal to n, where n is sample size. Initially we tried 5 values to com pare their effects: I0, 20, 50, I50, 200. Finally we decided to choose g = 2 0 , because it gave the lowest value o f m ean square pred iction error in a pilot study.Because o f the con ju gacy o f the prior distributions, we are able to simulate samples from the join t posterior distribution using a simple lo o p in R as follows. For each iteration, first sample
o f
fl
using p osterior samples o ffl
, w hich we obtain ed as described above. If 0 is inside the interval, it means that 0 is a plausible value for the com pon en t, and thus we can elim inatethis variable from our m odel. O nce we have selected significant variables, we will fit our then
regression m odel by taking the posterior m ean o f each selected variable's coefficient as an estimate.
Regression diagnostic plots for the g-prior variable selection fitted m odel can be seen in A ppen d ices A .3 and A.4.
3 .4 .2 T h e l o c a l E m p i r i c a l B a y e s a p p r o a c h
In the second variable selection m eth od we em ployed, the local Em pirical Bayes approach can be view ed as estim ating a separate g for each candidate m odel. Using the marginal likelihood after integrating out all param eters, an em pirical Bayes value o f g is the m axi m um (m arginal) likelihood estim ate constrained to b e nonnegative, w hich turns out to be gEBL = m a x (F — I, 0), where F = (1-R2y/P-1 -p ) is the usual F statistic for testing ^ 1, . . . , ^ p
all equal 0.
O nce g is set as per above, the variable selection m eth od proceeds just as the g-prior w ith credible intervals m eth od does.
Regression diagnostic plots for a local em pirical Bayes variable selection fitted m odel can be seen in A ppen d ices A .5 and A .6.
3 .4 .3 I n d i c a t o r v a r ia b le s e l e c t i o n m e t h o d
A t present, the com pu tation al m eth od m ost com m on ly used for fitting Bayesian m odels w ith intractable posteriors is the M arkov chain M onte Carlo (M C M C ) technique (R ob ert and Casella 2004). Variable selection m ethods can take advantage o f the M C M C framework. Indicator m odel selection does not use a conjugate prior as do the g-prior based m odel selection m ethods. Thus M C M C is required to estim ate the p osterior distributions o f the parameters.
For the purpose o f our p ro je ct, indicators are functions which can either take a value o f I or 0 in order to indicate whether a p red ictor variable belongs in the m odel. W e use I as our in dicator for i = I , 2, . . . , I0. B y including the in dicator in our m odel, we can decide which
variables should b e elim inated from our regression m odel. T he resulting regression m odel is:
Y = + X 1(^1 1 1) + X2(^2 1 2) + ' ' ' + X 1 0(^1 0 1 1 0) +
where
0 if otherwise
1 if variable i belongs in the m odel
In order to give each variable an equal chance o f bein g elim inated from our m odel, we as sum ed that I ~ B ernoulli(0.5) for i= 1 , 2, ■ ■ ■, 10. A lso, we assumed that ^ ~ N orm al(0,T i), and Tj ~ G a m m a (1 ,1 ). T h e m od el was fitted using O pen B U G S , w hich was called from R using R 2O p en B U G S (Sturtz et al. 2010). W e perform ed 10,000 iterations w ith 100 m ore iterations for burn-in. W e elim inated variables for which the posterior m ean o f the corre sponding indicator was less than 0.5.
Regression diagnostic plots for the variable selection by in dicator function m eth od and traceplots o f ^0, 11 and U1 can be seen in A ppen d ices A .7 through A .1 1.
3 .4 .4 H y p e r - g p r i o r w it h c r e d i b l e in te r v a ls
This m eth od is discussed in (Liang et al.2008); we sum m arize it here. T h e shrinkage factor
1+g in the conditional p osterior o f
fl
given a 2 is a factor which adjusts the m axim um likelih o o d estim ator f l OLS. It pulls the m axim um likelihood estim ator tow ard the prior m ean o f 0. Instead o f requiring us to pick a value o f g, the hyper-g prior m eth od allows the data to pick g by placing a prior distribution on either g or on the shrinkage factor. W e assum ed that the prior distribution o f g is n ( g ) = a -2 ( 1 + g ) - “ /2 , g >0, which is prop er distribution for a>2.
In this case, we assumed that a = 3 in deference to the aforem entioned au th ors’ suggestion. H yper-g priors are equivalent to the specification o f a B eta prior on the shrinkage fa ctor 1+ g ;
that is 1+g ~ B eta (1 ,| -1 ), which is a B eta distribution w ith m ean 2.
O nce again we must use R 2O p en B U G S to call O penB U G S and estim ate the posterior distribution o f
fl
using M C M C , because the in corp oration o f g as a param eter to b e sam pled results in a n on -con ju gate m odel. In using a M arkov chain M onte C arlo algorithm , we perform ed I0,000 iterations to simulate the posterior distribution o f
fl
and to o k the posterior m ean o f the coefficients o f significant variables. Variable selection was again perform ed using credible intervals from the posterior distributions o f regression coefficients.Regression diagnostic plot for the hyper-g prior variable selection m eth od and the trace- plots o f ,%, ^ 1 and ft2 can b e seen in A ppen d ices A .I2 through A . I6.
3.5
Measures for comparing the m eth od s’ results
W e w ould like to be able to com pare the results o f the four variable selection m eth ods and determ ine which does best for this data set. T o do this, we will use a cross-validation routine to measure the predictive perform ance o f each variable selection m ethod. In addition, we will also calculate three measures o f m odel quality in order to show what variables w ould have been selected if these had been applied instead. W e describe these measures first follow ed by the cross-validation.
3 .5 .1 A k a ik e i n f o r m a t i o n c r i t e r i o n ( A I C )
T he Akaike inform ation criterion (A IC ) is a measure o f the relative fitness o f various statis tical m odels for a given set o f data. G iven a collection o f m odels for the data, A IC estim ates the quality o f the fit o f each m odel based on the m axim um o f the m o d e l’s likelihood function. It provides a means for m odel selection. T h e m odel w ith the lowest A IC is preferred.
For som e candidate m odel o f a given data set, let L be the log-likelih ood fu n ction for the m odel and let p b e the num ber o f estim ated param eters in the m odel. T h en the A IC value o f the m odel is:
A I C = 2p — 2L
where L is the log-likelihood fu n ction evaluated at the m axim um likelihood estim ate o f 9. W e n ote that small values o f 2p correspond to parsim onious m odels, and that small values o f — 2L correspond to m odels w ith g o o d fit (L is correspondingly large). This is why
3 .5 .2 B a y e s ia n i n f o r m a t i o n c r i t e r i o n ( B I C )
T h e Bayesian inform ation criterion (B IC ) is another criterion for com paring m odels. The m odel w ith the lowest BIC is preferred; it differs from A IC in that it has a different penalty for nonparsim onious m odels:
B I C = lo g (n )p — 2L
3 .5 .3 A d j u s t e d R - s q u a r e d
R -squared is another measure used for m odel com parison. It is a statistical measure o f how close the data are to the fitted regression line. It is also known as the coefficient o f determ i nation. T h e bigger the R -squared value is, the better the m odel fits the data. T h e form ula for R -squared is:
R 2 = I — S S T '
where SSE is the sum o f squared errors o f the regression m odel and SST is the sum o f squares total for the m odel.
Every tim e a pred ictor in regression analysis is added, R 2 increases. T herefore, the m ore predictors that are added, the better the regression will seem to “fit” the data. Even the addition o f predictors which are insignificant will nevertheless increase the value o f R 2.
T h e adjusted R2 can instead b e used to include a m ore appropriate num ber o f vari
ables, thw arting the tem p tation to keep on adding variables to a data set. T h e adjusted R2
will increase only if a new pred ictor im proves the regression m ore than w ould be exp ected by chance. So, when adding a new pred ictor into a regression analysis, we w ould like to use adjusted R -squared to help us make decisions. T h e adjusted R -squared is a m odified version o f R -squared that has been adjusted for the num ber o f predictors in the m odel. T he bigger the adjusted R -squared value is, the better the m odel will fit. T h e form ula for adjusted R -squared is
S S E
R ldj = 1 —
(I — R 2) (n — p — I)
where n is the sam ple size and p is the total num ber o f explanatory variables in the m odel.
3 .5 .4 C r o s s - v a l i d a t i o n
O ur chosen measure o f predictive perform ance is cross-validation. M any o f the m odel fit statistics are not a g o o d guide to how well a m odel will predict: high R 2 does not necessarily m ean the m odel makes g o o d predictions. It is easy to over-fit the data by including t o o m any predictors and thereby inflate R2 and other fit statistics. For exam ple, in a simple
p olyn om ial regression we can just keep adding higher order term s and get better and better fits to the data. But the predictions from the m odel on new data will usually get worse as higher order term s are added.
One way to measure the predictive ability o f a m odel is to test it on a set o f data not used in fitting the m odel, called a “test set” . This is known as cross-validation. T h e data used for estim ation is the “training set” . In each cross-validation iteration, we random ly divided our data set into tw o parts: a testing data set and a training data set. T h e testing data contained about one third o f our original data set (29 observations); the training data set contained about tw o thirds o f our original data set (50 observations).
In each iteration, and for each variable selection m eth od , we used the training data set to perform variable selection. O nce the variables were selected, we to o k the p osterior means o f the coefficients corresponding to the selected variables, resulting in a fitted m odel. T hen we used these fitted m odels to predict the test data set responses Y . Finally, when we differ enced the observed test variable Yj and the pred icted response Yj, we obtain ed the prediction error, Yj-Yj . Using the m ean o f the squared pred iction errors (M S P E ), which is averaged over all squared pred iction errors in each cross-validation iteration and then averaged over 100 iterations, we can com pare the variable selection m ethods. If M S P E is large, the m eth od has a p o o r predictive perform ance. Likewise, if M S P E is low, the m eth od has a g o o d pre dictive perform ance.
4
Results
W e now present the results o f our study, starting w ith the variables selected by each m ethod. Table 4 gives the ou tpu t o f four Bayesian variable selection m ethods. It shows that the g-prior w ith credible intervals selected exactly one predictor, X 6, to stay in the m odel. This is concerning, given that Figure 5 shows that X4 is also strongly linearly associated
w ith the response. W e see that the em pirical Bayes m eth od selected X2, X 3, X 4, X 5, X 6, X 7, and X 10. Indicator variable selection chose X4 and X6 only, while hyper-g prior variable
selection chose X2 and X 6. T h ou gh every m eth od selects X 6, there is w ide disagreement
regarding several o f the other candidate predictors.
A ccord in g to econ om ic theory, at least four predictors (consum ption , exports, im ports, and investm ent) should have a significant im pact on G D P ; and accordin g to the ou tpu t o f Table 4 , only the local em pirical Bayes m eth od selected at least four predictors. This indicates this m eth od works best accordin g to econ om ic theory.
B y calculating the m ean squared pred iction error o f the four m ethods, over the I00 cross-validation iterations, we can decide w hich variable selection m eth od perform s best em pirically for this data set. Table 5 gives a com parison o f the four m ethods in terms o f m ean square pred iction error (M S P E ). It is evident that the local em pirical Bayes m eth od per form s best, since it gives the lowest value o f m ean squared pred iction error. B y this measure, g-prior variable selection perform s worst. T he discrepancy betw een these tw o m ethods may be due to the values o f g used by each. In the local em pirical Bayes m eth od, g is set to be nearly I0,000; in the g-prior variable selection w ith credible intervals m eth od, g is set to be
2 0.
Table 5 also provides the values o f A IC , B IC , and adjusted R -squared for the m odels selected by each m ethod. These are provided m erely for reference; we do not im ply that the Bayesian variable selection m ethods can b e assessed using such criteria. It is evident that A IC and B IC b o th favor the same m odel selected by local em pirical Bayes. B y the same token, the m odel selected by g-prior variable selection is not favored by any o f these measures.
Table 4: Variable selection by four m eth ods on the full data set g-prior VS EB VS Indicator VS H yper-g prior VS T h eory
X i No N N N X2 N Y N N Y X3 N Y N N M X4 N Y Y N Y X5 N Y N N X e Y Y Y Y Y X 7 N Y N N Y X 8 N N N N X 9 N N N N X 10 N Y N N M
Y = Y e s (in clu d e ), N = N o (o m it), M = M a y b e
Table 5: C om parison o f the four m ethods o f Bayesian varia ble selection
M odel Selection M ean M S P E A IC B IC adjusted R2
g-prior variable selection 0.2951141 -36.77993 -29.67159 0.9913
L ocal em pirical Bayes 0 .1 0 6 4 9 6 7 -2 5 3 .9 4 5 3 -2 3 2 .6 2 0 3 0 .9 9 9 5 Variable selection by in dicator function 0.1104306 -84.98479 -75.50699 0.9953 H yper-g prior variable selection 0.3195414 -79.64559 -70.1678 0.9950
that the local em pirical Bayes m eth od perform s best in this analysis.
5
Discussion and future work
In our p roject, we use four Bayesian variable selection m ethods to verify econ om ic theory regarding im portant predictors to G D P. A lso, we use four measures to com pare the results o f the various Bayesian selection m ethods.
A ccord in g to classical econom ics theory, consum ption, investm ent, exports, and im p orts have a significant im pact on G D P. A ccord in g to the ou tput o f Table 4 , the local em pirical Bayes m ethods similarly finds that in 2010, consum ption, investm ent, exports and
im ports do have a significant im pact on G D P. Furtherm ore, and interestingly, the local em pir ical Bayes m eth od also finds that fertility rate and popu lation ages 15-64 (% total) also have a significant im pact on G D P. A ccord in g to Barro (2001), econ om ic grow th is significantly negatively related to the total fertility rate. A ccord in g to A begu n de (2007), p op u lation ages
15-64 (% ) has a positive significant relationship w ith the grow th o f G D P.
A s m easured by the cross-validation routine and econ om ic theory, we believe that the local em pirical Bayes selection m eth od works best for this data set, and the g-prior variable selection m eth od works worst. In term s o f com pu tation al tim e, g-prior variable selection and local em pirical Bayes selection run faster than the others. H yper-g prior variable selection runs the slowest. T h e easiest m eth od to im plem ent is local em pirical Bayes selection and the hardest m eth od to im plem ent is hyper-g prior variable selection, which requires the use o f vector-valued fu n ction and m ultivariate distributions in the O pen B U G S m od el program . T he R -co d e can b e seen in A ppendices A .17 through A.20.
There are a num ber o f potential ways to extend this work. W e cou ld refit the m odel after doing the variable selection. B y doin g this, we might ob tain a better-fittin g m odel since we elim inate som e insignificant predictors from our m odel. A lso, we cou ld com pare m ethods using the deviance inform ation criterion (D IC ) in addition to A IC and BIC . M oreover, we cou ld use Bayes factors instead o f posterior distributions w ith credible intervals to do the variable selection. This is a m ore natural way for g-prior and hyper-g prior variable selection. Furtherm ore, we might do a sensitivity analysis for our priors and the hyperparam eter in the hyper-g prior. Finally, we could try to simulate prediction s from the p osterior predictive distribution in order to predict training data in cross validation.
Acknowledgements
I w ould like to thank m y com m ittee chair, Scott G od d a rd o f the U niversity o f Alaska Fairbanks for a large am ount o f help for this p roject, as well as m y com m ittee members. R on Barry, M argaret Short, and Julie M cIntyre for cod in g and course work. T h e data in this p ro je ct can b e dow nloaded from h t tp ://d a ta .w o r ld b a n k .o r g /.
References
[1] A begu n de, Dele O ., C olin D. M athers, Taghreed A dam , M on ica O rtegon, and Kathleen Strong. “T h e burden and costs o f chronic diseases in low -incom e and m iddle-incom e countries.” T h e Lancet 370, no. 9603 (2007): 1929-1938.
[2] Barro, R ob ert J. “Hum an capital and grow th .” T h e A m erican E con om ic R eview 91, no. 2 (2001): 12-17.
[3] C lyde, Merlise, and E dw ard I. George. “M odel uncertainty.” Statistical science (2004): 81-94.
[4] Fang, W enShwo, and Stephen M. Miller. “M odelin g the volatility o f real G D P growth: T h e case o f Japan revisited.” Japan and the W orld E con om y 21, no. 3 (2009): 312-324. [5] Foster, D ean P., and Edw ard I. G eorge. “T h e risk inflation criterion for m ultiple
regression.” T h e Annals o f Statistics (1994): 1947-1975.
[6] G eorge, Edw ard I., and R ob ert E. M cC u lloch . “Variable selection via G ibbs sam pling.”
Journal o f the A m erican Statistical A ssociation 8 8, no. 423 (1993): 881-889.
[7] H oeting, Jennifer A ., D avid M adigan, A drian E. Raftery, and Chris T . Volinsky. “Bayesian m odel averaging: a tu torial.” Statistical science (1999): 382-401.
[8] Huerta, Gabriel, and H edibert Freitas Lopes. “Bayesian forecasting and inference in
latent structure for the brazilian industrial p rod u ction in d ex.” Brazilian R eview o f E conom etrics 20, no. 1 (2000): 1-26.
[9] Kass, R ob ert E., and Larry W asserm an. “A reference Bayesian test for nested h yp oth e ses and its relationship to the Schwarz criterion.” Journal o f the am erican statistical association 90, no. 431 (1995): 928-934.
[10] K im , C hoongrak. “C o o k ’s distance in spline sm ooth in g.” Statistics and probability letters 31, no. 2 (1996): 139-144.
[11] Leamer, Edw ard E. “Specification searches: A d h oc inference w ith nonexperim ental data” . Vol. 53. John W iley and Sons Incorporated, 1978.
[12] Liang, Feng, Rui Paulo, G erm an M olina, Merlise A . C lyde, and Jim O. Berger. “ M ix tures o f g priors for Bayesian variable selection.” Journal o f the A m erican Statistical A ssociation 103, no. 481 (2008): 410-423.
[13] M itchell, T oby J., and John J. Beaucham p. “Bayesian variable selection in linear regression.” Journal o f the A m erican Statistical A ssociation 83, no. 404 (1988): 1023
1032.
[14] O ’Hara, R ob ert B., and M ikko J. Sillanpaa. “A review o f Bayesian variable selection m ethods: what, how and w hich.” Bayesian analysis 4, no. 1 (2009): 85-117.
[15] Raftery, A drian E., D avid M adigan, and Jennifer A . H oeting. “Bayesian m odel av eraging for linear regression m od els.” Journal o f the A m erican Statistical A ssociation 92, no. 437 (1997): 179-191.
[16] Smith, M ichael, and R ob ert K ohn. “N onparam etric regression using Bayesian variable selection.” Journal o f E conom etrics 75, no. 2 (1996): 317-343.
[17] Sturtz, Sibylle, Uwe Ligges, and A ndrew Gelm an. “ R 2O p en B U G S : a package for running O pen B U G S from R .”
U R L h t t p ://c r a n . rp roject. o rg /w e b /p a ck a g e s /R 2 O p e n B U G S /v ig n e tte s /R 2 O p e n B U G S . p d f (2 0 1 0).
[18] W orld Bank O pen D ata h ttp ://d a ta .w o r ld b a n k .o r g / (accessed A pril 10, 2016).
[19] Zellner, A rnold, and Claude M ontm arquette. “A study o f som e aspects o f tem poral ag gregation problem s in econ om etric analyses.” T h e R eview o f E conom ics and Statistics
Appendix A
A .1 n cvT est results (Test for curvature)
Test stat Pr(>|t|) xlnew -0. 203 0. 839 x2new 0..485 0. 629 x3new 0. 503 0. 617 x4new -7. 241 0. 000 x5new 1. 131 0. 262 x6new -8. 126 0. 000 x7new 0. 598 0. 552 x8new 1. 307 0. 196 x9new -0. 014 0. 989 x10new -0. 267 0. 790 Tukey test -7. 892 0. 000
A ccord in g to the ou tput o f ncvT est, we determ ine that there is som e curvature in the relationship betw een the response and variables X 4 and X 6, because the p-values are
A .2 R esid u al P lo ts (Curvature plots)
These are m arginal residual plots for sim ple linear regression fit. A . 3 N o r m a l Q - Q p l o t f o r g - p r i o r v a r ia b le s e l e c t io n
Normal Q-Q Plot
2
-
1
0
1
2
Theoretical Quantiles
A .4 B ayesian R esid u als vs. F itte d values plot using d a ta for g-prior variable selection 22 24 26 28 30 Fitted value A . 5 N o r m a l p r o b a b i l i t y p l o t o f B a y e s ia n r e s id u a ls f o r L o c a l e m p i r i c a l B a y e s
Normal Q-Q Plot
2
1
0
1
2
Theoretical QuantilesA. 6 B a y e s ia n R e s id u a ls v s . F i t t e d v a lu e s p l o t f o r L o c a l e m p ir ic a l B a y e s
22 24 26 28 30
Fitted value
This is Bayesian Residuals vs. F itted values plot for local em pirical Bayes. A . 7 T r a c e p l o t o f p0 f o r v a r ia b le s e l e c t i o n b y i n d i c a t o r f u n c t i o n
CM
0 1000 2000 3000 4000 5000
1:5000
W e did 10,000 iterations and used 1 chain to get this plot. T h e estim ated m ean o f fi0 is abou t 25.1.
A .8 T raceplot o f / i for variable selection b y indicator function
W e did 10,000 iterations and used 1 chain to get this plot.
A . 9 T r a c e p l o t o f ^ 1 f o r v a r ia b le s e l e c t i o n b y i n d i c a t o r f u n c t i o n
W e did 10,000 iterations and used 1 chain to get this plot. T h e extrem e ju m ps in m agnitude o f ^ 1 are caused by the m e th o d ’s difficulty in estim ating ^ 1 on iterations
where the in dicator I 1 is at 0. A .1 0 N o r m a l p r o b a b i l i t y p l o t o f B a y e s ia n r e s id u a ls f o r v a r ia b le s e l e c t i o n b y i n d i c a t o r f u n c t i o n
Normal Q-Q Plot
2
-
1
0
1
2
Theoretical QuantilesThis plot indicates that the Bayesian residuals are essentially norm ally distributed.
A .1 1 B a y e s ia n R e s id u a ls v s . F i t t e d v a lu e s p l o t u s in g d a t a f o r V a r ia b le s e l e c t i o n b y i n d i c a t o r fu n c t i o n
22 24 26 28 30
This is Residuals vs. F itted values plot for Variable selection by in dicator function. A .1 2 T r a c e p l o t o f f o r H y p e r - g p r i o r v a r ia b le s e l e c t io n
W e did 10,000 iterations and used 1 chain to get this plot. A .1 3 T r a c e p l o t o f ^1 f o r H y p e r - g p r i o r v a r ia b le s e l e c t io n
CM
0 1000 2000 3000 4000 5000
1:5000
A .1 4 T raceplot o f for H y p e r -g prior variable selection
W e did 10,000 iterations and used 1 chain to get this plot.
A .1 5 N o r m a l p r o b a b i l i t y p l o t o f B a y e s ia n r e s id u a ls f o r h y p e r - g p r i o r v a r ia b le s e l e c t io n
Normal Q-Q Plot
2
-
1
0
1
2
Theoretical Quantiles
A .1 6 B a y e s ia n R e s id u a ls v s . F i t t e d v a lu e s p l o t f o r h y p e r - g p r i o r v a r ia b le s e l e c t io n
22 24 26 28 30
Fitted value
A .1 7 R -c o d e for g-prior and local em pirical B ayes variable selection
x=model.matrix(trainreg)
P=x%*%solve((t(x)%*%x))%*%t(x) y=newtry
I=diag(50) # 50=size of training set msg=0.5*t(y)%*%(I-(g/(g+1))*P)%*%y umsg=msg/((50/2)-1) Bmse=(g/(g+1))*as.numeric(umsg)*solve(t(x)%*%x) mse=diag(Bmse)~0.5 ebeta=trainreg$coefficients meanbeta=(g/(1+g))*trainreg$coefficients meanb<-matrix(NA,11,10000) for(i in 1:10000){ sigam<-1/rgamma(1,25,rate=msg) var=(g/(1+g)*sigam*solve((t(x)%*%x))) mean=(g/(1+g)*ebeta) meanb[,i]<-rmvnorm(1,mean,var) }
library(HDInterval) # calculate credible interval hdi(meanb[1,])
hdi(meanb[2,])
A .1 8 R -c o d e for indicator function variable selection inits <- function(){ list(y.precision=1,beta0=0,I1=0,I2=0,I3=0,I4=0, I5=0,I6=0,I7=0,I8=0,I9=0,I10=0,u1=0,u2=0,u3=0, u4=0,u5=0,u6=0,u7=0,u8=0,u9=0,u10=0,t0=1, t1=1, t2=1, t3=1, t4=1, t5=1, t6=1, t7=1, t8=1, t9=1, t10=1) }
sdata.sim <- bugs(txtdata, inits, model.file = "150.txt",
parameters = c("y.precision","beta0","I1","I2","I3","I4","I5","I6","I7", "I8","I9","I10","t0","t1","t2","t3","t4","t5","t6", "t7","t8","t9","t10","u1","u2","u3","u4","u5","u6","u7","u8","u9","u10"), n.chains = 1, n.iter = 10000) A .1 9 O p e n B U G S c o d e f o r i n d i c a t o r f u n c t i o n v a r ia b le s e l e c t io n model{ for (i in 1:50){ y[i]~dnorm(n[i],y.precision) n[i]<-beta0+x1[i]*beta1+x2[i]*beta2+x3[i]*beta3+x4[i]*beta4+x5[i]*beta5 +x6[i]*beta6+x7[i]*beta7+x8[i]*beta8+x9[i]*beta9+x10[i]*beta10 } beta0~dnorm(0,t0) t0~dgamma(1,10) y.precision~dgamma(1,10)
I1~dbern(0.5) u1~dnorm(0,t1) t1~dgamma(1,1) beta1<-I1*u1 I2~dbern(0.5) u2~dnorm(0,t2) t2~dgamma(1,1) beta2<-I2*u2 I10~dbern(0.5) u10~dnorm(0,t10) t10~dgamma(1,1) beta10<-I10*u10 } A .2 0 R - c o d e f o r h y p e r - g p r i o r v a r ia b le s e l e c t io n xtr=model.matrix(trainreg) z=t(xtr)%*%xtr muo=as.vector(rep(0,11)) y <- matrix(y,50,1) zzz<-round(z[1:11,1:11],2) muo=as.vector(rep(0,11)) inits <- function(){ list(beta=c(0,0,0,0,0,0,0,0,0,0,0),y.precision=1,tau=1,lambda=0.5) }
model{ for(i in 1:50){ for(j in 1:1){ y[i,j]~dnorm(ea[i,j],y.precision) ea[i,j] <- beta[1]+beta[2]*x1[i]+beta[3]*x2[i] + beta[4]*x3[i]+beta[5]*x4[i]+beta[6]*x5[i]+beta[7]*x6[i] + beta[8]*x7[i]+beta[9]*x8[i]+beta[10]*x9[i]+beta[11]*x10[i] } } y.precision~dgamma(1,1) beta[1:11]~dmnorm(muo[],precision[,]) tau~dgamma(1,1) lambda~dbeta(1,0.5) g<-lambda/(1-lambda) for(k in 1:11){ for(l in 1:11){ precision[k,l]<-(1/g)*tau*z[k,l] } } }
sdata2.sim <- bugs(txt2data, inits, model.file = "3-8 test.txt", parameters = c("beta","y.precision","tau","lambda"), n.chains = 1, n.iter = 5000) as.integer(hdi(sdata2.sim)[,1][2]*hdi(sdata2.sim)[,1][1]>0) *sdata2.sim$mean$beta[1]+as.integer(hdi(sdata2.sim)[,2][2] *hdi(sdata2.sim)[,2][1]>0) *sdata2.sim$mean$beta[2]*x1te