An example - Some perspectives on the problem of model selection

2.7 Proofs

3.1.5 An example

We now demonstrate the POPMOS for predictive variable selection in linear regression analysis where the BMA is used as the reference. We use the Bayesian framework used in Raftery et al. [1997]. Each model M under consideration is of the form

Y = β0 + β1Xi1 + ... + βkXik + , ∼ N (0, σ

)

where {Xi1,...,Xik} is a subset of the set {X1,...,Xp} of all potential covariates. Let y and

X (w.r.t. M ) be the response vector and the corresponding design matrix, respectively. It is reasonable to assign a uniform prior to possible combinations of covariates, i.e., the prior information is “objective” between models. For model parameters, we assume priors

β|σ2∼ Nk+1(µ, σ2V ) ,

νλ

σ2 ∼ χ

2 ν.

Hyperparameters µ,V,ν,λ are chosen as follows (see Raftery et al. [1997] for the details) ν = 2.58, λ = .28, µ = ( ˆβ0, 0, ..., 0), V = diag(s2Y, φ2 s2 i1 , ...,φ 2 s2 ik ) where ˆβ0 is the OLS estimate of β0, s2Y,s

2 i1,...,s

ik are sample variances of Y,Xi1,...,Xik,

respectively, and φ = 2.85. Typically, in our experience, results are relatively insensitive to changes in values of the hyperparameters.

Then the marginal likelihood (3.6) under model M is

p(D|M ) = Γ(

ν+n 2 )(νλ)

ν/2_{[λν + (y − Xµ)}>_{(11 + XV X}>₎−1_{(y − Xµ)]}−(ν+n)/2

πn/2_{Γ(ν/2)|11 + XV X}>_|1/2 (3.13)

and the posterior predictive distribution (3.7) is

p(∆|D, M ) = Γ( ν+n+1 2 ) √ πΓ(n+ν₂ ) 1 (1 + x>_(X>_{X + V}−1₎−1_x)1/2 A(n+ν)/2 B(n+ν+1)/2 (3.14)

where

A = λν + kyk2+ µ>V−1µ − (X>y + V−1µ)>(X>X + V−1)−1(X>y + V−1µ) and

B = λν+kyk2+y2+µ>V−1µ−

−(xy +X>y+V−1µ)>(xx>+X>X +V−1)−1(xy +X>y+V−1µ).

Analysis of the crime data

Criminal behavior has been argued to be strongly related to criminal activity’s costs and benefits and to other legitimate opportunities. Ehrlich [1973] used the data from 47 U.S. states in 1960 to test this argument. The dependent variable was the crime rate. The costs of crime were measured by probability of imprisonment and average time served in prison. The benefits were related to wealth and income inequality in the community. The investigation also included other variables such as sex ratio, percentage of young males, etc. In summary, 15 potential covariates (Table 3.1) were considered.

This benchmark dataset has been analyzed by many authors. Previous diagnostic checkings (see, e.g., Draper and Smith [1981]) did not show any violation of the linear assumption. Ehrlich [1973] used the stepwise method to select significant variables. How- ever, Raftery et al. [1997] reported evidence against Ehrlich’s results and suggested using posterior probabilities to do variable selection. We now use this dataset to demonstrate the POPMOS and compare it to other model selection rules.

Table 3.1 summarizes the experimental results using the whole dataset. Models selected by different methods are listed in the corresponding columns. The third column is the overall posterior probability that the j-th covariate is in a model, i.e., P (βj6=0|D), cal-

Table 3.1: Crime data: Overall posterior probabilities and selected models

Number Covariate P (βj6= 0|D) AIC BIC OP MP

1 % of males age 14-24 .78 ? ? ? ?

2 Indicator for southern state .18

3 Mean years of schooling .97 ? ? ? ?

4 Police expenditure in 1960 .72 ? ? ? ?

5 Police expenditure in 1959 .50 ? ?

6 Labor force participation rate .08 7 No. males per 1000 females .08

8 State population .24

9 No. nonwhites per 1000 people .61 ? ? ? ?

10 Unemployment rate age 14-24 .11

11 Unemployment rate age 35-39 .45 ? ?

12 Wealth .31 ?

13 Income inequality 1.00 ? ? ? ?

14 Probability of imprisonment .82 ? ? ? ?

15 Ave. time in state prisons .23 ? MUI=.71 suggests that there is high model uncertainty

j = 1,2,...,15. The POPMOS selected the predictors with highest posterior probabilities (≥ .5). Raftery et al. recommended (from an empirical analysis) using posterior proba- bilities rather than p-values for variable selection. The last column presents the so-called median probability model (MP) introduced by Barbieri and Berger [2004]. The MP model is defined as the model consisting of those covariates which have overall posterior prob- ability P (βj6= 0|D) ≥ .5. In the framework of normal linear regression and under some

performance in terms of predictive expected squared loss (see Barbieri and Berger [2004] for the full definition). As shown in Table 3.1, the OP model is the same as the MP model. Table 3.1 also shows the models selected by AIC and BIC (which were exhaustively searched by using the branch-and-bound algorithm [Miller, 2002]). AIC, BIC and POP- MOS produced three different models. This is not a surprise because these criteria have different goals. As we may expect, the AIC model is the “biggest” model among selected models: it contains 9 covariates versus 7 covariates for OP and BIC. As we will see next, AIC models sometimes have poor predictive performances.

We now use the crime data to assess the predictive ability of the selection rules. To this end, the dataset was randomly split into two parts. One with 24 observations was used as the training set, the other with 23 observations was used as the prediction set. Other splits can be adopted. Table 3.2 shows the PPS and PC of the selected models. With C =20 being used, model set A contains 29 models. The model uncertainty indicator MUI=.61 suggests that there is moderate model uncertainty. As shown, the OP model has a better predictive performance than the AIC and BIC models. AIC has a poor predictive performance.

Note that the models selected using half of the data are slightly different from the models selected using the full data (however, they both contain the most important covariates). This is not a surprise because of the small size of the dataset. If we had a large enough dataset, using either the full data or half of it would lead to the same results. The selected models summarized in Table 3.2 are used only to examine the methods, they are not the final chosen models.

Table 3.2: Crime data: Assessment of predictive ability Method Model PPS PC AIC 1 3 4 5 9 13 14 .18 82.61% BIC 3 4 9 13 14 .16 82.61% MP 1 3 4 5 9 13 .12 86.96% OP 1 3 4 5 9 13 .12 86.96% BMA all .06 91.30%

MUI=.61 suggests that there is moderate model uncertainty

In document Some perspectives on the problem of model selection (Page 85-89)