Applied Bootstrap Resampling Procedure - Stereotype Logit Models for High Dimensional Data

As this is a novel procedure, a bootstrap resampling method is used to provide insight into the distributions of the parameter estimates, as well as their significance. A thorough definition of the bootstrap procedure was provided in Chapter 2. In this chapter we present the application of a bootstrap procedure used in our proposed model framework aimed at determining significance of estimated parameters.

A variety models are fit along the  trace. Model selection is carried out via AIC, BIC and a procedure aimed at reducing prediction error using CV or independent data generated from the same population; the application of these procedures is covered in a later section. Once we have selected our candidate model, we must determine the distribution of the parameters of interest, their correlations, and significance; along with potential confidence intervals.

Traditional methods make asymptotic assumptions and use the standard errors to develop 95% confidence intervals. Previous work has shown the problems inherent in deriving closed form estimates for the variances of parameters from penalized models (Osbourne, 2000). As a result a bootstrap resampling procedure is used. This procedure has shown great potential in correctly

and dynamically describing distributions for estimates for novel procedures (Horowitz, 2001). Therefore, as our new method is a suitable candidate, it seems fitting to apply a bootstrap resampling procedure.

For the purposes of this dissertation, the bootstrapping pairs design is used. Denote B as the number of bootstrap resamples. The size of B is set to 200; this is based on the statement that B ranging from 50 to 200 is sufficient (Efron and Tibshirani, 1986). For a covariate matrix X

and an ordinal outcome vector y, which are viewed as the population, define the tuple



y Xi, i,.



which denotes the ith entry and row respectively, i1, 2 , ...,n . For each bootstrap resample we

resample n tuples, with replacement from the original data giving rise to a new data set Xb and

y , b1,2,...,B. Once we have the B samples, the corresponding model is fit to each data set. For the original data, once the model is selected based on the elastic net penalty, the

corresponding value of is used in model fitting for all bootstrap resamples. This value is fixed as allowing it to vary may introduce additional variation into our model (Osbourne, 2000) and it is desirable that the variances be correctly attributed to the parameters and their interactions with each other. Once the B models are fit the corresponding parameter estimates are obtained. Denote the bth bootstrap parameters estimates as ( , , )α βˆ ˆ ˆ b. Having these B parameter estimates

allows us to plot a histogram of the values and gain insight into their distributions; it can also be used to examine interactions among estimates as well. As an example, for a given covariate, p

included in our final model, once the bootstrap procedure is fit and we have the p_b estimates we

Figure 3.1

Example Histogram

Example bootstrap distribution for beta

Fr eq ue nc y -10 -5 0 5 10 0 50 10 0 15 0

The potential of using a bootstrap resampling procedure to obtain the distribution for a given parameter is that it no longer has to conform to a known form; it is no longer bounded. In addition, the corresponding confidence intervals can be developed; they need not be symmetric. The resulting B estimates for a given parameter can also be used to assess significance; by the proportion of them that are non-zero. In addition, a covariance matrix will also be calculated from the bootstrap resamples. In the construction of the confidence intervals the bootstrap-t confidence interval method is used. Some of the information in Chapter 2 will be restated, serving as a reminder. In short, the bootstrap-t confidence intervals are of the from

This Figure is an example of a bootstrap distribution for p. The benefit of

using this technique is that the distribution does not have to conform to a known form.

(1 ) ( ) ˆp tˆ  seˆ, ˆp tˆ seˆ           , (3.22) where _{ˆ( )}ˆ* ˆ_{( ) /} j j

se   V  B_{, with}Vˆ( )_j being defined in equation (2.29), and _ˆt( ) _{is chosen from} the standard normal distribution such that _#



_{Z b}*_{( )}__t_ˆ( )



_B__ _{, where} _{Z b}*_{( )}_{is defined as}

 

*_{( )} ˆ(.) ˆ ˆ(.) j j j Z b se     β . (3.23) In addition, ˆ* p

 is the average of the B bootstrap resampling based estimates. The bootstrap resampling technique is formally applied as follows

1. For a given value of lambda, k,using the corresponding parameter estimates,



α βˆ, ,ˆ ˆ



_k, bootstrap resample from the tuples



y Xi, i,.



B =200 times. Place optional lower and

upper bounds on the parameter estimates.

2. For each bootstrap resample, use nonlinear programming to find the solution to equation (3.8).

The empirical distribution function can be derived from the bootstrap distribution. The empirical distribution can be defined as follows (Rohatgi and Saleh, 2001).

Definition 4. Let * 1 1 ( ) n ( ) n i j F x n  x X  



 . Then *_{( )} n

nF x is the number of Xk’s (1 k n  ) that are ≤ x. *_{( )}

F x is called the sample (or empirical) distribution function.

Define the indicator function ε as follows:

0 0 ( ) 1 0 if a a if a  _{ }     . (3.24)

In addition we state the Corollary without proof (Rohatgi and Saleh, 2001): Corollary 1 For each x R ,

*_{( )} P _{( )}

F x F x .

The stated corollary provides assurance that the bootstrap based empirical distribution will approach the true distribution in probability.

In document Stereotype Logit Models for High Dimensional Data (Page 75-79)