LASSO - Machine Learning Methods for Conditional Expectations

2.3 Machine Learning Methods for Conditional Expectations

2.3.1 LASSO

The idea of the Machine Learning methods is to shrink the LS estimatorγ_blstowards zero

using regularization. The usual way of regularization is by adding a penalty term in the criterion function (2.3.1).

2.3.1 LASSO

Assume we have a high dimensional sparse linear model, that is the number of regressors, p, can be larger or even much large than the sample size n. But only a small number s < n of the regressors are of substantial importance for carving the conditional expectation. (Belloni and Chernozhukov, 2011).

s = kγ0k0 := |{j : γ0,j 6= 0}| n.

The classic AIC/BIC estimator (Akaike, 1974; Schwarz, 1978) solves the following oracle problem: b γo = arg min γ N X i=1 (Yi− Xi0γ) 2 + λkγk0, (2.3.2)

where kγ0k₀ := |{j : γ0,j 6= 0}| denotes the `0 norm and λ is the penalty parameter. Then

γo defined in (2.3.2) can achieve the oracle convergence rate OP

ps/n. However, the criterion function (2.3.2) is a non-convex function and solving the minimization problem requires P

k6n p

k least square estimations which is an NP-hard problem (Natarajan, 1995).

convex function `1 norm as the penalty term in the criterion function

γlasso = arg min γ N X i=1 (Yi− Xi0γ) 2 + λkγk1, (2.3.3) where kγk1 = Pp

j=1|γj| denotes for the l1 norm. By adding the `1 penalty, the LASSO

estimators for coefficients can be exactly driven to zero during the regularization process and can be used for variable selection. Besides, the criterion function (2.3.3) is convex thus the computation for LASSO estimator is efficient. The λ is a penalty parameter that controls the shrinkage of estimators and variable selection. We review the choice of λ both in theoretical and practical cross-validation methods.

Theoretically, λ should be large enough to dominate the noise with high probability, λ > 2 kn−1Pn

i=1Xiεik∞. At the same time, λ should be as small as possible to reduce the

bias induced by shrinkage. In practice, Bickel, Ritov, and Tsybakov (2009) suggest to set

λ = 2 · cσp2n log(2p/α), (2.3.4)

where c > 1 and α ∈ (0, 1) are some constants, σ is the standard deviation of residual ε. Typically σ is unknown and needs to be estimated from the data using iteration method. Belloni and Chernozhukov (2013) also propose a choice of λ which is

λ = 2cσ√nΦ−1(1 − α/2p), (2.3.5)

where Φ−1(·) is the inverse of the cumulative distribution function of the standard Normal distribution. As showed in Bickel, Ritov, and Tsybakov (2009) and Belloni and Chernozhukov (2013), their choice of λ in (2.3.4) and (2.3.5) lead to a nearly oracle rates of convergence for the estimator _bγlasso under general conditions.

k_bγ(λ) − γk2 = OP r s log p n ! .

be driven exactly to zero during the regularization process. Hence this technique can be used for variable selection and generating more parsimonious model. But only under special cases LASSO can perfectly select the oracle model. In general, Belloni and Chernozhukov (2013) show that the LASSO estimator _bγ(λ) with λ defined in (2.3.5) can obtain sparsity results. Specifically, let T = {j : γj 6= 0} and bT = {j :γbj(λ) 6= 0}. Then | bT \T | ≤ Cs with high probability, where C is a constant, which indicates the number of irrelevant regressors selected by LASSO at most has the same order with the true sparsity. The result also implies that _bs := | bT | ≤ s + | bT \T | ≤ eCs with high probability. Thus, the LASSO estimator with penalty choice (2.3.5) has the sparsity property.

LASSO estimator can drive some parameters exactly to zero, but also shrinks all the non-zero parameters towards zero which lead to the estimation bias. In order to eliminate this bias, Belloni and Chernozhukov (2013) suggest to apply Post-LASSO estimator which minimizes the least squares criterion (1) over the non-zero components selected by the LASSO estimator. e γ ∈ arg min γ∈Rp ( _N X i=1 (Yi− Xi0γ) 2 : γj = 0 for each j ∈ bTc ) (2.3.6) where bTc _{= {j :} b

γj = 0}. If the variables are perfectly selected, then the Post-LASSO esti-

mator is exactly the oracle estimator for γ . But even if the model selection is not perfect, Belloni and Chernozhukov (2013) proves that the Post-LASSO estimator can achieves the same near-oracle convergence rate as LASSO and strictly faster under certain cases. Also, by construction, post-LASSO estimator has smaller shrinkage bias.

The LASSO estimator based on theoretical penalty choice (2.3.4) or (2.3.5) has good theoretical properties in both the convergence rate and variable selections. However, the choice of parameters c, α in (2.3.4) and (2.3.5) are arbitrary in practice and they might affect the performance of the estimators. In practice, researchers often prefer to use cross- validation to choose the penalty parameter for the Lasso estimator (Chetverikov, Liao, and Chernozhukov, 2019). Consider the K-folded cross-validation, the sample is partitioned

removed given any penalty level λ. b γ−k(λ) = arg min γ∈Rp   1 n − nk X i /∈Ik (Yi− Xi0γ) 2 + λkγk1  .

Then the cross-validated penalty parameter bλ is chosen by minimizing the summation of prediction errors on the validation sets,

b λ = arg min λ K X k=1 X i∈Ik (Yi− Xi0bγ−k(λ)) 2 .

Chetverikov, Liao, and Chernozhukov (2019) show that K-fold cross-validated Lasso es- timator_bγ(bλ) can attain optimal rate of convergence up to certain logarithmic factors.

k_bγ(bλ) − γk2 = OP r s log p n × p log(pn) !

Their simulation results show that the cross-validation LASSO estimator have much smaller estimation error than the LASSO estimator with λ chosen by (2.3.5).

Chetverikov, Liao, and Chernozhukov (2019) also discuss the sparsity bound for cross- validation LASSO estimator. Theoretically, they show that the number of non-zero components in the cross-validated Lasso estimator _bγ(bλ) may exceed s only by the small fac- tor, log2p (log n) (log(pn) + s−1_logr_{). However, the simulation results suggest that cross-}

validation typically yields a small value of λ, thus tends to select too many covariates. Also, for the cross-validation LASSO, to the best of my knowledge, there are still no theoretical results for the performance of the post-LASSO estimator.

In practice, the choice of K for cross-validation is a bias-variance trade-off problem. If K is large, for example, K = N (leave one out), the cross-validation estimator has small bias but high variance. And vice versa for K to be small. Overall, K = 5 or K = 10 is recommended in practice as a good balance between bias and variance trade-off (Hastie, Tibshirani, and Friedman, 2009).

In our Monte Carlo simulation, we use the cross-validation to choose the penalty parameter λ to avoid the arbitrary choice of parameters in theoretical results. We also consider the performance of the post-LASSO estimator based on the cross-validation variable selection although there are no theoretical results for that case.

In document Identification and Estimation in Semiparametric Social Interaction Models (Page 84-88)