In this section, we move on to discuss regularization methods implemented for the logistic regression model with high-dimensional predictors by first providing the statistical framework for logistic regression in the traditional setting. Consider a logistic regression of the form (2.2.1) log P (Yi = 1|x) P (Yi = 0|xi) = log π(xi) 1 − π(xi) = α + xiTβ (2.2.1)
where yi = 1 indicates 1 success and yi = 0 indicates failure. For each observation i, we denote Yi being the total number of successes and 1 − Yi being the total number of failures. Therefore, Yi follows a binomial distribution with mass function f (Yi; 1, πi) =
1 Yiπ
Yi
i (1 − πi)1−Yi. xiT = (xi1, · · · , xip) is a vector of p covariates, α represents the intercept, andβ = (β1, · · · , βp) is a vector of coefficients associated with the covariates. The maximum likelihood approach is implemented to obtain the estimates of the unknown parameters in
the model. The likelihood function for the logistic regression model can be written as: L(α,β|xi) = N Y i=1 π(xi)Yi(1 − π(xi))1−Yi = N Y i=1 exp(α + xiTβ) 1 + exp(α + xiTβ) Yi 1 1 + exp(α + xiTβ) 1−Yi . (2.2.2)
Correspondingly, the log-likelihood can be written as:
log L(α,β|xi) = N X
i=1
Yilog π(xi) + (1 − Yi) log(1 − π(xi)). (2.2.3)
It is not hard to show that log L(α,β|xi) in (2.2.3) is a concave function and its negative, − log L(α, β|xi) is a convex function.
2.2.1
LASSO for Logistic Regression
In a high-dimensional setting where N p, maximum likelihood estimation is no longer feasible due to singularities in the Hessian matrix. The penalized model is again proposed and demonstrated to be useful in this case. Recall for the linear regression model with high-dimensional covariates the penalized estimator is the one associated with the optimum value of the penalized least squares (2.1.3), which consists of the residual sum of squares and the penalty function. Notice the residual sum of squares is the kernel of the likelihood in the linear model case. When the response is discrete, minimizing the residual sum of square is not reasonable, so a modified penalized model that maximizes the penalized log- likelihood function was proposed [Wu et al., 2009]. The penalized log-likelihood consists of
the log-likelihood and the penalty function log Q(α,β|xi) = N X i=1 Yilog πi(xi) + (1 − Yi) log(1 − πi(xi)) − λ p X j=1 |βj| (2.2.4)
2.2.2
Forward Stagewise for Logistic Regression
The forward stagewise method for logistic regression inherits the mechanism from the linear model to update one coefficient at a time using a small incremental amount and thus obtains the penalized estimate. Recall for the forward stagewise algorithm implemented for the linear regression model, the coefficient to be updated at each iteration is selected based on the cor- relation between the covariates and the current residual. As mentioned previously, since the residual is no longer an appropriate measurement for a discrete variable, a modified forward stagewise method where the coefficient to be updated is selected based on the gradient of the likelihood function is therefore proposed for models having a discrete response. In addition, to determine the direction for updating the coefficient requires calculation of the second-order derivatives of the likelihood function at every step, which is computationally expensive given hundreds of thousands iterations needed before the likelihood function reaches the optimum. Hastie et al. [2007] showed the expanded representation of the LASSO problem produces an efficient version of forward stagewise which is able to avoid the cumbersome computation of second-order derivatives of the likelihood function. In the expanded setting, a negative copy of the covariates is created, thus the new predictor space is ˜X = (X, −X). Correspondingly, the coefficients are expanded to β = (β1, · · · , βp, βp+1, · · · , β2p). For each iteration, only one βj is updated with a small incremental amount and the Karush-Kuhn-Tucker condition (proof can be found in Appendix of [Hastie et al., 2007]) ensures βj and βj+p representing the
same covariate xj cannot be selected simultaneously in the regularization path. Therefore, the penalized estimate ofβ is obtained by subtracting the βj associated with −xj from that associated with xj, that is β = (β1− βp+1, β2 − βp+2, · · · , βp− β2p) is the final solution.
The forward stagewise regression for logistic model works as follows:
1. Let ˜X = (X, −X) and standardize the covariates so that each has mean 0 and unit norm. In initial step, set β = (β1, · · · , βp, βp+1, · · · , β2p) = 0.
2. Calculate the first-order derivative of − log L(α,β|xi) with respect to βj evaluated at current estimate β = β(s). Find the predictor x
j with the largest negative gradient element.
3. Update the corresponding coefficient βj with βj ← βj + to yield the new estimate β = β(s+1) where is a small amount, e.g. = 10−4.
4. Repeat steps 2 and 3 many times.
There is no standard stopping criteria in the forward stagewise method for logistic regression. We implemented the criteria to stop the iteration if the difference between the adjacent log- likelihood is smaller than a given value, that is | log L(α,β|xi)|β=β(s)−log L(α, β|xi)|β=β(s+1)| <