2.2 Technical Background: Regression Methods
2.2.3 Ridge Regression
Ridge regression[83,84] utilises the first approach to dealing with the undesirable
inflation of the due to the ill condition of . The derivation of the OLS
regression coefficient vector follows from identifying a vector which minimises
the squared error of the OLS relationship. Ridge regression identifies a vector which minimises the following relationship.
+ (13)
Where is introduced as a parameter to control the length of , as such the second
term is a penalty based on the Euclidean or l2-norm of . The derivation of is
51 = ( − ) ( − ) + (14) = [ − − + + ] = − + + = = + = ( + ) = ( + ) (15)
Whilst this estimator will not experience the same length deformation as the OLS
estimator, in the scenario that is ill conditioned. The Ridge estimator achieves
this at the cost of some bias. The larger is, the smaller the Euclidean length of the resulting estimator vector. A simplified interpretation of the process of Ridge regression can be seen in Figure
2.2.1.
In the figure, the OLS error function is depicted as a simple parabola. Figure 2.2.1 shows how the penalty term affects a very simple squared error function. Adding the squared term to the function shifts its minimum value closer to the origin along the axis than the original function, as a consequence of
controlling the length, the
Figure 2.2.1: Simplified example of how the Ridge penalty would influence the squared error function.
Figure 2.2.2: Illustration of how the Ridge penalty effects elements of .
52
squared error has increased. Figure 2.2.2 shows the projections of the squared error
parabolas onto a two dimensional space that represents two elements of the vector.
The arc that connects and represents another approach to formulating this
problem. Rather than finding the minimum value of the Ridge parabola (or the central point of the ridge contours), an equivalent approach is minimisation of squared error
subject to the Euclidean length of the vector being restricted to lie within the arc
illustrated. This second approach identifies a vector that lies on the edge of the arc and touches the inner most contour of the OLS squared error function. Both approaches yield the same solution, however, they go about it in slightly different ways. It is worth noting at this stage that a down side of Ridge regression is that it is very unlikely to
produce sparse solutions. A sparse solution is one possessing zero contribution from
many of the components. This can be interpreted visually in Figure 2.2.2, because of the smooth nature of the constraint region, there is a far greater likelihood that the Ridge vector will occur at a location not exactly overlapping one of the element vectors
or axes. Mention has been made of the – norm, generally an norm is defined by
the following equation.
| | = ∑| | (16)
53
Whilst the variation of in equation 16 will still generate the same absolute value of , the various norms differ in their mathematical properties. The difference between
the geometry of the l2 and l1 norms may be seen in Figure 2.2.3. The key difference
between the two is that the l1-norm possesses corners whilst the l2-norm does not, it
will become apparent shortly as to why this is important.
2.2.4 Least Absolute Shrinkage and Selection Operator
The second method to incorporate regression vector length control is known as least
absolute shrinkage and selection operator [85] (LASSO). LASSO controls the length of the regression vector in a fashion very similar to Ridge with the key difference being
that the length is accounted for using the l1-norm rather than the l2-norm. This is a
subtle but important difference, as will be illustrated. It has been mentioned that Ridge does not successfully induce sparse solutions, as a consequence all elements of the estimator vector are retained in the penalised solution, but they are all scaled to fit. The scaling affects large magnitude elements more than the smaller counter parts, the result of which is a smoothing out of the contributions of the variables and reduced interpretability of the final solution. Alongside reduced interpretability it may also outright reduce the performance of the solution. The LASSO squared error function,
Figure 2.2.3: Difference between and norm geometries in two dimensions.
54
may be seen below followed by the derivation of the regression coefficient vector
. = ( − ) ( − ) + | | (17) = − + + ( | |) = − + + = = −1 1 < 0 > 0 = ( ) ( − 2 ) (18)
As increases the LASSO estimator vector shrinks, and if
| | < 2
the coefficient will shrink to zero, larger coefficients will be more resistant and will be
retained longer. It was mentioned earlier that the geometry of the l1-norm is such that
it possesses corners. These corners increase the likelihood of a sparse solution, as a contour of the squared error function has a greater chance of touching a corner than a straight edge of the region. Visually this may be seen in the simple example of the Figure 2.2.4.
55
The sparsity of the LASSO estimator is a desirable trait as it allows for the identification and retention of important features of the data set, whilst removing those of little contribution. The results of the LASSO regression method then provide greater insight into the relevant information of the initial data set. However, the LASSO
algorithm selects single variables [90] within a highly correlated region of variables,
and does so indiscriminately. The indiscriminate nature of the variable selection in LASSO is undesirable as it reduces the insight one can gain upon inspecting the selected variables, and may lead to poor selection within a correlated group.
a b
Figure 2.2.4: Influence of norm geometry (a: -norm, b: -norm) upon solution selection, showing greater likelihood of a sparse solution for -norm.
56
2.2.5 Elastic Net
Somewhat recently a combination of Ridge regression and LASSO has been
suggested, combining the two in a linear fashion. This technique is known as Elastic
Net[90] and combines LASSO and Ridge introducing a tuning parameter, which
ranges from 0 (Ridge) to 1 (LASSO). Elastic Net boasts the retention of desirable features of both models, incorporating shrinkage, selection and smoother variable
selection process than the discrete – indiscriminate process LASSO utilises. The
Elastic Net squared error, and estimator vector may be seen below.
= ( − ) ( − ) + [(1 − )| | + ] (19)
= ( + ) ( −(1 − )
2 )
(20)
Whilst Elastic Net possesses several desirable properties, there is no guarantee that it will perform the best of the three methods mentioned.
57