Ridge Regression - Technical Background: Regression Methods

2.2 Technical Background: Regression Methods

2.2.3 Ridge Regression

Ridge regression[83,84]_{utilises the first approach to dealing with the undesirable}

inflation of the due to the ill condition of . The derivation of the OLS

regression coefficient vector follows from identifying a vector which minimises

the squared error of the OLS relationship. Ridge regression identifies a vector which minimises the following relationship.

+ (13)

Where is introduced as a parameter to control the length of , as such the second

term is a penalty based on the Euclidean or l2-norm of . The derivation of is

51 = ( − ) ( − ) + (14) = [ − − + + ] = − + + = = + = ( + ) = ( + ) (15)

Whilst this estimator will not experience the same length deformation as the OLS

estimator, in the scenario that is ill conditioned. The Ridge estimator achieves

this at the cost of some bias. The larger is, the smaller the Euclidean length of the resulting estimator vector. A simplified interpretation of the process of Ridge regression can be seen in Figure

2.2.1.

In the figure, the OLS error function is depicted as a simple parabola. Figure 2.2.1 shows how the penalty term affects a very simple squared error function. Adding the squared term to the function shifts its minimum value closer to the origin along the axis than the original function, as a consequence of

controlling the length, the

Figure 2.2.1: Simplified example of how the Ridge penalty would influence the squared error function.

Figure 2.2.2: Illustration of how the Ridge penalty effects elements of .

squared error has increased. Figure 2.2.2 shows the projections of the squared error

parabolas onto a two dimensional space that represents two elements of the vector.

The arc that connects and represents another approach to formulating this

problem. Rather than finding the minimum value of the Ridge parabola (or the central point of the ridge contours), an equivalent approach is minimisation of squared error

subject to the Euclidean length of the vector being restricted to lie within the arc

illustrated. This second approach identifies a vector that lies on the edge of the arc and touches the inner most contour of the OLS squared error function. Both approaches yield the same solution, however, they go about it in slightly different ways. It is worth noting at this stage that a down side of Ridge regression is that it is very unlikely to

produce sparse solutions. A sparse solution is one possessing zero contribution from

many of the components. This can be interpreted visually in Figure 2.2.2, because of the smooth nature of the constraint region, there is a far greater likelihood that the Ridge vector will occur at a location not exactly overlapping one of the element vectors

or axes. Mention has been made of the – norm, generally an norm is defined by

the following equation.

| | = ∑| | (16)

Whilst the variation of in equation 16 will still generate the same absolute value of , the various norms differ in their mathematical properties. The difference between

the geometry of the l2 and l1 norms may be seen in Figure 2.2.3. The key difference

between the two is that the l1-norm possesses corners whilst the l2-norm does not, it

will become apparent shortly as to why this is important.

2.2.4 Least Absolute Shrinkage and Selection Operator

The second method to incorporate regression vector length control is known as least

absolute shrinkage and selection operator [85]_{(LASSO). LASSO controls the length} of the regression vector in a fashion very similar to Ridge with the key difference being

that the length is accounted for using the l1-norm rather than the l2-norm. This is a

subtle but important difference, as will be illustrated. It has been mentioned that Ridge does not successfully induce sparse solutions, as a consequence all elements of the estimator vector are retained in the penalised solution, but they are all scaled to fit. The scaling affects large magnitude elements more than the smaller counter parts, the result of which is a smoothing out of the contributions of the variables and reduced interpretability of the final solution. Alongside reduced interpretability it may also outright reduce the performance of the solution. The LASSO squared error function,

Figure 2.2.3: Difference between and norm geometries in two dimensions.

may be seen below followed by the derivation of the regression coefficient vector

. = ( − ) ( − ) + | | (17) = − + + ( | |) = − + + = = −1 1 < 0 > 0 = ( ) ( − 2 ) (18)

As increases the LASSO estimator vector shrinks, and if

| | < 2

the coefficient will shrink to zero, larger coefficients will be more resistant and will be

retained longer. It was mentioned earlier that the geometry of the l1-norm is such that

it possesses corners. These corners increase the likelihood of a sparse solution, as a contour of the squared error function has a greater chance of touching a corner than a straight edge of the region. Visually this may be seen in the simple example of the Figure 2.2.4.

The sparsity of the LASSO estimator is a desirable trait as it allows for the identification and retention of important features of the data set, whilst removing those of little contribution. The results of the LASSO regression method then provide greater insight into the relevant information of the initial data set. However, the LASSO

algorithm selects single variables [90]_{within a highly correlated region of variables,}

and does so indiscriminately. The indiscriminate nature of the variable selection in LASSO is undesirable as it reduces the insight one can gain upon inspecting the selected variables, and may lead to poor selection within a correlated group.

a b

Figure 2.2.4: Influence of norm geometry (a: -norm, b: -norm) upon solution selection, showing greater likelihood of a sparse solution for -norm.

2.2.5 Elastic Net

Somewhat recently a combination of Ridge regression and LASSO has been

suggested, combining the two in a linear fashion. This technique is known as Elastic

Net[90]_{and combines LASSO and Ridge introducing a tuning parameter,} _which

ranges from 0 (Ridge) to 1 (LASSO). Elastic Net boasts the retention of desirable features of both models, incorporating shrinkage, selection and smoother variable

selection process than the discrete – indiscriminate process LASSO utilises. The

Elastic Net squared error, and estimator vector may be seen below.

= ( − ) ( − ) + [(1 − )| | + ] (19)

= ( + ) ( −(1 − )

2 )

(20)

Whilst Elastic Net possesses several desirable properties, there is no guarantee that it will perform the best of the three methods mentioned.

In document ATR FTIR chemometrics for biological samples : a thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Nanoscience at Massey University, Manawatū, New Zealand (Page 61-68)