Ridge regression - Shrinking coefficients to understand our data

Predicting numeric values: regression

8.4 Shrinking coefficients to understand our data

8.4.1 Ridge regression

Ridge regression adds an additional matrix I to the matrix XT

X so that it’s non-singular, and we can take the inverse of the whole thing: XT_X

+ I. The matrix I is an mxm identity matrix where there are 1s in the diagonal elements and 0s elsewhere. The symbol  is a user-defined scalar value, which we’ll discuss shortly. The formula for estimating our coefficients is now

Ridge regression was originally developed to deal with the problem of having more features than data points. But it can also be used to add bias into our estimations, giv- ing us a better estimate. We can use the  value to impose a maximum value on the sum of all our ws. By imposing this penalty, we can decrease unimportant parameters. This decreasing is known as shrinkage in statistics.

165

Shrinking coefficients to understand our data

Shrinkage methods allow us to throw out unimportant parameters so that we can get a better feel and human understanding of the data. Additionally, shrinkage can give us a better prediction value than linear regression.

We choose  to minimize prediction error. This is similar to other parameter- selection methods we used in the chapters on classification. We take some of our data, set it aside for testing, and then use the remaining data to determine the ws. We then test this model against our test data and measure its performance. This is repeated with different  values until we find a  that minimizes prediction error.

Let’s see this in action. First, open regression.py and add the code from the following listing.

def ridgeRegres(xMat,yMat,lam=0.2): xTx = xMat.T*xMat

denom = xTx + eye(shape(xMat)[1])*lam if linalg.det(denom) == 0.0:

print "This matrix is singular, cannot do inverse" return

ws = denom.I * (xMat.T*yMat) return ws

def ridgeTest(xArr,yArr):

xMat = mat(xArr); yMat=mat(yArr).T yMean = mean(yMat,0)

yMat = yMat - yMean

xMeans = mean(xMat,0)

xVar = var(xMat,0)

xMat = (xMat - xMeans)/xVar numTestPts = 30 wMat = zeros((numTestPts,shape(xMat)[1])) for i in range(numTestPts): ws = ridgeRegres(xMat,yMat,exp(i-10)) wMat[i,:]=ws.T return wMat

The code in listing 8.3 contains two functions: one to calculate weights, ridgeRegres(), and one to test this over a number of lambda values, ridgeTest().

The first function, ridgeRegres(), implements ridge regression for any given value of lambda. If no value is given, lambda defaults to 0.2. Lambda is a reserved key- word in Python, so you use the variable lam instead. You first construct the matrix XT_X

. Next, you add on the ridge term multiplied by our scalar lam. The identity matrix is created by the NumPy function eye(). Ridge regression should work on datasets that

Listing 8.3 Ridge regression

What is the ridge in ridge regression?

Ridge regression uses the identity matrix multiplied by some constant . If you look at I (the identity matrix), you’ll see that there are 1s across the diagonal and 0s elsewhere. This ridge of 1s in a plane of 0s gives you the ridge in ridge regression.

Normalization code

would give an error with regular regression, so you shouldn’t need to check to see if the determinant is zero, right? Someone could enter 0 for lambda and you’d have a problem, so you put in a check. If the matrix isn’t singular, the last thing the code does is calculate the weights and return them.

To use ridge regression and all shrinkage methods, you need to first normalize your features. If you read chapter 2, you’ll remember that we normalized our data to give each feature equal importance regardless of the units it was measured in. The second function in listing 8.3, ridgeTest(), shows an example of how to normalize the data. This is done by subtracting off the mean from each feature and dividing by the variance.

B

After the regularization is done, you call ridgeRegres() with 30 different lambda values. The values vary exponentially so that you can see how very small values of lambda and very large values impact your results. The weights are packed into a matrix and returned.

Let’s see this in action on our abalone dataset.

>>> reload(regression)

>>> abX,abY=regression.loadDataSet('abalone.txt') >>> ridgeWeights=regression.ridgeTest(abX,abY)

We now have the weights for 30 different values of lambda. Let’s see what these look like. To plot them out, enter the following commands in your Python shell:

>>> import matplotlib.pyplot as plt >>> fig = plt.figure()

>>> ax = fig.add_subplot(111) >>> ax.plot(ridgeWeights) >>> plt.show()

You should see a plot similar to figure 8.6. In figure 8.6 you can see the regression coefficients plotted versus log(). On the very left where  is the smallest, you have

Figure 8.6 Regression co- efficient values while using ridge regression. For very small values of  the coefficients are the same as regular regression, where- as for very large values of  the regression coefficients shrink to 0. Somewhere in between these two ex- tremes, you can find values that allow you to make better predictions.

167

Shrinking coefficients to understand our data

the full values of our coefficients, which are the same as linear regression. On the right side, the coefficients are all zero. Somewhere in the middle, you have some coef- ficient values that will give you better prediction results. To find satisfactory answers, you’d need to do cross-validation testing. A plot, shown in figure 8.6, also tells you which variables are most descriptive in predicting your output, by the magnitude of these coefficients.

There are other shrinkage methods such as the lasso, LAR, PCA regression,1_and subset selection. These methods can be used to improve prediction accuracy and improve your ability to interpret regression coefficients similarly to ridge regression. We’ll now talk about a method called the lasso.

In document Machine Learning in Action (Page 191-194)