FIGURE 3.13 Linear regression in two and three dimensions

for examples in the class and 0 for all of the others. Since classification can be replaced by regression using these methods, we’ll think about regression here.

The only real difference between the Perceptron and more statistical approaches is in the way that the problem is set up. For regression we are making a prediction about an unknown value y (such as the indicator variable for classes or a future value of some data) by computing some function of known values xi. We are thinking about straight lines, so the output y is going to be a sum of the xi values, each multiplied by a constant parameter:

y =PM

i=0β_ix_i. The βi define a straight line (plane in 3D, hyperplane in higher dimensions) that goes through (or at least near) the datapoints. Figure 3.13 shows this in two and three dimensions.

The question is how we define the line (plane or hyperplane in higher dimensions) that best fits the data. The most common solution is to try to minimise the distance between each datapoint and the line that we fit. We can measure the distance between a point and a line by defining another line that goes through the point and hits the line. School geometry tells us that this second line will be shortest when it hits the line at right angles, and then we can use Pythagoras’ theorem to know the distance. Now, we can try to minimise an error function that measures the sum of all these distances. If we ignore the square roots, and just minimise the sum-of-squares of the errors, then we get the most common minimisation, which is known as least-squares optimisation. What we are doing is choosing the parameters in order to minimise the squared difference between the prediction and the actual data value, summed over all of the datapoints. That is, we have:

j=0

t_j−

i=0

β_ix_ij

!²

. (3.21)

This can be written in matrix form as:

(t − Xβ)^T(t − Xβ), (3.22)

where t is a column vector containing the targets and X is the matrix of input values (even including the bias inputs), just as for the Perceptron. Computing the smallest value of this means differentiating it with respect to the (column) parameter vector β and setting the derivative to 0, which means that X^T(t − Xβ) = 0 (to see this, expand out the brackets, remembering that AB^T = B^TA and note that the term β^TX^tt = t^TXβ since they are

both a scalar term), which has the solution β = (X^TX)⁻¹X^Tt (assuming that the matrix X^TX can be inverted). Now, for a given input vector z, the prediction is zβ. The inverse of a matrix X is the matrix that satisfies XX⁻¹= I, where I is the identity matrix, the matrix that has 1s on the leading diagonal and 0s everywhere else. The inverse of a matrix only exists if the matrix is square (has the same number of rows as columns) and its determinant is non-zero.

Computing this is very simple in Python, using the np.linalg.inv() function in NumPy. In fact, the entire function can be written as (where the 'symbol denotes a linebreak in the text, so that the command continues on the next line):

def linreg(inputs,targets):

inputs = np.concatenate((inputs,-np.ones((np.shape(inputs)[0],1))),' axis=1)

beta = np.dot(np.dot(np.linalg.inv(np.dot(np.transpose(inputs),' inputs)),np.transpose(inputs)),targets)

outputs = np.dot(inputs,beta)

3.5.1 Linear Regression Examples

Using the linear regressor on the logical OR function seems a rather strange thing to do, since we are performing classification using a method designed explicitly for regression, trying to fit a surface to a set of 0 and 1 points. Worse, we will view it as an error if we get say 1.25 and the output should be 1, so points that are in some sense too correct will receive a penalty! However, we can do it, and it gives the following outputs:

[[ 0.25]

[ 0.75]

[ 1.25]]

It might not be clear what this means, but if we threshold the outputs by setting every value less than 0.5 to 0 and every value above 0.5 to 1, then we get the correct answer. Using it on the XOR function shows that this is still a linear method:

[[ 0.5]

[ 0.5]

[ 0.5]]

A better test of linear regression is to find a real regression dataset. The UCI database is useful here, as well. We will look at the auto-mpg dataset. This consists of a collection of a number of datapoints about certain cars (weight, horsepower, etc.), with the aim being to predict the fuel efficiency in miles per gallon (mpg). This dataset has one problem. There are

missing values in it (labelled with question marks ‘?’). The np.loadtxt() method doesn’t like these, and we don’t know what to do with them, anyway, so after downloading the dataset, manually edit the file and delete all lines where there is a ? in that line. The linear regressor can’t do much with the names of the cars either, but since they appear in quotes (") we will tell np.loadtxt that they are comments, using:

auto = np.loadtxt(’/Users/srmarsla/Book/Datasets/auto-mpg/auto-mpg.data.txt’,' comments=’"’)

You should now separate the data into training and testing sets, and then use the training set to recover the β vector. Then you use that to get the predicted values on the test set.

However, the confusion matrix isn’t much use now, since there are no classes to enable us to analyse the results. Instead, we will use the sum-of-squares error, which consists of computing the difference between the prediction and the true value, squaring them so that they are all positive, and then adding them up, as is used in the definition of the linear regressor. Obviously, small values of this measure are good. It can be computed using:

beta = linreg.linreg(trainin,traintgt)

testin = np.concatenate((testin,-np.ones((np.shape(testin)[0],1))),axis=1) testout = np.dot(testin,beta)

error = np.sum((testout - testtgt)**2)

Now you can test out whether normalising the data helps, and perform feature selection as we did for the Perceptron. There are other more advanced linear statistical methods.

One of them, Linear Discriminant Analysis, will be considered in Section 6.1 once we have built up the understanding we need.

PRACTICE QUESTIONS

Problem 3.1 Consider a neuron with 2 inputs, 1 output, and a threshold activation func-tion. If the two weights are w1= 1 and w2= 1, and the bias is b = −1.5, then what is the output for input (0, 0)? What about for inputs (1, 0), (0, 1), and (1, 1)?

Draw the discriminant function for this function, and write down its equation. Does it correspond to any particular logic gate?

Problem 3.2 Work out the Perceptrons that construct logical NOT, NAND, and NOR of their inputs.

Problem 3.3 The parity problem returns 1 if the number of inputs that are 1 is even, and 0 otherwise. Can a Perceptron learn this problem for 3 inputs? Design the network and try it.

Problem 3.4 Test out both the Perceptron and linear regressor code from the website on the parity problem.

Problem 3.5 The Perceptron code on the website is a batch update algorithm, where the whole of the dataset is fed in to find the errors, and then the weights are updated afterwards, as is discussed in Section 3.3.5. Convert the code to run as sequential updates and then compare the results of using the two versions.

Problem 3.6 Try to think of some interesting image processing tasks that cannot be per-formed by a Perceptron. (Hint: You need to think of tasks where looking at individual pixels isn’t enough to allow classification.)

Problem 3.7 The decision boundary hyperplane found by the Perceptron has equation y(x) = w^Tx + b = 0. For a point x⁰, minimise kx − x⁰k² to show that the shortest distance from the point to the hyperplane is |y(x⁰)|/kwk.

Problem 3.8 There is a link to a very large dataset of handwritten figures on the book website (the MNIST dataset). Download it and use a Perceptron to learn about the dataset.

Problem 3.9 For the prostate data available via the website, use both the Perceptron and logistic regressor and compare the results.

Problem 3.10 In the Perceptron Convergence Theorem proof we assumed that kxk ≤ 1.

Modify the proof so that it only assumes that kxk ≤ R for some constant R.

4

In document Machine Learning.pdf (Page 86-92)

FIGURE 3.13 Linear regression in two and three dimensions

3.5.1 Linear Regression Examples

FURTHER READING

PRACTICE QUESTIONS

4