FIGURE 5.8 Using cubic functions to connect the points gives an even better approxima- approxima-tion, and the curve is also continuous at the points where the sections join up (known as

knotpoints).

Of course, there is no reason why the functions should be linear at all—if we use cubic functions (i.e., polynomials with x³, x², x and constant components) to approximate each piece of data, then we can get results like those shown in Figure 5.8. We can continue to make the functions more complicated, with the important point being how many degrees of continuity we require at the boundaries between the points. These functions are known as splines, and the most common one to use is the cubic spline. To reach the stage where we can understand it, we need to go back and think about some theory.

5.3.1 Bases and Basis Expansion

Radial basis functions and several other machine learning algorithms can be written in this form:

where Φi(x) is some function of the input value x and the αi are the parameters we can solve for in order to make the model fit the data. We will consider the input being scalar values x rather than vector values x in what follows. The Φi(x) are known as basis functions and they are parameters of the model that are chosen. The first thing we need to think about is where each Φi is defined. Looking at the third graph of Figure 5.5 we see that the first function should only be defined between 0 and x1, the next between x1and x2, and so on. These points xiare called knotpoints and they are generally evenly spaced, but choosing how many of them there should be is not necessarily easy. The more knotpoints there are, the more complex the model can be, in which case the model is more likely to overfit, and needs more training data, just like the neural networks that we have seen.

We can choose the Φ_iin any way we like. Suppose that we simply use a constant function Φ(x) = 1. Now the model would have value α₁ to the left of x₁, value α₂ between x₁ and x2, etc. So depending upon how we fit the spline model to the data, the model will have different values, but it will certainly be constant in each region. This is sufficient to make the straight line approximation shown at the bottom of Figure 5.5. However, we might decide that a constant value is not enough, and we use a function that varies linearly (a linear function that has value Φ(x) = x within the region). In this case, we can make Figure 5.6, where each point is represented by a straight line that is not necessarily horizontal. This represents the line close to each point fairly well, but looks messy because the line segments do not meet up.

The question then is how to extend the model to include matching at the knotpoints, where one line segment stops and the next one starts. In fact, this is easy. We just insist that the α_ihave to be chosen so that at the knotpoint the value of f (x₁) is the same whether we come from the left of x₁or the right. These are often written as f (x⁻₁) and f (x⁺₁). Now we just need to work out which α values are involved in the x₁knotpoint from each side. There are going to be four of them: two for the constant part, and two for the linear part. The ones connected with the constant are obvious: α₁and α₂. Now suppose that the linear ones are α11and α12(which would mean that there were 10 regions and therefore 9 knotpoints, since then α1. . . α10 correspond to the constant functions for each region). In that case, f (x⁻₁) = α1+ x1α11and f (x⁺₁) = α2+ x1α12. This is an extra constraint that we will need to include when we solve for the values of the αi.

There is a simpler way to encode this, which is to add some extra basis functions. As well as Φ1(x) = 1, Φ2(x) = x, we add some basis functions that insist that the value is 0 at the boundary with x1: Φ3(x) = (x − x1)+, and the next with the boundary at

x₂: Φ₄(x) = (x − x₂)₊, etc., where (x)₊ = x if x > 0 and 0 otherwise. These functions are sufficient to insist that the knotpoint values are enforced, since one is defined on each knotpoint. This is then enough for us to construct the approximation shown in Figure 5.7.

5.3.2 The Cubic Spline

We can carry on adding extra powers of x, but it turns out that the cubic spline is generally sufficient. This has four basic basis functions (Φ₁(x) = 1, Φ₂(x) = x, Φ₃(x) = x², Φ₄(x) = x³), and then as many extras as there are knotpoints, each of the form Φ_4+i(x) = (x − x_i)³₊. This function constrains the function itself and also its first two derivatives to meet at each knotpoint. Notice that while the Φs are not linear, we are simply adding up a weighted sum of them, and so the model is linear in them. We can then produce curves like Figure 5.8, which represent the data very well.

5.3.3 Fitting the Spline to the Data

Having defined the functions, we need to work out how to choose the αi in order to make the model fit the data. We will continue to define the sum-of-squares error and to minimise that, which is known in the statistical literature as least-squares fitting, and will be described in more detail in Section 9.2. The important point is that everything is linear in the basis functions, so computing the least-squares fit is a linear problem. As with the MLP, the error that we are trying to minimise is:

E(y, f (x)) =

i=1

(yi− f (xi))². (5.4)

NumPy already has a method defined for computing linear least-squares optimisation:

the function np.linalg.lstsq(). As a simple example of how to use it we will make some noisy data from a couple of Gaussians and then fit the model parameters, which are 2.5 and 3.2. The final line gives the result, which isn’t too far from the correct one, and Figure 5.9 shows the results.

(p, residuals, rank, s) = np.linalg.lstsq(A,y)

pl.plot(x,y,’.’)

pl.plot(x,p[0]*A[:,0]+p[1]*A[:,1],’x’) p

FIGURE 5.9

Using linear least-squares to fit parameters for two Gaussians produces the

In document Machine Learning.pdf (Page 142-145)