107Optimizing linear regression

This chapter covers

subset each time. The excluded subsets are used as the validation set and the union of all the remaining subsets as the training set.

For each set of parameters you want to validate, train all k models and calculate the mean error across all k models. Finally, you choose the set of parameters giving you the smallest average error.

Why is this important? Because fitting a model depends very much on the training and validation sets used. If you take our housing data set, split it randomly into training and validation sets again, and then go through all the actions we did in this chapter, you will notice that the results and parameters will be different, maybe even dramatically so. K-fold cross-validation can help you decide which of the parameter combinations to choose.

We will have more to say about k-fold cross-validation when we talk about Spark’s new ML Pipeline API in the next chapter.

7.7 Optimizing linear regression

We have a couple more things to say about linear regression optimization. As you saw in earlier examples, LinearRegressionSGD (and its parent class GeneralizedLinear- Algorithm) has an optimizer member object you can configure. We previously used the default GradientDescent optimizer and configured it with the number of iterations and the step size.

There are two additional methods you can employ to make linear regression find the minimum of the cost function faster. The first is to configure the GradientDes- cent optimizer as a mini-batch stochastic gradient descent. The second is to use Spark’s LBFGS optimizer (see section 7.7.2).

7.7.1 Mini-batch stochastic gradient descent

As explained in section 7.4.2, gradient descent updates the weights in each step by going through the whole data set. If you recall, the formula used for updating each weight parameter is this:

This is also called batch gradient descent (or BGD, for short). In contrast, mini-batch stochastic gradient descent uses only a subset of data in each step, and instead of i going from 1 to m (the whole data set), it only goes from 1 to k (as some fraction of m). If k is equal to 1—which means the algorithm considers only one example in each step—the optimizer is simply called stochastic gradient descent (SGD).

Mini-batch SGD is much less computationally expensive, especially when parallel- ized, but it compensates for this parallelization with more iterations. It has more diffi- culties to converge, but it gets to the minimum close enough (except in some rare cases). If mini-batch size (k) is small, the algorithm is more stochastic, meaning it will have a more random route toward the cost function minimum. If k is larger, the

wj: wj ϒ1 m ---- – (h(x( )i)–y( )i)x( )_ji i=1 m



algorithm will be more stable. In both cases, though, it reaches the minimum and can get very close to BGD results.

Let us now see how to use mini-batch SGD in Spark. The same GradientDescent

optimizer we used before is used for mini-batch SGD, but you need to specify an additional parameter (miniBatchFraction). miniBatchFraction takes a value between 0 and 1. If it’s equal to 1 (which is the default), a mini-batch SGD becomes a BGD

because the whole data set is considered in each step.

Parameters for mini-batch SGD can be chosen similar to how we did it previously, only now there is one more parameter to be configured. If a step size parameter worked on BGD, that does not mean it will work on mini-batch SGD, so the parameter‘s value has to be chosen in the same way we did it before, or preferably, using the k-fold cross-validation.

A good starting point for the mini-batch fraction parameter is 0.1, but it will prob- ably have to be fine-tuned further. The number of iterations can be chosen so that the data set as a whole is iterated about 100 times in total (and sometimes even less). For example, if the fraction parameter is 0.1, specifying 1,000 iterations guarantees that elements in the data set are taken into account 100 times (on average). For perfor- mance reasons, in order to balance computation and communication between nodes in the cluster, the mini-batch size (absolute size, not the fraction parameter) must typ- ically be at least two orders of magnitude larger than the number of machines in the cluster12_.

In our online repository, you’ll find the method iterateLRwSGDBatch, which is a variation of iterateLRwSGD with one additional line:

alg.optimizer.setMiniBatchFraction(miniBFraction)

The signature of the method is also different as its parameter takes three arrays: besides number of iterations and step sizes, it also takes an array with mini-batch frac- tions. The method tries all combinations of the three values and prints the results (training and testing MRSE). You can try it out on our data set expanded with feature squares (trainHPScaled and validHPScaledRDDs). First, to get a feeling for the step size parameter in context of the other two, execute this command:

iterateLRwSGDBatch(Array(400, 1000), Array(0.05, 0.09, 0.1, 0.15, 0.2, 0.3, 0.35, 0.4, 0.5, 1), Array(0.01, 0.1), trainHPScaled, validHPScaled)

The results (available online) show that the step size of 0.4 works best. Now let’s use that value and see how the algorithm behaves when we change other parameters:

iterateLRwSGDBatch(Array(400, 1000, 2000, 3000, 5000, 10000), Array(0.4), Array(0.1, 0.2, 0.4, 0.5, 0.6, 0.8), trainHPScaled, validHPScaled)

The results (again, available online) show that 2,000 iterations are enough to get the best MRSE of 3.965, which is slightly better even than our previous best MRSE of 3.966

12_{Adding vs. Averaging in Distributed Primal-Dual Optimization, Chenxin Ma et al., www.cs.berkeley.edu/}

109

In document Reactive Data Handling (Page 112-114)