93Analyzing and preparing the data

This chapter covers

93Analyzing and preparing the data

column, with methods normL1 and normL2. Variance of each column can be obtained with the variance method.

DEFINITION Variance is a measure of dispersion of a data set and is equal to the average of squared deviations of values from their mean value. Standard deviation is calculated as the square root of variance. Covariance is a measure of how much two variables change relative to each other.

All of this can be very useful when examining data for the first time, especially when deciding whether feature scaling is necessary (described shortly).

7.4.2 Analyzing column cosine similarities

Understanding column cosine similarities is another thing that helps when analyzing data. Column cosine similarities represent an angle between two columns, viewed as vectors. A similar procedure can be used for other purposes as well (for example, for finding similar products, or similar articles).

You obtain column cosine similarities from the RowMatrix object:

val housingColSims = housingMat.columnSimilarities()

PYTHON The columnSimilarities method is not available in Python.

The resulting object is a distributed CoordinateMatrix containing an upper-triangular matrix (upper-triangular matrices contain data only above their diagonal). Value at i- th row and j-th column in the resulting housingColSims matrix gives a measure of similarity between i-th column and j-th column in the housingMat matrix. The values in the housingColSims matrix can go in value from –1 to 1. A value of –1 means the two columns have completely opposite orientations (directions), a value of 0 means they are orthogonal to one another, and a value of 1 means the two columns (vectors) have the same orientation.

The easiest way to see the contents of this matrix is to convert it to a Breeze matrix using our toBreezeD method and then print the output with our utility method

printMat that you can find in our repository listing, which we omit due to brevity. To do this, first paste the printMat method definition into your shell and execute the fol- lowing:

printMat(toBreezeD(housingColSims))

This will pretty-print the contents of the matrix (you can also find the expected output in our online repository). If you look at the last column of the result, it gives you a measure of how well each dimension in the data set corresponds to the target variable (average price). This is the contents of the last column: 0.224, 0.528, 0.693, 0.307, 0.873, 0.949, 0.803, 0.856, 0.588, 0.789, 0.897, 0.928, 0.670, 0.000. The biggest value here is the sixth value (0.949), which corresponds to the column containing the average number of rooms. Now you can see that it was no coincidence that we chose that exact column for our previous simple linear regression example—it has the strongest

similarity with the target value and thus represents the most appropriate candidate for simple linear regression.

7.4.3 Computing the covariance matrix

Another method for examining similarities between different columns (dimensions) of the input set is the covariance matrix. It’s important in statistics for modeling linear correspondence between variables. In Spark, you compute the covariance matrix sim- ilarly to column statistics and column similarities, using the RowMatrix object:

val housingCovar = housingMat.computeCovariance() printMat(toBreezeM(housingCovar))

PYTHON The computeCovariance method is not available in Python.

The expected output is also available in our online repository. If you spend a moment studying it, you’ll notice that there is a large range of values in the matrix and that some of them are negative and some are positive. You’ll also probably notice that the matrix is symmetric (that is, each (i, j) element is the same as a (j, i) element).

This is because the variance-covariance matrix contains the variance of each column on its diagonal and covariance of the two matching columns on all other posi- tions. If a covariance of two columns is zero, there is no linear relationship between them. Negative values mean that the values in the two columns move in opposite directions from their averages, whereas the opposite is true for positive values.

Spark also offers two other methods for examining correlations between series of data: Spearman’s and Pearson’s methods. Because an explanation of those methods is beyond the scope of this book, you can access them through the org.apache.spark .mllib.stat.Statistics object.

7.4.4 Transforming to labeled points

Now that we’ve examined the data set, we can get on to preparing the data for linear regression. First, we have to put each example in the data set in a structure called

LabeledPoint, which is used in most of Spark’s machine learning algorithms. It contains the target value and the vector with the features. housingVals containing Vector objects with all variables, and the equivalent housingMat RowMatrix object, were useful when we were examining the data set as a whole (in the previous sections), but now we need to separate the target variable (the label) from the features.

To do that, we can just transform the housingValsRDD (the target variable is in the last column):

import org.apache.spark.mllib.regression.LabeledPoint val housingData = housingVals.map(x => {

val a = x.toArray

LabeledPoint(a(a.length-1), Vectors.dense(a.slice(0, a.length-1))) })

In document Reactive Data Handling (Page 98-100)