Working with Nearest Neighbors - Tensorflow Machine Learning Cookbook

We start this chapter by implementing nearest neighbors to predict housing values. This is a great way to start with nearest neighbors because we will be dealing with numerical features and continuous targets.

Getting ready

To illustrate how making predictions with nearest neighbors works in TensorFlow, we will use the Boston housing dataset. Here we will be predicting the median neighborhood housing value as a function of several features.

Since we consider the training set the trained model, we will find the k-NNs to the prediction points and do a weighted average of the target value.

How to do it…

1. First, we will start by loading the required libraries and starting a graph session. We will use the requests module to load the necessary Boston housing data from the UCI machine learning repository:

import matplotlib.pyplot as plt import numpy as np

import tensorflow as tf import requests

sess = tf.Session()

2. Next, we will load the data using the requests module:

housing_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data''

housing_header = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']

cols_used = ['CRIM', 'INDUS', 'NOX', 'RM', 'AGE', 'DIS', 'TAX', 'PTRATIO', 'B', 'LSTAT']

num_features = len(cols_used)

# Request data

housing_file = requests.get(housing_url)

# Parse Data

housing_data = [[float(x) for x in y.split(' ') if len(x)>=1] for y in housing_file.text.split('\n') if len(y)>=1]

3. Next, we separate the data into our dependent and independent features. We will be predicting the last variable, MEDV, which is the median value for the group of houses.

We will also not use the features ZN, CHAS, and RAD because of their uninformative or binary nature:

y_vals = np.transpose([np.array([y[13] for y in housing_data])]) x_vals = np.array([[x for i,x in enumerate(y) if housing_header[i]

in cols_used] for y in housing_data])

x_vals = (x_vals - x_vals.min(0)) / x_vals.ptp(0)

4. Now we split the x and y values into the train and test sets. We will create the training set by selecting about 80% of the rows at random, and leave the remaining 20% for the test set:

train_indices = np.random.choice(len(x_vals), round(len(x_

vals)*0.8), replace=False)

test_indices = np.array(list(set(range(len(x_vals))) - set(train_

indices)))

x_vals_train = x_vals[train_indices]

x_vals_test = x_vals[test_indices]

y_vals_train = y_vals[train_indices]

y_vals_test = y_vals[test_indices]

5. Next, we declare our k value and batch size:

k = 4

batch_size=len(x_vals_test)

6. We will declare our placeholders next. Remember that there are no model variables to train, as the model is determined exactly by our training set:

x_data_train = tf.placeholder(shape=[None, num_features], dtype=tf.float32)

x_data_test = tf.placeholder(shape=[None, num_features], dtype=tf.

float32)

y_target_train = tf.placeholder(shape=[None, 1], dtype=tf.float32) y_target_test = tf.placeholder(shape=[None, 1], dtype=tf.float32) 7. Next, we create our distance function for a batch of test points. Here, we illustrate

the use of the L1 distance:

distance = tf.reduce_sum(tf.abs(tf.sub(x_data_train, tf.expand_

dims(x_data_test,1))), reduction_indices=2)

Note that the L2 distance function can be used as well. We would change the distance formula to the following:

distance = tf.sqrt(tf.reduce_sum(tf.square(tf.

sub(x_data_train, tf.expand_dims(x_data_test,1))), reduction_indices=1))

8. Now we create our prediction function. To do this, we will use the top_k(), function, which returns the values and indices of the largest values in a tensor. Since we want the indices of the smallest distances, we will instead find the k-biggest negative distances. We also declare the predictions and the mean squared error (MSE) of the target values:

top_k_xvals, top_k_indices = tf.nn.top_k(tf.neg(distance), k=k) x_sums = tf.expand_dims(tf.reduce_sum(top_k_xvals, 1),1)

x_sums_repeated = tf.matmul(x_sums,tf.ones([1, k], tf.float32)) x_val_weights = tf.expand_dims(tf.div(top_k_xvals,x_sums_

max_index = min((i+1)*batch_size,len(x_vals_train)) x_batch = x_vals_test[min_index:max_index]

y_batch = y_vals_test[min_index:max_index]

predictions = sess.run(prediction, feed_dict={x_data_train:

x_vals_train, x_data_test: x_batch, y_target_train: y_vals_train, y_target_test: y_batch})

batch_mse = sess.run(mse, feed_dict={x_data_train: x_vals_

train, x_data_test: x_batch, y_target_train: y_vals_train, y_

target_test: y_batch})

print('Batch #'' + str(i+1) + '' MSE: '' + str(np.round(batch_

mse,3)))

Batch #1 MSE: 23.153

10. Additionally, we can also look at a histogram of the actual target values compared with the predicted values. One reason to look at this is to notice the fact that with an averaging method, we have trouble predicting the extreme ends of the targets:

bins = np.linspace(5, 50, 45)

plt.hist(predictions, bins, alpha=0.5, label='Prediction'') plt.hist(y_batch, bins, alpha=0.5, label='Actual'')

plt.title('Histogram of Predicted and Actual Values'') plt.xlabel('Med Home Value in $1,000s'')

plt.ylabel('Frequency'') plt.legend(loc='upper right'') plt.show()

Figure 1: A histogram of the predicted values and actual target values for k-NN (k=4).

11. One hard thing to determine is the best value of k. For the preceding figure and predictions, we used k=4 for our model. We chose this specifically because it gives us the lowest MSE. This is verified by cross validation. If we use cross validation across multiple values of k, we will see that k=4 gives us a minimum MSE. We show this in the following figure. It is also worthwhile to plotting the variance in the predicted values to show that it will decrease the more neighbors we average over:

Figure 2: The MSE for k-NN predictions for various values of k. We also plot the variance of the predicted values on the test set. Note that the variance decreases as k increases.

How it works…

With the nearest neighbors algorithm, the model is the training set. Because of this, we do not have to train any variables in our model. The only parameter, k, we determined via cross-validation to minimize our MSE.

There's more…

For the weighting of the k-NN, we chose to weight directly by the distance. There are other options that we could consider as well. Another common method is to weight by the inverse squared distance.

In document Tensorflow Machine Learning Cookbook (Page 140-144)