Classifying with k-Nearest Neighbors
2.3 Example: a handwriting recognition system
We’re going to work through an example of handwriting recognition with our kNN
classifier. We’ll be working only with the digits 0–9. Some examples are shown in fig- ure 2.6. These digits were processed through image-processing software to make them all the same size and color.1 They’re all 32x32 black and white. The binary images were converted to text format to make this example easier, although it isn’t the most efficient use of memory.
2.3.1 Prepare: converting images into test vectors
The images are stored in two directories in the chapter 2 source code. The training- Digits directory contains about 2,000 examples similar to those in figure 2.6. There are roughly 200 samples from each digit. The testDigits directory contains about 900 examples. We’ll use the trainingDigits directory to train our classifier and testDigits to
1 The dataset is a modified version of the “Optical Recognition of Handwritten Digits Data Set” by E. Alpaydin, C. Kaynak, Department of Computer Engineering at Bogazici University, 80815 Istanbul Turkey, retrieved from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml) on October 3, 2010.
Example: using kNN on a handwriting recognition system
1. Collect: Text file provided.
2. Prepare: Write a function to convert from the image format to the list format used in our classifier, classify0().
3. Analyze: We’ll look at the prepared data in the Python shell to make sure it’s correct.
4. Train: Doesn’t apply to the kNN algorithm.
5. Test: Write a function to use some portion of the data as test examples. The test examples are classified against the non-test examples. If the predicted class doesn’t match the real class, you’ll count that as an error.
6. Use: Not performed in this example. You could build a complete program to extract digits from an image, such a system used to sort the mail in the United States.
test it. There’ll be no overlap between the two groups. Feel free to take a look at the files in those folders.
We’d like to use the same classifier that we used in the previous two examples, so we’re going to need to reformat the images to a single vector. We’ll take the 32x32 matrix that is each binary image and make it a 1x1024 vector. After we do this, we can apply it to our existing classifier.
The following code is a small function called img2vector, which converts the image to a vector. The function creates a 1x1024 NumPy array, then opens the given file, loops over the first 32 lines in the file, and stores the integer value of the first 32 characters on each line in the NumPy array. This array is finally returned.
def img2vector(filename): returnVect = zeros((1,1024)) fr = open(filename) for i in range(32): lineStr = fr.readline() for j in range(32): returnVect[0,32*i+j] = int(lineStr[j]) return returnVect
Try out the img2vector code with the following commands in the Python shell, and compare the results to a file opened with a text editor:
>>> testVector = kNN.img2vector('testDigits/0_13.txt') >>> testVector[0,0:31] array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]) >>> testVector[0,32:63] array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
35
Example: a handwriting recognition system
2.3.2 Test: kNN on handwritten digits
Now that you have the data in a format that you can plug into our classifier, you’re ready to test out this idea and see how well it works. The function shown in listing 2.6, handwritingClassTest(), is a self-contained function that tests out our classifier. You can add it to kNN.py. Before you add it, make sure to add from os import listdir to the top of the file. This imports one function, listdir, from the os module, so that you can see the names of files in a given directory.
def handwritingClassTest(): hwLabels = [] trainingFileList = listdir('trainingDigits') m = len(trainingFileList) trainingMat = zeros((m,1024)) for i in range(m): fileNameStr = trainingFileList[i] fileStr = fileNameStr.split('.')[0] classNumStr = int(fileStr.split('_')[0]) hwLabels.append(classNumStr)
trainingMat[i,:] = img2vector('trainingDigits/%s' % fileNameStr) testFileList = listdir('testDigits') errorCount = 0.0 mTest = len(testFileList) for i in range(mTest): fileNameStr = testFileList[i] fileStr = fileNameStr.split('.')[0] classNumStr = int(fileStr.split('_')[0])
vectorUnderTest = img2vector('testDigits/%s' % fileNameStr) classifierResult = classify0(vectorUnderTest, \
trainingMat, hwLabels, 3)
print "the classifier came back with: %d, the real answer is: %d"\ % (classifierResult, classNumStr)
if (classifierResult != classNumStr): errorCount += 1.0 print "\nthe total number of errors is: %d" % errorCount
print "\nthe total error rate is: %f" % (errorCount/float(mTest))
In listing 2.6, you get the contents for the trainingDigits directory
B
as a list. Then you see how many files are in that directory and call this m. Next, you create a training matrix with m rows and 1024 columns to hold each image as a single row. You parse out the class number from the filename.C
The filename is something like 9_45.txt, where 9 is the class number and it is the 45th instance of the digit 9. You then put this class number in the hwLabels vector and load the image with the function img2vector discussed previously. Next, you do something similar for all the files in the testDigits directory, but instead of loading them into a big matrix, you test each vector individu- ally with our classify0 function. You didn’t use the autoNorm() function from sec- tion 2.2 because all of the values were already between 0 and 1.To execute this from the Python shell, type kNN.handwritingClassTest() at the Python prompt. It’s quite interesting to watch. Depending on your machine’s speed, it
Listing 2.6 Handwritten digits testing code
Get contents of directory
B
Process class num from filename
will take some time to load the dataset. Then, when the function begins testing, you can see the results as they come back. You should have output that’s similar to the fol- lowing example:
>>> kNN.handwritingClassTest()
the classifier came back with: 0, the real answer is: 0 the classifier came back with: 0, the real answer is: 0 .
.
the classifier came back with: 7, the real answer is: 7 the classifier came back with: 7, the real answer is: 7 the classifier came back with: 8, the real answer is: 8 the classifier came back with: 8, the real answer is: 8 the classifier came back with: 8, the real answer is: 8 the classifier came back with: 6, the real answer is: 8 .
.
the classifier came back with: 9, the real answer is: 9 the total number of errors is: 11
the total error rate is: 0.011628
Using the kNN algorithm on this dataset, you were able to achieve an error rate of 1.2%. You can vary k to see how this changes. You can also modify the handwritingClassTest function to randomly select training examples. That way, you can vary the number of training examples and see how that impacts the error rate.
Depending on your computer’s speed, you may think this algorithm is slow, and you’d be right. For each of our 900 test cases, you had to do 2,000 distance calcula- tions on a 1024-entry floating point vector. Additionally, our test dataset size was 2 MB. Is there a way to make this smaller and take fewer computations? One modification to kNN, called kD-trees, allows you to reduce the number of calculations.
2.4
Summary
The k-Nearest Neighbors algorithm is a simple and effective way to classify data. The examples in this chapter should be evidence of how powerful a classifier it is. kNN is an example of instance-based learning, where you need to have instances of data close at hand to perform the machine learning algorithm. The algorithm has to carry around the full dataset; for large datasets, this implies a large amount of storage. In addition, you need to calculate the distance measurement for every piece of data in the database, and this can be cumbersome.
An additional drawback is that kNN doesn’t give you any idea of the underlying structure of the data; you have no idea what an “average” or “exemplar” instance from each class looks like. In the next chapter, we’ll address this issue by exploring ways in which probability measurements can help you do classification.
37