8 Data Visualisation and Plotting
9.5 Classification
In Sect.9.4 we analysed data that was unlabelled—we did not know to what class a sample belonged (known as unsupervised learning). In contrast to this, a supervised problem deals with labelled data where are aware of the discrete classes to which each sample belongs. When we wish to predict which class a sample belongs to, we call this a classification problem. SciKit-Learn has a number of algorithms for classification, in this section we will look at the Support Vector Machine.
We will work on the Wisconsin breast cancer dataset, split it into a training set and a test set, train a Support Vector Machine with a linear kernel, and test the trained model on an unseen dataset. The Support Vector Machine model should be able to predict if a new sample is malignant or benign based on the features of a new, unseen sample:
1 > > > from s k l e a r n i m p o r t c r o s s _ v a l i d a t i o n 2 > > > from s k l e a r n i m p o r t d a t a s e t s 3 > > > from s k l e a r n . svm i m p o r t SVC 4 > > > from s k l e a r n . m e t r i c s i m p o r t c l a s s i f i c a t i o n _ r e p o r t 5 > > > X = d a t a s e t s . l o a d _ b r e a s t _ c a n c e r () . data 6 > > > y = d a t a s e t s . l o a d _ b r e a s t _ c a n c e r () . t a r g e t
7 > > > X_train , X_test , y_train , y _ t e s t = c r o s s _ v a l i d a t i o n . t r a i n _ t e s t _ s p l i t ( X , y , t e s t _ s i z e = 0 . 2 ) 8 > > > svm = SVC ( k e r n e l =" l i n e a r ") 9 > > > svm . fit ( X_train , y _ t r a i n ) 10 SVC ( C =1.0 , c a c h e _ s i z e =200 , c l a s s _ w e i g h t = None , c o e f 0 =0.0 , d e c i s i o n _ f u n c t i o n _ s h a p e = None , d e g r e e =3 , g a m m a =" auto ", k e r n e l =" l i n e a r ", m a x _ i t e r = -1 , p r o b a b i l i t y = False ,
r a n d o m _ s t a t e = None , s h r i n k i n g = True , tol =0.001 , v e r b o s e =
F a l s e ) 11 > > > svm . score ( X_test , y _ t e s t ) 12 0 . 9 5 6 1 4 0 3 5 0 8 7 7 1 9 2 9 6 13 > > > y _ p r e d = svm . p r e d i c t ( X _ t e s t ) 14 > > > c l a s s i f i c a t i o n _ r e p o r t ( y_test , y _ p r e d ) 15 16 p r e c i s i o n r e c a l l f1 - s c o r e s u p p o r t 17 18 m a l i g n a n t 1.00 0.89 0.94 44 19 b e n i g n 0 . 9 3 1.00 0.97 70 20 21 avg / to ta l 0.9 6 0.9 6 0 .96 114
Listing 56. Training a Support Vector Machine to classify between malignant and benign breast cancer samples.
You will notice that the SVM model performed very well at predicting the malignancy of new, unseen samples from the test set—this can be quantified nicely by printing a number of metrics using the classification report func- tion, shown on Lines 14–21. Here, the precision, recall, and F1 score (F1 = 2·precision·recall/precision+recall) for each class is shown. The support column is a count of the number of samples for each class.
Support Vector Machines are a very powerful tool for classification. They work well in high dimensional spaces, even when the number of features is higher than the number of samples. However, their running time is quadratic to the number of samples so large datasets can become difficult to train. Quadratic means that if you increase a dataset in size by 10 times, it will take 100 times longer to train.
Last, you will notice that the breast cancer dataset consisted of 30 features. This makes it difficult to visualise or plot the data. To aid in visualisation of highly dimensional data, we can apply a technique called dimensionality reduc- tion. This is covered in Sect.9.6, below.