• No results found

Chapter Four

4.1 Random Forest Model

Random forest model is an ensemble machine learning method for performing classification or regression tasks This is achieved by constructing several decision trees and then giving as output the class that is the most occurring (mode) of the classes for classification and mean prediction for regression tasks. In this section we focus on random forest for classification tasks. Random forest models make use of random selection of features in splitting the decision trees, hence the classifier built from this model is made up of a set of tree-structured classifiers. We can represent the random forest model by equation (12) below:

π‘†π‘π‘Žπ‘π‘’ = {𝐹(𝑋, 𝛼𝑖); 𝑖 = 1,2,3,4, . . . , π‘›π‘œπ‘  π‘œπ‘“ π‘‘π‘Ÿπ‘’π‘’π‘ } (12)

In equation (12), 𝛼𝑖 represents the number of independent and identically distributed

random vectors in a way that every tree has a vote for the most popular class. To build the algorithm for this model, we pick at random k data points from the training set and build a decision tree associated to these k data points. Next, we choose the number of trees (ntrees) we desire to build and then repeat the previous steps. For a new data point, we make our β€˜ ntrees’ predict the category to which the data point belongs and then assign the new data point to the class that wins majority votes. We start with one tree and then proceed to build more trees based on the subset of data.

The random forest has a major advantage that it can be used to judge variable importance by ranking the performance of each variable. The model achieves this by estimating the predictive value of variables and then scrambling the variables to examine how much the performance of the model drops.

31

In applying the random forest model to our dataset, random sample of one million observations were used to create a random forest model using all the twelve predictors. Exactly 60% of the data was used as training set while the remaining 40% was used as test set using R package β€œrandomForest”. This package also provided the variable importance/ ranking shown in figure 4.1, variable importance of the model tells which variable has highest impact in making the prediction using 2 metrics namely Mean decrease in accuracy (MDA) and Mean decrease in Gini. Mean decrease in accuracy is percentage or proportion of incorrectly classified observations when a particular variable is excluded from the model. The MDA is computed for each tree by permuting the out-of-bag (OOB) data and then recording the prediction error. The error difference for each successive permutation is then averaged and normalized by the standard deviation. On the other hand, the mean decrease in gini measures the average increase of purity achieved by the splits of a variable. If such variable is important in the model it will achieve a split of mixed classes nodes into single class nodes.

32

RANDOM FOREST

ntree formula0 formula1 formula2 formula3 formula4 formula5 formula6 formula7 formula8 formula9

10 0.949 0.953645 0.95374 0.953633 0.952528 0.953763 0.954635 0.95335 0.92575 0.937253 20 0.952543 0.954593 0.95584 0.955373 0.95422 0.95492 0.955118 0.953768 0.934913 0.940028 30 0.95312 0.95569 0.956413 0.955758 0.954903 0.955235 0.95541 0.954043 0.934383 0.939333 40 0.953885 0.955878 0.95622 0.956015 0.954905 0.955578 0.95541 0.954095 0.93602 0.941858 50 0.954158 0.955665 0.95659 0.956003 0.955078 0.955503 0.955318 0.954165 0.935023 0.941095 60 0.954448 0.955958 0.956688 0.9562 0.95515 0.95548 0.955513 0.954328 0.9354 0.94106 70 0.954478 0.956185 0.956645 0.95624 0.955175 0.955863 0.955715 0.95422 0.93407 0.941063 80 0.954353 0.956205 0.95653 0.956185 0.955125 0.955595 0.955503 0.95415 0.935788 0.942295 90 0.954878 0.956163 0.956593 0.95641 0.955208 0.955603 0.955568 0.954248 0.935575 0.940908 100 0.954573 0.956245 0.956575 0.95635 0.955275 0.955758 0.9556 0.954363 0.935958 0.94218 110 0.955008 0.95617 0.956708 0.956465 0.95528 0.955735 0.955713 0.95438 0.93584 0.94163 120 0.954898 0.956313 0.9566 0.956425 0.955488 0.955648 0.95562 0.954335 0.933968 0.942425 130 0.954903 0.95628 0.956745 0.95649 0.955348 0.955715 0.95575 0.95431 0.936173 0.94127 140 0.954958 0.956253 0.95678 0.956375 0.955405 0.955658 0.955713 0.954313 0.934788 0.94249 150 0.955058 0.956368 0.956728 0.956533 0.955453 0.955725 0.955565 0.95435 0.936423 0.941585

Table 4.1. Accuracies of Random Forest model for 10 to 150 trees and 10 different permutation of variables.

33

Figure 4.1. Random Forest variable importance/ ranking using mean decrease in accuracy and mean decrease in gini.

From the variable importance chart, Loan Age has highest impact using both metrics and State has least impact in making prediction using the mean decrease accuracy metric while first-time-home-buyer indicator has the least impact using the mean decrease in gini metric. One hundred and fifty different random forest models were created using fifteen different number of trees from (10, 20,…..150) and 10 different formulas. For example, we see that over 120,000 observations will be misclassified if we drop the variable β€˜Loan age’ from our model while dropping first-time-homebuyer will result in no changes in the accuracy of our model. We permuted the variables by removing the least important variable from the equation at each step. In the accuracy table (shown in table 4.1) β€˜formula0’ represents the inclusion of all twelve variables, β€˜β€™formula1’’ consist of 11 variables (state variable was dropped) and so on. Accuracies of these models are presented in table 4.1. The highest accuracy of the model is 0.95678 produced at formula2 which consist of 10 variables (First-time-homebuyer and state variables were dropped) at 140 trees. Confusion matrix and overall statistics of the same are presented below.

34

Accuracy : 0.9568

95% CI : (0.9557, 0.9569) No Information Rate : 0.5283 P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.9173 Mcnemar's Test P-Value : < 2.2e-16 Statistics by Class:

Class: Default Class: Paying Class: Prepay Sensitivity 0.53570 0.9773 0.9686 Specificity 0.98996 0.9501 0.9877 Pos Pred Value 0.67298 0.9377 0.9888 Neg Pred Value 0.98223 0.9819 0.9656 Prevalence 0.03715 0.4346 0.5283 Detection Rate 0.01990 0.4247 0.5117 Detection Prevalence 0.02957 0.4529 0.5175 Balanced Accuracy 0.76283 0.9637 0.9782

True Class

Default Paying Prepay

Default 7960 3497 321

Predicted Paying 5031 169877 6260

Prepay 1868 453 204683

Overall

Accuracy 95.68%

35

From the model statistics, Kappa’s value of 0.9173 suggests that our model performs very well while the p value < 2.2e-16 indicates that the selected variables were statistically significant at 1% significance level. In predicting the paying and prepaid class the model performed extremely well with sensitivity and specificity for both classes exceeding 90%. However, the model performed moderately in predicting the default class with sensitivity just above 50%. Overall, the model performed very well with accuracy above 95%.

4.2 KNN Model

The K Nearest Neighbor classifier (also known as KNN) is an example of a non- parametric statistical model, hence it makes no explicit assumptions about the form and the distribution of the parameters. KNN is a distance based algorithm, taking majority vote between the k closest observations. Distance metrics employed in KNN model includes for example Euclidean, Manhattan, Chebyshev and Hamming distance. In this work we apply only the Euclidean distance measure. The KNN algorithm can be summarized as follows: given a positive integer K, a distance metric d and an unknown observation x, the model performs the steps below:

1) First it goes through the entire training set calculating the distance d between x and each data point in our training set. Taking K points closest to x as W and such that K is always an odd number to prevent a tie. 2) Next, we compute the proportion of points in W associated with a given

class label. This is called the conditional probability of each class and is given by equation (13) below:

𝑃(𝑦 = 𝑖|𝑋 = π‘₯) = 1

πΎβˆ‘ 𝐼(𝑦

(𝑗) = 𝑖) π‘Š

π‘—βˆˆπ‘Š (13)

In equation (13) I is an indication function which evaluates to 1 when x is true and zero when x is false. Lastly, we classify x to the class with the highest probability.

36

The choice of K is of great importance. This is because in KNN, K is a hyperparameter that controls the shape of the decision boundary and must be properly set in order to attain the best possible fit for the data set. Small K will restrain the prediction region and thus lead to high variance with low bias. Conversely, a higher choice of K accommodates more voters in the prediction region thus leading to a smoother decision boundary which implies lower variance but with increased bias. It should be noted that KNN training phase comes with both memory cost and computational cost. Memory cost is due to the fact that we have to store a huge data set because the algorithm simply memorizes the training observations which is used as β€˜β€™ experience or knowledge β€˜β€™ for the classification phase. The implication of this is that the algorithm only uses the training observations to give out predictions when a query is passed into our database. Since predicting the class of a single observation requires going through the entire data set, computational cost is therefore a factor to be considered.

In applying the KNN classifier to our data set, a randomized set of 120,000 data points was selected out of which 80,000 observations was used as training set and 40,000 observations as test data. 50 different KNN model were created with different K values varying from 1 to 50 and accuracy of each model was tested by making prediction on the test data. A plot of these accuracies against k values was obtained (See figure 4.2).

37

True Class

Default Paying Prepay

Default 828 355 184

Predicted Paying 438 16654 4908

Prepay 223 638 15772

Overall

Accuracy 0.83135

Table 4.1. Confusion matrix K- Nearest Neighbors with K = 15

From figure 4.2 we can see that the highest accuracy was achieved at K = 15 with accuracy of approximately 83%. K values higher than 15 did not yield increase in accuracy. The lowest accuracy occurred at K = 3 with accuracy of less than 40%. A careful look at figure 4.2 suggests an interesting situation whereby at smaller values of K, odd K values produced lesser accuracies than even K values (Ordinarily odd values of K should yield higher accuracies) as seen in the case of K = 2 with accuracy of 79% and K = 3 with accuracy of 38%. This scenario is usually as a result of β€˜ties’ in the KNN model when allocating votes to different classes. Ties indicate that two or more classes have equal chances or probabilities of predicting the class of a new input. The implication of this is that two or more classes have equal numbers of nearest neighbors (neighbors with equal distances) for the predicted data. Recall that the output variable has 3 classes and as such at K = 3 there’s a high chance of each class having equal votes (for K = 5 tie may occur if 2 classes has equal votes). Ties are natural occurrences in KNN model especially with huge dataset like the one used in this thesis, this is so because the probability of tie occurring increases with the size of data. One way to break ties is to apply a different selection criterion by estimating partial sums of distances to predict each class. Another way is to decrease the size of K by 1 until the tie

38

is eventually solved. However, the R software used in this thesis break ties randomly. Table 4.3 shows the confusion matrix of the K-NN model at K = 15 with overall accuracy of 83%. The model performed very well in predicting the β€˜paying’ and β€˜prepaid ’classes with positive predicted values of 76% and 94% respectively. However, the model did not perform so well in predicting the default class. The positive predicted value of the default class is 56% which is slightly above average.

Accuracy : 0.8282 95% CI : (0.8244, 0.8319)

No Information Rate : 0.5279 P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.6797

Mcnemar's Test P-Value : < 2.2e-16

Statistics by Class:

Class: Default Class: Paying Class: Prepay Sensitivity 0.49573 0.9328 0.7637

Specificity 0.98611 0.7647 0.9451

Pos Pred Value 0.56529 0.7547 0.9396

Neg Pred Value 0.98171 0.9362 0.7815

Prevalence 0.03515 0.4370 0.5279

Detection Rate 0.01742 0.4076 0.4032

Detection Prevalence 0.03083 0.5401 0.4291

39

Related documents