Classification Results - Web Page Classification Performance Evaluation

4.3 Web Page Classification Performance Evaluation

4.3.3 Classification Results

In this section, we present the results of our performance evaluation in three steps. First, we determine which classification method performs the best by using the micro F-score and macro F-score metrics. Second, we compare the performance of the best performing classification method across different labeling schemes to determine how labeling scheme choice impacts classification performance. And lastly, we use the precision and recall metrics to further analyze the performance of the best performing classification method.

For the stable tcpdump derived features we find that:

• Table 4.3 summarizes the mean classification performance for the micro F-score for each labeling scheme and classification method tested. The micro F-scores of the non-parametric KNN and Classi- fication Trees models are comparable (usually differ by less than .08), where KNN with the City Block 8_{Thus, the expected accuracy of random guessing is 1}_/_L_{, where}_L_{is the number of classes for a given labeling scheme.}

TABLE4.3: Web Page Classification Performance: Micro F-score

Classification Model VSL TDL AGL WNL

Stable Tcpdump features K Nearest Neighbors (KNN)

KNN - Euclidean (K=1) .9969 .8931 .7195 .8137 KNN - City Block/L1 distance (K=1) .9976 .9020 .7371 .8298

KNN - Cosine (K=1) .9938 .8921 .7188 .8131

KNN - Correlation (K=1) .9962 .8922 .7176 .8100 Classification Trees (CT)

CT - Gini Diversity Index .9957 .8526 .6009 .7441

CT - Twoing Rule .9962 .8529 .6071 .7479 CT - Deviance .9978 .8698 .5951 .7671 Naive Bayes (NB) NB - Normal .9182 .3626 .2212 .3216 NB - Kernel: Normal .9804 .4418 .2018 .3808 NB - Kernel: Triangle .9727 .4678 .2106 .3835 NB - Kernel: Box .9688 .4983 .2273 .3930 NB - Kernel: Epanechnikov .9728 .4735 .2077 .3844 NB - Multinomial/Histogram .9884 .7340 .3198 .4264 Linear Discriminant Analysis (LDA)

LDA - Linear .9878 .5457 .4362 .4297

LDA - Quadratic .9760 .3907 .3845 .3679

LDA - Mahalanobis .9436 .6645 .1047 .4340

Random Guessing (RG) .5000 .5061 .0613 .3414

TABLE4.4: Web Page Classification Performance: Macro F-score

Classification Model VSL TDL AGL WNL

Stable Tcpdump features K Nearest Neighbors (KNN)

KNN - Euclidean (K=1) .9908 .8637 .6624 .8210 KNN - City Block/L1 distance (K=1) .9928 .8679 .6785 .8367

KNN - Cosine (K=1) .9893 .8625 .6614 .8131

KNN - Correlation (K=1) .9887 .8629 .6624 .8173 Classification Trees (CT)

CT - Gini Diversity Index .9877 .8100 .5040 .7468

CT - Twoing Rule .9887 .8105 .4855 .7515 CT - Deviance .9934 .8325 .4989 .7722 Naive Bayes (NB) NB - Normal .8465 .5823 .2788 .4719 NB - Kernel: Normal .9477 .6278 .2789 .5325 NB - Kernel: Triangle .9280 .6365 .2859 .5311 NB - Kernel: Box .9182 .6466 .2793 .5335 NB - Kernel: Epanechnikov .9267 .6396 .2817 .4607 NB - Multinomial/Histogram .9514 .8128 .3040 .5827 Linear Discriminant Analysis (LDA)

LDA - Linear .9631 .6241 .0815 .4498

LDA - Quadratic .9366 .5999 .1021 .4715

TABLE4.5: Web Page Classification Performance: Accuracy

Classification Model VSL TDL AGL WNL

Stable Tcpdump features K Nearest Neighbors (KNN)

KNN - Euclidean (K=1) .9969 .8927 .7198 .8140 KNN - City Block/L1 distance (K=1) .9976 .8959 .7374 .8300

KNN - Cosine (K=1) .9964 .8625 .7190 .8132

KNN - Correlation (K=1) .9962 .8919 .7177 .8102 Classification Trees (CT)

CT - Gini Diversity Index .9959 .8557 .6010 .7398

CT - Twoing Rule .9960 .8557 .6050 .7477 CT - Deviance .9975 .8662 .5993 .7703 Naive Bayes (NB) NB - Normal .9183 .3628 .2005 .3213 NB - Kernel: Normal .9779 .4403 .2245 .3805 NB - Kernel: Triangle .9624 .4614 .2072 .3744 NB - Kernel: Box .9557 .4872 .2245 .3855 NB - Kernel: Epanechnikov .9615 .4666 .2050 .3784 NB - Multinomial/Histogram .9788 .4581 .2340 .4014 Linear Discriminant Analysis (LDA)

LDA - Linear .9877 .5460 .0432 .4296

LDA - Quadratic .9760 .3902 .0384 .3684

distance metric and withK=1 performs the best for all labeling schemes. We only show the best per- formingKvalue for the KNN methods provided in Table 4.3 — please refer to Appendix 10 for results with other values ofK. The different distance functions for the non-parametric methods do not impact the micro F-scores by more than .02. These non-parametric methods perform much better than the parametric Naive Bayes and LDA models in all cases — in fact, the Naive Bayes and LDA classifiers sometimes perform worse than apriori guessing (though all methods outperform random guessing in most cases). This is likely because the non-parametric methods do not rely on assumptions about the distribution of the features and are able to handle arbitrary feature distributions — that is, parametric methods assume specific theoretical distributions of features (e.g., Normal distribution) which is not typically the case for network traffic [Lim et al., 2010, Barford and Crovella, 1998]. This result is also consistent with the observations made in [Lim et al., 2010]. While data transformation and other preprocessing techniques may help improve the performance of parametric machine learning methods, say by transforming features such that they follow a theoretical distribution, we do not perform such an analysis in this work. Instead, we focus on examining the performance of the different machine learning methods by treating them as a black-box and do not perform any extra preprocessing to make the features more suitable for a particular method.9 As noted before, using this approach, KNN is the best performing method while Classification Trees performs second. Similar results were obtained when using the macro F-score and accuracy metrics— these are shown in Table 4.4 and Table 4.5, respectively. Thus, for the rest of the analysis in this section we use KNN as the classification method for evaluation.

• The performance for the best performing method, KNN with the City Block distance, differs greatly across the different labeling schemes. Table 4.3 shows that the micro F-score for the VSL labeling scheme is the best of all of the labeling schemes at .9976 — the TDL labeling scheme is the second- best at .9020, the WNL labeling scheme is third at .8298, and the genre-based labeling scheme is last with a micro F-score of .7371. Table 4.4 and Table 4.5 show that we observeverysimilar results when using the macro F-score and accuracy metrics — we do not discuss these metrics further since they do not contribute additional insight on the performance of KNN.

9_{We do perform a simple standardization procedure to normalize our features (that is, each feature in a feature set is subtracted by}

• With respect to the distance functions used by KNN, we find that the City block distance function performs the best, while the Euclidean distance function performs the second-best — though, the distance functions for KNN do not significantly impact the performance of KNN. We also find that, for the VSL labeling scheme, all classification methods can achieve a micro F-score above .980 and are fairly similar (within .018 of KNN). However, we observe much more considerable differences in performance between the classification methods for the other labeling schemes. For instance, the micro F-score for the best classification method for the AGL labeling scheme is .7374, while the micro F-score for the NB-Normal method is .2005. As described before, we believe that much of the performance differences between these methods is due to the distributional assumptions of parametric methods. It is also important to note that KNN performs better than classification trees despite the fact that both methods are non-parametric and significantly outperform the parametric approaches — more specifically, the micro F-scores for KNN are over 5% higher than the micro F-scores for classification trees for the WNL and AGL labeling schemes. It is likely that this is due to known issues with the optimization of classification tree models which may produce locally-optimal models instead of the, much more preferred, globally-optimal models. This issue usually worsens as classification/decision trees become larger and/or the number of classes considered increases [Rokach and Maimon, 2014]. KNN does not have this issue [Mikolajczyk and Schmid, 2005].

• We next discuss the metrics of precision and recall, shown in Table 4.6, for the KNN method with the City Block distance function — we do this to provide more details on web page classification performance than is present when analyzing micro and macro F-scores alone. For the video streaming labeling scheme, the precision and recall are higher than .99 for all labels except for the video label whose recall is .9834. This result shows that while the video streaming labeling scheme can be used to classify pages at a high rate, there is a slightly higher false negative rate for the video label itself — that is, the recall is smaller for the video label because the number of false negatives for this label is higher.10 We believe this occurs because the classification method may, at times, confuse a video streaming page with other types of bandwidth-intensive web pages that are included in our dataset (such as audio streaming web pages). For the targeted device labeling scheme, we find (i) that the precision and recall are above .79 and (ii) that precision is always higher than the recall. We also find 10_{Please refer to the definition of precision and recall.}

that the precision for the traditional web page label is over .10 higher than the precision of the mobile optimized web page labels. This result shows that the traditional web page label has a much lower false positive rate than the mobile optimized web page label — that is, mobile optimized web pages are more likely to be misclassified as traditional web pages than the reverse. This is likely because there are many traditional web pages that are efficient across all devices despite not being labeled as a mobile optimized web page.11 This result also implies that there are more efficient traditional web pages than inefficient mobile web pages. Observations for the navigation-based labeling scheme are fairly similar to the video streaming and targeted device-based labeling schemes where the precision and recall tend to be higher than .80. In this case, the landing page label has a precision and recall that is higher than the search result and clickable content labels. This result shows that the landing page label can be classified slightly more effectively than the others. Overall, the precision and recall are fairly high (i.e., consistently above .79) and relatively consistent across the video streaming, targeted device, and navigation-based labeling schemes. This result shows that the KNN classification method is able to classify web pages according to these labeling schemes without being excessively biased towards these classes.

Within the Alexa genre labeling scheme, there are labels, particularly the games and adult labels, that have precision and recall values that are above .80 while there are other labels, particularly the sports and health labels, that have precision and recall values that are approximately .55. Most of the other labels within the Alexa genre labeling scheme (e.g., business, computers, science, etc) have precision and recall values that are between .68-.80. These results imply that while each label can be classified at a rate higher than random guessing (i.e., precision and recall above .50), some web page categories can be classified much more effectively than others (i.e., precision and recall above .75).

In document Sanders_unc_0153D_17177.pdf (Page 178-184)