Sensitivity Analysis - Web Page Classification Performance Evaluation

4.3 Web Page Classification Performance Evaluation

4.3.4 Sensitivity Analysis

Importance of Feature Stability We next study whether the features selected in Section 4.2.1 actually outperform features that are less robust. Specifically, we compare classification accuracy when the most unstable(over time) features are selected from each of the 10 feature groups in Section 4.2.1, instead of the most stable ones — the most unstable features correspond to the last feature listed in each group in Appendix 11_{Please refer to Section 4.1 for our definition of a mobile optimized page.}

TABLE4.6: Precision and Recall (KNN - City Block Distance, Stable Tcpdump Features)

Labeling Scheme Class Names Precision Recall

Video Streaming Video .9913 .9834

Non-video .9984 .9992

Targeted Device Traditional .9342 .7933 Mobile optimized .8200 .7993 Web page Clickable content .8440 .8084

Navigation Search result .7992 .8211

Landing .8579 .9054

Alexa Genres Computers .7710 .7750

Business .7690 .7939 Shopping .5592 .6788 News .7214 .7283 Games .9089 .8335 Adult .9051 .8373 Arts .6914 .7088 Health .5355 .5541 Home .6891 .7235

Kids and Teens .5467 .5054

Recreation .6855 .5459 Reference .5785 .6395 Regional .6714 .6295 Science .7526 .7409 Society .5509 .6799 Sports .5712 .5121 World .6864 .6370

TABLE4.7: KNN - City Block Distance Classification Performance for Different Features and Train- ing Data Sets: Micro F-score

Features Used VSL TDL AGL WNL

Stable Tcpdump features .9977 .9100 .7380 .8355 Unstable Tcpdump features .9969 .8410 .6140 .7680 Stable Netflow features .9945 .8816 .6782 .7805 Stable Tcpdump: different browsers .9837 .8468 .5690 .7280 Stable Tcpdump: different time .9940 .8920 .7056 .8051

9. Recall that all features in each group are fairly informative (for classification) and are highly correlated with each other. These results are summarized in Table 4.7. Table 4.7 shows that the micro F-score obtained when using unstable features can be up to .10 lower than when using stable features. Thus, we conclude that it is important to include not just informative features for classification (as most prior work on traffic classification does), but to also consider the stability of features.

Classification with NetFlow-based Features Our results above are obtained with classification performed based on fine-grained features derived from per-packet TCP/IP headers. Sometimes, access to such packet traces may be infeasible or costly. We next ask: what accuracy can be achieved if only coarse-grained features that are obtainable from Netflow logs are used for classification? For this, we consider those (stable) features from each group that can be derived from Netflow logs. For instance, instead of the maximum number of PUSH segments sent by the client (Group 2), we include the maximum number of bytes sent by the client per TCP connection. None of the features in Group 6 and 7 qualify, though.

Table 4.7 shows that while video-streams can still be identified with high accuracy, netflow-derived features yield lower classification accuracy by up to .06 for the other classes. It is important to note that the performance with even coarse-grained netflow features isbetterthan with unstable tcpdump features — this further underscores the importance of considering stability in selecting fine-grained features.

Sensitivity to Time and Browser Our dataset includes 6 repeated downloads of each web page, using 5 different browsers for each. While we have explicitly identified features that are the most robust across time and browsers, it is important to understand the impact of training on one portion of a dataset and testing on another. We first consider the impact of time on classification performance, controlling for browser — this is done by training our classifier using data obtained at an instance in time, say the first measurement

taken for each web page which includes measurements across all browsers, and testing on data obtained at a later time sample, say the first repeated measurement.12 We repeat this process by training on the data obtained using each repeated measurement and testing on all others (occurs 6 times total, once for each repeated measurement) — the results for each train and test procedure were averaged. Table 4.7 shows that this hardly impacts classification performance at all on average. This result is promising, because it implies that classifiers do not have to be trained on data every day. In fact, our dataset includes measurements spaced out over a period of nearly 20 weeks.

We next consider the impact of browser on classification performance, controlling for time — this is done by training our classifier using data obtained using a single browser, say Firefox, and testing on data obtained using all other browsers. We repeat this process by training on the data obtained using each browser and testing on all other browsers (occurs 5 times total, once for each browser) — the results for each train and test procedure were averaged. Table 4.7 shows that while video streams can still be identified with the same rate, the micro F-scores for the mobile-targeted and navigation-labels reduces by about .06-.10. The most significant impact, however, is on the genre-based labels, which have a micro F-score of .58 as compared to .73. These results imply that traffic classification performance is much more browser-dependent than time-dependent — our analysis of repeatability and consistency of traffic features in Section 4.2.1 supports this observation. We conclude that it is important to train models on data that is representative of browser mixes found in real-world traces.

Miscellaneous Comments on Web Page Classification Results Traffic classification studies from recent literature boast of classification accuracies higher than even 94% [Kim et al., 2008, Schatzmann et al., 2010]. In comparing our results from this section to those, it is important to keep in perspective several fundamental differences:

• Our classification framework is subject to the web page design decisions of developers. Standards do not exist that ensure that web pages of a particular category yield similar traffic. We find that many web sites that host similar content tend to follow similar web page design trends. For example, mobile web sites tend to design their pages to be more resource conscious than traditional web sites.13 Modern web sites also use web page templates and content management systems. Thus, many of their 12_{Recall, our data includes 6 repeated measurements.}

corresponding web pages follow a predefinedstructurethat can be observed in traffic. We stress the need of strategically sampling web sites that are likely to be included in a real data set — we focus on popular web sites in this work. It is also necessary to keep the training data set up to date, since web pages evolve over time.

• The Alexa genre classification may be particularly noisy. Web page designers and third party analytics services have the freedom to arbitrarily assign labels to web pages. Even when web pages serve a particular purpose, its label may be unclear. For example, what type of web site is the gaming review web site www.ign.com – a news web site, an entertainment web site, or a game web site? Should social networking sites be considered news sites? These factors dramatically increase the variance and noise in each of the class labels.

• There is also legitimate room for improvement in performance by incorporating prior information (about how often a particular class appears “in the wild”) into the classification models. We elect to not use prior information because our dataset is synthetic — any prior information would give an inappropriate and non-representative increase in performance. A real-world dataset that was collected “in the wild”, would be able to benefit from such information.

• Our feature selection methodology identified a few temporal traffic features that were particularly informative for the video labeling scheme. However, our analysis in Chapter 3 shows that temporal traffic features is impacted by time and is significantly impacted by vantage point. Thus, we recom- mend training web page classification methods using data collected from the vantage points in which they are being used.

In document Sanders_unc_0153D_17177.pdf (Page 184-188)