4.4 Applicability of Web Page Classification
4.4.1 Approximating Labeled Traffic Distributions
We want to study if our classification results are useful in extracting the true distributions of features within a given class — that is, for a given class, do the feature distributions observed across classified web pages match those observed across ground-truth labels? If they do, then our classification methodology can be used to derive ground-truth feature distributions from real traces — which can then be used for traffic modeling and simulation studies. They can also be used to monitor and understand the general usage profile of a user population.
For studying this issue, we first divide traffic into two groups according to (i) the ground-truth labels and (ii) the classified labels. We then compare the distributions of features obtained from these two groups of traffic. We first present our analysis using two hypothesis testing approaches (and later visually). The first test, the Wilcoxon sum ranked test, tests the hypothesis that the mediansof the two distributions in question are the same. We use the Wilcoxon sum ranked test because it does not rely on strong distribution assumptions such as normality. The second hypothesis test method, the Kolmogorov-Smirnov test, tests whether the twoempirical distributionsare the same. Recall that a p-value that is larger than .05 validates the concerned hypothesis.
Table 4.8 shows the p-values for two traffic features,the number of TCP Connectionsandthe number of bytes transmitted, when the traffic is classified using either KNN or LDA. We show results for the number of TCP connections and number of bytes transmitted features in this table because they are important when modeling and simulating TCP/IP traffic [Weigle et al., 2006, Barford and Crovella, 1998] — results for other features such as the number of segments and objects (epochs) transferred are provided in Appendix 11. We also only show results for the KNN and LDA methods in Table 4.8 because they were the best and worst performing classification methods, respectively — Appendix 11 provides results for all classification methods considered in this chapter. We find that:
• With the KNN classifier, the Wilcoxon sum ranked and Kolmogorov-Smirnov tests yield p-values that indicate that both the median as well as the empirical distributions of these two features are the same across classes identified using either classified labels or ground truth labels. More importantly, the tests for the results for KNN are favorable with p-values that are larger than .05 forallclasses for each feature shown — this is true even for the AGL labeling scheme (results shown only for the 4 most common genres).
• This result is not true for all classification methods. In fact, LDA usually outputs p-values that favor the alternative hypothesis in which the distributions of the classified traffic differ from the ground truth — these p-values are generally much lower than 10−10. There are exceptions, where the p- values for LDA yield results in favor of the null hypothesis for some classes, but this is not true for an overwhelming majority of classes in each respective labeling scheme.
We have also analyzed other features that are relevant for traffic generation and simulation modeling, in- cluding the number of servers contacted, the number of objects transferred, and the number of segments transferred, and arrived at the same statistical conclusions — a list of the p-values for these features for each labeling scheme and classification method considered in this chapter is provided in Appendix 11.
Figures 4.7 and 4.8 plot the cumulative distributions for the number of TCP connections feature, as yielded using the KNN and LDA classification methods — we show these plots as a visual representation of the results in Table 4.8. Figure 4.7 shows the results for the navigation-based labeling scheme. We observe that in most cases, KNN is able to classify web pages into classes that closely match the distribution of the ground truth dataset. Similar to the hypothesis testing results, LDA is not able to consistently achieve this. In particular, LDA exaggerates the number of TCP connections required for the search result and clickable
0 50 100 150 200 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of TCP Connections
Fraction of Web Pages
LDA KNN GT
(a) Mobile Optimized
0 50 100 150 200 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of TCP Connections
Fraction of Web Pages
LDA KNN GT
(b) Traditional
Figure 4.8: Distribution of the number of TCP connections across 2 target device-based classes content pages. These results are more clear in Figure 4.8, which shows distributions corresponding to the targeted device-based labels. LDA, essentially classifies the data such that it maximizes the separation between the two classes — this behavior is similar to that of a clustering method [Erman et al., 2006]. Hence, the large shift of the number of TCP connections for web pages for the traditional web page class. This shift is unrealistic and does not align with the ground truth distribution. The nonparametric KNN method does not have this problem.
We conclude that classification methods that perform well, such as KNN, can be used to extract true distributions of traffic features within classes (matching the trained distribution), while other methods, such as LDA, have difficulty doing so.