Study 2 Investigating the Web Page Classification Problem

1.2 Overview of Dissertation

1.2.2 Study 2 Investigating the Web Page Classification Problem

Overview of Traffic Classification: One common technique for the analysis of anonymized TCP/IP headers (or other coarse information, such as NetFlow logs) is traffic classification. Traffic classification research has in the past focused primarily on two different types of classification problems: application layer protocol classification and web page identification.

1. Application layer protocol classification: Most traffic classification work using anonymized TCP/IP headers and flow-level data has focused on application protocol classification. Application cate- gories that have been of particular interest in the past include HTTP, FTP, P2P, video streaming, and mail [Lim et al., 2010, Kim et al., 2008, Erman et al., 2007b]. While distinguishing between some types of applications can be useful, determining whether an application is Web (i.e., HTTP) or not is less informative; the Web itself includes a wide variety of diverse applications, including video streaming, gaming, social networking, mail, and file hosting. This diversity in web applications

is only expected to grow as the Web becomes the standard front-end for emerging services and as existing services continue to migrate to the Web [Labovitz et al., 2010]. In the near future, there will be little utility in classifying something as Web traffic, since most observed traffic will be Web traffic. Thus, it is important to ask whether we can classify HTTP traffic using more fine-grained labels. 2. Web page identification:The security and privacy research community have focused on another type

of traffic classification problem, here referred to asweb page identification. Prior studies have shown that it is possible toidentify the exact web pagethat the traffic corresponds to with flow-level and/or TCP/IP packet-level data, using existing techniques such as similarity metrics (e.g., the Jaccard index) or learning-based classification (e.g., Naive Bayes) [Dyer et al., 2012, Herrmann et al., 2009]. Web page identification is thus possible, but “in the wild” it is useful for little besides showing that traffic analysis attacks are a threat; web page identification does not scale well with increasing numbers of web pages [Dyer et al., 2012]. There are too many web pages (on the order of millions) to measure, fingerprint, store, and reliably identify. Web page identification is typically used only when trying to identify a small set of targeted web pages.

Application layer protocol classification is too coarse a classification for modern traffic traces, and web page identification is too fine-grained. We propose investigating “web page classification,” which we believe can serve as a middle ground between these two traffic classification frameworks.

Problem Addressed: Web Page Classification Web page classification gives traffic a label more fine- grained than just “web traffic,” but avoids the scaling problems of web page identification by allowing multiple web pages to share the same label, reducing the number of classes of web pages that must be characterized and labeled. Labels for web applications are already commonly used for traffic trace analysis when available [Xu et al., 2011, Rao et al., 2011, Schneider et al., 2008, Butkiewicz et al., 2011]. The added information provided by labels on traffic gives additional insight to how the network is being used without compromising user privacy. The classification labels used depend on the intended purpose of the classification. Some specific examples of web page classification for different applications are provided below.

• Profiling video streaming (application type) usage:Video streaming is now reported to occupy nearly 50% of network bandwidth, and consumption is expected to grow [Sandvine, Bump]. The ability

to distinguish between bandwidth-hungry video and non-video streams at critical traffic aggregation points can facilitate better network planning and control. For instance, a campus network manager may be able to prevent network abuse and/or rate-limit video streams destined for student dorms; researchers may want to build profiles of enterprise video traffic to facilitate traffic modeling and forecasting studies; and ISPs may want to limit resources per business interests [Brodkin].

• Profiling mobile device usage:By 2017, it is estimated that the average number of devices per Internet user will grow to 5 [Bort], most of them mobile devices. The ability to identify downloads of web pages targeted for mobile devices can help to build profiles of mobile web usage within an enterprise (for capacity planning, modeling, and forecasting purposes) and can be used to deliver personalized content and advertisements customized for the constrained displays, power, and connectivity of mobile devices.

• Profiling web browsing navigation styles:The way users navigate through web pages can be classified: they access a landing page (homepage), clickable content (non-landing pages), or a search result. This kind of navigation-based classification can be useful for identifying network misuse, such as web crawlers being misused for the purposes of web page scraping [Jacob et al., 2012]. Recent studies have shown that malicious bots can be identified by the pattern of web page navigation from a given end-point [Tegeler et al., 2012, Wang et al., 2013].

• Profiling the content type of a web page: The content of web pages can typically be categorized into genres: Games, Shopping, News, Education, Business, etc [Inc., Xu et al., 2011]. Knowing the genre of web pages downloaded by a given user may be used to gauge user interest, which is invaluable for delivering personalized content and targeted advertisements [Yan et al., 2009]. Service providers currently rely on deep-packet inspection to assess what content consumers are interested in [Corpo- ration]. This classification will also be useful for some types of measurement studies, which perform content-based analysis of web traffic to better understand network use [Xu et al., 2011, Butkiewicz et al., 2011].

Basic Approach The basic approach used for this analysis is provided below.

• Diverse sample: As in the previous study, we collect a sample of web pages from the top 250 web sites in the world.

• Ground-truth labels using multiple labeling schemes: For classification methods to work, each web page must be labeled. We use four orthogonal labeling schemes: 1) Target device-based; 2) Video streaming-based; 3) Navigation-based; and 4) Genre-based. As previously noted, the labeling scheme used directly impacts the applicability of web page classification.

• Multiple client platforms:We apply the results from our previous study to this study’s data collection and feature selection methodology. We explicitly consider traffic features that are relatively stable over time and consistent across client platforms, because selecting robust traffic features is likely to improve web page classification performance.

• Evaluate classifications: We empirically evaluate web page classification performance while using both parametric and non-parametric methods. We do this performance evaluation for each of the labeling schemes considered.

• Investigate the applicability of web page classification:Lastly, we study whether web page classification methods can be used for different web-related application domains. We conduct two application- specific case studies, one investigating web page classification’s usefulness for applications in traffic forecasting and simulation modeling, and the other examining whether web page classification can be used to build and approximate user browsing profiles to gauge user interest.

Summary of Results The primary results and contributions of the study are given below.

• As previously mentioned, we classify web page download traffic according to four orthogonal labeling schemes, which are not usually considered in the traffic classification literature [Lim et al., 2010, Kim et al., 2008, Erman et al., 2007b, Dyer et al., 2012, Herrmann et al., 2009, Miller et al., 2014, Schatzmann et al., 2010].

• We find that non-parametric methods, such as K-Nearest Neighbors (KNN), outperform parametric methods, such as Linear Discriminant Analysis (LDA), on numerous metrics, including F-score, pre- cision, recall, and accuracy. The performance difference between these methods ranges from 8% to 50% for all major metrics. We believe this is because theoretical distributions are not able to approximate the empirical distributions of the different traffic features. Prior traffic classification studies that

use supervised machine-learning techniques also found that KNN and classification trees outperform other classification methods [Lim et al., 2010, Kim et al., 2008].6

• Features that are identified as being stable over time and consistent across client platform achieve higher classification performance (up to an increase of 12% in F-score). This analysis of whether different traffic features are effective across client platforms has not been explicitly considered in past work [Lim et al., 2010, Kim et al., 2008, Erman et al., 2007b, Dyer et al., 2012, Herrmann et al., 2009, Miller et al., 2014]. The study by Yen et al. [2009] is, to our knowledge, the only work in traffic classification that considers that traffic features may differ across client platforms. However, Yen et al. [2009] addresses this problem by filtering web page traffic according to browser before applying a web page identification technique. This browser-based filtering improves the performance of the classification technique, showing that considering client platform-specific differences in web traffic may improve classification performance.

• We find that the distributions of well-separated web page download traffic classified using our approach are statistically indistinguishable (p >0.05) from distributions derived using ground-truth labels. This result indicates that web page classification can be used for web-related application domains such as traffic modeling and user profiling.7 While prior work in traffic classification states that the results of classification can be used for traffic modeling, they do not provide results from a simulation study or conduct a statistical analysis to demonstrate this [Lim et al., 2010, Kim et al., 2008, Erman et al., 2007b].

In document Sanders_unc_0153D_17177.pdf (Page 31-35)