1.2 Overview of Dissertation
1.2.3 Study 3 Comprehensively Evaluating the State of the Art in Web Page Segmentation
Overview of Web Page Segmentation: The problem of grouping individual TCP/IP segments into the web pages that they collectively represent is calledweb page segmentation. Web pages must be segmented to perform any type of page-level analysis on traffic trace data, including general-purpose web traffic char- acterization and web page classification. These applications are described below.
6 Supervised machine-learning methods are the current state of the art in traffic classification, outperforming unsupervised and
heuristic methods for traffic classification. Please refer to Chapter 2 for a more complete overview of traffic classification methods.
• Web Page Characterization:Measurement and characterization studies of web traffic have been con- ducted for a number of different applications, including traffic forecasting and web usage model- ing [Hern´andez-Campos et al., 2003a, Choi and Limb, 1999, Butkiewicz et al., 2011, Newton et al., 2013, Mah, 1997, Ihm and Pai, 2011, Barford and Crovella, 1998]. Many of these studies attempt web page segmentation, but do not make it their primary focus and do not evaluate it. It is therefore not clear whether they work well for modern web traffic [Newton et al., 2013, Ihm and Pai, 2011], or even if these studies’ conclusions might have been impacted in unexpected ways by their segmentation methods.
• Web Page Classification: While there have been significant recent advances in the design of clas- sification and fingerprinting approaches, the vast majority of analysis techniques designed over the past two decades assume the availability of perfectly segmented web page traffic [Miller et al., 2014]. Although many recognize the critical need to understand how well web page segmentation can be performed in practice [Miller et al., 2014], few studies have tested actual web page segmentation in an applied setting.
While web page segmentation methods have been used for numerous traffic analyses in the past [Hern´andez- Campos et al., 2003a, Choi and Limb, 1999, Butkiewicz et al., 2011, Newton et al., 2013, Mah, 1997, Ihm and Pai, 2011, Barford and Crovella, 1998], it is unclear whether existing techniques are effective for traf- fic analysis on modern web page traffic, which introduces new challenges that make segmentation more difficult. Some of these challenges are described below.
• Increase in Automatically Generated Traffic:Modern web pages may use technology such as AJAX, which generates traffic without downloading a new page. For example, Facebook web pages load new content on the same page by scrolling, while Youtube web pages subsequently auto-play additional videos on the same page as the original video. This automatically generated traffic, which may be used to update the content of a web page, presents a problem for web page segmentation, which may treat it as a new page download.
• Increase in Overlapping Traffic: Modern browsers support multi-tab browsing, which allows mul- tiple browser windows to be open simultaneously. This is a problem for web page segmentation approaches, because increasing the number of tabs increases the degree of traffic that overlaps [Miller
et al., 2014]. Overlapping traffic (traffic that consists of multiple web pages) is tougher to demulti- plex, or segment, than traffic that only includes a single web page. This problem is further exacerbated when we consider that modern web pages can automatically generate traffic in the background while the user is not directly consuming the content. There are also environments where multiple users share a single IP address (e.g., NATs) that also increase the degree of overlapping traffic [Guha and Francis, 2005, Tsuchiya and Eng, 1993].
Problem Addressed in this Work: Evaluation of Web Page Segmentation Approaches Previous liter- ature has studied web page segmentation using TCP/IP headers as a timeseries analysis problem and has ap- proached this problem usingidle time-based approaches. These approaches estimate the beginning of a web page download by detecting whether the network activity level (e.g., the number of bytes observed) exceeds pre-defined thresholds after a certain amount of idle time [Maci´a-Fern´andez et al., 2010, Newton et al., 2013, Ihm and Pai, 2011, Mah, 1997, Barford and Crovella, 1998]. Another class of segmentation approaches rely onchange point detection, and these identify if/when a timeseries exhibits a substantial increase in network activity. Change point detection methods, such as fused lasso regression and hidden Markov models, have been applied in other fields, including computational biology and speech recognition [Rabiner, 1989, Tib- shirani and Wang, 2008, Bleakley and Vert, 2011]. Neither idle time-based approaches nor change point detection approaches have been comprehensively evaluated on modern web page traffic, which produces noisier traffic and more difficult segmentation problems than older web traffic.
Through a comprehensive empirical evaluation of idle time-based and change point detection approaches using both synthetic and real browsing data, we will determine whether web page segmentation approaches can be used to analyze modern web page traffic using only anonymized TCP/IP headers. A successful web page segmentation approach should be able to approximate some of the statistical properties of real user browsing behavior, including the number of web page downloads a user requests and the average inter- arrival time between these downloads—metrics that can be used to model user behavior. A successful web page segmentation approach can also be used to enable and/or facilitate web page classification “in the wild,” rather than in an isolated test environment.
• Browsing stream generation:Our data collection methodology considers both synthetically generated and real user browsing data. The synthetic data generation explicitly incorporates different inter- arrival time distributions, client platforms (browsers), and number of tabs used into the data collection methodology. The web pages browsed using the synthetic data is the same sample of web pages used in the previous two studies described. The real user browsing data was collected by recruiting 40 real users in an IRB-approved study [Sanders and Kaur, 2015c] in order to study personalized pages and those with user-interactive content, which may not be included in the synthetic data.
• Web page segmentation:We evaluate the performance of 2 types of web page segmentation methods: idle time-based methods and change point detection methods. We apply these web page segmenta- tion methods to browsing streams derived from the number of bytes and the number of SYNs traffic features—two features that have been previously used to segment web pages using TCP/IP head- ers [Maci´a-Fern´andez et al., 2010, Newton et al., 2013, Ihm and Pai, 2011, Mah, 1997].
• Applicability of web page segmentation: We study whether the web page segmentation method used has a measurable impact on different web traffic analysis domains. We do this by conducting case studies for the application domains of (i) user behavior modeling and (ii) web page classification. Summary of Results A summary of the results of this empirical evaluation is provided below.
• For web page segmentation, the number of SYNs is a more robust and informative feature than the number of bytes. Using the number of SYNs rather than the number of bytes improves the true positive rate, or recall, of the best web page segmentation methods by approximately 5%, and some methods, such as the basic idle-time based method, improve by over 40%. A recent study by Newton et al. [2013] also found that the number of SYNs was more effective for web page segmentation than the number of bytes.
• We find that the best-performing change point detection methods tested (fused lasso and a heuristic method) yield F-scores that are 20-30% higher than idle time-based methods. Thus, we find that change point detection methods can more robustly segment modern web traffic than existing idle time-based approaches. This study is the first work to consider change point detection methods for web page segmentation, although in a recent study, Ihm and Pai [2011] showed that idle time-based
approaches may perform poorly on modern web traffic. However, Ihm and Pai [2011] proposes that web pages should be segmented using content-based methods, which do not work on anonymized TCP/IP headers; we recommend change point detection methods, which do.
• We find that the inter-arrival time between web page downloads has a significant impact on web page segmentation performance. Our results show that web pages with small inter-arrival times (<5s) are detected less than 30% of the time, while web pages with large inter-arrival time (>10s) are detected approximately 85% of the time. We find that browser choice has only a small impact on segmentation performance, and that the number of tabs open in a browser begins to significantly impact segmentation performance only when the number of tabs increases from 4 to 8 (true positive rate and other metrics decrease by over 10%). These factors have not previously been considered in the literature on web page segmentation [Maci´a-Fern´andez et al., 2010, Newton et al., 2013, Ihm and Pai, 2011, Mah, 1997, Xie et al., 2013, Neasbitt et al., 2014].
• We find that the performance of web page segmentation methods impacts the performance of different applications domains that leverage segmentation, such as user behavior modeling and web page clas- sification. Higher-performing web page segmentation methods are beneficial for applications where web page segmentation is critical, with both synthetically generated and real user browsing data. This work is the first to evaluate the performance of multiple web page segmentation methods on traffic classification when only anonymized TCP/IP headers are available [Lim et al., 2010, Kim et al., 2008, Erman et al., 2007b, Dyer et al., 2012, Newton et al., 2013, Ihm and Pai, 2011].
1.3 Outline of Dissertation
The rest of this dissertation is organized as follows. Background and related work in the fields of Internet measurement, traffic classification, and web page segmentation is described in Chapter 2. The first contri- bution of this dissertation, a comprehensive measurement study that provides insight on the characteristics and diversity in modern web page traffic, is provided in Chapter 3. Web page classification and web page segmentation are investigated in Chapter 4 and Chapter 5, respectively. Concluding remarks and possible directions for future work are provided in Chapter 6.
CHAPTER 2: BACKGROUND AND RELATED WORK
The Web, the most popular application on the Internet, is a highly complex application because it uses multiple technologies — some of these technologies include web browsers (i.e., Google Chrome and Fire- fox), web pages (i.e., HTML and javascript), web servers (i.e., Apache), and communication protocols (i.e., HTTP, TCP/IP, and DNS). This complexity makes the discussion of the Web difficult because the term “Web” can be used to describe many related, yet different, technologies. Thus, it is important for any work that involves web technologies to clearly define the context for the aspect of “the Web” that is being con- sidered. This dissertation focuses on the development and evaluation of techniques that analyze “the Web” using only anonymized TCP/IP headers — this focus touches upon multiple technologies and research areas that are each related to it. This chapter provides the context for the aspects of “the Web” that this dissertation considers and is divided into three parts:
• Background: The first part of this chapter provides a brief background on the Web. Here, we clearly define key terms such as the Web, a web page, and web page traffic. We also provide background on web technologies including Hypertext Markup Language (HTML), javascript, Hypertext Transfer Protocol (HTTP), and other protocols that are used by web technologies including Domain Name System (DNS), Transmission Control Protocol (TCP), and Internet Protocol (IP). Readers familiar with this background material may read ahead to Section 2.3.
• State of the Art in Web Measurement Methods and Tools: The second part of this chapter discusses the methods and tools used for Web measurement. This discussion provides additional background on how related studies collect data to study the Web.
• Related Work:The last part of this chapter discusses the literature on the research areas that is related to this dissertation. These include Web measurement studies, traffic classification methods, and web page segmentation methods.