Data Analysis Methodology - Sanders_unc_0153D

A traffic feature is a measurable property of web page traffic that can be derived from our tcpdump traffic traces. These traffic traces include the first 60 seconds of web page download traffic and consist of HTTP, TCP/IP, and DNS headers, and source HTML files. Some details of the traffic features that we derive are provided below.

DNS-based Features: We developed scripts to extract DNS-based traffic features from tcpdump logs. We use port-numbers (destination port 53) to identify DNS messages. From this, we are able to compute traffic features such as the number of DNS requests, the number of DNS responses, and the TTL of DNS records. 6_{Please note, however, that the Firefox browser on the PlanetLab nodes did not update.}

We are also able to compute temporal metrics such as the response time of DNS requests and the inter-arrival times of DNS requests.

HTTP-based Features: HTTP(S) messages are also identified using a port-based approach (80 and 443). We use pcap2har to convert .pcap files (i.e., tcpdump traffic traces) to .har logs (i.e., JSON formatted HTTP archive logs) and developed scripts to extract traffic features from these [pcap2har]. We process the .har log files to extract fields from HTTP headers, including the MIME type, hostname, and length of the web object. We are unable to process encrypted HTTP messages (i.e., HTTPS).7

TCP/IP-based Features:We developed scripts to extract TCP/IP-based traffic features from tcpdump logs. We use the 5 tuple heuristic to identify TCP connections and classify it as a TCP connection if we observe the three-way handshake during the web traffic trace. We use TCP header fields to compute the number of TCP flags observed (i.e., PUSH, RESET, FIN, SYN, and ACK), and the number of bytes transferred within each segment. In addition to these statistics, we compute temporal metrics such as the duration of TCP connections and round-trip-time (RTT). RTT is defined as the time between the initial SYN segment sent by the client and the time it receives a SYN-ACK response segment from the server. We also measure the inter-connection arrival time—this metric is defined as the amount of time between the start of consecutive TCP connections, where the sending of the initial SYN segment is the event that is taken to be the start of a TCP connection. The last temporal traffic feature that we measure is TCP connection duration — this metric is defined as the amount of time between the first and last segments transmitted during a connection. Heuristic for web page download times:Web page download time is approximated by taking the amount of time between the first DNS request sent by the client and the last payload byte sent by the server [Huang et al., 2010]. To compute robust statistics of this metric, we also compute several measures of the time until P% (50%, 90%) of the bytes in a given web page are downloaded. Although this metric does not completely correspond to user perceived load times of web pages, it is correlated with user perceived web page load times [Yahoo, Huang et al., 2010].

HTML-based Features: We use the BeautifulSoup python library to extract features from the source HTML of the web pages [Richardson, 2015]. Count statistics that are derived from tags that represent (i) hyperlink-level information (e.g., “a” and “link” tags) and (ii) theextensionsof embedded objects that are referenced by a page (e.g., .jpeg, .gif, and .png extensions for embedded image objects) are used forobject- 7_{18% of our observed TCP connections transfer HTTPS traffic.}

based featuresand analysis— these have commonly been used in other HTML-based analysis [HTM, Canali et al., 2011]. We also derive a feature to analyze the textual-related differences between HTML documents. We use a simple bag-of-words model to count the frequency of all the words that are present in a document — bag-of-words models are commonly used in natural language processing, machine learning, and com- puter vision [Weinberger et al., 2009]. Awordin this model is defined as any sequence of characters that is present in an HTML document that is delimited by>,<, ”, newline, or whitespace characters. This model allows us to derive features that can measure the overall text-related differences between two documents. To compactly represent these text-related differences, we derive the number of different words feature which is computed as the number of words that are different between two documents (that is, a baseline document and a test document). We use this feature simply as a measure to flag significant differences in text for further analysis.

Please note that our study is focused primarily on the analysis of web pagetraffic. We analyze source HTML primarily to assist in understanding the traffic differences that we observe. In particular, we want to understand if any difference we are observing in web page traffic across client platforms is the result of differences in the source HTML, or is due to the choice of client platform.

3.2.2 Feature Selection Procedure

In total, we derive 26 primary traffic features, 549 secondary traffic features, and 127 HTML-based features (702 features total). Here,primary traffic featuresare traffic features that represent coarse traffic information, such as the number of bytes transferred, number of PUSH flags, number of objects, and number of TCP connections.Secondary traffic featuresare more fine-grained traffic features, such as the number of javascript objects or the number of HTTP responses with status code 404, and derived multi-flow features such as the average number of bytes sent per TCP connection or the maximum object size observed. Lastly, HTML-based featuresare features that are derived from the source HTML. A complete list of the features used in this study are provided in Appendix 3.

We also leverage statistical tests for our analysis that focuses on determining the influence that client platforms have on web page download traffic. In particular, we use a standard non-parametric statistical test, the Kruskal-Wallis test, to determine which traffic features differ significantly across different client platforms. The Kruskal-Wallis test yields p-values that represent the statistical significance of each feature for different client platforms. Here, lower p-values correspond to results that have greater statistical signifi-

cance. We deem resultsstatistically significantif the p-value is less than 0.05. We then use these results to dig deeper into our dataset to determine the source of any statistically significant difference.

In document Sanders_unc_0153D_17177.pdf (Page 103-106)