Web Page Identification - Traffic Classification

2.3 Related Work

2.3.2 Traffic Classification

2.3.2.4 Web Page Identification

The security and privacy research community have focused on a problem similar to traffic classification that is best referred to here asweb page identification. Web page identification is the problem ofidentifying the exact web page given an encrypted traffic trace. In other words, successfully solving the web page identification problem means that fine-grained information such as the web page visited by a user can still be inferred from traffic despite being encrypted. The presence of encrypted traffic means that only TCP/IP headers are available for web page identification.

Sun et al. [2002] showed that exact web pages can be identified using information derived from network traffic. One primary metric used for this problem is the size of a web object. Here, the size of a web object, or any type of web traffic (say transport-layer segments), is the number of bytes it contains. Sun et al. [2002] first derived web object sizes using TCP/IP headers, and then used the jaccard similarity classifier to compare traffic signatures that are known with those present in the traffic. The primary assumption of this approach is that the sizes of web objects that are referenced by different web pages are different. Sun et al. [2002] went on to propose mechanisms to help make their method less effective — hence, improving network security. These mechanisms include padding transport-layer segments (increasing size of web objects to be larger 25_{Erman et al. [2006] also published another similar study that only considered the K-means method Erman et al. [2007b].}

TABLE2.5: Summary of Prior Work on Traffic Classification

Author(s) Classification Labels Features Used Methods

Moore and

Papagiannaki [2005]

Application Type (Mail, Web, P2P, Games, Multimedia, Services, Interac- tive, Bulk, Database, Malicious)

Packet payloads + Coarse flow- level features

Signature Detection + Heuristics

Sen et al.

[2004]

Application Type (P2P vs Non-P2P) TCP Payloads (string matching

on payloads)

Signature Detection

Aceto et al.

[2010]

Application Type (P2P, Web,Unknown, Services, Encryption, Network management, Mail, Multimedia, Tunneling, Filesystem, Bulk, Games, Interactive)

Packet Payloads (restricted to first 32 bytes of payload) + TCP Protocol Type

Signature Detection

Alcock and

Nelson [2012]

Application Type (ESP over UDP, Web, Bittorent, Razor, Garena, Skype, RTMP, Xbox Live)

Packet Payload (First 4 bytes) + IP Address + Port number

Signature Detection

Karagiannis et al. [2004]

Application Type (P2P vs Non-P2P) TCP, UDP, and IP Headers

(non-temporal features such as Packet Size and IP address)

Heuristics

Karagiannis et al. [2005]

Application Type (P2P, Web, Mail,

Chat, FTP, Network management,

Games)

TCP, UDP, and IP Headers (non-temporal features such as Packet Size and IP address)

Heuristics

Dewes et al.

[2003]

Application Type (Chat vs Non-Chat) Packet Payloads + TCP, IP,

UDP, and HTTP Headers

(specifically, temporal features)

Heuristics

Crotti et al.

[2007]

Application Type (Web, SMTP, POP3, Other)

TCP and IP Headers (e.g., port numbers, temporal features, and segment size)

Heuristics

Moore and

Zuev [2005]

Application Type (Bulk, Database, In- teractive, Mail, Services, Web, P2P, At- tack, Games, Multimedia)

TCP/IP headers (e.g., port numbers, temporal features, packet size, and TCP flags)

Supervised Machine

Learning (NB) Roughan et al.

[2004]

Application Type (Domain, FTP,

HTTP/Web, P2P, Telnet, HTTPS)

TCP and IP Headers (e.g., port numbers, temporal features, packet size, and TCP flags) Supervised Machine Learning (KNN and LDA) Schatzmann et al. [2010]

Application Type (Webmail vs Non- webmail)

Coarse flow-level features

(specifically, temporal features)

Supervised Machine

Learning (SVM)

Kim et al.

[2008]

Application Type (Web, DNS, Mail, Chat, FTP, P2P, Streaming, Game)

TCP, UDP, and IP Headers (e.g., port numbers, packet size, and TCP flags) Supervised Machine Learning (NB, CT, KNN, SVM) Lim et al. [2010]

Application Type (Web, P2P, Attack, FTP, DNS, Mail, Streaming, Network Operation, Games, Encryption, Chat, Unknown)

TCP, UDP, and IP Headers (non-temporal features such as port numbers, packet size, and TCP flags) Supervised Machine Learning (NB, LDA, CT, KNN) McGregor et al. [2004]

Application Type (Web, ICMP, SMTP, IMAP, NTP, FTP)

TCP, UDP, and IP Headers (e.g., port numbers, packet size, temporal features, and TCP flags)

Clustering (Expectation Maximization)

Erman et al.

[2006]

Application Type (Web, POP3,

Database, P2P, Other, FTP, limewire)

TCP, UDP, and IP Headers (e.g., port numbers, packet size, and TCP flags) Clustering (K-means, AutoClass, DBScan) Hern´andez- Campos et al. [2003b]

Application Type (Web, HTTPS, POP3,

Gnutella, Telnet, POP, FTP,SMTP,

NNTP, Database)

TCP, UDP, and IP Headers (e.g., port numbers, packet size, and TCP flags)

Clustering (Hierarchi- cal Clustering)

than expected) and adding extra background traffic (increasing the number of web objects referenced by a web page to be higher than expected).

Liberatore and Levine [2006] performed a similar study except they focused on using transport-layer segment size as the primary feature instead of web object sizes — the direction of the communication was also used as a feature for classification. Liberatore and Levine [2006] used segment sizes because some encryption technologies hide the notion of a web object in a manner that cannot be recovered using TCP/IP headers alone. Liberatore and Levine [2006] showed that web page identification is still possible using segment size as the primary feature. Liberatore and Levine [2006] also showed that the Naive Bayes classifier performed better than the Jaccard Similarity classifier primarily because the Jaccard similarity does not account for some traffic features such as the number of segments transferred. Herrmann et al. [2009] and Panchenko et al. [2011] conducted similar studies except Herrmann et al. [2009] used a Multinomial Naive Bayes Model while Panchenko et al. [2011] used an SVM model for web page identification. The method proposed by Panchenko et al. [2011] is different from most prior methods in web page identification in that it uses coarse traffic features (i.e., features derived from multiple segments), such as TCP connection duration and bandwidth, in addition to fine-grained features (i.e., features that use each segment) such as the distribution of segment sizes observed in traffic.

Dyer et al. [2012] compared many of these web page identification approaches to determine which one works best. Dyer et al. [2012] found that the classification method used did not matter as much as the types of features used for the classification. In particular, Dyer et al. [2012] highlighted that the coarse features used by Panchenko et al. [2011] were particularly informative for web page identification. Coarse features are especially useful for identifying web pages when different countermeasures are used because they are more robust to the noise that these countermeasures add to the traffic. Cai et al. [2012] conducted a similar study that showed that web page identification can still work without the use of the segment size feature and despite the use of countermeasures. Yen et al. [2009] and Coull et al. [2007] also use coarse features for web page identification. Though, these studies only consider coarse features while the study by Panchenko et al. [2011] considers both coarse and fine-grained features. Both studies show that coarse features can be used to effectively identify web pages without fine-grained features. A key result from the study by Yen et al. [2009] is that it shows that browsers have an impact on web page identification and being able to detect browsers can help in building specialized classifiers for each browser that work better than a single classifier. Miller et al. [2014] conducted a study that investigated whether browser-specific features such as

browser caching can impact web page identification performance. The results of this study show that the browser cache has an impact on web page traffic and can consequently impacts web page identification performance. Miller et al. [2014] also found that a hidden Markov model can be used to provide navigation information to help identify the web page that a user visited. The assumption of this approach is that the navigation patterns of a user can be used to predict the domain of the web page (i.e., web site) that was visited — thus reducing the set of possible web pages that a user visited. Cai et al. [2012] and Danezis also showed that the navigation patterns of users can be used to supplement web page identification techniques.

The majority of the work in web page identification assume a closed world model [Cai et al., 2012, Yen et al., 2009, Dyer et al., 2012, Liberatore and Levine, 2006, Herrmann et al., 2009, Miller et al., 2014]. The closed world model assumes that the web page traffic trace includes only the traffic resulting from downloading a finite set ofkknown web pages. Indeed, in a real-world scenario one will observe web page traffic that includes download traffic from unknown web pages. The open world model considers such a scenario. Panchenko et al. [2011], Coull et al. [2007], and Sun et al. [2002] conducted some of the few studies that consider the open world model.

A summary of the prior work described in this section is provided in Table 2.6. The body of work presented in this section is related to this dissertation because (i) it is related to the general problem of traffic classification or (ii) it involves classifying web page traffic using TCP/IP headers.

In document Sanders_unc_0153D_17177.pdf (Page 87-90)