We now describe several threads of ongoing and planned work to explore related aspects of privacy issues and the "mashed-up" web.
Linking different devices of the same user.Our cookie linking attack as described does not allow the adversary to link two different browsers or devices of the same user, except by attaching real-world identities to both clusters. However, there could be other mechanisms: if two sets of cookies are repeatedly seen together on a variety of different networks, this probably represents the same user who is switching IPs (for example, traveling with a smartphone and a laptop). Another possibility is that the set of websites that the user tends to visit frequently could serve as a fingerprint.
International cookie linking.Future research into cookie linking should study the surveillance method in a more international context. Though our first goal was to study the technique in more general terms, we should explore how the NSA’s one-end foreign policy could impact how well cookie linking works, as it would constrain the amount of data available to the eavesdropper. This could be studied through the use of a proxy to direct crawls of American sites from a foreign computer, or it could be studied by having American computers crawl foreign sites provided from overseas servers. The findings of [14] suggest that the results may vary by country.
Effectiveness of user mitigation mechanisms.While using the Tor browser and blocking all third-party cookies are good mitigation mechanisms, they come at a cost in terms of usability and functionality. There are other browser privacy measures that are more palatable for users to deploy: (1) an extension such as “HTTPS Everywhere” which makes requests over HTTPS whenever the server supports it and (2) Safari-style cookie blocking which is a limited form of third-party cookie blocking that breaks less functionality. Since these are essentially modifications to browser behavior, they can be studied using our approach.
Leakage of sensitive attributes in HTTP traffic. As mentioned in Section 1, one of the adversary’s goals might be to infer sensitive attributes about the user that are transmitted between the browser and websites in plaintext, such as preferences, purchase history, address, etc. We plan to develop heuristics to identify such data in HTTP traffic and measure the prevalence of such leakage. A further step would be to extend the measurement of unencrypted transmission of sensitive information to the domain of mobile apps.
6. Related Work
Our work expands on the previous work done in the growing field ofweb privacy measurement
(WPM). Much of that work takes a descriptive approach to privacy issues, highlighting the activities of third-parties already extant on the web. Our work, however, is primarily hypothetical. We define ways in which the web makes users vulnerable tonovelprivacy threats from many different angles. We do, however, draw on methods used in other studies in this field.
On identity leakage Past studies in the area of identity leakage research have looked at the ways in which sites can, and already do, share personally identifiable information (PII) with third parties. Krishnamurthy et al. show that information leakage on popular Alexa sites often occurs when a first-party site directs the browser to a make special request to third-party servers [17]. These requests may include sensitive personal attributes like name, email, location, age, or gender as URL parameters in the resource URL of an HTTP request. From there a third-party may be able to use different bits of information collected from a variety of sources to construct a profile of a user. [17] found that as many as 75% of the sites studied leaked private information to third-parties through HTTP requests.
Further studies have expanded on [17] by automating data collection via browser automation frameworks like Selenium [21]. This allows researchers to scale up web crawls to include many sites, but also makes the task of automatically logging in on websites more challenging - [32, 22] opt instead to use a manual approach in its study of how advertisers use the information they are given. The researchers in [12] augment their data by relying on the HTTP logs of their local laboratory network. Other studies limit their scope by focusing on leaks originating only from Online Social Networks (OSNs) like Facebook [11, 19]. Many of these OSN studies have already concluded that PII leakage on OSNs and related apps (such as Facebook apps) is very common.
Despite the varied approaches many of the studies have the same goals of defining how sites leak private information and discovering how third-parties might be using that data. They tend to focus on information leakage that happens via request URLs [17, 19, 11] to third-parties or HTTP 302
redirects [21]. In that sense these studies are only descriptive - they do not define how sites might be unwittingly leaving the door open to the leakage of private information to third-parties.
On trackers Studies like [27, 18, 28, 14] seek a general understanding of the ’tracking’ ecosys- tem. This is not limited to third-party providers of JavaScript or third-party cookies. [7, 28, 24] look at other forms of locally-stored objects such as flash cookies and HTML local storage. But all seek to define the many ways in which a user browsing first-party sites enters into relationships with various third-parties.
[27] defines the different types of trackers seen on the web and takes a census of trackers classified by behavior on popular and non-popular sites. It also simulates real users browsing profiles by feeding search queries from the 2006 AOL search query logs to a search engine and using the top search result as the visited site, a technique also employed by [20]. Relevant to the present study, [27] finds that based on these simulated user profiles, ad service Doubleclick can only track users across 39% of the pages they visit on average.
[18] seeks to show the organizational relationships between trackers and account for the growth of large tracker families (Google, Microsoft, Adobe, Yahoo, and AOL) through their acquisition of smaller web tracking companies. It also studies the “depth of third-party penetration” on popular first-party sites. [18]’s findings show that in the period between 2005 and 2008, top-10 third-party tracker families increased their presence on first-party servers from 40% to 70%. This includes trackers that do not necessarily set third-party cookies or embed third-party JavaScript -
google-analytics.com, for instance, only sets first-party cookies that cannot be used to connect
activity across first-party sites.
On browser fingerprinting Browser fingerprinting involves the use of JavaScript objects, ActionScript queries (Flash’s scripting language) [25], HTTP data, and other methods to collect information that can be used to develop a unique signature of a user’s browser. Examples of commonly-used sources of fingerprinting data include JavaScript’snavigatorandwindowobjects.
These can give information about a browser’s installed plugins, screen resolution, and version information. While recent browser fingerprinting surveys [25, 6] reveal that fingerprinting is not
frequently found on even the top million most popular websites, the question of how much access third-party JavaScript ought to have to do browser fingerprinting remains.
[30] performs a similar survey of JavaScript on the top million most popular websites, but goes further than [25, 6] in its analysis. It takes a census of common third-party trackers and counts what user behaviors tracking scripts record. It finds that many third-party scripts track JavaScript events likemousedownandkeydown, or attempt to ’sniff’ a user’s browser history. It does not, however,
analyze the full potential that these scripts have in violating a user’s privacy. It only looks at what scripts actually do, not what theycoulddo.
On privacy-preserving tools There are various client-side tools to block, limit or visualize third-party tracking, built by researchers, hobbyists, or companies. These are too numerous to list exhaustively, but a sampling include Adblock Plus, Ghostery, ShareMeNot [1], Lightbeam and TrackingObserver [3].
7. Conclusion
Taken together, our findings on both cookie linking and unwelcome JavaScript suggest that there are many as-of-yet unseen consequences to the way third-party resources are included on websites. Further, they are not limited to any one threat model in particular. The “mash-up” model of the web offers everyone from eavesdroppers to active attackers a means with which they can compromise a user’s privacy. By highlighting the vulnerabilities we hope that steps can be taken to mitigate the threat presented by the diverse ecosystem of third-party resource providers.
Our “cookie linking” technique is perhaps the most difficult to remedy. We studied what can be inferred from the surveillance of web traffic logs and established that utilizing third-party tracking cookies enables an adversary to attribute traffic to users much more effectively than methods such as considering IP address alone. We also found that the technique is robust enough to render many mitigation tactics largely ineffective. The best course of action a user can take to protect herself is to block third-party cookies and trackers, but the user then runs the risk of compromising site functionality.
We found that unwelcome JavaScript also presents a sinister threat against a user’s privacy. A significant number of popular US websites grant third-party JavaScript a great deal of access to sensitive information, the worst of which is user passwords via key logging. This particular aspect of the threat could be mitigated if those sites that were vulnerable took more caution in how they isolate logins from third-party JavaScript, or JavaScript from the rest of the page.
We hope that these findings will inform the policy debate on both surveillance and the third-party ecosystem. We also hope that it will raise awareness of privacy breaches via subtle inference techniques and other side-channels, and spur web developers to act more thoughtfully in how they integrate third-party resources into their websites.
References
[1] ShareMeNot: Protecting against tracking from third-party social media buttons while still allowing you to use
them.https://sharemenot.cs.washington.edu. Accessed: 2014.
[2] Tracking mouse movements instead of clicks.
[3] TrackingObserver: A browser-based web tracking detection platform. http://trackingobserver.cs.
washington.edu. Accessed: 2014.
[4] Databases in WRDS - comScore. http://wrds-web.wharton.upenn.edu/wrds/about/databaselist.
cfm, December 2013.
[5] ‘tor stinks’ presentation - read the full document. http://www.theguardian.com/world/interactive/
2013/oct/04/tor-stinks-nsa-presentation-document, October 2013.
[6] Gunes Acar, Marc Juarez, Nick Nikiforakis, Claudia Diaz, Seda Gürses, Frank Piessens, and Bart Preneel.
FPDetective: dusting the web for fingerprinters. InProceedings of the 2013 ACM SIGSAC conference on
Computer & communications security, pages 1129–1140. ACM, 2013.
[7] Mika Ayenson, Dietrich J. Wambach, Ashkan Soltani, Nathan Good, and Chris J/ Hoofnagle. Flash cookies and
privacy II: Now with HTML5 and ETag respawning.World Wide Web Internet And Web Information Systems,
2011.
[8] Mahesh Balakrishnan, Iqbal Mohomed, and Venugopalan Ramasubramanian. Where’s that phone?: geolocating
IP addresses on 3G networks. InProceedings of the 9th ACM SIGCOMM conference on Internet measurement
conference, pages 294–300. ACM, 2009.
[9] Paul E. Black. Ratcliff/Obershelp pattern recognition. http://xlinux.nist.gov/dads/HTML/
ratcliffObershelp.html, December 2004.
[10] Elie Bursztein. Tracking users that block cookies with a HTTP redirect. http://www.elie.net/blog/
security/tracking-users-that-block-cookies-with-a-http-redirect, 2011.
[11] Abdelberi Chaabane, Yuan Ding, Ratan Dey, Mohamed Ali Kaafar, Keith Ross, et al. A closer look at third-party
osn applications: Are they leaking your personal information? InPassive and Active Measurement conference,
2014.
[12] Abdelberi Chaabane, Mohamed Ali Kaafar, and Roksana Boreli. Big friend is watching you: Analyzing online
social networks tracking capabilities. InProceedings of the 2012 ACM Workshop on Workshop on Online Social
Networks, WOSN ’12, pages 7–12, New York, NY, USA, 2012. ACM.
[13] Gabriel Chen. Convenience over safety: How authentication cookies compromise user account security on the web, 2014, in prep.
[14] Marjan Falahrastegar, Hamed Haddadi, Steve Uhlig, and Richard Mortier. The rise of panopticons: Examining
region-specific third-party web tracking. In Alberto Dainotti, Anirban Mahanti, and Steve Uhlig, editors,Traffic
Monitoring and Analysis, volume 8406 ofLecture Notes in Computer Science, pages 104–114. Springer Berlin Heidelberg, 2014.
[15] Sharad Goel, Jake M. Hofman, and M. Irmak Sirer. Who does what on the web: A large-scale study of browsing
behavior. InICWSM, 2012.
[16] Manoj Hastak and Mary J. Culnan. Persistent and unblockable cookies using HTTP headers. http://www.
nikcub.com/posts/persistant-and-unblockable-cookies-using-http-headers, 2011.
[17] Balachander Krishnamurthy, Konstantin Naryshkin, and Craig Wills. Privacy leakage vs. protection measures:
the growing disconnect. InProceedings of the Web, volume 2, pages 1–10, 2011.
[18] Balachander Krishnamurthy and Craig Wills. Privacy diffusion on the web: a longitudinal perspective. In
Proceedings of the 18th International Conference on World Wide Web, pages 541–550. ACM, 2009.
[19] Balachander Krishnamurthy and Craig E. Wills. On the leakage of personally identifiable information via online
social networks. InProceedings of the 2nd ACM workshop on Online social networks, pages 7–12. ACM, 2009.
[20] Bin Liu, Anmol Sheth, Udi Weinsberg, Jaideep Chandrashekar, and Ramesh Govindan. AdReveal: improving
transparency into online targeted advertising. InProceedings of the Twelfth ACM Workshop on Hot Topics in
Networks, page 12. ACM, 2013.
[21] Delfina Malandrino, Andrea Petta, Vittorio Scarano, Luigi Serra, Raffaele Spinelli, and Balachander Krishna-
murthy. Privacy awareness about information leakage: Who knows what about me? InProceedings of the 12th
ACM workshop on Workshop on privacy in the electronic society, pages 279–284. ACM, 2013.
[22] Jonathan Mayer. Tracking the trackers: Where everybody knows your username. https://cyberlaw.
stanford.edu/blog/2011/10/tracking-trackers-where-everybody-knows-your-username, Octo-
ber 2011.
[23] Jonathan R. Mayer and John C. Mitchell. Third-party web tracking: Policy and technology. InSecurity and
Privacy (SP), 2012 IEEE Symposium on, pages 413–427. IEEE, 2012.
[24] Aleecia M McDonald and Lorrie Faith Cranor. Survey of the use of Adobe Flash local shared objects to respawn
[25] Nick Nikiforakis, Alexandros Kapravelos, Wouter Joosen, Christopher Kruegel, Frank Piessens, and Giovanni
Vigna. Cookieless monster: Exploring the ecosystem of web-based device fingerprinting. InSecurity and Privacy
(SP), 2013 IEEE Symposium on, pages 541–555. IEEE, 2013.
[26] Mike Perry, Erinn Clark, and Steven Murdoch. The design and implementation of the Tor browser [draft].
https://www.torproject.org/projects/torbrowser/design, March 2013.
[27] Franziska Roesner, Tadayoshi Kohno, and David Wetherall. Detecting and defending against third-party tracking
on the web. In9th USENIX Symposium on Networked Systems Design and Implementation, 2012.
[28] Ashkan Soltani, Shannon Canty, Quentin Mayo, Lauren Thomas, and Chris Jay Hoofnagle. Flash cookies and
privacy. InAAAI Spring Symposium: Intelligent Information Privacy Management, 2010.
[29] Ashkan Soltani, Andrea Peterson, and Barton Gellman. NSA uses Google cookies to pin-
point targets for hacking. http://www.washingtonpost.com/blogs/the-switch/wp/2013/12/10/
nsa-uses-google-cookies-to-pinpoint-targets-for-hacking, December 2013.
[30] Minh Tran, Xinshu Dong, Zhenkai Liang, and Xuxian Jiang. Tracking the trackers: Fast and scalable dynamic analysis of web content for privacy violations. In Feng Bao, Pierangela Samarati, and Jianying Zhou, editors,
Applied Cryptography and Network Security, volume 7341 ofLecture Notes in Computer Science, pages 418–435. Springer Berlin Heidelberg, 2012.
[31] Jennifer Valentino-Devries, Jeremy Singer-Vine, and Ashkan Soltani. What they know.http://online.wsj.
com/public/page/what-they-know-digital-privacy.html, 2012.
[32] Craig E. Wills and Can Tatar. Understanding what they do with what they know. InProceedings of the 2012
ACM Workshop on Privacy in the Electronic Society, pages 13–18. ACM, 2012.
[33] Michal Zalewski. Rapid history extraction through non-destructive cache timing (v8). http://lcamtuf.
coredump.cx/cachetime/. Accessed: 2014.
[34] Yuchen Zhou and David Evans. Why aren’t http-only cookies more widely deployed. InProceedings of 4th Web
2.0 Security and Privacy Workshop, 2010.
A. Appendices
A.1. Modeling cookie expiration time
We limited our analysis to cookies with a three-month-plus lifespan as an efficient way to ensure that tracking cookies survived for the duration of the user’s browsing. But to more accurately model cookie expiration and its effect on the growth of the GCC, we use the AOL dataset to map the dates in which we collected data to the original timestamps spanning the actual three month’s worth of search queries from 2006. We consider the timestamps of the original visits as if we were visiting those pages in "real-time," and from there attempt to model the effect of the cycle of a cookie being set and reset on the growth of the giant connected component.
When we first encounter a cookie under this model, we use the lifespan of the cookie to determine how far in the future the cookie will persist. Any web page having this cookie that is encountered before the end of the cookie’s lifespan will be connected to other sites having that same cookie in the same lifespan. If eventually the simulated time of the crawl progresses past the end of the cookie’s life and the same cookie is encountered again, we must presume that the unique identifier
in the cookie would be reset, preventing us from connecting this new cookie from past instances of the same cookie.