The “Means likely to be used” test - The Limits of Anonymisation

Chapter 4. Informational Privacy in EU Law: Challenges in Data Protection and Privacy

A. Challenges to EU Data Protection Law

I. The Limits of Anonymisation

1. The “Means likely to be used” test

The GDPR refers to anonymisation in its Recital 26, and excludes anonymized data from its framework:

“To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly.

To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments.

The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or

423_{GDPR, Recital 26}

424_{Article 29 Working Party (2014), Opinion 05/2014 on Anonymisation Techniques, WP216,}

Adopted on 10 April 2014

116

to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.425_”.

The extra test which Recital 26 of the GDPR specifies for identifiability, “the means reasonably likely to be used”, shows that the regulators are aware of an ever-present technical risk: anonymisation is not one barrier which stops all possible identification attempts, but rather a set of various technical practices which may protect data from identification to a certain (or uncertain) extent426_{. If an entity deploys “unreasonable”} efforts in ensuring identification, the failure to stop it is not a violation under the GDPR.

Finding where anonymisation becomes sufficiently protective can be challenging. To the Article 29 Working Party, even de-identified data is still personal if it can be attributed to one person, whether or not any other characteristics apply - the only way to make that type of data anonymous would be to aggregate it to the point where no identification is possible427_{. Essentially, to the Article 29 Working Party, there is no such thing as a non-} personal dataset created from only one data subject428_{. This means that as long as the} data set is made from one individual, no matter how anonymized or innocuous, it constitutes personal data.

Even when data is anonymised, there is still a possibility of de-anonymization or of re- identifying the individual from other, less obvious pieces of data. This makes it very difficult to turn personal data into ‘’non-personal’’ data falling outside the scope of the GDPR.

Anonymisation is only effective at limiting identification when the effort needed to de- anonymize data is prohibitively time- or resource-intensive compared to the effort needed to anonymize that data in the first place. However, in the last couple of decades, research into anonymisation has shown that true anonymisation is impossible, and the means of re-identifying are becoming easier to obtain429_{. This puts anonymisation in a difficult}

425_{GDPR, Recital 26}

426_{Bayardo, R. J., & Agrawal, R. (2005, April). Data privacy through optimal k-anonymization. In}

Data Engineering, 2005. ICDE 2005. IEEE.

427_{Article 29 Working Party, Opinion 5/2014 on Anonymisation Techniques, WP216} 428_{Article 29 Working Party, Opinion 5/2014 on Anonymisation Techniques, WP216} 429_{Ohm, P. (2010), Broken Promises of Privacy: Responding to the Surprising Failure of}

Anonymisation. UCLA Law Review, Vol. 57, p. 1701; U of Colorado Law Legal Studies Research Paper No. 9-12. Available at SSRN: http://ssrn.com/abstract=1450006

117

situation, as the practice is widely used to protect personal information, especially in the medical and research contexts430_{, but cannot be completely trusted.}

The distinction based on “means likely reasonably to be used” provides some flexibility since anonymisation only needs to protect up to a certain threshold, but the shift in balance of power between anonymization and de-anonymisation means that under this distinction, anonymisation technology does not mean de-identifiability, and as such does not change the legal nature of the data, and is simply a security measure.

“Anonymisation” is not one process which when applied “increases anonymity”. Instead, it is a wide array of different technologies, each of which has its own advantages, flaws, and methods. For example, the process of “k-anonymity” is aimed at ensuring that the information of each person in the data set cannot be distinguished from other individuals in that data set, but achieving that (or coming as close as possible, as no anonymisation is perfect) can be done in multiple ways.

Because of this, re-identification can also be done in various ways. The main difficulty is that with enough other pieces of data being aggregated, as long as the data has any relation to the individual (which is always true, since the data is generated by the individual’s behavior), there will be pieces of data which may allow re-identification. Because of the unpredictability of that process, efforts to ensure true anonymity are never perfect.

An example of how the “means likely to be used” can be unpredictable is the AOL search data situation431_{. In 2006, AOL had released twenty million search queries for} researchers to mine for data. At that point, in the view of AOL, that data was considered fully and irremediably anonymised - meaning they did not have the “means likely to be used” to re-identify it432_{. However, as they quickly realized, this did not mean} anonymisation: reporters from the New York Times showed that they could easily re- identify parts of the information. They were able to identify an individual based on their searches:

“[S]earch by search, click by click, the identity of AOL user No. 4417749 became easier to discern. There are queries for "landscapers in Lilburn, Ga," several people with the last name Arnold and "homes sold in shadow lake subdivision gwinnett county georgia."

430_Ibid

431_{Schwartz (n.17)} 432_Ibid

118

It did not take much investigating to follow that data trail to Thelma Arnold, a 62-year old widow who lives in Lilburn, Ga., frequently researches her friends' medical ailments and loves her three dogs. "Those are my searches," she said, after a reporter read part of the list to her.'433

Meanwhile, another example involved the popular streaming website Netflix in a study by Arvind Narayanan and Vitaly Shmatikov434_{. At the time that Netflix was still a DVD} rental business. Their research demonstrated that some people in a supposedly anonymised data set could be identified through their ratings on the website IMDb. The researchers demonstrated that “Given a user's public IMDb ratings, which the user posted voluntarily to selectively reveal some of his […] movie likes and dislikes, we discover all the ratings that he entered privately into the Netflix system, presumably expecting that they will remain private.”435

According to a study by computer science professor Latanya Sweeney, all it takes is a ZIP code, birth date, and gender to identify 87% of Americans436_{. When considering how} unexpected correlations can came out of unexpected places, the fact that there is no algorithm or program that can take into account every piece of data and what can be deduced from it, the idea that any data is truly “anonymous” becomes doubtful.

So as we have seen, anonymisation is a set of techniques aimed at ensuring that individuals cannot be identified, while nevertheless using the data to extract value, but it is not perfect and is getting weaker as re-identification technology improves.

In document The information / guarantees balance - protecting informational privacy interests within the European data protection framework. (Page 116-119)