Automating social graph de-anonymization
3.1 Introduction
A number of the serious privacy cases already discussed, such as the Netflix scandal, have shown that anonymization of social networks is much harder than it looks. Rich datasets have are often published for research purposes with only casual attempts to anonymize them. Research in de-anonymization has also seen an upswing [4, 87, 117, 120], leading to high-profile data releases being followed by high-profile privacy breaches. These developments have forced organizations to make some effort to better anonymize the released data. However, distorting data to achieve this contradicts the very purpose of a release, since it damages utility. So how hard can it be to re-identify users? In this chapter we present a generic and automated approach to re-identifying nodes in anonymized social networks which enables novel anonymization techniques to be quickly evaluated. It uses a machine-learning model to match pairs of nodes in disparate anonymized subgraphs. Social network graphs in particular are high-dimensional and feature-rich data sets, and it is extremely hard to preserve their anonymity. Thus, any anonymization scheme has to be evaluated in detail, including those with a sound theoretical basis [61]. As discussed in §§2.4 to 2.6, many techniques have been proposed to resist de-anonymization; however Dwork and Naor have shown [121] that preserving privacy of an individual whose data is released cannot be achieved in general. The resulting uncertainty makes mass data release a very tricky proposition specially from the perspective of data subjects.
Ad-hoc vs generic. It has been conclusively demonstrated that merely removing identi- fiers in social network datasets is not sufficient to guarantee privacy. Despite these results, data practitioners, continue to propose anonymization strategies in the hope that they can resist de-anonymization “in practice”, such as the ones used to protect datasets from
38 3.1. INTRODUCTION
The Data for Development (D4D) challenge. This has led to a cat-and-mouse game: Data practitioners devise new anonymization variants by tweaking simple building blocks like sampling, deleting nodes or edges or injecting random ones. Then privacy researchers devise ad-hoc de-anonymization attacks to break the new variants.
Unraveling each anonymization technique manually, requires considerable effort and time and each attack can be defeated by a small tweak to the anonymization strategy, often by destroying specific features the attack exploited. Tailoring attacks to specific scenarios [3] highlights the problem of anonymization. However, it does little to deter future attempts to formulate “novel” anonymization techniques. Additionally, the expense involved in evaluating each new scheme cannot be amortized.
We need generic de-anonymization techniques that will allow cheap and timely evaluation of novel anonymization schemes, and eliminate large classes of weak ones. In this chap- ter, we demonstrate the efficacy of automated de-anonymization attacks on real-world anonymization schemes. They automatically uncover artefacts remaining after anonymiza- tion that enable re-identification of nodes in social networks. Our automated attacks can be used quickly and cheaply to screen “novel” anonymization schemes.
Our contributions. The key contribution of this work is to cast the problem of de- anonymization in social networks as a learning problem, and show that an automated learning algorithm can be used to evaluate a variety of social network anonymization strategies. Specifically, we:
• Formulate the problem of de-anonymization in social networks as a learning task. From a set of examples of known correspondences between nodes (training data) we wish to learn a good de-anonymization model (§ 3.3.1).
• Describe a non-parametric learning algorithm tailored to the de-anonymization learning problem in social graphs. The algorithm is based on random decision forests, with custom features that match social network nodes and a granular graph structure-based metric to capture the likelihood of node re-identification (§§ 3.3.2 to 3.3.4).
• Evaluate the learning algorithm on a real-world de-anonymization task from the D4D challenge, and compare it with an ad-hoc approach (§ 3.4.4).
• Show that the algorithm and model learn sufficient information about the anonymiza- tion algorithm, rather than the specific dataset anonymized, to be useful when de- anonymizing social networks of a different nature than the ones used for training (§3.4.3).
• We apply the automated learning algorithm, to a standard problem [5] of de- anonymizing nodes across social networks. It performs well, even when a very small number of examples are used to train it (§§ 3.4.4 and 3.4.5).
The work presented in this chapter is based on the paper titledAn Automated Social Graph De-anonymization Technique [122] published in the Proceedings of the 13th Workshop on Privacy in the Electronic Society (WPES 2014), and is in collaboration with George Danezis. I did all the day-to-day work while George acted as a senior academic advisor for this work. George proposed the idea of using machine learning to de-anonymize social networks. The design of the system evolved through our discussions. I wrote the code and conducted all the experiments. The first draft of the paper was written by me and was subsequently reviewed by George.