Non-Parallel Methods for Scalable Data Anonymization

used to achieve the privacy constraints specified by most of the other transaction data anonymization models such askm-anonymity [114], (h, k, p)-Coherence [132], com- plete k-anonymity [51], and ρ-uncertainty [12]. For example, specifying the privacy constraints consisting of all itemsets of size m using PS-rules and setting RBAT to achieve identity disclosure only, will allow to achieve thekm_{-anonymity. We therefore}

consider the parallelization of RBAT in this work.

2.2 Non-Parallel Methods for Scalable Data Anonym-

ization

The scalable anonymization of large transaction data has been considered in central- ised settings. Data indexing and sampling are two commonly used techniques. Disk- based methods with external indexing is one way considered by existing methods. Iwuchukwuet al. [55] proposed the application of spatial indexing tok-anonymizing data. The proposed method uses a multi-dimensional R-tree. Each node in the tree rep- resents a generalized representation. A path from the root node to a leaf node produces a set of records in a given dataset that satisfy the generalization constraints imposed by the path followed to reach the leaf node. For example, consider3-anonymizing a relational dataset shown in Table 2.5. The possible R-tree constructed is shown in Figure 2.2.

LeFevre et al. [63] also proposed a technique for scaling an existing generalization method, Mondrian [64], to datasets larger than the available memory. The algorithm is based on the idea of decision tree construction method, RainForest [37]. Starting with all attribute values generalised to the root, the algorithm scans the input datasetD to collect some statistics (depending on the split criteria), and creates a frequency group. It then chooses an allowable split attribute based on this frequency group.Dis scanned again to createm partitions based on the split attribute. The partitions are written to the disk, if they are larger than available memory. The process is recursively repeated

2.2 Non-Parallel Methods for Scalable Data Anonymization 23

Method Search method Techni - ques

Privacy Model Attacks Hierarchy-based Privacy As- sump- tions RBAT [72] Top-Down Global

General- ization

PS-rules Both no User-

specified Con- straints AA [114] Bottom-up Global General- ization

km_-anonymity _{Identity Disclosure} _yes _All m-sized itemsets need to be pro- tected Greedy [132] Bottom-up Local

General- ization

km_-anonymity _{Identity Disclosure} _yes _All _m_- itemsets are to be protec- ted Anonymize [51] Top-Down Local

General- ization

k-anonymity Identity Disclosure yes All itemsets need to be pro- tected LRA [115] Bottom-up Local

General- ization

km_-anonymity _{Identity Disclosure} _yes _All _m_- itemsets are to be protec- ted mHgHs [70] Top-Down Global General- ization+ Suppres- sion

km_-anonymity _{Identity Disclosure} _yes _All _m_- itemsets are to be protec- ted. TDControl [12] Top-Down Global

General- ization+ Suppres- sion

ρ-uncertainty Sensitive Itemset Disclosure yes All sensitive asso- ciation rules must be protec- ted. COAT [73] Iterative Global

General- ization+ Suppres- sion

Privacy and utility specification Identity Disclosure no utility require- ments specification

2.2 Non-Parallel Methods for Scalable Data Anonymization 24

Figure 2.2: A example R-tree fork-anonymous data

ID Age Sex 1 21 F 2 22 F 3 35 F 4 36 M 5 45 M 6 55 M

Table 2.5: An example DatasetT

in depth-first manner, until no split is possible. Using such disk-based approaches with high-speed disks may address the problem of anonymizing data larger than main memory but these disk-based methods in general limit the performance of the method, to the capabilities of current hardware technologies such as disk I/O speed. Also, the use of resources will still be limited to the processing and storage capability of a single machine. For example, accessing a large amount of data from disks frequently may lead to a performance bottleneck. Therefore, dealing with increasing scale of data using disk-based methods may not be promising in terms of scalability.

The use of sampling techniques have also been considered by some works [63, 65, 75]. Lefevreet al.[63] proposed the use of random sampling for scalablek-anonymization

2.2 Non-Parallel Methods for Scalable Data Anonymization 25

Figure 2.3: A example2-anonymous data using Mondrian

of relational data. The method scans the input data and generates a random sample that fits in the available memory and applies the Mondrian method to anonymize the data. The sample is used to construct the partition tree by choosing the allowable splits. For example, consider a sample of six records used to create a partition tree as shown in Figure 2.3. The data sample is first partitioned vertically across the sex attribute and then horizontally across the age attribute. This creates three different partitions satisfying2-anonymity. The partition tree created by the sample is used to anonymize original data and any splits violating the privacy constraints are undone. Most closely related to our work is the sampling-based method proposed by Loukides et al. [75] to anonymize transaction data. The algorithm uses top-down specialization. It selects a random sample of pre-determined size. Starting with all the items mapped to the most generalized item, it recursively performs specializations until no specialization can be made without violating any privacy constraint. Each time, a specialization operation is performed, the privacy constraints are checked using the sample. The set of generalizations acquired using the sample are then revised using the top-down and bottom-up cut-revision phases. The top-down revision phase further attempts to specialize the generalized items in order to find a solution with same privacy level but better utility. The bottom up cut-revision phase ensures the privacy protection

In document Anonymizing large transaction data using MapReduce (Page 35-39)