Prevention of Attribute Disclosure Using Privacy – Preserving Publishing of Microdata

(1)

P.Anusha, IJRIT 156

IJRIT International Journal of Research in Information Technology, Volume 1, Issue 8, August, 2013, Pg. 156-163

International Journal of Research in Information Technology (IJRIT)

www.ijrit.com

ISSN 2001-5569

Prevention of Attribute Disclosure Using Privacy – Preserving Publishing of Microdata

1

P.Anusha, ²K. Sudheer Kumar, ³Dr. R.V.Krishnaiah

1M.Tech Student,Dept. of CSE, DRK College of Engineering & Technology, Hyderabad, AP, India

2Associate Professor, Dept. of CSE, DRK College of Engineering & Technology, Hyderabad, AP, India

3Principal, Dept of CSE, DRK Group of Institutions, Hyderabad, Andhra Pradesh, India

1[email protected] , ²[email protected], ³[email protected]

Abstract

Privacy-preserving publishing of micro data which contains sensitive information is a challenging job. Anonymization techniques are used to achieve this. The existing anonymization techniques such as l-diversity, k-anonymity, generalization and bucketization have achieved privacy-preserving publishing of micro data. However, these techniques either cause losing data or problem of membership disclosure. Recently Tiancheng Li et al. proposed a technique known as slicing which preserves better data utility and can handle high-dimensional data. It can be used to prevent attribute disclosure. In this paper we implement the slicing approach and built a prototype application that demonstrates the proof of concept. The empirical results revealed that the prototype can be used for privacy-preserving data publishing.

Index Terms – Data mining, anonymization, privacy-preserving publishing, and data security

1. Introduction

Micro data are the data which contains records containing information about individuals, organizations and other details. Publishing such data into public domain can disclose membership information. For this reason it is essential for privacy-preserving data publishing which has been an important research topic of late. Many micro data anonymization techniques came into existence. Some of the popular once include Diversity [1], bucketization [2], [3], [4], k-anonymity [5] and generalization [6], [5], and Slicing [7]. In all the approaches the attributes to dataset are classified into three types. Some attributes can uniquely identify records which are very sensitive in nature.

Examples for such attributes include social security number, employee id and so on. Some attributes are known as Quasi Identifiers (QI) which are not sensitive but they can be used to establish sensitive information. The QIs might be known to adversaries already through other published data. Examples for QIs include date of birth, gender, zip code etc. The third category of attributes can’t identify records but they are sensitive in nature. They can disclose potential information of individuals and cause privacy problems when disclosed. Attribute disclosure protection is the important research area and the main focus of this paper too. The examples for third category (sensitive attributes) include salary of an employee, disease of a patient and so on. In anonymization techniques such as

(2)

P.Anusha, IJRIT 157 bucketization and generalization, identifiers are removed from data and the data is partitioned in the first phase. In the second phase they differ. The generalization converts QIs into less specific but meaningful values. On the other hand in bucketization there is clear separation of SAs and QIs though random permutations made on SA values. The anonymization techniques thus prevent membership disclosure.

Experiments on generalization [8], [9] revealed that, in case of high dimensional data, it loses some information when data is anonymized. The reasons for this are curse of dimensionality, uniform distribution assumption problem, loss of correlations between attributes. When compared to generalization the bucketization [3], [4] has better data utility. However, it has many drawbacks including membership disclosure problem, need for separation between SAs and QIs which is not easy in datasets; attribute correlation is lost.

Tiancheng Li et al. [7] introduced a new data anonymization technique known as slicing. It improves the anonymization process to achieve privacy-preserving data publishing. In order to achieve this, the slicing technique divides the given data horizontally and vertically. By grouping attributes vertical partitioning is done which is based on the possible correlations found across the attributes. The related attributes are grouped into columns thus each column can have a set of attributes. In the same fashion, by grouping tuples, horizontal partitioning is done. The values in each bucket are permutated randomly which will not allow membership disclosure. Breaking association across the columns is the main focus of slicing technique. This can reduce dimensionality and improve the data utility. As it can group highly related attributes it can preserve correlations and at the same time achieve privacy – preserving data publishing.

In this paper we implement slicing practically by developing a custom Java simulator, which demonstrates the proof of concept. The remainder of this paper is structured as follows. Section II reviews literature. Section 3 provides details of the slicing technique briefly. Section IV presents prototype implementation details. Section V provides experimental results while section VI concludes the paper.

2. Related Work

In the literature two popular anonymization techniques are found. They are bucketization and generalization. Generalization [6], [5], [10] replaces a value present under an attribute with less-specific value but it is meaningful semantically. For achieving generalization there exist three types of encoding methods. They are local recording, global recording and regional recording. In case of global recording [11] many values are replaced by a single value. Local recording [12] allows various occurrences of a value to be generalized with different value. The regional recording [13] allows partitioning of domain space many regions. Generalization has the following drawbacks. It fails to handle high-dimensional data. This is known as curse of dimensionality [8]. Too much information loss is involved in generalization [2].

Generalization [2], [3] is another anonymizing technique which converts the records into buckets and then separates SAs and QIs. Then permutations make the data to get anonymized. This technique is useful for high- dimensional data [14]. It has the drawbacks such as assumption of separation of SAs from QIs, membership disclosure problem. In [7] the most recent anonymizing technique known as slicing is proposed. Slicing is different from generalization and bucketization. It does both horizontal and vertical partitioning of data. It preserves correlations among the attributes and also achieves better anonymization and data utility. Horizontal partitioning it is possible to preserve correlations among attributes. There are other recent anonymization techniques such as k- anonymity [15] proposed by Terrovitis et al. This model needs the published data to have with k-transactions containing same set of items. This will prevent the adversaries to know the exact information from the published data. However, it suffers from many drawbacks. It fails in preventing adversary to learn additional information from data; an absence of an item may give some information to adversary; it is not easy to set n value. In [16] k- anonymity is used with local recording. Xe et al. [17] combined both approaches such as l-diversity and k- anonymity. However, this approach also assumes clear separation of QIs and SAs. Differential privacy [18], [19] is another technique available. Of late it has attracted researchers more. It is more into answering pertaining to statistical data rather than privacy – preserving data publishing. As specified earlier in [7] a new approach came into existence which is known as slicing.

(3)

P.Anusha, IJRIT 158

3. Slicing

As discussed in the introduction the slicking technique proposed in [7] is a technique which overcomes the drawbacks of generalization and bucketization techniques. The slicing technique partitions data both horizontally and vertically. In the process it preserves attribute correlations and also while ensuring better data utility when compared with generalization and bucketization techniques. More details on the technique can be found in [7] while this paper focuses on its implementation. The input and output datasets of slicing technique for illustrative purpose are presented in Fig. 1.

Fig. 1 – Illustrates input and output pertaining to slicing technique (excerpts from [7])

As can be seen in fig. 1 (a) the input dataset is shown with attributes like age, sex, zipcode and disease. The output table is shown in fig. 1 (b) where the horizontal and vertical partitioning of slicing technique is found with better data utility and permutations that can preserve attribute correlations besides ensuring that the attribute disclosure is protected.

4. Prototype Implementation

The environment used for building prototype application which is nothing but a custom simulator in Java which demonstrates the proof of concept of slicing include Java programming language, Net Beans as IDE (Integrated Development Environment) and a PC with 4GB RAM, core 2 dual processor running Windows XP operating system. The prototype is a web based application. The algorithms used can be found in [7]. The main interface of the application is as shown in fig. 2.

Fig. 2 – The main UI of the application.

(4)

P.Anusha, IJRIT 159 As can be seen in fig. 2, the main application UI facilitates end users to view original data which is not anonymized; view generalized data subjected to generalization technique; view bucketized data which has been subjected to bucketization; view data which is subjected to multiset-based generalization; and then perform one- attribute-per-column slicing; and view sliced data besides other general authentication related operations. The original data before anonymization is as shown below.

Fig. 3 – Original data

As can be seen in fig. 3, the original data is shown in tabular format. When generalization is applied the values in the columns are changed to less specific values. The result of generalization is as shown in fig. 4.

Fig. 4 – Illustrates generalized data

As can be seen in fig. 4, the data transformed or generalized data is shown. The generalization helps in anonymization. The results of bucketization and other techniques are not presented here due to space constraints.

Fig. 5 shows the result of slicing with a single attribute column.

Fig. 5 – Results of slicing (single attribute column)

(5)

P.Anusha, IJRIT 160 As can be seen in fig. 5, the result of single attribute column slicing shows semantically more meaningful data with highest data utility and at the same time preserves privacy of published microdata. The result of slicing algorithm proposed in [7] is shown in fig. 6.

Fig. 6 – Results of Slicing Technique

As can be seen in fig. 6, the result of slicing algorithm is shows semantically more meaningful data with highest data utility and at the same time preserves privacy of published micro data.

5. Results

Several experiments are made with on generalization, bucketization and slicing techniques. All these are anonymzing techniques which are used for privacy – preserving data publishing. The publishing of microdata is made possible with these techniques. The summary of experiments made on the three techniques using the data sets such as OCC-7 and OCC-15 is shown in a series of graphs below

.

Fig. 7 – Learning the sensitive attribute (OCC-7 dataset)

As can be seen in fig. 7, the horizontal axis represents I value while the vertical axis represents classification accuracy. The performance of techniques for learning sensitive attributes is given for OCC-7 dataset.

0 5 10 15 20 25 30 35 40

1 2 3 4

Classification Accuracy (%)

I value

Generilization Bucketization Slicing

(6)

P.Anusha, IJRIT 161 Fig. 8 – Learning the sensitive attribute (OCC-15 dataset)

As can be seen in fig. 8, the horizontal axis represents I value while the vertical axis represents classification accuracy. The performance of techniques for learning sensitive attributes is given for OCC-15 dataset.

Fig. 10 – Computational Efficiency (Cardinality)

As can be seen in fig. 10, the horizontal axis represents cardinality while the vertical axis represents classification accuracy. The performance of learning of QI attribute of various techniques is presented.

Fig. 10 – Computational Efficiency (Dimensionality)

As can be seen in fig. 10, the horizontal axis represents dimensionality while the vertical axis represents computational efficiency. The computational efficiency of learning of QI attribute of various techniques is presented

.

0 5 10 15 20 25 30 35 40

1 2 3 4

Classification Accuracy (%)

I value

Generalizatio n

0 10 20 30 40

1 2 3

Classification Accuracy(sec)

Cardinality

Generalizati on

Bucketizatio n

Slicing

0 10 20 30 40

1 2 3

Computaional Efficiency(sec)

Dimensionality

Generalizatio n

Bucketization

Slicing

(7)

P.Anusha, IJRIT 162

6. Discussions and Future Work

This paper presents a new approach called slicing to privacy preserving microdata publishing. Slicing overcomes the limitations of generalization and bucketization and preserves better utility while protecting against privacy threats. We illustrate how to use slicing to prevent attribute disclosure and membership disclosure. Our experiments show that slicing preserves better data utility than generalization and is more effective than bucketization in workloads involving the sensitive attribute. The general methodology proposed by this work is that:

before anonymizing the data, one can analyze the data characteristics and use these characteristics in data anonymization.

The rationale is that one can design better data anonymization techniques when we know the data better. In [20], [21], we show that attribute correlations can be used for privacy attacks. This work motivates several directions for future research. First, in this paper, we consider slicing where each attribute is in exactly one column. An extension is the notion of overlapping slicing, which duplicates an attribute in more than one column. These releases more attribute correlations. For example, in Table 1f, one could choose to include the Disease attribute also in the first column. That is, the two columns are fAge; Sex; Disease and fZipcode; Disease. This could provide better data utility, but the privacy implications need to be carefully studied and understood. It is interesting to study the tradeoff between privacy and utility [22]. Second, we plan to study membership disclosure protection in more details.

Our experiments show that random grouping is not very effective. We plan to design more effective tuple grouping algorithms.

Third, slicing is a promising technique for handling high-dimensional data. By partitioning attributes into columns, we protect privacy by breaking the association of uncorrelated attributes and preserve data utility by preserving the association between highly correlated attributes. For example, slicing can be used for anonymizing transaction databases, which has been studied recently in [16], [18], [17]. Finally, while a number of anonymization techniques have been designed, it remains an open problem on how to use the anonymized data. In our experiments, we randomly generate the associations between column values of a bucket. This may lose data utility. Another direction is to design data mining tasks using the anonymized data [23] computed by various anonymization techniques.

7. References

[

1] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam, “‘-Diversity: Privacy Beyond k- Anonymity,” Proc. Int’l Conf. Data Eng. (ICDE), p. 24, 2006.

[2] X. Xiao and Y. Tao, “Anatomy: Simple and Effective Privacy Preservation,” Proc. Int’l Conf. Very Large Data Bases (VLDB),

pp. 139-150, 2006.

[3] D.J. Martin, D. Kifer, A. Machanavajjhala, J. Gehrke, and J.Y. Halpern, “Worst-Case Background Knowledge for Privacy- Preserving Data Publishing,” Proc. IEEE 23rd Int’l Conf. Data Eng. (ICDE), pp. 126-135, 2007.

[4] N. Koudas, D. Srivastava, T. Yu, and Q. Zhang, “Aggregate Query Answering on Anonymized Tables,” Proc.

IEEE 23rd Int’l Conf. Data Eng. (ICDE), pp. 116-125, 2007.

[5] L. Sweeney, “k-Anonymity: A Model for Protecting Privacy,” Int’l J. Uncertainty Fuzziness and Knowledge- Based Systems, vol. 10, no. 5,pp. 557-570, 2002.

[6] P.Samarati, “Protecting Respondent’s Privacy in Microdata Release,” IEEE Trans. Knowledge and Data Eng., vol. 13, no. 6,

pp. 1010-1027, Nov./Dec. 2001.

[7] Tiancheng Li, Ninghui Li, Senior Member, IEEE, Jian Zhang, Member, IEEE, and Ian Molloy, “Slicing: A New Approach for Privacy Preserving Data Publishing”, IEEE transactions on knowledge and data engineering, VOL. 24, NO. 3, MARCH 2012.

[8] C. Aggarwal, “On k-Anonymity and the Curse of Dimensionality,” Proc. Int’l Conf. Very Large Data Bases (VLDB), pp. 901-909, 2005.

[9] D. Kifer and J. Gehrke, “Injecting Utility into Anonymized Data Sets,” Proc. ACM SIGMOD Int’l Conf.

Management of Data (SIGMOD), pp. 217-228, 2006.

[10] L. Sweeney, “Achieving k-Anonymity Privacy Protection Using Generalization and Suppression,” Int’l J.

Uncertainty Fuzziness and Knowledge-Based Systems, vol. 10, no. 6, pp. 571-588, 2002.

(8)

P.Anusha, IJRIT 163 [11]K.LeFevre, D. DeWitt, and R. Ramakrishnan, “Incognito: Efficient Full-Domain k-Anonymity,” Proc. ACM SIGMOD Int’l Conf. Management of Data (SIGMOD), pp. 49-60, 2005.

[12] J. Xu, W. Wang, J. Pei, X. Wang, B. Shi, and A.W.-C. Fu, “Utility- Based Anonymization Using Local Recoding,” Proc. 12th ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD), pp. 785-790, 2006.

[13] K. LeFevre, D. DeWitt, and R. Ramakrishnan,“Mondrian Multidimensional k-Anonymity,” Proc. Int’l Conf. Data Eng. (ICDE), p. 25, 2006.

[14] G. Ghinita, Y. Tao, and P. Kalnis, “On the Anonymization of Sparse High-Dimensional Data,” Proc. IEEE 24th Int’l Conf. Data Eng. (ICDE), pp. 715-724, 2008.

[15] M. Terrovitis, N. Mamoulis, and P. Kalnis, “Privacy-Preserving Anonymization of Set-Valued Data,” Proc.

Int’l Conf. Very Large Data Bases (VLDB), pp. 115-125, 2008.

[16]Y. He and J. Naughton, “Anonymization of Set-Valued Data via Top-Down, Local Generalization,” Proc. Int’l Conf. Very Large Data Bases (VLDB), pp. 934-945, 2009.

[17] Y. Xu, K. Wang, A.W.-C. Fu, and P.S. Yu, “Anonymizing Transaction Databases for Publication,” Proc. ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD), pp. 767-775, 2008.

[18] I. Dinur and K. Nissim, “Revealing Information while Preserving Privacy,” Proc. ACM Symp. Principles of Database Systems (PODS), pp. 202-210, 2003.

[19] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating Noise to Sensitivity in Private Data Analysis,”

Proc. Theory of Cryptography Conf. (TCC), pp. 265-284, 2006.

[20] T. Li and N. Li, “Injector: Mining Background Knowledge for Data Anonymization,” Proc. IEEE 24th Int’l Conf. Data Eng. (ICDE), pp. 446-455, 2008.

[21] T. Li, N. Li, and J. Zhang, “Modeling and Integrating Background Knowledge in Data Anonymization,” Proc.

IEEE 25th Int’l Conf. Data Eng. (ICDE), pp. 6-17, 2009.

[22] T. Li and N. Li, “On the Tradeoff between Privacy and Utility in Data Publishing,” Proc. ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD), pp. 517-526, 2009.

[23] A. Inan, M. Kantarcioglu, and E. Bertino, “Using Anonymized Data for Classification,” Proc. IEEE 25th Int’l Conf. Data Eng. (ICDE), pp. 429-440, 2009

.

8. Authors Biography

P.Anusha has completed B.Tech (IT) from Nagarjuna Institute of Technology Sciences and pursuing M.Tech (C.S.E) in DRK College of Engineering and Technology, JNTUH, Hyderabad.

Her main research interest includes Data Mining, Databases.

K. Sudheer Kumar is working as an Assistant Professor in DRK College of Engineering and Technology, JNTU, Hyderabad, Andhra Pradesh, India. He has received M.Tech (CSE) degree and M. Sc (Systems Theory & Computer Modeling). His main research interest includes Operating Systems, Computer Modeling and Advanced Computer Architecture.

Dr.R.V.Krishnaiah, did M.Tech (EIE) from NIT Waranagal, MTech (CSE) form JNTU, Ph.D, from JNTU Ananthapur, He has memberships in professional bodies MIE, MIETE, MISTE. His main research interests include Image Processing, Security systems, Sensors, Intelligent Systems, Computer networks, Data mining, Software Engineering, network protection and security control. He has published many papers and Editorial Member and Reviewer for some national and international journals.