Adaptive Utility-based Anonymization Model: Performance Evaluation on Big Data Sets

(1)

Procedia Computer Science 50 ( 2015 ) 347 – 352

Peer-review under responsibility of scientific committee of 2nd International Symposium on Big Data and Cloud Computing (ISBCC’15) doi: 10.1016/j.procs.2015.04.037

ScienceDirect

2nd International Symposium on Big Data and Cloud Computing (ISBCC’15)

Adaptive Utility-based Anonymization Model: Performance

Evaluation on Big Data Sets

Jisha Jose Panackal

a

_{*, Anitha S Pillai}

b

a_{Research Scholar, School of Computing Sciences, Hindustan University, Chennai- 603 103, India} b_{Professor and Head, School of Computing Sciences, Hindustan University, Chennai- 603103, India}

Abstract

Data Anonymization is one of the globally accepted mechanisms for the protection of privacy of individuals in data publishing scenario. Normally the data anonymization impacts on the quality of data especially critical to the success of knowledge-based applications. An intelligent approach based on association mining namely, Adaptive Utility-based Anonymization (AUA) has been proposed in order to deal with this issue. Initially the model is tested with sample instances of original data set National Family Health Survey (NFHS-3) and this paper includes performance evaluation of AUA model using data sets and proves that the data anonymization can be done without compromising the quality of data mining results.

Peer-review under responsibility of scientific committee of 2nd International Symposium on Big Data and Cloud Computing

(ISBCC’15).

Keywords: Anonymization;Classification;Data Mining;Privacy

1.Introduction

Privacy Preserving Data Mining (PPDM) is a hot research area today and has a wider scope in this electronic era since most of the individuals as well as organizations are more concerned about privacy related threats. In order to protect the identity disclosure of individuals some sort of data distortion is required especially in healthcare as well as in financial sectors. Data Anonymization is one of the promising technique in data publishing scenario and the proper integration of such techniques will

* Jisha Jose Panackal. Tel.: +91-9400860285.

E-mail address:[email protected]

Peer-review under responsibility of scientific committee of 2nd International Symposium on Big Data and Cloud Computing (ISBCC’15)

(2)

definitely improve the acceptance of knowledge-based systems. Disclosing the identity of individuals will definitely de motivate the people from sharing such important information to the society. Removing explicit identifiers such as Name and Personal ID of a person will not be enough to protect the person from disclosure risk.

Therefore, effective storing and sharing of huge amount of data, referred as big data is so crucial for carrying out various knowledge-based studies. Some of the remarkable PPDM studies discusses the problem and suggests some mechanisms in order to achieve the privacy protection. Achieving privacy without compromising the content of data or data mining accuracy is always a challenge in PPDM. Our previous study1_{, discussed the need to develop a flexible model for Privacy Preserving Data Mining}

by illustrating a detailed analysis on actual data related to Indian population namely, National Family Health Surveys (NFHS)-3. Later, we developed an algorithm called AUA2_{(Adaptive Utility-based Anonymization) which is considered as an intelligent}

model to solve the problem of disclosure risk.

As an extension of previous studies, the present study compares the data mining accuracy using classification on two data sets to prove that the proposed model is well suited for big data anonymization. The rest of the paper is organized as follows. Section 2 discusses background of the study including some of the important literatures related to privacy of individuals. Section 3 provides a brief description of proposed AUA model and section 4 include the data sets. In section 5, we present the experimental results which show the effectiveness of our model by comparing classification accuracy. Finally, the paper is concluded in section 6.

2.Related Work

In centralized scenario, the data is published, so that it can be used for a variety of purposes. Studies like 3, 4_{up to}11_{show that}

the privacy is addressed by various techniques including Anonymization, Perturbation, Condensation, Randomization and Fuzzy-based methods in the data publishing scenario. Even though the data is not encrypted as in distributed scenario, some sort of precaution should be taken regarding the privacy of individuals, before publishing the data. The techniques such as data hiding, compression, generalization, swapping and permutation are commonly applied mechanisms to protect privacy in this scenario. The studies 12_and13_{provide some remarkable frameworks that describe privacy models. K-anonymity as mentioned in}14, 15_and16

is the classical approach under this category and is widely accepted mechanism to protect privacy to some extent. Latest studies,

17, 18_and19_{strengthen the process of anonymization by extending the capability of k-anonymity model.}

3.AUA Model

The proposed framework is being developed for solving the problem of disclosure risk1_{. Initially, we have developed a}

framework to solve the problem in NFHS-3, later we have generalized the framework which becomes adaptive to accommodate any data set. The framework is implemented using Adaptive Utility-based Anonymization (AUA) model20_{. The highlights of}

AUA algorithm are included in this section.

The AUA approach is a two step iterative process based on association mining. Using support and confidence, we have performed association mining on given data set either for filtering risky instances or for retaining maximum information based on utility of data. The first step is based on quasi-sensitive associations among entire instances of the given data set and the second step is based on quasi-quasi associations among instances of the non-frequent set and among certain instances of frequent set. From this iterative process, multiple versions of anonymized data sets can be served positively according to the user`s need.

4.Data Sets 4.1 Adult data set

The data set namely, Adult included in UCI machine learning repository consists of 32,561 instances and 10 attributes and is considered for testing the effectiveness of the new model AUA. Among the ten attributes the attributes such as Country, Age, Education and Race are considered as the quasi-identifying (QI) attributes. The model generates multiple versions of anonymized data sets accordingly with the user preferred attribute on any one of the QI attributes. That is, if the user preferred attribute is

(3)

Country, the model generates anonymized data set of Adult data which is named as AdultAT-Country-wise. Similarly for other QI attributes Age, Education and Race the respective anonymized versions are generated namely Age-wise, AdultAT-Education-wise and AdultAT-Race-wise.

4.2 NFHS-3 data set

The National Family Health Surveys (NFHS) programme is a nationally important source of data on population, health, and nutrition for India and its states. This national survey is initiated in the early 1990s, and three rounds of the survey have been conducted NFHS-1 in 1992-93 and NFHS-2 in 1998-99 and the third in the series is introduced in the year 2005-06 namely, National Family Health Survey (NFHS-3). It consists of 1,24,385 records which include attributes related to HIV, TB, anemia and malaria related to HealthCare. Initially the model is tested with 15,563 records from NFHS-3 data set, and then the present study considers 46,593 records for testing the accuracy.

5.Experiment and Results

5.1 Classification on Adult Data set

In order to test the efficiency of the new model, the data set Adult has undergone the anonymization process as per the AUA model and each version of the data set including the original data set is considered for one of the data mining operation called classification.

Figure1. Classification Accuracy on Adult data set

Classification on original data set is performed using three known classifiers namely ZeroR, Naive Bayes and Random Forest. From the results as shown in fig.1, it is observed that the classifier Random Forest provide better classification accuracy for the original data set as compared with the results of other classifier models. But, Naive Bayes classifier provides least significant changes in original data set as well as different anonymized versions. The results also show that the accuracy is least affected with anonymous data sets. Here, the anonymized version with user preference quasi-identifier attribute “Race” provides

maximum accuracy i.e, 81.66 in Random Forest and 82.21 in Naive Bayes. The Naive Bayes provides insignificant (< 0.5) change in accuracy on all data sets. Therefore, Naive Bayes is considered as the best suitable classifier for the data set Adult. The results on other parameters that test the effectiveness of new model including privacy and information are shown in the fig.2.

66.92 66.92 66.92 66.92 66.92 82.5 82.21 82.16 _{82.11 82.21} 82.64 81.46 81.63 81.42 81.66 66 68 70 72 74 76 78 80 82 84 ZeroR Classifier

Naïve Bayes Classifier

(4)

Figure2. Comparison using various parameters on Adult data set

5.2 Classification on NFHS-3 Data set

Figure3. Classification Accuracy on NFHS data set

97.06 97.06 97.06 97.06 97.06 96.89 96.93 96.93 96.93 96.92 97.05 97.05 97.06 97.05 97.06 95 95.4 95.8 96.2 96.6 97 97.4 97.8 98.2 98.699 ZeroR Classifier

Naïve Bayes Classifier

Random Forest Classifier

82.5

82.21 82.16 82.11 82.21

81 82 83

Classification Accuracy on Percent_Correct

0.52 0.52 0.53 0.53 0.53 0

1

Classification Accuracy on Entropy

0.95 0.95

0.67 0.67

0.95

0 1

Classification Accuracy on IR-Precision

0.86 0.85 _0.77 _0.77 0.85

0 1

(5)

From the results shown in fig.3, it is observed that the classifier ZeroR provide better classification accuracy for the original data set as compared with the results of other classifier models. Naive Bayes classifier and Random Forest classifier also provide least significant changes in original data set as well as different anonymized versions. The Random Forest classifier provides more consistent values for all the parameters considered. The results on other parameters that test the effectiveness of new model including privacy and information are shown in the fig.4.

Figure4: Comparison using various parameters on NFHS-3 data set

6.Conclusions

The model AUA is tested initially with 15,563 instances of real set data namely NFHS-3 and generated multiple versions of anonymized data sets accordingly with each quasi-identifier (QI) attributes. Since the data contain four QIs, i.e. {State, Age,

Education and Religion}, it generated anonymized versions for each as per the user’s preference. Both original data set and

anonymized data sets are tested for classification accuracy and obtained the results. The results show that the anonymization process does not provide any significant degradation in the accuracy of data mining classification. Thus proves that the privacy of individuals can be ensured without compromising the quality of the data set.

Later, the study enhances the testing using additional data sets namely Adult data set and NFHS (enhanced) data set which includes huge number of instances. The results prove that the privacy of individuals can be ensured in big data without compromising the quality of the data set. The mechanism used in AUA provides sufficient intelligence to balance the two

97.05 97.05 97.06 97.05 97.06

95 96 97 98

Classification Accuracy on Percent_Correct

0.46 0.53 0.44 0.4 0.5 0

1

Classification Accuracy on Entropy

0.32 0.29 0.41 0.42 0.41 0

1

Classification Accuracy on IR-Precision

0.01 0.01 0.02 0.01 0.02 0

0.5

(6)

contradictory goals of PPDM, which is privacy protection and information preservation. The model is implemented in order to accommodate any set of data and the data provider gets flexibility in setting up of QI attributes and also has a provision to

include the user’s preference on any QI attribute. That is, if the anonymized data is to be used for a study based on a particular QI attribute, that user will get an anonymized version without affecting that attribute. The model gets the benefits of k-anonymity in terms of privacy protection, and provides maximum information to the users.

REFERENCES

1. Jisha Jose Panackal, Anitha S Pillai, V N Krishnachandran “Disclosure Risk of Individuals: a k-anonymity Study on Health

Care Data Related to Indian Population”, IEEE International Conference on Data Science and Engineering (ICDSE), 2014. 2. Jisha Jose Panackal, Anitha S Pillai ”Adaptive Utility-based Anonymization Model for Privacy Preserving Data Mining”,

International Conference on Data Mining and Warehousing (ICDMW), Data Mining Algorithms, Elsevier, 2014.

3. Li Liu, Murat Kantarcioglu and Bhavani Thuraisingham, “The applicability of the perturbation based privacy preserving data mining for real-world data”, Data and Knowledge Engineering, 2008.

4. Samarati P, “Protecting respondents privacy in Microdata release”, IEEE Transactions on Knowledge and Data Engineering, 13:10101027

5. Tiancheng Li, Ninghui Li, “Towards Optimal k-anonymization”, Data and Knowledge Engineering, Elsevier, 2008.

6. Marina Blanton, “Achieving Full Security in Privacy-Preserving Data Mining”, IEEE International Conference on Privacy,

Security, Risk, and Trust, and IEEE International Conference on Social Computing, 2011.

7. R. Mukkamala and V.G. Ashok, “Fuzzy-based Methods for Privacy-Preserving Data Mining”, Eighth International Confer -ence on Information Technology: New Generations, 2011.

8. Yan ZHU and Lin PENG, “Study on K-anonymity Models of Sharing Medical Information”, IEEE, 2007.

9. Yun Ding and Karsten Klein, “Model-Driven Application-Level Encryption for the Privacy of E-Health Data”, Inter-national Conference on Availability, Reliability and Security, 2010.

10. Jinfei Liu, Jun Luo and Joshua Zhexue Huang, “Rating: Privacy Preservation for Multiple Attributes with Different Sensitivity Requirements”, 11th IEEE International Conference on Data Mining Workshops, 2011.

11. Jos Luis Fernndez-Alemn, InmaculadaCarrinSeor, Pedrongel Oliver Lozoya, AmbrosioToval, “Methodological Review

-Security and Privacy in electronic health records: A systematic literature review”, Journal of Biomedical Informatics, 2013.

12. Jitao Zhao and Ting Wang, “A General Framework for Medical Data Mining”, International Conference on Future Information Technology and Management Engineering, 2010.

13. Sergio Martnez, David Snchez, Aida Valls, “A semantic framework to protect the privacy of electronic health records with

non-numerical attributes”, Journal of Biomedical Informatics 46 294303, 2013.

14. L. Sweeney, “Achieving k-anonymity privacy protection using generalization and suppression”, International Journal on Un-certainty, Fuzziness and Knowledge-based Systems, pp 571-588, 2002.

15. L Sweeney, “k-anonymity-A model for Protecting Privacy”, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002.

16. Yan ZHU, Lin PENG, “Study on k-anonymity Models of Sharing Medical Information”, IEEE, 2007.

17. Dan Zhu, Xiamo-Bai Li, Shuning Wu, “Identity disclosure protection: A data reconstruction approach for privacy preserving

data mining”, Elsevier. Decision Support Systems, pp.133-140, 2009.

18. Weijia Yang, Sanzheng, “A novel anonymization algorithm: Privacy Protection and Knowledge Preservation”, Elsevier. Expert Systems with Applications, pp.756-766, 2010.

19. Xiaoxun Sun, Lili Sun, Hua Wang, “Extended k-anonymity models against sensitive attribute disclosure”, Elsevier. Computer Communications, pp.526-535, 2011.

20. Jisha Jose Panackal and Anitha S Pillai, “An Intelligent Framework for Protecting Privacy of Individuals Empirical

Evaluations on Data Mining Classification”, Proc. of International Conference on Hybrid Intelligence Systems (HIS), IEEE, 2014.