Secure Discovery of Actionable Knowledge from Horizontally Distributed Databases

(1)

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 11, November 2014, Pg. 09-15

International Journal of Research in Information Technology (IJRIT)

www.ijrit.com ISSN 2001-5569

Secure Discovery of Actionable Knowledge from Horizontally Distributed Databases

Srinivas Rao M¹, V.Santosh Kumar², Sirisha K L³

Student, Department of CSE, Sreyas Institute of Engineering and Technology, Hyderabad, India¹ [email protected]

Associate Professor, Department of CSE, Sreyas Institute of Engineering and Technology, Hyderabad, India² [email protected]

Asst.Professor, Dept. of CSE, Keshav Memorial Institute of Technology, Hyderabad, India³ [email protected]

________________________________________________________________________________________________________

Abstract

Horizontally distributed databases are used by enterprises in order to manage data in distributed fashion and for convenience of local businesses. However, an enterprise as a whole will be able to mine useful information from such database.

As data is distributed across many sites, mining the whole database is essential for discovering actionable knowledge. More over secure mining is a concern. In this paper we propose an algorithm that can securely mine association rules which can be used to make well informed decisions. Our algorithm not only provides a solution for mining association rules from horizontally distributed databases but does it with fool proof security. We built a prototype application that demonstrates the proof of concept.

Our empirical results reveal that the application is useful and can be improved further in future for incorporating more mining operations.

Index Terms – Data mining, distributed databases, association rule mining, security

1.

Introduction

Data mining has been around for many years. However, recently its usage became indispensable in the real world where organizations utilize it for throw growth and sustenance. Knowledge Discovery in Databases (KDD) became a popular and inevitable part of modern organizations. Since data mining allows extracting trends or patterns, it will be very useful to know interesting information that can lead to well informed decisions. For instance a sales company can find interesting information that two products are sold together by 80% of customers. This kind of fact is known as a trend or pattern. Domain experts use this information and convert them into business intelligence with right interpretation. Thus the knowledge discovered from data mining can help in making expert decisions that lead to revenues and growth of enterprises. Nowadays companies in many fields including banking, insurance, healthcare etc. depend on data mining activities for making appropriate decisions. Such decisions are accurate as they are not biased and extracted from mining voluminous data. In this paper our focus is on generating association rules from horizontally distributed databases. Horizontally distributed databases are split across many places horizontally. The combined data will reflect the data of the organization. Figure 1 shows decentralized data collection.

(2)

Figure 1 – Data is distributed across many servers

As can be seen in Figure 1 it is evident that the data is stored in various locations and that can be integrated with some central information server. An enterprise can have multiple servers integrated with central server for providing locally required information. However, for mining it depends on the consolidated data which is obtained from various locations.

With respect to association rule mining, rules are generated from frequent item sets. Apriori is one of the algorithms that can provide frequent item sets. These item sets are extracted as per the given support and confidence.

The support and confidence are the statistical measures that can help in finding the quality of frequent item sets.

Domain experts make use of these measures in order to estimate the usefulness of generated trends or association rules. An association rule reflects the true association among objects with respect to their frequency in given dataset.

Support and confidence are very important measures to know usefulness of the frequent item sets. The support and confidence are computed as follows.

Here A and B are considered to items. The association of these two items and their frequency helps in making appropriate decisions which will help in business growth. Consider the following dataset as given in Figure 2.

(3)

According to the computations of support and confidence the transactions present in Figure 2 can be analyzed. The item sets are A, B, C and D. Their frequency is different. AB item set has 40% support, ABC 20%

and BC 60% support. With respect to confidence 66%, 50% and 75% are the values computed respectively. These statistics can help in quantitative data analysis to make expert decisions. These figures are useful to domain experts in making well informed decisions that lead to organizational growth.

In this paper we proposed an algorithm for mining association rules from horizontally distributed databases in secure fashion. We also build a prototype application that demonstrates the proof of concept. The remainder of the paper is structured as follows. Section II provides review of literature pertaining to secure association rule mining. Section III presents proposed architecture and also the algorithm. Section IV presents experimental results while section V concludes the paper.

2.

Related Work

Privacy preserving data mining has been active research for some years. There are many scenarios where data owner and data miner are not same. In that case security plays an important role. Many anonymizing techniques came up including [1] and [2]. The idea of perturbed data is also around. Performing data mining where multiple parties are involved in secure fashion is essential. Cryptographic approach is the usual approach followed. ID3 algorithm was used in [3] with secure building of decision trees. Secure clustering with algorithm such as Expectation Maximization (EM) was explored in [4] on horizontally distributed databases. In [5], [6] and [7] also the problem of association rule mining from distributed databases was focused. Experiments with vertical and horizontally distributed datasets were done in [8] and [9].

When multiple parties are involved in data mining process in distributed environment, secure multiparty communications play an important role. Towards this end research was carried out by many researchers in [10], [11]

and [12]. A privacy preserving protocol was presented in [11] and polynomials concept was used in [10].

Homomorphic encryption is also used in [10] and [11]. In [8] commutative encryption is used. These techniques proved costly as they incur computational cost. However, the computational cost of the work done in [12] is less.

Many experiments were made with two players. Another solution towards secure computation and distributed data mining is in [13]. Recently proposed set inclusion problem was explored in [14] with polynomial evaluation.

3.

Proposed Architecture and Algorithm For Secure Arm

In this paper we proposed a distributed architecture that bestows an overview of the distributed sites that are involved in data mining activities and how the consolidated rules are generated in secure environment. The security is provided with standard cryptographic mechanisms while the mining of data takes place from across the sites and then merged so as to have complete patterns that can be used to make well informed decisions.

3.1 Architectural Overview

(4)

Figure 3 – Overview of the architecture

As can be seen in the architecture there are multiple sites in which data exists. Data of an enterprise is distributed horizontally across multiple sites. The frequent item set mining takes place locally among all sites. Then the results are merged in order to get complete set of association rules that will be of interest to domain expert to interpreted and expert decisions.

3.2 Proposed Algorithm

The algorithm is meant for secure mining of association rules from horizontally distributed databases. The algorithm is as presented in Listing 1.

Listing 1 – Outline of the proposed algorithm

As can be viewed in listing 1, it is evident that the algorithm takes dataset, support and confidence as input and returns association rules. Initialization of security keys, candidate set generation, local pruning, unifying candidate results, computing local supports, merging mining results and retuning association rules are the important steps carried out.

(5)

4.

Experimental Results

We built a prototype application that demonstrates the proof of concept. The application has been built in Visual Studio 2012 which is the IDE for .NET environment. The experiments are made in a PC with 2GB RAM, Dual core processing running Windows 7 operating system. The experiments are made in terms of total time taken for mining task.

Figure 4 – Computational time comparison

As seen in Figure 4, it is evident that the algorithm has been tested with number of rows and columns represented by N which is plotted in horizontal axis while the time taken in seconds is represented in vertical axis. The proposed Secure ARM algorithm is compared with a baseline algorithm. The performance of the proposed algorithm is far better than the baseline algorithm.

5.Conclusion

In this paper we studied secure mining of association rules from horizontally distributed databases. Many solutions came into existence for solving the security problem when data owner and mining party are different. In this paper we assumed that the data is from same organization but present in multiple locations. We built architecture and proposed an algorithm for secure computation of association rules from horizontally distributed databases. Our algorithm is distributed in such a way that it is executed in multiple servers and the results are merged into the main server. This will help the head office of an enterprise to have consolidated data mining results for effective decision making. We also built a prototype application that demonstrates the proof of concept.

However, our prototype can be improved further with respect to secure mining and also incorporating multiple mining methods so as to be part of a decision support system.

REFERENCES

[1] A.V. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke. Privacy preserving mining of association rules. In KDD, pages 217–228, 2002.

[2] R. Agrawal and R. Srikant. Privacy-preserving data mining. In SIGMOD Conference, pages 439–450, 2000.

[3] Y. Lindell and B. Pinkas. Privacy preserving data mining. In Crypto, pages 36–54, 2000.

[4] X. Lin, C. Clifton, and M.Y. Zhu. Privacy-preserving clustering with distributed EM mixture modeling. Knowl.

Inf. Syst., 8:68–81, 2005.

[5] J. Zhan, S. Matwin, and L. Chang. Privacy preserving collaborative association rule mining. In Data and Applications Security, pages 153– 165, 2005.

0 100 200 300 400 500 600 700 800 900

100000 200000 300000 400000 500000 T

i m

e

(

s e

)c

N

Total Computaional Time

Baseline SecureARM

(6)

[6] J. Vaidya and C. Clifton. Privacy preserving association rule mining in vertically partitioned data. In KDD, pages 639–644, 2002.

[7] M. Kantarcioglu, R. Nix, and J. Vaidya. An efficient approximate protocol for privacy-preserving association rule mining. In PAKDD, pages 515–524, 2009.

[8] M. Kantarcioglu and C. Clifton. Privacy-preserving distributed mining of association rules on horizontally partitioned data. IEEE Transactions on Knowledge and Data Engineering, 16:1026–1037, 2004.

[9] A. Schuster, R. Wolff, and B. Gilburd. Privacy-preserving association rule mining in large-scale distributed systems. In CCGRID, pages 411– 418, 2004.

[10] L. Kissner and D.X. Song. Privacy-preserving set operations. In CRYPTO, pages 241–257, 2005.

[11] M.J. Freedman, K. Nissim, and B. Pinkas. Efficient private matching and set intersection. In EUROCRYPT, pages 1–19, 2004.

[12] J. Brickell and V. Shmatikov. Privacy-preserving graph algorithms in the semi-honest model. In ASIACRYPT, pages 236–252, 2005.

[13] M. Freedman, Y. Ishai, B. Pinkas, and O. Reingold. Keyword search and oblivious pseudorandom functions. In TCC, pages 303–324, 2005.

[14] T. Tassa, A. Jarrous, and J. Ben-Ya’akov. Oblivious evaluation of multivariate polynomials. Submitted.

AUTHORS

Srinivas Rao Mukka received the MCA degree in computer science and technology from Indira Gandhi National University, New Delhi, India in 2011. He is currently working towards his M.Tech degree in Sreyas Institute of Engineering and Technology, Hyderabad, India. His research interests include data mining and cloud computing.

Email: [email protected]

Vennu Santosh Kumar received the Masters degree in Computer Science and Engineering in the year 2010. He is Microsoft Certified System Engineer & CISCO Certified Network Administrator, he worked as a System Engineer in WIPRO Technologies(INDIA). In 2011 he joined as an Associate Professor at Sreyas Institute of Engineering and Technology in Computer Science Department. He has been involved in several tutorials, workshops, technical paper presentations .His research interests are focused on Computer Networks, Network Security & Mobile Computing.

(7)

Sirisha K L S received the Masters degree in Computer Science and Engineering in the year 2010. In 2010 she joined as an Assistant Professor at Keshav Memorial Institute of Technology in Computer Science and Engineering Department. Her research interests are focused on Network Security, Information Retrieval & Machine learning.