Data mining successfully extracts knowledge to

(1)

C O V E R F E A T U R E

Our goal in investigating privacy preservation issues was to take a systemic view of architectural requirements and design principles and explore possible solutions that would lead to guidelines for building practical privacy-preserving data mining systems.

FOUNDATIONAL DESIGN

As Figure 1 shows, privacy-preserving data mining usually has multiple steps that translate to a three-tiered architecture: At the bottom tier are the data providers,

the data owners, which are often physically distributed. The data providers submit their private data to the data warehouse server. This server, which constitutes the mid-dle tier, supports online analytical data processing to facilitate data mining by translating raw data from the data providers into aggregate data that the data mining servers can more quickly process.

The data warehouse server stores the data collected in disciplined physical structures, such as a multidi-mensional data cube, and aggregates and precomputes the data in various forms, such as sum, average, max, and min. In an online survey system, for example, the survey respondents would be data providers who submit their data to the survey analyzer’s data warehouse server; an aggregated data point might be the average age of all survey respondents. The aggregated data is more effi-cient to process than raw data from the providers.

At the top tier are the data mining servers, which per-form the actual data mining. In a privacy-preserving data

Although successful in many applications, data mining poses special concerns for private

data. An integrated architecture takes a systemic view of the problem, implementing

established protocols for data collection, inference control, and information sharing.

Nan Zhang

University of Texas at Arlington

Wei Zhao

Rensselaer Polytechnic Institute

D

ata mining successfully extracts knowledge to support a variety of domains—marketing, weather forecasting, medical diagnosis, and national security—but it is still a challenge to mine certain kinds of data without violating the data owners’ privacy.1_{How to mine patients’}

pri-vate data, for example, is an ongoing problem in health-care applications. In recognition of the growing privacy concern, directives such as the US Health Insurance Portability and Accountability Act (HIPAA) and the European Union Privacy Directive mandate privacy pro-tection for data management and analysis systems.

As data mining becomes more pervasive, such con-cerns are increasing. Online data collection systems are an example of new applications that threaten individ-ual privacy. Already companies are sharing data mining models to obtain a richer set of data about mutual cus-tomers and their buying habits.

The computing community must address data mining privacy before data mining techniques become wide-spread and the threat to private information spirals out of control. The sticking point is how to protect privacy while preserving the usefulness of data mining results.

Much research is under way to address obstacles, but practical privacy-preserving data mining systems are largely in the research and prototyping stages. Many techniques for privacy-preserving data mining concen-trate on algorithmic solutions and underlying mathe-matical tools,2,3_{rather than focusing on system issues.}

Privacy-Preserving

Data Mining Systems

(2)

mining system, these servers do not have free access to all data in the data warehouse. In a hospital system, the accounting department can mine patients’ financial data, for example, but cannot access patients’ medical records. Developing and val-idating effective rules for the data mining servers’ access to the data warehouse is an open research problem.4

Besides constructing data mining models on its local data warehouse server, a data mining server might share information with data mining servers from other systems. The motivation for this sharing is to build data mining models that span systems. For example, sev-eral retail companies might opt to share their local data mining models on customer

records to build a global data mining model about con-sumer behavior that would benefit all the companies. As Figure 1 shows, sharing occurs in the top tier, where each data mining server holds the data mining model of its own system. Thus, “sharing” means sharing local data mining models rather than raw data.

“Minimum necessary” design principle

Any design of a privacy-preserving data mining system requires a clear definition of privacy. The common inter-pretation is that a data point is private if its owner has the right to choose whether or not, to what extent, and for what purpose to disclose the data point to others. In privacy-preserving data mining literature, most authors assume (either implicitly or explicitly) that a data owner generally chooses notto disclose its private data unless data mining requires it. This assumption and the accepted information-privacy definition form the basis of the “minimum necessary” design principle:

In a data mining system, disclosed private information (from one entity to another) should be the minimum necessary for data mining.

Minimum in this context is a qualitative, not a quan-titative, measure. Since the quantitative measure of pri-vacy disclosure varies among systems, minimum captures the idea that all unnecessary private informa-tion (unnecessary in the context of how accurate the data mining results must be) should not be disclosed.

Minimum thus means that privacy disclosure is on a need-to-know basis. Many privacy regulations, includ-ing HIPAA, mandate this minimum necessary rule.

Privacy protocols

On the basis of the architecture in Figure 1 and the minimum necessary design principle, we have evolved a basic strategy for building a privacy-preserving data mining system. Central to the strategy are three proto-cols that govern privacy disclosure among entities:

•Data collectionprotects privacy during data trans-mission from the data providers to the data ware-house server.

•Inference controlmanages privacy protection between the data warehouse server and data mining servers. •Information sharingcontrols information shared

among the data mining servers in different systems. Given the minimum necessary rule, a common goal of these protocols is to transmit the minimum private information necessary for data mining from one entity to another to build accurate data mining models. In real-ity, it is often difficult to build an efficient system that protects private information perfectly. Consequently, there are always tradeoffs between data privacy and data mining model accuracy. These protocols are based on established methods that the system designer can tailor to particular requirements, choosing the most beneficial tradeoffs. The data collection protocol, for example, can

Data Mining System 2

Data mining servers

Data warehouse server

Data providers

Data Mining System 1

Data mining servers

Data warehouse server

Data providers

Information sharing

Figure 1. Basic architecture for privacy-preserving data mining.The architecture typically has three tiers: data providers, which are the data owners; the data warehouse server, which supports online analytical processing; and the data mining servers that perform data mining tasks and share information. The challenge is to control private information transmitted among entities without impeding data mining.

(3)

draw from one of two established collection methods, each with its advantages and drawbacks.

DATA COLLECTION PROTOCOL

The data collection protocol lets data providers iden-tify the minimum necessary part of private information— what must be disclosed to build accurate data mining models—and ensures that they transmit only that part of the information to the data warehouse server.

Several requirements shape the data collection proto-col. First, it must be scalable, since a data warehouse server can deal with as many as hundreds of thousands of data providers, as in an online survey system. Second, the computational cost to data providers must be small because they have considerably lower computational power than the data warehouse server, and a higher cost could discourage them from participating in data mining. Finally, the protocol must be robust; it must deliver rel-atively accurate data mining results while protecting data providers’ privacy, even if data providers behave errati-cally. For example, if some data providers in an online survey system deviate from the protocol or submit mean-ingless data, the data collection protocol must control the influence of such erroneous behavior and ensure that global data mining results remain sufficiently accurate.

Figure 2 shows a data collection protocol taxonomy based on two data collection methods.

Value-based method

With the value-based method,5 _{a data provider}

manipulates the value of each data attribute or item independently using one of two approaches. The per-turbation-basedapproach3_{adds noise directly to the}

original data values, such as changing age 23 to 30 or Texas to California. The aggregation-basedapproach generalizes data according to the relevant domain hier-archy, such as changing age 23 to age range 21-25 or Texas to the US.

The perturbation-based approach is highly suitable for arbitrary data, while the aggregation-based approach relies on knowledge of the domain hierarchy, but

can be effective in guarantee-ing the data’s anonymity6_—

k-anonymity, for example, means that each perturbed data record is indistinguish-able from the perturbed val-ues of at least k–1 other data records.

The value-based method assumes that it would be dif-ficult, if not impossible, for the data warehouse server to rediscover the original pri-vate data from the manipu-lated values but that the server would still be able to recover the original data distribution from the perturbed data, thereby supporting the construction of accurate data mining models.5

Dimension-based method

The dimension-based method is so called because the data to be mined usually has many attributes, or dimen-sions. The basic idea is to remove part of the private information from the original data by reducing the num-ber of dimensions. The blocking-based approach3

accomplishes this by truncating some private attributes without releasing them to the data warehouse server. However, this approach could result in information loss, preventing data mining servers from constructing accu-rate data mining models. The more complicated pro-jection-based approach7_{overcomes this problem by}

projecting the original data into a carefully designed, low-dimensional subspace in a way that retains only the minimum information necessary to construct accurate data mining models.

Advantages and drawbacks

Each method and attendant approach has pluses and minuses. The value-based method is independent of the data mining task, which makes it suitable for applica-tions involving multiple data mining tasks or tasks unknown at data collection. In contrast, the dimension-based method fits better with individual data mining tasks because the information to be retained after dimension reduction usually depends on the particular task.

So far, research has not defined an effective and uni-versally applicable projection-based approach. Even so, the projection-based approach promises strong advan-tages over value-based methods in terms of the tradeoff between accuracy and privacy protection.

Most value-based approaches treat different attrib-utes independently and separately, so at least some attributes that are less necessary for data mining are always disclosed to the data warehouse server to the same extent as other attributes. Indeed a recent study Dimension-based method Value-based method Perturbation-based approach Aggregation-based approach Blocking-based approach Projection-based approach Data collection protocol

Figure 2. Data collection protocol taxonomy. A designer can choose which of two methods— value- or dimension-based—and its attendant approaches best serve the design.

(4)

revealed that, with the perturbation-based ran-domization approach, the data warehouse server could use privacy intrusion techniques to filter noise from the perturbed data, thereby rediscovering part of the original private data.8

The projection-based approach avoids this problem by exploiting the relationship among attributes and disclosing only those necessary for data mining.

Guiding data submission can also reduce unnecessary privacy disclosure, enhancing the performance of data perturbation. In ear-lier work,7 _{we and colleague Shengquan}

Wang proposed a guidance-based dimension reduction scheme for dynamic systems, such as online survey systems, in which data

providers (survey respondents and so on) join the sys-tem and submit their data asynchronously. To guide data providers that have not yet submitted data, the scheme analyzes the data already collected and esti-mates the attributes necessary for data mining. The system then sends the estimated useful attributes to data providers as guidance. Our work shows that this guidance-based scheme is more effective than approaches without such guidance.

INFERENCE CONTROL PROTOCOL

Protecting private data in the data warehouse server requires controlling the information disclosed to the data mining servers—which is the aim of the inference control protocol. Following the minimum necessary rule, the inference control protocol ensures that the data warehouse server answers the queries necessary for data mining yet minimizes privacy disclosure.

Several requirements drive the inference control pro-tocol’s design and implementation. One is the need to block inferences. If a data mining server becomes an adversary, it will try to infer private information from the query answers it has already received. Figure 3 gives an example.

Further, the inference control protocol must be effi-cient enough to satisfy the data

warehouse server’s required online response time—the time between issuing a query and answering it. The time that an inference control protocol uses is part of that response time. It must be controlled so that the data warehouse server can main-tain its reduced response time.

To meet these requirements, infer-ence control protocols must restrict the information included in the query answers so that the data min-ing server cannot infer private data from received query answers.

Figure 4 shows an inference control protocol taxon-omy based on two inference control methods.

Query-oriented method

The query-oriented method4_{is centered on the}

con-cept of a safe query set, which says that query set <Q1,

Q2, …, Qn> is safe if a data mining server cannot infer private data from the answers to Q1, Q2, …, Qn. Thus, query-oriented inference control means that when the data warehouse server receives a query, it will answer the query only if the union set of query history—the set of all queries already answered—and the recently received query are safe. Otherwise, it will reject the query. Relative to query-oriented inference control in statistical databases, inference control in data ware-houses involves significantly more data. Consequently, the burden is on inference control protocols to process queries more efficiently.

Because dynamically determining a query set’s safety (online query history check) can be time-consuming, a static version of the query-oriented method might be more suitable. The static version determines a safe set of queries offline (before any query is actually received). If a query set is safe, then any one of its subsets is also safe. At runtime, when the data warehouse server

Figure 3. Inference that discloses private information. If the data mining server becomes an adversary, it might be able to infer from the query answers and certain cells (Known) the number of DVDs a data provider sold in June (which is private and should not be disclosed) by computing Q1+ Q3– (Q5+ Q6) = 88 – 72 = 16, where Q1to Q8are query answers.

Item April May June July Sum

Book 10 Known 15 Known Q5= 25

CD 20 Known 27 Known Q6= 47

DVD Known 35 16 36 Q7= 87

Game Known 25 Known 14 Q8= 39

Sum Q1= 30 Q2= 60 Q3= 58 Q4= 50

Data-oriented method Query-oriented method

Classify safe and unsafe sets

offline Check query history online Do perturbation by data collection Do perturbation online when query

received Inference control protocol

Figure 4. An inference control protocol taxonomy. A designer can choose which of two methods—query- or data-oriented—best serves the design.

(5)

server reject some privacy-divulging queries (such as

Q3in Figure 3). This, in turn, would effectively

down-grade the data perturbation level yet retain the same degree of privacy protection. Because the data is per-turbed, the server would have to reject far fewer queries and could thus answer most queries fairly accurately while continuing to protect private information.

INFORMATION SHARING PROTOCOL

Because each data mining server constructs local data mining models in its own system, these servers are likely to share their local data mining models rather than the raw data in the data warehouses. Local data mining mod-els can be sensitive, especially when the local models are not globally valid.

To protect the privacy of individ-ual data mining systems, some mech-anism must control the disclosure of private information in local data mining models. This mechanism is the information sharing protocol, which again follows the minimum necessary rule. The protocol’s objective is to enable data mining servers across multiple systems to construct global data mining models while disclosing only the min-imum private information about local data mining mod-els necessary for information sharing.

Many information sharing protocols exist for appli-cations other than data mining, such as database inter-operation or data integration.10_{Information sharing is}

necessary for most distributed data mining systems, and much work has focused on designing specific informa-tion sharing protocols for data mining tasks.

A major design concern of the information sharing protocol is defending against adversaries that behave arbitrarily within the capability allocated to them. The defense strategy depends on the adversary model—the set of assumptions about an adversary’s intent and behavior. Two of the more popular adversary models are semihonest10_{and beyond semihonest.}

Semihonest adversaries

An adversary is semihonest if it properly follows the designated protocol but records all intermediate com-putation and communication, thereby providing a way to derive private information.

Cryptographic encryption has proved effective in defending against semihonest adversaries.2,10,11_{In this}

method, each data mining server encrypts its local data mining model and exchanges the encrypted model with other data mining servers.

Some encryption scheme properties, such as the Rivest-Shamir-Adleman (RSA) cryptosystem’s commutative encryption property, make it possible to design algo-rithms for data mining servers to perform certain data mining tasks and set operations without knowing the receives the query, it answers only if the query is in the

predetermined safe set. Otherwise, it will reject the query. On the downside, the static method is conserva-tive in selecting a safe set, which might cause it to reject some queries unnecessarily.

Data-oriented method

With the data-oriented method of inference control,9

the data warehouse server perturbs the stored raw data and estimates the query answers as accurately as possi-ble on the basis of the perturbed data. As Figure 4 shows, the data collection protocol can handle perturbation unless the application requires storing original data in the data warehouse server. In that case,

the data warehouse server might have to perturb the data when processing the query.

The data-oriented method assumes that perturbation can protect private information from being disclosed, enabling the data warehouse server to answer all queries freely on the basis

of the perturbed data. Research has shown that the query answers estimated from the perturbed data can still sup-port the construction of accurate data mining models.5

Advantages and disadvantages

The two methods have unique performance consid-erations. The data-oriented method offers query respon-siveness, since the data warehouse server will answer all

queries. The query-oriented method, in contrast, nor-mally rejects a substantial number of queries,9_which

means that some data mining servers might be unable to complete their data mining tasks.

On the plus side, the query-oriented method can pro-vide more accurate answers than the data-oriented method. When the data warehouse server answers a query, its answer will always be precise. The data-ori-ented method, in contrast, answers queries with esti-mation, so it might not be accurate enough to support data mining, particularly when the construction of data mining models requires highly accurate query answers. Efficiency is an important advantage for the static ver-sion of the query-oriented method, which has the short-est response time because most of its computational cost is offline. The dynamic version must trade off efficiency and query responsiveness: To answer more queries, the data warehouse server must spend more time analyzing the query history. The data-oriented method also suf-fers from low efficiency, since the computational over-head for query estimation can be several orders of magnitude higher than for query answering.

One way to enhance inference control protocol per-formance is to integrate query- and data-oriented meth-ods. Introducing the query answer-or-reject scheme to the data-oriented method would let the data warehouse

The query-oriented method

can provide more accurate

answers than the

data-oriented method.

(6)

private keys of other entities.2,10,11_{Tasks include}

classifi-cation, association rule mining, clustering, and collabo-rative filtering; set operations include set intersection, set union, and element reduction.

Because it is not possible to recover the original (local) data mining models from their encrypted values without knowing the private keys, this method is a secure defense against semihonest adversaries. Researchers have already evolved a detailed taxonomy and cryptographic encryp-tion methods for various system settings.2,3

Beyond semihonest adversaries

An adversary is considered beyond semihonest if it deviates from the designated protocol, changes its input data, or both.

Because it is difficult if not impos-sible to defend against an adversary that is behaving arbitrarily, dealing with beyond semihonest adversaries requires more refined models. One

such model is the intent-based adversary model,12_which

formulates an adversary’s intent as combining the intent to obtain accurate data mining results with compromis-ing other entities’ private information. A game-theoretic method is then developed to defend against adversaries that weigh the accuracy of data mining results over com-promising other parties’ privacy.12

The basic idea is to design the information sharing pro-tocol in a way that no adversary can both obtain accu-rate data mining results andintrude on other servers’ privacy. Adversaries that are more concerned with the accuracy of data mining results will be forced not to intrude on the privacy of others to get that accuracy.

OPEN RESEARCH ISSUES

Several issues require additional research to ensure the optimum performance of the techniques described.

Protocol integration

Many systems need a seamless integration of the three protocols, yet little research has addressed this need. Our proposed integrated architecture could serve as a platform for studying protocol interaction. Such insights can pave the way for effective and efficient integration.

Heterogeneous privacy requirements

Privacy-preserving data mining techniques depend on respecting the privacy protection levels that data providers require. Most existing studies assume homoge-nous privacy requirements—that all data owners need the same privacy level for all their data and its attrib-utes. This assumption is unrealistic in practice and could even degrade system performance unnecessarily. Designing and implementing techniques that exploit

het-erogeneous privacy requirements is a challenge with much potential return.

Privacy measurements

The accuracy versus protection tradeoff inherent in privacy-preserving data mining means that some mech-anism must accurately measure the degree of privacy protection. Although extensive work has focused on pri-vacy measurement, as yet no one has proposed a com-monly accepted measurement technique for generic privacy-preserving data mining systems. Proper privacy protection measurement has three criteria: It must

• reflect system settings (adversaries might have different levels of inter-est in different data values, such as being more concerned with patients that have contagious dis-eases than other disdis-eases),

• account for data providers’ diverse privacy concerns (some might con-sider age as private information, while others are will-ing to disclose it publicly), and

• satisfy the minimum necessary rule.

A comprehensive study of privacy measurement for all three protocols would be a huge step toward improv-ing the performance of privacy-preservimprov-ing data minimprov-ing techniques.

Anomaly detection

A common application of data mining is to detect data-set anomalies, as in mining log file data to detect intrusions. However, few researchers have considered privacy protection in detecting anomalies.

Research on anomaly detection is an important part of data mining and can contribute to multiple disci-plines, such as security, biology, and finance. Thoroughly investigating issues related to the design of privacy-pre-serving data mining techniques for anomaly detection would be extremely beneficial.

Multiple protection levels

In some cases, multiple levels of private information must be protected. The first level might be a data point value, and the second level, the data point sensitivity (knowledge of whether or not a data point is private). Most existing studies focus on protecting the first level and assume that all entities already know the second level. Research has yet to answer how to protect the second level (and higher levels) of private information.

O

ur work is an important first step in addressing the critical systemic issues of privacy preservation in data mining. Much research remains to realize the

Research on anomaly

detection can contribute to

multiple disciplines, such as

security, biology, and finance.

(7)

potential of the architecture and design principles we have described. Much literature already addresses pri-vacy-preserving data mining, but clearly the ideas must cross considerable ground to become practical systems. Studies are needed for the design of privacy-preserving data mining techniques in real-world scenarios, in which data owners can freely address their individual privacy concerns without the data miner’s consent. Also critical is work that more closely incorporates designs with spe-cialized applications such as healthcare, market analy-sis, and finance. Our hope is that others will continue efforts in this important area. ■

References

1. J. Han and M. Kamber, Data Mining Concepts and Tech-niques, Morgan Kaufmann, 2001.

2. C. Clifton et al., “Tools for Privacy Preserving Distributed Data Mining,” SIGKDD Explorations, vol. 4, no. 2, 2003, pp. 28-34.

3. V.S. Verykios et al., “State-of-the-Art in Privacy Preserving Data Mining,” SIGMOD Record, vol. 33, no. 1, 2004, pp. 50-57.

4. L. Wang, S. Jajodia, and D. Wijesekera, “Securing OLAP Data Cubes against Privacy Breaches,” Proc. 25th IEEE Symp. Security and Privacy, IEEE Press, 2004, pp. 161-175. 5. R. Agrawal and R. Srikant, “Privacy-Preserving Data

Min-ing,” Proc. 19th ACM SIGMOD Int’l Conf. Management of Data, ACM Press, 2000, pp. 439-450.

6. R.J. Bayardo and R. Agrawal, “Data Privacy through Optimal k-Anonymization,” Proc. 21st Int’l Conf. Data Eng., IEEE Press, 2005, pp. 217-228.

7. N. Zhang, S. Wang, and W. Zhao, “A New Scheme on Pri-vacy-Preserving Data Classification,” Proc. 11th ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Min-ing, ACM Press, 2005, pp. 374-383.

8. Z. Huang, W. Du, and B. Chen, “Deriving Private Information from Randomized Data,” Proc. 24th ACM SIGMOD Int’l Conf. Management of Data, ACM Press, 2005, pp. 37-48. 9. R. Agrawal, R. Srikant, and D. Thomas, “Privacy-Preserving

OLAP,” Proc. 25th ACM SIGMOD Int’l Conf. Management of Data, ACM Press, 2005, pp. 251-262.

10. R. Agrawal, A. Evfimievski, and R. Srikant, “Information Sharing across Private Databases,” Proc. 22nd ACM SIG-MOD Int’l Conf. Management of Data, ACM Press, 2003, pp. 86-97.

11. Y. Lindell and B. Pinkas, “Privacy Preserving Data Mining,” Proc. 12th Ann. Int’l Conf. Advances in Cryptology, Springer-Verlag, 2000, pp. 36-54.

12. N. Zhang and W. Zhao, “Distributed Privacy Preserving Infor-mation Sharing,” Proc. 31st Int’l Conf. Very Large Data Bases, ACM Press, 2005, pp. 889-900.

Nan Zhangis an assistant professor of computer science

and engineering at the University of Texas at Arlington.His research interests include databases and data mining, infor-mation security and privacy, and distributed systems. Zhang received a PhD in computer science from Texas A&M Uni-versity. He is a member of the IEEE. Contact him at [email protected].

Wei Zhaois a professor of computer science and the dean

for the School of Science at Rensselaer Polytechnic Insti-tute. His research interests include distributed computing, real-time systems, computer networks, and cyberspace secu-rity. Zhao received a PhD in computer and information sci-ences from the University of Massachusetts, Amherst. He is a Fellow of the IEEE and a member of the IEEE Computer Society and the ACM. Contact him at [email protected].

Data mining successfully extracts knowledge to

Although successful in many applications, data mining poses special concerns for private

data. An integrated architecture takes a systemic view of the problem, implementing

established protocols for data collection, inference control, and information sharing.

Nan Zhang

Wei Zhao

D

Privacy-Preserving

Data Mining Systems

The query-oriented method

can provide more accurate

answers than the

data-oriented method.

O

Research on anomaly

detection can contribute to

multiple disciplines, such as

security, biology, and finance.

www.computer.org/internet/

• Autonomic Computing

• Roaming

• Distance Learning

• Dynamic Information Dissemination

• Knowledge Management

• Media Search

Engineering and Applying the Internet

IEEE Internet Computing

reports emerging tools,

technologies, and applications implemented through the Internet

to support a worldwide computing environment.