3.3 Miscellaneous Methods
3.3.1 Data Mining
More and more data is collected and stored in huge databases. Stored information (about people) is mostly used by companies and governments to provide services. However the information could as well be used to predict trends or even behavior, habits of people (costumers). These predictions are created by a technique called data mining, which is often used by businesses and companies to gain certain benefits and advantages while competing against other businesses.
Unfortunately the huge amount of data saved about individual people poses a threat to privacy, because information collected is more often shared and sold to other companies.
21 www.wikipedia.org - comprehensive Online Dictionary existing since 2001 and that is expanding fast
3.3 Miscellaneous Methods 39
Furthermore it is becoming more difficult for people to keep track of who knows what about them. A ability, which is an important part of the German Grundrecht auf informationelle Selbstbestimmung.
Basically data mining is used to discover interesting and useful patterns within (huge) databases to predict future trends or behaviors. Collection of data as well as data analysis and creation of profiles could be supported by automatic methods. During the process data collected is processed and new assumptions (due to results) are generated. Rules, which generate these assumptions (new data) are often generated by experts, who are familiar with the topic.
In [Sei03] data mining is described as software-supported, automated prediction due to known patterns of behavior and the calculation of yet unknown relations, patterns and trends within huge databases. Data mining should discover and of course use new information and knowledge. Other objectives of DM among others are segmentation, classification, forecast, description of concept, detection of anomalies and analysis of dependencies.
While some phases (from collection of data to actual use of new derived data) of DM can indeed be automated, others still have to be conducted manually. Human beings are still an important part of data mining (cf. DAT03), especially when it comes to protecting peoples privacy (e.g. reviewing of DM rules).
Data Mining Models
There are two different DM models (cf. [Sei03]):
Verification Model: Questions and hypothesis created by experts are verified or discarded due to analyzed by numerous tools.
Discovery Model: automated creation of hypothesis. These are validated due to data collected.
Data Mining Phases
The process of data mining consists of 4 different phases (cf. [Sei03]):
Planing Phase: Definition of objectives and choice of experts.
Preparation Phase: Collection and processing of data, includes editing, deleting of obvi-ously false or misleading data. Creation of a “Mining Base” for following procedures.
Mining Phase: actual search for interesting patterns. Sometimes it’s necessary to repeat steps from earlier phases to add new data to Mining Base.
Analysis: Results of search is analyzed and processed in order to make them understandable for non-experts. This objective is achieved by interpretation and visualization of gained results.
3.3 Miscellaneous Methods 40
Data derived by Data Mining
Data and patterns which were derived by data mining should posses certain attributes (cf.
[Sei03]):
understandable: patterns have to be easy to understand and graphical visualized valid: patterns should be valid too for data collected in the future
useful: patterns have to be useful and relevant for the particular objective non-trivial: only interesting patterns are presented
interesting: discovered knowledge has to be interesting too
Most importantly the data should be described in a way that is easy understood by people. Common patterns of mined data is listed below:
Rules: Data in databases is divided into classes. Attribute classes describe attributes of objects and regression rules predict numeric values.
Cluster: Data is divided into groups, due to statistical procedures. To each group a description is added, to make them better understandable.
Dependency Patterns: These patterns determine dependencies between variables of rele-vant data. Which groups, attributes turn up together?
Connection Patterns: Determines regularities among different objects within one database or among different databases.
Sequence Pattern: Search for regular occurrence of event sequences Data Mining Techniques
Common Data Mining techniques are listed below (cf. [Sei03]):
Cart Analysis: tries to find products, which are often bought together. When associated with customer-data it is possible to derive what will be bought by certain customers in the future. This analysis is also used for direct marketing, by remembering customer preferences and giving notices of special offerings according to these. The cart analysis is part of the cluster analysis group.
Case-based Reasoning: By remembering past experiences these method tries to derive future decisions. Attributes and parameter are stored in a database and when a decision has to be made, this database is searched for attributes and parameters, which resemble the decision the most. The more resemblance is found with stored attributes, the more accurate the prediction will be.
3.3 Miscellaneous Methods 41
Decision Tree: More complex decisions contain of numerous smaller decisions. A decision tree is build of nodes and at each of these nodes, attributes and parameters are queried and a decision is made until a node is reached, where it isn’t possible to make further decisions. The technique looks for the most fitting attribute and parameter at every single node that is reached during decision making.
Neuronal Network: A neuron processes an input to produce some kind of output and can be associated with other neurons to share information. Input as well as output is traveling between neurons - hence a network is created that is capable of processing information. Particular input produces particular output, but like in a black box22 the processing process, that creates the solution, isn’t visible.
Genetic Algorithms: tries to derive an optimal solution from randomly picked base solu-tions. To do this, genetic algorithms are used. Most fitting proposals for solution are picked and recombined. The result builds the new base for solutions. These solutions are sometimes changed randomly to resemble something like mutation23. Automatic Cluster Analysis: This technique is mostly the first step during data mining of
huge databases in order to build groups of information. These groups are further processes with other DM techniques.
Relation Analysis: is trying to associate individual elements of information in a database.
This analysis is only able to process structured information that is specifically conditioned.
Data Mining and Privacy
Main objective of data mining is to create new knowledge from already existing knowledge.
This ability could be used for great benefit for society, but could also be abused. After 9/11 data mining got even more important for the search of potential terrorists.
Often non-anonymised data is needed and it is impossible to know the nature of results that are delivered beforehand. Hence the fact that the already existing knowledge does not violate privacy, does not necessarily mean that newly gained information continues privacy protection (e.g. by taking public data to derive sensitive data in private or ethical way). Numerous studies and surveys (cf. [Vai02], [Ful04] and [Kan04]) covered the issue of data mining and (protection of) privacy.
To determine whether some data mining method violates privacy it is necessary to know (cf: [Kan04]):
• Which particular information is considered sensitive?
• To whom it is sensitive?/Whose privacy is at risk?
• What else is known?
22 system, where input and output is visible, but the (mostly complex) procedure in between isn’t 23 spontaneous changes in a system - in this case in order to investigate new possibilities
3.3 Miscellaneous Methods 42
• What is an acceptable trade-off between privacy and benefit of the result? How is this trade-off measured?
In [Kan04] it was also tried to build a so-called classifier that keeps privacy violation at a minimum during data mining. This classifier conducts the actual data mining process and is designed as a black box in order to hide the procedure of deriving new data. When the procedure is visible (like in neuronal networks or decision trees) sensitive data used could easily get exposed. Someone who wants data mining to be done, has only access to the classifier itself and gains no further information about the classifier (e.g. how it actually works) by usage of cryptography.
The question, however, what is considered private information and what not, is not that easy to answer. Many privacy laws include a so-called trade-off between cost and benefit for use of private information. Some of these laws include a provision ’in the public interest’ when it comes to the usage of sensitive information and loss of privacy caused.
The study conducted in [Kan04] identified the amount of given data examples to the classifier as the most important threat to privacy. The more examples are offered as input the more difficult it gets to trace information back to its origin. Every time sensitive data and unknown data is strongly associated with one another, privacy is threatened.
The survey in [Ful04] identified rules that are used to derive information during data mining pose a threat and should be monitored (also according to their interestingness24).
These rules are usually created by domain experts - people who have profound knowledge of that particular branch data mining will be used in. Other additional measures which could protect privacy are
• limit access to data
• restrict scope of queries
• hide/delete data
• alert users, when rules are used that could probably violate privacy Privacy preservation is needed in many different situations including:
• secure sharing of data between companies
• guarantee confidentiality of data available to the public, so people can’t be identified by aggregated data
• anonymisation of private data by mutating or randomising data
• Access control not just to databases
Techniques to maintain privacy preservation are listed below (cf. [Ful04]):
Authority Control and Cryptography: data is hidden from unauthorized access, but inap-propriate use of authorized users isn’t prevented
24 measures whether a rule useful for accurate results or indeed misleading
3.3 Miscellaneous Methods 43
Anonymisation: any attributes, which enable identification, are removed from source data Query Restriction: tries to detect threats to privacy caused by combination of queries Dynamic Sampling: chooses a different data set for each query and reduce size of data set Noise Addition: change results to protect privacy, while still keeping accuracy
Multiparty Computation: data mining is conducted on numerous sites and results are combined
To protect privacy during the process of data mining, the main objective is to find a DM model, which doesn’t reveal the data used. This could be achieved by altering the data before use (and accept a loss of accuracy in results) or by secure multi party computation.