BACKGROUND 17 While DM itself is automated, many of the pre and post-processing tasks require

Background

CHAPTER 2. BACKGROUND 17 While DM itself is automated, many of the pre and post-processing tasks require

some human input and domain knowledge in order for the KDD process to be success- ful. For example, domain knowledge is often required for eective feature selection. Similarly, the output of data mining algorithms cannot simply be assumed to be valid. The additional steps in the KDD process are essential to ensure that useful and valid knowledge is derived from the data. Blind application of data-mining methods is a dangerous activity and easily leads to the discovery of meaningless or outright invalid patterns [35]. The reason for this is that one can nd, for example, statistically signicant patterns in any data set even a randomly generated one if one searches long enough. Other problems such as noise, incorrect or lack of normalization, careless selection of features or failing to consider outliers can also lead to invalid patterns being found. These diculties typically lead to an iterative KDD process, illustrated in gure 2.1by the dotted arrows, where the results of all three steps in the KDD process feed back into the decisions made in the previous stages.

The interested reader is directed to [35] for a more in depth introduction to the high level KDD process. While as of this writing the article is over 13 years old and the eld has progressed considerably, it provides a good overview and denitions of KDD, the classic KDD process, its relationship to longer established elds like articial intelligence, machine learning, statistics and databases, as well as some early real world applications. For more concrete information about the various tasks and algorithms in the KDD process, the reader is directed to [88] and [46] which provide excellent introductions.

2.2 Data Mining

Data Mining (DM) is the most important and complex component of the KDD process. There are a number of denitions of data mining. The following is most closely related to that of [88], with the primary dierence being the use of the term pattern instead of information1_.

1_{This is for two reasons: First, the author prefers the more concrete term pattern since, as shall}

hopefully become clear in the text, this leads to a natural way of explaining and understanding the data mining process. Secondly, the author doesn't think that patterns (or to be more specic, pattern instances in the terminology introduced here) necessarily qualify as information or knowledge. Patterns are the output of an automated process which together, perhaps after post processing, visualization, validation or human interaction, may become information. It should be noted that pattern is usually used in the literature in the context of association rules and itemset mining, for example frequent pattern mining, but that patterns are not restricted to this sub-eld. Pattern is also used in the denition of data mining by [35].

18 2.2. DATA MINING Denition 2.1. Data mining is the automated process of extracting useful patterns from typically large quantities of data.

Dierent types of patterns capture dierent types of structures in the data: A pattern may be in the form of a rule, cluster, set, sequence, graph, tree, etc. and each of these pattern types is able to express dierent structures and relationships present in the data. For example, a rule may tell a marketer about strong relationships between purchased goods or services, predict customer `churn' or be used as the basis of recommender systems. A set can indicate products that customers are purchasing together and a cluster might tell him or her about groups of customers that have similar purchasing patterns. These may be used as the basis for a marketing scheme that aims to maximize response rates and sales for a given investment. A graph may tell a biologist about strong and previously unknown gene or protein interactions present in their experiments, or a security agent about suspicious communication structures between potential criminals. A tree or a set of rules may describe the decision structure that can be used to accurately predict medical conditions, perhaps based on patient records or medical imaging data. As suggested by these examples, patterns can be descriptive or predictive (or both). That is, they can be used to describe, model and help better understand a process or phenomena or to predict future events. In this work, many new types of patterns are introduced. Some are purely descriptive and some are explicitly used for prediction purposes.

Once a pattern type has been dened based on the problem at hand and the structure of the information sought, the goal and challenge is to automatically nd those pattern instances2 _{in the data that are interesting to the end user. That is, pattern} instances providing both useful and previously unknown (novel) information. This is done by evaluating pattern instances for interestingness according to some measure that, ideally, should model the value that the user obtains from being made aware of the pattern. An interestingness measure measures how interesting a pattern instance is expected to be. These measures must balance three important characteristics:

1. The utility that a user is expected to receive by exploiting the pattern instance. For example; the value, nancial gain, increase in accuracy or eciency or 2_{In the terminology used here, a pattern type describes the structure or schema of the pattern.}

For example, an itemset can be dened as a non-empty subset of all items that may be purchased in a supermarket. This species its type. A pattern instance on the other hand is a particular instantiation of that type. For example,{bread, butter, jam}. A pattern therefore has one type

but many instances, and these instances are found in the data set. This distinction between type and instance is rarely made explicit but helps to describe the data mining process in terms of the denition used in this thesis. In the literature and common usage, the term pattern is inherently ambiguous and refers to both or either the type or instance, depending on the context. Outside this section, the term will typically refer to the pattern instance of the type being discussed.

CHAPTER 2. BACKGROUND 19

In document Verhein, Florian (2010): Generalised Interaction Mining: Probabilistic, Statistical and Vectorised Methods in High Dimensional or Uncertain Databases. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik (Page 55-57)