Although data preparation causes the largest part of costs within KDD projects, research has mainly focused on the more central step of KDD, viz. data mining algorithms. The need for data preparation, though, is well-known and already led to many tools. These are also included in commercial environments for KDD, to be applied by knowledgeable data analysts.
As Pyle puts it [101], the task of data preparation for data mining is two-fold: the data have to be transformed such that data mining algorithms can be applied with high prospects for success, and the analyst has to become informed for mining and for the evaluation and application of the results.
In a multi-relational scenario, e. g. with data from a relational database to be analyzed, a number of proposals and systems were provided to help the ana-lyst. Among them are suggestions for combining and modifying data sets [114], ultimatly by the user with the help of database query languages.
Systems such as MiningMart [30, 87] or Xelopes [125] further support the user in multi-relational data preparation with means for the easy application of operators, up to opportunities to archive successful data preprocessing procedures for later access in similar projects. There is also a tendency towards the usage of standardized languages such as the Predictive Model Markup Language (PMML).
In the following, we focus on aspects of data preparation that are of special relevance for the following chapters.
2.4.1 Feature Construction
For KDD with a single table input for the data mining algorithm, feature con-struction means the creation of new columns for that single table.
2.4. PREPARATION FOR KNOWLEDGE DISCOVERY 27
Algorithms for conventional feature construction have also a single table input and compute new attributes from one or more of the attributes given in that table.
For instance, from two attributes that describe the length and width of an object, its area may be computed.
In a broader sense, manipulations of single existing attributes can also be allocated in the realms of conventional feature construction.
An example would be discretization, where a numeric attribute could be re-placed by a nominal attribute that symbolizes ranges of the former numeric values with the help of names.
Another example would be range normalization, e. g. by dividing the length values of all target objects by their maximum in order to arrive at an attribute for length with values between 0 and 1.
A final example here would be a coding of nominal attributes with n possible values by n binary attributes that indicate the occurrence of the possible nominal values.
Propositionalization is also an approach for feature construction. However, an algorithm for propositionalization takes multiple relations as input and usually concerns more complex structures than conventional feature construction. Here, new attributes are computed from specifics of several objects related to a target object. More details can be found in the following chapters.
2.4.2 Feature Selection
Considering again the conventional case of data mining with a single table input, it is usually good to have a larger number of rows in such a table. With a growing number of learning examples as represented by those rows, the statistics and heuristics that form the basis for learning get more reliable, as a rule.
The situation is different w. r. t. the number of columns, though. Here, larger numbers mean growing hypothesis spaces, which not only endanger efficiency of search but also effectivity, e. g. when dangers to arrive at only locally optimal solutions grow, or other dangers of overfitting.
Perhaps even more contra-intuitive are findings such as the following. For clas-sification tasks, not only features without a correlation with the target attribute can have negative effects for learning, but also features with certain predictive potentials as demonstrated by John [51], among others. Approaches to feature (subset) selection can improve the situation, for an overview see the book by Liu and Motoda [79].
Feature selection methods are often classified into filters and wrappers [79, 132]. While filters choose attributes based on general properties of the data before learning takes place, wrappers intermingle feature selection and learning.
The methods for feature selection are also often subdivided into those that judge only single attributes at a time and those that evaluate and compare whole sets of attributes. The former are also called univariate methods, the latter multivariate
methods. Furthermore, different selection criteria and search strategies can be applied.
Approaches to dimensionality reduction have also been developed within ILP, e. g. by Alphonse and Matwin [3]. Especially in the context of propositionaliza-tion, where unsupervised feature construction may lead to many redundant or otherwise irrelevant attributes, a selection of the good features seems advisable.
It was in fact investigated on several occasions e. g. by Lavraˇc and Flach [77] and by ourselves [72, 73], see Chapter 5.
2.4.3 Aggregation
Cabibbo and Torlone [21] state that aggregate functions have always been consid-ered an important feature of practical database query languages, but a systematic study of those has evolved only slowly. In many cases, the aggregate functions as provided by SQL were in the focus of the investigations. In fact, the same holds for large parts of our investigations as presented in this thesis.
The authors [21] let {{N }} denote the class of finite multisets of values from a countably infinite domain N and define an aggregate function over N as a total function from {{N }} to N , mapping each multiset of values to a value. Our view largely corresponds to that definition, although N may be a finite set, and the function values may also come from a set of values different from N , for instance when counting a certain value of a nominal attribute.
Aggregate functions are often used in statistics to describe properties of sam-ples of populations, e. g. averages or standard deviations. Categories of such mea-sures are described by Fahrmeir and colleagues [31] or Hand and colleagues [38], among others. Properties of aggregate operators are investigated by Detyniecky [29]. We focus for our work on aggregate functions with close relationships to SQL as mentioned above, but also on computational complexity, as investigated by K¨ornig [57] and further discussed in Chapter 5.
Aggregate functions are widely applied within KDD and related areas, as exemplified in the following. During data preparation, analysts often investigate statistical properties such as histograms of nominal attributes, in order to make decisions about which attributes to use, for instance.
Outlier detection and missing value replacement often rely on aggregate func-tions as well. Tools for these steps of data preparation can be found in many KDD environments. Aggregate functions may also be used to integrate [117] or compress [45] data.
Last not least, domain experts often apply aggregate functions when manu-ally transforming multi-relational data into inputs for conventional data mining systems.
In data warehousing and online analytical processing (OLAP), aggregate func-tions are also typical. Here, users investigate large volumes of data by the interac-tive use of special operators for navigation, which often involve the computation