The KDD Process - Website boundary detection via machine learning

The phrase Knowledge Discovery in Databases (KDD) was first used in [150]. KDD is described as “the process of finding useful information and patterns in data” [77]. KDD is a complete process involving a set of specific steps. There are various models that may be used to describe the KDD process. The most notable is the process-centered view model, shown in figure 2.3, which will be used to provide an overview of the KDD process with respect to this thesis. Other KDD models include: the Johns process model [102], the Human-centred process model [42] and the CRISP-DM process model [51].

Each of the steps of the KDD process shown in figure 2.3 is described in more detail below. The discussion is primarily founded on the review of KDD presented in [71, 95], but with additional information garnered from [78, 79, 151, 42, 124]. With reference to figure 2.3 the first thing to note is that KDD can be viewed as an iterative process, which means that when following each step the user can return to any of the previous steps if needed. The flow of information in the process-centered view model is indicated by the direction of the arrows in figure 2.3.

Figure 2.3: The Process-Centred view model of the KDD process [77]

The initial step of the KDD process, not shown in Figure 2.3, is to develop an understanding of the application domain [78]. Understanding the task that the user wants to accomplish is very important, this can be even more significant when the data is to be imported from multiple or complex data sources [151]. As the problem is investigated the goals of the data discovery task will emerge [42]. To understand and learn the application domain, the goals of the application and relevant prior knowledge need to be identified. For example to apply the KDD process to some data on the WWW, the goals of doing such an activity should be understood. This could be user traffic analysis from server logs or hyperlink structure to identify authority or hubs. The data from this domain must also be considered, this includes the format, and any

particular techniques to process or gather the data. Once the application domain is understood the KDD process can commence. Thus:

1. Selection In this step we gather the “target data” needed for the knowledge mining [151]. This could involve the task of selecting the sources which are needed for the desired data discovery. The data may come from many different origins such as databases, flat files, the WWW and non-electronic sources [71]. The selection step may also include the identification of a target data set by focusing on a subset of the different sources of data [78, 79].

2. Pre Processing The data gathered in the previous step may be noisy. For example the data may include incorrect or missing values [95, 71]; or anomalies resulting from the use of different data sources, different data types and different metrics [95]. A strategy has to be adopted to deal with these problems [78, 79]. Erroneous data could be corrected or removed [95], missing data can be supplied or predicted. Care must be taken during this step as “occasionally what looks like an anomaly to be scrubbed away is in reality the most crucial indicator of an interesting domain phenomenon” [42].

3. Transformation The data that has been cleaned during the previous step typically needs to be converted to a common or more usable format [95]. Methods for “data reduction” may be used to reduce the dimensionality of data [95, 78, 79]. The transformation step can also involve finding useful features in the data to best represent it according to the intended goals of the KDD task [78, 79].

4. Data Mining The data mining step is where the actual knowledge discovery takes place. The typical aim is to extract useful patterns from the given data set using some particular data mining algorithm(s). This step thus commences with the identification of the most appropriate data mining algorithm given the desired KDD task, which is then applied to the pre-processed data generated by the foregoing steps.

5. Interpretation/Evaluation Testing and verification of the discovered knowledge is important so that the end user can gain confidence in the generated results. Different evaluation techniques are applicable to different types of data mining (KDD). How the patterns that are found in the data mining step are presented to the user is very important [71]. The usefulness of patterns, and the new knowledge that they represent, depends heavily on the interpretation placed on them by the end user. To assist interpretation various visualization techniques have been proposed founded on the use of graphs, charts and GUI’s [71].

6. Knowledge This last step comprises the integration of the discovered knowledge with existing domain knowledge [151]. Methods for integration include simple

documentation of the knowledge discovered, or specific action taken as a direct cause.

As mentioned previously, the KDD process is iterative; this means that it can re- quire interaction with the user between steps thus incorporating flexibility into the model [42]. There is no single KDD solution that fits all KDD tasks. The diversity of KDD goals, and their related complexity in terms of data types and knowledge interpretation, requires a tailored approach [53]. KDD solutions are typically constructed and developed for a particular type of problem or data set [53]. A discussion of how the KDD process may be applied to the WBD problem is presented in chapter 3.3.2.

In this section the KDD process has been described. In the following section the specific discipline of web mining will be considered.

In document Website boundary detection via machine learning (Page 39-41)