Knowledge Discovery in Database - A data mining approach to improve the automated quality of da

The Knowledge discovery in database (KDD) is a continuum with steps that interactively and iteratively depend on each other. In real life applications, ex- tracting useful knowledge to support business needs is a difficult task. The rapid growth in data size and technologies, as well as the availability of storing and ac- cessing different data types includes: structured data, text data in web, images, videos increase the necessity of adopting the KDD process in Figure 2.5 from paper Fayyad et al.[1996].

2.3.1 Knowledge Discovery Processes

The Knowledge discovery process is generally classified into two areas: Pre- processing steps and Post-processing steps. The pre-processing stage includes: data cleaning, data selection and data transforming, whereas post-processing

consists of data mining, pattern evaluation and knowledge representation. These steps that occur in both pre-processing and post-processing are seamlessly related to each other. Figure 2.5 shows the phases of the knowledge discovery process, as briefly described in the following points:

• Data Cleaning: This step concerns data quality in the database and the data warehouse. Data must be checked and cleaned prior to moving it forward in the KDD process. Many quality problems are handled at this stage including: outlier or noisy data, missing fields and inaccurate data

Fayyad et al. [1996].

• Data Selection: This phase is very useful for reducing the dimensionalities of the dataset. In the data selection stage, users need to select useful features to represent the data. The selection of such features varies and depends on the goal of the data mining task.

• Data Transformation: In this stage, the data is transformed and consoli- dated based on the specified data mining tasks. Transformation methods include: normalisation, aggregation, generational and attribute redesign, which can be used in transforming data.

• Data Mining: This stage refers to the data mining tasks that users tend to adopt in a nominated KDD project. There involve the number of data mining tasks: pattern summarisation, classification, clustering and association rule mining. Based on the data mining tasks, there are a numbers of techniques and algorithms that can be used to identify the patterns from the data. This usually results in huge and meaningless numbers of patterns.

• Pattern Evaluation (interpretation): Data mining tasks often produce an overwhelming number of meaningless patterns. Users need to evaluate and interpret these patterns to identify those interesting patterns that are rele- vant to the targeted application.

• Knowledge Representation: After locating interesting patterns, users need to encapsulate these patterns in knowledge. This knowledge can be in- corporated and represented by users or the system in order to apply this knowledge to unseen data.

2.3.2 Data Mining Tasks

Data mining is defined as a process involving the extraction of useful and interesting information from the underlying data Han and Kamber [2001]. Based on the specific application, users can deploy a single data mining task or can com- bine more than one data mining tasks in order to extract useful and interesting information. Data mining tasks can be described as follows:

• Pattern summarisation: The main problem in data mining is that the total number of patterns is considerably large. Even after filtering out some of the more frequent patterns that fall over the specified minimum threshold, the number of patterns remains huge. Thus, manual examination by domain experts over the patterns is undoubtedly difficult to achieve. Therefore, it is essential to adopt pattern summarisation methods, such as the profile-based approach presented in Yan et al.[2005] to allow for significant reduction in the number of patterns.

• Classification: is a supervised data mining technique. It aims to correctly classify a set of features related to set classes. The function or the model that emerges between set features and classes in the training data can then be used to predict the classes for new data in the testing set. The accuracy of the model depends on accuracy when assigning a set of features or objects as belonging to classes Han and Kamber [2001].

• Clustering: is an unsupervised data mining technique. In clustering, instances are divided and grouped into a number of clusters based on the resemblance between instances. Those instances belonging to the same cluster share many characteristics. A classic clustering technique, which is based on K-means, involves the user initially specifying the number of desir- able clusters, as K. Then, based on the ordinary Euclidean distance metric, instances are assigned to the closest clusters Han and Kamber [2001]. • Association rules mining: is one of the most powerful data mining tech-

niques. Association rule mining was first presented inAgrawal et al. [1993] for use when mining frequent itemsets in transaction databases, and has since then been developed for the purpose of mining frequent itemsets at multiple levelsHan and Fu[1995,1999] and intertransactional itemsetsFeng et al. [2002]; Tung et al. [2003]and correlations between itemsets Shichao et al. [2006];Tsumoto and Hirano[2003] . Association rule mining includes two phases. The first phase is called pattern mining; that involves the discovery of frequent patterns. The second phase is called rule generation and involves the discovery of interesting and useful associations rules in discovered patterns. The association rule is somewhat useful for measuring

associations between itemsets.

In document A data mining approach to improve the automated quality of data (Page 47-51)