2.2 Accounting information system and data mining background
2.2.6 Data mining tasks and functionalities
Several DM problem types or analysis tasks are typically encountered during a DM project. Depending on the desired outcome, several data analysis techniques with different goals may be applied successively to achieve a desired result. In general, DM tasks can be grouped into two categories: descriptive and predictive. Descriptive mining tasks characterise the general properties of the data in the database. Predictive mining tasks perform inference tests on the current data in order to make predictions (Han & Kamber 2006).
Based on the different mining tasks, DM functionalities (methods) can be categorised as classification, clustering, regression, association rules, sequence discovery, prediction and so on (Dunham 2006). DM functionalities are used to specify the kinds of pattern to be found in DM tasks (Han & Kamber 2006). DM functionalities are shown in Figure 2.3.
17
Figure 2.3: Data mining functionalities, Adopted from Dunham (2006)
According to (Berry & Linoff 2004) basic DM functionalities are: classification, estimation, prediction, affinity grouping or associating rules, clustering, description and visualisation. The first three are examples of directed DM, where the aim is to find the value of a particular target variable. Affinity grouping and clustering are undirected tasks where the aim is to uncover structure in data without respect to a particular target variable. Profiling is a descriptive task that may be either directed or undirected.
Classification (Supervised learning): Classification maps data into predefined group or classes. Classification algorithms require that the classes be defined based on data attribute values. They often describe these classes by looking at the characteristics of data which are already known to belong to the classes. Classification techniques are: Decision Tree: CART, C4.5, Bayesian Classification: Consists of two type, Naive Bayesian Classification and Bayesian Belief Networks, Neural Network, Support Vector Machines, Associative Classification, Lazy Learners (or Learning from Your Neighbours): k-Nearest Neighbour Classifiers, Case-Based Reasoning. Other Classification Methods: Genetic Algorithms, Rough Set Approach and Fuzzy Set Approach (Berry & Linoff 2004). Examples of classification are: classifying credit
18
applicants as low, medium, or high risk, choosing content to be presented on a Web page and determining which phone numbers correspond to fax machines.
In these examples, there is a limited number of classes and it is expected to be able to assign any record to one or another of them (Berry & Linoff 2004).
Estimation: Estimation deals with continuously valued outcomes. Given some input data, estimation is used to assign a value for some unknown continuous variable such as income, height and credit balance or donation amount. Often, classification and estimation are used together, as when DM is used to predict who is likely to respond to the fund raising campaigns of a charity organisation and to estimate the amount of money donated by each supporter (Berry & Linoff 2004). Examples of estimation tasks (Berry & Linoff 2004) are: estimating the number of children in a family, estimating a family’s total household income and estimating the lifetime value of a customer
Prediction: Based on past and current data, many real-world DM applications can be considered as predicting future data states. Prediction is viewed as a type of classification. The difference is that prediction is predicting a future state rather than a current state. Actually, the difference is on the emphasis, since in predictive tasks the records are classified according to some predicted future behaviour or estimated future value. With prediction, the only way to check the accuracy of the classification or the estimation is to apply the model and then evaluate if its performance was the desired. That is, if the predictive task was to predict the customers who will respond to the next marketing campaign and buy the new product, the only effective way to evaluate the performance of the model is to wait until after the campaign and count how many of the target customers did actually buy the product. Prediction applications include speech recognition, machine learning and pattern recognition (Berry & Linoff 2004). Examples of prediction tasks (Berry & Linoff 2004) are: predicting which customers will leave within the next 12 months; and predicting which telephone subscribers will order a new service such as voice mail.
19
Affinity grouping or association rules: Association rules are alternatively referred to as affinity analysis. An association rule is a model used to identify specific types of data association. They are usually used in the retail sales community to identify items which are often purchased together. The task of affinity grouping is to determine which things go together (e.g., what usually goes together at a shopping cart at the supermarket). Affinity grouping can also be used to identify cross-selling opportunities and to design attractive packages or groupings of products and services (Berry & Linoff 2004).
Clustering (Unsupervised learning): Clustering is the task of segmenting a diverse group into a number of more similar sub-groups or clusters. What distinguishes clustering from classification is that clustering does not rely on predefined classes, examples, or target concepts. Clustering analyses data without consulting a known class label. In general, the class labels are not introduced in the training data simply because they are not known to begin with. Clustering can be used to generate such labels. The objects are clustered or grouped based on the principle of maximising the intra-class similarity and minimising the interclass similarity. That is, clusters of objects are created so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. Each cluster that is created can be viewed as a class of objects from which rules can be derived. Clustering is often done as a prelude to some other form of DM or modelling. Clustering techniques are: Partitioning: K- means and K-medians, Hierarchical, Density based and Model based (Berry & Linoff 2004).
Description and visualisation: Sometimes the purpose of DM is simply to describe what is going on in a complex database, in a way that increases our understanding of the people, the products or the processes that produced the data in the first place. A good enough description of a behaviour will often suggest an explanation for it as well or, at least, where to start looking for it (Berry & Linoff 2004).
20