Combination Approaches - Incremental knowledge based system for schema mapping

4.4 Methodology

4.4.2 Combination Approaches

The combination approaches - CPR based RDR and machine learning are described below:

i. CPR based RDR Approach

In order to validate a large number of cases, rules of RDR have been modied as ordered CPR (Kim et al., 2012) where censored conditions are added to create censor rules. In the CPR based RDR approach, a rule is red as long as none of its censor rules are red. This approach is shown in Figure 4.1.

This approach is described below according to Kim et al. (2012):

KB of CPR based RDR: The KB of CPR based RDR is designed as an n-ary tree. An example of KB is described in Figure 4.2.

Figure 4.1: CPR based RDR approach

Each node of the tree is a rule and each rule consists of IF [condition] THEN [conclusion] UNLESS [censor-condition]. A conclusion is a classication or an inter- pretation. The value of the conclusion part of a rule can be NULL. This rule called stop/censor/exception rule is added to stop the classication of the currently red rule. Many rules in CPR based RDR are added as censor rules of other rules to classify the incorrectly classied cases as NULL classications. Other rules called alternative rules are added to make these NULL classications as correct classications.

In Figure 4.2, the KB is initially empty. R0 (rule 0) is dened as a root rule. It is an entry point of the inference process, and it is always true. It performs default checking for checking the validity of the case processed. The rst level rules are denoted by R1, R2, R3 and R4, where R2, R3 and R4 are called the alternative rules of R1. The censored rules are denoted by C1, C2, C3, C4, C5 and C6. These censor rules are added as censor nodes/children rules to the KB, and the alternative rules are added as parent rules to the KB. The rules are created using attributes, operators (<,>,==,!=,<=,>=) and values.

Inference of CPR based RDR: The inference process is based on searching the KB represented as a decision list with each decision possibly rened again by another decision list (Kang et al., 1995). Once a rule is satised by any case, this process evaluates whether or not the censor conditions are matched to the given case. If any censor rule is not satised, the process then stops with one path and one conclusion. However, if any censor rule is satised, other rules below the rule that was satised at the top level is evaluated. This process stops when none of the rules can be satised by the case in hand.

Knowledge Acquisition: Knowledge acquisition is a process that transfers knowledge from human experts to Knowledge-Based Systems (KBSs) (Kang et al., 1995). Knowledge acquisition of CPR based RDR is required to handle the incorrect or missing classications. The knowledge acquisition process has been divided into three parts. Firstly, the system acquires the correct classications from the expert. Secondly, the system decides on the new rules' locations. Thirdly, the system acquires new rules from the expert and adds them to correct the KB. If the current KB suggests incorrect classication, it is necessary to add a censor rule for making the classication NULL.

If the current KB suggests no classication for any case, a new rule is added as an alternative rule. This rule is added as a child rule of the root node of the KB. The incremental knowledge acquisition algorithm (Compton and Cao, 2006) is described in Algorithm 1:

Algorithm 1: Algorithm for Incremental Knowledge Acquisition 1. Start with an empty KB;

2. Accept a new data case;

3. Evaluate the case against the KB ;

4. If the result is not correct, an expert is consulted to rene the KB;

5. If the overall performance of the KB is satisfactory, then terminate, otherwise go to Step 2;

Cornerstone Case (CB): The cases used to create rules are called cornerstone cases, and these cases are used in consequent knowledge acquisition (Compton and Jansen, 1990). The only dierence between conventional RDR and CPR based RDR in managing cornerstone cases is that the CPR based RDR approach maintains non- conforming cases as well as conforming cases for creating a new rule while the conventional RDR approach only maintains conforming cases for a new rule.

ii. Machine Learning Approaches

In this research, I use 11 machine learning approaches: decision trees such as J48, Ran- dom Forest, REPTree, BFTree, ADTree, Functional Tree, SimpleCart, and NBTree, rules such as Decision Table, meta such as AdaBootM1, and Bayesian Network, Naive Bayes. These are supervised classication approaches. These approaches input a col- lection of records (training set) where each record contains a set of attributes, and one of the attributes is the class. The purpose of training a dataset is to build up a model for the class attribute as a function of the values of other attributes. A test set is used to determine the accuracy of the model.

• J48. J48 classier is a simple C4.5 (Quinlan, 1993) decision tree for classication.

It divides a dataset into smaller subsets and creates a binary decision tree with decision nodes and leaf nodes. A decision node has more than one branch, and

the top most decision node corresponds to the best predictor called the root node. A leaf node represents a classication or a decision. For a given dataset, one or more decision rules are created that describe the relationships between inputs and targets. These rules can predict the value of new or unseen cases if the cases match with the inputs.

• Random Forest. Random Forest (Breiman, 2001) consists of many individual

trees to operate quickly over large datasets. In order to build each tree in the forest, the forest can be varied by using random samples. A tree is constructed as (Pater, 2005): (1) bagging takes samples from datasets with replacement for a training set and selects a small amount of data to be used for tree construction, (2) a random number of attributes is chosen from the bagging data and the one with the most information gain is selected to comprise each node, (3) all the nodes of the tree are traversed until no more nodes can be created due to information loss, and (4) bagging error is estimated by running a dataset through tree and measuring its correctness.

• ADTree. ADTree (Alternating decision tree) (Freund and Mason, 1999) is a

generalization of decision trees, voted decision trees and voted decision stumps. It consists of decision nodes and prediction nodes, where a decision node species a predicate condition and a prediction node contains a real-valued number. It always has prediction nodes as both root and leaves. In ADTree, an instance denes a set of paths. The instance is classied by following all paths for which all decision nodes are true and by summing any prediction nodes that are traversed.

• REPTree. REPTree (Reduced-Error Pruning tree) is a fast tree learner that

uses information gain/variance reduction for building a decision or regression tree and prunes it using reduced-error pruning (Witten and Frank, 2005). Like C4.5, it can split instances into pieces in order to handle missing values. In this approach, it is possible to set some parameters such as the minimum number of instances per leaf, maximum tree depth, minimum proportion of training set variance for a split (numeric classes only), and a number of folds for pruning.

• SimpleCart. SimpleCart is a decision tree learner for classication that uses

minimal cost complexity pruning strategy of CART (classication and regression tree) (Loh, 2011). For classication, it is possible to set some parameters such

as the minimum number of instances per leaf, the percentage of training data used to construct the tree, and the number of cross-validation folds used in the pruning procedure (Witten and Frank, 2005).

• NBTree. NBTree (Kohavi, 1996) is a hybrid classication approach that com-

bines decision trees and Naive Bayes. It creates trees with leaf nodes that are Naive Bayes classiers. It selects a threshold-like decision tree using a standard entropy minimization technique. It uses cross-validation for estimating accuracy using Naive Bayes.

• FT. FT (Functional Tree) (Gama, 2004) is a supervised approach for classica-

tion and regression problems. For prediction problems, it uses functional nodes as a bias reduction process and functional leaves as a variance reduction method. It uses linear functions both at the decision nodes and leaves, that facilitate large datasets.

• BFTree. BFTree (Bloom Filter Tree) (Athanassoulis and Ailamaki, 2014) is

produced by employing probabilistic data structures to trade accuracy for size. It consists of a root node and an internal node. For oering competitive search performance, it exploits pre-existing data ordering or partitioning.

• Decision Table. Decision table represents a conditional probability table, and

it stores input data based on a selected set of attributes (Hall and Frank, 2008) where each element is associated with a class probability. In learning decision table, attributes are selected based on the maximum performance of cross- validation. In cross-validation, the structure of a decision table does not change if some instances are inserted or deleted, only the class is changed that is associated with elements (Hall and Frank, 2008).

• Naive Bayes. The Naive Bayes approach is a simple probabilistic classier

applied to a classication task (Sahami et al., 1998). It uses probability based on Bayes Theorem, and inputs a node Fi for each of the feature and a class variable

C. It assumes all attributes to be independent given a value of a class variable.

• AdaBoost. AdaBoost (Freund and Schapire, 1996) is a boosting approach that

can reduce an error of any learning approach that generates classiers whose performance is better compared to random guessing. It constructs an initial

classier from the original dataset where every sample has an equal distribution ratio of 1. The performance of boosting is better than bagging, and boosting can be used with very simple rules to construct classiers that are similar to the C4.5 algorithm.

In document Incremental knowledge based system for schema mapping (Page 99-105)