Ensembles of Classifiers Based on Rough Sets Theory and Set-oriented Database Operations

(1)

Abstract—In this paper we present a new approach to

construct a good ensemble of classifiers for data mining applications based on rough set theory and database set operations. We borrow the main ideas of rough set theory and redefine them based on the database theory to take advantage of the very efficient set-oriented database operation. Our method first computes a set of reducts which include all the necessary attributes required for the decision categories. For each reduct, a reduct table is generated by removing those attributes which are not in the reduct. Next a novel rule induction algorithm is used to compute the maximal generalized rules for each reduct table and a set of reduct classifiers is formed based on the corresponding reducts. Our rule induction algorithm adopts the “conquer-without-separating “ strategy to generate a set of global best rules from the data set. The experimental results indicates that the rough set based approach is very promising for ensemble of classifiers.

Index Terms— rough set theory, ensemble of classifiers,

database

I. INTRODUCTION

HE procedure of constructing an ensemble of classifiers is to generate a set of classifiers instead of one classifier for the classification of new object, hoping that the combination of answers of multiple classifiers result in better performance. A lot of previous research studies [1,3,10,18,31,32] were focused on two fundamental questions: first how should each distribution be chosen on each round, and second, how should the weak rules be combined into a single rule? Many methods for constructing ensemble of classifiers have been developed such as bagging and boosting, some are general and some are specific to particular algorithms [14]. Boosting is to build a series of classifiers, where each classifier in the series pays more attention to the examples misclassified by its predecessor while bagging is to create a bootstrap data from the training data with replacement, this bootstrapped data set would have some of the original objects repeated and some of them missing,

The learning algorithm is run several times, each time with a different subset of the training examples. This technique works especially well for unstable learning algorithms, like decision tree, neural network. Another technique for

This work is supported partially by NSF CCF 0514679, and the NSF Career grant (IIS-0448023) and PA Dept of Health Grants (#239667).

Xiaohua Hu is with the College of Information Science & Technology of Drexel University, Philadelphia, PA 19104 (e-mail: thu@ cis.drexel.edu).

generating multiple classifiers is to manipulate the set of input feature available to the learning algorithms. Some of the methods such as manipulating the output targets, injecting randomness, algorithm-specific methods for generating ensembles are also studied by some researchers [3,32]. Most study of such methods use a substantial number of classifiers, Freund and Schapire [11] uses 100 classifiers, and in [32], it extends this to 1000.

Theoretically, a “weak” learning algorithm that performs just slightly better than random guessing can be “boosted” into an arbitrarily accurate “strong” learning algorithm [3,32]. However in many real data mining applications with millions of records and the size of data is in the hundreds of Gigabytes or even Terabytes range, there are some practical challenging issues need to be addressed to apply this approach: it is infeasible to run an ensemble with many classifiers because it is too time consuming and takes a lot of CPU and memory for huge data set, which is a typical characteristics of such applications. Also boosting will overfit if run for too many rounds [3,32]. So we need to construct an ensemble of limited number but effective classifiers to improve the classification accuracy. In order to do that, some important questions in constructing such ensemble of classifiers, such as “how many classifiers are good enough to improve the accuracy”, “ how to generate classifiers less correlated to each other” need to be addressed. As we know, too many classifiers will take too much CPU time and memory, hurt the comprehensibility and understanding ability of the rule sets. Too few classifiers will not improve the classification accuracy. A similar phenomenon is also observed in [11,28]. This also seems plausible in real life. Adding a novice to a team of experts is probably counterproductive and adding an expert whose knowledge is too similar to some other members only give more importance to the previous experts. The focus of this paper is to explore a theoretical model to explain the mechanism of multiple classifiers and efficiently construct a good ensemble of classifiers for data mining applications. In this paper, we propose a rough set approach to construct multiple classifiers by using database operations. Our approach manipulates the set of input features but with significant advantage over the previous methods.

The rest of the paper is organized as follows: we give an overview of rough set theory with some examples and redefine the main concepts and methods of rough set based on database set operations in Section 2. In Section 3, we describe an algorithm to generate multiple reducts from the data set. We

Ensembles of Classifiers Based on Rough Sets

Theory and Set-oriented Database Operations

Xiaohua Hu

(2)

present a novel rule induction algorithm to construct a reduct classifier for each reduct and present the comparison results of our ensemble of classifiers REDEnsemble with Bagged C4.5 and Boosted C4.5 in Section 5. We conclude with some discussions of our methods and future research plan in Section 6

II. ROUGH SET THEORY

A. Overview of rough set theory

In rough set theory, the data is collected in a table called decision table (or table in the database term). We also assume that our dataset is stored in a relational table with this form

Table(condition-attributes decision-attributes), C is used to

denote the condition attributes, D for decision attributes,

C∩D=Φ, tj denotes the jth tuple. (It is trivial to rearrange the order of the attributes by using SELECT statement without changing the semantic meaning of the database tuples)

Rough set theory defines three regions based on the equivalent classes induced by the attribute values: Lower Approximation, Upper Approximation and Boundary. Lower approximation contains all the objects which are classified surely based on the data collected. Upper approximation contains all the objects which can be classified probably, while the boundary is the difference between lower approximation and upper approximation. Below we give the formal definition.

Fig. 1 A Rough Set Model

Suppose T={C, D} is a database table, we define two tuples ti and tj are in the same equivalent class induced by attributes S

(S is a subset of C or D) if ti(S) = tj(S). (The tuples in the same

equivalent class has the same attribute value for all the attributes in S). Let [D]= {D1, .., Dk} denote the equivalent classes induced by D, ∀A ⊆ C, [A]= {A1,…Am} denotes the equivalent classes induced by A (Dj, Ai are called an equivalent class or elementary set).

Definition 1: For a set Dj, the lower approximation

Lower[A]/Dj of Dj under A ⊆ C is the union of all those

equivalent classes Ai, each of which is contained by Dj Lower[A]/Dj = {Ai | Ai ⊆ Dj , i=1,…m}.

For any object ti ∈ Lower[A]/Dj, ti can be classified certainly

to Dj

Lower[A]/[D] = ∪Lower[A]/[Dj] | Dj∈ [D], j=1,…k} Definition 2: For a set Dj, the upper approximation

Upper[A]/Dj of Dj under A⊆ C is the union of those equivalent

classes Ai, each of which has a non-empty intersection with Dj Upper[A]/Dj = {Ai | Ai ∩ Dj ≠Φi=1,..m}

For any object ti ∈Upper[A]/Dj, ti can be classified probably

to Dj

Upper[A]/[D] = ∪Upper[A]/[Dj] | Dj ∈ [D], j=1,…,k} Definition 3: The boundary Boundary[A]/[D] = Upper[A]/[D]

− Lower[A]/[D]

Example 1: Suppose we have a collection of 8 cars (t1 to t8)

with information about the Door, Size, Cylinder and Mileage.

Door, Size and Cylinder are the condition attributes and Mileage is the decision attribute. (the Tupel_id is just for

explanation purpose)TABLEI

8CARS WITH {DOOR,SIZE,CYLINDER,MILEAGE} TUPLE_ID DOOR SIZE CYLINDER MILEAGE

T1 2 COMPACT 4 HIGH T2 4 SUB 6 LOW T3 4 COMPACT 4 HIGH T4 2 COMPACT 6 LOW T5 4 COMPACT 4 LOW T6 4 COMPACT 4 HIGH T7 4 SUB 6 LOW T8 2 SUB 6 LOW

[Mileage] = {[Mileage=high], [Mileage=low]} [Mileage=low] = {t2, t4,t5,t7,t8}

[Mileage=high] = {t1,t3, t6}

[Door Size Cylinder] =

{{t1},{t2,t7},{t3,t5,t6},{t4},{t8}}

Lower[Door Size Cylinder]/[Mileage] = {t2, t7,t4,t8, t1}

Upper[Door Size Cylinder]/[Mileage] = {t2, t7,t3,t5, t6, t4, t8, t2,

t7,t3,t5, t6, t4, t1}

Boundary[Door Size Cylinder]/[Mileage] = {t3, t5, t6}

5 cars t2, t7, t4, t8, t1 belong to the lower approximation

Lower[Door Size Cylinder]/[Mileage], which means, that relying on the

information about the Door, Size, Cylinder, the data collected are not complete, it is only good enough to make a classification model for the above 5 cars. In order to classify t3, t5, t6 (which belong to the boundary region), more

information need to be collected about the car. Suppose we add the Weight of each car and the new data is presented in Table 2.

TABLE II

8CARS WITH {WEIGHT,DOOR,SIZE CYLINDER,MILEAGE}

Based on the new data, we get the Lower approximation, Upper approximation and Boundary as below:

[Door Weight Size Cylinder] = {{t1},{t2},{t3},{t4}{t5}{t6},{t7},{t8}}

Lower[Door Weight Size Cylinder]/[Mileage] = {t2, t4, t5 ,t7, t8, t1, t3, t6}

Upper[Door Weight Size Cylinder]/[Mileage] = {t2, t4, t5 ,t7, t8, t1, t3, t6}

TUPLE_ID WEIGHT DOOR SIZE CYLINDER MILEAGE

T1 LOW 2 COMPACT 4 HIGH

T2 LOW 4 SUB 6 LOW

T3 MED 4 COMPACT 4 HIGH

T4 HIGH 2 COMPACT 6 LOW

T5 HIGH 4 COMPACT 4 LOW

T6 LOW 4 COMPACT 4 HIGH

T7 HIGH 4 SUB 6 LOW

(3)

Boundary[Door Weight Size Cylinder]/[Mileage] = Φ After the Weight information is added, then a classification model for all 8 cars can be built. Based on the data itself, rough set theory can tell whether the data is complete or not. If the data is incomplete, it suggests more information about the objects need to be collected in order to build a good classification model. On the other hand, if the data is complete, rough set theory can also determine whether there are more than enough or redundant information in the data and find the minimum data needed for classification model. Furthermore, rough set theory classifies all the attributes into three categories: core attributes, reduct attributes and dispensable attributes. Core attributes have the essential information to make correct classification for the data set and should be retained in the data set, dispensable attributes are the redundant ones in the data set and they should be eliminated while reduct attributes are in the middle between. Depending on the combination of the attributes, in some combinations, a reduct attribute is not necessary while in other situation it is essential. Below we give the formal definitions:

Definition 4: An Attribute Cj∈C is a dispensable attribute in C with respect to D if Lower[C]/[D] = Lower[C-Cj]/[D] In Table 2, Lower[Door Weight Size Cylinder]/[Mileage] = Lower[Weight Size Cylinder]/[Mileage] , so Door is a dispensable attribute in C with respect to Mileage

Definition 5: An Attribute Cj∈ C is a core attribute in C with respect to D if Lower[C]/[D] ≠ Lower[C-Cj]/[D]

Lower[Door Weight Size Cylinder]/[Mileage] ≠ Lower[Dorr Size Cylinder]/[Mileage], Weight is a core attribute in C with respect to Mileage

Definition 6: An attribute Cj∈ C is a reduct attribute if Cj is part of a reduct.

B. Redefine Rough Set Key Concepts Based on Database Set Operations

A major drawback of rough set theory is the inefficiency in computation, which limits its suitability for large data sets. In order to find the reducts, core, dispensable attributes, rough sets need to construct all the equivalent classes based on the attribute values of the condition and decision

attributes. This is a very time consuming process and is very inefficient and infeasible and doesn’t scale for large data set, which is very common in data mining applications.

Our research investigation of the inefficiency finds out that rough set model does not integrate with the relational database systems and a lot of basic operations of these computations are performed in flat files rather than utilizing the high

performance database set operations. In considering of this and influenced by [19], we borrow the main ideas of rough set theory and redefine them in the database theory to utilize the very efficient set-oriented database operations. Almost all the operations in rough set computation used in our method can be

performed using the database Count, Projection etc. (In this paper, we use Card to denote the Count operation, Π for Projection operation). Our definitions are self-content and do not rely on the knowledge of rough set theory. Below we first give our new definitions of core, dispensable and reduct based on database operations and then present the algorithm to find the core attributes.

Definition 7: An attribute Cj is a core attribute if it satisfies

the condition Card(Π(C−Cj+D)) ≠Card(Π (C−Cj)),

In Table 2, Card(Π(Door, Size, Cylinder, Mileage)) = 6, Card(Π(Door, Size, Cylinder)) = 5, so Weight is a core attribute in C with respect to Mileage.

We can check whether attribute Cj is a core attribute by

using some SQL operations. We only need to take two projections of the table: one on the attribute set C−Cj+D, and

the other on C−Cj. If the cardinality of the two projection

tables is the same, then it means no information is lost in removing attributes Cj, otherwise, it indicates that Cj is a core attribute. Put it in a more formal way, using database term, the cardinality of two projections being compared will be different iff there exist at least two tuples tl and tk such that for any q ∈ C – Cj, s.t. tl.q = tk.q, tl.Cj ≠ tk.Cj and tl.D ≠ tk.D. In this case, a projection on C−Cj will be one fewer row than the projection on C_{−Cj+D because tl and tk being identical in C−Cj are being} combined in this projection. However, in the projection C−Cj+D, tl, tk are still distinguishable. So eliminating attribute Cj will lose the ability to distinguish tuple tl and tk. Intuitively this means that some classification information is lost after Cj is eliminated.

For example, in Table 2, t5 and t6 have the same values on all the condition attributes except Weight; the two tuples belong to different classes because they are different on the value on Weight. If Weight is eliminated, then t5, t6 are indistinguishable. So Weight is a core attribute for the table. All the core attributes are indispensable part of every reduct. So it is very important to have a very efficient way to find all the core attributes in order to get the reduct, the minimum subset of the entire attributes. In traditional rough set models, a popular method to get the core attribute is to construct a decision matrix first, then search all the entries in the decision matrix to find all those entries with only one attribute. If the entry in the decision matrix contains only one attribute, that attribute is a core attribute [4,14, 26]. The complex of this method is O(N2_{), where N is the number of tuples in the table.}

This method is very inefficient and it is not realistic to construct a decision matrix for millions of tuples, which is a typical situation for data mining application. For example for Table 2, the decision matrix is shown in Table 3.

TABLE III

DECISION MATRIX FOR TABLE 2

T2 T4 T5 T7 T8 T1 DOOR SIZE CYLINDER WEIGHT CYLINDER DOOR WEIGHT DOOR WEIGHT SIZE CYLINDER SIZE CYLINDER T3 DOOR WEIGHT SIZE CYLINDER DOOR WEIGHT CYLINDER WEIGHT WEIGHT SIZE CYLINDER DOOR WEIGHT SIZE CYLINDER T6 SIZE CYLINDER DOOR WEIGHT WEIGHT WEIGHT SIZE DOOR SIZE

(4)

CYLINDER CYLINDER

The decision matrix in Table 3 contains all the attributes, which are not identical between two groups of objects. [Mileage=low]={t2, t4, t5, t7, t8} and [Mileage=high]={t1, t3, t6}. So the core attribute for data in Table 2 is Weight according to its decision matrix in Table 3.

We propose a new algorithm based on the database operations to get the core attributes of a decision table. Compared with the original rough set approach, our algorithm is efficient and scalable

Algorithm 1: Core Attributes Algorithm Input: a decision table T(C,D)

Output: Core –the core attribute of table T Method

Set Core = _Φ

For each attribute Ci ∈C {

If Card(Π (C-Ci+D)) ≠ Card(Π (C-Ci)) Then Core = Core ∪ Ci

}

Definition 8: An attribute Cj∈C is a dispensable attribute

with respect to D if the classification result of each tuple is not affected without using Cj. In database term, it means

Card(_Π(C−Cj+D))= Card(Π (C−Cj)) .

This definition means that an attribute is dispensable if each tuple can be classified in the same way no matter whether the attribute is present or not. We can check whether attribute Cj is

dispensable by using some SQL operations. We only need to take two projections of the table: one on the attribute set C−Cj+D, and the other on C−Cj. If the cardinality of the two

projection tables is the same, then it means no information is lost in removing attributes Cj, otherwise, it indicates that Cj is

relevant ad should be reinstated. For example, in Table 2, since

Card(Π (Weight, Size, Cylinder, Mileage)) =6,

Card(Π(Weight, Size, Cylinder)=6, so Door is a dispensable attribute in C with respect to Mileage.

Definition 9: The degree of dependency K(REDU, D)

between the attribute REDU ⊆ C and attribute D in decision table T(C,D) is K(REDU,D) = Card(Π(REDU+D))/ Card(C+D)

The value K(REDU,D) is the proportion of these tuples in the decision table which can be classified. It characterizes the ability to predict the class D and the complement class ¬D from tuples in the decision table.

Definition 10: The subset of attributes RED (RED ⊆ C) is a reduct of attributes C with respect to D if it is a minimum subset of attributes which has the same classification power as the entire collection of condition attributes.

K(RED, D) = K(C, D)

K(RED, D) _{≠K(R’, D) ∀R’ ⊂ RED}

For example, for Table 2, there are two reducts: {Weight,

Size} and {Weight, Cylinder} (in next section we will present

the algorithm to find a reduct). For each reduct, we can derive a reduct table from the original table. For example, the reduct table T4 based on reduct {Weight, Size} is created by

projecting out the attributes Weight, Size and decision attribute Mileage from Table 2, which can still make a correct classification model. {Weigh Size} is a minimum subset and can’t reduce further without sacrifice the accuracy of the classification model. Suppose we create another table T5 from

T4 by moving Size, it cannot correctly distinguish between

tuples t1, t6 and tuples t2, t8 because these tuples have the same

Weight values but belong to different classes which were

distinguishable in the reduct table T4.

TABLE IV

REDUCT TABLE FOR {WEIGHT SIZE}

TUPLE_ID WEIGHT SIZE MILEAGE

T1, T6 LOW COMPACT HIGH

T2, T8 LOW SUB LOW

T3 MED COMPACT HIGH

T4, T5 HIGH COMPACT LOW

T7 HIGH SUB LOW

TABLE V:

REDUCD TABLE FOR {WEIGHT} TUPLE_ID WEIGHT MILEAGE

T1, T6 LOW HIGH

T2, T8 LOW LOW

T3 MED HIGH

T4, T5, T7 HIGH LOW

Definition 11: The merit value of an attribute Cj in C is

defined as Merit (Cj, C, D) = 1 –

Card(Π(C-Cj+D))/Card((Π(C+D)).

Merit(Cj, C, D) reflects the degree of contribution made by

the attribute Cj only between C and D. For example, in Table

2,

Card(Π(Door,Size,Cylinder,Mileage))=6, Card(Π(Door,Weight,Size,Cylinder,Mileage))=8, Merit(Weight,{Door,Weight,Size,Cylinder},

Mileage)=1-6/8=0.25

III. GENERATING MULTIPLE REDUCTS

Empirical tests indicate that ensembles of classifiers are more helpful if they are as accurate and reliable as possible and at the same time as diversified as possible from the other classifiers. The multiple classifiers concept matches the concept of reducts in rough se theory. From a reduct, we can generate a reduct table, from a reduct table, we can construct a reduct classifier which consists of the corresponding decision rules. A reduct uses a minimum number of attributes and represents a minimum and complete rules to classify objects in the decision table as perceived from “one angle”. (see Figure 2). To classify unseen objects, it is desirable that (1) different reducts use different attributes as much as possible, (2) the union of the reducts together include all the core and reduct attributes, (3) the number of reducts used for classification is minimal. Here we propose a greedy algorithm to compute a set of reducts which satisfy this optimal requirement partially. Our algorithm is sub-optimal because it cannot guarantee that that the number of reducts is kept to minimum. (it may be conjured that this problem is computational intractable to solve). Our algorithm starts with the core attribute (CORE), then through backtracking, a set of reducts is constructed. A reduct is computed by using forward stepwise selection and backward

(5)

stepwise elimination based on the merit values of the attributes and the dependency between condition attributes and decision attributes. The algorithm terminates when the attributes in the union of the reducts include all the attributes required for the decision categories.

Fig.2 Diagram of Multiple Reduct Classifiers

A reduct classifier employs only the information necessary to represent the given data set without losing essential information. Our approach is to learn a reliable model for each classifier. We believe that each classifier has a particular subdomain for which it is most accurate, thus it is very hard to say which one is better than the others in the real application. Depending on the subdomain, one reduct classifier can be more useful than another. All the reduct classifiers combine together to improve the classification accuracy of the learning algorithm. For more detailed of the algorithm, please refer to []. For example, we can find two reducts: {Weight Size}, {Weight, Cylinder}, which to-gether have all the three attributes required for decision table Table 2.

IV. MAXIMAL GENERALIZED RULES INDUCTION ALGORITHM A classification rule is a combination of values of some condition attributes such that the set of all examples matching it is contained in the set of examples labeled with the same class. A rule is denoted as an implication:

R: Ci1 = vi1 ∩ Ci2=vi2 ∩ …∩ Cik=vik → D = di

(Below we use cond(r) to denote the left hand and deci(r) for the right hand of the rule r). Our aim is to produce rules in the learning process which are maximal generalized by removing the maximum number of condition attribute values without decreasing classification accuracy of the rule. Computing such rules is especially important in data mining application since they represent the most general patterns existing in the data, Before describing our rule generation algorithm, we introduce two propositions: rule redundancy and rule inconsistency

1）Rule redundancy: (1) if ri and rj are valid rules where

cond(ri) = cond(rj) and dec(ri)= dec(rj), then ri and rj are logically equivalent rules. (2) if ri and rj are valid rules where cond(rj) ⊃ cond(ri) and dec(ri) = dec(rj), then rj is logically included in ri.

2）Rule inconsistency: if ri and rj are valid rules where

cond(rj) ⊆ cond(ri) and dec(ri) ≠ dec(rj), then ri and rj are inconsistent.

Data classification is to classify a set of objects based on their values on the condition attributes according to a classification model. A classification model can be a set of classification rules, a neural network, a decision tree, a Bayesian network and so on. Most data mining approaches to

the induction of rules fall into two categories: “divide-and-conquer” (classification tree family) and “separate-and-conquer”. The former recursively partitions the instance space until the remaining small instance space roughly belongs to the uniform concept. The latter induces one rule at a time and removes the instances covered by this rule until no more rules can be generated. Such methods suffers from the “splitting problem [9,13,30], because the size of the available sample dwindles, resulting in classification being made with less and less statistical support. Statistical anomalies become harder to weed out, noise sensitivity increases. As a result, overfitting and incorrect rules may be generated which decrease prediction accuracy. To alleviate the splitting problem, we propose a new method which adopts the strategy of “conquer-without-separating” to induce rules from the data set. It can make effective use of the statistic measure to combat noise, because each rule is generated, taking into consideration the entire data set in a “specific-to-general” fashion. In our method, we consider a reduct table as a set of specific rule RULES = {r1, r2,…, rn}. Each rule ri corresponds to exactly one tuple in the reduct table. Such rules can be generalized further by dropping conditions. The process by which the maximum number of condition attributes values are removed without losing essential information is called value reduction in rough set theory, and the resulting rule is called a maximal generalized rule. The concept of a maximal generalized rule is similar to that of the kernel rule [30]. The maximal generalized rules minimize the number of rule conditions and are optimal because their conditions are non-redundant. A condition is dropped from a rule ri, and then rule ri is checked for consistency with other rules in the rule set

RULES. If rules ri is inconsistent, then the dropped condition

is restored. This step is repeated until every condition of the rule ri has been tested. The resulting rule is a maximal generalized rule. After all rules in RULES have been processed, a maximal generalized rule set is obtained.

The orders in which we process the attributes determine which maximal generalized rule is generated. Thus a maximal generalized rule may not turn out to be the best with respect to the conciseness or the coverage of the rule. Given a rule with a condition, we would evaluate all 2a −1 possible subset of conditions on the database and select the best rule but this is, in general, impractical. For a near optimal solution, each condition of the rule is assigned a significance value by an evaluation function before dropping condition process is started. The significance value indicates the relevance of this condition for this particular case. High significance values indicate more relevance. The process of dropping conditions should first drop the conditions with lower significance, as described in [9,13,5]. The evaluation function for a condition cik=vik is defined as SIG(cik=vik) = P(cik=vik) _∗ (P(D=di|cik=vik) − P(D=di)), where P(cik=vik) is the probability of occurrence of condition cik=vik; P(D=di|cik=vik) is the conditional probability of the class D=di conditioned on the occurrence of the condition cik=vik, P(D=di) is the proportion of the class D=di on the database.

(6)

To avoid generating highly specific rules, our algorithm DBClass uses Laplace test, a special case of the m-probability-estimate developed by Cestnik [3], to ensure that the distribution of examples among classes covered by the rule is significantly different from that which would occur by chance. In this way, many rules covering a few examples are eliminated, as the significance test believes their appear high accuracy likely to simply due to chance. Laplace test avoids the undesirable “downward bias” of other measure and generate rules tends to be favored. The Laplace measure is sufficient on its own to bias the search towards those general rules with higher predictive accuracy (and thus also high significance). (The effect of Laplace to our method is similar to the combined effect of minimum support and minimum confidence for association rule algorithm).

Laplace = (nc + l) / (ntotal + k)

Where k is the number of classes in the domain, nc is the number of tuples in the predicted class c covered by the rule, ntotal is the total number of tuples covered by the rule.

For example consider two rules r1 and r2, where: r1 covers 450 tuples of class high and 5 low r2 covers 5 tuples of class high and 0 low

Here the algorithm should prefer r1 as its accuracy on new test data is likely to be better. r2 only covers a few tuples and its appear accuracy of 100% is not fully true reflective on new test data. For our example, the Laplace accuracy estimate for predicting the class with the most covered tuples are 98.7% for r1 and 85.7% for r2

Algorithm 2: Maximal Generalized Rules (DBClass) Input: (1) A decision Table T(C,D) , (2) A reduct

REDUCT

Output: A reduct classifier Method:

1. Generate the reduct table RULES by projecting on

REDUCT and D from T(C,D)

2. MG_Rules = Φ

3. Compute the significance value SIG for each

condition of the rules in RULES

4. Simplify each tuple in the reduct table RULES 5. Calculate the Laplace measure for each tuple 6. Transform the tuples with Laplace measure greater

than the threshold value into a classification rule

Step 4:

For each tuple ri ∈ RULES Do {

Sort the set of conditions of the rule ri based on the significance values

For each condition value (cik = vik) ∈ ri Do { Remove (cik = vk) from ri

If ri is inconsistent with any rule r ∈ RULES

Then put (cik = vik) back to ri }

Remove any rule r’ _{∈ MG_Rules that is logically included in} rule ri

If rule ri is not logically included in a rule r’ ∈ MG_Rules Then MG_Rules ∪ ri → MG_Rules

}

Suppose there are N tuples (rules) with A attributes. The computation of significant values requires O(AN) and the processing of dropping conditions on rules requires O(ANlogN), so finding all maximal generalized rules for n tuples is O(ANlogN).

V. ENSEMBLE OF CLASSIFIERS

Our approach uses reducts to construct an ensemble of classifiers. We first construct a set of reducts which contains all the essential attributes, Then using the novel induction algorithm in Section 4 , we construct a reduct classifier for each reduct from the corresponding reduct table. A reduct classifier is a set of maximal generalized classification rules without any re-dundant attributes and attribute value. A reduct classifier corresponding to a reduct is a minimal set of classification rules which is fully covered by the attributes of a reduct. The fully reduct cover means that all the condition attributes used by the decision rules is also the attributes of the reduct table. Using different reducts of a decision table, we can derive different reduct classifiers, thus constructing an ensemble of classifiers. Below is the algorithm.

TABLE VI

REDUCT TABLE FOR {WEIGHT CYLINDER}

Tuple_id Weight Cylinder Mileage

t1, t6 low 4 high

t2, t8 low 6 low

t3 med 4 high

t4, t7 high 6 low

t5 high 4 low

The reduct classifiers for reduct “Weight, Size” is

1. (Weight=low)(Size=comp) →(Mileage=high)

2. (Weight=medium) → (Mileage=high)

3. (Size=sub) → (Mileage=low) 4. (Weight=heavy) → (Mileage=low)

The reduct classifiers for reduct “Weight, Cylinder” is

1. (Weight=low)(Cylinder=4) _{→(Mileage=high)} 2. (Weight=medium) → (Mileage=high)

3. (Cylinder=6) _{→ (Mileage=low)}

4. (Weight=heavy) → (Mileage=low)

To evaluate the classification accuracy of REDEnsemble, we ran REDEnsemble, C4.5, Bagged C4.5 and Boosted C4.5 on some data sets from UCI machine learning repository [25]. All the numeric attributes are discretized using algorithm DBChiMerge [9]). If the number of reducts of a date set generated by using Algorithm 2 is greater than 10, we choose the first 10 reduct classifiers. The results are shown in Table 7. 15 data sets from UCI repository were chosen and ten complete 10-fold cross-validation were carried out for each data set. At the bottom of table, each column’s average is shown. REDEnsemble achieves a significant improvement on prediction accuracy over C4.5 and the accuracy is also high than bagged C4.5 and boosted C4.5. In our test, the final classifier REDEnsemble is formed by using equal voting.

TABLE VII

(7)

VI. CONCLUSION

We present a database operation based rough set approach for constructing an ensemble of classifiers. Most rough set based approach systems do not integrate with the databases systems, a lot of computational intensive operations such as generating core, reduct and rule induction are performed on flat file, which limit its applicability for large data set in data mining applications. In this paper we present a database operation based rough set approach. We borrow the main ideas of rough set theory and redefine them based on the database theory to take advantage of the very efficient set-oriented database operation. We propose a novel context sensitive measure for feature ranking, present a new set of algorithms to calculate core, reduct, rule induction based on our new database based rough set model. Almost all the operations used in generating core, reduct, etc in our method can be performed using the database set operations such as Count, Projection. Our rough set based approach is designed based on database set operations, compared with the traditional rough set based data mining approach, our method is very efficient and scalable.

There are some other open problems in ensemble of classifiers, such as how to understand and interpret the decision made by an ensemble of classifiers because an ensemble provides little insight into how it makes its decision. For learning task such as data mining applications, comprehensibility is crucial, voting methods normally results in incomprehensible classifier that cannot easily understand by end-users. These are the research topics we are currently working on and hope to report our findings in the near future.

REFERENCES

[1] Breiman L., Arcing Classifiers, The Annals of Statistics, 26(3), 801--849

[2] Cestnik B., Estimating Probabilities: A Crucial Task in Machine Learning, Europe Conference in Artificial Intelligence, 1990

[3] Clark, P., Niblett, T. , The CN2 Induction Algorithm, Machine Learning, 3:261-283

[4] Dietterich T.G., An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting and

[5] Freund Y., Schapire R., Experiments with a New Boosting Algorithm, Prod. of the ICML-1996

[6] Hu. X, Construction of An Ensemble of Classifiers based on Rough Sets Theory and Database Operations, Proc. of the 2001 IEEE International Conference on Data Mining (ICDM2001)

[7] Joshi M., Agarwal R., Kumar V., Predicting Rare Classes: Can Boosting Make Any Weak Learner Strong? Prod. of the 9th SIGKDD 2002 [8] Krieger A., Wyner A., Long C., Boosting Noisy Data, in Proc. of the

ICML 2002

[9] Lin T.Y., Cercone, N., Rough Sets and Data Mining: Analysis of Imprecise Data, Kluwer Academic Publisher, 1997

[10] Lin T.Y., Yao Y.Y. Zadeh L. (eds), Data Mining, Rough Sets and Granular Computing, Physica-Verlag, 2002

[11] Michalski R.S., A Theory and Methodology of Inductive Learning, Machine Learning: An Artificial Intelligence Approach, Vol. 1, Morgan Kaufmann,

[12] Murphy P.M; Aha D. W. UCI repository of machine learning databases, http://www.ics.uci.edu/~mlearn/MLRepository.html, 1996

[13] Pawlak Z., Rough Sets, International Journal of Information and Computer Science, 11(5), 1982

[14] Pawlak Z.. Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, 1992

[15] Quinlan J.R., Bagging, Boosting and C4.5, Proc. of the AAAI-96 [16] Rymon, R. An SE-Tree Based Characterization of the Induction

Problem, In Proc. o f ICML 1993

[17] Schapire R., The Boosting Approach to Machine Learning: An Overview, MSRI Workshop on Nonlinear Estimation and Classification, 2002

[18] Schapire R.E., Freund Y., Barlett P., Explanation for the Effectiveness of Voting Methods, Machine Learning, 1998

[19] Ziarko, W., Variable Precision Rough Set Model, Journal of Computer and System Sciences, Vol. 46, No. 1, 1993