• No results found

The Process of Conducting Rough Sets Analysis

4. Reduct Generation

Reduct generation is where the approximations of sets are constructed. As mentioned before a rough set can be represented by a pair of crisp sets, called the lower and the upper approximation. The lower approximation consists of all objects, which certainly belong to the set and the upper ap- proximation contains objects, which possibly belong to the set.

An important issue in rough set analysis is that of attribute reduction, which is performed in such a way that the reduced set of attributes pro- vides the same quality of classification as the complete set. Input to a Re- ducer algorithm is a decision table, and a set of reducts is returned. The re- turned reduct set may possibly have a set of rules attached to it as a child. Two main types of discernibility methods are:

x Full: Computes reducts relative to the system as a whole, i.e., minimal attribute subsets that preserve our ability to discern all relevant objects from each other.

x Object-based: Computes reducts relative to a fixed object, i.e., minimal attribute subsets that preserve our ability to discern that object from the other relevant objects.

Modulo Decision: A table can either be interpreted as a decision system or as a general Pawlak information System. If the option to compute

98 6 Rough Sets

reducts modulo the decision attribute is checked, the table is interpreted as a decision system. If the decision system contains inconsistencies, boundary region thinning should be considered. For consistent systems, there are no boundary regions to thin. In a decision system it may happen that two objects that are indiscernible with respect to attributes A may be- long to different decision classes. The decision system is said to be incon- sistent with respect to A if this is the case, and consistent otherwise. With boundary region thinning we look at the distribution of decision values within each indiscernibility set, and exclude those decision values from the generalized decision that occur with a frequency below some threshold. Low-probability decision values are thus treated as “noise”.

Use of IDG: If the algorithm supports it, a set of IDGs can be supplied that enables the notion of discernibility to be overloaded on a per attribute basis. If no IDG file is specified, strict inequality is used.

x Genetic algorithm: Implements a genetic algorithm for computing

minimal hitting sets. The algorithm has support for both cost information and approximate solutions. Each reduct in the returned reduct set has a support count associated with it. The support count is a measure of the “strength” of the reduct.

x Johnson’s algorithm: Invokes a variation of a simple greedy algorithm to compute a single reduct only. The algorithm has a natural bias towards finding a single prime implicant of minimal length. For example Let S = {{cat, dog, fish}, {cat, man}, {dog, man}, {cat, fish}} and for simplicity let w be the constant function that assigns 1 to all sets S in S. Step 2 in the algorithm then amounts to selecting the attribute

x Holte’s 1R: Returns all singleton attribute sets. The set of all 1R rules, i.e., univariate decision rules, are indirectly returned as a child of the returned set of singleton reducts.

x Manual reducer: Enables the user to manually specify an attribute subset that can be used as a reduct in subsequent computations.

x Dynamic reducts (RSES): A number of subtables are randomly sampled from the input table, and proper reducts are computed from each of

that occurs in the most sets in S. Initially, B = {empty}. Since cat is the most frequently occurring attribute in S, we update B to include cat. We then remove all sets from S that contain cat, and obtain S = {{dog, man}}. Repeating the process, we arrive at a tie in the occurrence counts of dog and man, and arbitrarily select dog. We add dog to B, and remove all sets from S that contain dog. Now, S = 0, so we’re done. Our computed answer is thus B = {cat, dog}.

The Process of Conducting Rough Sets Analysis 99

these using some algorithm. The reducts that occur the most often across subtables are in some sense the most “stable”.

x Exhaustive calculation (RSES): Computes all reducts by brute force. No support is provided for IDGs, boundary region thinning or approximate solutions.This algorithm does not scale up well, and is only suitable for tables of moderate size. Computing all reducts is NP-hard.

x Johnson’s algorithm (RSES): Invokes the RSES implementation of the greedy algorithm of Johnson

x Genetic algorithm (RSES): Implements a variation of the algorithm described by Wr ´oblewski. Uses a genetic algorithm to search for reducts, either until the search space is exhausted or until a given maximum number of reducts have been found. Three predefined parameter settings can be chosen among that control the thoroughness and speed of the genetic search procedure. No support is provided for IDGs, boundary region thinning or approximate solutions.

Filter Reducts: The next step is to use a set of algorithms that remove ele- ments from reduct sets, according to different evaluation criteria. Unless explicitly stated, algorithms in this family modify their input directly. Fol- lowing are exemplary algorithms used for this purposes:

x Basic filtering: Removes individual reducts from a reduct set. Possible removal criteria can be the length of reduct or the support for that reduct.

x Cost filtering: Removes reducts from a reduct set according to their “cost”. The function cost specifies the cost of attribute subset B. If a reduct’s cost exceeds some specified threshold, the reduct is scheduled for removal.

x Performance filtering: Each reduct in the reduct set is evaluated according to the classificatory performance of the rules generated from that reduct alone. The reduct is removed if the performance score does not exceed a specified threshold.

The following process is performed for each reduct.