Chapter 2 Literature Review
2.5 Filters for Feature Selection
2.5.1 Basic Consistency Measure
The earliest idea of consistency based feature selection was quite simple: find out a subset of features that is consistent with the target feature. FOCUS (Almuallim and Dietterich, 1991) was first introduced by Almuallim and Dietterich in 1991 and then improved it to FOCUS2 in (Almuallim and Dietterich, 1994). This algorithm stops the search in the first set of features that fully consistent with the class label. Although it can guarantee finding the smallest consistent feature subset, there are also some limitations. First, FOCUS can only be applied on discrete features. Discretization algorithms must be applied on continuous features before measuring. Several improvements of FOCUS are proposed, such as CFOCUS and F (Arauzo et al., 2003), to adapt to continuous features. Second, noise data may greatly hinder the performance of FOCUS. The algorithm has to add another feature just because of one inconsistent data, and that may be redundant. Another limitation is the low searching speed, as a random searching strategy cannot guide the search.
Two consistency based feature selection are widely used. One is created by Liu, Motoda and Dash (Liu et al., 1998) as Consistency based Feature Selection. This is a feature subset selection method based on the idea of FOCUS but with more efficient searching strategy. It is a feature subset selection method designed for nominal feature selection.
Another one is Rough Set Feature Selection (RSFS). It is a group of filters based on rough set theory that is also based on the concept of consistency.
2.5.1.1 LVF Consistency Based Feature Selection
In (Liu et al., 1998) Liu, Motoda and Dash proposed a probabilistic approach β Las Vegas Algorithm for feature subset selection (LVF). LVF makes probabilistic choices to help guide them more quickly to a correct solution (Liu et al., 1998). The LVF algorithm is described as follows:
Input: MAX-TRIES D: dataset
N: number of attributes
Ξ³: allowable inconsistency rate
Output: sets of M features satisfying the inconsistency criterion Cbest = N
for i=1 to MAX-TRIES
S = randomSet(seed); C = numOfFeatures(S); if(C<Cbest) if(InconCheck(S, D)<Ξ³) Sbest = S; Cbest = C; print_Current_Best(S); else if((C=Cbest)and(InconCheck(S, D)<Ξ³))
print_Current_Best(S); end
The algorithm starts with a random selected feature subset S in each round. It will do nothing with the situation C>Cbest, which means the number of selected attributes must decrease if not equal to the current best subset.
The inconsistent criterion is defined as follows:
(1) Two instances are considered inconsistent if they match except for their class labels;
(2) For all the matching instances (without considering their class labels), the inconsistency count is the number of the instances minus the largest number of instances of class labels
(3) The inconsistency rate is the sum of all the inconsistency counts divided by the total number of instances.
In their later researches, the inconsistency rate is also introduced to simplify the algorithm. The inconsistency rate and consistency rate are defined as follows:
πΌπΆπππΆππππππΆππ¦π πππ= πΆππππππΆπππππππππΆπππΆππππππΆπππππΆππππΆπππππΆππππΆπππ
(2.9)
πΆππΆππππππΆππ¦π πππ= 1β πΌπΆπππΆππππππΆππ¦π πππ
(2.10) In Liuβs algorithm, the inconsistency is compared with a pre-specified rate Ξ³. Unless specified, the default value of Ξ³ is 0. If the inconsistency rate is below Ξ³, then it means the subset of features is acceptable.
The goal of this algorithm is to find a feature subset that contains no inconsistent. That feature subset may not the smallest subset, and algorithm stops when it finds the first consistent subset. Hash table is used to accelerate the algorithm. To make it adapts to continuous features, datasets contain continuous features must be discretised before applying consistency based feature selection.
Antonio (Arauzo et al., 2003) pointed out that the computation can be very fast by using hash table. Through a direct search, all instances are grouped according to their value of attributes. Then the number of inconsistency is the number of instances that do not belong to the majority class. In the average case, of this process is in O(n).
2.5.1.2 Rough Set Feature Selection
According to Antonioβs paper about consistency measures (Arauzo et al., 2003), Rough Set Feature Selection can also be regarded as one kind of consistency based feature selection.
Rough Sets was first defined by ZdzisΕaw I. Pawlak in this paper (Pawlak, 1982) and later book (Pawlak, 1991). Given a data table S=(U, A) where the universe U is a finite nonempty set of objects and A is the set of attributes. Every attribute aβA associates a
set Va, of its values, called the domain of a. A data pattern of S is any feature value vector v=(v1,β¦vn) where viβvai for i=1,β¦n such that v=a(x) for some xβU (Parthalain et al., 2010).
Each nonempty subset BβA determines an indiscernibility relation, which is defined as:
π π΅ = {(x, y)βU Γ U | π(π₯) = a(y),βaβB}. The relation RB partitions U into some equivalence classes given by U Rβ π΅= {[π₯]π΅|π₯ β π} , or just U/B, where [π₯]π΅ denotes the equivalence class determined by x with respect to B, i.e., [π₯]π΅ = {π¦ β π |(π₯,π¦) β
π π΅} (Jiye et al., 2014).
Given an equivalence relation R on the universe U and XβU, the lower approximation and upper approximation of X are defined (Wang and Zhou, 2009) by
π π=βͺ{π₯ β π|[π₯]π βX} πππ π= {π₯ β π|[π₯]π΅β X}
(2.11) and
π π=βͺ{π₯ β π|[π₯]π β©X β β } πππ π= {π₯ β π|[π₯]π΅β©Xβ β }
(2.12)
The tuple < π π,π π> composed of the lower approximation and upper approximation is called a rough sets.
A decision table is a data table S = (U, CβͺD) with Cβ©D = β , where C is called a condition set and D is called a decision set. Given Pβ C and U/D={D1, D2,β¦Dr}, the positive, negative, and boundary region of D with respect to the condition attribute set P is defined by
POSπ(π·) =βͺπ=1π π ππ·π (2.13)
NEGπ(π·) =π ββͺπ=1π π ππ·π (2.14)
To achieve feature selections, the attribute dependency is an important issue that can discover which attributes are strongly related to which other variables. In rough set theory, dependency is defined in the following way:
For P, Qβ A, the attribute dependency Ξ³π(π) is defined as:
Ξ³π(π) =|POS|ππ|(π)|β€1
(2.16)
If Ξ³π(π) = 1, that means Q totally depends on P, and if Ξ³π(π) = 0, then Q does not depend on P.
The Rough Set Feature Selection is based on the idea of removing attributes without affecting the dependency degree. That means, the selected feature sets R have the same dependency Ξ³π (π·) as the original condition attributes. A quick feature selection algorithm based on Rough Set theory includes two parts: finding the core, and then get the reduction.
A reduct is a minimal subset of attributes that can fully characterize the knowledge in the datasets. Formally, in the data table S = (U, CβͺD), a subset R is a reduct of C if:
(1) Ξ³π (π·) =Ξ³πΆ(π·) and
(2) βaβ R, Ξ³π β{π}(π·)β Ξ³π (π·)
It should be mentioned that a reduct may not be the subset with smallest cardinality. Also the reduct of a data table is not unique. The interaction of all reducts is called the core. All attributes in the core cannot be removed without causing a decrease of dependency.
The RSFS algorithm is described as follows.
RSFS(C, D)
C is the set of all conditional attributes D is the set of decision attributes Set R =β
if (Ξ³πΆβ{ππ}(π·) <Ξ³πΆ(π·)) then R = Rβͺ{ππ} while (Ξ³π (π·) <Ξ³πΆ(π·)) C=C-R for (i=1;i<|C|;i++) if (Ξ³π βͺ{ππ}(π·) >Ξ³π (π·)) then R = Rβͺ{ππ} end for
return R as the reduct
In the RSFS algorithm, the core is found firstly by checking the dependency of all attribute sets with only one attribute missing. If the dependency changed, then it means the missing feature is essential to keep the dependency, in other words it is in the core. Then starting from the core, other attributes are added in if they can increase the dependency. The whole algorithm stops if the dependency of R is equal to the original condition set C, which means R can completely represent C. It should be noticed that this method cannot find the βglobalβ minimum reduct, but can save computing time at some extent regarding to generate a proper reduct.