Redundancy elimination - Research experiment

3.3 Research experiment

3.3.3 Redundancy elimination

This is a pre-processing step in which redundant or highly-correlated features are removed from the data. Redundant features supply information which exists in other attributes and thereby reduce the predictive performance (Xu et al. 2016:370-381). In this study, a hybrid algorithm that is based on the

FCBF was used to eliminate redundant features in all sets selected by the feature selection algorithms.

Symmetric uncertainty (SU)

In information theory, SU is a normalised measure that evaluates the dependencies of features using entropy and conditional entropy. The entropy of X given that X is a random variable and the probability of x is P(x) is defined as:

𝐻(𝑋) = − ∑ 𝑃(𝑥𝑖)𝑙𝑜𝑔2(𝑃(𝑥𝑖)) 𝑖

(3.1)

The conditional entropy, also known as the conditional uncertainty of X after given the values of an attribute Y is: 𝐻(𝑋\𝑌) = − ∑ 𝑃(𝑦𝑗) 𝑗 ∑ 𝑃(𝑥𝑖\𝑦𝑗)𝑙𝑜𝑔2(𝑃(𝑥𝑖\𝑦𝑗)) 𝑖 (3.2)

The SU figure of 0 implies that features are totally independent, while an SU amount of 1 signifies that a feature can totally predict the value of another feature.

According to (Yu & Liu 2003: 1-11), an attribute which has a certain degree of correlation with a concept, for example, a class may also have the same or even higher degree of correlation to other concepts. Thus, the attribute and the target concept are correlated at a level that is greater than a specific threshold 𝛿 and therefore causing this attribute to be significant to the class concept. This correlation is by no means predominant or significant in determining the target concept. The concept of predominant correlation is defined as follows (Singh, Kushwaha & Vyas 2014:95-105);

Definition 1 – Predominant correlation

𝑖𝑓𝑓 𝑆𝑈_𝑖,𝑐 ≥ 𝛿, and ∀ 𝐹_𝑗 ∈ 𝑆ˈ(𝑗 ≠), there exists no 𝐹_𝑗 such that (𝑆𝑈_𝑗,𝑖 ≥ 𝑆𝑈_𝑖,𝑐).

Definition 2 – Predominant Feature

A feature is predominant to the class; 𝑖𝑓𝑓 its correlation to the class is predominant or can become predominant after eliminating other attributes from the class.

As stated by the preceding explanations, an attribute is good if it is predominant in the prediction of the class concept. Selecting attributes by classifying them is a procedure that recognises all attributes that are predominant to the class and eliminates non-key features.

The following three heuristics can efficiently recognise predominant attributes and eliminate redundancy from all significant or relevant attributes, with no need to test for peer redundancy for each attribute in 𝑆ˈ, and thereby avoiding investigations of correlations between pairs of all significant attributes. If two redundant attributes are recognised, eliminating one of them that is less significant to the class results in the retainment of weightier information for the class prediction whilst decreasing redundant attributes.

Heuristic 1

𝑆𝑃_𝑖 is the set all redundant peers to 𝐹𝑖

𝑖𝑓 (𝑆_𝑃𝑖 + = ∅) consider 𝐹𝑖 as a predominant feature, eliminate all attributes in 𝑆𝑝𝑖−, and skip

identifying redundant peers for them(Yu & Liu 2003).

Heuristic 2

𝑖𝑓 (𝑆_𝑃𝑖 +_{≠ ∅)}_{action all attributes in}_𝑆

𝑃𝑖 +prior to deciding on 𝐹𝑖. If none of them becomes predominant,

follow Heuristic 1, else only eliminate 𝐹_𝑖 and decide if or not to eliminate any attributes in 𝑆_𝑃𝑖 − based on other attributes in 𝑆'.

The attribute with the biggest 𝑆𝑈𝑖,𝑐value is always a predominant attribute and can be used to

eliminate other attributes.

3.3.3.1 Fast correlation-based filter

The FCBF is an algorithm that selects good features based on predominant correlation, and then presents a fast algorithm with less than quadratic time complexity (Yu & Liu 2003). The algorithm applies the predominance concept. The FCBF uses SU as a correlation measure (Wu et al. 2006). It is composed of two sections, and the first section selects relevant attributes. An attribute p is significant to the target attribute C iff SU ( p,c) ≥ δ given that δ is a predefined threshold.

In the second section, redundant features are selected from the relevant ones, according to its redundancy definition: a feature q is said to be redundant iff p is a predominant feature, SU(p,c) > SU(q,c) and SU(p,q) ≥ SU(q,c).The inequalities imply that p is a better predictor of class c and that q

is more similar to p than to c (Yang et al. 2016).

The steps of identifying redundant features consist of: (1) choosing a predominant attribute, (2) removing all attributes for which it forms an approximate Markov blanket, and (3) iterate steps (1) and (2) until no more predominate attributes can be found. An optimal feature subset can therefore be approximated by a set of predominant features without redundancy.

3.3.3.2 FastCR

The proposed FAST Correlation-based Redundancy elimination (FastCR) algorithm is based on the MIC and FCBF method and is implemented in Java. The original FCBF code selects significant features and removes redundant attributes. The FastCR algorithm for this research removes redundant attributes. Relevant attributes are selected using the MIC algorithm, resulting in a hybrid MICFastCR algorithm.

In lines 2-4, the algorithm calculates the MIC values for all the features and saves them in a list. In the second part (line 6-20), redundant features are removed if the SU values of two features are the same, Table 3.2.

As stated in Heuristic 1, a feature Fp, which has been ascertained to be a predominant attribute, can

constantly be used to eliminate other attributes that are graded lower than Fp and have Fp as one of

its redundant peers (Yu & Liu 2003:1-11).

The loop begins from the first element (Heuristic 3) in𝑆ˈ𝑙𝑖𝑠𝑡 (line 7) and runs as detailed below:

Considering all the prevailing attributes (from the one right next to Fp to the last one in𝑆ˈ𝑙𝑖𝑠𝑡), if Fp turns

out to be a redundant peer to a feature Fq, Fq will be eliminated from 𝑆𝑙𝑖𝑠𝑡list (Heuristic 2). After

selecting attributes for one cycle subject to Fp,, the algorithm will utilise the feature that currently

remains, is beside Fp as the new reference (line 19) to reiterate the selection. The algorithm ends

when there are no more attributes that can be eliminated from 𝑆𝑙𝑖𝑠𝑡list.

Table 3.2 MIC and FastCR Algorithm (Zhao, Deng & Shi 2013:70-79; Yu & Liu 2003:1-8)

Input: 𝑆(𝐹1, 𝐹2, … , 𝐹𝑁, 𝐶) // training dataset

𝛿 //a predefined threshold

Output:𝑆𝑏𝑒𝑠𝑡 //an optimal subset

1 Begin

2 for all 𝑓𝑖, 𝑓𝑗∈ 𝐷, 𝑖 ≠ 𝑗 do begin

3 Calculate MIC values and Set 𝑀𝑖,𝑗 = 𝑀𝐼𝐶( 𝑓𝑖, 𝑓𝑗);

4 end for

5 _{Sort distinct values of}_𝑀_𝑖,𝑗_{in descending order as}𝑆ˈ_{𝑙𝑖𝑠𝑡}_;

75 7 𝐹_𝑝 = 𝑔𝑒𝑡𝐹𝑖𝑟𝑠𝑡𝐸𝑙𝑒𝑚𝑒𝑛𝑡(𝑆ˈ_{𝑙𝑖𝑠𝑡})_; 8 _{do begin} 9 _𝐹_𝑞 _{= 𝑔𝑒𝑡𝑁𝑒𝑥𝑡𝐸𝑙𝑒𝑚𝑒𝑛𝑡(𝑆ˈ}_{𝑙𝑖𝑠𝑡}_{, 𝐹}_𝑝_); 11 _{do begin} 12 𝐹ˈ_𝑞= 𝐹_𝑞_;

13 𝑖𝑓 (𝑆𝑈_𝑝,𝑞 = 1)_{// if features are identical} 14 _{remove 𝐹}_𝑞_{𝑓𝑟𝑜𝑚}𝑆ˈ_{𝑙𝑖𝑠𝑡}_; 15 𝐹_𝑞= 𝑔𝑒𝑡𝑁𝑒𝑥𝑡𝐸𝑙𝑒𝑚𝑒𝑛𝑡(𝑆ˈ_{𝑙𝑖𝑠𝑡}, 𝐹ˈ_𝑞)_; 16 𝑒𝑙𝑠𝑒 𝐹_𝑞 = 𝑔𝑒𝑡𝑁𝑒𝑥𝑡𝐸𝑙𝑒𝑚𝑒𝑛𝑡(𝑆ˈ_{𝑙𝑖𝑠𝑡}, 𝐹_𝑞)_; 17 𝑒𝑛𝑑 𝑢𝑛𝑡𝑖𝑙 (𝐹_𝑞 == 𝑁𝑈𝐿𝐿)_; 18 𝐹_𝑝 = 𝑔𝑒𝑡𝑁𝑒𝑥𝑡𝐸𝑙𝑒𝑚𝑒𝑛𝑡(𝑆ˈ_{𝑙𝑖𝑠𝑡}, 𝐹_𝑝)_; 19 𝑒𝑛𝑑 𝑢𝑛𝑡𝑖𝑙 (𝐹_𝑞 == 𝑁𝑈𝐿𝐿)_; 20 𝑆_{𝑏𝑒𝑠𝑡} = 𝑆ˈ_{𝑙𝑖𝑠𝑡}_; 21 end;

In document Software defect prediction using maximal information coefficient and fast correlation-based filter feature selection (Page 89-94)