Classification and Dynamic Domains
2.13. Feature Selection Methods
Features selection is an important step when creating any classification model. As shown in section 2.4, for domains where a virtual concept drift might occur, the set of input features change over time. Therefore, it is important to identify the set of significant features before proceeding in updating the classification model or creating a new one. In general, any classification model aims to
approximate the functional link f() between an input attributes
N = {n1, n2, ... ,nW} and an output attributes T = {t1, t2, ... ,tH}. Sometimes the output attributes can be concluded by only a sub-set of the input attributes
{n(1), n(2), ... , n(w)} where w < W. Hence, it might be reasonable not to use the whole set of input attributes. Yet, with the availability of sufficient resources, it might be acceptable to use all input attributes, even the ones that are redundant
Chapter 2 Page 31 CLASSIFICATION AND DYNAMIC DOMAINS
or irrelevant. Two different techniques can be used to select the most effective set of features those are as follows:
A)Filter method
In this approach, the selection process is independent of the data mining algorithm that will be utilised on the chosen features. Normally, filter methods evaluate the features significance by examining the inherent characteristics in the data. Once features relevance score is calculating the set of features that do not pass a pre-determined threshold are deleted. However, the set of remaining features is presented to the data mining algorithm as an input features. This technique is computationally fast and simple (Sánchez-Maroño et al., 2007). In addition, in this method the features selection phase needs to be completed one time only (Sánchez-Maroño et al., 2007).
Several feature selection algorithms can be used, for instance, Information Gain (Shannon, 1948), Chi-Square (Greenwood & Nikulin, 1996) and Gain Ratio (Quinlan, 1993).
1-Information Gain (IG): is the most frequently used algorithm in filter methods (Bramer, 2013), (Dash & Liu, 1997), (Yu & Liu, 2004). Information Gain employs an information theoretic measurement called entropy, which assesses the uncertainty in a data set associated with a particular variable (normally the class variable). The entropy is calculated as per equation 2.2.
E(D) = ∑𝑁𝑖=1−𝑝𝑖 𝑙𝑜𝑔2(𝑝𝑖) (2.2)
Where pi is the relative frequency of class i in data set D comprising N classes. In case of binary classification, we can customize the entropy equation as per equation 2.3.
Chapter 2 Page 32 CLASSIFICATION AND DYNAMIC DOMAINS
E(D) = −𝑃𝑙 𝑙𝑜𝑔2(𝑃𝑙) − 𝑃𝑝 𝑙𝑜𝑔2 (𝑃𝑝) (2.3)
Where Pl signifies the possibility that a sample holds the first class; and Pp is the possibility that the sample holds the second class. After calculating the entropy, we start examining the effect of each feature on the IG. The feature that minimizes the entropy is added to the minimal data set. The IG is calculated as per equation 2.4.
IG (D, F) = E(D) - ∑ 𝐷𝑣 𝐷
𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝐹) 𝐸(𝐷𝑣) (2.4)
Where E(D) is the entropy of the whole data set, F is the feature in which the IG is assessed, and Dv is the number of features in D, and E(Dv) is the entropy of the sub data set that has the value v for the feature F. The higher the IG value, the more helpful the feature will be for classification.
2-Chi-Square: is utilised to measure whether the occurrence of a specific class
and the occurrence of a specific input attribute (feature) are independent. The high value of Chi-Square means that the class attribute and the input attribute are dependent, hence, the input feature is added to the selected feature set. On the other hand, low value of Chi-Square means that the input feature is independent of the class and therefore it is considered irrelevant for classification (Witten et al., 2011). Chi-Square is calculated as per equation 2.5.
𝑥2(𝑓, 𝑡) = 𝑁(𝐴𝐷−𝐶𝐵 )2
(𝐴+𝐶)(𝐵+𝐷)(𝐴+𝐵)(𝐶+𝐷) (2.5)
Where f is an input feature, t is a class variable, A denotes how many times that t and f co-occur, B is the number of times that f occurs without t, C is the number of times that t occurs without f, D is the number of times neither
Chapter 2 Page 33 CLASSIFICATION AND DYNAMIC DOMAINS
3-Gain Ratio: utilises an iterative process for feature selection. These iterations terminate when there is only predefined number of features remaining. The higher the Gain Ratio for a specific feature the more useful the feature is for classification. The Gain Ratio uses split information for normalizing the Information Gain score. The split information value represents the potential
information generated by partitioning the training dataset D into V
partitions, resulting to V outcomes on attribute A. Split information is calculate as per equation 2.6.
Split Info A(D) =
-
∑ |𝐷𝑗| |𝐷| 𝑣 j=1× 𝑙𝑜𝑔2|𝐷𝑗|
|𝐷| (2.6)
The Gain Ratio is calculated as per equation (2.7)
Gain Ratio (A) = Information Gain (A) / Split Info (A) (2.7)
B)Wrapper method
This technique uses the results of the data mining algorithm to assess how good a given feature subset is. The main advantage of this method is that the quality of a features subset is assessed by the performance of the data mining algorithm applied to that subset. However, this technique is much slower than the filter method (Sánchez-Maroño et al., 2007). In addition, this method is computationally expensive compared to filter methods.
2.14.
Chapter Summary
This chapter provides and overview of the main obstacles when creating any classification model. Concept drift, catastrophic forgetting, and stability plasticity dilemmas have been debated as the main issues to be considered when creating any classification model for dynamic. Two possible approaches that can be used to provide a balance between stability and plasticity when
Chapter 2 Page 34 CLASSIFICATION AND DYNAMIC DOMAINS
creating any classification model for dynamic domains have been discussed, i.e. single classifier approach and ensemble based approach. In addition, we discussed the main issues that should be taken into account when creating NN based classification models. Also, in this chapter several classification algorithms have been briefly described.
Chapter 3 Page 35 PHISHING WEBSITES AND CONTEMPORARY ANTI-PHISHING TECHNIQUES