Relevance Scoring - Multivariate Correlation Analysis for Supervised Feature Selection in High-

The transformed feature is based on the pattern set s drawn by a Monte-Carlo iteration and its relevance for classification is necessary to be evaluated. With our defined transformation function, a na¨ıve solution is to convert all patterns into numeric features and perform feature selection. As such an approach is computa- tionally expensive, it is necessary to evaluate the relevance of an ordinal pattern set right after the transformation. By estimating the misclassification rate of a classifier trained for each transformed feature, it is possible to evaluate the feature relevance. However, we aim to efficiently score the relevance and redundancy of a transformed feature without training a classifier. Hence, we estimate the mis- classification rate of a feature f by applying principles of Chebyshev’s inequality [KS66].

Let us consider a simple binary classification task with classes {ca, cb} and

feature f generated using the pattern set s. Using the theory of Chebychev in- equality [KS66], the misclassification of feature f is represented using the variance

V ar[f | ci] and expected value E[f | ci] as,

error(f ) = V ar[f |ca] + V ar[f |cb]

2 · (|E[f |cb] − E[f |ca]|)2

. (5.2)

The Equation 5.2 has statistical properties similar to a two-sample t-test. Its detailed proof is provided in Section 5.5.1 and we explain the intuition behind the equation with an example.

Example 5.3. Assume two multivariate ordinal patterns s₁ and s₂, where s₁ is relevant and s₂ is irrelevant for the classification.

Each ordinal pattern set is transformed into numeric features f₁ and f₂ respec- tively (c.f. Definition 5.2). As a relevant pattern, s₁ has a higher discriminative power, i.e., it occurs in every time series of one class (e.g., ca) with a high prob-

ability and never occurs for the other class. Therefore, the distributions of the transformed feature f₁ for each class, exhibits a minimal variance, i.e., V ar[f₁ | ca]

and V ar[f₁ | cb]. On contrary, an irrelevant multivariate ordinal pattern set s2,

without any discriminative power to classify, occurs in different time series ran- domly. Hence, the distribution of the transformed numeric feature f₂ | ca and

Figure 5.6: Illustration of relevant and irrelevant feature based on Equation 5.2. Where, length of the colored blocks denote the variance of distributions,

inverted triangles denote the expected values of the distributions and colors denote the class

f₂ | cb has random peaks and lows. This leads to a larger variance in the respective

distributions V ar[f₂ | ca] and V ar[f2 | cb]. This means, the classification error is

high when the sum of the variances are large.

In real world applications, due to factors such as noise, it is possible that s₁ (which has high occurrence for class ca) occurs in a few samples of class cb, i.e.,

V ar[f₁ | cb] is not exactly equal to zero. Hence, in addition to the variance,

the distance between the expected values of the distributions is estimated, i.e., |E[f |cb] − E[f |ca]|. As we aim to extract the most distinguishing pattern set

between two classes, the expected values of their distributions under each class will have a larger difference, i.e., E[f | ca] >> E[f | cb]. This large difference in

the expected values helps the classification boundaries to be well-separated. This means, the classification error is large if the difference between the expected values are small.

Using Figure 5.6, we illustrate the property of a relevant feature f₁ and an irrelevant feature f₂. Our transformation function is dependent on the frequency of an ordinal pattern (c.f. Definition 5.2). Hence, for a relevant feature, the variance of its distribution under each class is smaller in comparison to that of an irrelevant feature. The variance is denoted by blocks of different lengths on the number line, whereas, the color denotes the class, i.e., red denotes V ar[f | ca] and green denotes

V ar[f | cb]. On contrary, for a relevant feature, its expected value under each class

is well separated in comparison to an irrelevant feature. We denote the expected value of the distributions, i.e., E[f | ca] and E[f | cb], as inverted triangle in Figure

Definition. 5.3: Relevance scoring

For a classification task with C = {c₁, · · · , ck} classes, the lower bound of

the ability to distinguish any pair of classes ca ∈ C and cb ∈ C using the

transformed feature f is,

disca,cb(f ) = 1 − error(f )

and we define its relevance as the lowest value of all pairwise dis scores, i.e.,

rel(f ) = min{disca,cb(f ) | ca6= cb}.

Assume a classification task with classes ca, cb, cc for which the disca,cb(f ),

disca,cc(f ) and discb,cc(f ) are computed. The three values denote the accuracy

of each class. The relevance of f is defined as the minimum of the three dis scores in Definition 5.3. Intuitively, it means that feature relevance is the lowest accuracy of all pairwise scores. Hence, maximizing rel(f ) implies maximizing the lowest accuracy of all pairs of classes.

5.5.1 Theoretical foundations of feature relevance score based

on Chebychev’s inequality

Consider f as a feature extracted using a multivariate ordinal pattern set s (c.f. Definition 5.1) to classify between ca and cb. We denote its distribution for class ca

as f |ca. The expected value and the variance of the distribution are represented as

E[f |ca] and V ar[f |ca] respectively. Similarly, for class cb, we define the distribution

f |cb, expected value E[f |cb] and variance V ar[f |cb]. Without loss of generality, we

assume E[f |ca] < E[f |cb]. The upper bound of mis-classification for feature f with

arbitrary distribution is strongly founded by the principles of Chebyshev-inequality, Chebychev’s inequality defines the upper bound of the fraction of samples that can lie beyond a threshold a > 0. For any feature f , no more than 1/a2 of the values can be greater than a standard deviations away from the mean [KS66],

P (|f − E[f ]| ≥ a) ≤ V ar[f ]

a2 , (5.3)

Figure 5.7: Example: A number line with limits [0,1]

Given the expected value E[f ] and variance V ar[f ] of the feature, Equation 5.3 represents the probability of a sample being greater than a. The approach is commonly used for finding outliers, i.e, instances with a high probability of being greater than E[f ] + a are outliers. Applying the rule of Chebychev’s inequality [KS66] for classification problems, an instance of feature f is classified as ca or cb

based on the arbitrary threshold value k | 0 < k < |E[f |cb] − E[f |ca]| (c.f. Figure

5.7).

We denote P (Mca,cb

k ) as the probability that ca is mis-classified as cb or cb is

misclassified as ca. Under the assumption that f |ca and f |cb are symmetrically

distributed around their expected values, a sample is classified as cb when its

expected value is greater than E[f |ca] + k. Hence, to estimate P (Mkca,cb) we need

to quantify the maximum number of ca that exceed the threshold and likewise for

cb.

Lemma 5.1. The upper bound of mis-classification is represented as,

P (Mca,cb k ) ≤ V ar[f |ca] 2k2 + V ar[f |cb] 2(|E[f |cb] − E[f |ca]| − k)2 .

Proof: For a classification task, using our threshold E[f |ca] + k, we aim to esti-

mate:

Mcb|ca

k : a sample with class ca is misclassified as cb based on threshold k.

Mca|cb

k : a sample with class cb is misclassified as ca based on threshold k.

Figure 5.7 visualizes both cases on a simple number-line, by applying the Cheby- chev inequality,

P (Mcb|ca

k ) = P ((f |ca) ≥ E[f |ca] + k)

= P ((f |ca) − E[f |ca] ≥ k).

We assumed that f |cais distributed symmetrically around its expected value. Thus

P (Mcb|ca

k ) is half the probability that the value of f |ca has at least a distance of k

to its expected value,

P (Mcb|ca

k ) =

2P (|(f |ca) − E[f |ca]| ≥ k) .

Using the Chebyshev-Inequality and setting a = k in Equation 5.3, we can estimate an upper bound of P (Mcb|ca k ) as, P (Mcb|ca k ) ≤ V ar[f |ca] 2k2 . (5.4)

On applying the same symmetric assumption on (f |cb), M ca|cb

k is half of the prob-

expected value.

P (Mca|cb

k ) = P ((f |cb) ≤ E[f |ca] + k)

= P ((f |cb) − E[f |cb] ≤ E[f |ca] − E[f |cb] + k)

= P (E[f |cb] − (f |cb) ≥ E[f |cb] − E[f |ca] − k)

= 1

2P (|(f |cb) − E[f |cb]|≥ |E[f |cb] − E[f |ca]|−k)

Comparing the above result with Equation 5.3, we derive, a = |E[f |cb] − E[f |ca]|−

k. To estimate an upper bound

P (Mca|cb

k ) ≤

V ar[f |cb]

2(|E[f |cb] − E[f |ca]| − k)2

(5.5)

From Equation 5.4 and 5.5 this, we derive an upper bound for the total mis- classification probability for ca and cb as,

This means, given an optimal value of the threshold E[f |ca]+k, we can calculate

an upper bound for the minimal misclassification of each pair of classes. However, we cannot assume that all classifiers find such an optimal k based on the data. Moreover, finding this bound costs additional computation time. In order to be independent of the classifier and have a better efficiency, we use the fact, that 0 < k < |E[f |cb] − E[f |ca]|. Due to that, the upper bound of the mis-classification

grows approximately as fast as

V ar[f |ca] + V ar[f |cb]

2(|E[f |cb] − E[f |ca]|)2

In document Multivariate Correlation Analysis for Supervised Feature Selection in High-Dimensional Data (Page 134-139)