Average Distance Feature Selection - Detection of microsleeps from the eeg via optimized classi

Average distance feature selection was originally implemented as a component of GA. It was intended as a method of measuring the fitness of individual features. All features corresponding to each class (events and non-events) are averaged together across all observations. The resultant vectors are then subtracted from each other, and the largest distances found. The features corresponding to the largest average distances are retained, and the rest discarded. The number of features can be adjusted. The process is known as the Average Distance between Events and Non-events (ADEN). Many tests were performed

utilizing ADEN on artificial data, and a handful of tests investigated increasing the number of features beyond one with ADEN, as shown in the Appendix. Four variations of ADEN were developed.

4.2.6.1 ADEN

ADEN required the user to define 𝑈 features to retain. The training data 𝐗 consisted of

F features and M observations. Then, features corresponding to events and nonevents were

separated into 𝐗_e and 𝐗_n. Each was averaged to form a mean feature vector (F long), 𝑥_e and 𝑥n. The difference formed a single vector, ∆𝑥𝑓.

∆𝑥𝑓 = 𝑎𝑏𝑠(𝑥e,𝑓− 𝑥n,𝑓) (6)

The difference between classes was normalized by dividing vector ∆𝑥𝑓 by Cohen’s d (effect size), such that within-group variances in the training data could be accounted for. Training data 𝐗 were reduced to a matrix of 𝑈 features and M observations, with all remaining features based on the indices 𝑓 of the 𝑈 terms in ∆𝑥_𝑓. The testing data would likewise be reduced to 𝑢 features, selected from the 𝑓 indices corresponding to features in the training data.

ADEN Calculation

1) Take training data matrix 𝐗, with dimensions features 𝐹 by observations 𝑀. 2) Calculate Cohen’s d.

3) Move all observations of non-events from 𝐗 into matrix 𝐗𝐧. 4) Move all observations of events from 𝐗 into matrix 𝐗_𝐞.

5) Average 𝐗_𝐞 and 𝐗_𝐧 to form a mean feature vector (F long), 𝑥_e and 𝑥_n.

6) Calculate the absolute value of the difference between 𝑥e and 𝑥n in vector ∆𝑥𝑓. 7) Divide vector ∆𝑥𝑓 by Cohen’s d.

8) Arrange values in vector ∆𝑥_𝑓 in descending order.

9) Reduce training data 𝐗 to features corresponding to 𝑈 highest differences for training data.

4.2.6.2 ADENZ

A second variation of ADEN was dubbed Average Distance Between Events and Non- events by Z-score transform (ADENZ). The z-score transformation involved subtraction of the mean for each variable, followed by dividing by the variable’s standard deviation. In contrast with ADEN, ADENZ applied independent z-score transformations to the training and testing data, omitting Cohen’s d (effect size).

ADENZ Calculation

1) Take training data matrix 𝐗, with dimensions features 𝐹 by observations 𝑀. 2) Perform z-score transformation on 𝐗.

3) Move all observations of non-events from 𝐗 into matrix 𝐗_𝐧. 4) Move all observations of events from 𝐗 into matrix 𝐗_𝐞.

5) Average 𝐗𝐞 and 𝐗𝐧 to form a mean feature vector (F long), 𝑥e and 𝑥n.

6) Calculate the absolute value of the difference between 𝑥_e and 𝑥_n in vector ∆𝑥_𝑓. 7) Arrange values in vector ∆𝑥_𝑓 in descending order.

8) Reduce training data 𝐗 to features corresponding to 𝑈 highest differences for training data.

9) Reduce the testing data to the same feature subset.

4.2.6.3 GADEN

A further development of ADEN was the incorporation of aspects of GA, resulting in Genetic Average Distance between Events and Non-events (GADEN). ADEN’s primary role in GADEN was as a bottleneck for ranked features, as GA would be performed upon random combinations of remaining, selected ADENs. The user was required to designate a pool of 𝑉 features to select as a bottleneck. A total of 𝑈 features would be selected at random from the “gene pool” of 𝑉features. Approximately half of the training data would be randomly selected, and tested on the other half using only the selected 𝑈 features.

35 GADEN Calculation

1) Take training data matrix 𝐗, with dimensions features 𝐹 by observations 𝑀. 2) Calculate Cohen’s d.

3) Move all observations of non-events from 𝐗 into matrix 𝐗𝐧. 4) Move all observations of events from 𝐗 into matrix 𝐗_𝐞.

5) Average 𝐗_𝐞 and 𝐗_𝐧 to form a mean feature vector (F long), 𝑥_e and 𝑥_n.

6) Calculate the absolute value of the difference between 𝑥_e and 𝑥_n in vector ∆𝑥_𝑓. 7) Divide vector ∆𝑥𝑓 by Cohen’s d.

8) Arrange values in vector ∆𝑥𝑓 in descending order.

9) Reduce the training data 𝐗 to the features corresponding to the 𝑉 highest differences for training data.

10) Select a random subset of 𝑈 features in 𝐗.

11) Estimate the “fitness” of each feature subset from phi correlation of training and testing on only each 𝑈-sized feature subset with LDA.

12) Use the feature subset corresponding to highest phi correlation as the basis for other subsets.

13) Repeat (3) through (12) until error goal or number of iterations met.

14) Reduce the training data 𝐗 to retain only the feature subset of size 𝑈 corresponding to the highest “fitness.”

15) Reduce the testing data to same feature subset.

The random combination of features with the highest phi correlation would “reproduce” a new set of variants (e.g., new “genotypes” of a new “generation”) that would be tested against the parent. The standard configuration of GADEN utilized a total of three generations with a constant number of “offspring” for each generation. GADEN took much more time and processing power than ADEN or ADENZ, but was able to overcome the potential issue of selecting collinear features present in the other two methods.

4.2.6.4 GADENZ

A further version of GADEN was developed, based upon ADENZ.

GADENZ Calculation

1) Take training data matrix 𝐗, with dimensions features 𝐹 by observations 𝑀. 2) Perform z-score transformation on 𝐗.

3) Move all observations of non-events from 𝐗 into matrix 𝐗_𝐧. 4) Move all observations of events from 𝐗 into matrix 𝐗_𝐞.

5) Average 𝐗_𝐞 and 𝐗_𝐧 to form a mean feature vector (F long), 𝑥_e and 𝑥_n.

6) Calculate the absolute value of the difference between 𝑥_e and 𝑥_n in vector ∆𝑥_𝑓. 7) Arrange values in vector ∆𝑥_𝑓 in descending order.

8) Reduce the training data 𝐗 to features corresponding to 𝑉 highest differences for training data.

9) Select a random subset of 𝑈 features in 𝐗.

10) Estimate the “fitness” of each feature subset from phi correlation of training and testing on only each 𝑈-sized feature subset with LDA.

11) Use the feature subset corresponding to highest phi correlation as the basis for other subsets.

12) Repeat (3) and (4) until error goal or number of iterations met.

13) Reduce the training data 𝐗 to retain only the feature subset of size 𝑈 corresponding to the highest “fitness.”

14) Reduce the testing data to same feature subset.

The primary difference with GADENZ was the use of a z-score transform to normalize the training data. Cohen’s d was not used. The different method of normalization was thought to potentially result in a different set of features than GADEN.

Pattern Recognition 4.3

Pattern recognition is the automatic sorting of data into categories or assigning labels. Pattern recognition techniques are often grouped by the learning technique utilized. A

successful pattern recognition algorithm can correctly identify the label of testing data the majority of the time. Two common categories are supervised and unsupervised learning. Supervised learning involves a classifier being given a labelled set of training data, and generalizing each group. Unsupervised learning does not contain labels, and is focused on discerning patterns between groups, irrespective of class labels.

Three supervised learning approaches to pattern recognition were investigated: linear discriminant analysis (LDA), support vector machines (SVMs), and radial basis functions (RBFs).

4.3.1 Linear Discriminant Analysis

Linear Discriminant Analysis (LDA) is a simple type of pattern classification algorithm. LDA calculates the within-group and between-group variances of training data and draws a boundary between them. Depending on which side of the boundary that a new observation is assigned to, it is assigned a different group label. LDA was used previously in microsleep detection (Peiris et al., 2011) and has the advantages of being a robust and simple classifier.

LDA Calculation

1) Take the matrix of training data 𝐗 and target vector 𝑥t. 2) Move all observations of non-events from 𝐗 into matrix 𝐗_𝐧. 3) Move all observations of events from 𝐗 into matrix 𝐗_𝐞. 4) Calculate group mean and variance for 𝐗_𝐞 and 𝐗_𝐧. 5) Compute class separation to set threshold.

6) Expose the classifier to testing data.

LDA was the baseline that other pattern recognition algorithms were compared with.

In document Detection of microsleeps from the eeg via optimized classification techniques. (Page 49-54)