Average distance feature selection was originally implemented as a component of GA. It was intended as a method of measuring the fitness of individual features. All features corresponding to each class (events and non-events) are averaged together across all observations. The resultant vectors are then subtracted from each other, and the largest distances found. The features corresponding to the largest average distances are retained, and the rest discarded. The number of features can be adjusted. The process is known as the Average Distance between Events and Non-events (ADEN). Many tests were performed
33
utilizing ADEN on artificial data, and a handful of tests investigated increasing the number of features beyond one with ADEN, as shown in the Appendix. Four variations of ADEN were developed.
4.2.6.1 ADEN
ADEN required the user to define π features to retain. The training data π consisted of
F features and M observations. Then, features corresponding to events and nonevents were
separated into πe and πn. Each was averaged to form a mean feature vector (F long), π₯e and π₯n. The difference formed a single vector, βπ₯π.
βπ₯π = πππ (π₯e,πβ π₯n,π) (6)
The difference between classes was normalized by dividing vector βπ₯π by Cohenβs d (effect size), such that within-group variances in the training data could be accounted for. Training data π were reduced to a matrix of π features and M observations, with all remaining features based on the indices π of the π terms in βπ₯π. The testing data would likewise be reduced to π’ features, selected from the π indices corresponding to features in the training data.
ADEN Calculation
1) Take training data matrix π, with dimensions features πΉ by observations π. 2) Calculate Cohenβs d.
3) Move all observations of non-events from π into matrix ππ§. 4) Move all observations of events from π into matrix ππ.
5) Average ππ and ππ§ to form a mean feature vector (F long), π₯e and π₯n.
6) Calculate the absolute value of the difference between π₯e and π₯n in vector βπ₯π. 7) Divide vector βπ₯π by Cohenβs d.
8) Arrange values in vector βπ₯π in descending order.
9) Reduce training data π to features corresponding to π highest differences for training data.
34
4.2.6.2 ADENZ
A second variation of ADEN was dubbed Average Distance Between Events and Non- events by Z-score transform (ADENZ). The z-score transformation involved subtraction of the mean for each variable, followed by dividing by the variableβs standard deviation. In contrast with ADEN, ADENZ applied independent z-score transformations to the training and testing data, omitting Cohenβs d (effect size).
ADENZ Calculation
1) Take training data matrix π, with dimensions features πΉ by observations π. 2) Perform z-score transformation on π.
3) Move all observations of non-events from π into matrix ππ§. 4) Move all observations of events from π into matrix ππ.
5) Average ππ and ππ§ to form a mean feature vector (F long), π₯e and π₯n.
6) Calculate the absolute value of the difference between π₯e and π₯n in vector βπ₯π. 7) Arrange values in vector βπ₯π in descending order.
8) Reduce training data π to features corresponding to π highest differences for training data.
9) Reduce the testing data to the same feature subset.
4.2.6.3 GADEN
A further development of ADEN was the incorporation of aspects of GA, resulting in Genetic Average Distance between Events and Non-events (GADEN). ADENβs primary role in GADEN was as a bottleneck for ranked features, as GA would be performed upon random combinations of remaining, selected ADENs. The user was required to designate a pool of π features to select as a bottleneck. A total of π features would be selected at random from the βgene poolβ of πfeatures. Approximately half of the training data would be randomly selected, and tested on the other half using only the selected π features.
35 GADEN Calculation
1) Take training data matrix π, with dimensions features πΉ by observations π. 2) Calculate Cohenβs d.
3) Move all observations of non-events from π into matrix ππ§. 4) Move all observations of events from π into matrix ππ.
5) Average ππ and ππ§ to form a mean feature vector (F long), π₯e and π₯n.
6) Calculate the absolute value of the difference between π₯e and π₯n in vector βπ₯π. 7) Divide vector βπ₯π by Cohenβs d.
8) Arrange values in vector βπ₯π in descending order.
9) Reduce the training data π to the features corresponding to the π highest differences for training data.
10) Select a random subset of π features in π.
11) Estimate the βfitnessβ of each feature subset from phi correlation of training and testing on only each π-sized feature subset with LDA.
12) Use the feature subset corresponding to highest phi correlation as the basis for other subsets.
13) Repeat (3) through (12) until error goal or number of iterations met.
14) Reduce the training data π to retain only the feature subset of size π corresponding to the highest βfitness.β
15) Reduce the testing data to same feature subset.
The random combination of features with the highest phi correlation would βreproduceβ a new set of variants (e.g., new βgenotypesβ of a new βgenerationβ) that would be tested against the parent. The standard configuration of GADEN utilized a total of three generations with a constant number of βoffspringβ for each generation. GADEN took much more time and processing power than ADEN or ADENZ, but was able to overcome the potential issue of selecting collinear features present in the other two methods.
36
4.2.6.4 GADENZ
A further version of GADEN was developed, based upon ADENZ.
GADENZ Calculation
1) Take training data matrix π, with dimensions features πΉ by observations π. 2) Perform z-score transformation on π.
3) Move all observations of non-events from π into matrix ππ§. 4) Move all observations of events from π into matrix ππ.
5) Average ππ and ππ§ to form a mean feature vector (F long), π₯e and π₯n.
6) Calculate the absolute value of the difference between π₯e and π₯n in vector βπ₯π. 7) Arrange values in vector βπ₯π in descending order.
8) Reduce the training data π to features corresponding to π highest differences for training data.
9) Select a random subset of π features in π.
10) Estimate the βfitnessβ of each feature subset from phi correlation of training and testing on only each π-sized feature subset with LDA.
11) Use the feature subset corresponding to highest phi correlation as the basis for other subsets.
12) Repeat (3) and (4) until error goal or number of iterations met.
13) Reduce the training data π to retain only the feature subset of size π corresponding to the highest βfitness.β
14) Reduce the testing data to same feature subset.
The primary difference with GADENZ was the use of a z-score transform to normalize the training data. Cohenβs d was not used. The different method of normalization was thought to potentially result in a different set of features than GADEN.
Pattern Recognition 4.3
Pattern recognition is the automatic sorting of data into categories or assigning labels. Pattern recognition techniques are often grouped by the learning technique utilized. A
37
successful pattern recognition algorithm can correctly identify the label of testing data the majority of the time. Two common categories are supervised and unsupervised learning. Supervised learning involves a classifier being given a labelled set of training data, and generalizing each group. Unsupervised learning does not contain labels, and is focused on discerning patterns between groups, irrespective of class labels.
Three supervised learning approaches to pattern recognition were investigated: linear discriminant analysis (LDA), support vector machines (SVMs), and radial basis functions (RBFs).
4.3.1 Linear Discriminant Analysis
Linear Discriminant Analysis (LDA) is a simple type of pattern classification algorithm. LDA calculates the within-group and between-group variances of training data and draws a boundary between them. Depending on which side of the boundary that a new observation is assigned to, it is assigned a different group label. LDA was used previously in microsleep detection (Peiris et al., 2011) and has the advantages of being a robust and simple classifier.
LDA Calculation
1) Take the matrix of training data π and target vector π₯t. 2) Move all observations of non-events from π into matrix ππ§. 3) Move all observations of events from π into matrix ππ. 4) Calculate group mean and variance for ππ and ππ§. 5) Compute class separation to set threshold.
6) Expose the classifier to testing data.
LDA was the baseline that other pattern recognition algorithms were compared with.