Classification Based on Multilayer Extreme Learning Machine for Motor Imagery Task from EEG Signals

(1)

Classification Based on Multilayer Extreme Learning

Machine for Motor Imagery Task from EEG signals

Lijuan Duan

1, 2

_{, Menghu Bao}

1, 2

_{, Jun Miao}

3*

_{, Yanhui Xu}

1, 2

_and

Juncheng Chen

4

1_{College of Computer Science and Technology, Beijing University of Technology,} Beijing 100124, China.

2_{Beijing Key Laboratory of Trusted Computing, National Engineering Laboratory for Critical} Technologies of Information Security Classified Protection, Beijing 100124, China. 3_{Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS),}

Institute of Computing Technology, CAS, Beijing 100190, China

4_{Beijing Key Laboratory on Integration and Analysis of Large-scale Stream Data,} Beijing 100124, China

[email protected], [email protected], [email protected], [email protected], [email protected]

Abstract

Classification of motor imagery electroencephalogram (EEG) is one of the most important technologies for BCI. To improve the accuracy, this paper introduces a classification system based on Multilayer Extreme Learning Machine (ML-ELM). In the system, the combination of PCA and LDA is chosen as the method of feature extraction and the ML-ELM is used to classify. The ML-ELM has not only the advantage which ELM has but also better performance than ELM. In the experiment, our method is compared with the methods based on ELM, such as kernel-ELM, Constrained-ELM and V-ELM, and some state-of–the-art methods on the same dataset. The experimental results show that ML-ELM is much more suitable for motor imagery EEG data and has better performance than the others.

Keywords: Electroencephalogram Classification, Motor Imagery, Extreme Learning Machine, Multilayer Extreme Learning Machine

1 Introduction

Human brains are natural intelligent computing systems that have powerful ability to control all of human neural activities, such as human behavior, thoughts and emotions. Discovering and simulating the computing mechanisms of human brains is always the goal of artificial intelligence. EEG (electroencephalogram) [1] is the change of event related potentials in cerebral cortex caused by the

*_{Corresponding author}

Volume 88, 2016, Pages 176–184

7th Annual International Conference on Biologically Inspired Cognitive Architectures, BICA 2016

(2)

interaction of numerous interconnected neurons, which is one of the important means to obtain information from human brain. It can be collected by relatively simple and inexpensive external devices and it is much more convenient. Brain-Computer Interface (BCI) [2] [3] [4] technology, which allows users to control computers and other external devices by brain activities, is an effective mean to utilize the information of the brain. Such technology does not depend on the brain's traditional output such as peripheral nerves and muscles. Motor imagery is a kind of simulation of muscular activity and can be captured by various acquisition techniques.

EEG signal recognition is the key technology of BCI, which includes feature extraction and classification. 7here are three kinds of methods to represent EEG signals, which are based on time, frequency and time-frequency, such as amplitude values [5] , band powers (BP) [6] , power spectral density (PSD) values [7] and wavelet package (WP) [8] . And for the classification methods, the k-nearest neighbor [5] , support vector machines (SVM) [9] , neural networks [10] and naive Bayes [11] are widely used. Taking high accuracy and fast learning speed into consideration, this paper use LDA-after-PCA to obtain low dimensional feature and Multilayer Extreme Learning Machine (ML-ELM) [12] to classify. ML-ELM is a method based on Extreme Learning Machine (ELM), which is for single hidden layer feed forward neural networks (SLFNs) proposed by Huang et al. [13] [14] [15] [16] [17] . Meanwhile ML-ELM is a kind of deep neural network, which not only can approximate complicated function but also does not need iteration during the training process. It has much better generalization performance and speed. Therefore, in order to demonstrate its superiority, three other methods which are also based on ELM are implemented. One family of those includes Kernel ELM [18] [19] [20] , which is like SVM. The second family is Constrained Extreme Learning Machine (CELM) [21] , whose weights between input layer and hidden layer are not chosen randomly. The last family is V-ELM [22] , voting with more than one ELMs.

The remainder of this paper is organized as follows. In section 2, the whole system for classification is presented, including feature extraction and classification. In section 3, the dataset is demonstrated, results discussion on parameter selection of feature extraction and ML-ELM and evaluation of the system are described in details. The conclusion is given in section 4.

2

0 ethodology

In this section, the whole system is introduced in details. Firstly, feature vectors of original EEG signals are extracted by the method based on the combination of PCA and LDA. Then, the feature vectors are sent to the classification part based on ML-ELM. The result of classification is regarded as the criteria to evaluate the system.

2.1 Feature Extraction

As the dimension of EEG data is high, we firstly reduce the dimension of the data with the method of PCA in order to be more suitable for discriminative tasks. PCA is the linear projection from an original s-dimensional space to a w-dimensional space (s > w) relying on maximization of the total scatter matrix of projected samples [23] . However, as PCA is unsupervised learning, it does not take into account any difference in class. Then, LDA is used to search the best projection direction which makes the set of projective feature vectors of the training samples has the maximum between-class scatter and minimum within-class scatter simultaneously and reduce dimensions. For a binary-class problem, all the dimension is reduced to 1[24] .

The method of feature extraction is the combination of PCA and LDA. The procedure to select features is as follow:

1). Perform PCA on Channel A1 and A2 of the data set respectively to obtain compact PCA features and determine the dimension reduced according to Accuracy Contribution Rate (ACR) of new space’s

(3)

first few dimensions. Experiments show when the dimension is close to 16, ACR is more than 99%. Therefore, the dimension of the dataset is reduced to 16.

2). Perform LDA on the 16-dimensioned features. As the dataset is binary class, the 16-dimensioned features are converted to 1-dimension.

3). Concatenate LDA features of Channel A1 and A2 to form a new feature of 18 dimensions as the input of ML-ELM.

2.2 Classification Based on ML-ELM

2.2.1 Review of ELM

ELM as emergent technology which overcomes some challenges such as slow learning speed, trivial human intervene faced by other techniques recently attracted the attention from more and more researchers [25] . Compared with other techniques, ELM provides better generalization performance at a much faster learning speed and can obtain the global optimal solution [26] .

ELM is originally developed for the Single-hidden layer feedforward networks (SLFNs) and then extended to the ‘‘generalized’’ SLFNs. Structure of the SLFNs is shown as Figure 1.

For ELM, the input, output and procedure are listed as followed:

Input: the training sampleܰ ൌ ൛൫ݔ௝ǡ ݕ௝൯ൟ_௝ିଵ ே

א ܴௗ_{ൈ ܴ}௠_{, where}_ݔ

௝ is the input vector and ݕ௝ is the

class label, hidden node number L and the activation function ܩ൫ܽ௝ǡ ܾ௝ǡ ݔ൯ where ܽ௝ is the associated

connection weight vector and ܾ_௝ is the bias.

Output: the optimal weight ߚ̱ of the network. Procedure:

Step 1: Generate the parameters ൛൫ܽ௝ǡ ܾ௝൯ȁ݆ ൌ ͳǡ ǥ ǡ ܮൟ of the hidden nodes randomly.

Step 2: Compute output matrix H of hidden layer:

ܪ ൌ ൥݄ሺݔڭଵሻ ݄ሺݔேሻ

൩ ൌ ൥ܩሺܽଵǡ ܾڭଵǡ ݔଵሻ ڮ ܩሺܽڰ ௅ǡ ܾڭ௅ǡ ݔଵሻ ܩሺܽଵǡ ܾଵǡ ݔேሻ ڮ ܩሺܽ௅ǡ ܾ௅ǡ ݔேሻ

൩

, where ܩ൫ܽ_௜ǡ ܾ_௜ǡ ݔ_௝൯ is the activation function of ݅௧௛ hidden node for ݆௧௛ sample.

Step 3: Output the optimal weight ߚ̱ of the network. At first, calculate the hidden layer’s output connection weight ߚ by solving the least squares problemߚ ൌ ܪறܻ, where ܪற is the Moore-Penrose

(4)

Inverse matrix of H, ܻ ൌ ൥ ݕଵ் ڭ ݕே் ൩ ேൈ௠

. Then, compute the optimal weight ߚ̱ fromߚ, where ߚ̱ is the least-squares solution ofߚ.

2.2.2 Multilayer Extreme Learning Machine

Multilayer ELM is a multiplayer neural network based on ELM auto-encoder. [12] mentioned Multilayer neural networks perform poorly when trained with back propagation (BP) only, so the hidden layer weights in a deep network are initialized by using layer-wise unsupervised training and the whole neural network is fine-tuned with BP. Similar to deep networks, ML-ELM hidden layer weights are initialized with ELM-AE, which performs layer-wise unsupervised training. However, in contrast to deep networks, ML-ELM doesn’t require fine tuning.

ML-ELM hidden layer activation functions can be either linear or nonlinear piecewise. If the number of nodes Lk in the kth_{hidden layer is equal to the number of nodes}_Lk

−1in the (k −1)th hidden layer, g is

chosen as linear; otherwise, g is chosen as nonlinear piecewise, such as a sigmoidal function:

ܪ௞ _{ൌ ݃ሺሺߚ}௞_ሻ்_ܪ௞ିଵ_ሻ₍₁₎

where ܪ௞is the kth_{hidden layer output matrix. The input layer}_x_{can be considered as the}₀th_{hidden layer,}

where k =0. The output of the connections between the last hidden layer and the output node t is analytically calculated using regularized least squares.

The structure of Multilayer ELM can be seen in Figure 2.

In Figure 3: (a) ELM-AE output weights β1_{with respect to input data}_x_{are the first-layer weights of}

ML-ELM. (b) The output weights βi+1_{of ELM-AE, with respect to}_ith_{hidden layer output}_hi_of

ML-ELM are the (i +1)th_{layer weights of ML-ELM. (c) The ML-ELM output layer weights are calculated}

using regularized least squares.

(5)

Experiments

In this section, we first introduce the EEG data set for motor imagery. And then discuss the results of the whole system, including parameter selection and accuracy comparison with other methods.

3.1 Data Set Description

The data used in this paper comes from the BCI competition 2003 data set Ia, which is a batch of high quality of the data set provided by University of Tübingen, Institute of Medical Psychology and Behavioral Neurobiology, Niels Birbaumer. The dataset was taken from a healthy subject. The subject was asked to move a cursor up and down on a computer screen, while his cortical potentials were taken. During the recording, the subject received visual feedback of his slow cortical potentials (Cz-Mastoids). Cortical positivity leads to a downward movement of the cursor on the screen. Cortical negativity leads to an upward movement of the cursor. All the trails are composed of training set (268 trials, 135 for class 0, 133 for class 1) and testing set (293 trials, 147 for class 0, 146 for class 1). Each trial lasted 6s. During every trial, the task was visually presented by a highlighted goal at either the top or bottom of the screen to indicate negativity or positivity from second 0.5 until the end of the trial. The visual feedback was presented from second 2 to second 5.5. Only this 3.5 second interval of every trial was provided for training and testing. The sampling rate of 256 Hz and the recording length of 3.5s resulted in 896 samples per channel for every trial.

Six EEG electrodes were located according to the International 10–20 system as shown in Figure 3 [21] and referenced to the vertex electrode Cz as follows: Channel 1: A1 (left mastoid); Channel 2: A2 (right mastoid); Channel 3: F3 (2 cm frontal of C3); Channel 4: P3 (2 cm parietal of C3); Channel 5: C4 (2 cm frontal of C4); and Channel 6: P4 (2 cm parietal of C4).

The dataset is processed as follows:

Firstly, only signals of channel 1 and channel 2 (A1 and A2) were left to yield discriminative feature sets based on our previous research [27] . Then partition the continuous recording samples to sub-epochs of 500ms with 125ms overlapped. Therefore, the raw data is split into 9 segments for each channel, and each segment has 128 dimensions data. Thereby, there are 18 segments in the end.

3.2 Results Discussion

The discussion includes parameter selection of feature extraction and ML-ELM and classification accuracy comparison with other methods.

3.2.1 Parameter Selection of Feature Extraction

The combination of PCA and LDA is chosen to extract features for motor imagery EEG signals. First, PCA is used for dimension reduction. As each sample is partitioned to 9 segments, we set 16 dimensions, which have an ACR of 99% for all the segments. Then, LDA makes the 16-dimensioned features to 1-dimension. For the all 18 segments of channel A1 and A2, we combine the final features to be the input of the ML-ELM.

(6)

3.2.2 Parameter Selection of ML-ELM

For ML-ELM, the number of layers is an important parameters. From many experiments, finally we choose the ML-ELM with three hidden layers as it can get better performance with less parameters. The random weights between the input layer and hidden layer have effect on the performance, so it may not be the best choice with much more layers, seen from Figure 4. Sigmoid is be set as the activation function. The number of hidden nodes is set to be 20 and 5000.

As the structure is set, the parameters of the networks are in discuss. Here, each hidden layer has one parameter C for the regularized least mean square calculation, so that there are three parameters to tune with range from -5 to 10. We let every selection iterate 20 times. In about 81920 times cycles, finally we choose 20 optimal values. All the accuracies are more than 92.5%, and the mean of them is 92.95%. Under the set of (7, 7, -3), the result can reach 94.20%. The Figure 5 shows how the accuracy changing with C1, C2 and C3.

3.2.3 Evaluation of the Classification System Based on ML-ELM

In order to demonstrate the effectiveness of the classification system, firstly, we compare ML-ELM with the other classifier based on ELM, with the same features. And some state-of–the-art methods on the same dataset are also collected as contrast.

Figure 4˖˖Selection for the number of ML-ELM hidden layers

(7)

We compare our method with avg-ELM, Kernel-ELM, CELM and V-ELM. Avg-ELM is the average of 50 ELMs, Kernel-ELM is similar to SVM, CELM is ELM with constraint between the layer of input and hidden, and the V-ELM is 50 ELMs to vote. As shown in Table 1, the accuracy of ML-ELM is 94.20, 5.31% higher than avg-ELM, 2.05% higher than Kernel-ELM, 1.42% higher than CELM and 0.68% higher than V-ELM. Although the V-ELM is several ELMs, the ML-ELM is true deep networks. So we can get better performance than V-ELM.

:e also compare our method with some state-of–the-art methods on the same dataset. The comparison is showed in Table 2. From Table 2, we can see the method we proposed can be found to have the best performance with the increase of at least 2.05%. The best performance demonstrates that the classification system is suitable for motor imagery EEG data. Also, the best generalization performance can be verified.

4 Conclusion

This paper applied a classification system based on ML-ELM to BCI EEG signal for identifying motor imagery. The method improved classification performance of EEG signal in contrast to the other methods. The main conclusions are summarized as follows:

1. According to the rule of the BCI competition that the performance measure encompasses the classification accuracy, our method achieved promising results compared over other classification methods. In the case of dataset, the proposed method is suitable for binary-class signals of motor imagery.

2. The generalization performance of our method is with least human intervene. Just like ELM, the parameters of the number of hidden nodes and activation function can be set directly as mentioned in [26] .

Table 1: Comparison of other methods based on ELM

Feature Extraction avg-ELM Kernel-ELM CELM V-ELM ML-ELM

PCA+LDA 88.89 92.15 92.78 93.52 94.20

Table 2: Comparison of related various methods

Method Number of

electrode Feature extraction

Accuracy (%)

Linear (2004) [28] 4 gamma band power

combined with SCP 88.70

Bayes (2005)[29] 6 combing SCP with the

spectral centroid 90.44 Neural Network (2005)[30] 2 SCP and beta band

specific energy 91.47

Neural Network (2008) [31] 6 Wavelet package 90.80

k-NN (2010)[1] 1 Coefficients of the second

order polynomial 92.15

(8)

In this paper, the method is just applied to the binary-class EEG data. However, we will improve the method for multi-classification of EEG. What’s more, the EEG data is offline processing. With the high learning speed of ELM, the application of online processing can be considered in the future.

Acknowledgement

This research is partially sponsored by Natural Science Foundation of China (Nos. 61175115, 61370113 and 61272320), Beijing Municipal Natural Science Foundation (4152005, 4152006 and 4162058), the Importation and Development of High-Caliber Talents Project of Beijing Municipal Institutions (CIT&TCD201304035), Jing-Hua Talents Project of Beijing University of Technology (2014-JH-L06), and Ri-Xin Talents Project of Beijing University of Technology (2014-RX-L06), the Research Fund of Beijing Municipal Commission of Education (PXM2015_014204_500221) and the International Communication Ability Development Plan for Young Teachers of Beijing University of Technology (No. 2014-16).

References

[1] Kayikcioglu T, Aydemir O. A polynomial fitting and k-NN based approach for improving classification of motor imagery BCI data[J]. Pattern Recognition Letters, 2010, 31(11):1207-1215. [2] Vaughan T M, Wolpaw J R. The Third International Meeting on Brain-Computer Interface Technology: making a difference[J]. IEEE Transactions on Neural Systems & Rehabilitation Engineering, 2006, 14(2):126-7.

[3] Wolpaw J R, Birbaumer N, Heetderks W J, et al. Brain-computer interface technology: a review of the first international meeting.[J]. IEEE Transactions on Rehabilitation Engineering A Publication of the IEEE Engineering in Medicine & Biology Society, 2000, 8(2):164-173.

[4] Vaughan T M, Wolpaw J R, Donchin E. EEG-based communication: prospects and problems.[J]. IEEE Transactions on Rehabilitation Engineering A Publication of the IEEE Engineering in Medicine & Biology Society, 1997, 4(4):425-30.

[5] Kaper M, Meinicke P, Grossekathoefer U, et al. BCI competition 2003-data set IIb: support vector machines for the P300 speller paradigm. IEEE Trans Biomed Eng[J]. IEEE Transactions on Biomedical Engineering, 2004, 51(6):1073-1076.

[6] Pfurtscheller G, Neuper C, Flotzinger D, Pregenzer M. EEGbased discrimination between imagination of right and left handmovement. Electroencephalogr Clin Neurophysiol. 1997;103(6):642–51.

[7] Chiappa, S, Bengio S. HMM and IOHMM modeling of EEGrhythms for asynchronous BCI systems. In: ESANN. 2004.p. 193–204.

[8] Gao C F, Bao Liang L V, College W M, et al. A New Method of Extracting Vigilant Feature from Electrooculography Using Wavelet Packet Transform[J]. Chinese Journal of Biomedical Engineering, 2012.

[9] Camps-Valls G, Bruzzone L. Kernel-based methods for hyperspectral image classification. IEEE Trans Geosci Remote Sens.2005;43(6):1351–62.

[10] Subasi A, Erc¸elebi E. Classification of EEG signals using neuralnetwork and logistic regression. Comput Methods ProgramsBiomed. 2005;78(2):87–99.

[11] LeVan P, Urrestarazu E, Gotman J. A system for automaticartifact removal in ictal scalp EEG based on independent component analysis and Bayesian classification. Clin Neurophysiol. 2006;117(4):912–27.

[12] Kasun L L C, Zhou H, Huang G B, et al. Representational Learning with ELMs for Big Data[J]. IEEE Intelligent Systems, 2013.

(9)

[13] Huang G B, Zhu Q Y, Siew C K. Extreme learning machine: a new learning scheme of feedforward neural networks[J]. Proc.int.joint Conf.neural Netw, 2004, 2:985-990 vol.2..

[14] Huang G B, Wang D. Advances in Extreme Learning Machines (ELM2011)[J]. Neurocomputing, 2011, 74(16):2411-2412.

[15] Zong W, Huang G B, Chen Y. Weighted extreme learning machine for imbalance learning[J]. Neurocomputing, 2013, 101(3):229-242.

[16] Lendasse A, He Q, Miche Y, et al. Editorial: Advances in extreme learning machines (ELM2012)[J]. Neurocomputing, 2014, 128(5):1-3.

[17] Deng C W, Huang G B, Xu J, et al. Extreme learning machines: new trends and applications[J]. Sciece China Information Sciences, 2015, 58.

[18] Huang G B. “An insight into extreme learning machines: random neurons, random features and kernels,” Cognitive Computation, 2014 : 1-15.

[19] Huang G B, Siew C K. Extreme Learning Machine with Randomly Assigned RBF Kernels [J]. International Journal of Information Technology. 2005, 11(1):16-24.

[20] Huang G B, Zhou H, Ding X, et al. Extreme learning machine for regression and multiclass classification.[J]. IEEE Transactions on Systems Man & Cybernetics Part B Cybernetics A Publication of the IEEE Systems Man & Cybernetics Society, 2012, 42(42):513-29.

[21] Zhu W, Miao J, Qing L. Constrained Extreme Learning Machine: A novel highly discriminative random feedforward neural network[C]// International Joint Conference on Neural Networks (IJCNN), 2014. 2014:800-807.

[22] Duan L, Zhong H, Miao J, et al. A Voting Optimized Strategy Based on ELM for Improving Classification of Motor Imagery BCI Data[J]. Cognitive Computation, 2014, 6(3):477-483. [23] Jolliffe I T. “Principal component analysis,” New York: Springer-Verlag, 1986.

[24] Panahi N, Shayesteh M G, Mihandoost S, et al. “Recognition of different datasets using PCA, LDA, and various classifiers” Application of Information and Communication Technologies (AICT), 2011 5th International Conference on. IEEE, 2011: 1-5.

[25] Huang G B, Wang D H, Lan Y. Extreme learning machines: a survey[J]. International Journal of Machine Learning and Cybernetics, 2011, 2(2): 107-122.

[26] Huang G B, Zhu Q Y, Siew C K. “Extreme learning machine: theory and applications,” Neurocomputing, 70(1): 489-501, 2006.

[27] Duan L J, Zhang Q, Yang Z .et al. “Research on HeuristicFeature and Classification of EEG Signal Based on BCI Data Set,” ResearchJournal of Applied Sciences, Engineering and Technology, 2013, 5(3):1008-1014.

[28] Mensh B D, Werfel J, Seung H S. BCI Competition 2003—Data Set Ia: Combining Gamma-Band Power With Slow Cortical Potentials to Improve Single-Trial Classification of Electroencephalographic Signals[J]. IEEE Transactions on Biomedical Engineering, 2004, 51(6):1052-6.

[29] Sun S, Zhang C. Assessing features for electroencephalographic signal categorization[C]// Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP '05). IEEE International Conference on. IEEE, 2005:v/417-v/420 Vol. 5.

[30] Wang B, Jun L, Bai J, et al. EEG recognition based on multiple types of information by using wavelet packet transform and neural networks.[C]// International Conference of the Engineering in Medicine & Biology Society. Conf Proc IEEE Eng Med Biol Soc, 2005:5377 - 5380.

[31] Wu T, Yan G Z, Yang B H, et al. EEG feature extraction based on wavelet packet decomposition for brain computer interface[J]. Measurement, 2008, 41(6):618–625.