Classification - Processing brain signals

2.4 Processing brain signals

2.4.3 Classification

Classification, or pattern recognition, is the practice of discriminating between different groups of data. In our case, we use classification to translate the feature vector which carries the characteristics of the EEG, to an associated group of data. There are three main approaches to mod- eling:

Fixed models is used when the exact input-output relation is known, for instance a known

threshold value.

Parametric models use training data or á priori information to tune the parameters of the

model to fit the data.

Non-parametric models or clustering techniques is suited when the relationship between data

and classes is not well understood.

A BCI typically use a parametric model, or a mix of parametric and non-parametric models, and rely on user specific training to tune the classifier.

General linear classifiers

Consider a two-class case with classesω1andω2and a feature vector z = [z1, z2, ..., zl]T. A linear

classifier maps z to eitherω1orω2by a linear discriminant function g :Rl→ R defined as

where w = [w1, w2, ..., wl]T is a weight vector and w0is a threshold. The decision rule becomes:

Chooseω1if g (z) > 0, else ω2.

The left part of Figure2.8shows the decision line in the case where l = 2. In a multi-class case with c classes, we need one discriminant function for each class:

gi(z) = w_iT· z + wi 0 (2.8)

where i = 1,2,...,c, and we get the decision rule:

Chooseωmif gm(z) = maxi©gi(z)ª.

Furthermore, there are broadly two types of methods for determining the linear discriminator w .

Probabilistic linear models

Probabilistic models estimate the conditional density functions P (z|ωi) with known probability

distributions. The naive Bayes classifier take base in Bayes’ rule

P (ωi|z) =P (z|ω

i)P (ωi)

P (z) . (2.9)

Expressed in terms of P (z), we get

P (z) = c X i =1 P (z|ωi)P (ωi) P (ωi|z) . (2.10) SincePc 1P (ωi|z) = 1 this reduces to P (z) = c X i =1 P (z|ωi)P (ωi) (2.11)

and so the probability density function of the feature vector z can be estimated according to as- sumptions on the distribution for each class. For discrete features, the multinomial and Bernoulli distributions are popular, while a Gaussian distribution is a fair assumption for continuous data like EEG.

Linear discriminant analysis (LDA) is another probabilistic model that are very popular in BCIs. LDA looks for the best linear combination of variables by assuming that P (z|ωi) is nor-

mally distributed for all classes i , and estimates mean and covariance parametersµi andPi

from the training data. The resulting decision boundary is a linear hyperplane. In the spe- cial case where all covariance parameters are equal (P

i =P), it is called quadratic discriminant

analysis (QDA).

Discriminative linear models

Discriminative models estimate the vector w by maximizing the given criteria on the training data. Linear regression is an example of this; it attempts to minimize the total distance from the regression line to each point.

Figure 2.9: Prinicples of the SVM. The support vector machine (SVM) com-

pares only two classes at a time. Given train- ing data with each point zibelonging to either

ωi = 1 or ωi = −1, the SVM formulates the

training procedure as an optimization prob- lem:

Minimize ||w||2,

such that for any k = 1,...,n

ωi(wT· z − w0) ≥ 1. (2.12)

Figure 2.9shows the resulting maximum-margin hyperplane, given by wT · z − w0= 0, which separates the blue and red dots, and the margin hyperplanes given by wT · z − w0= ±1. The green samples on the margin are called support vectors. SVM-based classifiers are very popular in BCI application, and has performed well with high accuracy rates (Cinar and Sahin (2010);

Nonlinear parametric models

Nonlinear, parametric classifiers are when the discriminative function is not linear as shown in Figure2.8. These often build upon expanding linear cases. For instance, the perceptron, which is a discriminative linear classifier similar to SVMs, acts as an artificial neuron in artificial neural networks (ANN), and may also be expanded to the nonlinear multilayer perceptron. SVMs can also be generalized to a nonlinear model.

Non-parametric models

Commonly referred to as clustering algorithms, these types of models find hidden structures in unlabeled data. The most popular is the k-means method, which aims to partition n observa- tions into k classes or clusters. Given training data zn where each zi ∈ Rl, cluster into k sets

S = {S1, S2, ..., Sk}. Then minimize the cumulative l2-norm with respect to S:

min S k X i =1 X zj∈Si ||zj− mi||2 (2.13)

where mi is the mean of all points in Si.

Sequential models

In many real-world cases, the data can be expressed as a function of time and contains temporal dependencies. Sequence models does not only learn a mapping between features and labels, but also model the transitions between features. These models may rely on the presentation of correct input-output pairs (parametric methods) as well as finding hidden structure in the data (non-parametric methods). This is the case for one of the most popular sequence models, namely the hidden Markov model (HMM).

Suppose the features z and classesω now are represented as stochastic time series, such that z = {zt} , zt ∈ Z and ω = {ωt} ,ωt∈ Ω. Then the HMM includes an initial distribution p(ω1), a transition distribution p(ωt|ωt −1) and a observation distribution p(zt|ωt). Together, they define

the joint probability distribution P (ω,z) = T Y t =1 P (zt|ωt)P (ωt|ωt −1) (2.14)

where the initial state distribution p(ω1) is written as p(ω1|y0). Sequence models will be more closely examined in chapter3.

Reinforcement learning

Reinforcement learning models are a mix of parametric and non-parametric models. It differs from standard parametric models because it does not only rely on á priori information to tune the classifier, but also on á posteriori information to update the classifier during operation. That is, these models are presented frequently with so-called reward-based signals, which is used as a measure for the input-output relationship. Thus reinforcement learning also rely on finding structures in the data that are hidden during training, and optimizing (learning) these structures better over time. Reinforcement learning algorithms often rely on exploration techniques in order to learn and adapt to the task at hand.

These characteristics makes them very general, and they may incorporate many aspects and methods from the previously presented classifiers. Some of them even incorporate feature ex- traction. Reinforcement learning techniques will also be revisited in chapter3.

In document Brain-Computer Interface In Control Systems (Page 31-35)