• No results found

4.6 Rear-of-device interaction for Rotation Tasks

5.1.3 Machine Learning Algorithms

In the following, we detail several machine learning classication algo- rithms that are used by the contributions in this chapter. The descrip- tion of the algorithms is ordered by implementation complexity.

5.1.3.1 k-Nearest Neighbors

One of the simplest machine learning recognizers isk-Nearest Neigh- bors (kNN). This algorithms is purely data-driven and classifying is done by comparing a sample points to thek closest labeled points in a training set, and determining the class ofs by the majority of the la- bels of thek nearest neighbors. The usual metric for determining the proximity ofsto other data points is the Euclidean distance (L2norm):

d(p,q) = √ n

i=1 (qi−pi)2 (5.8)

5.1 Machine Learning Foundations 129 optimal path DTW distance at (i,j) optimal predecessor at smallest DTW distance Sequence 1 Sequence 2 Sequence 1 Sequence 2 Time (a) (b)

Figure 5.1: (a)Dynamic Time Warping calculates the optimal alignment between two tempo- rally skewed sequence. The DTW update rule ensures that a point(i,j)on the optimal DTW path (green) is always at the minimum distance from its predecessor. (b)The result of DTW is the cost-optimal alignment between two time sequences.

Although it is simple to implement, kNN has a number of disadvan- tages. Since for classication, a new sample has to be compared with all other points in the data set, naïve kNN implementations do not scale well in terms of computational performance for large data sets. In the case of large data sets, k-d trees can be used to reduce the near- est neighbor search time toO(log n)(Marsland, 2009), wherenis the number of data points in the training set. A further weakness of kNN is that it is not very robust towards noise in the training data. Due to the simple majority-based classication scheme even relatively few spurious data points will seriously degrade the classier’s accuracy.

Both the$3 Gesture Recognizer(Section 5.2.2.7) and Protractor3D(Sec- tion 5.3)use a kNN strategy for classication.

5.1.3.2 Dynamic Time Warping

When dealing with temporal data, such as gestures, it is important to use a technique that is robust towards temporal variations between input samples and training data. Dynamic Time Warping (DTW) is a technique that is used to calculate the minimal-cost alignment between two time sequences (Figure 5.1).

130 5 Motion Gestures

DTW works by nding the optimal path through aN×Mcost matrix

D, which is initialized to innity. The entriesDi,jof the cost matrix are

generated using the following DTW update function:

Di,j =        0 i= j=0 min(Di−1,j−1,Di−1,j,Di,j−1) +di,j i>0,j>0 ∞ otherwise (5.9)

Usually, the Euclidean distance (Equation 5.8) is chosen to calculate the distancedi,j. Following the path starting atDM,Nof minimal neigh-

boring entries through the matrix will yield the optimal alignment path through the matrix. The DTW cost is the matrix entryDN,M.

DTW has been used widely in previous work, both in gesture recogni- tion for HCI and also in speech recognition (Sakoe and Chiba, 1978), the eld from which the algorithm originated.

Because the number of entries in the DTW matrix rises quadratically in relation to the sequence lengths, DTW does not scale well to long sequences. There have been a number of approaches to compensate for this problem. One possible solution is to window (or “envelope”) the search space. There are two very common envelopes called theSakoe- Chiba Band, andItakura Parallelogram (Sakoe and Chiba, 1978; Itakura, 1975). Another approach, calledFastDTW(Salvador and Chan, 2004), is to approximate the full DTW calculations by subsampling the origi- nal DTW matrix.

Our work on gesture-based authentication in Section 5.4 uses DTW as one of the main machine learning algorithms. In Section 5.5 the performance of DTW is evaluated with combined accelerometer and gyroscope data.

5.1.3.3 Logistic Regression

Logistic regression is a simple but powerful supervised learning classi- er that is very popular in the machine learning community. Logistic Regression, in essence, represents the operation of a single neuron of an Articial Neural Network.

5.1 Machine Learning Foundations 131 Hypothesis The hypothesis of Logistic Regression is given by

(x) =g(θTx) (5.10)

wherexis a feature vector and the model,θ, is a parameter vector the length of which corresponds to the number of features. The outputs of

(x)are constrained to0≤hθ(x)1, in order to classify two distinct

classes, 0 and 1. This behavior is obtained by choosing the Sigmoid (also known as Logistic) function forg:

g(z) = 1 1+e−z (5.11) 8 6 4 2 0 2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0

The Sigmoid Function

Figure 5.2: Plot of the Sigmoid function. Values forx > 0asymptote at 1, values forx <0asymptote at 0.

Figure 5.2 shows a plot of the Sigmoid function. The intuition behind using the Sigmoid function is that we would like the classier to output the probability of class 1 given an input xand the parameter vector θ, i.e.hθ(x) = P(y = 1|x;θ). Thus, we can predict 1 ifhθ(x) 0.5 and 0 if(x)< 0.5, because(x) 0.5whenever θTx 0and(x)< 0.5

wheneverθTx <0.

Training the Classi er In order to train the model parameters θ using input data vectors X and labels y ∈ {0, 1} we need to minimize the following cost function:

J(θ) = 1 m m

i=1 Cost(hθ(x(i)),y(i)) (5.12)

132 5 Motion Gestures where Cost((x),y) =    log((x)) ify=1 log(1−hθ(x)) ify=0 (5.13)

Using the logarithm guarantees thatJ(θ)remains convex, i.e. the global maximum is guaranteed be found using gradient descent. Also, for

y = 1,log(hθ(x))captures the intuition that there should be a near innite penalty when(x)0and a near zero penalty when(x)1.

The reverse is the case with log(1−hθ(x)), for y = 0. The cost

function can be further simplied as follows:

J(θ) = 1 m [ m

i=1 y(i)log(x(i)) + (1−y(i))log(1−hθ(x(i))) ] (5.14)

The minimum of the cost function can be found using gradient descen- twith the gradient of J(θ):

δ δθj J(θ) = 1 m m

i=1 (x(i)−y(i))x(ji) (5.15)

The gradient descent update function can then be applied repeatedly up to a given convergence criterium to optimizeminθJ(θ)for all com-

ponents jofθ:

θj =θj−αδδ

θj

(5.16)

αis a parameter that controls the speed of gradient descent. Sinceαcan be difficult to choose correctly, more advanced gradient descent algo- rithms, such as the Broyden-Fletcher-Goldfarb-Shannon (BFGS) algo- rithm¹(Fletcher, 1987) perform calculations on the input data in order to determine useful values forα.

¹For the application of regularized Logistic Regression to motion gesture recogni- tion featured in Section 5.5, we used the BFGS (Fletcher, 1987) implementation from theScipy.optimizePython library to minimize the cost function. The specic function call we used wasscipy.optimize.fmin_bfgs(...).

5.1 Machine Learning Foundations 133 Predictions To make a prediction given a new input vectorx, we simply

calculate

(x) = 1

1+e−θTx (5.17)

to obtainp(y=1|x;θ).

Multi-Class Classi cation LR classiers can be used for multi-class clas- sication. Multi-class classication can be achieved using a one-vs-all approach. For this, we train one logistic regression classierh(θi)(x)for each class i to predict the probability thaty = i. In order to make a prediction for a new inputx, we pick the classithat maximizes

max

i h

(i) θ (x)

Decision Boundary and Regularization The version of Logistic Regres- sion used in this section only determines a linear decision boundary. To obtain more complex decision boundaries, additional, higher-order features can be generated. For instance, if we have the featuresx1 and

x2 a more complex boundary can be calculated by generating higher-

order features such asx1x22,x21x2,x21,x22. This will also result in a higher

dimensionality forθ.

In order to avoid overtting due to too many features,regularizationcan be applied to the cost function in order to “weaken” the effects of θ. This will lead to a more “general” hypothesis and reduce the problem of over tting. The regularized cost function for linear regression is thus dened as:

J(θ) = 1 m [ m

i=1 y(i)log(x(i)) + (1−y(i))log(1−hθ(x(i))) ] + λ 2m n

j=1 θ2 j (5.18)

The gradient for regularized Logistic Regression is dened as²:

δ δθj J(θ) =    1 mm i=1(x(i)−y(i))x (i) j if j=0 1 mm i=1(x(i)−y(i))x (i) j mλθj if j>0 (5.19)

²The distinction forj=0is made because, by convention, every input vectorxis padded such thatx(0)=1.

134 5 Motion Gestures 5.1.3.4 Hidden Markov Models

Hidden Markov Models (HMMs) are a statistical machine learning tech- nique designed for the classication of time-series data. Developed originally for speech recognition applications (Rabiner, 1990), HMMs have become one of the most used machine learning techniques in re- search.

The basic premise of HMMs is that by looking at a sequence of inputs, or observations, we can calculate the probability of the model being in a certain state, e.g, “gesture recognized” or “no gesture recognized”. Thus, observations are not uniquely tied to a specic state (Marsland, 2009). This makes HMMs very robust towards variations in the time- series data. More formally, HMMs are composed of:

• an observation alphabetV.

• an underlying (hidden) transition system with a set of states

S={q(0), . . . ,q(N)}.

• a probability distribution matrix A, where the entriesai,jdescribe

the probability of a transition from stateqi to state qj given the

current observation:

p(q(nj)|q(ni)1)≡ai,j (5.20)

• an emission probability distributionB= bi(v)which tells us the

probability of observing a symbolv∈ Vat stateq(i):

p(v|q(i)) bi(v) (5.21)

• an initial state distributionπ=(i)}with p(q(0i)) π(i). The compact notation for an HMM is therefore:

λ= (A,B,π) (5.22)

There are three fundamental problems for HMMs. The rst problem is called theEvaluation Problem. The question is how to efficiently calculate the probability of an observation sequenceO=o1,o2, . . . ,oN given the

modelλ. An efficient solution to this problem is given by the Forward Algorithm (Marsland, 2009).

The second problem of interest is theDecoding Problem. Given an HMM λ, what is the optimal state sequence Q = q1,q2, . . . ,qN given an ob-