• No results found

Real-time translation of American Sign Language using wearable technology

N/A
N/A
Protected

Academic year: 2021

Share "Real-time translation of American Sign Language using wearable technology"

Copied!
45
0
0

Loading.... (view fulltext now)

Full text

(1)

University of Richmond

UR Scholarship Repository

Honors Theses

Student Research

2016

Real-time translation of American Sign Language

using wearable technology

Jackson Taylor

Follow this and additional works at:

http://scholarship.richmond.edu/honors-theses

Part of the

Computer Sciences Commons

, and the

Mathematics Commons

This Thesis is brought to you for free and open access by the Student Research at UR Scholarship Repository. It has been accepted for inclusion in Honors Theses by an authorized administrator of UR Scholarship Repository. For more information, please contact

Recommended Citation

(2)

Real-Time Translation of American Sign Language using

Wearable Technology

Jackson Taylor

Honors Thesis∗

Department of Mathematics & Computer Science University of Richmond

(3)

The signatures below, by the thesis advisor, the departmental reader, and the honors coordinator for computer science, certify that this thesis, prepared by Jackson Taylor, has been approved, as to style and content.

(Dr. Barry Lawson, thesis advisor)

(Dr. Doug Szajda, departmental reader)

(4)

Abstract

The goal of this work is to implement a real-time system using wearable technology for translating American Sign Language (ASL) gestures into audible form. This system could be used to facilitate conversations between individuals who do and do not communicate using ASL. We use as our source of input the Myo armband, an affordable commercially-available wearable technology equipped with on-board accelerometer, gyroscope, and electromyography sensors. We investigate the performance of two different classification algorithms in this context: linear discriminant analysis and k-Nearest Neighbors (k-NN) using various distance metrics. Using thek-NN classifier and windowed dynamic time warping as the distance metric, our working prototype is able to obtain accuracies between94%−98%when classifying up to 20 different ASL gestures.

(5)

Contents

1 Introduction 1

2 American Sign Language 3

3 Background and Related Work 4

4 The Myo Device 6

4.1 Accelerometer . . . 7

4.2 Gyroscope . . . 7

4.3 Roll, Pitch, Yaw . . . 8

4.4 Electromyography . . . 8

4.5 Construction of Complete Data Signal . . . 9

5 Classification Algorithms 9 5.1 Linear Discriminant Analysis . . . 10

5.2 k-Nearest Neighbors (k-NN) . . . 13

5.2.1 Euclidean Distance . . . 13

5.2.2 Manhattan Distance . . . 14

5.2.3 Dynamic Time Warping . . . 14

(6)

6 Implementation of the Prototype 18

6.1 Capturing an ASL Gesture . . . 18

6.1.1 Static Windowing Capture . . . 18

6.1.2 Dynamic Windowing Capture . . . 20

6.2 Signal Sampling . . . 24

6.3 Classification Process . . . 25

6.3.1 Classification using LDA . . . 26

6.3.2 Classification usingk-NN . . . 26

6.4 Swift Programming Language . . . 27

7 Experiments and Results 28 7.1 Comparison of Classification Algorithms . . . 28

7.2 Comparison of Classification Algorithms: Synthetic Data . . . 30

7.3 Increasing the Vocabulary Size . . . 32

7.4 DTW Window Size . . . 33

8 Future Work 35

(7)

1

Introduction

American Sign Language (ASL) is one of the primary forms of communication used by those with hearing loss. The most recent estimates of ASL’s user base place the number of users around one-half to two million people in the United States [12]. Still, this number is small compared to the current three hundred million United States citizens, indicating a large proportion of citizens who cannot communicate effectively via ASL. Some people may take it upon themselves to learn American Sign Language, but the process can be time-consuming, equivalent to learning another language entirely. Providing a mechanism to automate translation of ASL into audible form can help remove communicative barriers between those who do and do not regularly use ASL.

In general, ASL is broken down into about 6000 words, letters, and phrases [17]. For example, the most commonly used ASL words include gestures for home, mom, pizza, and dog. Before ASL and other forms of sign language were widely adopted, the process of spelling out words was commonly used [16], but this process quickly became cumbersome. The introduction of word- and phrase-level abstractions, in ASL and similar standards, facilitated a fast-paced, efficient communication similar to spoken languages.

The goal of the work here is to identify via classification select word-level ASL abstractions. The problem is similar to speech recognition in many ways, but presents unique challenges. Two pos-sible approaches for speech recognition are analysis of the corresponding signal as a time series or analysis of features extracted from the signal. As an example of the latter, features can be extracted in the form of phonemes, units of sound used to differentiate one word from another [21]. For

example, the word pit is differentiable from bit with the result of exchanging the phoneme /’b/

for /’p/. A word such as bit, when pronounced, registers the phonemes /’b/, /I/, and /t/ in that

sequential order. Although sign language can be modeled using as phonemes the orientation and position of the arm as well as the shape of the hand, the recognition process is not as streamlined as for speech. In a single gesture, arm orientation and hand shape, i.e., sign language phonemes, can change simultaneously and sequentially, complicating the analysis of those features. In this work,

(8)

we focus our analysis on signals as time series rather than on phonemes of ASL.

To date, the most successful approaches in automated translation of ASL involve analyzing video of hand gestures. While this technique is very accurate in translating sign language, video analysis does not facilitate the most common forms of communication seen in today’s world. That is, the associated requirement of having a camera present is constraining at best, prohibitive at worst. Video analysis is a very effective approach when communicating with someone, say, over a video conferencing system, but the motivation of the work presented here is to develop a much less constrained mechanism for translating ASL.

In this work, we propose use of a wearable device called the Myo, a small band worn on the fore-arm, which can measure a change in orientation, position, and movement of the fore-arm, hand, and fingers. The Myo is very sleek and relatively unobtrusive, facilitating automated translation of ASL with only minor hardware requirements. The ultimate goal is to pair the Myo with a mobile device (e.g., smart phone) executing software that will translate the Myo signals into a corresponding ASL word which can then be reproduced audibly.

During movement of the user’s arm and/or hand, the Myo device transmits 15 different streams of data to the mobile device. Machine learning techniques are used to classify these data into ASL words, which can then be emitted in audible form by the mobile device. Moreover, the hardware required is extremely portable and relatively low cost. This approach, pairing a small wearable device with a mobile device easily carried by the user, shows promise for effective communication between persons who do and do not understand ASL.

The remainder of this paper is organized as follows. In Section 2, we provide a brief overview of American sign language. In Section 3, we discuss background and related work. In Section 4, we discuss the Myo device and the signals it produces. In Section 5, we discuss the machine learning algorithms we use to classify signals into ASL words. In Section 6, we discuss details of the prototype implementation, and in Section 7 we present results from experiments using our approach. Finally, Section 8 suggests future directions.

(9)

2

American Sign Language

American sign language (ASL) is a form of nonverbal communication used by millions of Americans [12]. ASL contains hand shape gestures that correspond to individual letters as well as hand and arm gestures that represent words or full phrases. Figure 1 shows all letters and digits in the ASL finger-spelled dictionary and their corresponding hand gesture.

Figure 1: ASL alphabet and digits [19].

Other than finger-spelled letters, ASL includes word-level abstractions such as signs for “hello” and “goodbye”. Some word gestures in ASL stem from commonly practiced signals (such as the gesture for “hello”) while others are original (such as the gesture for the word “no”). Forming a sentence in ASL using gestures follows a different grammar than typical English speakers use. For example, consider the sentence “My Grandma has a green coat.” To construct the sentence in ASL, one would sign “my grandma have green coat”, where each of those words is a separate gesture. For some commonly used phrases, ASL has corresponding gestures such as “brushing teeth” and “look at

(10)

Figure 2: Example ASL word and phrase gestures [4].

3

Background and Related Work

There have been multiple studies investigating automated classification of American Sign Language movements. Most recent work focuses primarily on the analysis of video [11, 17, 20, 22]. These methods are generally very accurate in classifying ASL using the relatively large amount of data the camera collects (compared to that obtained by most wearable devices), but a user would need to be constantly facing a camera in order to communicate with others. However, a camera-based approach would be applicable for video conferencing and visual communication over the Internet. In order to increase the accuracy of some classifiers, the participants wear gloves on each hand to easily distinguish their hands from the background [17, 22]. Though accuracy increases as a result, the efficacy of using this method in a real world context is suspect. Other efforts include three different positions of cameras in order to obtain a full three-dimensional mapping of the user’s movements [20].

Of those projects that do use wearable devices, the focus is mainly on finger-spelling recognition. Some of these attempts include users wearing special gloves with sensors which transmit gesture data to the computer for recognition [9, 14]. Though providing benefits for recognizing the ASL

(11)

alphabet, this type of implementation would prove cumbersome in the real world. One recent master’s thesis uses a sensor called the Shimmer as the wearable device [2]. The Shimmer device is able to produce medical-grade data analysis for EMG signals, as well as data from the on-board accelerometer and gyroscope, in order to determine location and movement of the wearer’s arm. A more recent project combines both video analysis and wearable accelerometers achieving high accuracy in ASL movement classification [3]. However, the wearables in these works are generally cumbersome and/or are not commercially available.

Because of the computation and battery constraints of the intended mobile platform, scalability is an important consideration in this context. Speech recognition uses phonemes to recognize each word, but with ASL this method is quite difficult as the arm-orientation and hand-shape phonemes can change simultaneously. One study in particular constructed a set of phonemes for ASL consisting of 4 categories: 30 hand-shapes, 8 hand orientations, 20 major body locations, and 40 movements [20]. Though there are many more ASL phonemes than speech phonemes, the phonetic makeup of sign language (total of 4(30 + 8 + 20 + 40) = 392 classes to identify in parallel) is still less than the currently increasing 6000 ASL vocabulary words. Other projects do not use phonetic breakdown of ASL movements, instead working with an aggregate signal of movements from their choice of device. Most of these aggregate-signal projects have a limited vocabulary of less than 50 ASL movements [2, 17, 21]. One study in particular includes 262 different ASL movements, producing very accurate classification results of up to 94% [7].

Most of the projects using video analysis apply a hidden Markov model (HMM) as the algorithm of choice for classification. Following the popularity of using HMMs for speech recognition, these projects use sub-sampled images from video cameras as feature vectors and obtain accurate results [7, 17, 21]. HMMs allow for fast recognition of ASL movements, but as the vocabulary increases the efficiency and accuracy of the model decreases [20]. Other research investigates parallel HMMs, increasing the speed and accuracy of training the models and improving the real-time response of the system [20]. Other work focusing on wearable devices uses linear discriminant analysis (LDA) as the classification algorithm of choice [2], and is able to achieve very high accuracy.

(12)

Because the data gathered from an ASL movement is time series data, other work investigates dy-namic time warping (DTW) as a distance metric to compare new samples to samples previously recorded [11]. DTW is an algorithm used to align (by “warping”) two different time series to allow comparison of the two series. Although HMMs have been shown to be generally superior to DTW in aligning two offset series, the authors introduce statistical dynamic time warping (SDTW), a mod-ified version of DTW, along with a modmod-ified nearest-neighbor algorithm, achieving classification accuracies of up to 91% in a video-based setup [11].

4

The Myo Device

In our work, we use a wearable device called Myo for collecting and transmitting signals related to ASL gestures. The Myo [10] is a consumer electronic device released in March 2015 by Thalmic Labs. Myo’s software development kit (SDK) allows developers to access the raw data transmitted from the Myo to a Bluetooth-connected computer. This raw data consists of the output of the on-board accelerometer, gyroscope, orientation, and EMG sensors. A schematic of the Myo device is given in Figure 3. In order to function properly, the Myo device should be positioned on the forearm of the user with the USB cable port pointed toward the wearer’s hand. Depictions of the Myo and its orientations relative to the user’s arm are given in Figures 4 and 5.

(13)

Figure 4: A representation of the Myo coordinate system for accelerometer and gyroscope [5].

4.1 Accelerometer

The Myo contains an accelerometer which estimates acceleration of the arm along thex,y, andz

axes, as shown in Figure 4. (Note that our specific Myo returns the same value for accelerometer

y andz axes, so we omit the duplicate. Omitting this third dimension is mitigated by the three

dimensions of roll, pitch, and yaw, described in Section 4.3.) The device transmits accelerometer

data in units of gravity (g), approximately9.8m/s2.

4.2 Gyroscope

Along with accelerometer data, the Myo device also transmits gyroscope data, measuring rotation

of the band (and therefore the arm). Gyroscope data is captured along all three of thex,y, and z

axes shown in Figure 4. (Note that our specific Myo returns the same value for the gyroscopeyand

zaxes, so, similar to the accelerometer, we omit the duplicate. Again, omitting this third dimension

is mitigated by the three dimensions of roll, pitch, and yaw.) The Myo transmits gyroscope data in

(14)

4.3 Roll, Pitch, Yaw

Unlike the gyroscope and accelerometer, values for roll, pitch, and yaw transmitted by the Myo are all calculated in real-time using other sensor data. The Myo documentation does not report how these values are calculated but documents the particular axes in a way analogous to airplanes. Figure 5 depicts how roll, pitch, and yaw change as the user’s arm moves while wearing the Myo.

Figure 5: Roll, pitch, and yaw for the Myo device [5].

4.4 Electromyography

Electromyography (EMG) devices read electrical activity from muscles, and have even been used to help control motor-powered prosthetic limbs [13]. In a simple sense, on either contraction or relaxation of a muscle, an EMG device detects and reports an electrical change in that particular muscle. The Myo device has eight EMG sensors on board, corresponding to an additional eight signal components reported on movement. Because the Myo is a small band built specifically for use on the forearm, its EMG sensors are not able to pinpoint exact muscle movements. However, collectively the eight components can be used to identify certain finger gestures. This ability to identify finger movement is critical for determining ASL gestures that differ only in hand/finger positioning.

(15)

4.5 Construction of Complete Data Signal

Since Myo data is recorded from four different sensors, our classification system uses one signal consisting of data concatenated from all four sensors. Consider an ASL gesture’s signal containing tdata points. Let ax = (ax,1, ax,2, . . . , ax,t) anday = (ay,1, ay,2, . . . , ay,t) correspond to data

transmitted by the accelerometer. Let gx = (gx,1, gx,2, . . . , gx,t) and gy = (gy,1, gy,2, . . . , gy,t)

correspond to data transmitted by the gyroscope. Let transmitted roll, pitch, and yaw signals be rep-resented byor = (or,1, or,2, . . . , or,t),op = (op,1, op,2, . . . , op,t), andoy = (oy,1, oy,2, . . . , oy,t),

respectively. Finally, let ei = (ei,1, ei,2, . . . , ei,t) where i ∈ {1,2, . . . ,8}correspond to the eight

EMG signals. For our work, we construct an aggregate signalsas follows:

s= (ax, ay, gx, gy, or, op, oy, e1, e2, . . . , e8)

Recall that identical y and z values are reported by both the accelerometer and the gyroscope.

Accordingly, az and gz are omitted from our signal s. We normalize the values in each of the

sub-vectorsax,ay, . . . ,e8using the mean and standard deviation of the raw values from the

corre-sponding sub-vector.

5

Classification Algorithms

In this section, we discuss the machine-learning algorithms used in this work to classify ASL

ges-tures: linear discriminant analysis,k-nearest neighbors using Euclidean and Manhattan distances,

(16)

5.1 Linear Discriminant Analysis

Linear discriminant analysis (LDA) [15] is a supervised classification algorithm that determines a linear combination of features to separate two or more classes of data. LDA’s primary objective is to perform dimensionality reduction while preserving as much of the class specific information as possible [6].

Consider the case of classifying vectors of dimension m ≥ 2 into C = 2 classes. Given m

-dimensional training vectors

xi =xi1, xi2, . . . , xim i∈ {1,2, . . . , N}

where each xi belongs to either known class ω1 or ω2, such that |ω1| = N1, |ω2| = N2, and

N =N1+N2, we want to choose a projection vector

w= [w1, w2, . . . , wm]T

such that projectionsyi =wTxi maximize separability of the scalars/classes. Define the mean of

each class of vectors to be

µ1 = 1 N1 X xiω 1 xi and µ2= 1 N2 X xiω 2 xi

and the mean of the projection of each class to be

˜ µ1 = 1 N1 X yiω 1 yi= 1 N1 X xiw 1 wTxi=wTµ1 and µ˜2 = 1 N2 X yiω 2 =wTµ2 .

For each class, denote thescatter, which measures within-class variability after projection, as

˜ s21 = X yiω 1 yi−µ˜1 2 and s˜22 = X yiω 2 yi−µ˜2 2

(17)

so thats˜21+ ˜s22 is a measure of thewithin-class scatter.

The goal of LDA is to identify a projection vectorwthat maximizes the criterion function

J(w) = |µ˜1−µ˜2|

2

˜

s21+ ˜s22 .

Intuitively, the numerator indicates that we want large separation between the means of the pro-jections of each class, while the denominator indicates that we want small variance within each class. This, in turn, will maximize the separability of the classes.

Let the covariance matrices of each class be denoted as

S1 = X xiω 1 xi−µ1 xi−µ1 T and S2= X xiω 2 xi−µ2 xi−µ2 T

and letSW =S1+S2 be thewithin-class scatter matrix. Similarly, letSB = (µ1−µ2) (µ1−µ2)T

be thebetween-class scatter matrix. The criterion function can then be written as [6]

J(w) = |µ˜1−µ˜2| 2 ˜ s21+ ˜s22 = wTS Bw wTS Ww .

Maximizing the criterion function equates to solving the generalized eigenvalue problem

SW−1SBw=λw

so that the solution will be the eigenvectors ofSW−1SB. (A derivation for this claim is provided in

[6].) The optimal projection vector w is the eigenvector corresponding to the largest eigenvalue

(18)

−5 0 5 10 15 20 −5 0 5 10 15 20

LDA Projections (Second Eigenvector)

1st Feature 2nd F eature ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Class #1 Class #2 Class #3 Vector #1 Vector #2

(a) Points projected onto second eigenvector.

−5 0 5 10 15 20 −5 0 5 10 15 20

LDA Projections (First Eigenvector)

1st Feature 2nd F eature ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Class #1 Class #2 Class #3 Vector #1 Vector #2

(b) Points projected onto first eigenvector.

Figure 6: LDA projection onto eigenvectors for three classes of vectors.

Consider an example for C = 3 classes. Let xi, i ∈ {1,2, . . . ,60}, be two-dimensional vectors

where each component of a 2-D vector is randomly generated from a standard normal distribution. Assign 20 vectors to each of the three classes, and then, via transformation using an arbitrary

covariance matrix, shift the center and orientation of each of the three classes in 2-D space. Letw1

be the projection vector (eigenvector) that corresponds to the largest eigenvalue from the solution

described earlier, andw2be the eigenvector corresponding to the second largest eigenvalue. Figure

6 shows two graphs with the vectors projected onto the eigenvectors w1 and w2. In Figure 6a

the vectors are projected onto w2, corresponding to the second largest eigenvalue, with limited

separation between classes 2 and 3. However, Figure 6b depicts the projection of the vectors

ontow1, corresponding to the largest eigenvalue, with distinguishable separation among all three

classes.

Now consider classification using LDA. Letsbe a data signal with an unknown class, andM be the

set of training signals with class identification. LDA will compute the optimal projections for M,

(19)

between the projection ofsand the class projection means.

5.2 k-Nearest Neighbors (k-NN)

k-Nearest Neighbors (k-NN) [1] is an alternative (supervised) classification algorithm that classifies

a vector of unknown class by finding kclassified vectors having minimum distance to the

unclas-sified vector, where the distances are some quantifiable metric. The training process of k-NN is

straightforward: k-NN simply stores an array of training data vectors and their respective classes.

Becausek-NN is a supervised algorithm, the class of each training vector must be indicated.

Now consider classification usingk-NN. Letsbe a data signal with an unknown class, and M be

the set of training signals with class identification. k-NN will comparesto every record in the set

M, and classifysas belonging to the class most common among thekrecords inM closest tos.

As mentioned above,k-NN requires a means of quantifying distance, i.e., similarity, of two different

samples. The three measures of similarity we consider fork-NN in this work are Euclidean distance,

Manhattan distance, and dynamic time warping distance.

5.2.1 Euclidean Distance

The Euclidean distance is a measure of the distance between two points in a given coordinate space. Leta= (a1, a2, a3, . . . , an)andb= (b1, b2, b3, . . . , bn)be points inn-space. The Euclidean distance

d(a,b)between pointsaandbis defined by

d(a,b) =p(a1−b1)2+ (a2−b2)2+ (a3−b3)2+. . .+ (an−bn)2 .

In the context of k-NN, ifmis an arbitrary signal from the training set M and sis a yet-unseen

(20)

5.2.2 Manhattan Distance

The Manhattan, or taxicab, distance is the vertical and horizontal distance between two points in a given coordinate space. Leta= (a1, a2, a3, . . . , an)andb= (b1, b2, b3, . . . , bn)be points inn-space.

The Manhattan distanced(a,b)between pointsaandbis defined by

d(a,b) =|a1−b1|+|a2−b2|+|a3−b3|+. . .+|an−bn| .

Similar to Euclidean distance, ifsandmare both arbitrary signals,d(s,m)can be used to quantify

the similarity of the twon-dimensional signals.

5.2.3 Dynamic Time Warping

Dynamic time warping (DTW) is a distance metric for measuring the similarity between two time series. The primary benefit of DTW is its ability to equate two time series that differ in length but are similar in overall shape. DTW was once widely used in speech recognition applications, but has generally been supplanted by HMMs in that context.

DTW attempts to determine an optimal alignment between two time series using a dynamic

pro-gramming approach. Consider the comparison of two vectors a = (a1, a2, . . . , an) and b =

(b1, b2, . . . , bm)— unlike Euclidean distance, the vectors need not be of the same size. Construct

an(n+ 1)×(m+ 1)distance matrix where entry(i, j)represents the distance betweenai andbj.

Finding the optimal alignment is equivalent to finding the shortest path from the bottom-left(0,0)

to the top-right (n, m) of the distance matrix. The resulting distance between the two vectors is

the sum of the entries along the shortest path. Pseudocode for theO(nm)DTW algorithm is given

(21)

input:a= (a1, a2, a3, . . . , an)andb= (b1, b2, b3, . . . , bm)

1 dtw ←array[0. . . n][0. . . m]

2 set all cells indtwto∞

3 dtw[0][0] = 0.0 4 i, j ←1 5 whilei < ndo 6 j←1 7 whilej < mdo 8 c← |ai−bj| 9 dtw[i][j]←c+ min(dtw[i−1][j], dtw[i][j−1], dtw[i−1, j−1]) 10 j←j+ 1 11 end 12 i←i+ 1 13 end 14 returndtw[n][m]

Figure 7 provides an example of the shortest path between two ASL gesture signals as determined

by DTW. For brevity, the data used here include only the accelerometer xvalues produced by the

Myo (see Section 4.1). Figure 7a shows the comparison of signals from two different gestures (“apple” and “dog”). Note that the optimal alignment path deviates from the diagonal, indicating that the two signals are not relatively similar. Figure 7b shows the comparison of two different signals of the same gesture (“apple”). In this case, the optimal alignment path closely follows the diagonal, indicative that the two signals are similar. Note that the overall distance computed,

(22)

Timeseries alignment d$index1 d$inde x2 Query index xts 0 100 200 300 400 500 −200 0 200 yts Ref erence inde x 100 −200 0 100 200 300 400 500

(a) “apple” signal vs. “dog” signal

Timeseries alignment d$index1 d$inde x2 Query index xts 0 100 200 300 400 500 −200 0 200 yts Ref erence inde x 300 0 −300 0 100 200 300 400 500

(b) Two different signals of “apple”

Figure 7: Optimal alignment paths between two signals as determined by DTW.

5.2.4 Dynamic Time Warping with Windowing

TheO(nm) run-time of DTW can be improved using windowing. To do so, rather than computing

distances for all (n+ 1)(m+ 1) entries in the distance matrix, only those entries in a window

around the optimal path should be computed. An appropriate choice of window size will improve efficiency without affecting accuracy. Unfortunately, trial-and-error must be used for selecting the window size. Pseudocode for DTW with windowing is given below. Compared to the previous DTW

algorithm, note the additional input parameterwand the changes to incorporatewon lines 6 and

(23)

input:a= (a1, a2, a3, . . . , an),b= (b1, b2, b3, . . . , bm), andw=Window Size

1 dtw ←array[0. . . n][0. . . m]

2 set all cells indtwto∞

3 dtw[0][0] = 0.0 4 i, j =←1 5 whilei < ndo 6 j←max(1, i−w) 7 whilej <min(m, i+w)do 8 c← |ai−bj| 9 dtw[i][j]←c+ min(dtw[i−1][j], dtw[i][j−1], dtw[i−1, j−1]) 10 j←j+ 1 11 end 12 i←i+ 1 13 end 14 returndtw[n][m]

Figure 8 provides an example. The black squares represent a portion of the optimal path between two signals, similar to the paths shown in Figure 7. The gray squares represent the window of entries computed in determining the optimal alignment path. Uncolored squares represent entries whose values are not computed.

(24)

6

Implementation of the Prototype

In this section, we describe implementation details of our prototype, including capturing gestures from the Myo signals, sampling of signals, details of the classification process, and choice of pro-gramming platform.

6.1 Capturing an ASL Gesture

To recognize the signal corresponding to an ASL movement, our system must know when the movement starts and ends — a similar process is needed for recognizing speech. Unlike speech, ASL movements do not have pauses between words. Most movements by native ASL speakers never have the same starting position but instead mesh together to create one long movement potentially consisting of multiple words.

To ease the problem of gesture recognition, for our prototype we add a constraint that the user must place their arms at their side at the start and end of each signed word. This gives the classifier a base starting point from which to determine the start and end of any gesture. Given this constraint, we used two different approaches for recognizing an ASL gesture by attempting to automatically recognize the transition from and to that base resting position.

6.1.1 Static Windowing Capture

To recognize the starting and ending times of an ASL gesture, our first approach uses a fixed window consisting of all data points that occur within a particular range. To define this range, various data signals are collected with the user’s arm at their side. The window range is then defined to be

10% above and below the average of the accelerometer x values from the resting arm position.

Because the accelerometerxvalues are approximately zero whenever the arm is positioned at the

(25)

0 50 100 150 −0.5 0.0 0.5 1.0 Data Index G (Gr a vity)

Figure 9: Static windowing capture of “mom” ASL Movement. Dashed horizontal lines depict the static window presumed to be the resting position. Solid vertical lines indicate the start and end of

the gesture as determined by our system. Only accelerometerxvalues are shown.

In Figure 9, the accelerometer x values reported by the Myo for the ASL gesture “mom” show

relatively flat portions of signal at the start and end. These correspond to the user having their arms at their side, transitioning from the previous or to the next gesture. Once the signal begins to shift away from the flat portion, those data are considered to be the start of a gesture and are collected by the classifier. In Figure 9, the gray-dashed horizontal lines depict a window around the signal that the system considers to be a resting arm. Once the signal leaves the window for a specified amount of time, a gesture is considered to have started. Once the signal reenters and remains within the window for the same specified amount of time, the gesture is considered to have stopped. These start and stop locations are depicted by solid vertical lines in Figure 9. The data between the two vertical lines is then sent for classification.

(26)

0 500 1000 1500 −1.0 −0.5 0.0 0.5 1.0 Data Index G (Gr a vity)

Figure 10: Static windowing capture of multiple movements. Only accelerometer x values are

shown.

The static window approach works for the majority of isolated test cases (gestures), but fails when applied in real-time tests. Figure 10 highlights two issues encountered when using this approach in real-time: (1) in some cases too much resting signal is included as part of a legitimate gesture and (2) in some cases, resting signal alone is considered to be an entire gesture. For the first issue, some of the data corresponding to a resting position are not in the predefined “window”. Increasing the size or position of the static window does not alleviate these issues. Because the resting position is not consistent between gestures, a window determined dynamically is more appropriate.

6.1.2 Dynamic Windowing Capture

A more effective approach for determining the start and end of an ASL gesture uses a dynamically-determined window for the resting position. While the static window approach attempts to identify

(27)

long periods of data inside a fixed vertical range, the dynamic window approach identifies data within a similar vertical range regardless of location of that range. In this way, the dynamic window approach allows the resting window to shift to accurately capture different resting positions. Our approach for dynamically identifying a resting-position window proceeds as follows. Let the

range (height) of the resting-position window be2, where >0. Let data pointr, with valuev(r),

be the reference point for the window. Let α ≥ 1 be the window’s current length (i.e., number

of successive data points in the window), and let β ≥ 1 be the minimum length of a window

considered to be a resting window.

Initially, the first data point p1 encountered becomes the reference, so that r = p1, the window

has range[v(r)−, v(r) +], and the window length is α = 1. If data pointp2 has valuev(p2) ∈

[v(r)−, v(r) +], then p2 is added to the window, now of lengthα = 2. If insteadp2 has value

v(p2) 6∈ [v(r)−, v(r) +], then a new window is constructed withp2 as the reference point, i.e.,

r=p2, redefining the window range, and with window length nowα = 1.

At any time, let the reference point be denoted byr =pi wherei = 1,2, . . .Any subsequent data

point pi+m, where m = 1,2, . . ., is either added to the existing window or used to create a new

window as follows:

• Ifv(pi+m) ∈ [v(r)−, v(r) +], thenpi+m is added to the window, increasing the window

lengthαby 1.

• If v(pi+m) 6∈ [v(r)−, v(r) +], then the smallest integer j is chosen to satisfy v(pi+k) ∈

[v(pi+j) −, v(pi+j) + ], where j = 1,2, . . . , m and k = j, j + 1, . . . , m. In other words,

the earliest pointpi+j is chosen so that each ofpi+j, pi+j+1, . . . , pi+m are within the range

[v(pi+j)−, v(pi+j) +]. Then r = pi+j becomes the new reference point, defining a new

window with range[v(r)−, v(r) +]and resulting window lengthα =m−j+ 1.

Whenever the window lengthα ≥β, any subsequent data point outside the window is considered

(28)

● ● ● ● ● ● ● ● 1 2 3 4 5 6 7 8 0.0 0.4 0.8 Step 1 Data Index Data P oint ● ● ● ● ● ● ● ● 1 2 3 4 5 6 7 8 0.0 0.4 0.8 Step 2 Data Index Data P oint ● ● ● ● ● ● ● ● 1 2 3 4 5 6 7 8 0.0 0.4 0.8 Step 3 Data Index Data P oint ● ● ● ● ● ● ● ● 1 2 3 4 5 6 7 8 0.0 0.4 0.8 Step 4 Data Index Data P oint ● ● ● ● ● ● ● ● 1 2 3 4 5 6 7 8 0.0 0.4 0.8 Step 5 Data Index Data P oint ● ● ● ● ● ● ● ● 1 2 3 4 5 6 7 8 0.0 0.4 0.8 Step 6 Data Index Data P oint ● ● ● ● ● ● ● ● 1 2 3 4 5 6 7 8 0.0 0.4 0.8 Step 7 Data Index Data P oint ● ● ● ● ● ● ● ● 1 2 3 4 5 6 7 8 0.0 0.4 0.8 Step 8 Data Index Data P oint

Figure 11: Dynamic windowing example.

the next point in sequence that occurs just before the reference point of the next (resting) window

having lengthα≥β. (We assume that the user does not pause during execution of a gesture.)

Figure 11 provides an illustration of our dynamic window approach for determining a resting

win-dow. In this example, the window half-width= 0.2and minimum-length parameterβ = 6.

Step 1: The first data pointp1 is added to the window, becoming the window’s reference point. Since

the value ofp1 is 0.5, the window range is[0.3,0.7], i.e.,0.5±, and window length is 1.

Step 2: The second point p2 has value 0.6, within the window’s range [0.3,0.7]. Thusp2 is added to

the window, which now has length 2.

Step 3: Point p3 also has value 0.6, within the window’s range [0.3,0.7]. Thus p3 is added to the

window, which now has length3.

Step 4: Pointp4 has value 0.5, and so is added to the window, now of length 4.

Step 5: Pointp5 has value 0.7, and so is added to the window, now of length 5.

Step 6: Pointp6has value 0.8, outside the window’s range[0.3,0.7], so a new window must be found.

(29)

new window contains p6. p2 is the earliest point in the sequence that, when considered the reference point for the window, results in a window containing pointsp2, p3, . . . , p6. Thusp2,

with value 0.6, becomes the new reference point, the window’s range becomes[0.4,0.8], and

the length of the window is 5.

Step 7: Pointp7 has value 0.9, outside the window’s range[0.4,0.8], so a new window must again be

found. Back-tracking to find the earliest possible reference point results inp5 being selected

as the new window’s reference point. Thus the new window length is 3 with range[0.5,0.9].

Step 8: Pointp8has value 0.7, within the window’s range[0.5,0.9], sop8is added to the window, now

of length 4.

The process above is repeated until (a) the window length is at least β and (b) the subsequent

data point lies outside the window’s range. This new point is considered to be the starting point

of the ASL gesture. (Note that if in this example the minimum-length parameter were β = 5,

the point at step 7 would be considered the first point of the ASL gesture.) The ending point of the gesture is considered to be the next point that occurs just before the next resting window is identified. Once the ending point of the gesture is determined, the data for the ASL gesture are sent for classification.

This dynamic window approach improves the reliability for real-time capturing of ASL movements. Figure 12 shows the successful capture of different ASL gestures in sequence using this approach. The approach is effective at isolating gestures even when the user does not always position their arm in exactly the same resting position.

(30)

0 500 1000 1500 −1.0 −0.5 0.0 0.5 1.0 Data Index G (Gr a vity)

Figure 12: Dynamic windowing capture of multiple ASL gestures. Only accelerometerxvalues are

shown. Note that different resting positions (horizontal portions of signal) are correctly omitted.

6.2 Signal Sampling

Once an ASL gesture has been appropriately “trimmed” of resting position signal, we sample a fixed number of evenly-spaced samples across the entire signal, regardless of signal duration. This sam-pling process improves computational efficiency in classification by reducing the number of data points in a signal passed to the classifier. More importantly, the sampling allows us to meaningfully compare two signals that are similar in shape but vary in duration.

An example is given in Figure 13. On the left side, Figures 13a and 13c show two signals of different duration for the ASL gesture “mom” prior to sampling. The vertical lines represent the fixed number of samples being chosen. Notice how the two signals are very similar in shape but the signal in 13c was completed in roughly half the time as that in Figure 13a. The resulting sampled versions of each signal, show in Figures 13b and 13d, are very similar both in shape and in duration, facilitating more accurate classification.

(31)

0 20 40 60 80

−0.5

0.0

0.5

ASL Movement "mom" (Slow), Unsampled

Data Index G (Gr a vity) (a) 5 10 15 20 −0.5 0.0 0.5

ASL Movement "mom" (Slow), Sampled

Data Index G (Gr a vity) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● (b) 0 20 40 60 80 −0.5 0.0 0.5

ASL Movement "mom" (Fast), Unsampled

Data Index G (Gr a vity) (c) 5 10 15 20 −0.5 0.0 0.5

ASL Movement "mom" (Fast), Sampled

Data Index G (Gr a vity) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● (d)

Figure 13: Examples of sampling two signals of the same ASL gesture that vary in duration.

6.3 Classification Process

Consider a user wearing the Myo who begins to perform an ASL gesture from one of theCclasses of

gestures. Our system will capture the ASL gesture from the Myo device using dynamic windowing

and signal sampling, as discussed in Sections 6.1.2 and 6.2. The result is a data signalsof unkown

class where each component is a time series (as described in Section 4.5)

(32)

The classification algorithm (either LDA ork-NN) takes the data signalsand determines the signal class that most closely matches according to the algorithm’s similarity metric.

Training signals are used by each of the classification algorithms. To this end, given a collection

of C different ASL gestures to classify, in advance we capture n signals of each gesture. Letmi,j

be a captured signal, where i= 1,2, . . . , C represents the class (ASL gesture) andj = 1,2, . . . , n

represents a particular trimmed, sampled signal of that class. Then the set

M ={m1,1, m1,2, . . . , m1,n, m2,1, m2,2, . . . , m2,n, . . . , mk,1, mk,2, . . . , mk,n}

defines the set of training signals.

6.3.1 Classification using LDA

The process of classification using LDA proceeds similar to the discussion given in Section 5.1. LDA

accepts the set of training signalsM, computes optimal projections for each class of signals, and

then computes the means of each class of projections. Then given a signalsof unknown class, LDA

classifiessby determining the class having minimum distance between the class projection mean

and the projection ofs.

6.3.2 Classification usingk-NN

As discussed in Section 5.2, the k-NN algorithm classifies using a majority vote of its k nearest

neighbors. To balance computational efficiency with classification accuracy, we used k-NN with

k= 1along with a set of reference signals defined for each class as follows. Given the training set

M, theCnsignals are grouped based on their corresponding gesture, as shown below.

m1,1 m1,2 . . . m1,n | {z } ¯ m1 m2,1 m2,2 . . . m2,n | {z } ¯ m2 . . . mk,1 mk,2 . . . mC,n | {z } ¯ mC

(33)

For each classi= 1,2, . . . , C, we compute the average signalm¯i for that class using ¯ mi = ¯ mi,1+m¯i,2+. . .m¯i,n n .

Let the set M¯ = {m¯1,m¯2, . . . ,m¯C} be the collection of reference signals for the C classes, each

corresponding to a different ASL gesture.

Fork-NN with either Euclidean or Manhattan distance, the corresponding distance function can be

applied directly to signalswith each of the reference signals inM¯. Fork-NN with DTW distance,

a direct comparison betweensand a reference signal is not appropriate because of the “warping”.

That is, the DTW algorithm should not compare values from different sensors on the Myo,

partic-ularly those with different units. Therefore, we compute the distance, DT W∗(s, m¯i), between

signal s and a reference signal m¯i, i = 1,2, . . . , C, by summing the DTW computed distances

between the individual components:

DT W∗(s, m¯i) =DT W(s:ax, m¯i :ax) +DT W(s:ay, m¯i :ay) +. . .+DT W(s:e8, m¯i :e8)

where the notations:ax indicates use of only theaxcomponent of the signals.

DTW windowing reduces the run-time of the DTW distance metric by limiting the number of com-putations needed for the final distance calculation (see Section 5.2.4). Unfortunately there is no known algorithm to optimally select a window size because of the window’s dependence on the type of data. Instead, multiple window sizes must be tested and compared. (For this comparison, see Section 7.4.)

6.4 Swift Programming Language

Because of the prevalence of iOS devices in today’s society, an iOS mobile device acts as an appro-priate hardware platform for our application. Therefore, Swift — an open source, general purpose programming language built and released by Apple [8] — was chosen as the base language for

(34)

our implementation. Swift offers comprehensive built-in functionality, executes efficiently even on mobile platforms, and can be easily installed on either a desktop computer or mobile device. This portability allows for careful testing of the prototype in a desktop computer setting and, appropri-ately, deployment on a mobile device. In addition, the Myo SDK was built for the Swift language, making Swift a natural choice.

7

Experiments and Results

In this section, we present the results of experiments comparing the accuracy of the LDA and k

-NN algorithms for classifying Myo-generated ASL gesture signals. We also present results from experiments that consider computational efficiency.

7.1 Comparison of Classification Algorithms

The purpose of this experiment is to compare the accuracy achieved by each of the classification algorithms under the same training and test sets of ASL gestures. We include 10 different ASL gestures: “apple”, “cat”, “dog”, “home”, “hot”, “like”, “mom”, “please”, “shirt”, and “thank you” [18]. For each gesture, 20 different signals of the gesture were captured, giving 200 signals total. Each signal is captured as the user performs the gesture while wearing the Myo device, with their arm resting at their side before and after the gesture. Each signal is captured using dynamic windowing (see Section 6.1.2) and sampled 60 times (see Section 6.2). To construct the training set, for each of the 10 gestures we select at random 10 signals, giving a training set of size 100. The test set then consists of the remaining 100 signals. We then compare the accuracy of each classification algorithm on the test set.

For this experiment, we use the lda(. . .) function implemented in the MASS library in R. For k

-NN, we implemented the classifier in Swift using three separate distance metrics: (1) Euclidean distance, (2) Manhattan distance, and (3) dynamic time warping (DTW). Collectively, these result

(35)

in four different classifiers: LDA,k-NN Euclidean,k-NN Manhattan, andk-NN DTW.

Figure 14 and Table 1 show the accuracy of each classifier in correctly classifying signals from the test set. The first three values (solid bars) of each classifier section represent three different replica-tions of the experiment using different training and test sets selected at random. The fourth value (bar with pattern) is the average accuracy of 30 replications with 95% confidence intervals

super-imposed. Figure 14 and Table 1 demonstrate that k-NN is consistently more accurate than LDA.

k-NN DTW consistently outperformsk-NN with either Euclidean or Manhattan distance metrics.

LDA k−NN DTW k−NN Euclidean k−NN Manhattan

Accur acy (%) 0 20 40 60 80 100

Replication 1 Replication 2 Replication 3 Average of 30

Figure 14: Classification accuracy comparison of LDA, k-NN DTW, k-NN Euclidean, and k-NN

(36)

Accuracy (%)

Classifier Replication 1 Replication 2 Replication 3 Average of 30 Replications

LDA 89.0 91.0 92.0 86.9

k-NN DTW 94.0 95.0 97.0 94.2

k-NN Euclidean 88.0 90.0 91.0 86.9

k-NN Manhattan 92.0 90.0 88.0 90.3

Table 1: Comparison of classification accuracies for four classifiers.

7.2 Comparison of Classification Algorithms: Synthetic Data

To compare classifier accuracy under a large data set we generated synthetic data by randomly

perturbing signals from the previous experiment. Sincek-NN DTW consistently outperformed the

other classification algorithms, k-NN DTW was the algorithm of choice for this experiment (see

Section 7.1).

As in the previous experiment, 10 ASL gestures are used: “apple”, “cat”, “dog”, “home”, “hot”, “like”, “mom”, “please”, “shirt”, and “thank you” [18]. The training set again consists of 10 original signals selected at random from each of the 10 gestures. The test set then consists of 500 perturbed signals, where each of the 100 original signals is perturbed 5 times.

We perturb a signal using an offsetoranging from 0.01 to 0.20. For example, lets= (s1, s2, . . . , sn)

be a signal to be perturbed. A perturbed signals∗= (s∗1, s∗2, . . . , s∗n)is generated using

s∗i =si+ (uniform(−o, o)×si),

wherei= 1,2, . . . , nanduniform(−o, o)randomly selects a number in the range(−o, o)such that

all real numbers between−oandoare equally likely.

Figure 15 shows the classification accuracy of k-NN DTW versus increasing perturbation offset.

(37)

con-fidence interval superimposed. k-NN DTW again performs consistently well, achieving approxi-mately 98% accuracy across the replications.

5 10 15 20 95 96 97 98 99 100

Accuracy vs Perturb Offset (30 Replications Each)

Perturb Offset (%) Accur acy (%) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Figure 15: k-NN classification accuracy versus increasing perturbation of synthetic signals.

Perturb Offset Accuracy (%) Perturb Offset Accuracy (%) Perturb Offset Accuracy (%)

0.01 98.2±0.4 0.08 98.4±0.6 0.15 98.3±0.5 0.02 98.6±0.5 0.09 98.4±0.5 0.16 98.6±0.4 0.03 98.4±0.5 0.10 98.1±0.7 0.17 98.6±0.4 0.04 98.3±0.4 0.11 98.5±0.4 0.18 98.3±0.5 0.05 98.4±0.3 0.12 98.2±0.5 0.19 98.4±0.5 0.06 98.8±0.4 0.13 98.6±0.5 0.20 98.4±0.5 0.07 98.3±0.5 0.14 98.2±0.5

(38)

7.3 Increasing the Vocabulary Size

The purpose of this experiment is to test thek-NN DTW classifier as the number of ASL gestures in

the vocabulary increases. For the setup, 20 ASL gestures are recorded with 20 different signals of each gesture. The 20 ASL gestures include: “apple”, “bathroom”, “brushing teeth”, “candy”, “cat”, “cereal”, “cow”, “drink”, “home”, “horse”, “hot”, “hungry”, “milk”, “mom”, “please”, “shirt”, “sleep”, “sorry”, “thank you”, and “water” [18]. Each signal is captured using dynamic windowing (see Section 6.1.2) and sampled 60 times (see Section 6.2).

In this experiment, for eachC = 4,5,6, . . . ,20, we randomly select C classes (gestures) from the

20 available and use theC selected classes as the new vocabulary. The training set then consists of

10C signals selected at random with the remaining 10C signals used for the testing set.

Table 3 and Figure 16 show the average accuracy of k-NN DTW as the vocabulary size increases.

Each value is the average of 30 replications of the experiment for that value ofC, with 95%

confi-dence intervals provided.

# Gestures Accuracy (%) # Gestures Accuracy (%) # Gestures Accuracy (%)

4 96.5±1.2 10 96.8±1.0 16 95.7±1.4 5 97.0±1.1 11 96.4±1.2 17 95.9±1.2 6 94.6±1.3 12 96.5±1.4 18 95.6±1.4 7 95.9±1.3 13 97.1±1.1 19 96.0±1.2 8 96.6±1.2 14 96.6±1.1 20 96.0±1.0 9 96.7±1.2 15 95.0±1.8

(39)

5 10 15 20 90 92 94 96 98 100

Accuracy vs Number of Gestures (30 Replications Each)

Vocabulary Size Accur acy (%) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Figure 16: Accuracy ofk-NN DTW versus vocabulary size.

7.4 DTW Window Size

As explained in Section 5.2.4, the efficiency of DTW can be improved by only computing values within a window around the shortest path. However, there is no known algorithm for immediately selecting the ideal window size. Instead, the process involves trial and error, testing each window size. The purpose of this experiment is to investigate this computational efficiency, as a function of window size, versus classification accuracy.

This experiment has a similar setup to that in Section 7.1. There are 10 ASL gestures with 20 captured signals per gesture. Thus there are 200 signals that can be used for training. For each gesture, we select 10 signals at random for training, with the remaining 10 for testing. Here,

(40)

for the purpose of consistently comparing replications across window sizes, we use the same 30 randomly-selected test sets across all window sizes.

For this experiment we use a range of [1,50] for the DTW window size. Figure 17 depicts the

classification accuracy and the computation time ofk-NN DTW versus DTW window size. Timing

results were obtained by executing the prototype on a 3.2 GHz MacBook Pro. Figure 17 shows that execution time increases with increasing window size, as expected. Although omitted from the figure, there is no change in accuracy beyond a window size of 13, but computation time continues to increase. For this reason, we use 13 as our implemented window size. This ensures that DTW with windowing will not affect accuracy while being more computationally efficient then traditional DTW. 2 4 6 8 10 12 14 80 85 90 95 100 DTW Window Size Window Size Accur acy (%) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.05 0.10 0.15 0.20 Computation Time (s) ● ● Accuracy Computation Time

(41)

8

Future Work

The work in this thesis focuses on two classification algorithms: LDA and k-NN. The DTW-based

k-NN, once widely used for speech recognition, provides high classification accuracy for our work.

However, because HMMs are now generally the classification algorithm of choice in speech recogni-tion, future work will investigate HMMs in an effort to further improve the accuracy of our system. As noted in Section 3, HMMs seem to be the most common classification algorithm used for recog-nizing ASL gestures, particularly for video based systems.

Increasing the vocabulary size of our system would increase the efficacy of using the classifier in the real world. Our prototype uses only 20 ASL gestures for proof of concept, but the system should be expanded to better represent the large ASL dictionary. As discussed in Section 3, other research has investigated use of phonemes to construct ASL gestures for classification. Our work can also be extended to use phonetic construction for gestures.

Finally, we are interested in using syntax identification for improving classification of ASL words and phrases. This approach would use rules for constructing sentences in the English language, along with context-specific information, to choose words most likely to fit properly in a sentence. This feature could prove very beneficial in increasing the overall accuracy, and therefore usability, of our classification system.

9

Conclusion

The goal of this work was to develop a practical real-life application to help remove barriers in communicating via ASL. A key aspect of our work was to develop a system that works effectively in real-time on mobile systems, and that requires only streamlined, readily-available hardware. Previous work in this area is often restricted by the need for bulky hardware, probes, or cameras. The Myo device, a small portable consumer electronic device, acts as the input source for our classifier because of its sleek design, wide availability, and relatively inexpensive cost. The Myo

(42)

has on-board accelerometer, gyroscope, orientation, and EMG sensors, providing appropriate data needed for accurately classifying ASL gestures.

These different sensors are then used to form one aggregate signal consisting of data concatenated from all four sensors, which is then sent to our system for classification. Using the dynamic

win-dowing approach we are able to use accelerometer x values to identify the start and end of ASL

gestures. To facilitate this identification process, we require the user to place their arm in a resting position at their side between multiple gestures.

Our work focuses on two classification algorithms: linear discriminant analysis and k-NN. In our

experiments, k-Nearest Neighbors using dynamic time warping produces accurate results using

a fast training process. Thus k-NN with DTW was our choice of classification algorithm for the

prototype. The computationally expensive DTW distance metric is improved by using a windowing system, facilitating real-time use for classification.

In experimentation, our system achieves classification accuracies of 94 – 98% in identifying ASL gestures using real and synthetic data. Further research is necessary to improve overall classifi-cation accuracy measurements while increasing the number of ASL gestures recognizable by our system.

(43)

References

[1] Naomi S Altman. “An introduction to kernel and nearest-neighbor nonparametric

regres-sion”. In:The American Statistician.46.3 (1992), pp. 175–185.

[2] Alex vijay raj Amalaraj.Real-time hand gesture recognition using sEMG and accelerometer for

gesture to speech conversion. Masters Thesis. San Francisco State University, 2015.URL:http:

//sfsu-dspace.calstate.edu/bitstream/handle/10211.3/141234/AS362015ENGRA43. pdf.

[3] Helene Brashear et al. Using multiple sensors for mobile sign language recognition. Georgia

Institute of Technology, 2003. URL: http://www.cc.gatech.edu/~thad/p/031_10_SL/

iswc2003-sign.pdf.

[4] National Institute on Deafness and Other Communication Disorders. American Sign

Lan-guage. June 2015.URL:https://www.nidcd.nih.gov/health/american-sign-language.

[5] Mark Difranco. GUI without the G: Going Beyond the Screen with the Myo Armband. Mar.

2014. URL: http://developerblog.myo.com/gui- without- g- going- beyond-

screen-myotm-armband/.

[6] Aly A. Farag and Shireen Y. Elhabian. A Tutorial on Data Reduction: Linear Discriminant

Analysis. Oct. 2008.URL: http://www.di.univr.it/documenti/OccorrenzaIns/matdid/

matdid437773.pdf.

[7] Kirsti Grobel and Marcell Assan. “Isolated sign language recognition using hidden Markov

models”. In:IEEE International Conference on Systems, Man, and Cybernetics. Computational

Cybernetics and Simulation.Vol. 1. IEEE. 1997, pp. 162–167.

[8] Apple Inc. Swift. A modern programming language that is safe, fast, and interactive. 2016.

URL:https://developer.apple.com/swift/.

[9] Waleed Kadous et al. GRASP: Recognition of Australian sign language using instrumented

(44)

[10] Thalmic Labs.Myo Gesture Controlled Armband. Sept. 2013.URL:https://www.myo.com/.

[11] Jeroen F Lichtenauer, Emile A Hendriks, and Marcel JT Reinders. “Sign language

recogni-tion by combining statistical DTW and independent classificarecogni-tion”. In: IEEE Transactions on

Pattern Analysis and Machine Intelligence.30.11 (2008), pp. 2040–2046.

[12] Ross E Mitchell et al. “How many people use ASL in the United States? Why estimates need

updating”. In:Sign Language Studies.6.3 (2006), pp. 306–335.

[13] Daisuke Nishikawa et al. “EMG prosthetic hand controller using real-time learning method”.

In: IEEE International Conference on Systems, Man, and Cybernetics. Vol. 1. IEEE. 1999,

pp. 153–158.

[14] Cemil Oz and Ming C Leu. “American Sign Language word recognition with a sensory glove

using artificial neural networks”. In: Engineering Applications of Artificial Intelligence. 24.7

(2011), pp. 1204–1213.

[15] Zhihua Qiao, Lan Zhou, and Jianhua Z Huang. “Effective linear discriminant analysis for

high dimensional, low sample size data”. In:Proceeding of the World Congress on Engineering.

Vol. 2. Citeseer. 2008, pp. 2–4.

[16] The Deaf Society.INSIGHTS INTO AUSLAN. July 2015.URL:http://deafsocietynsw.org.

au/documents/SignLanguage1Handouts.pdf.

[17] Thad Starner, Joshua Weaver, and Alex Pentland. “Real-time american sign language

recog-nition using desk and wearable computer based video”. In: IEEE Transactions on Pattern

Analysis and Machine Intelligence.20.12 (1998), pp. 1371–1375.

[18] ASL University. Basic ASL: 100 First Signs. URL: http : / / www . lifeprint . com / asl101 /

pages-layout/concepts.htm.

[19] William Vicars.Sign Language Alphabet. 2007.URL: http://www.lifeprint.com/asl101/

topics/wallpaper1.htm.

[20] C. Vogler and D. Metaxas. “Parallel hidden Markov models for American sign language

recog-nition”. In:The Proceedings of the Seventh IEEE International Conference on Computer Vision.

(45)

[21] Christian Vogler and Dimitris Metaxas. “A framework for recognizing the simultaneous

as-pects of american sign language”. In:Computer Vision and Image Understanding.81.3 (2001),

pp. 358–384.

[22] Zahoor Zafrulla et al. “American Sign Language Recognition with the Kinect”. In:Proceedings

of the 13th International Conference on Multimodal Interfaces. ICMI ’11. Alicante, Spain: ACM,

References

Related documents

 Presented at Texas Tech Bob Albin Animal and Food Sciences Poster Competition (March 23) and Texas Tech University Undergraduate Research Conference (April 16-20)...

Non-adiabatic dynamics calculations have been performed in the manifold of the 3 lowest electronic singlet states after an S0 → S1 (1 1 A → 1 1 B) transition with the

Nevertheless, in the 1980s, due to the growing number of Canadians capable of work who were receiving welfare benefits, the federal government and the provincial governments

Various methods have been considered: methods based on effective width theory and cross- section classification concept which neglect interaction effects and include

The FSC logo, the initials ‘FSC’ and the name ‘Forest Stewardship Council’ are registered trademarks, and therefore a trademark symbol must accompany the.. trademarks in

This built on our previously introduced calculation of the absolute magnitude of ΔψM in intact cells, using time-lapse imaging of the non-quench mode fluorescence

One cash prize of Rs.3,000.00 is awarded to a student of final year B.Tech.(Hons.)/Dual Degree in Electronics and Electrical Communication Engineering securing highest CGPA

NLP Coaching Language (including the infamous NLP Meta-Model) is the most powerful set of questions and linguistic tools there is for helping people to solve problems, make