Nonverbal Vocal Interface

(1)

Rochester Institute of Technology

RIT Scholar Works

Theses

Thesis/Dissertation Collections

2006

Nonverbal Vocal Interface

Michael J. Murdoch

Follow this and additional works at:

http://scholarworks.rit.edu/theses

This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please [email protected].

Recommended Citation

(2)

G0GGOOOOQ G080Q

OG0000080

Michael J. Murdoch

June

29, 2006

Master

's thesis submitted to the faculty of the B. Thomas

Golisano College of Computing and Information Sciences in

partial fulfillment of the requirements for the degree of

Master of Science in Computer Science.

Chair

Joe Geigel

Reader

Warren Carithers

Observer Carl Reynolds

(3)

(4)

Thesis/Dissertatio Author Permission Statement

Title of thesis or dissertation: _ _

-L...I'--'

tJ.

e.LL=--

~

--'---"

6

""

bJ

'-"

(

'---"--

~

-'="

t}(

'---""-=--

I

_

~

"--'---~---'---~'-""=

~

_ _ _ _ _ _

Name of

authO"

~

&h

E

.

~d.J,

Degree:

J1

b

~=--

-Program: v <:':: ~ L.L.

College:

<1

Cc I

3

I understand that I must submit a print copy of my thesis or dissertation to the RIT Archives, per current RIT guidelines for the completion of my degree. I hereby grant to the Rochester Institute of Technology and its agents the non-exclusive license to archive and make accessible my thesis or dissertation in whole or in part in all forms of media in perpetuity. I retain all other ownership rights to the copyright of the thesis or dissertation. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation.

Print Reproduction Permission Granted:

I,

Michael J. Murdoch

, hereby grant permission to the Rochester Institute

Technology to reproduce my print thesis or dissertation in whole or in part. Any reproduction will not be for commercial use or profit.

Signature of Author: _ _

....:.M..:..:.:.:ic::..:.h.:..:a::..::e:..:.I..::.J.:...:. Mc..:...:..:::u::..:.r..:::d...:::o...:::c.:...:h _ _ _ _ _ _ _

Date: _ _ _ _ _

Print Reproduction Permission Denied:

I, , hereby deny permission to the RIT Library of the Rochester Institute of Technology to reproduce my print thesis or dissertation in whole or in part.

Signature of Author: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Date: _ _ _ _ _ _

Inclusion in the

RIT

Digital Media Library Electronic Thesis & Dissertation (ETD) Archive

I,

Michael

J

Murdoch

, additionally grant to the Rochester Institute of Technology

Digital Media Library (RIT DML) the non-exclusive license to archive and provide electronic access to my thesis or dissertation in whole or in part in all forms of media in perpetuity.

I understand that my work, in addition to its bibliographic record and abstract, will be available to the world-wide community of scholars and researchers through the RIT DML. I retain all other ownership rights to the copyright of the thesis or dissertation. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I am aware that the Rochester Institute of Technology does not require registration of copyright for ETDs.

(5)

ooooooeo

Nonverbal vocal _interface, _meaning the use of non-speech vocal sounds such as "oooh"

and

"ahhh"

asinputtoa_computer,provides an

interesting

and usefulinput modality for a graphical user interface. Nonverbal vocal interface is a novel improvementover speech-basedsolutionsbecausevoicedsounds_maybe_smoothly

modulated, meaningtheyarewellsuitedto control ofcontinuousvariables such as cursor_position, while spoken commands are

inherently

discrete. Agraphicaluser interface isan excellent environmentforvocal inputbecause instantaneous visual feedback is crucial to usability, enabling users to see the results of their vocalizations and learn the interface _very quickly. _Continuously voiced sounds maybe easilyand

independently

modulatedindimensions such as_{volume, pitch,} and vowel. These_{dimensions may be}usedtoaugment afamiliar input devicesuch asthe mouse,addinganotherdegreeoffreedomto theinteraction. For_example, a mouse-based_paintingprogram_maybe improved

by

_usingvocal volumetocontrol brush size while painting. Vocal input _{may alternatively} be used without other input_devices,forexampleto controlthecursorintwodimensions. Thisoffers an opportunity to improve access to _{computing for}users unableto operate a mouse. In this _thesis, the use of nonverbal vocal interface for graphical interaction is explored. Vocal dimensionsof_volume, _pitch, andvowelare detected inrealtime

(6)

(7)

OGOOOOOOOOGOOOOO

Iwould liketo thankProfessor_{Joe Geigel for supporting}thisworkand _giving me thefreedomtoexplore a crossroads offieldsof study. I_additionallythankWarren

Carithers and Carl Reynolds fortheir service on _my committee. Without

listing

names, Iexpress _mygratitude to the volunteers whosubmitted _{themselves to my} vocal pitch/vowel experiment. Theirtimeandfeedbackwereinvaluable.

Most_importantly, Ithank _my_wife, _Meredith, for her patience with the extended

(8)

(9)

oooooooo

l _INTRODUCTION ₁₃

1.1 User Interface ₁₃

1.2 Vocal Input ₁₄

2 DISCUSSION 15

2.1 _{Multimedia Art} ₁₅

2.1.1 Audiovisual Relationships ₁₅

2.1.2 Audio Interactions 16

2.2 Human-Computer Interaction ₁₇

2.2.1 Vocal Interface 18

2.2.2 _EvaluatingCursor Control 18

2.3 PitchandVoice Analysis ₁₉

2.3.1 Volume Detection ₁₉

2.3.2 PitchDetection 20

2.3.3 VowelDetection 22

2.4

Summary

23

3 HYPOTHESIS 24

4 EXPERIMENTS 26

4.1 Experiment1 26

4.1.1 Description 26

4.1.2 Implementation ₂₇

4.2 Experiment2 ₂₇

4.2.1 Description ₂₇

4.2.2 Pitch Detection 28

4.2.3 VowelDetection ₂₉

4.2.4 UserTest 30

5 RESULTS 33

5.1 Experiment1 ₃₃

5.2 Experiment2 ₃₄

6 CONCLUSION ₃₉

6.1 Experiments 39

6.2 Future Work 40

6.3 FinalThoughts 40

7 BIBLIOGRAPHY 43

A IMPLEMENTATION ₄₅

A.i Experiment1 ₄₅

(10)

A.2.1 Pitch Detection 46

A.2.2 Vowel Detection 48

A.2.3 User Test 49

A.2.4 NotesonSpeed 5

B CODE 51

B.i Experiment1: soundVolDraw2 51

B.2 Experiment2: sound_analysis2 53

B.3 Experiment2: cursorTest 58

(11)

ooooooo

o

oooooo

Figurel: Analysisof_{medium-frequency}

"oooh."

28 Figure 2: Analysisof

low-frequency

"oooh."

29 Figure3: Analysisof_{medium-frequency}

"ahhh."

30

Figure4: Usertestpractice screenfromcursorTest program 31

Figure5: Example

drawing

_usingvocal volumetocontrolbrushwidth ₃₃ Figure 6: Example

drawing

_usingvocalvolumetocontrolbrushwidth ₃₄ Figure7: Usertesttimesover a series oftests_showingthe

learning

curve 36 Figure 8: _{User category} averages for the first two _test, _excluding extremes,

shownas afunctionof spatialdirection ₃₇

(12)

(13)

O OOOOOOOGOOOO

1.1

User Interface

User interface is a critical component of modern computer _systems; it is the

windowthroughwhich a user _{simultaneously}expressescommandsto the machine

and receives feedback about how those commands are

being

interpreted.

Commonly, user interface combines mechanical input with visual feedback, as is

the case with mouse pointer input. The metaphor ofthe computer screen as a

planar_desktop,which_{may be} navigatedintwodimensionswith a mouse or other

pointingdevice, isubiquitous, easilyunderstood,and effective. Inputdevicessuch

as a mouse, trackpad, thinkpad,joystick, or

drawing

_tablet, are all _extremelywell suitedfor controllinga pointerinatwo-dimensional environment. _{However, they}

arelimited

by

theirtwo-dimensionality.

Often,auser needs morethantwodimensionsofcontrol, forexample, tocontrola

third spatialdimension. _Or, athirdorfourth degree offreedom may beuseful to

control other aspects of the user interface. _Navigating a _map, for _example, is

primarilyatwo-dimensional_task, which_maybe _{easily done using}a mousetopan in the cardinal _directions, but more control is useful. Otherdegrees of_freedom,

such as zoom or_rotation, require additionaldimensionsof userinputthatare not

typicallyprovided

by

themouse, exceptwith modifierkeys, scroll_wheels, or other controls. _Drawing programs provide another example suited to two dimensions but improved

by

more. The mouse pointer canlocate abrushtoolon the_screen,

and the mouse button provide

binary

input to draw or not draw, but another degree offreedom to controlbrush pressure,line width, or _opacitywould be very

beneficial.

Some users _{unfortunately} _maybe _physicallyunable touse a mouse at _all,

forcing

them to struggle to find alternative means to command a computer. Whether

adding degrees of freedom to those provided

by

the mouse or _presenting the

primarymeans of controlto animpaired_{user, the}needforadditionaldimensions

(14)

1.2

Vocal Input

Perhapsthisneedforalternativeoradditionaldimensionsof userinterfacecontrol

may be met through the user's voice. Speech recognition has a _{long history} in computer_science,and itoffersthewonderful promise of aconversational, natural

language userinterface. _However,thisdreamremainsdistant, andcurrent speech

recognition systems often have limited _vocabulary or require calibration to the

voice of agiven user. Most_importantly, regardlessifthesesignificanthurdlesare

overcome, speech recognitionisnot well suitedto thelow-leveluserinterfacetasks

described above. Speech recognition lends itself to dictation and discrete

commands, notcontinuous variableslikepointernavigationandbrushsize.

Amore

interesting

proposal is touse _{nonverbal, meaning}non-speech, vocalinput

for user interface. The human voice is _wonderfully controllable and, just as

important, continuously variable. Voice volume _may be modulated, up or down,

slowly or _quickly, and a vocalized tone can be interrupted

by

_tonguing or other staccato sounds. Consider a lengthened voiced vocalization, a sustained sound

like "oooh"

or

"ahhh."

The vocal pitch, or fundamental frequency, is _easily

modulated_upanddown,held_constant,or wavered with vibrato. _{Alternatively,}the

vowel sound _maybe _modified, _among

"oooh," "ahhh," "eeee," "ohhh"

and others.

All ofthese modulations canbe considered degreesof_freedom, and_{significantly,}

they are for the most part each _independent, _meaning, for _example, that a vocalization can change pitch regardless of whether it is _changing vowel at the

(15)

oooooooooo

Considering

the context of nonverbal vocal interface requires review of work in

several different fields. Voice and pitch _detection, human-computer _interaction,

and multimedia _art, at _least, are relevant to the _study of nonverbal vocal user

interface_{design, implementation,} andtesting.

2.1

Multimedia

Art

Usingnonverbal vocal cues and inputfor computerinteraction is not an _entirely new _concept; _however, most of the _existing examples of such work lie in the

conceptual and artistic realm. _{Here, they} are _effective, _interesting, and fun, but often not as _{pragmatically} useful as _they _potentially could be. _Nonetheless, a reviewis informative.

2.1.1

Audio Visual Relationships

A fun example to begin with is a performance art piece

by

[LEVI 03], further

described in [LEVI 04]. _Using a combination of computer vision and

speech/sound _analysis, _they provided an audiovisual performance in which vocalizations are visualized in real time on a projection screen behind the

performer. This allows a series ofinteractive segments in which theirvoices are

"shown"

invarious ways. Mostoftheperformance segments provide visualization

of the

performers'

vocalizations, i.e., showing on the screen an emanation from their mouths as _they proceed to speak or sing. Sound is visualized as waves, clouds,bubbles, and othersimiles. In one portion of_{the performance,}vocal pitch

is used to control the direction of a paintbrush on _{the screen,} with

descending

pitch_turningthe brushclockwise, and_ascending counter-clockwise.The result is rather_chaotic, butan

interesting

tidbitinthissegmentis the use ofa_sharp

"shh"

toerasethepainting. Theauthors speakof synesthesia and_{phonesthesia,}inwhich

phonemes,orbasicelementsof_speech,are perceivedas

having

color or_shape, asa

basis fortheirworkandmotivationforthe_studyof speechvisualization.

Other metaphors and _sound/vision relationships are discussed in [EDMO _04], which presents a_variety of artists and examples who have combined synthesized

music and computer graphics in various ways. _They present some

interesting

(16)

visual, such as betweencolors and sounds andbetweentone intervalsand colors.

Theycite_Jewanski,whostudiedperceptualdifferences betweencolorsandsounds. For example, "dissonance in musicis more unpleasant than dissonance in color; two colors creates a new unit, while two tones does not create a new _tone;

perception of sound is normally _relative, while perception of color is more absolute."

There ismention oftimbreas

being

relatedtocolor,butnoinformation

about specific relationships. The authors list all the ways that audio and visual

parameters can be or have been linked: mathematically, metaphorically, intuitively, or randomly. _{Unfortunately,} this leaves the entire world open and doesn't provide _any true insight as to which works well and whichworks poorly. Thefocusis_reallyon_performance, _art,andinteraction.

Several audiovisual artists and examples oftheir work are described

by

[EDMO

04]. Adriano Abbado uses graphical visualizationforhissynthesized sounds, and uses filtered noise to synthesize sounds. _Also, he produced an interactive

arrangementwherein

users'

motions relativetofloor-mountedsensors cancontrol

independentlyan aural stream and a projected visual stream. JackOx,while part of _COSTART, created a timbre-based vowel/color system for vocal changes. YasunaoTonesynthesized soundrepresentationsofChinesecharacters, and set_up a softboardinstrumentthatdetectscharacters

being

drawnandproduces sound.

2. 1

.2

Audio

Interactions

Not all artistic examples combine the visual with the audio. An _interesting exampleof an_entirelyauralinteractionisprovided

by

[BOHL_04] and[BOHL05]. The authors implement a rather_wacky _{system, the} Universal _Whistling Machine

(UWM), that responds to human whistles and replies with similar sounds, as if

engaging in conversation. _They mention el_Silbo, a whistled _vocabulary used on Islade laGomera, inthe_Canary_Islands,as well as_Kuskoy,usedinTurkey. Each is a "reduced

language,"

simpler thantrue speech but still _containing _vocabulary and grammar: with _prosody but without phonemes. This is

interesting

background, but intruththeUWMdoes notimplement anykindoflanguagewith its_whistlinginteraction.

The responses ofthe UWM to humanwhistles are described as _entirely _random,

drawing

froma

long

list oftransformationsofthe human's input. The longerthe

back-and-forth _interaction, the more complex the transformations become. In

termsofimplementationofthe_UWM,audio_input,sampled at 44.1_kHz, isfedtoa FFT-based pitch tracker, then it is analyzed in statistical moments

-standard deviation, kurtosis,and median

-to determineif it isawhistle

deserving

response. Whistle synthesisis basedonsubtraction,startingwithwhite _{noise, then}low-pass

filtering followed

by

tuned band-pass

filtering

to select the desired _frequency,

(17)

Reactive transformations of human whistles include pitch _translation, contour

inversion,time stretching, compression,andinversion.

The UWM system uses basic computer vision to recognize the presence of a

person, andthenwhistles toattract attention. Once the person responds withhis

own _whistle, the game _begins, andthesystem samplesand modifies therecorded

whistle and replies. This continues until the person becomes _bored, which is

apparentlya_surprisingly

long

_{time, but}what appearstobe a conversationis _truly muchlessthanthat. Theauthorsdescribe whistlingas a_low-level,low-bandwidth form of_{communication,} and call ahuman interactionwiththe UWM a"precursor

to languaging." _However, _nothing is conveyed to either _party, and _nothing productive, beyond the person's amusement, comes from the interaction. The UWM is an

interesting

_toy with which to _study human-machine _interaction, especially

by

watching people _try to "fool the

system"

and see what _they get in response.

Amorefocused _interaction, albeit still without much productivity, is suggested

by

[OLIV 97], inwhich a vocalinputresultsinauralfeedback.The authorsdescribea

singing interface foraninteractive environment, the_Singing_Tree, whichis part of a larger effort called the Brain Opera. The system responds to vocal input

-positively ifthe singer holdsa constant_pitch, and _{negatively if}the singer wavers. Therewardforconstant pitchisauralfeedback in

harmony

with angelic voices and

synthesized _instruments, while the_{penalty for}_wandering pitchis dissonance and

ugly rhythm. The system uses a real time DSP tool kit for pitch detection and vowel formant _analysis,

focusing

on

detecting

changes in either rather than

absolute pitch or vowel. _Velocity,_thus, ismoreimportantthanabsolute value. No detail isprovidedonthepitch or vowel_detection,unfortunately.

The authors state that nonverbal input and aural feedback will be important to

future "smart"

application _interfaces, yet their

lofty

goals are _only _meekly

approached. _Having such a simple goal of constant pitch makes for a rather

narrow user _interface, though perhaps in the context ofthe entire Brain Opera it becomesinteresting.

2.2

Human-Computer Interaction

Therehasbeensome worktoemployvocalinputtomore productiveendsthanart

provides. Ofcourse, speech recognition is awell-establishedfield, andit is atthe

edges ofthe speech recognition literature that appear an

interesting

non-speech

example and an analysis of how appropriate speech-based solutions _may be to

(18)

2.2. 1

Vocal Interface

[IGAR_01]describesan

interesting

add-onto aspeechrecognition systeminwhich

nonverbal sounds are used to control variable parameters. _They suggest that nonverbal vocal cues can be used for three types of control: first, "control

by

continuous

voice,"

in which theuser makes a continuous sound foras longas he

wants to continue _activating a control. Their example forthis is avolume knob,

where theuser says "volumeup,"

then says

"ahhh"

whiletheknob turns, stopping

whenthe knob has turned as faras is desired. Second is "rate-basedparameter

control

by

pitch,"

which is _essentiallyan extension ofthe first interaction. They

provide an example of _scrolling, in which a user says, "move

up,"

then makes a

continuous

"ahhh"

whose pitch determinesthe speed ofthe resulting motion. As

in theformer continuous voice _control, the interaction stops whenthe userstops

vocalizing. The third interaction proposed is "discrete control

by

tonguing," in

which a parameteris changed indiscretesteps via hard-tonguedvocalizations. A

user who says "channel up, ta, ta,

ta,"

will see the channel change three steps

upward. _Theymention"breathed

sound"

as aless _tiringandlessannoying wayto interface,but don'texplain what_theymean.

The authors discuss several implementation details. The input audio signal is

digitized to 16 bits at 22 _kHz, and the FFT is computed _using a 2024-sample

Hanning window. Frequencies below _{375 Hz} are removed to avoidbackground

noise,and athresholdamplitudeisusedtodetectvoicing. Pitchchangeis detected

using dot products of12 msec _{time-shifted,}

+/-43 Hz frequency-shifted spectra.

They comment that absolute pitch is not computed, and that some

users'

pitch

changesare not_robustlydetected.

This paper is_veryimportant to thepresentdiscussion. Theauthors recognize the

immediacyand _simplicityof_interpreting nonverbal_{vocalizations,} buttheirvision

islimitedsomewhat,

focusing

on_{the complementary}use of speech recognition and

nonverbal vocal interface. _They mention

having

implemented their controls in

video games as well as the potential applications for

hands-busy

immersive

interactionssuchascar navigation while

driving

as well as handicappedaccess. In

all ofthese cases, especiallyinthe examples _they_describe, real-time audio/visual feedback is crucial to the success of the vocal interface. _{For example,} in the

volume-control _{example, the} user _simplystops _saying

"ahhhh"

when the volume

reacheshissatisfaction. While providingagood_start,_theydon'tmakethe

jump

to

eliminate speechinputentirely.

2.2.2

Evaluating

Cursor

Control

There is _plenty of evidence that graphical user interface tasks are not _especially

(19)

israther_{difficult; however,}an

interesting

attempt at

doing

so is describedin [DAI

04]. In their _{implementation,} the authors divide an interface screen into a _3x3

grid of_boxes, numbered one through nine in row-major order. The user, whose

goal istoselect anicon on_{the screen,} speaks the number oftheboxthat contains

the_icon,then that_{box is recursively}dividedintoa_3x3 grid. Forexample, theuser

says, "Six,"

andthelinesare redrawntosubdivide what wasbox6. Thiscontinues

twoor three timesto reachtheproper spatial _resolution, and the user _eventually

says "Select

five,"

or _something, which results in a mouse-click on the

disambiguated icon inbox 5. Additional commands move the whole grid in four

directionsorback_upa recursionlevel iftheusermakes a mistake.

Interestingly, theauthorsofthegrid approachdescribea_{usability study}performed

with ₂₄testsubjects familiarwith mouse_and/orkeyboard based navigation. The

grid-based speech recognition program was implemented _{using Visual} Basic and

IBM'sViaVoiceprogramfor speech recognition. The subjects were given_training

tasksand

finally

testedforspeedin_selectingon-screentargets. Their testinvolves

navigating fromthe center ofthe screento each of a _ring oftarget icons. Lotsof

results are_provided,_but,_{notably, the}averagetime toselect atarget_usingthe

grid-based program was _nearly 8 seconds. This is cited to be 33% faster than other

speech-based navigation methods but is obviously much slower than _using a

mouse. _However, it is exceedinglyusefulifamouseis not an option. Theauthors

conclude

by

_recognizingthat_{"it is generally}acceptedthatspeechisnottheoptimal solutionfornavigation-oriented activitiesifother modalities canbe

used."

2.3

Pitch

and

Voice Analysis

Alternative modalities, such as nonverbal vocal _interface, require the tools to

analyze audiodataand computerelevant characteristics. Volume,pitch, and vowel

have been identified as vocal dimensions that _may be modulated

by

a user and detected

by

a computer system. Priorworkdescribeseach ofthese_veryneatly.

2.3. 1

Volume Detection

The amplitudeof a _periodic, or_wavelike, signal _maybe estimatedbased on either

its peak value or its root-mean-square (RMS) value over a given interval.

Generally,it is more robusttocompute RMSamplitude,asthistakes intoaccount

all ofthe samples rather than _{simply the} extremes. Peak estimation, _especially

with sparse samples, is error-prone and

likely

to fluctuate with noise or other

variables. It is assumed that RMS amplitude is a well-known concept, and no

researchwas doneto extendthisbasiccomputation. Infact, most audiolibraries

includeamplitude orvolume-detectionfunctionality,_presumably_RMS-based, as a

(20)

23.2

Pitch

Detection

Pitch is a perceptual _quantity that is _generally simplified and equated with f-mdarrierital frequency-. Detectio:: of fundamental frequency, the inverse of fundaniental period, is a topic with a rich

history

in speech recognition and computer music

Two

interesting

methods are strmmarized

by

[MJDD 03]. The first, harmonic

product spectrum _(HPSj, works withthe Fourier_frequencyspectrum of an audio signal. The FFT signal is multiplied

by

downsampledversions ofitself, andthe resultis asinglepeak

indicating

thefuncLamental frequency. Thismethod is not discussedmuch_elsewhere,perhapsbecauseofthelimitationslisted

by

the author,

induding

asensith.it;.*

toshortwindowsizesduetoquantization ofthefrequencies.

The second _method, or

family

of _methods, discussed

by

[MfDD 03], is autocorrelation, which _{roughly involves comparing} a signal to variously time-shiftedversions ofitself The time-sriifted _copyof an audio signal and thesignal

itselfare compared

by

_computing theirproduct as afunction of shift The time

shiftat whichthe autocorrelationfunction_(ACF) isrniiximized is indicativeofthe period ofthesample, winchis _inverselyrelatedtoitsfrequency. Thefundamental

periodis_typicallythefirstmaximum ofthe autocorrelationfunction Themethod

is rather _brute-force, and

detecting

the extremes relies on differentiation and

searching for sign changes. The author describes several shortcuts to autocorrelation Forexample, pitch is _generallydetectedformultiple windows of samplesthatare _temporally_correlated, inwhich casethe entire windowneed not

besearchedfrom leftto rightbuttheperioddetectedintheprevious window_may be used as a _starting _guess, and the search _{proceeding from} there. _Also, he suggests_quittingafter

finding

thefirstminimum, or_mayimnm, asthatis_typically indicativeofthefundamentalfrequency.

Autocorrelation_{is ubiquitously}mentioned asthesimplest_waytodetect_periodicity

in audio signals. _Manyvariations on the_(ACF) arefoundinthe literature. Inan

early paper, [ROSS 74] introduces the average magnitude _{difference function} (AMDF),whichistheabsolute value ofthe differenceoftheoriginal signal andits

time-shiftedself. Of_{course, the}AMDF _generallyapproaches zerowhere theACF shows apeak. Pitchdetectionfor humanvoices_usingeitherdetectionmethod_may

be simplified through intelligent constraints. Typical values for _{time shift,} or period, fall between3and _15ms,whichcorrespondtofrequenciesbetween70 and 300Hz. _Further, the shifted signal _{may be} shortened even _further, _{to two} pitch periodsforthelowestexpectedfrequency,forexample 100Hzfor amale_speaking

voice. These constraints, though dated and perhaps less crucial with current

(21)

speech, theypresumably are _similarly useful with otherhuman _utterances,verbal

or nonverbal.

Some more unusual pitch-detection algorithms are presented _{in [SOOD 04]} and

[TERE 02]. The lattersuggests arather

interesting

twistonACF-liketimeshifting,

in which three _dimensions, or "state

variables," are formed from time-shifted

copies ofthe original signal. For example, dimension one is the original _signal,

dimensiontwoistimedelayed 12 _samples,anddimensionthreeis timedelayed₂₄

samples. The method for _elucidating fundamental period is rather unclear but

involves_computingaEuclidean distance inthe3-Dstate variablespace.

A much better explained fundamental

frequency

detection algorithm based on a

variant of AMDF is shown in [CHEV 02]. In this paper, de Cheveigne and

Kawahara describe alist of simple but effective improvements beyond basicACF

andpresent afull methodcalledYIN. _{First, they}address theproblems withACF.

A distinction is made between actual _{autocorrelation,} which uses a fixed

rectangular window _size, and "modified autocorrelation,"

or _covariance, which

shrinks the window as the time shift increases. The ACF is prone to _choosing a

too-high _period, while the covariance function is prone to _{choosing the zero-lag}

peak when not constrained

by

a threshold. As a baseline for their later

improvements, the authors found a 10% error rate for pitch detection using the

ACF with a speech database.

Concluding

that the ACF is too error-prone, they

movetoa differencefunction inwhichthesum ofthe squared absolutedifference

(SSADF) is computed. This function has minima wherethe ACFhasmaxima, but

the_{relationship is}not simpleinversion. Thedifferencefunction islesssensitiveto

signal amplitude changes than is the _ACF, _providing a 1.95% error rate with the

speech_database,whichisquite animprovementover10%.

Several further incremental improvements are offered. First is the use of the

cumulative mean normalized difference function _(CMNDF), which divides the

SSADF

by

the cumulative average ofitselfat shorter

lag

values. This function is

unity at zero _lag, which prevents period errors at this point. A further improvement isto use parabolic interpolation when

detecting

the minima ofthe

CMNDF. Thisgets aroundtheintegerquantization thatshift computations allow,

providing accuracygreater thanhalfthe_samplingperiod. This _may allow coarser

sampling without loss of _accuracy, which means less computation. A final

improvement uses a _two-step estimation process, which is similar to median

smoothing the result, presumably on the time _{axis, using} period estimates from

neighboring timevaluestohonetheperiod estimate atthepresenttime.

The YINalgorithm is presented with practicalconstraints and ranges in mind. It

requires no upperlimitforthe fundamental _frequency,which _maybe more useful

(22)

must be at least as

long

as the longest expected period. _Further, fundamental

frequency

estimation requires enough signal duration: two times the longest expected period. Authors used a ₂₅ msec window with ₂₅ msec period search

range,

implying

40 Hz lower

frequency

bound. A low pass filter was used to remove high

frequency

noise: convolutionwith a 1 msec squarewindow, implying

zero signal at 1 kHz. In all, this discussion of pitch detection, failure modes, and

smartimprovementsprovides anexcellentplatformfor furtherwork.

2.3.3

Vowel Detection

Voweldetectionisamorenuancedchallenge thanpitchdetection. The differences

between "ahhh"

and

"oooh"

are_veryapparent inthe mouthofthe vocalist,butthe

sampled sounds _may have the same pitch and the same volume. The difference

between thevowelslies in the pattern ofthe waveform. Viewed in the_frequency

domain, the relative spectral heights beyond the fundamental _frequency

distinguishthem.

Anonline_resource,[CHIT _03] describesthe detectionofformants,or peaks inthe

Fourier

frequency

spectrum, and how _they _may be used to _identify vowels

by

comparing the measuredformants to prototypical vowelformants. Formants are

caused

by

physical vibrationsin thevocal_tract, anddifferentvowels show_clearly different

frequency

distributions. Formant analysis _may be used for the

identification of _phonemes, which are the atoms of vocal _speech, and of course

vowel sounds comprise a subset of phonemes. Detected formant characteristics

are comparedtothose offiverecorded ground-truth_vowels, andthrough a series

of_thresholds, assignedto one or more ofthem. Due to the number ofthresholds

involved and the ground-truth requirement ofthe sampled _{vowels, this} seems a

rather tenuous method. _However, thebasic idea of

identifying

vowels based on theirspectral characteristicsissound.

Another vowel recognition _method, shown in [SENA 05], again compares

measured vowel data to prototypical data, but the transformation involved is

rather unique. The authors point out that vowel sounds made

by

different speakers and at different frequencies have similar _waveforms, but different

frequency

distributions due inparttopitchdifference. Thewaveformsaresimilar

but not

linearly

_{related, meaning that} the difference is notdescribed

by

a simple timescaling. The authorsbegintheiranalysisintheFourierdomainandcompute

the spectralenvelope _usinga cepstrum_{transformation,}which_essentiallytakes the

inverse Fouriertransformofthe

log

ofthe realpart ofthe Fourier_spectrum,_using

alow-pass filtertosmooththeresult. Thespectral envelopeisthenmodified_using

the scale _transform, an evolution of the Mellin _transform, which

basically

introduces scale invariance, _meaning to the problem at hand that it removes

(23)

Once in the scale-invariant _space, the transformed spectral envelopes are

comparedwiththose of_modeled, prototypical_vowels,and adecisionis made asto

whichvowelbest fitsthemeasureddata. The authorsusea modeledvocal tract to

create thevowel_prototypes, afactthatisn'tcrucial to the algorithm andwhich_is,

say, intractablewithout some speech-synthesis infrastructure. The time involved

inthe multipleFFTtransformationsto reachthescale-invariant spectral envelope

is almost _certainly too great for a real-time _{implementation; however,} the basic

ideaofFFTspectralshapecomparisonissound.

2.4

Summary

The work described in these disparate fields comprises the basic components

necessarytocreateand evaluate a nonverbal vocalinterface. Volume detectioncan

beapproached via simple andwell-known RMScomputations. Pitchdetectionhas beenstudiedinvarious_ways,andthemethoddescribedin [CHEV_02] showsgreat

promise for its robustness and accuracy. Vowel _detection, while somewhat less

robust that pitch _detection, has been shown in a few different implementations. Theconcept ofFourierspectral shapeanalysis seemsquite promising.

Thus, whilethe use of nonverbal vocalinput in a graphical user interface hasnot

beenshown_before,_{the necessary}elementsfor

detecting

vocal variableshave been

developed_separatelyand_essentiallyneedtobe pulledtogetherand adaptedto the

task. Oncethisis done,evaluation ofthe_usabilityand usefulness ofthenonverbal

vocalinterface _maybe made

by implementing

a cursor control system and_usinga

(24)

O 000000000

Based on the _array oftechniques that have been developed for analyzing speech

and_song _signals, it isworth_askingthequestion ofhowthese _{may be}extendedto

respondto nonverbal vocalinput and usedto provide a novel user interface. To

review, there are at least three degrees of freedom in a continuously voiced

vocalization; inorder of

increasing

detection _{complexity, these} are volume, pitch,

and vowel shape. _One, _two, or all of these together should be usable as independent dimensionsof user-controlledinputtoa userinterface.

Optimal mapping ofthese vocal dimensions to graphical user interface controls will depend on the specifics of a given application or environment, but many

possibilities appear

intuitively

useful. Volume has connotations of _intensity,

strength, size,and_pressure, andthereforeshouldbemost useful tocontrol similar

aspects, such as pen size or brush pressure in a _painting program. Absolute

volume is

fairly

_easy forpeopleto distinguish and_produce, and_maybe adequate for _controlling the interface. _{Alternatively,} volume _change, crescendo and

diminuendo, could be used for control. The derivative of amplitude can be computedtodescribethedirectionof volume change.

Pitchhasconnotations of_{height, distance,} and_size,andthusshouldbewell suited

to control a vertical spatial dimension orlocation, a zoom _factor, or perhaps pen

size or brush pressure. People are _generally good at

detecting

or _producing

changes in _pitch, and _they are less capable of_producing or

identifying

absolute

levelsof_pitch,soitmakes intuitivesensefortheuserinterfacetorespondtopitch

change ratherthanpitchlevel.

Vowel _{sound, the} difference between "oooh" and

"ahhh,"

for example, does not

intuitively

correlate with spatialdimensionsor other variables. Becauseof_{the way}

the mouth moves to produce these sounds, opening and _{closing, respectively,}

perhaps a zoom factor would make sense as a user interface response.

Interestingly, however,vowel soundis bothacontinuous anddiscrete variable. A

user can modulate _smoothly between

"oooh" and

"ahhh,"

making a sound like

"oowahh,"

or _simply vocalize one or _{the other,} _meaning that as a user _input, it

couldbeusedtocontrola variable or selectfromaset. Acontinuous variablelike

zoomfactor_maybe controlled_usingsmoothtransitionsbetweenvowels. Discrete

choices, suchas_{selecting among}

drawing

toolslike_{pen, pencil,} or

brush,

couldbe

(25)

to _map vowel shape to a non-intuitively-matched dimensions of _input, except

perhapsthat theuser will requiremore instructionto learnthecorrelation. _Thus, vowelshapecanbeusedto controlthe "other"dimensioninvarious situations: for

example,whilepitch controlsvertical_motion,vowel can controlhorizontalmotion.

Feedback isanimportantfeatureof_anyinterface design. Auser of a conventional

mouse would notbe_verygood at_selectingan absolute position without_seeingthe

cursor on the screen and_{watching it} respond to the action ofthe mouse. _Seeing

the response allows real-time corrections to be applied. _Likewise, pitch change,

when visualized

by

cursor movement or other _feedback, can be calibrated and

controlled

by

the user in a relative sense. While this is taken for granted for

traditional devices likea _mouse, evidence ofthe importanceoffeedback forvocal input is shown in [OLIV_97], inwhich aural feedback is givento the user. _Thus,

tyingthe inputto theeffect it _produces, withoutlatency, seems_veryimportant to thesuccess ofinputmodalities.

It is hypothesizedthat a userinterface canbe made to detect one or more ofthe

user input dimensions of vocal volume, pitch, and vowel. These can then be mapped to a _variety of control dimensions, _singly or in combination, to either

augment aconventional

(26)

O

OOGOOOOOOO

Totest thehypothesisthatnonverbalvocalinputcanbeusedtocontrolsome or all

dimensions of a user _interface, a pair of experimental user interface

implementations has been designed. Each provides insight into the validity and

robustness of aspects of the hypothesis. Experiment 1 aims to augment a two-dimensional mouse-driven input ina _painting_{program, using the} mouse to paint

and_usingvocal volumetocontrolthe size ofthebrush. Experiment2 is designed

toeliminatethemousealtogetherand replaceitwith vocalinputforcursor control.

Vocal pitch is mapped to vertical direction and vocal vowel is used to control

horizontal direction. Eachexperimenthas been implementedandwillbeanalyzed

indepth.

4.1

Experiment

1

4.1.1

Description

The first experiment is designed to showthat a low _complexity vocal dimension

can be used in tandem with the most familiar two-dimensional input _device, the

mouse. _{Specifically,} vocal volume is mapped to brush size in the context of a

simple _paintingprogram. This meansthat thebrushposition is controlledintwo

spatial dimensions _using a conventional mouse while the simultaneous vocal

volume controls the size or thickness of the _brush, _thereby _adding a third

dimensionof control.

The painting program example lends itself to nonverbal vocal control for two distinct reasons. _First, the variable

being

_controlled, brush size, must be varied

continuously, ratherthandiscretely. _{For this reason,} acontinuousvocaldegreeof

freedom is suitable. Discrete speech commands would be rather unwieldy.

Second, real-time visual response providesthe userinstantaneousvisual feedback

to guide his or her use of the vocal volume _input, _enabling simple and intuitive

interaction that is _easily learned

by

the user. The user can _quickly

develop

a

calibration ofhisorhervocalinputtothevisual results seenonthescreen.

Experiment l is meant to be a proof of concept for nonverbal vocal _interface,

meaningitis designedtoshowthat theidea isrobust andthat thesystem responds

inaqualitativelycorrectfashion. Inother_words,it isa_toy

(27)

usefulfor_provingtheconceptissound

-sotospeak.

Quantifying

theperformance

or_accuracyofa_paintingprogramis unnecessarygiventhe_satisfying

immediacy

of

visualfeedback.

4. 1

.2

Implementation

Implemented inthelanguage _Processing [FRY06], theexperimenttakes the form

ofa simple

drawing

programcalled soundVolDraw2. Itbeginswith ablankwhite

windowand_continuouslycomputesvocal volumefrom themicrophone input. To aidtheuser's _{understanding}oftheimpactofhervocal_volume,a

dummy

brushin the upper-left corner of the window reacts to the volume _input, regardless of

whether the mouse button is pressed and the pointer-driven _{brush is actually}

painting,toshowthebrushsize.

When themouse button is_pushed, a randomcoloris selected, meaningthateach

new stroke has a different color. While the mouse is _dragged, a line is drawn whose width is scaled

by

the computed volume value. Further details on the implementation_maybefoundin AppendixA;code_{may be}foundin Appendix B.

4.2

Experiment

2

4.2. 1

Description

In thesecond_{experiment, the}mouse isreplaced_entirely

by

vocal input. Asimple

graphical userinterface was implemented with a cursor whose vertical positionis

controlled

by

vocal pitch and whose horizontal position is controlled

by

vowel

sound. The_variables, pitchand_vowel, _maybe_{independently}_{modulated, allowing}

formotionin any directionwhen changedintandem. Visualfeedback is

inherently

part of _{this experiment,} as the input audio is processed _real-time, and cursor motionis synchronizedto theuser'sinput.

Rather than absolute pitch and _vowel, changes in either vocal variable result in

cursor movement. In an intuitive _fashion, upward pitch change causes upward cursor_motion, anddownwardpitch change causesdownwardmotion. Aseries of

upward-moving vocalizations (perhaps the same sound pattern _repeated) will

sequentiallymovethecursorfurther_upthescreen.

Vowel

directionality

is assigned somewhat _arbitrarily

-"oooh"

is designated _left,

and

"ahhh"

is designated right

-and _again, change in vowel sound results in

cursor movement. _Thus, an _opening sound like

"oowah"

will movethe cursorto theright,while a_closing soundlike

"aaoooh"

(28)

with _{pitch, the} rate of change of vowel affects the speed of the cursor motion,

allowingprecise control.

In orderto evaluate the

feasibility

ofthe cursor control in Experiment 2, a short

user test was implemented in which test subjects move the cursor to a series of

icons onthe screen. It was hypothesized that vocalists, or singers, who are more

adept at fine vocal control and familiar with pitch modulation, would perform

betterthannonvocalists. _Further,itwas assumedthatusers wouldlearnthevocal interface_{relatively quickly}

-thateveniftheirfirst_trywas

fairly

dismal, itwouldn't

takemuchpracticetoimprovetheirperformance significantly.

Experiment 2 was implemented using the language Processing. _Volume, _pitch,

pitch _{velocity, vowel,} and vowel _velocity are computed continuously from the

microphone_input, andtheresults are usedtodirectthecursor onthescreen. The

user _{test program,} called _cursorTest, wraps this

functionality

with a control

schemetopresenttargeticonsonthescreen and recordhow

long

ittakeseach user

toreachthem.

4.2.2

Pitch Detection

Thepitchdetection methoddraws

heavily

on[CHEV 02]. Thealgorithm worksin the time domain and_maybe categorized as an autocorrelation _{method, meaning}

that it compares a section ofthe audio signal to a time-shifted _copy ofitself to

determine what time shift value x provides the best match. _{The periodicity} of

voiced _{vocalizations,} _generally with a fundamental period and a _variety of

harmonics, makesthispossible. _Usually,the fundamentalperiodis thelongest of

the possible _solutions, _meaning the lowest in _frequency,

implying

the shorter

period solutions are

higher-frequency

harmonics.

Figure1: _{Analysis of}_{medium-frequency} "oooh."

Figure 1 shows three plots relatedto an input medium-frequency "oooh" sound.

The _top plot shows the sound signal on a _{time axis,} _showing a characteristic

periodicitythat, fora simple

"oooh,"

[image:28.502.54.448.437.569.2]

(29)

Thesecond plot showsthepitch detection functionon a shiftxaxis. The firstxat

whichthisfunction hasa minimumisthefundamentalperiod,whichistheinverse

of fundamental

frequency,

_{or pitch.} _Similar _plots _for _a

lower-frequency

"oooh"

sound are given_{in Figure 2,} anditisclearthat thefundamentalperiodindicatedis

longerthan thatin Figure_1,_{showing qualitatively}correctbehavior.

Figure2: Analysis_of

low-frequency

"oooh."

Fundamental

frequency,

or _pitch, is computed as the inverse of fundamental

period, and pitch _{velocity is} computed

by

_comparing the current pitch value to previous values. The pitch_{velocity is}usedtomovethecursorinthe_ydirection

by

aproportional amount. More detailonthepitch_{computations,}

including

relevant

formulae,maybe foundin AppendixA; code_{may be}foundin Appendix B.

4.2.3

Vowel Detection

Avowel detection andvowel_velocity algorithm was implemented to differentiate

clearlybetweenthe "oooh" and

"ahhh"

sounds and provide a continuous range of values betweenthose two end points. This is usedto control the cursor inthex

direction.

The vowel detection algorithm was implemented in the

frequency

domain. A

simple algorithmwasdevelopedthatdraws_thematicallyonthat in [SENA05]. It

wasobservedthatthe overallshape ofthe Fourier

frequency

spectrumisdifferent

for various vowel sounds. _{Specifically,} the _nearly pure sine wave of

"oooh"

provides a single_strongpeakinthe

frequency

_domain,whilethemore open

"ahhh"

sound results _{in strong} peaks at the fundamental pitch as well as several

harmonics. The vowel detection method analyzes the relative heights of the fundamental

frequency

andthenextfour harmonics. Thebottomplotin Figure 3

showsthe Fourier

frequency

histogram ofa vocalized

"ahhh,"

overlaid with lines

[image:29.502.50.453.126.264.2]

(30)

The five computed harmonic strength _values, represented

by

a 5-D vector, are comparedwithprototypical vowels. Thetwovowels ofinterest inimplementation are

"oooh"

and

"ahhh,"

which provide_very different relative harmonic strengths, as can be seen

by

_comparing the bottom plot in Figure 1 with that in Figure 3.

Basedontheappearance ofplotslike these,thevowel

"oooh"

isrepresented

by

the

prototypevector_p., _(1, _{o, o, o, o),} andthevowel

"ahhh"

isrepresented

by

p_, (1, 1, 1, 1, 1).

Figure3: _{Analysis of medium-frequency}

"ahhh."

The vowel detection algorithm compares relative heights ofthe harmonics ofthe measured sound withptandp2todeterminehowsimilarit istoeach. Ofcourse,it can computeintermediatevaluesbetweenthese two end pointsbasedon which is more similar.

Similarto thepitch_velocity_calculation,vowel_velocityiscomputed_usingprevious

vowel values. Thevowel_velocitydetection algorithmis rather _noisyascompared with pitch detection so some temporal _smoothing is employed to add some robustness atthe expense of some responsiveness. As a _result, left-right motion

feels slower compared to up-down _motion, and the user must _repetitively voice

something like "oowahh,

oowahh"

to move the cursor across the screen to the

right.

4.2.4

User Test

To evaluate the effectiveness of the implementation of Experiment _2, a user test was constructedto time how

long

ittakeseach usertomove_{the cursor,}_{using only}

his or hervoice, to a _variety of on-screen icons. [DAI _04] presents an excellent

summary of a user test constructed to evaluate a speech-based cursor control

scheme. While theircursor control method was

discarded,

the core oftheir user testwas used as abasis forthepresent usertest.

The test involves a set of eight target icons drawn in a _ring, and the user was

[image:30.502.59.449.156.290.2]

(31)

arrangement is shown in Figure 4, which is a screen-capture from the program

cursorTest. The radius ofthe _ring oficons is 200 _pixels, or about 2 inches on a

typical monitor. Inthe upper-left cornerofthescreen, five plots _appear, _showing

computed values as a function oftime. These are _volume, pitch, pitch _velocity,

vowel, andvowel velocity. These are present for diagnostic _reasons, notbecause the user is expected to understand them or use them for navigation. It is worth

noting that the four lower plots turn red when the volume is below _threshold,

indicating

that theprogramis

ignoring

too-quietinputand not_movingthecursor.

Volume

phi s

-iWij1-iWk1V1ooftr III 1 I

A. -vowerveloisity..

?

The testconsistsof_movingthecursortoeach olthe boxes. Pressspacetobegin1

Figure4: Usertestpractice screenfromcursorTest program.

The usertestprocedure is quite simple. The screen in Figure 4appears whenthe

[image:31.502.53.449.166.564.2]

(32)

vowel change movesthe cursorleftandright. Vocal examples areprovided, with

the testproctor

demonstrating

soundsthatmove thecursorineach ofthe cardinal

directions and _showing the results on the screen. _Further, the proctor explains

that diagonals and curves are obtained when pitch and vowel changes are

combined, and some examples shown. The user is then given command ofthe

microphoneandallowedtopracticeforafewminutes, untilcomfortable.

Several more hints are provided as the _training session continues. _First, the important _thing is change. _Simply _vocalizing different pitches will not result in

motion,buta smooth_{sweep from}alowpitchtoahighpitch will. _Second,because oftemporal smoothing, a longervocalization is alwaysbetter than a shorter one.

Thismeansthata_long,slow pitch changeis betterthanashort,quick_{one, to}move

the cursor a short distance. _Third, _moving straight _up and down _using pitch modulation alone is _{relatively easy,} but _moving straight left and right is more

difficult due to people's _tendency to change pitch while _vocalizing a sound like "oowahh."

Itis helpfulto practice a constant-pitch vowel changeto move straight fromsidetoside.

Oncetheuseris satisfied withhisorherpractice, the testbegins. Theuser moves

thecursortoatarget _icon, andthe timernotes whenthe user_successfullygetsthe

cross-hair intothebox. This continuesfor each oftheeight_icons, and thetest is

(33)

O

G0O0OOO

5.1

Experiment

1

Experiment 1 was _extremely successful. While no attempt was made to collect

quantitative _data, _{qualitatively it} worked perfectly. The simultaneous input of position_using themouse andbrushsize _usingthemicrophone provedsimple and

intuitive, and _very little practice was required before the visual results were

satisfactoryto theuser. Beloware screen-captured examples of paintings created

usingsoundVolDraw2. Figure5shows astylizedhibiscus flower

drawing

inwhich

line width was modulated to accentuate the petals. The line width variation has

some unintended _{fluctuations,} but overall the results are representative of the

intendedeffect.

Figure5: Example

drawing

using vocal volumetocontrolbrushwidth.

Figure 6 illustrates thelevel ofcontrol and _{repeatability} that is possible to create

quickly with the program. The line weights in the text were varied to mimic a

classic serif_typeface, and whiletheend result isnosubstitutefora renderedfont,

it is

fairly

good, and much more interesting that it would be with constant line [image:33.502.125.380.313.508.2]

(34)

[image:34.502.123.377.47.241.2]

Sam!

Figure 6: Example

drawing

_usingvocal volumetocontrolbrushwidth.

Several people volunteered to spend some time

fiddling

with this program, and

they all were pleased and amused with the results. As simple as the implementation_is,thevolunteerswere_verypleasedtoseethevisualresult oftheir

voice ina _waythatwas newandentertaining. Some people were methodical and

experimental with their trial, probing the response of the new input modality.

Otherswere_unabashedly_squawking and

hooting

atthe computer,

laughing

atthe

results,

drawing

scribbled_pictures,and_simply_{enjoying it like}atoy.

5.2

Experiment 2

The realization of Experiment 2 met all expectations. The pitch detection

algorithmisrobust and_accurate, andthevowel_detection,while alittle lessrobust

than _ideal, works _amazingly well given its simplicity. The user test proved _very

successful and provided qualitativeinsightas well asdataon_usabilityand

learning

curve.

Twelve users took the _test, in addition to the author. Of the _twelve, one was

discarded because his voice was_{too raspy} and nasalto provide good control over

the cursor,

leading

tofrustrationand scores an order ofmagnitudeworsethan the

rest. The other eleven were categorized

by

gender as well as in vocalist or

nonvocalist categories. To qualify as a_vocalist, a user had some sort of _singing

experience or _training, _preferably in a choral _setting that requires vowel _quality

and clarity. Three users were categorized as vocalists and eight as nonvocalists. Six ofthe users are _female, and five are male. Two users were tested multiple

(35)

Each user was _{tested twice successively,} as eachtest consists of_only eight target

icons and there was some variability. Because some users"got lost" alittle bit or otherwise performed _{significantly} worse on one of their _icons, and others "got

lucky"

occasionally

by fortuitously

nailing a bullseye with an unintended

vocalization,thehighestandlowestofthe eight scoresforeachtestweredropped.

The average time for the _remaining twelve target icons (six for each of the two

tests) isshownin Table l.

Table1: Averageusertimesforthefirsttwo testswith extremes excluded.

Male

Voc.

Female

Vocalist Male Nonvocalist Female Nonvocalist

User 1 2 3 4 5 6 7 8 9 10 11

Time 10.7 17.0 23.9 18.8 25.9 22.8 24.6 19.2 27.8 30.6 35.6

The average time over all eleven users on their first two tests is _23.4 _seconds,

which is somewhat

disappointingly

_long, but not abad start. Itshould be noted

that user ₅ took the _{test only} once due to an _oversight, sohis initial testresult is reported where everyone else's first two test results are averaged. User performancedidnot_{vary widely}betweentheirfirstand second_tests, sothismild

inconsistency

maybe ignored. The data may be broken down inseveral_ways, for

example in the vocalist vs. nonvocalist and male vs. female categories mentioned

above,as showninTable2.

Table2: User_categoryaveragesforthefirsttwo testswith extremes excluded.

Vocalist Nonvocalist

Male 10.7 23.0

Female 20.4 28.3

Here, the hypothesis that vocalists would outperform nonvocalists is verified

by

their significantly better performance. _{Qualitatively,} this difference was clear whenthe testswereadministered; thevocalists

invariably

hadmuchmore comfort withthevocalcontrol_necessaryforthe taskand requiredless_trainingtimebefore

theyfeltreadyto takethetest.

Additionally, in each _category the male users outperformed their female

counterparts. Thiswas unanticipated and_mayindicatetheexistence of somesort

ofbias built in unintentionally as the pitchand vowel detection algorithms were

implementedand optimized _using themale author'svoice. Optimized is a_strong

word, because there really aren't _any empirical thresholds or other optimized variables

likely

tocausethis.

However, a couple ofdetails may introduce accuracybiases. The quantizationof

[image:35.502.53.452.165.223.2]

(36)

detection algorithm leads to coarser pitch computation for short fundamental

periods, which correspond to higher frequencies. It is perhaps possible that the

modulation of

higher-frequency

female voices is not as well discerned as that of

lower-frequency

malevoices. _Swaying the opposite_direction, thequantization in

the

frequency

domain in the FFT computation means that low _frequencyvowels

are more_coarselycomputed. Thiscouldleadtoless discernment intheharmonic

peak-height computations for

lower-frequency

voices than for higher-frequency

voices. Neitherofthese tendencieswere studied well enoughto _saydefinitivelyif

theyare_truly

introducing

_bias,butthepotentialfor differenceexists.

25

\

\ -Expert

\

- -

-Female Nonvoc

Male Nonvoc

\

.^

/ "^ -^

%

/

tes!2 tesl'J test4 tests

Figure7: Usertesttimesovera series_oftests_showingthe

learning

curve.

The users'

practice sessions and first tries at the test were _successful, but

sometimes

frustratingly

slow. _Fortunately,withimmediatevisual_feedback, people

learn _quickly and the users who took the test multiple times showed rapid

improvement. Users 7 and _8, one male and one _female, both _{nonvocalists,} were

testednine consecutivetimes inordertocharacterizetheir

learning

curves. Figure

7shows the _resulting data,

indicating

a quickimprovement overthe first four or

fivetestsandthen

leveling

off. Also shownis datafromthe _author, referredto as

theexpert userbecauseofhis

familiarity

withthe testand extensive practice while

writing the algorithms. The expert _baseline, representative of well-practiced

competence, averagesjustover₄secondsper targeticon. Overthecourse ofnine

tests taken in a single half-hour, the male user reached a plateau around ₁₃

[image:36.502.61.445.185.423.2]

(37)

each user's

learning

curve would continueto trend towardfastertimesifgiventhe time topractice.

Another

interesting

_waytolookattheusertestdata is toconsiderthedependence

on direction.

Moving

_in _the cardinal _directions, _up, _{down, left,} and _right, seems

intuitively

simpler than _{moving in} diagonal directions. The cardinals can be

reached with pitch or vowel changes on their own, while the diagonals require a

composite vocalization with both pitch and vowel changes. Most users did not

attempttocombine pitch changeand vowel changeinasingle_{vocalization,}instead

stair-stepping in the cardinal directions to reach the diagonal target icons.

However, some users mastered combined vocalizations _quickly, and one of the

female vocalistswas much more comfortable with _{gently-curving} diagonals than

with straight left and right movements. In the author's _experience, J shapes

proved comfortable and convenient with a little practice. Directional data are

summarized in Figure _8, _using compass directions in a logical _map sense where

northisup.

Expert

MaleVocalist

-FemaleVocalist

N _MaleNonvoc.

40 _FemaleNonvoc.

35

NW , s

30

NE

, \"

<''-. 25 -

-\

\V-v"-- '

\

*

VI * \

f-

?

r. /

. / .

v.

w

15

10'

^.<-"i '/ /

V

}tZ

1

%.

-A:

SW SE

Figure8: _{User category}averagesforthefirsttwotest,excludingextremes,

[image:37.502.106.409.284.588.2]

(38)

The general trend of diagonals _taking more time to reach than the cardinals is

clear, though it is not _exactly consistent across user categories. The expert user data makethebestapproximation ofacircle, meaningthat proficiencywas gained

nearly equivalently in all directions. The most inconsistentusers fall inthe male

nonvocalists _category, for whom diagonals took two to three times more time to

reach than cardinals. Both vocalist categories are more _even, cardinal directions

compared to

diagonals,

than the nonvocalistcategories. Complete user testdata

maybe foundin Appendix C.

User reaction to the nonverbal vocal interface implementation varied, but was

generallypositive andinsome casesquiteenthusiastic. Somewere embarrassedto

be singingand

howling

toa computerwith another personinthe room, and others

were too

busy

_entertaining themselves to notice another person's presence.

Several users suggested a game implementation _using vocal input for control,

whichis certainlyafeasibleidea. _Manyusers envisioned with amusement an office full of _{cubicle-dwelling} _employees, all _{cooing in cacophony} to their computers.

One user, after

demonstrating

_relatively adept control ofher voice to move the

cursor, pushed herchair back and_said, "That was the weirdest _thing I have ever

done."

(39)

O 0OO0O0OOOO

Thenonverbalvocalinterfaceconceptisan

interesting

confluence and extension of

prior work in signal _{processing, graphics,} and human-computer interaction. _By

combining existing ideasand algorithms with some modification andoptimization,

a novel method of user interface is created. It is shown that the concept of

nonverbalvocalinterface_maybeutilizedintandemwith atraditionalinputdevice suchasamouseorusedtoreplacethe mouse altogether. _Ordinaryusers aswell as

those _physically unable to manipulate a traditional input device canbenefit from

thisfoundation.

6.1

Experiments

The success ofthe two simple experiments is quite gratifying. _{Experiment l,} in which vocal volume controlsbrush size ina _paintingprogram, shows thata vocal variable canbe simultaneouslytied to mouse-driven cursor control_variables,and

itshowsthatthisextension of user control wasintuitiveand_easilylearnedwiththe

visual feedback provided. _{Additionally,} and most _interesting, it previews the usefulnessof such_{mixed-modality}_input,both for_precise,productive controlas an artist or engineer might _employ as well as for entertainment as a game or _toy

interface. _Applying this crude experimental interface to _any of these end uses

wouldbesimple and effective.

Experiment2 proves the hypothesisthata _completelyvocal interface canbe used

to supplant a mouse for two-dimensional control of a graphical environment.

Despite the _simplicity and small subject pool ofthe user _test, valuable data were collected _showing the ease of use of the implementation of and _illustrating the

rapid impro