autocorrelation analysis

(1)

on Resource Constrained Devices (Extended Abstract) Fabian Monrose Michael K. Reiter y QiLi Daniel P. Lopresti Chilin Shih Abstract

Programmable mobile phones and personal digital

assistants(PDAs) with microphones permit

voice-driven user interfaces in which a user provides

in-putbyspeaking. Inthispaper,weshowhowto

ex-ploitthiscapabilityto generatecryptographickeys

on such devices. Specically, we detail our

im-plementation of a technique to generate a

repeat-able cryptographic key on a PDA from a spoken

passphrase.Ratherthanderivingthecryptographic

keyfrom merelythepassphrasethat wasspoken|

whichwouldconstitutelittlemorethananexercise

inautomaticspeechrecognition|westriveto

gener-ateasubstantiallystrongercryptographickeywith

entropydrawnbothfromthepassphrasespokenand

howtheuserspeaksit. Moreover,thecryptographic

key is designed to resist cryptanalysis even by an

attackerwhocapturesandreverse-engineersthe

de-vice on which this key is generated. We describe

the major hurdles of achieving this on an

o-the-shelfPDAbearinga206MHzStrongArmCPUand

an inexpensive microphone. We also evaluate our

approach usingmultiple datasets, onerecordedon

the device itself, to clarify the eectiveness of our

implementationagainstvariousattackers.

1 Introduction

Futuristic mobile computing platforms will oer,

andinsomecaseswillrequire,methodsofuserinput

other than a keyboard, mouse orjoystick. This is

especiallytrueforhead-mounteddisplaysandother

BellLabs,LucentTechnologies, MurrayHill,NJ,USA;

ffabian,qli,dpl,clsg@re sear ch. bell -la bs.c om

y

DepartmentofElectricalandComputerEngineeringand

DepartmentofComputerScience,CarnegieMellon

Univer-sity,Pittsburgh,PA,USA;[email protected]

wearable computers (e.g., see [22]). For such

fu-turisticdevices,andevenfornext-generationPDAs

andprogrammablemobilephones,voiceisaleading

contenderforthedominantuserinputmedium.

We argue that if voice prevails in this sense, then

this poses a challenge for securing data on these

devices. On the one hand, if our experience with

laptopcomputersandmobile phonesisany

indica-tion, then these devices will be stolen frequently:

Laptop theft is already the second leading

quan-tiablecost to enterprises from IT-related security

threats[19]. Similarly,mobilephonesaretheobject

oftheftinfourofeverytenpersonalrobberiesin

sev-eralcities in theUnited Kingdom, andthese areas

loggedafourfoldincrease in personal robberies

in-volvingmobile phones between1998{99and2000{

01. 1

These trends suggest that encryption of any

sensitive data on such devices is prudent. On the

otherhand,presumingthatthesedeviceswillnotbe

tamper-resistant,thecryptographickeywithwhich

such encryption can be performed would need to

be derived from the voice input of the user,

pre-sumablysomeformofspokenpassphrase.

Unfortu-nately,spokenpassphrasesarelikelytohavefarless

entropy than typed ones, due to their need to be

pronouncibleandduetootherformsofinformation

lossin aspokenrepresentation(e.g., capitalization

andpunctuation).

Inthis paperwedescribean implementation of an

approach to derivea repeatable cryptographickey

from spoken user input, in which the entropy of

thekeyis drawnfrom both thepassphrase that is

spoken and the speech patterns of the user while

speaking it. In this way, even if the entropy of

thepassphrasespaceis small,thevariabilityacross

users'vocal tractswill pose an additionalobstacle

to the cryptanalysisof thekey. Moreover,our

ap-proachusestechniquesdesignedtowithstandan

at-1

Seehttp://news.bbc.co.uk/hi/english/ uk /ne wsid1440000/

1440863.stm.

(2)

vice on which the key generation takes place (but

not while it is taking place), i.e., an attackerwho

has full access to the stable storage of the device.

Our general approach for achieving this was

dis-cussedin[16],thoughasonlyaninitialsteptoward

thisgoal,thatworkevaluatedtheapproachonlyfor

thegenerationof46-bitkeys,usingonlyutterances

recorded over phone calls, and without regard for

thediÆculties facedin implementingtheapproach

onresource-constraineddevices. Here,weprovidea

somewhatmorerealisticevaluationofthisapproach

usingafullimplementationonano-the-shelfPDA

(the Compaq IPAQ), using data recorded on that

PDA, andtargeting 60-bitcryptographickeys. We

detailnumerouschangesandrenementsweneeded

to make theapproach viableonthis platform. We

willalsogiveevidencetosuggestthattheadversary

gains little by knowing the user's passphrase, and

thattheadvantagetheadversarygainsby

addition-allyrecordingtheusersayingphrasesotherthanthe

passphraseislessthanonemightexpect.

We caution the reader, however, of several

limi-tations of our analysis and our approach. First,

thoughwedemonstratethereliablere-generationof

akeyusing an60-bit characterizationof theuser's

utterance, we do not claim to necessarily achieve

keyswith 60bitsofentropy. Indeed, onecandraw

fewconclusionsregardingkeyentropyfromthe

lim-iteduserstudies[16, 17]thatwehavebeenableto

perform. That said, our studies suggest that the

technique we have implemented does draw

signif-icant entropy from passphrase utterances, and at

theveryleastshould oergreaterentropythanthe

passphrase space alone. Second, as our approach

strivestore-generateacryptographickeywhenever

thelegitimateuserutters herpassphrase,itis

nec-essarily vulnerable to an attacker who can obtain

boththeuser'sdevice andahigh-qualityrecording

oftheuserutteringherpassphrase;thisexposesall

theuser'skeyingmaterial,afterall. Whileany

bio-metricisvulnerabletosuchanattack,weraisethis

point heretoemphasize theprimary attackerwith

whichweareconcerned: theattackerwhocaptures

thedevicebutthatdoesnothaveaccesstotheuser.

Thatsaid,wedoshow,somewhatsurprisingly,that

knowledgeoftheuser'spassphraseseemstohelpthe

attackerlittle,andthatevenrecordingsoftheuser

sayingotherphraseshelpsonlymarginally.

In this section we give the background necessary

to describe our contributions in this paper. Our

generalapproachtogeneratingacryptographickey

fromaspokenutteranceofapassphraseisdescribed

in [16]and isbased onanapproach initially

devel-oped for generating a key from a typed password

andtheuser'skeystroketimingsduringtheentryof

thatpassword[15]. Howthispaperrelatestothese

priorpublicationsisdescribedbelow.

In our approach, the system is initialized by rst

generatingacryptographickeyK,andthen

gener-atingacollectionof2msharesofKusinga

general-izedsecretsharingscheme(e.g.,see[24]forasurvey

of such schemes), where m is asystem parameter.

These shares are aligned in a m2 table that is

storedonstablestorage. Thesharesmustbe

gener-atedand placedwithin thetable sothat K canbe

reconstructedfromanysetofmsharesconsistingof

onesharefrom eachrowofthetable.

Upon entry of the passphrase, the system

mea-suresmbiometricfeaturesoftheuser'sentryofthe

passphrase. Wedenote thei-th feature by

i ,and

denotethevalueoffeature

i

onthe`-th

(success-ful orunsuccessful) attempt to log into this user's

account(i.e., generatethis user'skey) by

i (`). In

thecaseofaspokenpassphrase,thefeatures

i are

features extractedfrom the user'sutterance as

de-scribedin [16]. Forthe`-thloginattempt,the

sys-tem then generates a bit stringb

` 2 f0;1g m from 0 (`);:::; m 1 (`);b `

iscalled afeaturedescriptor,

and thei-th bit of b

`

, denoted b

`

(i), is determined

from the i-th feature

i

(`). Algorithms for

gener-ating b ` from 0 (`);:::; m 1

(`) are proposed and

evaluatedin[16],butforthepurposesofthispaper,

thereadercanthinkofb

` beingdeterminedby b ` (i) 0 if i (`)< i 1 otherwise (1) where i

denotes some xed threshold value. The

system then attempts to reconstruct K using the

table elements at positions hi;b

` (i)i

0i<m

. When

theloginindex `isnotnecessaryin thediscussion,

wewilltypicallyomitit.

Afterahistoryof feature descriptorsfrom

success-ful loginsis observed (i.e., logins in which the key

was successfully reconstructed), those elements of

(3)

measurements are such that b(4)= 1consistently,

thentheh4;0i elementof thetable israndomly

al-tered. If b(i)is suÆcientlyconsistent that element

hi;1 b(i)iinthetableisperturbedinthisway,then

b(i) is called adistinguishing feature. We will give

aprecise characterizationofdistinguishingfeatures

inSection 5.

Thecorrectuser,wheninducingfeaturedescriptors

consistent with those she has induced in the past,

shouldnotencounteranyofthealteredelementsin

thetable. Wehavefound,however,thatinpractice

itisnecessaryforthesystemtoattempt

reconstruc-tionusingfeaturedescriptorswithinsomeHamming

distance c

max

of the inducedfeature descriptor, to

correct for up to c

max

\errors" by the user (e.g.,

see[15]). Thisresultsinupto

cmax X i=0 m i (2)

secret sharing reconstructions per key

(re)generationattempt.

Securityofthistechniquerequiresthatanadversary

whocapturesthedevicebeunabletoeÆciently

dif-ferentiatearandomtableelementfromavalidshare

ofK. Ifthis isthecase,thenanadversarymaybe

forced to simply guess feature descriptors until he

ndsK. Ifd tableelementswere randomized(i.e.,

there areddistinguishingfeatures),then 2 m d

out

ofthe2 m

possiblefeature descriptorsyieldK, and

sotheprobabilitythatarandomlychosendescriptor

willreconstructthesecretis2 d

.

Specic secret sharingschemes for populating this

table were investigated in [15]. That paper also

included an evaluation of this approach with

fea-ture descriptors of length m = 15 derived from

the keystroketimings of auser while typing an

8-character password. There, weevaluatedan

imple-mentation in which the table was additionally

en-crypted with the password; in this way, the

tech-nique serves to render adictionary attack against

the passwordup to 2 15

times more diÆcult. Our

subsequentworkonvoicefeatures[16]described

al-gorithmsforgeneratingfeaturedescriptorsfromthe

user'svoicewhilespeakingapassphrase. Itfurther

evaluatedthesecurityandreliabilityoftheresulting

techniquewithfeaturedescriptorsoflengthm=46

derivedfrom preexisting recordings ofusersovera

phone line. However, in contrast to the keystroke

case,hereourevaluationpresumedatablethatwas

passphrase (to decrypt the table). In this case,

m = 46 does not provide nearly enough security

forimportantapplications.

In this paper we address the computational

chal-lenges of performing key reconstruction on a

resource-constrained PDA with more realistic

pa-rameters than our previous voice study explored.

Specically,weevaluateourimplementationofthis

approach for featuredescriptors of lengthm =60,

andarguethat regeneratingthekeyK canbe

reli-ablyachievedona206MHzStrongARMprocessor

bycorrectingforuptoc

max

5errors(inthesense

described above). The challenges in achievingthis

are the front-end signalprocessing needed to keep

c

max

smallso that expression (2) remains

manage-able, and in devising a secret sharing scheme and

corresponding reconstruction algorithm that

per-mits this reconstruction to occur in a reasonable

amountoftimeonthisplatform. Consequently,we

focusonthesecontributionsinthispaper,andrefer

the reader to [16] for the algorithmic details

com-prisingotherstepsofthekey(re)generationprocess.

3 Front-end Signal Processing

Asdescribedin Section2,thefront-endsignal

pro-cessing performed by the device is critical for

eÆ-ciently generating a key from the user's utterance

of herpassphrase. Intuitively, the goalof this

sig-nalprocessingis to translatethesound uttered by

the user|which is received in the device as a

se-riesofamplitudesamplesfrom itsmicrophoneand

analog-to-digital(A/D)converter|intoa

represen-tationsuitedforthegenerationofafeature

descrip-tor(usingthealgorithmsof[16]). Ideally,thissignal

processing should yield a representationthat is as

\clean" as possible, in that inter-word silence and

backgroundnoiseaectthis representationaslittle

aspossible. Thelesssilenceandbackgroundnoisein

therepresentationaftersignalprocessing,themore

consistenttheuser'sutteranceswillbe(thereby

in-creasingd and the securityof the scheme) and/or

thelesserrorcorrectionwillbenecessaryinthelater

stagesof keygeneration (i.e., thesmallerc

max and

expression(2)canbe).

Ofcourse,thebenetsofsignalprocessinginterms

(4)

ut-tondtherightbalanceofeliminating

environmen-taleectsearlyviasignalprocessing,versusrelying

on the error correction in the key generation step

tocompensatefortheeectsofnoiseandsilencein

theuser'sutterance. Inthissectionwedescribethe

seriesofsignalprocessingstepsthatwebelievebest

achieves this balance. These steps are pictured in

Figure1anddescribedtextuallybelow.

A/DC

down sampling

autocorrelation

analysis

LPC

analysis

silence

removal

cepstrum

mean-subtract

frames

voice-only

endpoint

detection

energy

Figure1: Outlineofthefront-endmodulesusedfor

capturing the speech and processing the signal to

generateasequenceofframescomprisingthe

voice-onlyportionsin theutterance.

As thespeakerutters her passphrase,the signalis

sampledatapredenedsampling rate,whichisthe

number of times the amplitude of the analog

sig-nalisrecordedpersecond. Theminimumsampling

ratesupportedbyourtargetplatform,theIPAQ TM

(seeSection5.1),is32kHz;i.e.,32;000samplesare

taken per second. 2

Each sample is representedby

a xed number of bits. Obviously, the more bits

therearepersample,thebetteristheresolutionof

thereconstructedsignal,butthemorestorageis

re-quired for savingand processing theutterance. In

our implementation, we represent the signal using

16bits persample. Therefore,theamountof

stor-agerequiredpersecond ofrecordedspeech is

32000 samples second 2 bytes sample = 64 kilobytes second (3)

Since theutteranceof oneof ourtested passwords

caneasilybe6secondsormore,thestorage

require-mentsforprocessingevenasingleutterancecanbe

2

TheIPAQisclaimed to support lowersamplingrates,

butwehavebeenunabletogetthemtoworkcorrectlyunder

isespeciallytruesince,aswehavefoundby

experi-ence,writingtostablestoragewhiletherecordingis

ongoingcan introduce noiseinto therecording. In

ourcase,thisparticularlyposedanissueforour

ex-perimentalevaluationinwhichweneededtoacquire

manysamplesfrom thespeaker;seeSection 5.1.

To make subsequent processing on the device

eÆ-cient,ourimplementationrstmakesseveral

modi-cationstotherecordedspeechtoreducethe

num-ber of samples. For example, we down-sample the

32;000 samples per second to only 8;000 samples

persecond,eectivelyachievingan8kHzsampling

rate. For most voice-related applications, a

sam-pling rate of 8 kHzis suÆcient to reconstruct the

speech signal. Infact, nearly allphone companies

inNorthAmericauseasamplingrateof8kHz[21].

Downsamplingmustbeperformedwithsomecare,

however, due to the sampling theorem [20]. The

sampling theorem tells us that the sampling rate

mustexceedtwicethesignalfrequencytoguarantee

anaccurateanduniquerepresentationofthesignal.

Failuretoobeythisrulecanresultinaneectcalled

aliasing, in which higher frequenciesare falsely

re-constructed as lower frequencies. Down sampling

to8kHzthereforeimpliesthatonlyfrequenciesup

to4kHzcanbeaccuratelyrepresentedbythe

sam-ples. Therefore,when downsampling to 8kHz we

usealow-pass digital lter withcuto at 4kHzto

strip higher frequencies from the signal. That is,

this lter takes sound of any frequencies as input

andpassesonlythefrequenciesof4kHzorless.

After down sampling, the signal is broken into

30millisecond (ms) windows, each overlappingthe

nextby10ms. Withineachwindoware240samples

(since 8;000samples/second0:03seconds=240

samples). Overlapping windows by 10 ms avoids

discontinuities from one window to the next, and

additionalsmoothingisperformedwithineach

win-dowtoyieldas smoothasignalaspossible.

The goal of the next signal processing steps is to

derive a frame from each window. A frame is a

12-dimensional vector of real numbers called

cep-stralcoeÆcients[20],whichhavebeenshowntobe

averyrobustandreliablefeaturesetforspeechand

speakerrecognition. ThesecepstralcoeÆcientsare

obtainedusingaatechniquecalled autocorrelation

analysis. Thebasicpremisebehindautocorrelation

analysisis thateachspeechsamplecanbe

(5)

sam-cientsusingautocorrelationanalysisinvolveshighly

specialized algorithms that we do not detail here,

but that are standard in speech processing (linear

predictivecoding[10]).

Asideeectofgeneratingframesisacalculationof

theenergy ofthesignalperwindow. Theenergyof

awindowisproportionaltotheaverageamplitudes

ofthesamplesin thewindow,measuredin decibels

(dB). Energycanbe used to removeframes

repre-sentingsilence(whichhasverylowenergy)fromthe

framesequence,viaaprocesscalledendpoint

detec-tion[13]. Thesilenceportionsofthefeatureframes

are then removed and the voice portions

concate-nated.

Finalmodicationstotheframesequencearemade

via a technique called cepstral mean subtraction.

In this technique, the component-wise mean over

all frames is computed and subtracted from every

frame in the sequence. Intuitively, if the mean

vectorisrepresentativeof thebackgroundnoiseor

thechannelcharacteristicsintherecording

environ-ment, then subtracting that mean vector from all

theframesyieldsaframesequencethatismore

ro-bustin representingtheuser'svoice.

After this, the speech data is segmented and

con-vertedtoasequenceofbits(afeaturedescriptor)as

describedin [16]. Thisfeaturedescriptorisusedto

regeneratethesecretkeyfromthepreviouslystored

tableofshares, asdescribedin Section2.

4 Encoding Scheme

We review the encoding scheme of Bleichenbacher

presentedin[1]. Thatschemefocusesonthe

partic-ular secret sharingscheme we use to populate the

m2tabledescribedin Section2andfrom which

thekeyKisreconstructed.Wequicklyfoundinthe

implementation of this technique on the

resource-constrained IPAQ that the secret sharing schemes

suggestedin[15]wouldnotsuÆce. Thatpaper

sug-geststhreedierentschemes. OneissuÆciently

re-source eÆcient for our purposes but also has

po-tential security weaknesses (see [15, Sections 5.1{

5.2],[2]),andwhiletheothertwoaddressthis

weak-ness[2],theyaresimplytoocomputationally

inten-siveduring reconstruction to permit the degree of

error correction we require. Therefore, to achieve

cretsharingschemethatpermitsfastreconstruction

andthatappearstobesecureforourpurposes.

Weemphasizethat thetypeof security werequire

for our secretsharing scheme is dierent from the

typicalsecuritydenitionofasecretsharingscheme.

The latter denition is, informally, that an

adver-sarynotpossessingasuÆcientset ofsharesbe

un-able to reconstruct the secret. In our case,

how-ever, the adversary who captures the device is in

possession ofall sharesin thetable,and so clearly

possesses enough shares to reconstruct the secret.

Oursecurityrequirementis rather that the

adver-sary beunable to eÆciently nd a suÆcientset of

validsharesinthetable,i.e.,asetcontainingavalid

sharefromeachrowofthetableandnoinvalid

(ran-dom)elements. Ideally,thebesttheadversarycould

do would beto repeatedly tryreconstruction with

arandomlychosenset containingoneelementfrom

each row. However, becausethe invalid shares are

placedaccordingtoanunknowndistribution

deter-mined by the biometric features of the user|and

not uniformly at random|it is impossible to

for-mallyreducethesecurityofsuchaschemetoa

well-knowncryptographicproblem. (Obviouslythereare

distributions that would leavethe scheme trivially

breakable.) Assuch, untilwend abetterwayto

model security, we are stuck with heuristically

se-cure schemes; the approach described in this

sec-tionisone. Nevertheless,wewillcommentindetail

aboutourcurrentknowledgeofthesecurityofthis

schemeinSection 4.4.

4.1 Initialization

Here wedo notdescribethesecretsharing scheme

is its full generality, but rather describe how it is

instantiated for our particular purposes. When a

deviceisinitializedforauser,thedevicerstselects

arandom(m 1)-elementcolumnvectors2Z m 1

p

forpaprime. Here,pcanbesmall; p=2 31

1is

asuitablechoice, so that arithmetic on32-bit

pro-cessorsisveryfast. Thevectorsisasecretvector,

therecoveryofwhichisnecessarytoobtainthekey

K. Forexample,K canbedenedasK=h(s) for

haone-wayfunction, orK canbe encrypted with

h(s).

Thesecondstepininitializationistogeneratea

ran-dom2m(m 1)matrixU =(u

ij )

0i<2m;0j<m 1

(6)

generatedpseudorandomly,thenstoringtheseedof

thepseudorandom process is suÆcientfor U to be

regenerated when needed. The 2m-element table

t=(t 0 ;t 1 ;:::;t 2m 1 ) T 2Z 2m p isthen generatedas t p Us (where \ p

" denotes equivalence modulo

p). Thatis, tisthetableasdescribedin Section2;

intuitively, theelement in the i-th \row" and j-th

\column"for0i<mandj2f0;1gist

2i+j .

Tocomplete initialization, s is deleted, and U (or

the seedneeded to regenerate it), thetable t, and

primeparestoredforthenextkeyregeneration

at-tempt. In addition, y = h 0

(s ) is stored for some

(dierent) one-way and collision-resistant function

h 0

6=h, sothat when s is reconstructed,it can be

recognizedascorrect.

After a suÆcient number of successful key

recon-structions (see Section 4.2), the table t is

\hard-ened"asdescribedinSection2: ifoveranumberof

successfulkeyreconstructions,eachinducedfeature

descriptor b is consistent on the i-th feature (i.e.,

b(i)isusually thesame,asspeciedmoreprecisely

in Section 5), then element t

2i+(1 b(i))

is assigned

tobearandomelementofZ

p

. Thisshould usually

notaect reconstructionfor thecorrect user,since

thatusertypicallyselectst

2i+b(i) .

4.2 Key reconstruction

Asdescribedin Section2,theinputfrom theuser,

in this case an utterance, is used to generate an

m-bit feature descriptor b. The key regeneration

program constructs a vector t

b 2 Z

m

p

by selecting

thecorrespondingelementsofthetable,i.e.,(t

b ) T = (t 2i+b(i) ) 0i<m . Itsimilarlyconstructsam(m 1) matrixU b =(u 2i+b(i);j ) 0i<m;0j<m 1 . Notethat solving U b s 0 p t b (4) for s 0 would yield s=s 0

if b containedno user

er-rors,andthefactthats 0 =scouldbeconrmedby testing whether y =h 0 (s 0

). However, since

instan-tiating and solving(4) for all the dierentfeature

descriptors b 0

within Hammingdistance c

max from

b would require too much time on the target

de-vice,ourerror correctionstrategy takesadierent

approach.

This faster approach derives from the observation

thattheequationU

b 0 s 0 p t b 0

forafeature

descrip-0

b 0

to be rejected very quickly if this equation has

nosolution. Specically, let ~

U

b 0

bethemm

ma-trixwhoserstm 1columnsarethesameasU

b 0

and whoselast column is t

b 0

. Then, asolution

ex-ists onlyif det( ~ U b 0) p 0, and so if det( ~ U b 0) 6 p 0 thenb 0

canbediscardedfromfurtherconsideration.

Usinga recursivealgorithm, it ispossibleto check

det( ~ U b 0 ) p

0forfeaturedescriptorsb 0

withinc

max

errorsofbusingonlyoneadditionalGaussian

elim-ination step per new b 0

. Due to this feature, our

implementation can test over 610 6

feature

de-scriptorsb 0

persecond whenm=60.

Two further observations are worth noting here.

First, the determinant of even a randomly chosen

matrix is 0 mod p with probability p 1

, which is

notnegligiblesincepischosento besmall.

There-fore, if det( ~ U b 0) p

0 and sowe proceed to solve

U b 0s 0 p t b 0 for s 0

, itremains necessaryto conrm

s 0 by checkingy =h 0 (s 0

). Second, there isasmall

chance that (4) is not solvable even if b is

consis-tent withall theuser's distinguishingfeatures,

be-cause U

b

might nothave full rank. Under the

as-sumption that U

b

is chosen uniformly at random,

itfollowsthat U

b

hasrankm 1withprobability

Q m j=2 (1 p j )=1 p 2 +O(p 3 ). Thisprobability

is much smallerthan the probability that the

sys-temcannotbesolvedbecausethefeaturedescriptor

binducedbytheuser'sutterancecontainsmorethan

c

max

errors,andthus canbeignored.

4.3 Performance

Forperformancemeasurementswechooseto

bench-markkeyreconstructionona206MHz StrongArm

and a 500 MHz Ultra II. The rst processor is

that currently available in the IPAQ TM

, on which

ourcurrent implementation runs. Thesecond

pro-cessor is in line with the current trends 3

in the

hand-held market, and thus, allows us to forecast

expected performance on future PDAs. Our

per-formance benchmarks consists of a collection of

C/C++modules cross-compiled for the ARM,

com-prisingof signalprocessing code, an enhanced

ver-sion of cryptolib1.2[11] updatedby Daniel

Ble-ichenbacher, and a matrix manipulation package

newmatv9.0[5]forimplementingthe segmentation

3

The Intel 80200 processor based on the Arm

compli-antXScale TM

microarchitecturethatsupports400,600,and

(7)

Totest theperformance of ourkey reconstruction

routineswedevisedthefollowingexperiment. First,

anm2table ofsharesof thekeyK is generated

asoutlinedin Section 2,i.e., as auser'skey

regen-erationtablewouldbeinitializedinpractice. Then,

d rowsof the table (features) are selected at

ran-dom to bedistinguishing, andone elementof each

of these d rows is perturbed randomly|as if the

user were consistent in utilizing the other element

of this row (see Section 2). Finally, a feature

de-scriptorbischosenwiththepropertythatcc

max

oftheddistinguishingfeatures(chosenatrandom),

sayb(i

1

);:::;b(i

c

),are\errors",i.e.,aresettoselect

the randomly perturbedelements of rowsi

1 ;:::;i

c .

Thekeyreconstructionprocess performs anumber

ofreconstructionsthatdependsoncinitsattempts

tocorrectforsucherrors;thenumberof

reconstruc-tionsperformedintheworstcaseisshownin

expres-sion(2)ofSection2. Ourbenchmarkistheamount

oftime requiredto reconstruct thekeyK on

aver-age,whichisaroughmeasureof thetimerequired

toperform m c 2 + c 1 X i=0 m i (5)

secretsharingreconstructionsandtesteachfor

cor-rectness. Notethatforc=c

max ,(5)islessthan(2) by m c

=2 sinceonaverage,reconstruction succeeds

after searching half of the feature descriptors that

correctbonexactlyclocations.

Ourresultsforthisbenchmark, shownin Figure2,

are signicantly less than multiplying (5) by the

timeto performand testasinglereconstructionin

our secret sharing scheme. The reason is due to

the signicant optimizations that can be achieved

asdescribedin Section4.2.

4.4 Security

One potential security weakness of this scheme is

thefactthat anadversarywhocapturesthedevice

canconceivablyreconstructs fromnotjust one

el-ement of thetable perrow, but insteadusing any

melementsofthetable. Forexample,ifthe

adver-sary had reasonto believe that the rstm=2 rows

contained no distinguishing features|and thus all

table elements in these rows were valid|then the

adversary could set t 0

[i] = t[i] for 0 i <m and

U 0

=(u)

0i<m

andusethesetosolvefors.

There-morelikelythannottobedistinguishing.

Wenowdescribethefastestattackonourschemeof

which weareaware. An attackerwhocapturesthe

deviceonwhichthekeyisgeneratedbutwhohasno

informationabouttheuser'sdistinguishingfeatures

mayattackthesystembyrepeatedlyguessinga

fea-turedescriptorbatrandomandsolving(4). Ifthere

areddistinguishingfeaturestheneachguesswillbe

successfulwithprobability2 d

,butwillrequirethe

attacker to solve a system of m linear equations,

which is quitetime consuming. A faster approach

isto choose feature descriptorsb

0 ;b

1

;::: such that

eachdiersfromthelastinonebit. Then,

comput-ingU 1

b

i

requiresonlyonenewGaussianelimination

stepperb

i

, and the further optimizations outlined

inSection4.2 canalsobeapplied inthiscase.

Theexpectedtimeforthisattacktosucceedcanbe

computedasfollows. Assumethatb

i andb

i+1 dier

in exactly one position that is chosen at random.

LetG(c) for c 2denote theexpectednumber of

Gaussianeliminationstepsperformeduntilreaching

ab

i

withnoerrors(i.e.,thatisconsistentwithallof

theuser'sdistinguishingfeatures),assumingthatb

0

hasc errors. Notethatb

i+1

hasadierentnumber

oferrorsthanb

i

withprobabilityd=m,andifithas

adierent numberof errors, then it decreases the

numberoferrors(byone) withprobabilityc=dand

increases it with probability(d c)=d. Hence, we

getthefollowingequationsforG(c).

G(2) = 0 G(c) = m d + c d G(c 1)+ d c d G(c+1) for2<cd

Solvingforthese linearequations atc=d=2yields

theexpectednumberofGaussianeliminationsteps

beforerecoveringthekey,sincearandomfeature

de-scriptorcontainsanexpectedc=d=2errors.

More-over,aftertheGaussianeliminationstepforeachb

i ,

m(m 1) multiplicationsarerequiredto testb

i on

average.So,thetotalcostofthisattackisasshown

in Figure 3. (Actually, thisis aconservativelower

bound, sincethecostof each Gaussian elimination

stepitself isnotcounted.)

Intheempiricalevaluationthat weperformin

Sec-tion5,weevaluateourapproachat m=60,which

isthesmallestvalueofmshowninFigure3.Here,if

70%ofthefeaturesaredistinguishing,thenthis

at-tack conservativelyrequires anexpected 2 44

(8)

0

20

40

60

80

100

120

2

3

4

5

6 time (secs)

corrections

206 MHz StrongArm

m = 60

m = 70

m = 80

0

20

40

60

80

100

120

2

3

4

5

6 time (secs)

corrections

500 MHz Ultra II

m = 60

m = 70

m = 80

Figure 2: Key reconstruction performance for m 2 f60;70;80g. Depicted are average (of 50 executions)

elapsedtimesforkeyreconstruction ontheIPAQ(left)and a500Mhzprocessorwhich re ectsupcoming

PDAprocessortrends(right).

30

40

50

60

70

80

90

100

110

60

65

70

75

80

85

90

95

100 floor(log

2

(# mults))

m

d/m = 0.5

d/m = 0.6

d/m = 0.7

d/m = 0.8

d/m = 0.9

d/m = 1.0

Figure 3: A lowerbound on the expected number

ofmultiplicationstoexhaustivelysearchforthekey,

assumingthattheddistinguishingfeaturesare

uni-formlydistributed;seeSection4.4.

thenthisattackrequiresatleast2 50

multiplications

on average. Obviously the security of the scheme

improvesasmandd=mareincreased,andthisisa

goalof ourongoingwork.

5 Empirical Results

Inthis sectionweempiricallyevaluatethesecurity

of our technique using two dierent data sets. In

acterizethesecurityofourtechniqueagainstan

at-tacker who captures the device. It is clear from

Section4.4thatthenumberdofdistinguishing

fea-tures is central to the security of our scheme, in

that if d is small, then our scheme is vulnerable

to akeyrecoveryattackviaexhaustivesearch (see

Figure3). Therefore,in orderto demonstratethat

ourapproachis plausibly secure, itis necessaryto

demonstrate that a high number of distinguishing

features can be achieved using our techniques. In

addition,wealsoattempttocharacterizethedegree

to which additional knowledge aids the attacker's

questforthekey,intheformofeitherknowingthe

passphrasesaidbytheuserorhavingrecordingsof

theusersayingphrasesotherthanherpassphrase.

Weremindthereader that largedis notsuÆcient

for strong security. For example, even if all

fea-turesare distinguishing (d =m) for all users, but

allusers' feature descriptorsare identical (and the

attacker knows this), then an attacker who

cap-turesauser'sdevicecantriviallydeterminethekey.

Therefore, it is equally important that users'

fea-turedescriptorsvarywidely|ormoreprecisely,are

drawn from a distribution with high entropy. An

entropyevaluation of user'sutterances from phone

recordings of users saying the same passphrase is

describedin[16,17],andthesestudiessuggestthat

theentropyavailablein userutterancesit

substan-tial even when userssay the samepassphrase. As

already noted, however, since that study involves

onlyrecordingsofuserstakenoverphonelines,and

sincethatstudyis limitedtom=46features,it is

(9)

sets withwhich weare presently working (see

Sec-tions 5.1 and 5.2) include too few users to enable

meaningful measurements of the entropy of users'

feature descriptors, and so here we report results

fordistinguishingfeatures only.

Inorder to calculatethe average numberof

distin-guishingfeatures peruser,itis ofcourse necessary

to dene when a feature is distinguishing. Let

i

and

i

denote themeanand standarddeviation of

feature

i

over the recent history of successful

lo-gins. 4

Then we say that the i-th feature is

distin-guishingifj i i j>k i

forsomeparameterk>0.

Note that iffeature i is distinguishing, theneither

i > i +k i

and sousually b(i) =0 forthe user

(see (1)), or i < i k i

and sousually b(i) =1

fortheuser. Intuitively,theparameterk tunesthe

\sensitivity" of the scheme, in that a small k

im-pliesmoredistinguishingfeatures,andalargek

im-plies fewer. Obviouslyk mustbetuned to balance

achievinga high numberof distinguishing features

withenablingtheusertosuccessfullyregeneratehis

keyreliably, since a higher number of

distinguish-ing features is advantageous for security but also

requires increasingly similar utterancesto

regener-atethekey. Theparameterkwillplayacentralrole

inourevaluation.

Thefeatures

i

thatweuseinthebalanceofthis

pa-peraredescribedin[16,Section3.2].Eachisdened

bycomparingthepositionofavectorcharacterizing

a segment of the utterance to a xed plane. This

planeisaparameterofourscheme,andthoughwe

willrarelymentionitbelow,itisimportantforthe

readertobeawarethatthedatawepresentisbased

onaplaneselected,basedonourdata, tooptimize

ourmeasuresincertainways. Ontheonehand,this

meansthatourdatapresentswhatcouldbeachieved

withagoodselectionofthis plane,and isthus

op-timistic in this regard. On the other hand, since

this plane isselected bysearchingthrougha small

setofcandidateplanes,(innitely)manyplanesare

omittedfrom thissearch. Consequently, itislikely

thatplanesyieldingbettermeasuresexist. The

ex-perimentationwehaveconductedthusfardoesnot

permit us to conclude how to select this plane in

general,andthiscontinuestobeanareaofour

on-goingwork.

4

The\recenthistory"isdenedtobethelasthsuccessful

loginsforsomeparameterh.Recordsofthelasthsuccessful

loginscanbestoredinaleencryptedwiththekeythatis

reconstructedonasuccessfullogin.Theparameterhwillnot

5.1 Evaluation of IPAQ recordings

Ourrstdatasetwascollectedonadevicemuchlike

thoseweenvisionforuse withour scheme, namely

a Compaq IPAQ TM

H3600 series. The IPAQ is a

personal digital assistant with a 206 MHz

Stron-gArmCPU,atouchscreenLCD,16MB ashROM,

16MB SDRAM,aPhilipsaudiocodecUDA1341T

(i.e., an A/D with data compression), built-in

mi-crophones, a compact ash type-II expansionslot,

and a speaker output jack. The audio codec has

anintegratedanalogfront-endincludingautomatic

gaincontrol,whichadjuststhesignalstrengthbased

on the level of the spoken utterance. The sound

driverisfullduplexandOpenSoundSystem(OSS)

compliant[29],thoughweencounteredsomesignal

problemswhenrecordingsamplesat8kHz. Forthis

reasonwerecordedoursamplesat32kHzandthen

downsampled to8kHzoine;seeSection3.

As illustrated in (3) of Section 3, the storage

re-quirements for merely saving the utterance of a

passphrasecanbesignicant. Toovercomethe

stor-age limitations of this particular IPAQ in light of

thisrequirement|andin particular,topermit

sav-ingmultipleutterancesinourusertesting|weused

a1GBIBMMicrodrive TM

(inthecompact ash

ex-pansion slot) as astable store. However, to avoid

recordingnoise from theMicrodriveondisk seeks,

the recordings are rst written to a primary

par-tition in volatile memory. When the memory

ca-pacity is reached, the Microdrive is automatically

mounted,thedata ushedtoanext2lesystemon

thedrive, and then unmounted. Inthe eventthat

awirelessconnectioncanbeestablished,the

Micro-drivecanbereplacedwithawirelessnetworkcard,

andthedatawrittentoaremotemountpoint.

TheIPAQwas used to recordutterances from ten

users. All ten users were recorded saying the

samepassphrasemultiple times, whichin this case

was the address of Carnegie Mellon University:

\Carnegie Mellon University, 5000 ForbesAvenue,

Pittsburgh,Pennsylvania15213". Theuserwas

ap-proximatelyone footawayfrom theIPAQ's

micro-phone. Theuserwasrequiredtowaitforatleastone

secondbetweenpressingthe\record"buttonofour

recordingapplicationandspeaking,soastonot

in-terleavethevoicesignalwiththedevice'sattempts

toperformautomaticgaincontrol. (Sincethe

auto-maticgaincontrolconvergeswithin 0:5seconds on

(10)

acousticenvironmentinwhichtheseutteranceswere

recordedwasastandardoÆceenvironment,andas

such,backgroundnoisewassignicant. Six

record-ingsof each user|the \training"utterances|were

used to determine the distinguishing features for

thatuser. Theremainingrecordingsforeachuser|

the \testing" utterances, of which there were six

from eachuseronaverage|wereeach usedto

gen-erateafeaturedescriptor. Comparingeachfeature

descriptortothesameuser'sdistinguishingfeatures,

to determinethenumberof distinguishingfeatures

with which the feature descriptor was consistent,

counted asa\truespeaker"trial. Comparingeach

feature descriptor to another user's distinguishing

featurescountedas an\imposter"trial.

Theresultsofthisanalysisareshownintheleftside

of Figure 4. This graph demonstratesthe average

numberofdistinguishingfeaturesperuserasa

func-tionofk,andtheaveragenumberofthesethatthe

feature descriptorof atruespeakeroran imposter

matched. 5

Theerrorbarson thetrue speakerand

impostercurvesshowonestandarddeviationabove

andbelowtheaverage.

There are several pointsworthnoting about these

results. First, thegap betweenthe \distinguishing

features" and \true speaker" points indicates the

number c

max

of error corrections that would need

to be performed during the key regeneration

pro-cess to achieve a reasonably low false reject rate.

For example, if k = 0:6, then c

max

5 should

achieve a reasonablefalse reject rate, and

correct-ing 5errorsis feasibleon today's devices (see

Sec-tion 4.3). Unfortunately, this data suggests that

choosingk=0:6yieldsfewerdistinguishingfeatures

than we would like for security (d=m 0:5 only).

Asecondpointworthnotingisthatthehuman

im-posters, even when sayingthe same passphrase as

thetrueuser,didnotmatchsignicantlymoreofthe

trueuser'sdistinguishingfeatures thanif theyhad

simplyguessedarandomfeaturedescriptor(shown

bythe\randomguessing"lineinFigure4,whichis

simplyhalfthe\distinguishingfeatures"line).

5

Thepointsforagivenvalueofkweregeneratedusinga

planechosentomaximizethenumberofdistinguishing

fea-turesand, among allsuch planes, to maximizea weighted

average of the number of features matched by the \true

speaker"andmissedbythe \imposter". Seethe last

para-Inorderto evaluateadierentmodel of

imperson-ation, i.e,onewhere theattackerhasknowledge of

thespeakerbeingimpersonated,weexploreda

sec-ond dataset. Our second dataset is onecollected

withinthecontextofadierentresearcheort,and

so consequentlywehad lesscontrol overthe

avail-ability of phrases said by the same user multiple

times. This data set consists of recordings of a

professionalspeaker,herecalled\Speaker-J",taken

in aprofessionalstudio (i.e.,aroomwith virtually

no background noise) using a high-quality

micro-phone. The same microphone was used

through-outdatacollectiontoensure atandconsistent

fre-quencyresponse. Consequently, this data set is of

amuch higher quality than thedata set described

inSection 5.1. Thisdataconsistsofrecordings

col-lected fortwoseparateexperiments, but both

spo-kenbythesamespeaker,Speaker-J.PartI consists

ofapproximately1600sentences(roughly1hourof

speech)andincludesthespeakerreading(with

con-sistent voice quality and head-to-microphone

dis-tance) both standard newswire text and a

pre-paredscriptthatcoversrarecombinationsofspeech

sounds. PartIIconsistsof38minutesofspeech

con-sistingofagroupofsentencesrepeated7times,all

recordedin asingleday. Theelapsedtimebetween

thetwodatasetswasapproximately1year.

Toevaluateourtechnique,eachofthesephrases(in

partII) were used as apassphrase in our scheme:

verecordingswere used to generatethe speaker's

distinguishingfeatures fora givenphrase, andtwo

recordings were used to simulate the speaker

at-temptingtoregeneratehiskey. Specics aboutthe

chosenpassphrasescanbefoundin Table1.

TherightsideofFigure4showstheresulting

\dis-tinguishing features" and \true speakers" curves.

These curves are analogous to the curves in the

leftside of Figure4with thesamelabels: therst

characterizesthenumberofdistinguishingfeatures

forthis user basedonthe trainingutterances,and

thesecondgivestheaveragenumberoffeatures on

whichafeaturedescriptorgeneratedfromatest

ut-terancematched the distinguishingfeatures. If we

look atk=0:6, weagainseethat c

max

5should

approach a reasonable false negative rate (and is

plausible by Section 4.3). Moreover, according to

thisdataset, thedistinguishingfeatures atk=0:6

(11)

0

10

20

30

40

50

60

0.2

0.4

0.6

0.8

1 features

k

IPAQ

Distinguishing features

True speaker

Human imposter

Random guessing

0

10

20

30

40

50

60

0.2

0.4

0.6

0.8

1 features

k

Speaker-J

Distinguishing features

True speaker

TTS imposter

Cut-and-paste imposter

Random guessing

Figure4: Evaluationoftwodatasets; seeSections5.1and5.2.

of these recordings may provide amore optimistic

picturethanwouldberealizedin practice.

Part I of the Speaker-J data set is very rich, and

moreover, the research eort that generated this

data set carefully annotated the speech,

identify-ing the beginning, ending and midpoint of each

phoneme itcontains. Informally,phonemes arethe

basic units in the sound system of a language; in

the case of English,there are about 50 phonemes.

With these annotations,diphones canbeextracted

from therecorded speech. A diphone is a portion

of speech beginning in the middleof one phoneme

and ending in the middle of the next phoneme; a

diphonethusisanexampleofhowtheuser'sspeech

transitionsfromonephonemetoanother. Diphones

reveal much about the voice patterns of the user

whouttered them. So,partIoftheSpeaker-Jdata

set provides the opportunity to attempt a

dier-entform of attackagainstoursystem, namelyone

that wouldsimulate an attackerwhohad acorpus

of recordingsof theuser sayingmany things other

than thepassphrase itself. Aquestion weattempt

toansweriswhether theserecordingsassistthe

at-tackersignicantlyinndingtheuser'skey.

Specically, consider an attack in which the

at-tackerwishestotestacandidatepassphrase(inour

case, selected from part II) but does not how the

user speaks it. The attacker uses a text analysis

module fromatext-to-speech(TTS) system[28]to

translate the text of the passphrase into a string

ofphonemesthatrealizethepassphrase(i.e.,a

pro-nunciationforthetext),alongwithotherimportant

informationthatistypicallyusedwhensynthesizing

speech(e.g.,thedurationandthepitch contourfor

each phoneme). Of course, any of these features

maynotmatchexactlywhattheusersayswhenshe

speaks the passphrase. For example,a given word

canbepronouncedanumberofdierentways. So,

evengiventhecorrectpassphraseasinput,there is

no guarantee the text analysis module will yield a

stringof phonemes that matchesthe waythe user

speaksthepassphrase. Moreover,thedurationand

pitch predictions made by the text analysis might

dier signicantly from what the real user sounds

like.

Nevertheless, suppose theattackerpossessesa

cor-pus of recordings of the user speaking various

phrases other than the passphrase (in our

ex-periment, part I of the Speaker-J data),

anno-tated to identify phonemes and diphones. The

at-tackercanthenattempt to constructhow theuser

would saythepassphrase,usingtechniquesderived

from a concatenative text-to-speech synthesis

sys-tem(e.g.,[12]),in oneofthefollowingways:

Cut-and-pasteimposter Concatenate the raw

speechsamples (diphones,orlongersegments)

as-is from the inventory. There are various

forms of this. On the one hand, the

at-tackermaymakenomodicationsto duration

or pitch of the resulting speech. This yields

speechthat can soundverymuchlikethetrue

speaker, though there can be severe

disconti-nuities at the concatenation boundaries. In

addition, such an approach can yield

notice-able dierences in the recording levels within

(12)

at-discontinuities.

TTS imposter Useatraditional TTSsignal

pro-cessingback-endto synthesizethe passphrase.

Note that this is designed to produce nice

sounding speech, but that it also makes use

of theduration and pitch predictions that are

outputfromthetextanalysis module. Ifthese

predictions do not correspond to the way the

user actually speaks, this step might impede

theattack. Forexample,theusermayhavean

idiosyncratic way of saying a particular word,

either in her passphrase orin the instance in

theattacker'srecordingsoftheuser.

We experimentedwith four types of cut-and-paste

attacksandtwotypesofTTS attacks. Theresults

ofthesetestsareshownintherightsideofFigure4.

Thecurveslabeled\TTSimposter"and

\Cut-and-paste imposter" capture the best attacks of each

typethatwediscovered. Asthecurvesdemonstrate,

theseattacksbothperformedsimilarly,and

outper-formed random guessing in some cases. However,

it appears that the attacks as we conducted them

would fallshortofbreakingourscheme.

ThoughpartI oftheSpeaker-Jdataset consistsof

1600 sentences, it is not the case that an attacker

wouldneedtoassembleofcorpusofuserrecordings

ofthisextenttoattackatypicalpassphrase. Table1

approximates theaveragenumberofsentencesand

their cumulativeduration that the attackerwould

needtorecordtoobtainthediphonesineachofthe

vepassphrasesweexamined. Thesenumberswere

obtainedbyrandomlyselectingsentencesfrompart

IoftheSpeaker-Jdatasetuntiltheneededdiphones

were obtained.

passphrase passphrase sentences

number phonemes needed

0 24 340(805secs)

1 52 455(1071secs)

2 29 1297(3104.88secs)

3 27 152(367.320secs)

4 18 415(951.421secs)

Table1: Approximatenumberofsentencesattacker

would need torecord to obtaindiphones necessary

toreconstructeachpassphrasetested.

As speech synthesis technology improves, the size

decrease. However,TTSandcut-and-pasteattacks

ofthetypesweperformedrequireanannotated

cor-pus,andachievingthisannotationis avery

manu-allyintensiveprocessthatistypicallyconductedby

speech experts. Inthe case of the Speaker-J data

set, it is estimated that 200expert-hours of eort

was invested in achieving the annotated data set.

(It takesaboutonehour to manually segmentone

minuteofspeech.) Thisisalreadyasignicant

bar-rierto anattackerwishingto utilize these avenues

ofattack. Thoughautomaticlabellersareavailable

(e.g.,[30]),theirperformanceispoor,andweexpect

it would substantially increase the error rates for

theattacksoutlinedherein. Wedoexpect,however,

thatthesuccessofsuchattackswillincreaseevenfor

ourowndatasets,asweexploreinmoredetailways

toimprovetheeectivenessoftheseattacks. Inthe

fullversionofthispaperwewillprovideamore

de-tailedanalysisofthese threatsonaper-passhprase

basis. Wehopethatthis analysiswill beusefulfor

designingeectivecountermeasures.

6 Related Work

Theonly priorworkof which we areawareon the

topic of generating cryptographic keys from voice

utterancesisthatin whichthecryptographickeyis

merelythetextofwhatisspoken,recognizedusing

standard techniques in automatic speech

recogni-tion. (Actually,weareunawareof specic systems

thatevenjustdothis, butsinceitis animmediate

extension of using a typed password as a

crypto-graphickey, wetreatitasobviouspriorwork.) To

our knowledge, our work is the only research

to-ward determining a repeatable cryptographic key

that draws entropy from how the user speaks the

passphrase. How this paper contributes over our

own priorpublications on this topic wasdescribed

inSection2.

Thatsaid,therehasbeenworkongenerating

cryp-tographic keys from biometrics other than voice.

The rst such work of which we are aware is due

to Soutaret al. [25,26, 27], whodescribemethods

forgeneratingarepeatablecryptographickeyfroma

ngerprintusing opticalcomputingandimage

pro-cessingtechniques. Thesetechniquesgenerateakey

from a two-dimensionalimage (a ngerprintbeing

(13)

well-http://www.bioscrypt.com/).

Adierentapproachtogeneratingarepeatablekey

basedonbiometricdataisdue toDavida,Frankel,

and Matt [4]. In this scheme, a user carries a

portablestoragedevicecontaining(i)error

correct-ingparameterstodecodereadingsofthebiometric

(e.g., an irisscan) with alimitednumber oferrors

to a \canonical" reading for that user, and (ii) a

one-wayhashofthat canonicalreadingfor

verica-tionpurposes. This canonicalreading,once

gener-ated,canbeusedasacryptographickey,orcan be

hashedtogether withapassword(using adierent

hash function) to obtainakey. Juelsand

Watten-berg[9]generalizedandimprovedtheDavidaetal.

scheme through anovelmodicationin the use of

error-correcting codes, thereby shrinking the code

size and achieving higher resilience. These

tech-niquesareadierentapproachforgenerating

cryp-tographickeysfrombiometricreadings,andreacha

correspondingly dierentset oftradeos. Notably,

whereasourtechniquespermitausertoreconstruct

her key even if she is inconsistent on a majority

ofherfeaturedescriptorbits(notuncommon when

usingvoiceasabiometric[6]), these techniques do

not.

More distantly related work is that of Ellison et

al. for generating a cryptographic key based on

answers to questions posed to a user [7]. The

work is premised on the assumption that

ques-tionscanbeposedthat thelegitimateuserwill

an-swer one way but others attempting to

imperson-ate the user will answeranother way. Their

con-struction resemblesoneinstance ofourtechniques,

namely that of [15, Sections 5.1{5.2], and in this

waytheir scheme achievesa degreeof resilienceto

forgotten answers. However, Bleichenbacher and

Nguyen[2]haveshownthattheEllisonetal.scheme

isinsecure,whereasourconstructionsappeartobe

muchstronger. Anotherconstructionsimilartothat

in[15,Sections5.1{5.2]wasusedin thedesignofa

forensicdatabase, where aperson's medicalrecord

can be decrypted only once aDNA sample of the

personisobtained(e.g.,atacrimescene)[3].

How-ever,thisschemeisalsoinsuÆcientforourpurposes,

due tothesameinadequaciesastheschemeof [15,

Sections5.1{5.2].

Impersonation attacks using recordingsof theuser

speakingphrasesother than her passphrase,aswe

exploredin Section 5.2,havebeenpreviously

stud-ied for the purpose of fooling speaker verication

inthese worksare somewhatdierentfromour

ex-ploration here, however. Notably, in [14, 23], the

authors describesynthesizing apassphrase using a

speaker-independentmodel, andthenadapting the

pitch and duration of the synthesized passphrase

basedonrelativelyfew recordingsof theuser. The

authorsgiveevidencethateventhesesimpleattacks

can make it diÆcult to set acceptance thresholds

foraspeakervericationsystem. Infutureworkwe

hopetoexplorehowthesetechniquescanbeapplied

inthecontextofourwork.

7 Conclusion

Theviabilityof(re)generatingstrongcryptographic

keysfromvoiceutterancesremainsunproven. While

webelievethattheworkpresentedinthispaper

of-fers steps toward achieving this goal and evidence

that it can be reached, somecritical advances are

still required. Notably, our analysesusing m =46

in[16] andm=60hereareinsuÆcientforsecurity

asrequiredincommercial applications,andour

fu-tureworkwillcontinuetofocusonreachingm=80

or higher. More extensive user trials to evaluate

theentropyof users'distinguishingfeatures is also

needed. And, there remain numerous open

ques-tions asto how to tune an implementation of our

approach in practice. The analysis of Section 5

hidesmanyparametersfromview,andthe

relation-shipbetweentheseparameters,securityand

usabil-ityneedtobefurtherexplored.

Acknowledgements

WeareespeciallygratefultoDanielBleichenbacher

forthedesign,analysis,and implementationof the

encoding scheme in Section 4. Additional analysis

ofthe keyreconstruction schemepresentedtherein

will appear in [1]. Wealso thank SapeMullender,

Peter Bosch and Qiru Zhou for their assistance in

analyzinganomaliesinthekernelsoundmoduleson

thevariousplatformswehaveexperimentedwithin

thiseort. Weare thankfultoGreg Kochanski for

numerous insightful discussions, and Minkyu Lee,

for providing us with signal processing code that

attempts to correct the discontinuities in the

(14)

[1] D. Bleichenbacher. Encoding schemes for generating

keysfromalow-entropynoisysource,2002.

[2] D. Bleichenbacher and P. Nguyen. Noisy polynomial

interpolation and noisy Chinese remaindering.In

Ad-vances in Cryptology { Proceedings of EUROCRYPT

'2000(LNCS1807),Springer-Verlag,pages53{69,2000.

[3] P.Bohannon, M.Jakobsson, and S.Srikwan.

Crypto-graphic approaches to privacy in DNA databases. In

Proceedings of the 2000 International Workshop on

PracticeandTheoryinPublicKeyCryptography,

Jan-uary2000.

[4] G.I.Davida,Y.Frankel,andB.J.Matt. Onenabling

secureapplicationsthrougho-linebiometric

identica-tion. InProceedings of the 1998 IEEESymposium on

SecurityandPrivacy,pages148{157,May1998.

[5] R.B.Davies.AmatrixlibraryinC++.September1997.

http://www.webnx.com/r ober t.

[6] G.R.Doddington,W.Liggett,A.F.Martin,M.

Przy-bocki, and D. A. Reynolds. Sheep, goats, lambs and

wolves: Astatisticalanalysisofspeakerperformancein

the NIST1998 speakerrecognitionevaluation.In

Pro-ceedingsofthe5 th

InternationalConferenceonSpoken

LanguageProcessing,November1998.

[7] C.Ellison,C.Hall,R.Milbert,andB.Schneier.

Protect-ingsecretkeyswithpersonalentropy.FutureGeneration

ComputerSystems16:311{318,2000.

[8] D. Genoud and G. Chollet. Speech pre-processing

againstintentionalimpostureinspeakerrecognition.In

Proceedings of the 1998 International Conference on

SpokenLanguageProcessing,pages105{108,December

1998.

[9] A. Juels and M. Wattenberg. A fuzzy commitment

scheme.InProceedingsofthe6 th

ACMConferenceon

Computer and Communication Security, pages 28{36,

November1999.

[10] FrederickJelinek.StatisticalMethodsforSpeech

Recog-nition.MITPress,1997.

[11] J. B.Lacy, D. P. Mitchell, and W. M. Schell.

Cryp-toLib: cryptographyinsoftware.InProceedingsofthe

4 th

UNIXSecuritySymposium,Oct.1993.

[12] M.Lee,D.P.LoprestiandJ.P.Olive.Atext-to-speech

platformforvariablelength optimalunitsearching

us-ingperceptualcostfunctions.InProceedingsofthe4 th

ISCAResearchWorkshoponSpeechSynthesis,August{

September2001.

[13] Q.Liand A.Tsai.Amatchedlterapproachto

end-pointdetection forrobustspeakerverication.In

Pro-ceedingsofIEEEWorkshoponAutomaticIdentication

AdvancedTechnologies(AutoID'99),pages35{38,

Octo-ber1999.

[14] T. Masuko, T. Hitotsumatsu, K. Tokuda and

T.Kobayashi.On thesecurityofHMM-basedspeaker

verication systemsagainst imposture using synthetic

speech. In Proceedings of the European Conference

on Speech Communication and Technology, Budapest,

Hungary,vol.3,pages1223{1226,September1999.

[15] F. Monrose, M. K. Reiter, and S. Wetzel. Password

hardening basedon keystroke dynamics.International

Journal of Information Security 1(2):69{83, February

graphickeygenerationfromvoice(extended abstract).

InProceeedingsof the2001IEEESymposiumon

Secu-rityandPrivacy,May2001.

[17] F.Monrose,M.K.Reiter,Q.Liand S.Wetzel.Using

voicetogeneratecryptographickeys: Apositionpaper.

InProceedingsof Odyssey2001,TheSpeaker

Verica-tionWorkshop,June2001.

[18] B. L. Pellom and J. H. L. Hansen. An experimental

study of speaker verication sensitivity to computer

voicealteredimposters.InProceedingsof the1999

In-ternationalConferenceonAcoustics,Speech,andSignal

Processing,March1999.

[19] R. Power. 2001 CSI/FBI computer crime and

secu-ritysurvey.ComputerSecurityIssues&TrendsVII(1),

2001.

[20] L. Rabiner and B.H. Juang. Fundamentals of Speech

Recognition,PrenticeHall,1993.

[21] R. D. Rodman.Computer Speech Technology. Artech

House,Norwood,MA,1999.

[22] A.I.Rudnicky,S.D.Reed,andE.H.Thayer.

Speech-Wear: Amobilespeech system.In Proceedings of the

4 th

InternationalConferenceonSpokenLanguage

Pro-cessing,pages538{541,October1996.

[23] T. Satoh, T. Masuko, T. Kobayashi, and K.Tokuda.

Arobustspeakervericationsystemagainstimposture

usinganHMM-basedspeechsynthesissystem.In

Pro-ceedingsoftheEuropeanConferenceonSpeech

Commu-nicationandTechnology,vol.2,pages759{762,Aalborg,

Denmark,September2001.

[24] G.J.Simmons.Anintroductiontosharedsecretand/or

shared controlschemesand their application.In

Con-temporaryCryptology: TheScienceofInformation

In-tegrity,IEEEPress,1992.

[25] C.SoutarandG. J.Tomko.Secureprivatekey

gener-ationusingangerprint.InCardtech/Securetech

Con-ferenceProceedings,vol.1,pages245{252,May1996.

[26] C. Soutar, D. Roberge, A. Stoianov, R. Gilroy, and

B.V.K. Vijaya Kumar.Biometric encryption TM

using

imageprocessing.InOptical Security and Counterfeit

DeterrenceTechniquesII(Proc.SPIE3314),pages178{

188,1998.

[27] C. Soutar, D. Roberge, A. Stoianov, R. Gilroy, and

B.V.K.Vijaya Kumar.Biometric encryption TM

{

En-rollmentandvericationprocedures.InOpticalPattern

RecognitionIX(Proc.SPIE3386),pages24{35,1998.

[28] R.W.Sproat.Multilingualtext-to-speechsynthesis:The

BellLabsapproach.KluwerAcademicPulishers,1998.

[29] J.Tranter.Linux Multimedia Guide.O'Reillyand

As-sociates,Inc.September,1996.

[30] C. W. Wightman and D. T. Talkin. The aligner:

Text-to-speech alignment usign Markov models. In

Progress in Speech Synthesis, Chapter 25, pages 313{