on Resource Constrained Devices (Extended Abstract) Fabian Monrose Michael K. Reiter y QiLi Daniel P. Lopresti Chilin Shih Abstract
Programmable mobile phones and personal digital
assistants(PDAs) with microphones permit
voice-driven user interfaces in which a user provides
in-putbyspeaking. Inthispaper,weshowhowto
ex-ploitthiscapabilityto generatecryptographickeys
on such devices. Specically, we detail our
im-plementation of a technique to generate a
repeat-able cryptographic key on a PDA from a spoken
passphrase.Ratherthanderivingthecryptographic
keyfrom merelythepassphrasethat wasspoken|
whichwouldconstitutelittlemorethananexercise
inautomaticspeechrecognition|westriveto
gener-ateasubstantiallystrongercryptographickeywith
entropydrawnbothfromthepassphrasespokenand
howtheuserspeaksit. Moreover,thecryptographic
key is designed to resist cryptanalysis even by an
attackerwhocapturesandreverse-engineersthe
de-vice on which this key is generated. We describe
the major hurdles of achieving this on an
o-the-shelfPDAbearinga206MHzStrongArmCPUand
an inexpensive microphone. We also evaluate our
approach usingmultiple datasets, onerecordedon
the device itself, to clarify the eectiveness of our
implementationagainstvariousattackers.
1 Introduction
Futuristic mobile computing platforms will oer,
andinsomecaseswillrequire,methodsofuserinput
other than a keyboard, mouse orjoystick. This is
especiallytrueforhead-mounteddisplaysandother
BellLabs,LucentTechnologies, MurrayHill,NJ,USA;
ffabian,qli,dpl,clsg@re sear ch. bell -la bs.c om
y
DepartmentofElectricalandComputerEngineeringand
DepartmentofComputerScience,CarnegieMellon
Univer-sity,Pittsburgh,PA,USA;[email protected]
wearable computers (e.g., see [22]). For such
fu-turisticdevices,andevenfornext-generationPDAs
andprogrammablemobilephones,voiceisaleading
contenderforthedominantuserinputmedium.
We argue that if voice prevails in this sense, then
this poses a challenge for securing data on these
devices. On the one hand, if our experience with
laptopcomputersandmobile phonesisany
indica-tion, then these devices will be stolen frequently:
Laptop theft is already the second leading
quan-tiablecost to enterprises from IT-related security
threats[19]. Similarly,mobilephonesaretheobject
oftheftinfourofeverytenpersonalrobberiesin
sev-eralcities in theUnited Kingdom, andthese areas
loggedafourfoldincrease in personal robberies
in-volvingmobile phones between1998{99and2000{
01. 1
These trends suggest that encryption of any
sensitive data on such devices is prudent. On the
otherhand,presumingthatthesedeviceswillnotbe
tamper-resistant,thecryptographickeywithwhich
such encryption can be performed would need to
be derived from the voice input of the user,
pre-sumablysomeformofspokenpassphrase.
Unfortu-nately,spokenpassphrasesarelikelytohavefarless
entropy than typed ones, due to their need to be
pronouncibleandduetootherformsofinformation
lossin aspokenrepresentation(e.g., capitalization
andpunctuation).
Inthis paperwedescribean implementation of an
approach to derivea repeatable cryptographickey
from spoken user input, in which the entropy of
thekeyis drawnfrom both thepassphrase that is
spoken and the speech patterns of the user while
speaking it. In this way, even if the entropy of
thepassphrasespaceis small,thevariabilityacross
users'vocal tractswill pose an additionalobstacle
to the cryptanalysisof thekey. Moreover,our
ap-proachusestechniquesdesignedtowithstandan
at-1
Seehttp://news.bbc.co.uk/hi/english/ uk /ne wsid1440000/
1440863.stm.
vice on which the key generation takes place (but
not while it is taking place), i.e., an attackerwho
has full access to the stable storage of the device.
Our general approach for achieving this was
dis-cussedin[16],thoughasonlyaninitialsteptoward
thisgoal,thatworkevaluatedtheapproachonlyfor
thegenerationof46-bitkeys,usingonlyutterances
recorded over phone calls, and without regard for
thediÆculties facedin implementingtheapproach
onresource-constraineddevices. Here,weprovidea
somewhatmorerealisticevaluationofthisapproach
usingafullimplementationonano-the-shelfPDA
(the Compaq IPAQ), using data recorded on that
PDA, andtargeting 60-bitcryptographickeys. We
detailnumerouschangesandrenementsweneeded
to make theapproach viableonthis platform. We
willalsogiveevidencetosuggestthattheadversary
gains little by knowing the user's passphrase, and
thattheadvantagetheadversarygainsby
addition-allyrecordingtheusersayingphrasesotherthanthe
passphraseislessthanonemightexpect.
We caution the reader, however, of several
limi-tations of our analysis and our approach. First,
thoughwedemonstratethereliablere-generationof
akeyusing an60-bit characterizationof theuser's
utterance, we do not claim to necessarily achieve
keyswith 60bitsofentropy. Indeed, onecandraw
fewconclusionsregardingkeyentropyfromthe
lim-iteduserstudies[16, 17]thatwehavebeenableto
perform. That said, our studies suggest that the
technique we have implemented does draw
signif-icant entropy from passphrase utterances, and at
theveryleastshould oergreaterentropythanthe
passphrase space alone. Second, as our approach
strivestore-generateacryptographickeywhenever
thelegitimateuserutters herpassphrase,itis
nec-essarily vulnerable to an attacker who can obtain
boththeuser'sdevice andahigh-qualityrecording
oftheuserutteringherpassphrase;thisexposesall
theuser'skeyingmaterial,afterall. Whileany
bio-metricisvulnerabletosuchanattack,weraisethis
point heretoemphasize theprimary attackerwith
whichweareconcerned: theattackerwhocaptures
thedevicebutthatdoesnothaveaccesstotheuser.
Thatsaid,wedoshow,somewhatsurprisingly,that
knowledgeoftheuser'spassphraseseemstohelpthe
attackerlittle,andthatevenrecordingsoftheuser
sayingotherphraseshelpsonlymarginally.
In this section we give the background necessary
to describe our contributions in this paper. Our
generalapproachtogeneratingacryptographickey
fromaspokenutteranceofapassphraseisdescribed
in [16]and isbased onanapproach initially
devel-oped for generating a key from a typed password
andtheuser'skeystroketimingsduringtheentryof
thatpassword[15]. Howthispaperrelatestothese
priorpublicationsisdescribedbelow.
In our approach, the system is initialized by rst
generatingacryptographickeyK,andthen
gener-atingacollectionof2msharesofKusinga
general-izedsecretsharingscheme(e.g.,see[24]forasurvey
of such schemes), where m is asystem parameter.
These shares are aligned in a m2 table that is
storedonstablestorage. Thesharesmustbe
gener-atedand placedwithin thetable sothat K canbe
reconstructedfromanysetofmsharesconsistingof
onesharefrom eachrowofthetable.
Upon entry of the passphrase, the system
mea-suresmbiometricfeaturesoftheuser'sentryofthe
passphrase. Wedenote thei-th feature by
i ,and
denotethevalueoffeature
i
onthe`-th
(success-ful orunsuccessful) attempt to log into this user's
account(i.e., generatethis user'skey) by
i (`). In
thecaseofaspokenpassphrase,thefeatures
i are
features extractedfrom the user'sutterance as
de-scribedin [16]. Forthe`-thloginattempt,the
sys-tem then generates a bit stringb
` 2 f0;1g m from 0 (`);:::; m 1 (`);b `
iscalled afeaturedescriptor,
and thei-th bit of b
`
, denoted b
`
(i), is determined
from the i-th feature
i
(`). Algorithms for
gener-ating b ` from 0 (`);:::; m 1
(`) are proposed and
evaluatedin[16],butforthepurposesofthispaper,
thereadercanthinkofb
` beingdeterminedby b ` (i) 0 if i (`)< i 1 otherwise (1) where i
denotes some xed threshold value. The
system then attempts to reconstruct K using the
table elements at positions hi;b
` (i)i
0i<m
. When
theloginindex `isnotnecessaryin thediscussion,
wewilltypicallyomitit.
Afterahistoryof feature descriptorsfrom
success-ful loginsis observed (i.e., logins in which the key
was successfully reconstructed), those elements of
measurements are such that b(4)= 1consistently,
thentheh4;0i elementof thetable israndomly
al-tered. If b(i)is suÆcientlyconsistent that element
hi;1 b(i)iinthetableisperturbedinthisway,then
b(i) is called adistinguishing feature. We will give
aprecise characterizationofdistinguishingfeatures
inSection 5.
Thecorrectuser,wheninducingfeaturedescriptors
consistent with those she has induced in the past,
shouldnotencounteranyofthealteredelementsin
thetable. Wehavefound,however,thatinpractice
itisnecessaryforthesystemtoattempt
reconstruc-tionusingfeaturedescriptorswithinsomeHamming
distance c
max
of the inducedfeature descriptor, to
correct for up to c
max
\errors" by the user (e.g.,
see[15]). Thisresultsinupto
cmax X i=0 m i (2)
secret sharing reconstructions per key
(re)generationattempt.
Securityofthistechniquerequiresthatanadversary
whocapturesthedevicebeunabletoeÆciently
dif-ferentiatearandomtableelementfromavalidshare
ofK. Ifthis isthecase,thenanadversarymaybe
forced to simply guess feature descriptors until he
ndsK. Ifd tableelementswere randomized(i.e.,
there areddistinguishingfeatures),then 2 m d
out
ofthe2 m
possiblefeature descriptorsyieldK, and
sotheprobabilitythatarandomlychosendescriptor
willreconstructthesecretis2 d
.
Specic secret sharingschemes for populating this
table were investigated in [15]. That paper also
included an evaluation of this approach with
fea-ture descriptors of length m = 15 derived from
the keystroketimings of auser while typing an
8-character password. There, weevaluatedan
imple-mentation in which the table was additionally
en-crypted with the password; in this way, the
tech-nique serves to render adictionary attack against
the passwordup to 2 15
times more diÆcult. Our
subsequentworkonvoicefeatures[16]described
al-gorithmsforgeneratingfeaturedescriptorsfromthe
user'svoicewhilespeakingapassphrase. Itfurther
evaluatedthesecurityandreliabilityoftheresulting
techniquewithfeaturedescriptorsoflengthm=46
derivedfrom preexisting recordings ofusersovera
phone line. However, in contrast to the keystroke
case,hereourevaluationpresumedatablethatwas
passphrase (to decrypt the table). In this case,
m = 46 does not provide nearly enough security
forimportantapplications.
In this paper we address the computational
chal-lenges of performing key reconstruction on a
resource-constrained PDA with more realistic
pa-rameters than our previous voice study explored.
Specically,weevaluateourimplementationofthis
approach for featuredescriptors of lengthm =60,
andarguethat regeneratingthekeyK canbe
reli-ablyachievedona206MHzStrongARMprocessor
bycorrectingforuptoc
max
5errors(inthesense
described above). The challenges in achievingthis
are the front-end signalprocessing needed to keep
c
max
smallso that expression (2) remains
manage-able, and in devising a secret sharing scheme and
corresponding reconstruction algorithm that
per-mits this reconstruction to occur in a reasonable
amountoftimeonthisplatform. Consequently,we
focusonthesecontributionsinthispaper,andrefer
the reader to [16] for the algorithmic details
com-prisingotherstepsofthekey(re)generationprocess.
3 Front-end Signal Processing
Asdescribedin Section2,thefront-endsignal
pro-cessing performed by the device is critical for
eÆ-ciently generating a key from the user's utterance
of herpassphrase. Intuitively, the goalof this
sig-nalprocessingis to translatethesound uttered by
the user|which is received in the device as a
se-riesofamplitudesamplesfrom itsmicrophoneand
analog-to-digital(A/D)converter|intoa
represen-tationsuitedforthegenerationofafeature
descrip-tor(usingthealgorithmsof[16]). Ideally,thissignal
processing should yield a representationthat is as
\clean" as possible, in that inter-word silence and
backgroundnoiseaectthis representationaslittle
aspossible. Thelesssilenceandbackgroundnoisein
therepresentationaftersignalprocessing,themore
consistenttheuser'sutteranceswillbe(thereby
in-creasingd and the securityof the scheme) and/or
thelesserrorcorrectionwillbenecessaryinthelater
stagesof keygeneration (i.e., thesmallerc
max and
expression(2)canbe).
Ofcourse,thebenetsofsignalprocessinginterms
ut-tondtherightbalanceofeliminating
environmen-taleectsearlyviasignalprocessing,versusrelying
on the error correction in the key generation step
tocompensatefortheeectsofnoiseandsilencein
theuser'sutterance. Inthissectionwedescribethe
seriesofsignalprocessingstepsthatwebelievebest
achieves this balance. These steps are pictured in
Figure1anddescribedtextuallybelow.
A/DC
down sampling
autocorrelation
analysis
LPC
analysis
silence
removal
cepstrum
mean-subtract
frames
voice-only
endpoint
detection
energy
Figure1: Outlineofthefront-endmodulesusedfor
capturing the speech and processing the signal to
generateasequenceofframescomprisingthe
voice-onlyportionsin theutterance.
As thespeakerutters her passphrase,the signalis
sampledatapredenedsampling rate,whichisthe
number of times the amplitude of the analog
sig-nalisrecordedpersecond. Theminimumsampling
ratesupportedbyourtargetplatform,theIPAQ TM
(seeSection5.1),is32kHz;i.e.,32;000samplesare
taken per second. 2
Each sample is representedby
a xed number of bits. Obviously, the more bits
therearepersample,thebetteristheresolutionof
thereconstructedsignal,butthemorestorageis
re-quired for savingand processing theutterance. In
our implementation, we represent the signal using
16bits persample. Therefore,theamountof
stor-agerequiredpersecond ofrecordedspeech is
32000 samples second 2 bytes sample = 64 kilobytes second (3)
Since theutteranceof oneof ourtested passwords
caneasilybe6secondsormore,thestorage
require-mentsforprocessingevenasingleutterancecanbe
2
TheIPAQisclaimed to support lowersamplingrates,
butwehavebeenunabletogetthemtoworkcorrectlyunder
isespeciallytruesince,aswehavefoundby
experi-ence,writingtostablestoragewhiletherecordingis
ongoingcan introduce noiseinto therecording. In
ourcase,thisparticularlyposedanissueforour
ex-perimentalevaluationinwhichweneededtoacquire
manysamplesfrom thespeaker;seeSection 5.1.
To make subsequent processing on the device
eÆ-cient,ourimplementationrstmakesseveral
modi-cationstotherecordedspeechtoreducethe
num-ber of samples. For example, we down-sample the
32;000 samples per second to only 8;000 samples
persecond,eectivelyachievingan8kHzsampling
rate. For most voice-related applications, a
sam-pling rate of 8 kHzis suÆcient to reconstruct the
speech signal. Infact, nearly allphone companies
inNorthAmericauseasamplingrateof8kHz[21].
Downsamplingmustbeperformedwithsomecare,
however, due to the sampling theorem [20]. The
sampling theorem tells us that the sampling rate
mustexceedtwicethesignalfrequencytoguarantee
anaccurateanduniquerepresentationofthesignal.
Failuretoobeythisrulecanresultinaneectcalled
aliasing, in which higher frequenciesare falsely
re-constructed as lower frequencies. Down sampling
to8kHzthereforeimpliesthatonlyfrequenciesup
to4kHzcanbeaccuratelyrepresentedbythe
sam-ples. Therefore,when downsampling to 8kHz we
usealow-pass digital lter withcuto at 4kHzto
strip higher frequencies from the signal. That is,
this lter takes sound of any frequencies as input
andpassesonlythefrequenciesof4kHzorless.
After down sampling, the signal is broken into
30millisecond (ms) windows, each overlappingthe
nextby10ms. Withineachwindoware240samples
(since 8;000samples/second0:03seconds=240
samples). Overlapping windows by 10 ms avoids
discontinuities from one window to the next, and
additionalsmoothingisperformedwithineach
win-dowtoyieldas smoothasignalaspossible.
The goal of the next signal processing steps is to
derive a frame from each window. A frame is a
12-dimensional vector of real numbers called
cep-stralcoeÆcients[20],whichhavebeenshowntobe
averyrobustandreliablefeaturesetforspeechand
speakerrecognition. ThesecepstralcoeÆcientsare
obtainedusingaatechniquecalled autocorrelation
analysis. Thebasicpremisebehindautocorrelation
analysisis thateachspeechsamplecanbe
sam-cientsusingautocorrelationanalysisinvolveshighly
specialized algorithms that we do not detail here,
but that are standard in speech processing (linear
predictivecoding[10]).
Asideeectofgeneratingframesisacalculationof
theenergy ofthesignalperwindow. Theenergyof
awindowisproportionaltotheaverageamplitudes
ofthesamplesin thewindow,measuredin decibels
(dB). Energycanbe used to removeframes
repre-sentingsilence(whichhasverylowenergy)fromthe
framesequence,viaaprocesscalledendpoint
detec-tion[13]. Thesilenceportionsofthefeatureframes
are then removed and the voice portions
concate-nated.
Finalmodicationstotheframesequencearemade
via a technique called cepstral mean subtraction.
In this technique, the component-wise mean over
all frames is computed and subtracted from every
frame in the sequence. Intuitively, if the mean
vectorisrepresentativeof thebackgroundnoiseor
thechannelcharacteristicsintherecording
environ-ment, then subtracting that mean vector from all
theframesyieldsaframesequencethatismore
ro-bustin representingtheuser'svoice.
After this, the speech data is segmented and
con-vertedtoasequenceofbits(afeaturedescriptor)as
describedin [16]. Thisfeaturedescriptorisusedto
regeneratethesecretkeyfromthepreviouslystored
tableofshares, asdescribedin Section2.
4 Encoding Scheme
We review the encoding scheme of Bleichenbacher
presentedin[1]. Thatschemefocusesonthe
partic-ular secret sharingscheme we use to populate the
m2tabledescribedin Section2andfrom which
thekeyKisreconstructed.Wequicklyfoundinthe
implementation of this technique on the
resource-constrained IPAQ that the secret sharing schemes
suggestedin[15]wouldnotsuÆce. Thatpaper
sug-geststhreedierentschemes. OneissuÆciently
re-source eÆcient for our purposes but also has
po-tential security weaknesses (see [15, Sections 5.1{
5.2],[2]),andwhiletheothertwoaddressthis
weak-ness[2],theyaresimplytoocomputationally
inten-siveduring reconstruction to permit the degree of
error correction we require. Therefore, to achieve
cretsharingschemethatpermitsfastreconstruction
andthatappearstobesecureforourpurposes.
Weemphasizethat thetypeof security werequire
for our secretsharing scheme is dierent from the
typicalsecuritydenitionofasecretsharingscheme.
The latter denition is, informally, that an
adver-sarynotpossessingasuÆcientset ofsharesbe
un-able to reconstruct the secret. In our case,
how-ever, the adversary who captures the device is in
possession ofall sharesin thetable,and so clearly
possesses enough shares to reconstruct the secret.
Oursecurityrequirementis rather that the
adver-sary beunable to eÆciently nd a suÆcientset of
validsharesinthetable,i.e.,asetcontainingavalid
sharefromeachrowofthetableandnoinvalid
(ran-dom)elements. Ideally,thebesttheadversarycould
do would beto repeatedly tryreconstruction with
arandomlychosenset containingoneelementfrom
each row. However, becausethe invalid shares are
placedaccordingtoanunknowndistribution
deter-mined by the biometric features of the user|and
not uniformly at random|it is impossible to
for-mallyreducethesecurityofsuchaschemetoa
well-knowncryptographicproblem. (Obviouslythereare
distributions that would leavethe scheme trivially
breakable.) Assuch, untilwend abetterwayto
model security, we are stuck with heuristically
se-cure schemes; the approach described in this
sec-tionisone. Nevertheless,wewillcommentindetail
aboutourcurrentknowledgeofthesecurityofthis
schemeinSection 4.4.
4.1 Initialization
Here wedo notdescribethesecretsharing scheme
is its full generality, but rather describe how it is
instantiated for our particular purposes. When a
deviceisinitializedforauser,thedevicerstselects
arandom(m 1)-elementcolumnvectors2Z m 1
p
forpaprime. Here,pcanbesmall; p=2 31
1is
asuitablechoice, so that arithmetic on32-bit
pro-cessorsisveryfast. Thevectorsisasecretvector,
therecoveryofwhichisnecessarytoobtainthekey
K. Forexample,K canbedenedasK=h(s) for
haone-wayfunction, orK canbe encrypted with
h(s).
Thesecondstepininitializationistogeneratea
ran-dom2m(m 1)matrixU =(u
ij )
0i<2m;0j<m 1
generatedpseudorandomly,thenstoringtheseedof
thepseudorandom process is suÆcientfor U to be
regenerated when needed. The 2m-element table
t=(t 0 ;t 1 ;:::;t 2m 1 ) T 2Z 2m p isthen generatedas t p Us (where \ p
" denotes equivalence modulo
p). Thatis, tisthetableasdescribedin Section2;
intuitively, theelement in the i-th \row" and j-th
\column"for0i<mandj2f0;1gist
2i+j .
Tocomplete initialization, s is deleted, and U (or
the seedneeded to regenerate it), thetable t, and
primeparestoredforthenextkeyregeneration
at-tempt. In addition, y = h 0
(s ) is stored for some
(dierent) one-way and collision-resistant function
h 0
6=h, sothat when s is reconstructed,it can be
recognizedascorrect.
After a suÆcient number of successful key
recon-structions (see Section 4.2), the table t is
\hard-ened"asdescribedinSection2: ifoveranumberof
successfulkeyreconstructions,eachinducedfeature
descriptor b is consistent on the i-th feature (i.e.,
b(i)isusually thesame,asspeciedmoreprecisely
in Section 5), then element t
2i+(1 b(i))
is assigned
tobearandomelementofZ
p
. Thisshould usually
notaect reconstructionfor thecorrect user,since
thatusertypicallyselectst
2i+b(i) .
4.2 Key reconstruction
Asdescribedin Section2,theinputfrom theuser,
in this case an utterance, is used to generate an
m-bit feature descriptor b. The key regeneration
program constructs a vector t
b 2 Z
m
p
by selecting
thecorrespondingelementsofthetable,i.e.,(t
b ) T = (t 2i+b(i) ) 0i<m . Itsimilarlyconstructsam(m 1) matrixU b =(u 2i+b(i);j ) 0i<m;0j<m 1 . Notethat solving U b s 0 p t b (4) for s 0 would yield s=s 0
if b containedno user
er-rors,andthefactthats 0 =scouldbeconrmedby testing whether y =h 0 (s 0
). However, since
instan-tiating and solving(4) for all the dierentfeature
descriptors b 0
within Hammingdistance c
max from
b would require too much time on the target
de-vice,ourerror correctionstrategy takesadierent
approach.
This faster approach derives from the observation
thattheequationU
b 0 s 0 p t b 0
forafeature
descrip-0
b 0
to be rejected very quickly if this equation has
nosolution. Specically, let ~
U
b 0
bethemm
ma-trixwhoserstm 1columnsarethesameasU
b 0
and whoselast column is t
b 0
. Then, asolution
ex-ists onlyif det( ~ U b 0) p 0, and so if det( ~ U b 0) 6 p 0 thenb 0
canbediscardedfromfurtherconsideration.
Usinga recursivealgorithm, it ispossibleto check
det( ~ U b 0 ) p
0forfeaturedescriptorsb 0
withinc
max
errorsofbusingonlyoneadditionalGaussian
elim-ination step per new b 0
. Due to this feature, our
implementation can test over 610 6
feature
de-scriptorsb 0
persecond whenm=60.
Two further observations are worth noting here.
First, the determinant of even a randomly chosen
matrix is 0 mod p with probability p 1
, which is
notnegligiblesincepischosento besmall.
There-fore, if det( ~ U b 0) p
0 and sowe proceed to solve
U b 0s 0 p t b 0 for s 0
, itremains necessaryto conrm
s 0 by checkingy =h 0 (s 0
). Second, there isasmall
chance that (4) is not solvable even if b is
consis-tent withall theuser's distinguishingfeatures,
be-cause U
b
might nothave full rank. Under the
as-sumption that U
b
is chosen uniformly at random,
itfollowsthat U
b
hasrankm 1withprobability
Q m j=2 (1 p j )=1 p 2 +O(p 3 ). Thisprobability
is much smallerthan the probability that the
sys-temcannotbesolvedbecausethefeaturedescriptor
binducedbytheuser'sutterancecontainsmorethan
c
max
errors,andthus canbeignored.
4.3 Performance
Forperformancemeasurementswechooseto
bench-markkeyreconstructionona206MHz StrongArm
and a 500 MHz Ultra II. The rst processor is
that currently available in the IPAQ TM
, on which
ourcurrent implementation runs. Thesecond
pro-cessor is in line with the current trends 3
in the
hand-held market, and thus, allows us to forecast
expected performance on future PDAs. Our
per-formance benchmarks consists of a collection of
C/C++modules cross-compiled for the ARM,
com-prisingof signalprocessing code, an enhanced
ver-sion of cryptolib1.2[11] updatedby Daniel
Ble-ichenbacher, and a matrix manipulation package
newmatv9.0[5]forimplementingthe segmentation
3
The Intel 80200 processor based on the Arm
compli-antXScale TM
microarchitecturethatsupports400,600,and
Totest theperformance of ourkey reconstruction
routineswedevisedthefollowingexperiment. First,
anm2table ofsharesof thekeyK is generated
asoutlinedin Section 2,i.e., as auser'skey
regen-erationtablewouldbeinitializedinpractice. Then,
d rowsof the table (features) are selected at
ran-dom to bedistinguishing, andone elementof each
of these d rows is perturbed randomly|as if the
user were consistent in utilizing the other element
of this row (see Section 2). Finally, a feature
de-scriptorbischosenwiththepropertythatcc
max
oftheddistinguishingfeatures(chosenatrandom),
sayb(i
1
);:::;b(i
c
),are\errors",i.e.,aresettoselect
the randomly perturbedelements of rowsi
1 ;:::;i
c .
Thekeyreconstructionprocess performs anumber
ofreconstructionsthatdependsoncinitsattempts
tocorrectforsucherrors;thenumberof
reconstruc-tionsperformedintheworstcaseisshownin
expres-sion(2)ofSection2. Ourbenchmarkistheamount
oftime requiredto reconstruct thekeyK on
aver-age,whichisaroughmeasureof thetimerequired
toperform m c 2 + c 1 X i=0 m i (5)
secretsharingreconstructionsandtesteachfor
cor-rectness. Notethatforc=c
max ,(5)islessthan(2) by m c
=2 sinceonaverage,reconstruction succeeds
after searching half of the feature descriptors that
correctbonexactlyclocations.
Ourresultsforthisbenchmark, shownin Figure2,
are signicantly less than multiplying (5) by the
timeto performand testasinglereconstructionin
our secret sharing scheme. The reason is due to
the signicant optimizations that can be achieved
asdescribedin Section4.2.
4.4 Security
One potential security weakness of this scheme is
thefactthat anadversarywhocapturesthedevice
canconceivablyreconstructs fromnotjust one
el-ement of thetable perrow, but insteadusing any
melementsofthetable. Forexample,ifthe
adver-sary had reasonto believe that the rstm=2 rows
contained no distinguishing features|and thus all
table elements in these rows were valid|then the
adversary could set t 0
[i] = t[i] for 0 i <m and
U 0
=(u)
0i<m
andusethesetosolvefors.
There-morelikelythannottobedistinguishing.
Wenowdescribethefastestattackonourschemeof
which weareaware. An attackerwhocapturesthe
deviceonwhichthekeyisgeneratedbutwhohasno
informationabouttheuser'sdistinguishingfeatures
mayattackthesystembyrepeatedlyguessinga
fea-turedescriptorbatrandomandsolving(4). Ifthere
areddistinguishingfeaturestheneachguesswillbe
successfulwithprobability2 d
,butwillrequirethe
attacker to solve a system of m linear equations,
which is quitetime consuming. A faster approach
isto choose feature descriptorsb
0 ;b
1
;::: such that
eachdiersfromthelastinonebit. Then,
comput-ingU 1
b
i
requiresonlyonenewGaussianelimination
stepperb
i
, and the further optimizations outlined
inSection4.2 canalsobeapplied inthiscase.
Theexpectedtimeforthisattacktosucceedcanbe
computedasfollows. Assumethatb
i andb
i+1 dier
in exactly one position that is chosen at random.
LetG(c) for c 2denote theexpectednumber of
Gaussianeliminationstepsperformeduntilreaching
ab
i
withnoerrors(i.e.,thatisconsistentwithallof
theuser'sdistinguishingfeatures),assumingthatb
0
hasc errors. Notethatb
i+1
hasadierentnumber
oferrorsthanb
i
withprobabilityd=m,andifithas
adierent numberof errors, then it decreases the
numberoferrors(byone) withprobabilityc=dand
increases it with probability(d c)=d. Hence, we
getthefollowingequationsforG(c).
G(2) = 0 G(c) = m d + c d G(c 1)+ d c d G(c+1) for2<cd
Solvingforthese linearequations atc=d=2yields
theexpectednumberofGaussianeliminationsteps
beforerecoveringthekey,sincearandomfeature
de-scriptorcontainsanexpectedc=d=2errors.
More-over,aftertheGaussianeliminationstepforeachb
i ,
m(m 1) multiplicationsarerequiredto testb
i on
average.So,thetotalcostofthisattackisasshown
in Figure 3. (Actually, thisis aconservativelower
bound, sincethecostof each Gaussian elimination
stepitself isnotcounted.)
Intheempiricalevaluationthat weperformin
Sec-tion5,weevaluateourapproachat m=60,which
isthesmallestvalueofmshowninFigure3.Here,if
70%ofthefeaturesaredistinguishing,thenthis
at-tack conservativelyrequires anexpected 2 44
0
20
40
60
80
100
120
2
3
4
5
6
time (secs)
corrections
206 MHz StrongArm
m = 60
m = 70
m = 80
0
20
40
60
80
100
120
2
3
4
5
6
time (secs)
corrections
500 MHz Ultra II
m = 60
m = 70
m = 80
Figure 2: Key reconstruction performance for m 2 f60;70;80g. Depicted are average (of 50 executions)
elapsedtimesforkeyreconstruction ontheIPAQ(left)and a500Mhzprocessorwhich re ectsupcoming
PDAprocessortrends(right).
30
40
50
60
70
80
90
100
110
60
65
70
75
80
85
90
95
100
floor(log
2
(# mults))
m
d/m = 0.5
d/m = 0.6
d/m = 0.7
d/m = 0.8
d/m = 0.9
d/m = 1.0
Figure 3: A lowerbound on the expected number
ofmultiplicationstoexhaustivelysearchforthekey,
assumingthattheddistinguishingfeaturesare
uni-formlydistributed;seeSection4.4.
thenthisattackrequiresatleast2 50
multiplications
on average. Obviously the security of the scheme
improvesasmandd=mareincreased,andthisisa
goalof ourongoingwork.
5 Empirical Results
Inthis sectionweempiricallyevaluatethesecurity
of our technique using two dierent data sets. In
acterizethesecurityofourtechniqueagainstan
at-tacker who captures the device. It is clear from
Section4.4thatthenumberdofdistinguishing
fea-tures is central to the security of our scheme, in
that if d is small, then our scheme is vulnerable
to akeyrecoveryattackviaexhaustivesearch (see
Figure3). Therefore,in orderto demonstratethat
ourapproachis plausibly secure, itis necessaryto
demonstrate that a high number of distinguishing
features can be achieved using our techniques. In
addition,wealsoattempttocharacterizethedegree
to which additional knowledge aids the attacker's
questforthekey,intheformofeitherknowingthe
passphrasesaidbytheuserorhavingrecordingsof
theusersayingphrasesotherthanherpassphrase.
Weremindthereader that largedis notsuÆcient
for strong security. For example, even if all
fea-turesare distinguishing (d =m) for all users, but
allusers' feature descriptorsare identical (and the
attacker knows this), then an attacker who
cap-turesauser'sdevicecantriviallydeterminethekey.
Therefore, it is equally important that users'
fea-turedescriptorsvarywidely|ormoreprecisely,are
drawn from a distribution with high entropy. An
entropyevaluation of user'sutterances from phone
recordings of users saying the same passphrase is
describedin[16,17],andthesestudiessuggestthat
theentropyavailablein userutterancesit
substan-tial even when userssay the samepassphrase. As
already noted, however, since that study involves
onlyrecordingsofuserstakenoverphonelines,and
sincethatstudyis limitedtom=46features,it is
sets withwhich weare presently working (see
Sec-tions 5.1 and 5.2) include too few users to enable
meaningful measurements of the entropy of users'
feature descriptors, and so here we report results
fordistinguishingfeatures only.
Inorder to calculatethe average numberof
distin-guishingfeatures peruser,itis ofcourse necessary
to dene when a feature is distinguishing. Let
i
and
i
denote themeanand standarddeviation of
feature
i
over the recent history of successful
lo-gins. 4
Then we say that the i-th feature is
distin-guishingifj i i j>k i
forsomeparameterk>0.
Note that iffeature i is distinguishing, theneither
i > i +k i
and sousually b(i) =0 forthe user
(see (1)), or i < i k i
and sousually b(i) =1
fortheuser. Intuitively,theparameterk tunesthe
\sensitivity" of the scheme, in that a small k
im-pliesmoredistinguishingfeatures,andalargek
im-plies fewer. Obviouslyk mustbetuned to balance
achievinga high numberof distinguishing features
withenablingtheusertosuccessfullyregeneratehis
keyreliably, since a higher number of
distinguish-ing features is advantageous for security but also
requires increasingly similar utterancesto
regener-atethekey. Theparameterkwillplayacentralrole
inourevaluation.
Thefeatures
i
thatweuseinthebalanceofthis
pa-peraredescribedin[16,Section3.2].Eachisdened
bycomparingthepositionofavectorcharacterizing
a segment of the utterance to a xed plane. This
planeisaparameterofourscheme,andthoughwe
willrarelymentionitbelow,itisimportantforthe
readertobeawarethatthedatawepresentisbased
onaplaneselected,basedonourdata, tooptimize
ourmeasuresincertainways. Ontheonehand,this
meansthatourdatapresentswhatcouldbeachieved
withagoodselectionofthis plane,and isthus
op-timistic in this regard. On the other hand, since
this plane isselected bysearchingthrougha small
setofcandidateplanes,(innitely)manyplanesare
omittedfrom thissearch. Consequently, itislikely
thatplanesyieldingbettermeasuresexist. The
ex-perimentationwehaveconductedthusfardoesnot
permit us to conclude how to select this plane in
general,andthiscontinuestobeanareaofour
on-goingwork.
4
The\recenthistory"isdenedtobethelasthsuccessful
loginsforsomeparameterh.Recordsofthelasthsuccessful
loginscanbestoredinaleencryptedwiththekeythatis
reconstructedonasuccessfullogin.Theparameterhwillnot
5.1 Evaluation of IPAQ recordings
Ourrstdatasetwascollectedonadevicemuchlike
thoseweenvisionforuse withour scheme, namely
a Compaq IPAQ TM
H3600 series. The IPAQ is a
personal digital assistant with a 206 MHz
Stron-gArmCPU,atouchscreenLCD,16MB ashROM,
16MB SDRAM,aPhilipsaudiocodecUDA1341T
(i.e., an A/D with data compression), built-in
mi-crophones, a compact ash type-II expansionslot,
and a speaker output jack. The audio codec has
anintegratedanalogfront-endincludingautomatic
gaincontrol,whichadjuststhesignalstrengthbased
on the level of the spoken utterance. The sound
driverisfullduplexandOpenSoundSystem(OSS)
compliant[29],thoughweencounteredsomesignal
problemswhenrecordingsamplesat8kHz. Forthis
reasonwerecordedoursamplesat32kHzandthen
downsampled to8kHzoine;seeSection3.
As illustrated in (3) of Section 3, the storage
re-quirements for merely saving the utterance of a
passphrasecanbesignicant. Toovercomethe
stor-age limitations of this particular IPAQ in light of
thisrequirement|andin particular,topermit
sav-ingmultipleutterancesinourusertesting|weused
a1GBIBMMicrodrive TM
(inthecompact ash
ex-pansion slot) as astable store. However, to avoid
recordingnoise from theMicrodriveondisk seeks,
the recordings are rst written to a primary
par-tition in volatile memory. When the memory
ca-pacity is reached, the Microdrive is automatically
mounted,thedata ushedtoanext2lesystemon
thedrive, and then unmounted. Inthe eventthat
awirelessconnectioncanbeestablished,the
Micro-drivecanbereplacedwithawirelessnetworkcard,
andthedatawrittentoaremotemountpoint.
TheIPAQwas used to recordutterances from ten
users. All ten users were recorded saying the
samepassphrasemultiple times, whichin this case
was the address of Carnegie Mellon University:
\Carnegie Mellon University, 5000 ForbesAvenue,
Pittsburgh,Pennsylvania15213". Theuserwas
ap-proximatelyone footawayfrom theIPAQ's
micro-phone. Theuserwasrequiredtowaitforatleastone
secondbetweenpressingthe\record"buttonofour
recordingapplicationandspeaking,soastonot
in-terleavethevoicesignalwiththedevice'sattempts
toperformautomaticgaincontrol. (Sincethe
auto-maticgaincontrolconvergeswithin 0:5seconds on
acousticenvironmentinwhichtheseutteranceswere
recordedwasastandardoÆceenvironment,andas
such,backgroundnoisewassignicant. Six
record-ingsof each user|the \training"utterances|were
used to determine the distinguishing features for
thatuser. Theremainingrecordingsforeachuser|
the \testing" utterances, of which there were six
from eachuseronaverage|wereeach usedto
gen-erateafeaturedescriptor. Comparingeachfeature
descriptortothesameuser'sdistinguishingfeatures,
to determinethenumberof distinguishingfeatures
with which the feature descriptor was consistent,
counted asa\truespeaker"trial. Comparingeach
feature descriptor to another user's distinguishing
featurescountedas an\imposter"trial.
Theresultsofthisanalysisareshownintheleftside
of Figure 4. This graph demonstratesthe average
numberofdistinguishingfeaturesperuserasa
func-tionofk,andtheaveragenumberofthesethatthe
feature descriptorof atruespeakeroran imposter
matched. 5
Theerrorbarson thetrue speakerand
impostercurvesshowonestandarddeviationabove
andbelowtheaverage.
There are several pointsworthnoting about these
results. First, thegap betweenthe \distinguishing
features" and \true speaker" points indicates the
number c
max
of error corrections that would need
to be performed during the key regeneration
pro-cess to achieve a reasonably low false reject rate.
For example, if k = 0:6, then c
max
5 should
achieve a reasonablefalse reject rate, and
correct-ing 5errorsis feasibleon today's devices (see
Sec-tion 4.3). Unfortunately, this data suggests that
choosingk=0:6yieldsfewerdistinguishingfeatures
than we would like for security (d=m 0:5 only).
Asecondpointworthnotingisthatthehuman
im-posters, even when sayingthe same passphrase as
thetrueuser,didnotmatchsignicantlymoreofthe
trueuser'sdistinguishingfeatures thanif theyhad
simplyguessedarandomfeaturedescriptor(shown
bythe\randomguessing"lineinFigure4,whichis
simplyhalfthe\distinguishingfeatures"line).
5
Thepointsforagivenvalueofkweregeneratedusinga
planechosentomaximizethenumberofdistinguishing
fea-turesand, among allsuch planes, to maximizea weighted
average of the number of features matched by the \true
speaker"andmissedbythe \imposter". Seethe last
para-Inorderto evaluateadierentmodel of
imperson-ation, i.e,onewhere theattackerhasknowledge of
thespeakerbeingimpersonated,weexploreda
sec-ond dataset. Our second dataset is onecollected
withinthecontextofadierentresearcheort,and
so consequentlywehad lesscontrol overthe
avail-ability of phrases said by the same user multiple
times. This data set consists of recordings of a
professionalspeaker,herecalled\Speaker-J",taken
in aprofessionalstudio (i.e.,aroomwith virtually
no background noise) using a high-quality
micro-phone. The same microphone was used
through-outdatacollectiontoensure atandconsistent
fre-quencyresponse. Consequently, this data set is of
amuch higher quality than thedata set described
inSection 5.1. Thisdataconsistsofrecordings
col-lected fortwoseparateexperiments, but both
spo-kenbythesamespeaker,Speaker-J.PartI consists
ofapproximately1600sentences(roughly1hourof
speech)andincludesthespeakerreading(with
con-sistent voice quality and head-to-microphone
dis-tance) both standard newswire text and a
pre-paredscriptthatcoversrarecombinationsofspeech
sounds. PartIIconsistsof38minutesofspeech
con-sistingofagroupofsentencesrepeated7times,all
recordedin asingleday. Theelapsedtimebetween
thetwodatasetswasapproximately1year.
Toevaluateourtechnique,eachofthesephrases(in
partII) were used as apassphrase in our scheme:
verecordingswere used to generatethe speaker's
distinguishingfeatures fora givenphrase, andtwo
recordings were used to simulate the speaker
at-temptingtoregeneratehiskey. Specics aboutthe
chosenpassphrasescanbefoundin Table1.
TherightsideofFigure4showstheresulting
\dis-tinguishing features" and \true speakers" curves.
These curves are analogous to the curves in the
leftside of Figure4with thesamelabels: therst
characterizesthenumberofdistinguishingfeatures
forthis user basedonthe trainingutterances,and
thesecondgivestheaveragenumberoffeatures on
whichafeaturedescriptorgeneratedfromatest
ut-terancematched the distinguishingfeatures. If we
look atk=0:6, weagainseethat c
max
5should
approach a reasonable false negative rate (and is
plausible by Section 4.3). Moreover, according to
thisdataset, thedistinguishingfeatures atk=0:6
0
10
20
30
40
50
60
0.2
0.4
0.6
0.8
1
features
k
IPAQ
Distinguishing features
True speaker
Human imposter
Random guessing
0
10
20
30
40
50
60
0.2
0.4
0.6
0.8
1
features
k
Speaker-J
Distinguishing features
True speaker
TTS imposter
Cut-and-paste imposter
Random guessing
Figure4: Evaluationoftwodatasets; seeSections5.1and5.2.
of these recordings may provide amore optimistic
picturethanwouldberealizedin practice.
Part I of the Speaker-J data set is very rich, and
moreover, the research eort that generated this
data set carefully annotated the speech,
identify-ing the beginning, ending and midpoint of each
phoneme itcontains. Informally,phonemes arethe
basic units in the sound system of a language; in
the case of English,there are about 50 phonemes.
With these annotations,diphones canbeextracted
from therecorded speech. A diphone is a portion
of speech beginning in the middleof one phoneme
and ending in the middle of the next phoneme; a
diphonethusisanexampleofhowtheuser'sspeech
transitionsfromonephonemetoanother. Diphones
reveal much about the voice patterns of the user
whouttered them. So,partIoftheSpeaker-Jdata
set provides the opportunity to attempt a
dier-entform of attackagainstoursystem, namelyone
that wouldsimulate an attackerwhohad acorpus
of recordingsof theuser sayingmany things other
than thepassphrase itself. Aquestion weattempt
toansweriswhether theserecordingsassistthe
at-tackersignicantlyinndingtheuser'skey.
Specically, consider an attack in which the
at-tackerwishestotestacandidatepassphrase(inour
case, selected from part II) but does not how the
user speaks it. The attacker uses a text analysis
module fromatext-to-speech(TTS) system[28]to
translate the text of the passphrase into a string
ofphonemesthatrealizethepassphrase(i.e.,a
pro-nunciationforthetext),alongwithotherimportant
informationthatistypicallyusedwhensynthesizing
speech(e.g.,thedurationandthepitch contourfor
each phoneme). Of course, any of these features
maynotmatchexactlywhattheusersayswhenshe
speaks the passphrase. For example,a given word
canbepronouncedanumberofdierentways. So,
evengiventhecorrectpassphraseasinput,there is
no guarantee the text analysis module will yield a
stringof phonemes that matchesthe waythe user
speaksthepassphrase. Moreover,thedurationand
pitch predictions made by the text analysis might
dier signicantly from what the real user sounds
like.
Nevertheless, suppose theattackerpossessesa
cor-pus of recordings of the user speaking various
phrases other than the passphrase (in our
ex-periment, part I of the Speaker-J data),
anno-tated to identify phonemes and diphones. The
at-tackercanthenattempt to constructhow theuser
would saythepassphrase,usingtechniquesderived
from a concatenative text-to-speech synthesis
sys-tem(e.g.,[12]),in oneofthefollowingways:
Cut-and-pasteimposter Concatenate the raw
speechsamples (diphones,orlongersegments)
as-is from the inventory. There are various
forms of this. On the one hand, the
at-tackermaymakenomodicationsto duration
or pitch of the resulting speech. This yields
speechthat can soundverymuchlikethetrue
speaker, though there can be severe
disconti-nuities at the concatenation boundaries. In
addition, such an approach can yield
notice-able dierences in the recording levels within
at-discontinuities.
TTS imposter Useatraditional TTSsignal
pro-cessingback-endto synthesizethe passphrase.
Note that this is designed to produce nice
sounding speech, but that it also makes use
of theduration and pitch predictions that are
outputfromthetextanalysis module. Ifthese
predictions do not correspond to the way the
user actually speaks, this step might impede
theattack. Forexample,theusermayhavean
idiosyncratic way of saying a particular word,
either in her passphrase orin the instance in
theattacker'srecordingsoftheuser.
We experimentedwith four types of cut-and-paste
attacksandtwotypesofTTS attacks. Theresults
ofthesetestsareshownintherightsideofFigure4.
Thecurveslabeled\TTSimposter"and
\Cut-and-paste imposter" capture the best attacks of each
typethatwediscovered. Asthecurvesdemonstrate,
theseattacksbothperformedsimilarly,and
outper-formed random guessing in some cases. However,
it appears that the attacks as we conducted them
would fallshortofbreakingourscheme.
ThoughpartI oftheSpeaker-Jdataset consistsof
1600 sentences, it is not the case that an attacker
wouldneedtoassembleofcorpusofuserrecordings
ofthisextenttoattackatypicalpassphrase. Table1
approximates theaveragenumberofsentencesand
their cumulativeduration that the attackerwould
needtorecordtoobtainthediphonesineachofthe
vepassphrasesweexamined. Thesenumberswere
obtainedbyrandomlyselectingsentencesfrompart
IoftheSpeaker-Jdatasetuntiltheneededdiphones
were obtained.
passphrase passphrase sentences
number phonemes needed
0 24 340(805secs)
1 52 455(1071secs)
2 29 1297(3104.88secs)
3 27 152(367.320secs)
4 18 415(951.421secs)
Table1: Approximatenumberofsentencesattacker
would need torecord to obtaindiphones necessary
toreconstructeachpassphrasetested.
As speech synthesis technology improves, the size
decrease. However,TTSandcut-and-pasteattacks
ofthetypesweperformedrequireanannotated
cor-pus,andachievingthisannotationis avery
manu-allyintensiveprocessthatistypicallyconductedby
speech experts. Inthe case of the Speaker-J data
set, it is estimated that 200expert-hours of eort
was invested in achieving the annotated data set.
(It takesaboutonehour to manually segmentone
minuteofspeech.) Thisisalreadyasignicant
bar-rierto anattackerwishingto utilize these avenues
ofattack. Thoughautomaticlabellersareavailable
(e.g.,[30]),theirperformanceispoor,andweexpect
it would substantially increase the error rates for
theattacksoutlinedherein. Wedoexpect,however,
thatthesuccessofsuchattackswillincreaseevenfor
ourowndatasets,asweexploreinmoredetailways
toimprovetheeectivenessoftheseattacks. Inthe
fullversionofthispaperwewillprovideamore
de-tailedanalysisofthese threatsonaper-passhprase
basis. Wehopethatthis analysiswill beusefulfor
designingeectivecountermeasures.
6 Related Work
Theonly priorworkof which we areawareon the
topic of generating cryptographic keys from voice
utterancesisthatin whichthecryptographickeyis
merelythetextofwhatisspoken,recognizedusing
standard techniques in automatic speech
recogni-tion. (Actually,weareunawareof specic systems
thatevenjustdothis, butsinceitis animmediate
extension of using a typed password as a
crypto-graphickey, wetreatitasobviouspriorwork.) To
our knowledge, our work is the only research
to-ward determining a repeatable cryptographic key
that draws entropy from how the user speaks the
passphrase. How this paper contributes over our
own priorpublications on this topic wasdescribed
inSection2.
Thatsaid,therehasbeenworkongenerating
cryp-tographic keys from biometrics other than voice.
The rst such work of which we are aware is due
to Soutaret al. [25,26, 27], whodescribemethods
forgeneratingarepeatablecryptographickeyfroma
ngerprintusing opticalcomputingandimage
pro-cessingtechniques. Thesetechniquesgenerateakey
from a two-dimensionalimage (a ngerprintbeing
well-http://www.bioscrypt.com/).
Adierentapproachtogeneratingarepeatablekey
basedonbiometricdataisdue toDavida,Frankel,
and Matt [4]. In this scheme, a user carries a
portablestoragedevicecontaining(i)error
correct-ingparameterstodecodereadingsofthebiometric
(e.g., an irisscan) with alimitednumber oferrors
to a \canonical" reading for that user, and (ii) a
one-wayhashofthat canonicalreadingfor
verica-tionpurposes. This canonicalreading,once
gener-ated,canbeusedasacryptographickey,orcan be
hashedtogether withapassword(using adierent
hash function) to obtainakey. Juelsand
Watten-berg[9]generalizedandimprovedtheDavidaetal.
scheme through anovelmodicationin the use of
error-correcting codes, thereby shrinking the code
size and achieving higher resilience. These
tech-niquesareadierentapproachforgenerating
cryp-tographickeysfrombiometricreadings,andreacha
correspondingly dierentset oftradeos. Notably,
whereasourtechniquespermitausertoreconstruct
her key even if she is inconsistent on a majority
ofherfeaturedescriptorbits(notuncommon when
usingvoiceasabiometric[6]), these techniques do
not.
More distantly related work is that of Ellison et
al. for generating a cryptographic key based on
answers to questions posed to a user [7]. The
work is premised on the assumption that
ques-tionscanbeposedthat thelegitimateuserwill
an-swer one way but others attempting to
imperson-ate the user will answeranother way. Their
con-struction resemblesoneinstance ofourtechniques,
namely that of [15, Sections 5.1{5.2], and in this
waytheir scheme achievesa degreeof resilienceto
forgotten answers. However, Bleichenbacher and
Nguyen[2]haveshownthattheEllisonetal.scheme
isinsecure,whereasourconstructionsappeartobe
muchstronger. Anotherconstructionsimilartothat
in[15,Sections5.1{5.2]wasusedin thedesignofa
forensicdatabase, where aperson's medicalrecord
can be decrypted only once aDNA sample of the
personisobtained(e.g.,atacrimescene)[3].
How-ever,thisschemeisalsoinsuÆcientforourpurposes,
due tothesameinadequaciesastheschemeof [15,
Sections5.1{5.2].
Impersonation attacks using recordingsof theuser
speakingphrasesother than her passphrase,aswe
exploredin Section 5.2,havebeenpreviously
stud-ied for the purpose of fooling speaker verication
inthese worksare somewhatdierentfromour
ex-ploration here, however. Notably, in [14, 23], the
authors describesynthesizing apassphrase using a
speaker-independentmodel, andthenadapting the
pitch and duration of the synthesized passphrase
basedonrelativelyfew recordingsof theuser. The
authorsgiveevidencethateventhesesimpleattacks
can make it diÆcult to set acceptance thresholds
foraspeakervericationsystem. Infutureworkwe
hopetoexplorehowthesetechniquescanbeapplied
inthecontextofourwork.
7 Conclusion
Theviabilityof(re)generatingstrongcryptographic
keysfromvoiceutterancesremainsunproven. While
webelievethattheworkpresentedinthispaper
of-fers steps toward achieving this goal and evidence
that it can be reached, somecritical advances are
still required. Notably, our analysesusing m =46
in[16] andm=60hereareinsuÆcientforsecurity
asrequiredincommercial applications,andour
fu-tureworkwillcontinuetofocusonreachingm=80
or higher. More extensive user trials to evaluate
theentropyof users'distinguishingfeatures is also
needed. And, there remain numerous open
ques-tions asto how to tune an implementation of our
approach in practice. The analysis of Section 5
hidesmanyparametersfromview,andthe
relation-shipbetweentheseparameters,securityand
usabil-ityneedtobefurtherexplored.
Acknowledgements
WeareespeciallygratefultoDanielBleichenbacher
forthedesign,analysis,and implementationof the
encoding scheme in Section 4. Additional analysis
ofthe keyreconstruction schemepresentedtherein
will appear in [1]. Wealso thank SapeMullender,
Peter Bosch and Qiru Zhou for their assistance in
analyzinganomaliesinthekernelsoundmoduleson
thevariousplatformswehaveexperimentedwithin
thiseort. Weare thankfultoGreg Kochanski for
numerous insightful discussions, and Minkyu Lee,
for providing us with signal processing code that
attempts to correct the discontinuities in the
[1] D. Bleichenbacher. Encoding schemes for generating
keysfromalow-entropynoisysource,2002.
[2] D. Bleichenbacher and P. Nguyen. Noisy polynomial
interpolation and noisy Chinese remaindering.In
Ad-vances in Cryptology { Proceedings of EUROCRYPT
'2000(LNCS1807),Springer-Verlag,pages53{69,2000.
[3] P.Bohannon, M.Jakobsson, and S.Srikwan.
Crypto-graphic approaches to privacy in DNA databases. In
Proceedings of the 2000 International Workshop on
PracticeandTheoryinPublicKeyCryptography,
Jan-uary2000.
[4] G.I.Davida,Y.Frankel,andB.J.Matt. Onenabling
secureapplicationsthrougho-linebiometric
identica-tion. InProceedings of the 1998 IEEESymposium on
SecurityandPrivacy,pages148{157,May1998.
[5] R.B.Davies.AmatrixlibraryinC++.September1997.
http://www.webnx.com/r ober t.
[6] G.R.Doddington,W.Liggett,A.F.Martin,M.
Przy-bocki, and D. A. Reynolds. Sheep, goats, lambs and
wolves: Astatisticalanalysisofspeakerperformancein
the NIST1998 speakerrecognitionevaluation.In
Pro-ceedingsofthe5 th
InternationalConferenceonSpoken
LanguageProcessing,November1998.
[7] C.Ellison,C.Hall,R.Milbert,andB.Schneier.
Protect-ingsecretkeyswithpersonalentropy.FutureGeneration
ComputerSystems16:311{318,2000.
[8] D. Genoud and G. Chollet. Speech pre-processing
againstintentionalimpostureinspeakerrecognition.In
Proceedings of the 1998 International Conference on
SpokenLanguageProcessing,pages105{108,December
1998.
[9] A. Juels and M. Wattenberg. A fuzzy commitment
scheme.InProceedingsofthe6 th
ACMConferenceon
Computer and Communication Security, pages 28{36,
November1999.
[10] FrederickJelinek.StatisticalMethodsforSpeech
Recog-nition.MITPress,1997.
[11] J. B.Lacy, D. P. Mitchell, and W. M. Schell.
Cryp-toLib: cryptographyinsoftware.InProceedingsofthe
4 th
UNIXSecuritySymposium,Oct.1993.
[12] M.Lee,D.P.LoprestiandJ.P.Olive.Atext-to-speech
platformforvariablelength optimalunitsearching
us-ingperceptualcostfunctions.InProceedingsofthe4 th
ISCAResearchWorkshoponSpeechSynthesis,August{
September2001.
[13] Q.Liand A.Tsai.Amatchedlterapproachto
end-pointdetection forrobustspeakerverication.In
Pro-ceedingsofIEEEWorkshoponAutomaticIdentication
AdvancedTechnologies(AutoID'99),pages35{38,
Octo-ber1999.
[14] T. Masuko, T. Hitotsumatsu, K. Tokuda and
T.Kobayashi.On thesecurityofHMM-basedspeaker
verication systemsagainst imposture using synthetic
speech. In Proceedings of the European Conference
on Speech Communication and Technology, Budapest,
Hungary,vol.3,pages1223{1226,September1999.
[15] F. Monrose, M. K. Reiter, and S. Wetzel. Password
hardening basedon keystroke dynamics.International
Journal of Information Security 1(2):69{83, February
graphickeygenerationfromvoice(extended abstract).
InProceeedingsof the2001IEEESymposiumon
Secu-rityandPrivacy,May2001.
[17] F.Monrose,M.K.Reiter,Q.Liand S.Wetzel.Using
voicetogeneratecryptographickeys: Apositionpaper.
InProceedingsof Odyssey2001,TheSpeaker
Verica-tionWorkshop,June2001.
[18] B. L. Pellom and J. H. L. Hansen. An experimental
study of speaker verication sensitivity to computer
voicealteredimposters.InProceedingsof the1999
In-ternationalConferenceonAcoustics,Speech,andSignal
Processing,March1999.
[19] R. Power. 2001 CSI/FBI computer crime and
secu-ritysurvey.ComputerSecurityIssues&TrendsVII(1),
2001.
[20] L. Rabiner and B.H. Juang. Fundamentals of Speech
Recognition,PrenticeHall,1993.
[21] R. D. Rodman.Computer Speech Technology. Artech
House,Norwood,MA,1999.
[22] A.I.Rudnicky,S.D.Reed,andE.H.Thayer.
Speech-Wear: Amobilespeech system.In Proceedings of the
4 th
InternationalConferenceonSpokenLanguage
Pro-cessing,pages538{541,October1996.
[23] T. Satoh, T. Masuko, T. Kobayashi, and K.Tokuda.
Arobustspeakervericationsystemagainstimposture
usinganHMM-basedspeechsynthesissystem.In
Pro-ceedingsoftheEuropeanConferenceonSpeech
Commu-nicationandTechnology,vol.2,pages759{762,Aalborg,
Denmark,September2001.
[24] G.J.Simmons.Anintroductiontosharedsecretand/or
shared controlschemesand their application.In
Con-temporaryCryptology: TheScienceofInformation
In-tegrity,IEEEPress,1992.
[25] C.SoutarandG. J.Tomko.Secureprivatekey
gener-ationusingangerprint.InCardtech/Securetech
Con-ferenceProceedings,vol.1,pages245{252,May1996.
[26] C. Soutar, D. Roberge, A. Stoianov, R. Gilroy, and
B.V.K. Vijaya Kumar.Biometric encryption TM
using
imageprocessing.InOptical Security and Counterfeit
DeterrenceTechniquesII(Proc.SPIE3314),pages178{
188,1998.
[27] C. Soutar, D. Roberge, A. Stoianov, R. Gilroy, and
B.V.K.Vijaya Kumar.Biometric encryption TM
{
En-rollmentandvericationprocedures.InOpticalPattern
RecognitionIX(Proc.SPIE3386),pages24{35,1998.
[28] R.W.Sproat.Multilingualtext-to-speechsynthesis:The
BellLabsapproach.KluwerAcademicPulishers,1998.
[29] J.Tranter.Linux Multimedia Guide.O'Reillyand
As-sociates,Inc.September,1996.
[30] C. W. Wightman and D. T. Talkin. The aligner:
Text-to-speech alignment usign Markov models. In
Progress in Speech Synthesis, Chapter 25, pages 313{