Rochester Institute of Technology
RIT Scholar Works
Theses
Thesis/Dissertation Collections
1995
Speech discrimination using wavelets and zero
crossing counting
Craig Martin
Follow this and additional works at:
http://scholarworks.rit.edu/theses
This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please [email protected].
Recommended Citation
1
Speech Discrimination using Wavelets and
Zero Crossing Counting
by
Craig W. Martin
M.S. Rochester Institute of Technology
A Thesis sub mitted in partial fulfillment of
the requirements for the degree of Master of
Science in the School of Computer Science in the
College of Applied Science and Technology of
the Rochester Institute of Technology.
College of Applied Science and Technology
Rochester Institute
of
Technology
Rochester, New York
CERTIFICATE OF APPROVAL
M. S. DEGREE THESIS
The M. S. Thesis of Craig W. Martin
has been examined and approved
by the thesis committee as satisfactory
for the thesis requirement for the
Master of Science Degree
Dr.
P.AJ.
Anderson, Thesi.s Advisor
Dr.
J.
A.
Biles
THESIS RELEASE PERMISSION
FORM
ROCHESTER INSTITUTE
OF TECHNOLOGY
COLLEGE OF APPLIED SCIENCE AND
TECHNOLOGY
Title
ofThesis: Speech
Discrimination
usingWavelets
andZero
Crossing Counting
I,
Craig
Martin
, prefer tobe
contacted each time a request
for
reproductionis
made.I
canbe
reached atthefollowing
address:Eastman Kodak
Company
Research
Laboratories,
B-59
Rochester,
New
York,
14560
SPEECH
DISCRIMINATION
USING WAVELETS
AND ZERO CROSSING
COUNTING
by
Craig
W. Martin
Submitted
to theComputer Science
Department
in
partialfulfillment
ofthe requirementsfor
theMaster
ofScience degree
at
the
Rochester Institute
ofTechnology
ABSTRACT
Traditional
speechdiscrimination
techniqueshave
focused primarily
on the
frequency
domain
analysis of speech signals.In
this paper the author presents anewtechniqueusing
wavelettransforms
and zerocrossing counting
tofacilitate
a timedomain
speechdiscrimination
ACKNOWLEDGMENTS
This
work would nothave been
possible withoutthe significantcontributions made
by
a number ofkind
individuals,
whomI
gratefully
acknowledge:Dr.
P.
G.
Anderson,
for his
patience, time andhelpful
suggestionsin
serving
asmy
thesis advisor.Dr. J. A.
Biles,
for his
time and suggestionsin
reviewing
this thesis.
Dr. M.
Raghuveer.
for his
time and suggestionsin reviewing
this
thesis.
Dr. B. W.
Keelan,
for his
assistancein
applying
theSAS
discrimination
capabilities to speech recognition.and the members of
Eastman Kodak department 791 for
helping
me toDEDICATION
This
thesis
is
dedicated
to
my
lovely
wife,
Nancy,
andmy
two
children,CONTENTS
Introduction
Defining
andPopulating
theVocabulary
Database
The
Macintosh Sound System
Speech Signal Preparation
Wavelet
Theory
andPractice
Exemplar Extraction
Petersen/Barney
Vowel Discrimination
Summary
Appendices
INTRODUCTION
Many
techniques
arecurrently
usedin
pursuit ofthe
perfect speechrecognition system.
These
techniques
encompassvariouslevels
ofrecognition,
signalprocessing,
and comprehension.Most
ofthese systemsbegin
withdigitally
sampled speechfollowed
by
some sort of signal processing.The
signalprocessing may
encompass spectralanalysis.
LPC
analysis,Fourier
andGabor transforms,
and variousother
techniques
[1-7].
In
addition, other systems attempt tounderstand speech
by
applying
artificialintelligence
techniques to the speech signal.Most
ofthe moretraditional speech recognitiontechniques
have
their
foundations
rootedin
thefrequency
domain
analysis of speech signals.Overviews
ofthis type of systemhave
been widely
published[1]
andgenerally follow
thesebasic
steps:(
1
)
Find
the
beginning
and end ofthe utterance.(2)
Filter
the
raw signalinto
frequency
bands
(FFT.
Gabor.
etc.).(3)
Cut
the
utteranceinto
afixed
number of segments.(4)
Average
thedata
for
eachband
in
each segment.(5)
Store
this
pattern withits
name.(6)
Collect
atraining
setof about3
repetitionsfor
each pattern.(7)
Recognize
the unknownby
comparing
its
pattern against all patternsin
thetraining
set andreturning the
name ofthepattern closest
to
the unknown.Much
successhas
been
had
withdifferent
variations ofthis
simpleThe
success ofthe
frequency
domain
approachto
speech recognitionis
in
no small partdue
to the
fact
that
this type of analysis eliminatesthe
needfor
phasedetection
algorithms.These
algorithms,which aregenerally
quite complex[81,
are required whencomparing
speech orother signals
in
the time
domain.
This
is
the
best way
to obtain acommon reference point
for
the comparison.The
phasedetection
issue has in
partbeen
responsiblefor
thelack
of speech recognitionwork
using
time
domain
techniques.
It is
notsurprising
then thatthe
wavelettransform,
whichgenerally
produces coefficients with
heavy
dependence
ontemporalpositioning
(at
higher
frequencies),
has been
usedsparingly in
speech recognitionwork.
In
spite ofthis,
recent research[9,1
3]
has
shownthe
following
advantagesin
using
waveletsto
solve the speech recognition problem:(
1
)
Contained in
the wavelettransformis information
pertaining to
both
frequency
content(harmonic)
andthe
regions of
energy
concentration thatconstitutethefrequency
spectrum of a speech signal
(formant).
(2)
The
analysis process usedby
thewavelettransform
is very
similarto that performed
by
thehuman
ear.(3)
Due
tothe
timelocalization
properties ofthewavelettransform,
it
is
quite valuablefor
locating
discontinuities in
speech signals.
This
is particularly helpful for
locating
the(4)
The
wavelettransform has
the property
ofcompressing
most of
the target
signal propertiesinto
afew
ofthe
largest
coefficients.
This
minimizesthe
amount ofdata
associated withthe speech signal exemplars.
This
thesis
is based
onthe
hope
that the
aforementioned waveletproperties can
be
exploitedto
produce a simplevoweldiscrimination
system
based
oninformation
extractedfrom
the
wavelettransforms.As
an additionalgoal,the
system should notbe
socomputationally
complex as
to
require super computer capabilities torunthe
algorithm.
The
feature
extraction algorithm shouldbe
able tobe
run,with minimal
delay,
on standardPC/Mac
systems.Tasks
not pursuedin
thisthesis
are complete word/sentencerecognition,vowel
discrimination
outside the ninePeter sen/Barney
vowels, or
foreign
pronunciations or accent modifications ofthe ninePeter sen/Barney
vowels.These
projectswouldbe logical
extensionsDefining
andPopulating
theVocabulary
Database
Before any
work couldbegin
onthis
projectit
wasincumbent
on usto
define
and populate adatabase
of speech on which thewaveletbased
recognition system would operate.
Early
onin
this
projectthe
decision
was made
to
usethe
Petersen/Barney
group
ofvowels to populate thedatabase. These
are agroup
of nineinternationally
recognizedvowelswhich,
for
the
mostpart,do
notoverlap in
frequency
space.This
decision
wasbased
ontwofactors:
(
1
)
The
wavelettransform
has been
shownto
be especially
wellsuited
to
detecting
the onset and offset of vowels[9.13J.
(2)
Having
entered this project apriori this choicelimited
thescope of
the
database
in
terms ofthe expectations placed on thespeech recognition system.
Once
the scope ofthe
database
wasdefined,
andbefore any
realworkcould commence, a plan
for
populating the
speechdatabase
with realspeech sampleswas
devised
andimplemented.
The
first
step
in
population ofthedatabase
wasto
find
awilling
group
of speakers who would provide a widenumber of variations of eachof
the
Petersen/Barney
vowels.Fortunately.
I
was ableto
solicitvolunteers
for
this purposefrom
anumber ofpeopleworking
with meat
Eastman Kodak.
This
group
was madeup
ofapproximately
twelve
people who
formed
thefoundation
ofthe
database from
the
beginning
ofthe project.
I
amdeeply
gratefulfor
theirhelp.
The
nextstep
wasto
devise
a systemwhichwould aidin
both
correctvowel pronunciation
by
the
participants and efficient software accessto
thevowels containedin
the
database. After
careful examination ofthe problem,
a series of nine wordswerechosen, eachbeginning
withone ofthe vowels
in
the
Petersen/Barney
group.This
vocabulary is
shown
in
thetable
below:
Figure
0
Sound
#Pronunciation Word
Petersen/Barney
Vowel
1
eat
"e
2
it
l
3
any
e
4
at
a
5
up
u
6
Amish
a
7
oomph
OO
8
ooze
OO
9
auto
A
0
In
having
each participantsay
each ofthese ninewords, we gain theadvantage of
aiding
in
proper pronunciation, andallowing
the vowelsto
be easily
parsedfrom
thefront
of each wordby
a simple softwarealgorithm
(the
parsing algorithmis discussed later in
this paper).Thus
the thesisvocabulary database
consists of pronunciations of eachThe
final
step
wasto
find
adependable
and consistentvehiclefor
recording
andstoring
the
soundsfor
thevocabulary database.
Towards
this
end aMacintosh Quadra 650
computer was used.All
database
sounds were recorded on a singleQuadra
systemusing
thesame microphone
(Sun
SPARC
10
Microphone)
each time.The
Quadra
650
system wasextremely easy
to use.Some
ofthefeatures
whichwere
helpful in
thedatabase
effort were:(
1
)
The Macintosh
presents a userfriendly
interface
viathe
Sound
Control Panel
whichis
part ofthe
Mac
operating
system.The
Sound Control Panel
contains simplebuttons for
controlling
all aspectsof
recording
and playback.This
simplicity
allowedsounds to
be
recorded and verifiedin very little
time.
(2)
All
aspects of theresulting
soundfiles
are welldocumented
by
Apple Computer
[11,12]
(see
thenext section).This
madewriting
softwareto
operate on the sounddata
straightforward.(3)
The
resulting
soundfiles
are compatible withany
otherMacintosh
computerallowing
thefeature
extraction softwareto
be debugged
and tested on otherMacintosh
systems.This
includes
systems with no soundrecording
capabilities.The
Macintosh
Sound System
The
purpose ofthis
sectionis
notto
provide an exhaustive explanationof
the
Macintosh
sound system.Apple Computer
[1
1,12]
has already
addressed
this
issue.
Rather,
this sectionis intended
to highlight
theareas
that
have
somedirect influence
onthis
thesis work.The
mainquestion
that
needsto
be
resolvedis
whether or not speech recordedon the
Macintosh
meetsthe
minimumrequirementsfor
a speechdiscrimination
system.The
first
areathatneedsto
be
examinedis
that ofthesampling
rateused
in
the speechrecording
process.In
general,it is
agreedthat
aminimum
sampling
rate of8kHz is
required111.
This
allows speechwith
harmonics
up
to about4kHz
tobe accurately
represented.This
assumes thatother problems, such as aliasing, are not a major
factor.
As
thesampling
rateis
increased,
the upperfrequency
limit
alsoincreases.
In
general, the maximumfrequency
responseis
approximately
equal toonehalf
thesampling
rate.The
sampling
rate usedby
the
Macintosh
computersin
this thesis workis 22kHz.
Thus,
we
have
exceeded the minimumfrequency
response recommendationThe
secondimportant
issue is
to
determine
the
number ofbits
required to represent each sampled point
in
the sound.Current
research shows
that
a minimum of8
bits
per sampleis generally
recommended
for
intelligible
speech111.
By
increasing
the
number ofbits
persample,
animprovement
in
the signalto
noise ratio canbe
realized.
The Macintosh
systems usedin
this
workemploy
an8 bit
representation where
the
data is
stored as numbersranging from 0
to
255. The
value zero representsthe
largest
negative amplitude and255
representsthelargest
positive amplitude.A
value of127
is
interpreted
as an amplitude of0. In
this
format,
wehave
mettheminimum required number of
bits
per sample.Most
ofthe
newerMacintosh
computers now use a16 bit
representation which couldyield
improved
resultsin future
speech work.The
third
andfinal issue is
thatofthe
frequency
response ofthe
microphone.
The
microphone shippedby
Sun
Microsystems
withthe
Sun SPARC
station10 has
afrequency
response of50Hz
to
8kHz
[14J.
This
would suggestthat thisis
not alimiting
factor
since wedetermined
previously
that an upperfrequency
responselimit
of4kHz
was sufficient.Speech
Signal Preparation
In
most speech recognition applications someform
of pre-processingis done
priorto the
actual speechdiscrimination
process.This
signalpreparation stage
usually
includes
signaldetection
(lead
andtrailing
edge), amplitude
normalization,
and phase normalization.This
projectis
nodifferent.
In
this
caseboth lead/trail
signaldetection
andamplitude normalization
techniques
wereemployed prior todoing
any
wavelet
based feature
extraction work.In
addition, a simplealgorithm was used to provide a crude estimation of the pitch.
Phase
normalization,
whichis
requiredbecause
of the timedomain
natureofwavelets,
was notdone in
the
preparation stage.This
is due
to thefact
that
aform
of phasedetection
is built into
thewavelet stage.This
phase
detection
technique
is
one ofthe uniquefeatures
ofthis work and willbe discussed later in
this
paper.As
part ofthis
thesis,
aMacintosh
program was written whichhandles
the
job
ofextracting
exemplarsfrom
each sound andstoring
themin
afile. In
the
first
portion ofthistask,
the programhad
to
detect
the
lead
edge of each vowelfrom
thegroup
of ninewordsdiscussed
Signal
detection
algorithms are welldocumented
[1,8]
and canbe
quitecomplex.
In
the
case ofthis work,
I
was ableto
usethefact
thateachof
the
nine wordsbegin
withthe
vowelin
question, and exploit thisto
simplify
the
algorithm.Thus,
in
keeping
withthe originally
statedgoals of
minimizing
computationalcomplexity,
a simple voweldetection
algorithm wasdeveloped
by
empirical meansand verified.The
mainfunction
ofthe
algorithm was to throwaway any
noisespikes
(such
aslip
smacks orbreaths)
occurring
atthe start ofthevowel,
andthen
begin
buffering
thedata
until a1024 byte buffer
wasfull.
The
algorithm operatesdirectly
onthe
8 bit data discussed in
aprevious section and was
implemented
asfollows:
Step 1: Begin reading
sounddata
untilthedifference between
the
data
and zero amplitude(
1 27 in
this case)is
greater than3-Step
2: Read
adata
point and placeit in
the1024 byte buffer.
StenJL.
Determine if 10
consecutive pointshave been
readwithan absolute amplitude
less
than4. If
the answeris
yes, gotostep
1
and restartfrom
the currentlocation.
Step
4: Determine if 1024 bytes have
been
read andbuffered.
If
the answeris
no, gotostep 2.
Step 5:
Exit
the signaldetection
algorithm.Two important
assumptions were madehere in
order tosimplify
thealgorithm.
Firstly,
the occurrence oftenvaluesvery
closeto zeroindicates
the
end of anoise spike.Secondly,
1024
bytes
ofdata
areThe
empirical processby
whichthese two
assumptionswere verifiedwas
really
quite straightforward.A
menuitem
called"Play
Sound
Segment"
was added
to the
exemplar extraction program writtenfor
the
Macintosh.
Using
this
feature in
conjunction with adjustmentsto
the
number of"noise
zeros"and
the
buffer
size, the vowel extractorwas optimized.
The
optimization process was performed on all ninevowels
for 12 different
speakers.By
making
adjustments and thenplaying
the extractedvowel,
thevaluesof10 "noise
zeros"
and the
1024 byte buffer
size werederived.
Pitch is very
helpful
in
speechdiscrimination in
thatit
can provide anadditional piece of
information
whichmay
aidin
the comparison of multiple speech signals.Pitch
algorithms,like
speech signaldetection,
can
be very
complex[
1
1.
But
there are short cutsthat
canbe
usedto
provide
"rough"
estimates of pitch at a
low
costin CPU
cycles.One
such short cutis
tosimply
count the number ofzero crossingsin
theutterance.
Toward
thisend atrivial
subroutine was written tocountthe number ofzero crossings
in
the parsed vowel.This
wasdone
by
incrementing
a counterif
one ofthe
following
conditionswas met:(
1
)
A
pointthat
is less
thanor equalto
zerois
followed
by
apoint greater
than
zero.(2)
A
point thatis
greater thanor equalto
zerois followed
by
apoint
less
than zero.The importance
of the pitch estimate willbecome
clearlater in
this
The
last step before
thewavelet stage was to perform amplitudenormalization on
the
parsed vowel.These
algorithms arefairly
basic
and also
readily
available[1,8J.
In
short,
normalizationis
nothing
more
than
multiplying every
Fourier
coefficient(frequency
domain)
or
every
sampleddata
point(time
domain)
by
a pre-determinedconstant.
In
this
case, alltime
domain data
pointsweredivided
by
the
largest
data
pointin
the
1024 byte
vowelbuffer.
This
has
theconsequence of
producing
a signal with a maximum amplitude ofoneand a minimum amplitude ofzero.
There
are several advantagesresulting from
this action:(
1
)
Since
the
coefficientsresulting
from
wavelettransforms aresomewhat
dependent
oninput
signal amplitude,it is desirable
that
allinput data
cover the same amplitude range.This
forces
the
resulting
coefficients tobe
as close as possiblewhentrying
to compare similar sounds.
(2)
For
the purpose ofmemory
conservation and executionspeed,
four-byte
floating
point variables were usedin
the
wavelet algorithm.
Normalizing
the
input
data
to
one preventedfloating
point overflowduring
these calculations.(3) Dividing
alldata
pointsby
a constanthas
the
possible effect ofsuppressing
any
noise componentremaining in
the
signal.Multiplying by
a constant greaterthan
onemay have
resultedin exaggerating
the
noise andmaking
speechdiscrimination
more
difficult.
(4)
Normalization
is
asimple procedurein
keeping
withthe
original goal of
avoiding
mathematical procedureswhich areoverly
complex.A
simple subroutinewaswrittento
perform the above normalization algorithm.After
parsing
and normalization,the
vowelis ready for
thewavelet
processing
stage.Before
getting
into
the specificdetails
ofhow
thiswas appliedtothis
project, some of thetheory
behind
theWavelet Theory
andPractice
In its basic form
the
wavelettransform is defined
as:mXjB)
=(l/yfd)
\
h((t-T)/a)
x(t)dt
[8,10,131
where
integration
covers the range of plus and minusinfinity.
The
wavelet,
h(t-T/d),
defines
agroup
offunctions
with variabletemporaland
frequency
localization dependent
on thescaling
factor
d.Each
ofthese
functions is actually
abandpass
filter
whichis
sensitive toevents
occurring
aroundtimeX
and at a scale(frequency
range)ofd.The
time
and scale parameters canbe
variedto
describe
afamily
ofwavelet
functions
described
as a class.The
shape ofevery
member ofthe
classis identical
toevery
other member exceptfor
the scalechange.
These
members are allextractedfrom
one motherfunction
by
using
the scalefactor
to
expand or compress the end points.The
wavelettransform
canbe described
as a constantQ
analysismethod.
The
ratio ofbandwidth
to centerfrequency
remains constantwhich means that
it
enhancesfrequency
resolution atlower
frequencies
and timeresolution athigher
frequencies.
Thus,
the
inherent
beauty
of the waveletis its ability
tofocus its high
frequency
analysis at a specific point
in time,
muchlike
the analysis performedby
thehuman
ear.As
mentionedin
theintroduction,
this
characteristic makes the wavelet well adapted
to
detecting
the
onsetand offset ofvowels and other speech
discontinuities.
Even
though a multitude of waveletfunctions
(covering
the entirefrequency
spectrum) canbe
derived from
a suitable mother wavelet,this
is
oflimited
valuein
the
real world.The
speechdiscrimination
algorithm,
whetherit be linear
or non-linearin
nature, canonly
process exemplars which
have
alimited
number ofdata
points.Thus,
we would
like
toadjust thevalues ofT
anddin
order tolimit
thecontinuous
domain
of wavelets to a sparse set offunctions. One
suchpossibility
is
to choose a specialized set ofvaluesfor
X
andd
whichapproximate the
Q
of thehuman
ear.This
approach,whichhas
been
investigated
by
P
Basile,
F.
Cutugno,
P.
Maturi,
andA.
Piccialli
[131,
was
beyond
thelimits
of computationalcomplexity
(CPU
cycles)imposed
on this project,but
showeddefinite
promise.In
addition, aninvestigation into
adaptive waveletsby
Harold
H.
Szu,
Brian
Telfer,
Shubha
Kadambe,
andPramila Srinivasan
[16,17]
was shown toproduce
very low
speech classification errors.In
this effort,Daubechies
wavelet parameters wereiterated in
conjunctionwith aneural network to produce a minimum
energy function in
eachband.
This
iteration
was performed on a single pitch periodfor
a givenvowel.
The
adaptive wavelet approximationwasthen treated
as asuper wavelet which can
be dilated
and contracted tohandle
pitch andspeaking
rate changes.Once
again, thiswasbeyond
theCPU
cycleIn
the case of this thesiswork,
anexisting
wavelet algorithm wassought,
whichwould provide a goodtradeoff
between CPU
overheadand
the
ability
to
extract meaningful exemplarsfrom
thePetersen/Barney
vowels.Through
the course ofthis
thesis,
theDaubechies
class of waveletfilters
[9,10,18]
wasfound
(by
empirical means) to provide anacceptable
tradeoff
between CPU
executiontime,
exemplar sizepresented
to
thediscrimination
algorithm, and theresulting
discrimination
rate.This
judgment
was madebased
on targetexemplar extraction
times
in
seconds(instead
of minutes)withexpected minimum
discrimination
rates ofapproximately 75%. These
testswererun on an
8 MHz Macintosh SE
with the assumptionthat
performance on
today
simproved
desktop
systems wouldbe
orders ofmagnitude
better.
The
members oftheDaubechies
classvary from
being
highly
localized
in
frequency
scale tohighly
smoothdepending
onthe size ofwaveletfilter
convolution matrix.The
simplest and mostcompact member ofthis
classis
DAUB4
and usesonly 4
ofthese
matrix values.In
spite ofthe
highly
localized
nature ofthis member,
it
was usedin
this thesisas the main vehicle
for
exemplar extraction.The
advantage of thisdecision
wasfound in
the
straightforwardnessof the algorithm andthe
knowledge
that
success wouldbode
wellfor
resultsusing
morecomplex wavelets.
Numerical Recipes
in C
[9,10],
presents algorithmsin C for
thediscrete
Daubechies
wavelettransforms.
These
are givenfor
convolutionmatrix sizes of
4, 12,
and20
values.Again,
theseimpact
the
accuracy
of thefiltering.
This
C
code was used with minor modificationfor
thisproject.
The
transform operates on vectorswhich are a power oftwoin length
(128,
256, 512,
etc.) andin
this case contain thediscrete
sound samples after signal
processing
(parsing,
normalization,
etc.).After
thewavelettransform,
the
vectoris
returned as a series ofcoefficients, with emphasis on
temporal
placement ofthese
coefficients
increasing
withfrequency.
The
resulting
coefficients are groupedby
powers oftwostarting
withthe
lowest
frequencies.
Thus.
after
transforming
a512 byte input
vector, thefirst
two
coefficients(vector
locations
0
and1
)
represent analysis of thelowest
frequencies,
followed
by
the next2,
the next4,
the next8,
the next16
withthe
It
shouldbe
obviousfrom
this
progressionthat thereis
increasing
resolution,
with respectto
time,
in
the
placement ofthese outputcoefficients as
the target
frequency
band increases. This is because
the
bandpass filter
representedby
each successive member ofthewavelet class
has
a wider scopethan
its
predecessor whileretaining
the shape specified
by
X
and d(shape
specificto theDaubechies
class).Furthermore,
the
speech signal canbe
quiteaccurately
reconstructedafter
zeroing
out allbut
the
few largest
coefficientsin
eachband,
andperforming
aninverse
wavelettransform.
The
location
ofthesecoefficients
in
eachband
must alsobe
retained.It is
thisability
to"compress"
most ofthe relevant
information into
afew
of theresulting
coefficients
that
makes the wavelettransform valuablein many
oftodays
image
compression schemes.This
capability for data
compression
has
alsobeen
exploitedhere,
to minimize the size of theexemplars extracted
from
thePetersen/Barney
vowel set.More detail
on
this
process willbe
providedin
the next section.Eiemplar Et
tractionThe
goalin
the exemplar extraction processis
to pull out afew
key
pieces of
information
that
willdemonstrate
the
uniquefeatures
of theitem in
question.If
possibleit is best
tokeep
the
quantity
ofdata
in
each exemplar
to
a minimumin
ordertolessen
theburden
onthediscrimination
algorithm.In
the
case ofthePetersen/Barney
vowelsused
in
thiswork, the
exemplars will consist ofthe number ofzero crossingsin
a512 byte
section ofthe vowel, plus3
coefficientsderived from
the
DAUB4
wavelettransform
ofthe same512 byte
section.
The
method used to capturethe
wavelet coefficientsis
thesubject of this section.
Many
signals which wemay
wishto
analyze(including
speech) are notdefined only
by frequency
content,but
alsoby
the positions of oneevent with respect to another
in
time.This
is
commonin
the power generationindustry
where three phase generators produce output with a set spatial relationship.Many
such examples can alsobe
sighted
in
the
medical profession, withone ofthebest
being
the
output ofheart
monitoringdevices.
Wavelets,
with theirability
to
focus
on patternsin
different
frequency
bands,
are well suitedfor
Since
most ofthe relevantinformation
is
containedin
the
largest
wavelet
coefficients,
speech exemplars canbe
extractedby
using
wavelets
to
recordthe
relative positions of thelargest
waveletcoefficients
in
eachfrequency
band. This
approach must overcometwo
sources ofvariability
to
be
successful.Firstly,
phasedifferences
in
the
vowels mustbe
considered whenperforming
the wavelettransforms. As
mentionedearlier,
theissue
of phase correctionhas
spawned some
very
complicated algorithms.These
are required sothat
transformations
performed ondifferent
vowelshave
a commonreference point.
Phase
correction was not performedin
the signalpreparation stage
because
during
thecourse ofthis
workit became
evident
that
it
couldbe
performed as part of thewavelet stage.This
method capitalizes on
the
positionalsensing
capabilities ofthe
waveletin
order to perform phase synchronization.In
thisimplementation
aseries of
ten
512 byte
wavelettransforms
were performedonthe
1024 byte
vowelbuffer in
an attemptto
locate
areference point.This
was
done
by
using
amoving
windowtransform
whosestarting
pointincremented
by
16 bytes
eachtime.Figure 1
shows the positions(relative
totime)
of allwavelet coefficientsin
eachfrequency
band
after
the transform
is
performed.Figure 1[
fiandl
Band 2
Band
3
Band
4Band
5
Band
6Band
7Band
8Band
90-1
2-3
4-78-15
16-3132-63
64-127128-255
256-511
In
essence themoving
window series yielded anarray
of1 0
buffers
ofthe type
shownin
Figure
1.
These
transformed
buffers
werethen
scanned
to
find
the
onehaving
its
maximum coefficientlocated
atposition
100
(or
as closeto
100
as possible)in
frequency
band
7.
This
provided a common reference point
between
ail vowels.The
choice ofthe
band
7,
while somewhatarbitrary,
wasdue
to
its
placementin
the
mid-frequency
range andits
acceptabletemporal
resolution.This
would
increase
the
chances ofhaving
frequency
activity in
thisband
for
allvowels, thus
providing
a stable reference point.Once
one ofthe ten transformed
buffers
wasselected,
theexemplarsfor
the
vowel were chosen as the positionof the maximum coefficientsin bands 8
and9 along
with the sum of all the positions of maximumcoefficients
in
all9 bands.
These
three exemplars willbe hitherto
referred to as
CI, C2,
andC3
respectively.An
example of actualdata
used
in
this
processis
shownin Figure 2.
Again,
these
arethe
temporal
locations
of the maximum coefficientsin
eachfrequency
band for
the selected transformbuffer.
Figure
2
Bandl Band 2 Band
3
Band4 Band5
Band6 Band 7Band
8Band
90 2 4 10 28
55
101170
343
In
this case wehave:
The
resulting
exemplarsCI, C2,
andC3,
are a snapshotin
time ofthe
behavior
ofthe vowel relativeto the
mid-frequencyband
reference.Below,
in Figure
3,
alisting
ofthe
pseudo codehas been
provided todemonstrate
moreclearly how
the
exemplar extraction algorithm wasimplemented. Steps
1
through
6
were performed as partoftheMacintosh
exemplar extractorimmediately
afterits initial
signalpreparation stage.
Steps
7-1
3
arediscussed in
the next section.Figure
3
(1)
Place
the
preprocessed vowelin
a1024 byte buffer.
(2)
Set
the vowelbuffer
pointer to thebeginning
of thebuffer.
(3)
Perform
a wavelettransform
onthe512 byte
sectionstarting
atthe current pointer.
(4)
Increment
thebuffer
pointerby
16 bytes.
(5)
If less
than10
transformshave
been
performed, gotostep 3.
(6)
Save
thelocations
of all maximum coefficientsfor
all ten ofthewavelettransforms
in
afile
along
withthe ten
zerocrossing
counts.(7)
Locate
frequency
band
7 in
each ofthe ten5
1 2 byte
transforms.
(8).
Choose
the transform whose maximum coefficientis
placed closestto
100
in
frequency
band 7
(see
Figure 2).
(9)
Save
the
placement of all maximum coefficientsin
the
otherfrequency
bands
associated with this particulartransformation.
(10)
Set CI
= the position of maximum coefficientin
band
8.
(
1 1
)
Set C2
=the
position of maximum coefficientin band
9.
(12)
Set
C3
= the sum ofthe
positionsof maximum coefficientsin
allnine
frequency
bands.
(13)
Set
F0
=the
zerocrossing
count whichis
the
pitch estimate.In
additionto the phase correctionissue,
we also need torememberthat the
positions of maximum coefficientsrelativeto
referenceband
7
willbe dependent
onthe
characteristic pitch of eachindividuals
voice.
Thus
we should not expectto get similar values ofCI
,C2.
andC3
for
the same vowel unlessboth
speakershave
a similar pitchprofile.
This
is because
the
entire speech signalis in
essencemodulated
by
thisfundamental
frequency
causing shifting
in
thepositions of coefficients
among
the
different
frequency
bands.
Thus,
as noted
in step
6
ofFigure
3,
each ofthe ten transforms are alsosaved with theirrespective pitch estimate
(zero
crossing
count).We
will see
that this
value,hitherto
referred to asFO,
willbe
criticalto
the
success ofthe
discrimination
algorithm.Figure 4
shows an actualblock
often setsof nine maximum waveletcoefficients
along
with the ten associated zerocrossing
counts(pitch
estimate).
The
fifth
set,in
this case,wouldbe
selectedfor
exemplarextraction
because
ofthe valueof101
in band 7.
This
setis
alsoshown, minus
the
zerocrossing
value,in Figure 2. Figure
4.
whichis
shown
in
the
file format
written toby
theMacintosh feature
extractor(Figure
3,
Step
6),
contains all theinformation
required todetermine
CI. C2, C3,
andFO. We
will seein
the next section that thisfour
partexemplar will
be
sufficientto
distinguish
the nine vowels onefrom
Figure 4
Bandl
Band
2 Band3
Band4Band 5
Band
6Band 7
BandS
Band 9 Pitch0 2 4 12
22
57 117 202 407 561 2 6 12 30 61
113
194 391 610
2 411
30 59 109 186375
641
2
5
1129
57105
178 359 650
2
410
28
55
101 170343
591
3
5
10
27 62 97 162 511 620 2 7 14 26 51
123
154 311 591 2
5
13
25
49 119 146295
580
3
713
24 62115
138 279 590
2
4 1223
62 111 130263
60Petersen/Barney
Vowel
Discrimination
As
mentionedin
the
previoussection, the
Macintosh
exemplarextractor generates an output
file
containing data
ofthe
form
shownin
Figure
4. Data
ofthis type
is
storedfor
each spokeninstance
ofeach vowel
in
the
Petersen/Barney
set.The
Macintosh
programdoes
not,
however,
perform steps7
through
13
shownin
the pseudo codelisting
ofFigure 3.
These
steps were performedmanually due
to thefact
that:(
1
)
thediscrimination
system wasonly
available on aSun
SPARC
10
thus
requiring
the
data
to
be
transferred
viadisk;
and(2)
the
act ofscanning
allten transformations
andselecting
the one with amaximum near position
100
(band 7)
is
a simple operationhaving
noimplications
onthe
viability
ofthis
work.Thus
thereader shouldassume
that
all sets ofCI, C2, C3,
andFO
presentedto the
discriminator
werederived
by
using
thishybrid
scheme.Before
discussing
the actualdiscrimination
system,it is important
to
note
that
with thePetersen/Barney
vowels we gain a major advantagein
knowing
the pitch.By
definition,
these
vowels arefairly
localized
in
frequency
spacewithrespectto
each other.This
meansthat
before
even
using
CI, C2,
orC3
in
thediscriminator,
we canbreak
the vowelsdown
into
smaller groups according to their zerocrossing
counts(FO).
The
vowels shownin Figure 0
werefound
to
residein
three majorpitch groups.
Group A,
which containsvowels1
through
4.
werefound
Group
B,
which containsvowels5.
6,
and9,
on averagehad
countsbetween
30
and50
per512 bytes.
Group
C
contains vowels7
and8
and
generally had
zero crossingsbelow
30
counts per5
1 2 bytes.
The
immediate
advantage ofthis
technique is
that
eachvowel canbe
classified as
having
membership
in
one ofthree
groupsbefore the
discrimination
algorithmis
even applied.This
has
the net effect ofincreasing
thediscrimination
ratebecause
ofthe reductionin
thenumber of vowels
that
mustbe
separated onefrom
another.The
discriminant
analysis system usedin
this project wastheSAS
DISCRIM Procedure
[151
running
on aSun Microsystems SPARC 10.
When
presented with theinput,
theDISCRIM Procedure
generatesquadratic
discriminant functions for classifying
observationsinto
twoor more groups
based
on exemplarscontaining
one or more variables.The
resulting
discriminant
function is derived from
the generalizedpairwise squared
distance
and wouldbe easily
portedto
othercomputer systems.
In
this case the prior probabilities ofoccurrenceare assumed to
be
equalfor
allinputs. Upon
completion,SAS
produces a
summary file
showingimplementation details
and thepercentage oftime the vowels were
properly
correlatedto their
number
(1
through9)
based
on their exemplars.The
vowel exemplarinput
andresulting
SAS
output are shownin
Appendix
A for group
A,
Appendix
B for
group
B,
andAppendix
C for group C.
While it is
possible to runthe
SAS discriminator
directly
onthefour
partexemplars shown
in
the appendices,
it is generally
advantageousto
create anew set ofinputs based
on someintuition
aboutthe
data.
In
this
case, a set of16
newC
andF
exemplarswere createdinside
theSAS
program afterreading
CI. C2. C3.
andFO from disk.
The
expandedset of exemplars were generated
based
on theknowledge
that therewere relationships
between
these
valuesthat
couldbe
expressed asratios.
These
ratios, shownbelow in Figure
5,
were used to accentuatethe
interdependencies between
pitch and the temporal placement ofwavelet coefficients.
Note
that some exemplar valueshave been
redefined,
but
still retain theinfluence
of the original value.All
voweldiscrimination
was performedusing
this new set of exemplars.Calculate First
C7
=(C2*C1)/C3
C6
=(CI
*C3)/C2
C5
=(C2
+C1)/C3
C4
=(C2-C1)/C3
C3
=C3
CO
=(CI
+C2
+C3VC3
CI
=C3/(C3 +C1)
C2
=C3/(C3
+C2)
Figure
5
Calculate Second
F0
=iogl0(F0)
Fl
=CO/FO
F2
=C1/F0
F3
=C2/F0
F4
-C4/F0
F5
=C5/F0
F6
-C6/F0
After
running
theSAS
DISCRIM
procedure on the expanded exemplardata,
thegoal of a75%
minimumdiscrimination
ratewas achieved.These
results,
which are shownin
Figure
6,
arebased
ontwelve
repetitionsof
the vowels;
onefrom
each ofthe twelve
different
speakers.
Figure
6
Vowel
#(derived
from)
Discrimination
Rate
1
(EAT)
83.33
%2
(IT)
75.00
%
3
(ANY)
75.00
%
4
(AT)
75.00
%
5
(UP)
83.33
%
6
(AMISH)
91.67%
7
(OOMPH)
91.67%
8
(OOZE)
83.33
%
9
(AUTO)
75.00
%
The
encouraging
partofthese resultsis
that reasonablediscrimination
rates were achieved
in
spite oftheemphasis onconserving
CPU
cycles.This indicates
that muchbetter
rates couldbe
achieved with signalprocessing
and wavelet algorithms that are optimizedfor
this task.
Of
course, the
penalty
wouldbe increased
pressure onCPU
executiontime.
Summary
In
general, the
results ofthis
work were quite encouraging.In
addition
to
achieving
the
minimum goal of a75% discrimination
rate.the
following
observationswere made:The
ability
ofthe
wavelettransform
todetect
the placementevents
in
time
canbe
usedto
synchronizethe
phasesofmultiple signals
by looking
for
predetermined signatures.After
phase synchronization,it is
also possible todistinguish
one vowel
from
anotherbased
on thelocalization
of eventsin
time.A
pitch estimation algorithm canbe
implemented
by
simply
calculating
the number ofzerocrossingsin
the utterance.For
speechwith specificfrequency
localization
(such
as thePetersen/Barney
vowels), the pitch estimate canbe
used tosubdivide
the
speech prior to thediscrimination
algorithm.A
simplediscrimination
system,employing
the
techniquesdescribed
herein,
couldbe implemented
torespondto
singlevowel commands.
The
wavelet phasedetection
algorithm couldbe
usedto
look
for
the
start of an eventin any
signal(speech,
image.
medical.
3
phase power, etc.),by
simply
searching
for
theplacement
(in
time)
of a givenwavelet coefficientaccording
Where CPU
poweris
available,
a continuousmoving
windowwavelet transform could
be
usedto
differentiate
vowelsin
real-time.
This
wouldinvolve
acontinuouscomparison ofthe positions of thewavelet coefficients
in
eachfrequency
band
after eachtransform.
In light
of the success achieved with the simplified pitchestimation and wavelet algorithms, much
better
resultscould
be
expected with more complex algorithms andincreased
computing
power.Appendix A
Group
AS# FO CI C2 C3
10:38 Saturday, February 4, 1995
S# FO CI C2 C3
1 50 247 495 955 2 50 160 511 889 1 51 175 357 726 2 50 248 408 857 1 59 170 343 713 2 51 204 404 814 1 65 251 507 950 2 51 189 374 756 1 72 207 417 834 2 56 212 413 830 1 75 209 456 860 2 62 215 431 844 1 79 160 491 863 2 65 204 428 842 1 87 174 348 714 2 67 212 426 832 1 89 157 315 658 2 71 233 466 908 1 92 170 438 828 2 74 243 329 763 1 102 153 366 730 2 78 246 495 945 1 113 138 269 624 2 81 248 497 937
S# FO CI C2 C3 S# FO CI C2 C3
3 56 207 472 859 4 50 194 274 666 3 58 181 452 826 4 50 199 400 796 3 66 205 416 820 4 50 197 393 774 3 72 133 511 851 4 54 199 400 788 3 75 213 429 829 4 61 200 400 793 3 76 240 483 924 4 62 252 418 872 3 85 250 400 843 4 64 254 419 883 3 86 159 324 671 4 64 202 403 797 3 88 240 486 932 4 66 199 400 807 3 89 192 386 764 4 70 207 415 808 3 97 250 501 954 4 75 202 406 817 3 103 183 424 818 4 95 152 366 707
Group
A 1 10:38 Saturday, February 4, 1995Discriminant Analysis
48 Observations
16 Variables
4 Classes
47 DF Total
44 DF Within Classes 3 DF Between Classes
Class Level Information
SOUND 1 2 3 4
Frequency
12 12 12 12 PriorWeight Proportion Probability
12.0000 0.250000 0.250000
12.0000 0.250000 0.250000
12.0000 0.250000 0.250000
12.0000 0.250000 0.250000
Discriminant Analysis
Covariance Matrix Rank
16
Group A 2
10:38 Saturday, February 4, 1995
Pooled Covariance Matrix Information
Natural Log of the Determinant
of the Covariance Matrix
-148.15458
Group
A 310:38 Saturday, February 4, 1995 Discriminant Analysis
Pairwise Generalized Squared Distances Between Groups
2
_
-1
_ _ D (i|j) = (X - X ) GOV (X - X
)
i
1
ii
From SOUND
1 2 3 4
Generalized Squared Distance to SOUND
Group A 4 10:38 Saturday, February 4, 1995
Discriminant Analysis Linear Discriminant Function
-1 Constant = -.5
X'
COV
1
-1
Coefficient Vector = COV
SOUND
CONSTANT -28387112 -28378857 -28375796 -28373005
CO -76226930 -76224439 -76203559 -76214671
CI -3034464 -3035846 -3045365 -3016966
C2 381278534 381282879 381180965 381216076
C3 -3759 -3764 -3757 -3763
C4 85470437 85476876 85451823 85456237
C6 44983 45024 44971 45020
C5 -65617659 -65625900 -65597579 -65616804
C7 -3916 -3916 -3915 -3915
FO 40796 40594 40970 40500
Fl 143298970 143292692 143253258 143273612
F2 79965435 79963699 79968494 79922862
F3 -702184950 -702195605 -701994757 -702070326
F4 -188766574 -188777425 -188722670 -188736005 F5 148777224 148788222 148732455 148767770
F6 8883 8893 8880 8891
Group A 5 10:38 Saturday, February 4, 1995
Discriminant Analysis
Classification Summary for Calibration Data: WORK.AA
Resubstitution
Summary
using Linear Discriminant FunctionGeneralized Squared Distance Function:
2
D (X) == (X-X )'
-i
COV (X-X )
j
i
j
Posterior Probability of Membership in each SOUND:
2 2
Pr(j|X) = exp(-.5 D
(X)) / SUM exp(-.5 D (X))
j
k kNumber of Observations and Percent Classified into SOUND:
From SOUND 1 2 3
1 10 83.33 1 8.33 1 8.33 2 0 0.00 9 75.00 1 8.33 3 1 8.33 1 8.33 9 75.00 4 0 0.00 2 16.67 1 8.33 Total Percent 11 22.92 13 27.08 12 25.00
Priors 0.2500 0.2500 0.2500
Group A 6 10:38 Saturday, February 4, 1995
Discriminant Analysis
Classification Summary for Calibration Data: WORK.AA
Resubstitution Summary using Linear Discriminant Function
Number of Observations and Percent Classified into SOUND:
From SOUND Total
1 0 0.00 12 100.00 2 2 16.67 12 100.00 3 1 8.33 12 100.00 4 9 75.00 12 100.00 Total Percent 12 25.00 48 100.00 Priors 0.2500
Error Count Estimates for SOUND:
1 2 3
Rate 0.1667 0.2500 0.2500
Priors 0.2500 0.2500 0.2500
4 Total
0.2500 0.2292
Appendix B
Group
B 009:53 Saturday, February 4, 1995
S# FO CI C2 C3 S# FO CI C2 C3 S# FO CI C2 C3
5 41 136 511 852 6 49 166 511 855 9 45 255 511 961 5 35 246 492 947 6 39 192 511 903 9 40 166 511 866 5 49 130 258 560 6 37 255 511 972 9 30 205 511 919
5 45 135 365 675 6 47 255 511 965 9 43 234 460 910
5 48 232 467 896 6 40 186 446 836 9 35 236 479 909 5 39 212 415 828 6 45 199 511 901 9 33 208 418 813 5 45 234 464 895 6 49 172 511 859 9 49 231 465 895
5 42 197 395 774 6 47 255 450 858 9 35 173 353 737 5 30 219 511 933 6 49 251 511 952 9 35 243 511 952
5 35 226 454 855 6 42 255 511 968 9 37 209 499 903
5 41 199 400 777 6 44 201 511 901 9 32 255 511 947 5 41 238 295 731 6 44 165 501 860 9 44 202 405 807
Exemplars presented as input to SAS for vowels 5, 6, and 9.
Vowels were derived from 12 individual speakers.
Group B 1 09:53
Saturday, February
4, 1995Discriminant Analysis
36 Observations
16 Variables
3 Classes
35 DF Total
33 DF Within Classes 2 DF Between Classes
Class Level Information
SOUND 5 6 9
Frequency
12 12 12 Weight Proportion 12.0000 12.0000 12.0000Group
B 0.333333 0.333333 0.333333 Prior Probability 0.333333 0.333333 0.333333 Discriminant Analysis Covariance Matrix Rank 1609:53 Saturday, February 4, 1995
Pooled Covariance Matrix Information
Natural Log of the Determinant
of the Covariance Matrix
-154.26124
Group
B 309:53 Saturday,
February
4, 1995Discriminant Analysis
Pairwise Generalized Squared Distances Between Groups
2
_ _
-1
D
(i| j)
= (X - X ) COV (X - X)
i 1 i
1
Generalized Squared Distance to SOUND
5 6 9
Group B 4 09:53 Saturday,
February
4, 1995Discriminant Analysis Linear Discriminant Function
-1
_
Constant = -.5 X' COV X
i
1
-1
Coefficient Vector = COV
SOUND
CONSTANT -32277681 -32281120 -32272681
CO -125348089 -125214528 -125169474
CI 684614944 684658333 684466760
C2 -190942609 -191604568 -191613466
C3 -16645 -16619 -16604
C4 -248049881 -248228165 -248182120
C6 140905 140715 140610
C5 -71428008 -71271549 -71227840
C7 -6928 -6928 -6927
FO 45865 46073 46501
Fl 176836947 176621691 176550619
F2 -1.01563E9 -1.01571E9 -1.01541E9
F3 412291135 413378023 413384539 F4 385210673 385506957 385436056 F5 116039107 115786043 115712962
F6 27172 27129 27105
F7 -188962 -188651 -188483
Group B 5 09:53 Saturday,
February
4, 1995Discriminant Analysis
Classification Summary for Calibration Data: WORK.AA
Resubstitution Summary using Linear Discriminant Function
Generalized Squared Distance Function:
2
_
-1
D (X) = (X-X ) COV (X-X
)
i
i
j
Posterior
Probability
ofMembership
in each SOUND:2 2
Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D
(X))
j
k kNumber of Observations and Percent Classified into SOUND:
From SOUND 5 6 9 Total
5 10 83.33 0 0.00 2 16.67 12 100.00 6 0 0.00 11 91.67 1 8.33 12 100.00 9 1 8.33 2 16.67 9 75.00 12 100.00 Total Percent 11 30.56 13 36.11 12 33.33 36 100.00
Priors 0.3333 0.3333 5333
Rate
Priors
Error Count Estimates for SOUND:
Appendix C
Group
C 009:52 Saturday,
February
4, 1995S# FO CI C2 C3 S# FO CI C2 C3
7 28 255 511 974 8 20 162 314 699
7 23 255 511 960 8 13 255 511 983 7 29 157 511 869 8 19 255 511 968 7 23 159 316 673 8 28 250 258 710 7 29 238 511 946 8 19 235 462 880 7 21 202 511 916 8 25 215 414 830 7 29 255 511 982 8 21 255 511 954 7 29 165 492 849 8 21 209 511 921
7 19 255 511 976 8 18 219 511 908 7 27 227 351 782 8 29 180 287 658 7 20 255 511 954 8 21 207 511 907
7 26 204 511 925 8 29 169 511 877
Exemplars presented as input to SAS for vowels 7 and 8. Vowels were derived from 12 individual speakers.
Group C 1 09:52 Saturday,
February
4, 1995Discriminant Analysis
24 Observations
16 Variables
2 Classes
23 DF Total
22 DF Within Classes 1 DF Between Classes
Class Level Information
SOUND 7 8
Frequency
12 12 PriorWeight Proportion Probability
12.0000 0.500000 0.500000
12.0000 0.500000 0.500000
Discriminant Analysis
Covariance
Matrix Rank
Group C 2
09:52 Saturday,
February
4, 1995Pooled Covariance Matrix Information
Natural Log of the Determinant
of the Covariance Matrix
16 -149.7567
Group C 3
09:52 Saturday, February 4, 1995
Discriminant Analysis
Pairwise Generalized Squared Distances Between Groups
2 _ _
-1
_ _
D
<i| j)
= (X - X )' COV (X - X )i 3 i 3
From SOUND
7
8
Generalized Squared Distance to SOUND
7 8
4.73158
4.73158
Discriminant Analysis
Group C 4
09:52
Saturday,
February
4, 1995Linear Discriminant Function
-1
Constant = -.5 X' COV
3
-1
Coefficient Vector = COV X
3
SOUND
CONSTANT -116920497 -116919457
CO -178032482
-178116795
CI 2191695849 2191836091
C2 -1.33626E9 -1.33607E9
C3 -82773 -82780
C4 -917554648 -917551644
C6 696802 696867
C5 -89169658 -89244292
C7 -18294 -18296
FO -1264658 -1264670
Fl 174968199 175073034
F2 -2.84637E9 -2.84657E9
F3 2241564904 2241342867
F4 1273630932 1273634051
F5 129206960 129298509
F6 114369 114379
F7 -869557 -869635
Group C 5
09:52 Saturday,
February
4, 1995Discriminant Analysis
Classification Summary for Calibration Data: WORK.AA
Resubstitution Summary using Linear Discriminant Function
Generalized Squared Distance Function:
2 -1
D (X) = (X-X )' COV (X-X )
3 3 3
Posterior
Probability
of Membership in each SOUND:2 2
Pr(j|X) = exp(-.5 D
(X)) / SUM exp(-.5 D (X))
j
k kNumber of Observations and Percent Classified into SOUND:
From SOUND 7 8 Total
7 11 91.67 1 8.33 12 100.00 8 2 16.67 10 83.33 12 100.00 Total Percent 13 54.17 11 45.83 24 100.00 Priors 0.5000 0.5000
Rate
Priors
Error Count Estimates for SOUND:
7 8 0.0833 0.1667 0.5000 0.5000
Total
Appendix
D
Wavelet
Based
Phase
Alignment
Speech
Signal
Sampled Speech Buffer
(
1 024bytes)
51
l!"""!"!!"!!:!"!"""!":!S'!!S!'!!""!S!!i!!'":i!!!i
:^:^:u;:::^:::::::::s:
"bihsiaiiisi^d^nin:;::::.^ n::sK:K^n==^:Sr^
nHHHHHtHHHHunsHHnsHHHhinE^iiiiiiiK^^
IssllilnsinnKlHiHInls :r:i::::j:::::::::::::::
mimtimtmmmimmmmmmmmmmm)miimiimmitmmmimmm4Mmmmmmmmmmmmmmmm
:i::i:il:i:::i::i:::;::i:n;:;:H:i::i:::iiii:::I;:i;:::iJ:^
nti:i::::i:::i:n::r;:::::::::ti:i:::::!:::ii::::::::::::i:i!:
!H!i!4!!!!H!!!}!H:!!i!!t!!!^i!l!!!!H!!!!i!!HH!^HS;S!i!S!;:!'!!!!!!i!!"S!!!!"!:
lllllljlilllliiiiiiJHiiiiiiiiii^illllli
miimmmtmMimmmmifmmtmm
wmM
i(:s:m;i:i;inmmmmm
l^iillsiiiisliiiliilltlH: l:::n.::Hh::::::::::u::::::::::::u:::H:;HH:H:!:HK:H:::H!H::!:nHnH:::::::::::::::::i
:ii:::il;ii;i;:;ni:::::!i!::;;:|;:;;!!::;::H::::!l!!!:!::l::!:H^
uumtniuwtHtu
IliniinillsllSIIIIIOIIOiiliyjjilllJlliiii!
iiiiiiiiiiijiiiiiiiiiniiiiiiiiiiiiiiii^
ljlllil|lll|lillil|l|gglllllilll|!lllill
REFERENCES
1
.Speech
Frequently
Asked Questions
(FAQ);
Andrew
Hunt,
comp.speech on the
internet,
1994.
2.
Fundamentals
ofSpeech
Recognition;
Lawrence Rabiner &
Biing-Hwang
Juang.
Englewood Cliffs
NJ: PTR Prentice Hall
(Signal
Processing
Series),
cl993
ISBN
0-13-015157-2.
3.
Speech
recognitionby
machine;W. A.
Ainsworth,
London:
Peregrinus
for
the
Institute for Electrical
Engineers,
cl988.4.
Speech
synthesis and recognition;J. N.
Holmes.
Wokingham:
Van Nostrand
Reinhold,
cl988.5.