Towards Improved Assessment of Phonotactic Information for Automatic
Language
Identification
Terrence Martin,
Eddie
Wong,
Sridha Sridharan
Speech
and Audio Research
Laboratory, Queensland University
of
Technology,
GPO Box
2434,
2
George St, Brisbane, Australia, QLD
4001.
tl.martin, ee.wong, [email protected]Abstract
Phonotacticmodelling, typically in the form of a PPRLM sys-tem,forms akey component in state-of-the-art Language Iden-tification (LID) systems. Given theobjectiveof PPRLM sys-tems is to capture as accurately aspossible the phonotactics which characterisealanguage, it is assumed that the minimi-sation of PhoneError Rate(PER)is a precursor to achieving thiseffectively. In this paper we examine the relevance of PER as ametric fordeterminingeventual LIDperformance.Inorder toconduct thisinvestigationwemake use of the CallHome cor-pus,based on thepremiseitprovidesabetterrepresentationfor thestyleof discourse and channel conditions encountered in the ConversationalTelephone Speech (CTS),which is now the fo-cusof current NIST LID evaluations. UsingCallHome instead of the OGI-MLTS corpus totrainphone recognisers, we ob-tainedsignificantly improved results, withanaverage improve-mentof approximately 6% absoluteacrossthe 30, 10 and3 sec-onds tasks for the NIST 1996 and 2003 evaluations. We also examine theimpactoftuningthe individual front-end recognis-ers, onboth the resultant PER of otherlanguages andagainst the resultant LIDperformance. Wefindthat PER has a num-ber of limitations inindicatingboth thedegreeand direction of changestoLIDperformance. Accordingly, weproposea new
metric which is better suited forforecastingtheimpactonLID performance when the phone recogniser front-end is modified.
1.
Introduction
Phonotacticmodelling, typically in the form of a PPRLM
sys-tem [1], formsa keycomponent in state-of-the-artLanguage Identification(LID) systems. The objective of PPRLM systems istocapture asaccuratelyaspossiblethephonotacticswhich characterisealanguage.Howeverthe number of available
met-rics which reflect howwell, and how consistently this informa-tion iscaptured,arelimited. Theavailability of reliable metrics isimportantfor evaluation purposes but also forgaining insight into what informationprovides themostimportant contribution
tothe LID task. Whilst LIDrateisultimatelythemostrelevant metric, PERof the front-endrecognizercanalso be usedas a proxyfor how well this information iscapturedand offers the additionalbenefitofbeingmoreeasilyobtained.
Theutility of this benefit is best illustratedin PPRLM
sys-temoptimisation. If LIDrateistobe usedtoexamine theimpact of anychangestothe system, suchasthe front endrecognisers, both thetrainingandtestdatamustbe decoded.Following this, n-grammodels needtobesubsequentlytrained and then tested. Given 12languages,thiscanbeextremelytimeconsuming,
es-pecially for the CallFriendcorpus[2]. Alternatively,PERof the front-endcanbe obtainedreliably fromamuch smallersetof
data,in a much smaller time frame.
However, little research has been published which exam-ines the relationship between PER and LID, and the relevance ofquoted PER are of limited use. To illustrate, most quoted PERhave been based on performance for the OGI-MLTS cor-pus [3],despitethe fact LID evaluations are conducted on the significantly more difficult recognition task of CallFriend. Ac-cordingly,nomeaningfulinformation canbe,or atleast should be, inferred between theerror ratesquoted and subsequentLID performance.
In order to provide a better means of evaluating phone recognition performance we use the CallHome corpus[2], which containsspeechwhich isessentiallythe same condition asthat contained inCallFriend,and contains bothtranscriptions and lexical resources. The use of this corpusprovidestwo ben-efits. Thefirstis a means to evaluate theperformance of the front-endrecognisersin a task which morecloselyreflects the expected style of discourse and channel conditions. The second is that acoustic modelscanalso be built, whichwereexpectedto providemorereliable front end decodersinthePPRLMsystem. As analternatetoPER, we proposethe use of a new metric forphonotacticinformation basedon aPhoneAlignmentCost (PAC). This technique stems from the idea that phone recogni-tionerrors are notall thesame. Forexample,isan errorwhere /p/isreplaced by/b/on aconsistent basismoredesirable than say/p/being replaced bya more erroneousrepresentative such
as/s/ ? The fact that PER's for ASRsystemsarequite high,yet PPRLMsystemsperform quite admirably highlights that useful information is still present in these error riddenphone streams. Based on this concept, the PACincorporates a linguistically in-tuitivehierarchyforestablishinga costfor the various types of phone recognition errors. The cumulative cost can then be used
togaugethe effect of anychanges. Incontrast, PERpenalises anyerrorsabsolutely; theyareeitherrightorwrong,despitethe fact thatmeaningfulinformation may be still available.
Thecontents of thispaperis as follows. In Section2, a briefdescription of thePPRLMsystemisprovided. A discus-sion of the relative merits ofusingPERand PAC forextracting availablephonotactics information is then providedinSection 3. Detailsregardingthedevelopmentof the baseline front-end recognisersis then outlined in Section 4. The results of experi-mentation which examined therelationshipbetweenPER, PAC and LID accuracyarethenpresentedin Section5. Conclusions
aresubsequentlydrawn in Section 6.
2. PPRLM
System Description
"Parallel PhoneRecognition followed byLanguageModelling" (PPRLM) [1], comprisesabank of identical "PhoneRecogni-Phone
Sequences
Phone
Figure 1: Block Diagram of PPRLM System.
tion followed by LanguageModelling" (PRLM) sub-systems running in parallel, asdepicted in Figure 1. Each of the
sub-systems performs thesameLIDfunction, however the
front-endphone recognisersaretrainedindividually with speech data
from different languages. As the nameimplies, this system
worksby first decoding the speech data into aphone stream.
Likelihood scores arethen obtained by comparing the phone
stream to n-gram Language Models (bi-gram LMs are
em-ployed in this paper). Inanattempttoenrich thephonetic
de-scription, a duration tagof "-Short" or "-Long" is appended
to eachphone label when their corresponding phone duration is shorter orlonger than it's averagephone duration. One of
themoreimportant features ofPPRLMis it doesnotrequire transcribed speech data for the languages whicharetargeted
for identification. Instead, their phonotactic'saredescribed in
terms of the front-endphone recogniser's language phonemic inventory. Inessence,the front-endphone recognisersare
em-ployedtodecode thespeech data of all the different languages.
3. Phonotactic Information Metrics
Phonotacticscanbe definedasthefrequency and possible order
ofoccurrenceofasequenceofphoneticevents. It hasproven
tobeaneffective informationsourceforaccurateidentification oflanguages. ThePPRLM LID systemoutlined in thispaper
is basedonextracting phonotactic information by decoding the
speech data withafront-endphone recogniser. However, phone
recognitionsystemsproduceasignificant number oferrors,
typ-ically in therangeof 40-60%PER, and accordinglycorruptthe phonotactic information containedin theoriginalutterance.
De-spitetheseerrors,PPRLM systemsperform quite well, although
systemperformance does degradewhen thelengthof thetest ut-teranceis decreased.
Given theamountofinaccuracyinthephonestream, it is somewhatsurprisingthat the level of LIDperformanceachieved
issohigh, and begs the question whether the meaningful
infor-mation extracted is in factphonotactic,orsimplyaresult of the
efficiencyinwhich themodellingsystemis abletoexploit
pat-tern differences across differentlanguages. However, for the
remainder of discussion inthis paper, it is assumed that this
meaningful informationisphonotactic.
The limited understanding of how information contained
in thephone streamis used,inturnmeansthat theimpact of
changestothe systemscanonlybe evaluatedempirically. This canbeatimeconsumingprocessandhighlights thepotential
benefits forasuitablemetric, capableofrepresentingavailable
phonotactic information for LID.Inaddition, it is likely thata
suitable tool willprovidequalitative benefitsinunderstanding language characteristics. Accordinglywediscuss andcontrast
the attributes oftwometrics in this section. The first is thePER, whichrepresentsthemostcommonly used approach. Second,
wesuggestanalternativemetric, basedon analignment
tech-nique originally proposed by [4] foruseinpronunciation
mod-elling. We haveadapted this technique forourpurposesto
pro-videabetter alternative thanPERforgauging theamountof availablephonotactic information.
3.1. Phone Error Rate
Most PPRLM systems utilise acoustic models trained from OGI. The mainreasonfor this is that the time framerequiredto
produce modelsusing the OGI data is relatively short; the
tran-scriptions contain phone based alignments and model training from thatpoint is relatively straight-forward. However, whilst the models workeffectively in PPRLMsystems,little credence should bepaidtoreported PER'sonOGIevaluations, given the
eventualrecognition task is decoding speech from CallFriend. It isexpected that theuseof OGI models fordecoding CallFriend
willproduce less than optimal PER's, andinturndegrade the
amountof information availableinthehypothesised phonetic
stream.
Aside fromerrorswhich result because ofamismatch
be-tweentask, considerationalso needstobegiventothe fact that
language differences also degrade decoderaccuracy. It is well
known incross-lingual and multilingual studies [5], [6], [7] thatusing acoustic models fromonelanguagetodecodespeech from another leadstodegraded performance. Given the already higherrorrate,the level ofaccuracyobtainedby front-end
de-codersonotherlanguagesislikelytobevery poorindeed.
De-spite thesesourceofdegradation,thesuccessof LID systems
basedonPPRLMillustratethat useful information still exists in
these inaccuratephonestreams. Our interest lies in decipher-ing theextent andusefulness of the information contained in
thesephonestreams, andidentifyingtherelationshipbetween
PERand LIDwasconsidered animportant stepto achieving thisgoal.
However, determining the information containedinthese phonestreamsisproblematic, especiallyif PERistobe usedas
the metric. Asmentioned,inordertodeterminePER's,a
suit-ablesetof referencetranscriptionsisrequired. Ifthe
relation-shipbetween PER and LID istobevalid,it is alsopreferable that the evaluation inwhich PER is obtained mirrors that of the eventual LID task. This isobviouslynotthecaseforOGI,and
accordinglya moresuitablecorpuswassort.
Speech Utterance O Likelihood Scores
_A
Identified 'LanguageThe transcriptions available in CallHome were considered a more appropriate representation for the speech which occurs inCallFriend, and as such, more suitable for examining the re-lationship between PER and LID. In addition, CallHome has transcriptions for a number of languages, making it possible to evaluate thephone recognition performance of each of front end recogniser, on other languages.
However, even with a suitable evaluation set, obtaining PERacross multiple languages is problematic, as differences betweenphonemic inventories exist. In order to obtain PER, the referencetranscription needs to be compared with that pro-ducedbythephone recogniser. Slightdifferences in the artic-ulatoryrealisation of the same sound means thattheyarequite often labelleddifferentlyacrosslanguages.Forinstance,in En-glish the "d" in dog is labelled in Worldbet as"d",whereas in Japanese the closestapproximateis labelled"d{". Accordingly some form ofmapping isrequired sothat an equitable com-parisoncanbe made. Complicating this problem is thatsome languages have only onelabel foraparticular sound, suchas Spanish where the vowelsarepure, whereas inEnglish there aremanyphonemicvariants of the same basic sound. Thus,the mappingprocess canbecome quite involved, requiring knowl-edgeof the variouspropertiesof sounds across manylanguages. Of course, data drivenmappingscanbederived, althoughthis has ashortcoming when there are differences in channel con-ditions between the corpora the models are trainedon,and the developmentsetused to derive rules.
Regardless of which mapping technique is used, theuseof PERhas another limitation. We assertthat when itcomes to modelling the phonotactics which characterise alanguage, the degree with whichrecognition errors corrupt thephonotactic information is notequitable. For instance, if the phone/p/is interchangeably recognised as either/b/or/p/,andrarely as anything else,then it islikelythat usable n-gram statistics are gathered. Incontrast, ifinappropriate modellingormismatch between the train and test domain leads to inconsistencies in decoding/b/,then less information isprobablyderived. Given this, intuitively thecostofan error canvaryaccording to lin-guistic similarity. Unfortunately, PERis absolute, either the recognised phone iscorrect, oritsnot. InSection5, experimen-tal resultshighlighthow ineffective PER is as an indicator for predictingLIDperformance. Based on thisassertion, the next section outlines an alternative metric which seeks to overcome thesedeficiencies.
3.2. PhoneAlignment Cost
In order to establish the cost of eachrecognition errormade bythe front-end decoder, weadopted anapproach first intro-duced in [4]. The focus of this workwasimprovingthe mod-ellingofpronunciationvariation for Mandarin. Akeyaspect of pronunciation modellingisobtainingreliable estimates for the frequency of alternate pronunciations which differ from the lex-icalrepresentation andcausetranscriptionerrors. One method forestablishing the frequency of these pronunciation variations is to decode atranscriptionandsubsequentlycompare it with the reference. This requires that both the reference and hy-pothesised transcriptionarealigned. Generallythisalignment is achieved viadynamic programming usingasimpleedit
dis-tance as a costfunction. Unfortunately, simpleeditcost func-tionsprovide inadequate alignments,and inturnimpactonthe reliabilityof derivedpronunciationrules. To combatthis,Fung introducedaflexiblealignment tool which incorporatesa hierar-chy ofcostsforinter-symbol alignments. This tool is available
at[8]. Aby productof thisalignmentprocess is the cost of aligningutterance.
The relevance of this alignment cost is that the cost assigned toeachinter-symbol alignmentis related to howlinguistically similartheyare.Thegeneralidea is that the cost ofaligning/d/ with /t/ is lessexpensive, than say /s/. At a cruder level, the costofaligning /i/ withanyother vowel is less thanaligning with a consonant. From aglobal perspective, if a Spanish recog-niser is inappropriately
trained,
and produces inconsistent tran-scriptions forsaytheJapaneselanguage, theprocessof align-mentbecomesmoredifficult and hence this will be reflectedin the overall cost ofaligningthe reference utterance with the hy-pothesised. Ideallythe cost function should allow for agraceful degradationinrecogniser performance, by incorporatinga hier-archialstructurebasedonlinguistic similarity. Additionally,it shouldincorporatea meansofcomparing phonemesfrom dif-ferentlanguages which aresimilar inarticulatory realisation, but annotateddifferently. Basedonthisidea, theaverage cost persymbol alignment,inprinciple should provideabetterguide totheamountof information which ispreserved when changes aremade to front endrecognisers.Toobtainalignmentcosts we adaptedthe toolkit introducedbyFung, to enable it to cater for thephonemeinventories ofmultiple languages, andexpanded the classhierarchy. Aformalisation of thequalitative explana-tiongivenin[4] is as follows.The cost ofaligning phonesfrom the reference and hypoth-esisedtranscriptions is annotated as C(b, s) where b is used todenote the referencephone and sthehypothesised phone. LetX =
(X,
X2,..., Xt)
represent the total set ofarticula-toryclassesdesignedtoprovidecoveragefor thephonetic in-ventoryof both the sourceandtargetlanguages. Thetypeof classes used arearrangedin ahierarchial manner, similar to the questionsetused intraining context-dependant models,sothat classes range from broadcategoriessuchaswhether thephone isavowelorconsonant,throughto exactdescriptionsof articu-lation. A subset of X exists which isdefinedby:
XDS =BUS (1)
whereB=(bi,
b2,
..., bn) definesthe set of n classes in whichbexists andsimilarlyS=
(S1, S2,...,
SM)
definesthe subset ofmclasses in whichsexists. Usingthecardinalityoperatorto
reflectthe number of distinctelements, the cost of aligning the phone pairing is given by;
C(&,s) r3s
-i-n
si
(2)
Essentiallythis equates toincrementingthe cost each time thephones bandsdonotco-exist in each of the classes
con-tained in
XBS.
The cost outlined above represents those associated with substitutions. However,inmanycasesthelengthof the
refer-enceandhypothesised transcriptionvaries.Accordinglya
sep-arate setof rules is also necessaryto define costs associated with insertions and deletions. The rulesgoverning insertions and deletionswerecruder than those used for substitutions.We expandedandadjustedthe rulesetoriginallymade availableby [8], whichwasdesignedto coverthesetof Mandarin sounds. Thisexpansionwasrequiredto coverthephonemic inventory
acrossEnglish, Spanish, German, Japanese,and Mandarin. The guidingrules inouradaptation,wasthat the insertion of
vow-elswas morelikelythan consonants, and viceversafor dele-tions. In addition the deletion of thephones/r/,
/1/,
/h/wereaffordedasmallercost astheseweredeletedquite often, espe-ciallyinthecaseof/r,
1/
whenthey occurred syllable finally.As mentioned in the previous section, the use of PER as ametric isdifficultwhen comparing phonestreams across lan-guages. For each language under consideration, an appropri-atemapping must be conducted. Thus, if German needs to be aligned with Japanese, a mapping must be produced. If Ger-manthen needs to be aligned with Mandarin, another mapping is required and so on. Using the alignment cost, the phonemic inventory for each languages only needs to be incorporated into the classes listonceand accordingly, it offersa moreexpedient meansof aligning across languages.
4. CallHome
The motivation forusing models trainedonCallHome, in lieu
of those trained on OGI, is that itrepresents a closer match
with thestyle of discourse and recording conditions contained inCallFriend. As such it is expected that the subsequent mod-els willproducemoreaccuratetranscriptions and correspond-ing improvements in LID performance. For this research we
used the resourcescontained inthe CallHome corpusto
pro-duce baseline Automatic Speech Recognition (ASR) systems
forSpanish, Mandarin, German, and Japanese.
The CallHome corpusincludes acollection of telephone
speech recordings, transcripts and lexicalresourcesfor six
lan-guages; thosealready mentionedaswellasAmericanEnglish
andEgyptian Arabic. Thecorpuscontainsrecordings of
un-scripted conversations between native speakers of the specific language. All calls, which lastedupto30 minutes, originated inNorth America. Participants typically called family members
orclosefriends[2]. There is considerablymoretraining datain
CallHome whencomparedtoOGI. Table 1 details of the total
amountof data availableinthetwocorpora,afterremoving
un-desirableutterances,whichhighlights the differences inamount
of available data. Whilst statistics for OGIarenotshown,there
arealsoconsiderablymorespeakersinthe CallHomecorpora.
Atpresent,ASRsystemsbasedonthe available English and
Arabic data havenotbeen produced by the author. Inanattempt to expandthe number of available CTS recognisers,a
recog-niser based on transcriptions from SwitchBoardwasused to representtheEnglish decoderinourPPRLMsystem. Note that wehaveincorporatedall availabletranscriptionsforCallHome,
includingthose released inSpanishand Mandarin NIST evalua-tions. This datawasthensegregatedintoseparatetrain/test and
developmentsets,accordingtoan80/10/10split.Noutterances
fromanyspeaker, occurredinanyotherset. Further details
re-gardingthe breakdown of data for and number ofspeakersfor
the CallHomedata isprovidedinTable2.
Whilst thiscorpushas beenfreelyavailable forsometime, very few studies havereported its use in ASR development.
The difficulty of the task is perhaps onereason; recognition
of CallHomespeechis adifficulttask, with work outlined in
[9] suggesting that the task issignificantlymoredifficult than
SwitchBoardEnglish. Assuch,thedevelopmentof ASR
sys-temsacrossfourlanguages (5ifEnglish is included) isa
signifi-cantundertaking. Complicatingmattersis that theorthographic representation of each language contains itsownpeculiarities
whichrequireattention.
TheSpanishand Germantranscriptsintroduceveryfew
sur-prises. For Spanish these include theuseofacuteaccents,and diaeresis,whilst in German the inclusion ofUmlaut, namely a,
6,ii. Both of theseareencodedusingIS08859-1,andcanbe
seamlessly incorporatedinmostcomputerbased applications. However,bothJapaneseand Mandarinorthographies requirea
littlemoreattention. TheJapanese transcripts containamixof
Ldngauge
OGI
CallHome
Langauge
(hrsd)
(hous)
Mandarin 1.3 24.0 Spanish 1.7 46.8 German 1.5 10.1 Japanese 1.1 10.6 English 3.5 164.0Table 1: Comparison of Total Available Data- OGI vs Call-Home
iuarin uYiouIdc54J 4UUIU/JO 40 Ib/Jb
nish 61821/397 8097/50 7747/45
man 14744/191 1865/22 1644/27
Japanese
20546/187
2670/26
2660/27
English
187753/4389 6554/247 8426/243
Table 2: Details of CallHome datasets
Kanji, Hiragana and Katakana, encoded using the EUC stan-dard. SimilarlyMandarin is encodedusingGB mainland con-ventions.
Whilst the lexicons provided for each of these four lan-guagesprovidesreasonable coverage for the words contained inthe transcripts, Grapheme-to-Phoneme (G2P) rules were built toreduce theOut-of-Vocabularyrate to zero.Classificationand RegressionTrees(CART), using theWagon-CARTtoolkit[10], wereusedtoproduce G2P rules for both Spanish and German. Inthecaseof Mandarin andJapanese,the character based or-thographieswere firstconverted toRomanised forms (Pinyin andRomaji) usingthe conversion tools available at[11] and [12] respectively. The subsequent derivation of letter to phone rules thenproved to be a trivial exercise, with an almost
one-to-onemappingfrom letter-to-sound.
Usingtheprocessed transcripts, two setsof models were
produced. The first set ofmodels, which are subsequently usedasfront endsrecognisersintheLIDsystem,arebasedon
context-independent acoustic models, with 32 mixture compo-nentsusedtomodel the state-emissionprobability density func-tions. This modelset wasusedtoobtain thephoneerror rates
providedinTable3. Itshould be noted that the resultspresented inTable 3 were achieved aftertuningthe insertion rate on a
sep-aratedevelopmentset.
The second set ofmodels, which were used to obtain the Word Error Rates(WER) shown in Table 3 arebasedon
decisiontree clustered, cross word context-dependant phone HMM's. Abi-gram language modelwastrained for each lan-guage, using the appropriate training transcripts. To prevent problems with Out-of-Vocabulary (OOV) words, those wordsin thetest setvocabulary which didnotappearinthetraining data
wereassignedasmallprobabilityinthelanguagemodel. The models used to obtain theWERshown in Table 3 are simply forinformationalpurposes, andwere notused further in LID experimentsoutlined in this paper.
Parameterisation ofspeechwasachieved via 12 static PLP's plus normalized energy, 1st and 2nd orderderivatives, anda
Model
SetPhone
Err1
WordError
__S__
Rates%Rate%
Mandarin 42 61.9 48.2Spanish
31 52.9 44.5 German 42 64.6 39.2 Japanese 37 54.3 42.6 English 36 63.9 33.1Table 3:
Recognition
Performance for CallHome modelswas employed to reduce speaker and channel mismatch. Each phone model is achieved via a three
state,
left-to-right HMM, with no skip transitions, except for silence and pause models. An ergodic silence model is used, allowing transitions back to preceding states. The pause model is a "tee" model, which is tied to the centre state of the silence model. Additionally,an er-godic"laugh" model was created based on its frequency of oc-currenceacross all languages. To cater for the various array of speechnoises and background noise, two additional left-to-right modelswere created for speech noise and background noise.5. Experiments
The LID results presented in this paper represent those ob-tainedaccordingto the NIST-1996 evaluation
(1996-Test),
the 1996 development set(1996-Dev) and NIST-2003 evaluation data sets(1996-Test). There are 12 different languages (Ara-bic, English,Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil and Vietnamese) and 3 of them have asecond dialect(English, Mandarin andSpanish),
thereby con-tainingdouble the amount of training data to the others. Each evaluationhas test utterances with duration of 3, 10, and 30 sec-onds.Before outliningthe experiments conducted, further details on theOGI acoustic models is required. Our previous PPRLM system based on the use of OGI models incorporated 6 lan-guages. However, the development of CallHome across mul-tiplelanguagesis still a work in progress. Assuch,models have onlybeencompletedfor the 5 languages mentioned earlier. Ac-cordingly,the Hindi language from OGI was excluded from the OGIsystemto ensure results presented are comparable.
The same HMM state topology was used for both the Call-Home and OGI acoustic
models,
although the number of mix-turecomponents used to model the state pdf for OGI was only 8. Asmentioned, the availability of considerable more train-ing data in CallHome allowed us to increase the mixtures to 32.Parameterizationfor the OGI system mirrors that described for CallHome. Each phone recogniser produces a phone se-quence foreach of the 12 languages. The phonotactic infor-mationcontainedin the individual phone sequence is modelled viaabackedoffbigramLanguage Model(LM),
with duration information appended. In testing, these LM's are used inde-pendentlytoscorethe phone sequences of eachrecogniser,
and fused at the scorelevel.5.1. PPRLMLIDPerformance
The firstexperimentoutlined is a comparison of overall PPRLM LIDperformance, acrossall 12 languages, using the two acous-ticmodel sets. It was expected that the CallHome models would outperformthose from OGI, and as can be seen from Table 4, these suspicionsare confirmed. The LID results presented are
thoseobtained after fusion of scores from individual classifiers. The inclusion of the terminology unoptimised and optimised is used to delineate between models tuned to extract maximum phone recognition performance via tuning on a heldout devel-opment set. Details outlining the rationale for this experimenta-tion are deferred until later in the secexperimenta-tion.
The CallHome models obtains superior LID results when compared to those obtained using the OGI based front end recogniser, across all evaluations anddurations, with an aver-age difference of5.96%. However, the range of improvements varied. For example, on the 1996 test the average difference was 3.1%, whereas for the 2003 evaluation the difference was in excess of 9%. Based on this result alone, the utilisation of the CallHome corpus seems vindicated.
5.2. Investigating the Relationship Between PER and LID Performance
In previous versions of PPRLM implementation at QUT, no at-tempt has been made to optimise individual recogniser perfor-mance for a number of reasons. A lack of suitable transcriptions fordeterminingPER isonereason. Moreimportantly,it was un-certain whether tuning a recogniser to increase performance on one language, may bias the resultant phone stream to reflect the phonotactics of the language on which is was tuned, and subse-quent degrade the accuracy on other languages. Conversely, it is also possible that inaccuracies which result from an "untuned" system, manifest themselves globally across all languages, in turn reducing the information content of the phonetic stream.
Given this, investigations were conducted to evaluate the effect of "tuning" therecogniser,on both PER and LID. Front end tuning was done for each of the 5 languages outlined ear-lier. Each of the recognisers was tuned to maximise recognition performance on its base language, by adjusting the insertion penalty. These tuned models arereferred to as the optimised CallHome. The OGI models were also tuned to improve their performance on CallHome, however the level of
performance
still lagged that achieved with the untuned CallHome models and so results are not shown.Table 5 includes PER's for each of the 5 front end languages when tested its own language. Results are included
for
OGI as well as the un-optimised and optimised versions of CallHome. This is also contrasted against the global LID performance ob-tained using each of the individual PRLM systems.To illustrate, using a Spanish OGI PRLM front end, the PER when used to decode CallHome Spanish is 70.22%, whilst the LID rate "across all" languages using the Spanish PRLM system is 67.57%. In contrast, when un-optimised models are used to decode the Spanish CallHome transcripts, a PER of 58.1% was achieved whilst LID was 77.33%. The LID results shown are for the 30 second task in the 1996 development set. The selection of the 1996 evaluation was based on the fact the results achieved in this evaluation align more closely with the overall averages, as shown previously in Table 4.
It can be observed that the change in PER rate when pro-gressing from OGI to CallHome without tuning is quite large, ranging from 10.3 to 17.5% absolute. Quite alarmingly the phone recognition rate, even with the CallHome models is quite poor. The tuning process serves to make the PER more re-spectable, producing further improvements ranging from 4.82% to just over 9%.
The progression of improvement in LID provides some in-teresting observations. For examples, when progressing from OGI to unoptimised CallHome, the LID rate obtained coincides
NIST
Test
1Durtina
Iutt
1
OGI
(%)
1CallHom
CallHome
Task
j
(s)
J
lUnoptimised
(%)
Optimnisd
(%)
30 1147 77.86 84.39 84.74 1996 Dev. Test 10 1172 63.05 68.34 68.00 3 1174 39.27
43.95
44.38
30 1492 85.2588.00
89.81
1996 Test 10 1502 74.57 78.43 78.03 3 1503 54.22 56.89 56.95 30 960 76.98 86.35 86.56 2003 Test 10 96062.60
71.98 73.54 3 96044.69
53.85
53.96
Table 4: PPRLM LID Results (% of accuracy) using different acoustic model sets including recognition optimisation.
Spanish 70.22 Japanese 75.06 English 86.22 Mandarin 77.43 German 88.04
Table5: PERfor CallHome andLIDresults (% ofaccuracy)for each of thePRLMsub-systems using 30 second 1996 NIST Develop-ment set.
reasonably well with the changesin PER. Forexample,an ab-solutechangein PERof 12.12% for theSpanish language, pro-ducesacorresponding changeof9.76%in LIDrate. Similarly achangein German PER of15.68%leads to animprovement inthe LID rate of 7.32%. In all cases the decrease in PER led
to anincrease in LID,althoughfor Japanese theimprovement wasless substantial. Furtherimprovementin PER was achived viatuning, ranging from 4.82% for Englishtojustover9% for Japanese.Unfortunately,these did notnecessarilycoincide with LIDperformance, producingadecrease for allPRLMsystems exceptJapanese.
This indicates that tuning the individual recognisers for eachPRLMsystem, toimprove performance onits base lan-guage,hasadetrimental effectonoverallLID.This reinforces the idea that whilsttuningmay improve the recognition per-formance for the baselanguage, itdegradestheglobal phone recognition by imposingits ownphonotacticsconstraints. How-ever, without observing the actual PER's for each language alongsideachievedidentification rate,this idea is still specu-lative. Accordingly,additionalexperiments examininghow im-proving PER's foraparticular language correlates with identifi-cationratesfor thesamelanguage, rather than globalLIDrates,
wereconducted.
Todothis,aSpanishdecoder wasusedtodecode all of the languages for which transcriptswereavailable. Ashighlighted earlier, inorderto obtainPERvalues whendecoding speech from otherlanguages, mappingbetween thephonemic inven-tories of eachlanguage is required. Thereareseveralpossible mapping techniques, each with its ownbenefits. We trialled knowledgedriven
mapping,
wherephonesaremapped accord-ingto linguistic similarity, aswell as confusion based, data-drivenmapping. Aseparate subset of the CallHomedevelop-mentdata, 45minuteslong,wasusedtoderive the confusion basedmappings. Confusion based mapping generally produce better results when derivedinthesamedomain. Thiswasthe
casefor the CallHomemodels, however duetochannel differ-ences,best resultswereobtainedusingaknowledge based map-ping for the OGI models. Accordingly, the results shown in Ta-ble 6 are based onusingthe most suitablemapping technique for each of the corpora.
Themappingacrosslanguages provedtobeatedious and timeconsumingtask. As mentionedearlier,this is one of the disadvantagesofusingPERas aguideforgaugingLID perfor-mance.Accordingly, we also used the Phone Alignment Cost to gaugetheamountof information in thephonestreamproduced bytheSpanish recogniserfor each of thelanguages including Spanish. Theuseof this tool is muchsimpler asthe phone-micinventory onlyneedstobe included in thesetoflinguistic classesonce.The PACscores arealsoprovidedinTable 6. The PACscorerepresentstheaverage costofaligning each phonein the referencetranscript with that hypothesised by the front end decoder. It should also behighlightedthatadecrease in PAC
scorecorrespondsto abetteralignment, and relative decreases shouldideallyleadto anincrease in LIDperformance.
Table 6 also includesLIDrates. However, in contrast to previous tables, the LID rateshownrepresentsthe identifica-tionratefor thatparticular language, and NOT the globalLID rate. Assuch,thelanguagesinColumn I of Table 6 represent thelanguagesprocessbytheSpanishPRLMsystem. For
exam-ple, usingtheSpanishdecoder and OGImodels,the Japanese languagewascorrectly identified 58.97%of the time.
Oneglaringresulttoemergefrom Table 6 is how poor the phone recognitionaccuracyis forlanguages other than Span-ish. Thus,it isquite surprisingthat LIDrateissohigh. Fur-ther examination of Table 6 reveals thatprogressingfrom OGI models to CallHome andsubsequentlytotheoptimised mod-els,stillproducessignificant improvementsin PERfor the other languages. Asbefore,theprogressionfrom OGItoCallHome
seems to produce similar correspondence between PER and LID.This is also reflectedinthe PACcosts. Forexample, for
Languages
l
IOGI
CallHome
Unoptimisde 1
CallHome Optimited
gu____
PER%
PAC
LID% PER%PAC LID%
PER%
PAC
LID%
Spanish 70.30 7.09 74.83 58.10 5.45 83.44 52.90 5.34 84.11 Japanese 86.40 7.86 58.97 74.65 6.79 60.26 63.05 6.58 57.69 Mandarin 83.90 8.44 71.61 76.23 7.99 78.06 69.45 7.99 78.71 German 85.51 8.14 60.76 78.21 7.28 81.01 69.67 7.20 77.22 English 87.70 8.01 81.13 83.37 7.23 84.91 73.26 7.09 87.42 Table 6: Individual target language PER/PAC and LIDresults (% ofaccuracy)achieved using SpanishFrontEnd.
' a) 4 .) a Eu) O;X -0 160 135 110 85 60 35 10 -15 Evaluation Metric CallHome-Unoptimised * CallHome-Optimised
Figure 2: Relative Improvements in PRA/PACandLIDRatesinrelationshiptoOGI baseline
SpanishonSpanish,the PER, PAC and LIDimprovementsare 12.12%,1.64% and 16.73% respectively. This trend is repeated acrossalllanguages, althoughthe scale ofimprovementsis vari-able.
In theprevious section, it was found thattuningeach of the front endrecognisers didnotnecessarily provideaboostto the overall LIDperformance of eachPRLMsystem. We sug-gested that oneexplanation for this wasthattuningmay ad-versely affect PER for the other targetlanguages. However, the results in Table6, indicate that PER'simprovefor all lan-guagesevaluated, when theSpanishPRLMsystem wastuned toimprove performanceonSpanish.This suggests that the sub-sequentdecrease in LID cannot be attributed totuning having a detrimentalimpactonthe accuracy of thephone streamof otherlanguages. Of course it may be that thetuning impacts onlanguages outside those for whichPERrates wereobtained, asVietnamese. Thismaybewhy the globalLIDratedecreased aftertuning,as showninTable 5. Unfortunately this is impos-sible to evaluate without suitabletranscriptions. However, the factimprovementsto PERfor each of thelanguagesinTable 6 does notsubsequentlyresult inimprovementsin LIDforthose languages, indicates that the usefulness of PER as a metric for gaugingeventual LIDperformanceis limited.
5.3. PACversusPER
Inanattempt to rationalisewhyPERratedoes notcoffespond to LIDrates, we examined the trends in PAC scores. As with PER, progressiontooptimisedmodels leads toimprovements inPACscores,butnotLID.However,the different scales used by PAC,PERandLIDmake it difficulttoinferrelationships
be-tweenmetrics.Accordingly,weexamine the relativechangesin metrics,togainfurtherinsight. When examined from this per-spective,the relativechangesinPAC scores are much smaller than its PER counterpart.
Toillustrate,considerFigure 2,whichplotsthe relative per-formanceimprovements for each of the metrics using the origi-nalperformance of the OGIsystem asthestarting point. Thus,
theimprovements plotted are CallHome and optimised Call-Home, relative to theoriginal performance of the OGI system. Notethat because the LID is in terms of correctpredictions, phone recognition performance is calculated in terms of accu-racy(PRA),topreserve the relative directions ofimprovement. Thus relativeimprovements for PRA, PAC and LID rates are plotted for each language basedonthe results showninTable 6. Thisplot highlights anumber ofthings. Thefirstis that PRA(orits counterpartPER)isinadequateforgaugingthe in-fluence ofchangestothe decoder on eventual LIDperformance. Incontrast,themagnitude of the impact which changestothe front endrecogniser imparton LID performance is better re-flectedby the PAC metric. Forexample, the change from the OGI model set tothe CallHome model setand subsequently theoptimised CallHome model set,produced relatively large improvementsin PRA. In German an initial improvementof around50%,followedby improvementto morethan110%
bet-terthan OGIwas obtainedby progressing throughthe unopti-mised andoptimisedmodelsets. This represents a60% rela-tivechange which results from the optimisation. However, the LIDperformance only improved by about 35% and then ac-tually dropped offto 30% relativeto OGI, whichmeans per-formancewentbackby 5%. Thus the margins of changeare
relatively large forPER,butnotLID. Incontrast,thechanges for Germanusing the PAC metricarearound1%, still different, butatleastmoreindicative. Thus the PACmeasure seems to at
least be better suited forgaininganoverall indication of poten-tialimprovements,ifnotthe actual direction. Oneexplanation for these smaller differences between PAC and LID rate, may be that the PAC is calculatedonCallHome, whereas theLID rateis basedonCallFriend.
Certainlythe results showncan onlybe usedas a guide. Howeverwith further refinementwe believe this measure is moresuitable forestimating the potential impact of changesto the front endonsubsequentLIDperformance.
6. Conclusion
In this paper a more detailed examination of therelationship between phone recognition, phonotactic information and LID rates wasundertaken. The utility of PER as a proxyfor phono-tactic informationwasdiscussed and examined. It was high-lighted that the use of OGI models in PPRLMsystems produce phone streams that containsignificanterrors. Modelsproduced using CallHome transcriptionsweresubsequentlycomparedto OGImodels, producing more accurate transcriptions and im-portantly an average absolute improvement in LID of approxi-mately 6% across the 30, 10 and 3 seconds tasks for the NIST 1996 and 2003 evaluations.
Inaddition a new metric, the Phone AlignmentCost (PAC) wasproposed. This metric was based on the principle that errors inthe phone stream are not equitable, and usefulinformation is stillpresentinphonesstreamscorrupted by phone recognition errors. This method overcomes many of theshortcomingsthat PERexhibits, includingdifficultyin useforevaluating front end performanceacrossmultiple languages, aswellas aninability tograde the significance oferrorsaccordingtolinguistic simi-larity. Comparisons between the PER and PAC were conducted and contrasted with LID performance. Whilst the PAC tech-nique still requires refinement, early indications are that rela-tivechangesinPAC scores are moreclosely alignedto LID per-formance thanchangesin PER. Thus the PAC metric is better suitedtoestimating the impact of changestothe front endon
subsequentLIDperformance and should be useful for
extract-ing improved performance from PPRLM systems.
broadcast news transcription and understanding work-shop, 1998, pp. 301-307.
[10] "WAGON- Classification and Regression Tree (CART) toolkit," http://www.ims.uni-stuttgart.de/ phonetik/synthesis/festival/festdoc-1. 4.0.1/speechtoois/x3475.htm.
[11] "Hanzi to Pintin (C2T) conversion tool," http: //www.ibiblio.org/pub/packages/ccic/ software/unix/convert/.
[12] "KAKASI, Kanji-Kana-to-Romaji conversion toolkit," http://kakasi.namazu.org/.
7.
References
[1] M.A.Zissman and E.Singer,"Automatic Language Iden-tification ofTelephone SpeechMessagesusing Phoneme Recognition and N-Gram Modelling," inInternational ConferenceonAcoustics, SpeechandSignal Processing, 1994,vol.1, pp.305-308.
[2] LDC, "Linguistic Data Consortium," http://www. ldc. upenn.edu/,2004.
[3] Y. K.Muthusamy, R. A. Cole, andB. T. Oshika, "The OGIMulti-Language Telephone Speech
Corpus,'
in In-ternational ConferenceonSpokenLanguageProcessing, 1992,vol.2,pp.895-898.[4] P. Fung, W.Byrne,Z.F.Thomas,T.Kamm, L.Yi, S. Zhan-jiang, V. Venkataramani, and U.Ruhi, "Pronunciation modeling of Mandarin casual speech," Johns Hopkins SummerWorkshop, 2000.
[5] T.Schultz and A.Waibel, "Polyphone decision tree spe-cialization forlanguage adaptation," in Proc. ofICASSP, Istanbul2000, 2000.
[6] C. Nieuwoudt and E. C.Botha, "Cross-languageuseof acoustic information for automatic speech recognition," Speech Commun., vol. 38,no.1,pp. 101-113, 2002. [7] J. Kohler, "Language adaptation ofmultilingual phone
models for vocabulary independent speech recognition tasks,"in Proc.ICASSP,Washington, U.S.A,1998, vol. 1, pp.417-420.
[8] "Flexible Alignment Toolkit," http://www. clsp.jhu.edu/ws2000/groups/mcs/Tools/ README.html,2000.
[9] G.Zavaliagkosand T.Colthurst, "Utilizinguntranscribed training datato improve performance," in Proc.