Towards Improved Assessment of Phonotactic Information for Automatic Language Identification

(1)

Towards Improved Assessment of Phonotactic Information for Automatic

Language

Identification

Terrence Martin,

Eddie

Wong,

Sridha Sridharan

Speech

and Audio Research

Laboratory, Queensland University

of

Technology,

GPO Box

2434,

2 George St, Brisbane, Australia, QLD

4001.

tl.martin, ee.wong, [email protected]

Abstract

Phonotacticmodelling, typically in the form of a PPRLM sys-tem,forms akey component in state-of-the-art Language Iden-tification (LID) systems. Given theobjectiveof PPRLM sys-tems is to capture as accurately aspossible the phonotactics which characterisealanguage, it is assumed that the minimi-sation of PhoneError Rate(PER)is a precursor to achieving thiseffectively. In this paper we examine the relevance of PER as ametric fordeterminingeventual LIDperformance.Inorder toconduct thisinvestigationwemake use of the CallHome cor-pus,based on thepremiseitprovidesabetterrepresentationfor thestyleof discourse and channel conditions encountered in the ConversationalTelephone Speech (CTS),which is now the fo-cusof current NIST LID evaluations. UsingCallHome instead of the OGI-MLTS corpus totrainphone recognisers, we ob-tainedsignificantly improved results, withanaverage improve-mentof approximately 6% absoluteacrossthe 30, 10 and3 sec-onds tasks for the NIST 1996 and 2003 evaluations. We also examine theimpactoftuningthe individual front-end recognis-ers, onboth the resultant PER of otherlanguages andagainst the resultant LIDperformance. Wefindthat PER has a num-ber of limitations inindicatingboth thedegreeand direction of changestoLIDperformance. Accordingly, weproposea new

metric which is better suited forforecastingtheimpactonLID performance when the phone recogniser front-end is modified.

1. Introduction

Phonotacticmodelling, typically in the form of a PPRLM

sys-tem [1], formsa keycomponent in state-of-the-artLanguage Identification(LID) systems. The objective of PPRLM systems istocapture asaccuratelyaspossiblethephonotacticswhich characterisealanguage.Howeverthe number of available

met-rics which reflect howwell, and how consistently this informa-tion iscaptured,arelimited. Theavailability of reliable metrics isimportantfor evaluation purposes but also forgaining insight into what informationprovides themostimportant contribution

tothe LID task. Whilst LIDrateisultimatelythemostrelevant metric, PERof the front-endrecognizercanalso be usedas a proxyfor how well this information iscapturedand offers the additionalbenefitofbeingmoreeasilyobtained.

Theutility of this benefit is best illustratedin PPRLM

sys-temoptimisation. If LIDrateistobe usedtoexamine theimpact of anychangestothe system, suchasthe front endrecognisers, both thetrainingandtestdatamustbe decoded.Following this, n-grammodels needtobesubsequentlytrained and then tested. Given 12languages,thiscanbeextremelytimeconsuming,

es-pecially for the CallFriendcorpus[2]. Alternatively,PERof the front-endcanbe obtainedreliably fromamuch smallersetof

data,in a much smaller time frame.

However, little research has been published which exam-ines the relationship between PER and LID, and the relevance ofquoted PER are of limited use. To illustrate, most quoted PERhave been based on performance for the OGI-MLTS cor-pus [3],despitethe fact LID evaluations are conducted on the significantly more difficult recognition task of CallFriend. Ac-cordingly,nomeaningfulinformation canbe,or atleast should be, inferred between theerror ratesquoted and subsequentLID performance.

In order to provide a better means of evaluating phone recognition performance we use the CallHome corpus[2], which containsspeechwhich isessentiallythe same condition asthat contained inCallFriend,and contains bothtranscriptions and lexical resources. The use of this corpusprovidestwo ben-efits. Thefirstis a means to evaluate theperformance of the front-endrecognisersin a task which morecloselyreflects the expected style of discourse and channel conditions. The second is that acoustic modelscanalso be built, whichwereexpectedto providemorereliable front end decodersinthePPRLMsystem. As analternatetoPER, we proposethe use of a new metric forphonotacticinformation basedon aPhoneAlignmentCost (PAC). This technique stems from the idea that phone recogni-tionerrors are notall thesame. Forexample,isan errorwhere /p/isreplaced by/b/on aconsistent basismoredesirable than say/p/being replaced bya more erroneousrepresentative such

as/s/ ? The fact that PER's for ASRsystemsarequite high,yet PPRLMsystemsperform quite admirably highlights that useful information is still present in these error riddenphone streams. Based on this concept, the PACincorporates a linguistically in-tuitivehierarchyforestablishinga costfor the various types of phone recognition errors. The cumulative cost can then be used

togaugethe effect of anychanges. Incontrast, PERpenalises anyerrorsabsolutely; theyareeitherrightorwrong,despitethe fact thatmeaningfulinformation may be still available.

Thecontents of thispaperis as follows. In Section2, a briefdescription of thePPRLMsystemisprovided. A discus-sion of the relative merits ofusingPERand PAC forextracting availablephonotactics information is then providedinSection 3. Detailsregardingthedevelopmentof the baseline front-end recognisersis then outlined in Section 4. The results of experi-mentation which examined therelationshipbetweenPER, PAC and LID accuracyarethenpresentedin Section5. Conclusions

aresubsequentlydrawn in Section 6.

2. PPRLM

System Description

"Parallel PhoneRecognition followed byLanguageModelling" (PPRLM) [1], comprisesabank of identical "Phone

(2)

Recogni-Phone

Sequences

Phone

Figure 1: Block Diagram of PPRLM System.

tion followed by LanguageModelling" (PRLM) sub-systems running in parallel, asdepicted in Figure 1. Each of the

sub-systems performs thesameLIDfunction, however the

front-endphone recognisersaretrainedindividually with speech data

from different languages. As the nameimplies, this system

worksby first decoding the speech data into aphone stream.

Likelihood scores arethen obtained by comparing the phone

stream to n-gram Language Models (bi-gram LMs are

em-ployed in this paper). Inanattempttoenrich thephonetic

de-scription, a duration tagof "-Short" or "-Long" is appended

to eachphone label when their corresponding phone duration is shorter orlonger than it's averagephone duration. One of

themoreimportant features ofPPRLMis it doesnotrequire transcribed speech data for the languages whicharetargeted

for identification. Instead, their phonotactic'saredescribed in

terms of the front-endphone recogniser's language phonemic inventory. Inessence,the front-endphone recognisersare

em-ployedtodecode thespeech data of all the different languages.

3. Phonotactic Information Metrics

Phonotacticscanbe definedasthefrequency and possible order

ofoccurrenceofasequenceofphoneticevents. It hasproven

tobeaneffective informationsourceforaccurateidentification oflanguages. ThePPRLM LID systemoutlined in thispaper

is basedonextracting phonotactic information by decoding the

speech data withafront-endphone recogniser. However, phone

recognitionsystemsproduceasignificant number oferrors,

typ-ically in therangeof 40-60%PER, and accordinglycorruptthe phonotactic information containedin theoriginalutterance.

De-spitetheseerrors,PPRLM systemsperform quite well, although

systemperformance does degradewhen thelengthof thetest ut-teranceis decreased.

Given theamountofinaccuracyinthephonestream, it is somewhatsurprisingthat the level of LIDperformanceachieved

issohigh, and begs the question whether the meaningful

infor-mation extracted is in factphonotactic,orsimplyaresult of the

efficiencyinwhich themodellingsystemis abletoexploit

pat-tern differences across differentlanguages. However, for the

remainder of discussion inthis paper, it is assumed that this

meaningful informationisphonotactic.

The limited understanding of how information contained

in thephone streamis used,inturnmeansthat theimpact of

changestothe systemscanonlybe evaluatedempirically. This canbeatimeconsumingprocessandhighlights thepotential

benefits forasuitablemetric, capableofrepresentingavailable

phonotactic information for LID.Inaddition, it is likely thata

suitable tool willprovidequalitative benefitsinunderstanding language characteristics. Accordinglywediscuss andcontrast

the attributes oftwometrics in this section. The first is thePER, whichrepresentsthemostcommonly used approach. Second,

wesuggestanalternativemetric, basedon analignment

tech-nique originally proposed by [4] foruseinpronunciation

mod-elling. We haveadapted this technique forourpurposesto

pro-videabetter alternative thanPERforgauging theamountof availablephonotactic information.

3.1. Phone Error Rate

Most PPRLM systems utilise acoustic models trained from OGI. The mainreasonfor this is that the time framerequiredto

produce modelsusing the OGI data is relatively short; the

tran-scriptions contain phone based alignments and model training from thatpoint is relatively straight-forward. However, whilst the models workeffectively in PPRLMsystems,little credence should bepaidtoreported PER'sonOGIevaluations, given the

eventualrecognition task is decoding speech from CallFriend. It isexpected that theuseof OGI models fordecoding CallFriend

willproduce less than optimal PER's, andinturndegrade the

amountof information availableinthehypothesised phonetic

stream.

Aside fromerrorswhich result because ofamismatch

be-tweentask, considerationalso needstobegiventothe fact that

language differences also degrade decoderaccuracy. It is well

known incross-lingual and multilingual studies [5], [6], [7] thatusing acoustic models fromonelanguagetodecodespeech from another leadstodegraded performance. Given the already higherrorrate,the level ofaccuracyobtainedby front-end

de-codersonotherlanguagesislikelytobevery poorindeed.

De-spite thesesourceofdegradation,thesuccessof LID systems

basedonPPRLMillustratethat useful information still exists in

these inaccuratephonestreams. Our interest lies in decipher-ing theextent andusefulness of the information contained in

thesephonestreams, andidentifyingtherelationshipbetween

PERand LIDwasconsidered animportant stepto achieving thisgoal.

However, determining the information containedinthese phonestreamsisproblematic, especiallyif PERistobe usedas

the metric. Asmentioned,inordertodeterminePER's,a

suit-ablesetof referencetranscriptionsisrequired. Ifthe

relation-shipbetween PER and LID istobevalid,it is alsopreferable that the evaluation inwhich PER is obtained mirrors that of the eventual LID task. This isobviouslynotthecaseforOGI,and

accordinglya moresuitablecorpuswassort.

Speech Utterance O Likelihood Scores

_A

Identified 'Language

(3)

The transcriptions available in CallHome were considered a more appropriate representation for the speech which occurs inCallFriend, and as such, more suitable for examining the re-lationship between PER and LID. In addition, CallHome has transcriptions for a number of languages, making it possible to evaluate thephone recognition performance of each of front end recogniser, on other languages.

However, even with a suitable evaluation set, obtaining PERacross multiple languages is problematic, as differences betweenphonemic inventories exist. In order to obtain PER, the referencetranscription needs to be compared with that pro-ducedbythephone recogniser. Slightdifferences in the artic-ulatoryrealisation of the same sound means thattheyarequite often labelleddifferentlyacrosslanguages.Forinstance,in En-glish the "d" in dog is labelled in Worldbet as"d",whereas in Japanese the closestapproximateis labelled"d{". Accordingly some form ofmapping isrequired sothat an equitable com-parisoncanbe made. Complicating this problem is thatsome languages have only onelabel foraparticular sound, suchas Spanish where the vowelsarepure, whereas inEnglish there aremanyphonemicvariants of the same basic sound. Thus,the mappingprocess canbecome quite involved, requiring knowl-edgeof the variouspropertiesof sounds across manylanguages. Of course, data drivenmappingscanbederived, althoughthis has ashortcoming when there are differences in channel con-ditions between the corpora the models are trainedon,and the developmentsetused to derive rules.

Regardless of which mapping technique is used, theuseof PERhas another limitation. We assertthat when itcomes to modelling the phonotactics which characterise alanguage, the degree with whichrecognition errors corrupt thephonotactic information is notequitable. For instance, if the phone/p/is interchangeably recognised as either/b/or/p/,andrarely as anything else,then it islikelythat usable n-gram statistics are gathered. Incontrast, ifinappropriate modellingormismatch between the train and test domain leads to inconsistencies in decoding/b/,then less information isprobablyderived. Given this, intuitively thecostofan error canvaryaccording to lin-guistic similarity. Unfortunately, PERis absolute, either the recognised phone iscorrect, oritsnot. InSection5, experimen-tal resultshighlighthow ineffective PER is as an indicator for predictingLIDperformance. Based on thisassertion, the next section outlines an alternative metric which seeks to overcome thesedeficiencies.

3.2. PhoneAlignment Cost

In order to establish the cost of eachrecognition errormade bythe front-end decoder, weadopted anapproach first intro-duced in [4]. The focus of this workwasimprovingthe mod-ellingofpronunciationvariation for Mandarin. Akeyaspect of pronunciation modellingisobtainingreliable estimates for the frequency of alternate pronunciations which differ from the lex-icalrepresentation andcausetranscriptionerrors. One method forestablishing the frequency of these pronunciation variations is to decode atranscriptionandsubsequentlycompare it with the reference. This requires that both the reference and hy-pothesised transcriptionarealigned. Generallythisalignment is achieved viadynamic programming usingasimpleedit

dis-tance as a costfunction. Unfortunately, simpleeditcost func-tionsprovide inadequate alignments,and inturnimpactonthe reliabilityof derivedpronunciationrules. To combatthis,Fung introducedaflexiblealignment tool which incorporatesa hierar-chy ofcostsforinter-symbol alignments. This tool is available

at[8]. Aby productof thisalignmentprocess is the cost of aligningutterance.

The relevance of this alignment cost is that the cost assigned toeachinter-symbol alignmentis related to howlinguistically similartheyare.Thegeneralidea is that the cost ofaligning/d/ with /t/ is lessexpensive, than say /s/. At a cruder level, the costofaligning /i/ withanyother vowel is less thanaligning with a consonant. From aglobal perspective, if a Spanish recog-niser is inappropriately

trained,

and produces inconsistent tran-scriptions forsaytheJapaneselanguage, theprocessof align-mentbecomesmoredifficult and hence this will be reflectedin the overall cost ofaligningthe reference utterance with the hy-pothesised. Ideallythe cost function should allow for agraceful degradationinrecogniser performance, by incorporatinga hier-archialstructurebasedonlinguistic similarity. Additionally,it shouldincorporatea meansofcomparing phonemesfrom dif-ferentlanguages which aresimilar inarticulatory realisation, but annotateddifferently. Basedonthisidea, theaverage cost persymbol alignment,inprinciple should provideabetterguide totheamountof information which ispreserved when changes aremade to front endrecognisers.Toobtainalignmentcosts we adaptedthe toolkit introducedbyFung, to enable it to cater for thephonemeinventories ofmultiple languages, andexpanded the classhierarchy. Aformalisation of thequalitative explana-tiongivenin[4] is as follows.

The cost ofaligning phonesfrom the reference and hypoth-esisedtranscriptions is annotated as C(b, s) where b is used todenote the referencephone and sthehypothesised phone. LetX =

(X,

X2,

..., Xt)

represent the total set of

articula-toryclassesdesignedtoprovidecoveragefor thephonetic in-ventoryof both the sourceandtargetlanguages. Thetypeof classes used arearrangedin ahierarchial manner, similar to the questionsetused intraining context-dependant models,sothat classes range from broadcategoriessuchaswhether thephone isavowelorconsonant,throughto exactdescriptionsof articu-lation. A subset of X exists which isdefinedby:

XDS =BUS (1)

whereB=(bi,

b2,

..., bn) definesthe set of n classes in which

bexists andsimilarlyS=

(S1, S2,...,

SM)

definesthe subset of

mclasses in whichsexists. Usingthecardinalityoperatorto

reflectthe number of distinctelements, the cost of aligning the phone pairing is given by;

C(&,s) r3s

-i-n

si

(2)

Essentiallythis equates toincrementingthe cost each time thephones bandsdonotco-exist in each of the classes

con-tained in

XBS.

The cost outlined above represents those associated with substitutions. However,inmanycasesthelengthof the

refer-enceandhypothesised transcriptionvaries.Accordinglya

sep-arate setof rules is also necessaryto define costs associated with insertions and deletions. The rulesgoverning insertions and deletionswerecruder than those used for substitutions.We expandedandadjustedthe rulesetoriginallymade availableby [8], whichwasdesignedto coverthesetof Mandarin sounds. Thisexpansionwasrequiredto coverthephonemic inventory

acrossEnglish, Spanish, German, Japanese,and Mandarin. The guidingrules inouradaptation,wasthat the insertion of

vow-elswas morelikelythan consonants, and viceversafor dele-tions. In addition the deletion of thephones/r/,

/1/,

/h/were

affordedasmallercost astheseweredeletedquite often, espe-ciallyinthecaseof/r,

1/

whenthey occurred syllable finally.

(4)

As mentioned in the previous section, the use of PER as ametric isdifficultwhen comparing phonestreams across lan-guages. For each language under consideration, an appropri-atemapping must be conducted. Thus, if German needs to be aligned with Japanese, a mapping must be produced. If Ger-manthen needs to be aligned with Mandarin, another mapping is required and so on. Using the alignment cost, the phonemic inventory for each languages only needs to be incorporated into the classes listonceand accordingly, it offersa moreexpedient meansof aligning across languages.

4. CallHome

The motivation forusing models trainedonCallHome, in lieu

of those trained on OGI, is that itrepresents a closer match

with thestyle of discourse and recording conditions contained inCallFriend. As such it is expected that the subsequent mod-els willproducemoreaccuratetranscriptions and correspond-ing improvements in LID performance. For this research we

used the resourcescontained inthe CallHome corpusto

pro-duce baseline Automatic Speech Recognition (ASR) systems

forSpanish, Mandarin, German, and Japanese.

The CallHome corpusincludes acollection of telephone

speech recordings, transcripts and lexicalresourcesfor six

lan-guages; thosealready mentionedaswellasAmericanEnglish

andEgyptian Arabic. Thecorpuscontainsrecordings of

un-scripted conversations between native speakers of the specific language. All calls, which lastedupto30 minutes, originated inNorth America. Participants typically called family members

orclosefriends[2]. There is considerablymoretraining datain

CallHome whencomparedtoOGI. Table 1 details of the total

amountof data availableinthetwocorpora,afterremoving

un-desirableutterances,whichhighlights the differences inamount

of available data. Whilst statistics for OGIarenotshown,there

arealsoconsiderablymorespeakersinthe CallHomecorpora.

Atpresent,ASRsystemsbasedonthe available English and

Arabic data havenotbeen produced by the author. Inanattempt to expandthe number of available CTS recognisers,a

recog-niser based on transcriptions from SwitchBoardwasused to representtheEnglish decoderinourPPRLMsystem. Note that wehaveincorporatedall availabletranscriptionsforCallHome,

includingthose released inSpanishand Mandarin NIST evalua-tions. This datawasthensegregatedintoseparatetrain/test and

developmentsets,accordingtoan80/10/10split.Noutterances

fromanyspeaker, occurredinanyotherset. Further details

re-gardingthe breakdown of data for and number ofspeakersfor

the CallHomedata isprovidedinTable2.

Whilst thiscorpushas beenfreelyavailable forsometime, very few studies havereported its use in ASR development.

The difficulty of the task is perhaps onereason; recognition

of CallHomespeechis adifficulttask, with work outlined in

[9] suggesting that the task issignificantlymoredifficult than

SwitchBoardEnglish. Assuch,thedevelopmentof ASR

sys-temsacrossfourlanguages (5ifEnglish is included) isa

signifi-cantundertaking. Complicatingmattersis that theorthographic representation of each language contains itsownpeculiarities

whichrequireattention.

TheSpanishand Germantranscriptsintroduceveryfew

sur-prises. For Spanish these include theuseofacuteaccents,and diaeresis,whilst in German the inclusion ofUmlaut, namely a,

6,ii. Both of theseareencodedusingIS08859-1,andcanbe

seamlessly incorporatedinmostcomputerbased applications. However,bothJapaneseand Mandarinorthographies requirea

littlemoreattention. TheJapanese transcripts containamixof

Ldngauge

OGI

CallHome

Langauge

(

hrsd)

(hous)

Mandarin 1.3 24.0 Spanish 1.7 46.8 German 1.5 10.1 Japanese 1.1 10.6 English 3.5 164.0

Table 1: Comparison of Total Available Data- OGI vs Call-Home

iuarin uYiouIdc54J 4UUIU/JO 40 Ib/Jb

nish 61821/397 8097/50 7747/45

man 14744/191 1865/22 1644/27

Japanese

20546/187

2670/26

2660/27

English

187753/4389 6554/247 8426/243

Table 2: Details of CallHome datasets

Kanji, Hiragana and Katakana, encoded using the EUC stan-dard. SimilarlyMandarin is encodedusingGB mainland con-ventions.

Whilst the lexicons provided for each of these four lan-guagesprovidesreasonable coverage for the words contained inthe transcripts, Grapheme-to-Phoneme (G2P) rules were built toreduce theOut-of-Vocabularyrate to zero.Classificationand RegressionTrees(CART), using theWagon-CARTtoolkit[10], wereusedtoproduce G2P rules for both Spanish and German. Inthecaseof Mandarin andJapanese,the character based or-thographieswere firstconverted toRomanised forms (Pinyin andRomaji) usingthe conversion tools available at[11] and [12] respectively. The subsequent derivation of letter to phone rules thenproved to be a trivial exercise, with an almost

one-to-onemappingfrom letter-to-sound.

Usingtheprocessed transcripts, two setsof models were

produced. The first set ofmodels, which are subsequently usedasfront endsrecognisersintheLIDsystem,arebasedon

context-independent acoustic models, with 32 mixture compo-nentsusedtomodel the state-emissionprobability density func-tions. This modelset wasusedtoobtain thephoneerror rates

providedinTable3. Itshould be noted that the resultspresented inTable 3 were achieved aftertuningthe insertion rate on a

sep-aratedevelopmentset.

The second set ofmodels, which were used to obtain the Word Error Rates(WER) shown in Table 3 arebasedon

decisiontree clustered, cross word context-dependant phone HMM's. Abi-gram language modelwastrained for each lan-guage, using the appropriate training transcripts. To prevent problems with Out-of-Vocabulary (OOV) words, those wordsin thetest setvocabulary which didnotappearinthetraining data

wereassignedasmallprobabilityinthelanguagemodel. The models used to obtain theWERshown in Table 3 are simply forinformationalpurposes, andwere notused further in LID experimentsoutlined in this paper.

Parameterisation ofspeechwasachieved via 12 static PLP's plus normalized energy, 1st and 2nd orderderivatives, anda

(5)

Model

Set

Phone

Err

1 WordError

__

S__

Rates%

Rate%

Mandarin 42 61.9 48.2

Spanish

31 52.9 44.5 German 42 64.6 39.2 Japanese 37 54.3 42.6 English 36 63.9 33.1

Table 3:

Recognition

Performance for CallHome models

was employed to reduce speaker and channel mismatch. Each phone model is achieved via a three

state,

left-to-right HMM, with no skip transitions, except for silence and pause models. An ergodic silence model is used, allowing transitions back to preceding states. The pause model is a "tee" model, which is tied to the centre state of the silence model. Additionally,an er-godic"laugh" model was created based on its frequency of oc-currenceacross all languages. To cater for the various array of speechnoises and background noise, two additional left-to-right modelswere created for speech noise and background noise.

5. Experiments

The LID results presented in this paper represent those ob-tainedaccordingto the NIST-1996 evaluation

(1996-Test),

the 1996 development set(1996-Dev) and NIST-2003 evaluation data sets(1996-Test). There are 12 different languages (Ara-bic, English,Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil and Vietnamese) and 3 of them have asecond dialect(English, Mandarin and

Spanish),

thereby con-tainingdouble the amount of training data to the others. Each evaluationhas test utterances with duration of 3, 10, and 30 sec-onds.

Before outliningthe experiments conducted, further details on theOGI acoustic models is required. Our previous PPRLM system based on the use of OGI models incorporated 6 lan-guages. However, the development of CallHome across mul-tiplelanguagesis still a work in progress. Assuch,models have onlybeencompletedfor the 5 languages mentioned earlier. Ac-cordingly,the Hindi language from OGI was excluded from the OGIsystemto ensure results presented are comparable.

The same HMM state topology was used for both the Call-Home and OGI acoustic

models,

although the number of mix-turecomponents used to model the state pdf for OGI was only 8. Asmentioned, the availability of considerable more train-ing data in CallHome allowed us to increase the mixtures to 32.Parameterizationfor the OGI system mirrors that described for CallHome. Each phone recogniser produces a phone se-quence foreach of the 12 languages. The phonotactic infor-mationcontainedin the individual phone sequence is modelled viaabackedoffbigramLanguage Model

(LM),

with duration information appended. In testing, these LM's are used inde-pendentlytoscorethe phone sequences of each

recogniser,

and fused at the scorelevel.

5.1. PPRLMLIDPerformance

The firstexperimentoutlined is a comparison of overall PPRLM LIDperformance, acrossall 12 languages, using the two acous-ticmodel sets. It was expected that the CallHome models would outperformthose from OGI, and as can be seen from Table 4, these suspicionsare confirmed. The LID results presented are

thoseobtained after fusion of scores from individual classifiers. The inclusion of the terminology unoptimised and optimised is used to delineate between models tuned to extract maximum phone recognition performance via tuning on a heldout devel-opment set. Details outlining the rationale for this experimenta-tion are deferred until later in the secexperimenta-tion.

The CallHome models obtains superior LID results when compared to those obtained using the OGI based front end recogniser, across all evaluations anddurations, with an aver-age difference of5.96%. However, the range of improvements varied. For example, on the 1996 test the average difference was 3.1%, whereas for the 2003 evaluation the difference was in excess of 9%. Based on this result alone, the utilisation of the CallHome corpus seems vindicated.

5.2. Investigating the Relationship Between PER and LID Performance

In previous versions of PPRLM implementation at QUT, no at-tempt has been made to optimise individual recogniser perfor-mance for a number of reasons. A lack of suitable transcriptions fordeterminingPER isonereason. Moreimportantly,it was un-certain whether tuning a recogniser to increase performance on one language, may bias the resultant phone stream to reflect the phonotactics of the language on which is was tuned, and subse-quent degrade the accuracy on other languages. Conversely, it is also possible that inaccuracies which result from an "untuned" system, manifest themselves globally across all languages, in turn reducing the information content of the phonetic stream.

Given this, investigations were conducted to evaluate the effect of "tuning" therecogniser,on both PER and LID. Front end tuning was done for each of the 5 languages outlined ear-lier. Each of the recognisers was tuned to maximise recognition performance on its base language, by adjusting the insertion penalty. These tuned models arereferred to as the optimised CallHome. The OGI models were also tuned to improve their performance on CallHome, however the level of

performance

still lagged that achieved with the untuned CallHome models and so results are not shown.

Table 5 includes PER's for each of the 5 front end languages when tested its own language. Results are included

for

OGI as well as the un-optimised and optimised versions of CallHome. This is also contrasted against the global LID performance ob-tained using each of the individual PRLM systems.

To illustrate, using a Spanish OGI PRLM front end, the PER when used to decode CallHome Spanish is 70.22%, whilst the LID rate "across all" languages using the Spanish PRLM system is 67.57%. In contrast, when un-optimised models are used to decode the Spanish CallHome transcripts, a PER of 58.1% was achieved whilst LID was 77.33%. The LID results shown are for the 30 second task in the 1996 development set. The selection of the 1996 evaluation was based on the fact the results achieved in this evaluation align more closely with the overall averages, as shown previously in Table 4.

It can be observed that the change in PER rate when pro-gressing from OGI to CallHome without tuning is quite large, ranging from 10.3 to 17.5% absolute. Quite alarmingly the phone recognition rate, even with the CallHome models is quite poor. The tuning process serves to make the PER more re-spectable, producing further improvements ranging from 4.82% to just over 9%.

The progression of improvement in LID provides some in-teresting observations. For examples, when progressing from OGI to unoptimised CallHome, the LID rate obtained coincides

(6)

NIST

Test

1Durtina

I

utt

1 OGI

(%)

1

CallHom

CallHome

Task

j

(s)

J

l

Unoptimised

(%)

Optimnisd

(%)

30 1147 77.86 84.39 84.74 1996 Dev. Test 10 1172 63.05 68.34 68.00 3 1174 39.27

43.95

44.38

30 1492 85.25

88.00

89.81

1996 Test 10 1502 74.57 78.43 78.03 3 1503 54.22 56.89 56.95 30 960 76.98 86.35 86.56 2003 Test 10 960

62.60

71.98 73.54 3 960

44.69

53.85

53.96

Table 4: PPRLM LID Results (% of accuracy) using different acoustic model sets including recognition optimisation.

Spanish 70.22 Japanese 75.06 English 86.22 Mandarin 77.43 German 88.04

Table5: PERfor CallHome andLIDresults (% ofaccuracy)for each of thePRLMsub-systems using 30 second 1996 NIST Develop-ment set.

reasonably well with the changesin PER. Forexample,an ab-solutechangein PERof 12.12% for theSpanish language, pro-ducesacorresponding changeof9.76%in LIDrate. Similarly achangein German PER of15.68%leads to animprovement inthe LID rate of 7.32%. In all cases the decrease in PER led

to anincrease in LID,althoughfor Japanese theimprovement wasless substantial. Furtherimprovementin PER was achived viatuning, ranging from 4.82% for Englishtojustover9% for Japanese.Unfortunately,these did notnecessarilycoincide with LIDperformance, producingadecrease for allPRLMsystems exceptJapanese.

This indicates that tuning the individual recognisers for eachPRLMsystem, toimprove performance onits base lan-guage,hasadetrimental effectonoverallLID.This reinforces the idea that whilsttuningmay improve the recognition per-formance for the baselanguage, itdegradestheglobal phone recognition by imposingits ownphonotacticsconstraints. How-ever, without observing the actual PER's for each language alongsideachievedidentification rate,this idea is still specu-lative. Accordingly,additionalexperiments examininghow im-proving PER's foraparticular language correlates with identifi-cationratesfor thesamelanguage, rather than globalLIDrates,

wereconducted.

Todothis,aSpanishdecoder wasusedtodecode all of the languages for which transcriptswereavailable. Ashighlighted earlier, inorderto obtainPERvalues whendecoding speech from otherlanguages, mappingbetween thephonemic inven-tories of eachlanguage is required. Thereareseveralpossible mapping techniques, each with its ownbenefits. We trialled knowledgedriven

mapping,

wherephonesaremapped accord-ingto linguistic similarity, aswell as confusion based, data-drivenmapping. Aseparate subset of the CallHome

develop-mentdata, 45minuteslong,wasusedtoderive the confusion basedmappings. Confusion based mapping generally produce better results when derivedinthesamedomain. Thiswasthe

casefor the CallHomemodels, however duetochannel differ-ences,best resultswereobtainedusingaknowledge based map-ping for the OGI models. Accordingly, the results shown in Ta-ble 6 are based onusingthe most suitablemapping technique for each of the corpora.

Themappingacrosslanguages provedtobeatedious and timeconsumingtask. As mentionedearlier,this is one of the disadvantagesofusingPERas aguideforgaugingLID perfor-mance.Accordingly, we also used the Phone Alignment Cost to gaugetheamountof information in thephonestreamproduced bytheSpanish recogniserfor each of thelanguages including Spanish. Theuseof this tool is muchsimpler asthe phone-micinventory onlyneedstobe included in thesetoflinguistic classesonce.The PACscores arealsoprovidedinTable 6. The PACscorerepresentstheaverage costofaligning each phonein the referencetranscript with that hypothesised by the front end decoder. It should also behighlightedthatadecrease in PAC

scorecorrespondsto abetteralignment, and relative decreases shouldideallyleadto anincrease in LIDperformance.

Table 6 also includesLIDrates. However, in contrast to previous tables, the LID rateshownrepresentsthe identifica-tionratefor thatparticular language, and NOT the globalLID rate. Assuch,thelanguagesinColumn I of Table 6 represent thelanguagesprocessbytheSpanishPRLMsystem. For

exam-ple, usingtheSpanishdecoder and OGImodels,the Japanese languagewascorrectly identified 58.97%of the time.

Oneglaringresulttoemergefrom Table 6 is how poor the phone recognitionaccuracyis forlanguages other than Span-ish. Thus,it isquite surprisingthat LIDrateissohigh. Fur-ther examination of Table 6 reveals thatprogressingfrom OGI models to CallHome andsubsequentlytotheoptimised mod-els,stillproducessignificant improvementsin PERfor the other languages. Asbefore,theprogressionfrom OGItoCallHome

seems to produce similar correspondence between PER and LID.This is also reflectedinthe PACcosts. Forexample, for

(7)

Languages

l

IOGI

CallHome

Unoptimisde 1

CallHome Optimited

gu____

PER%

PAC

LID% PER%

PAC LID%

PER%

PAC

LID%

Spanish 70.30 7.09 74.83 58.10 5.45 83.44 52.90 5.34 84.11 Japanese 86.40 7.86 58.97 74.65 6.79 60.26 63.05 6.58 57.69 Mandarin 83.90 8.44 71.61 76.23 7.99 78.06 69.45 7.99 78.71 German 85.51 8.14 60.76 78.21 7.28 81.01 69.67 7.20 77.22 English 87.70 8.01 81.13 83.37 7.23 84.91 73.26 7.09 87.42 Table 6: Individual target language PER/PAC and LIDresults (% ofaccuracy)achieved using SpanishFrontEnd.

' a) 4 .) a Eu) O;X -0 160 135 110 85 60 35 10 -15 Evaluation Metric CallHome-Unoptimised * CallHome-Optimised

Figure 2: Relative Improvements in PRA/PACandLIDRatesinrelationshiptoOGI baseline

SpanishonSpanish,the PER, PAC and LIDimprovementsare 12.12%,1.64% and 16.73% respectively. This trend is repeated acrossalllanguages, althoughthe scale ofimprovementsis vari-able.

In theprevious section, it was found thattuningeach of the front endrecognisers didnotnecessarily provideaboostto the overall LIDperformance of eachPRLMsystem. We sug-gested that oneexplanation for this wasthattuningmay ad-versely affect PER for the other targetlanguages. However, the results in Table6, indicate that PER'simprovefor all lan-guagesevaluated, when theSpanishPRLMsystem wastuned toimprove performanceonSpanish.This suggests that the sub-sequentdecrease in LID cannot be attributed totuning having a detrimentalimpactonthe accuracy of thephone streamof otherlanguages. Of course it may be that thetuning impacts onlanguages outside those for whichPERrates wereobtained, asVietnamese. Thismaybewhy the globalLIDratedecreased aftertuning,as showninTable 5. Unfortunately this is impos-sible to evaluate without suitabletranscriptions. However, the factimprovementsto PERfor each of thelanguagesinTable 6 does notsubsequentlyresult inimprovementsin LIDforthose languages, indicates that the usefulness of PER as a metric for gaugingeventual LIDperformanceis limited.

5.3. PACversusPER

Inanattempt to rationalisewhyPERratedoes notcoffespond to LIDrates, we examined the trends in PAC scores. As with PER, progressiontooptimisedmodels leads toimprovements inPACscores,butnotLID.However,the different scales used by PAC,PERandLIDmake it difficulttoinferrelationships

be-tweenmetrics.Accordingly,weexamine the relativechangesin metrics,togainfurtherinsight. When examined from this per-spective,the relativechangesinPAC scores are much smaller than its PER counterpart.

Toillustrate,considerFigure 2,whichplotsthe relative per-formanceimprovements for each of the metrics using the origi-nalperformance of the OGIsystem asthestarting point. Thus,

theimprovements plotted are CallHome and optimised Call-Home, relative to theoriginal performance of the OGI system. Notethat because the LID is in terms of correctpredictions, phone recognition performance is calculated in terms of accu-racy(PRA),topreserve the relative directions ofimprovement. Thus relativeimprovements for PRA, PAC and LID rates are plotted for each language basedonthe results showninTable 6. Thisplot highlights anumber ofthings. Thefirstis that PRA(orits counterpartPER)isinadequateforgaugingthe in-fluence ofchangestothe decoder on eventual LIDperformance. Incontrast,themagnitude of the impact which changestothe front endrecogniser imparton LID performance is better re-flectedby the PAC metric. Forexample, the change from the OGI model set tothe CallHome model setand subsequently theoptimised CallHome model set,produced relatively large improvementsin PRA. In German an initial improvementof around50%,followedby improvementto morethan110%

bet-terthan OGIwas obtainedby progressing throughthe unopti-mised andoptimisedmodelsets. This represents a60% rela-tivechange which results from the optimisation. However, the LIDperformance only improved by about 35% and then ac-tually dropped offto 30% relativeto OGI, whichmeans per-formancewentbackby 5%. Thus the margins of changeare

relatively large forPER,butnotLID. Incontrast,thechanges for Germanusing the PAC metricarearound1%, still different, butatleastmoreindicative. Thus the PACmeasure seems to at

least be better suited forgaininganoverall indication of poten-tialimprovements,ifnotthe actual direction. Oneexplanation for these smaller differences between PAC and LID rate, may be that the PAC is calculatedonCallHome, whereas theLID rateis basedonCallFriend.

Certainlythe results showncan onlybe usedas a guide. Howeverwith further refinementwe believe this measure is moresuitable forestimating the potential impact of changesto the front endonsubsequentLIDperformance.

(8)

6. Conclusion

In this paper a more detailed examination of therelationship between phone recognition, phonotactic information and LID rates wasundertaken. The utility of PER as a proxyfor phono-tactic informationwasdiscussed and examined. It was high-lighted that the use of OGI models in PPRLMsystems produce phone streams that containsignificanterrors. Modelsproduced using CallHome transcriptionsweresubsequentlycomparedto OGImodels, producing more accurate transcriptions and im-portantly an average absolute improvement in LID of approxi-mately 6% across the 30, 10 and 3 seconds tasks for the NIST 1996 and 2003 evaluations.

Inaddition a new metric, the Phone AlignmentCost (PAC) wasproposed. This metric was based on the principle that errors inthe phone stream are not equitable, and usefulinformation is stillpresentinphonesstreamscorrupted by phone recognition errors. This method overcomes many of theshortcomingsthat PERexhibits, includingdifficultyin useforevaluating front end performanceacrossmultiple languages, aswellas aninability tograde the significance oferrorsaccordingtolinguistic simi-larity. Comparisons between the PER and PAC were conducted and contrasted with LID performance. Whilst the PAC tech-nique still requires refinement, early indications are that rela-tivechangesinPAC scores are moreclosely alignedto LID per-formance thanchangesin PER. Thus the PAC metric is better suitedtoestimating the impact of changestothe front endon

subsequentLIDperformance and should be useful for

extract-ing improved performance from PPRLM systems.

broadcast news transcription and understanding work-shop, 1998, pp. 301-307.

[10] "WAGON- Classification and Regression Tree (CART) toolkit," http://www.ims.uni-stuttgart.de/ phonetik/synthesis/festival/festdoc-1. 4.0.1/speechtoois/x3475.htm.

[11] "Hanzi to Pintin (C2T) conversion tool," http: //www.ibiblio.org/pub/packages/ccic/ software/unix/convert/.

[12] "KAKASI, Kanji-Kana-to-Romaji conversion toolkit," http://kakasi.namazu.org/.

7. References

[1] M.A.Zissman and E.Singer,"Automatic Language Iden-tification ofTelephone SpeechMessagesusing Phoneme Recognition and N-Gram Modelling," inInternational ConferenceonAcoustics, SpeechandSignal Processing, 1994,vol.1, pp.305-308.

[2] LDC, "Linguistic Data Consortium," http://www. ldc. upenn.edu/,2004.

[3] Y. K.Muthusamy, R. A. Cole, andB. T. Oshika, "The OGIMulti-Language Telephone Speech

Corpus,'

in In-ternational ConferenceonSpokenLanguageProcessing, 1992,vol.2,pp.895-898.

[4] P. Fung, W.Byrne,Z.F.Thomas,T.Kamm, L.Yi, S. Zhan-jiang, V. Venkataramani, and U.Ruhi, "Pronunciation modeling of Mandarin casual speech," Johns Hopkins SummerWorkshop, 2000.

[5] T.Schultz and A.Waibel, "Polyphone decision tree spe-cialization forlanguage adaptation," in Proc. ofICASSP, Istanbul2000, 2000.

[6] C. Nieuwoudt and E. C.Botha, "Cross-languageuseof acoustic information for automatic speech recognition," Speech Commun., vol. 38,no.1,pp. 101-113, 2002. [7] J. Kohler, "Language adaptation ofmultilingual phone

models for vocabulary independent speech recognition tasks,"in Proc.ICASSP,Washington, U.S.A,1998, vol. 1, pp.417-420.

[8] "Flexible Alignment Toolkit," http://www. clsp.jhu.edu/ws2000/groups/mcs/Tools/ README.html,2000.

[9] G.Zavaliagkosand T.Colthurst, "Utilizinguntranscribed training datato improve performance," in Proc.

qf

the