Information Extraction Using the Structured Language Model

(1)

Ciprian Chelba and Milind Mahajan

Microsoft Research

Microsoft Corporation

One Microsoft Way, Redmond, WA 98052

fchelba,milindmg@microsoft. com

Abstract

Thepaperpresentsadata-drivenapproachto

infor-mationextraction(viewedastemplatelling)using

thestructuredlanguagemodel(SLM)asastatistical

parser. The task of template lling is cast as

con-strainedparsingusingtheSLM.Themodelis

auto-maticallytrainedfrom aset ofsentencesannotated

with frame/slot labels and spans. Training

pro-ceeds in stages: rstaconstrained syntactic parser

istrainedsuchthattheparsesontrainingdatameet

thespeciedsemanticspans, thenthenon-terminal

labelsare enrichedto containsemanticinformation

andnallyaconstrainedsyntactic+semanticparser

istrainedontheparsetreesresultingfrom the

pre-vious stage. Despite the small amount of training

data used, the model is shown to outperform the

slot level accuracy of a simple semantic grammar

authoredmanuallyfortheMiPad|personal

infor-mationmanagement|task.

1 Introduction

Information extraction from text canbe

character-izedastemplatelling(JurafskyandMartin,2000):

a giventemplate orframe contains acertain

num-berof slots that need to be lledin with segments

oftext. Typicallynotallthewordsintextare

rele-vantto aparticularframe. Assumingthat the

seg-mentsoftextrelevanttollingintheslotsare

non-overlappingcontiguousstringsofwords,onecan

rep-resentthesemanticframeasasimplesemanticparse

treefor thesentenceto beprocessed. Thetree has

twolevels: the root node is tagged with the frame

label and spansthe entiresentence; theleaf nodes

aretaggedwiththeslot labelsandspanthestrings

ofwordsrelevantto thecorrespondingslot.

Considerthe semantic parse S for asentenceW

presentedinFig.1. CalendarTaskistheframetag,

(CalendarTask schedule meeting with

(ByFullName*Person megan hokins) about

(SubjectByWildCard*Subject internal lecture)

at(PreciseTime*Timetwothirtyp.m.))

Figure1: Samplesentenceandsemanticparse

spanningtheentiresentence;theremainingonesare

slottagswiththeircorrespondingspans.

IntheMiPadscenario(Huanget al.,2000)|

es-sentiallyapersonalinformationmanagement(PIM)

task | there is a module that is able to convert

theinformationextractedaccordingtothesemantic

parseintospecic actions. Inthiscasetheactionis

toscheduleacalendarappointment.

Weviewtheproblemofinformationextractionas

therecoveryofthetwo-levelsemanticparseS fora

givenwordsequenceW.

Weproposeadatadrivenapproachtoinformation

extraction that usesthestructured languagemodel

(SLM) (Chelba and Jelinek,2000)asanautomatic

parser. The parser is constrained to explore only

parsesthatcontainpre-setconstituents|spanning

a given word string and bearing a tag in a given

set ofsemantictags. Theconstraintsavailable

dur-ingtrainingandtestaredierent,thetestcase

con-straintsbeingmorerelaxedasexplainedinSection4.

The main advantage of the approach is that it

doesn't require any grammar authoring expertise.

Theapproachisfullyautomaticoncetheannotated

training data is provided; it does assume that an

application schema |i.e. frame andslot structure

| has been dened but does not require

seman-ticgrammarsthat identifyword-sequencetoslot or

frame mapping. However, the process of

convert-ing the word sequence coresponding to a slot into

actionable canonicalforms |i.e. convert half past

two inthe afternoon into 2:30 p.m. |mayrequire

grammars. Thedesignoftheframes|what

infor-mationis relevantfortakingacertainaction, what

slot/frametagsaretobeused,see(Wang,1999)|

isadelicatetaskthatwewillnotbeconcernedwith

forthepurposesofthispaper.

The remainder of the paper is organized as

fol-lows: Section 2 reviews the structured language

model(SLM)followedbySection 3whichdescribes

indetailthetrainingprocedureandSection4which

denes the operation of the SLM asa constrained

parser and presents the necessarymodications to

(2)

oth-havecarriedout. WeconcludewithSection7.

2 Structured Language Model

We proceed with a brief review of the structured

language model (SLM); an extensive presentation

of the SLM can be found in (Chelba and Jelinek,

2000). The model assigns a probability P(W;T)

to every sentence W and its every possible binary

parse T. The terminals of T are the words of W

with POStags, and the nodes of T are annotated

withphraseheadwordsandnon-terminallabels. Let

(<s>, SB) ... (w_p, t_p) (w_{p+1}, t_{p+1}) ... (w_k, t_k) w_{k+1}.... </s>

h_0 = (h_0.word, h_0.tag)

h_{-1}

h_{-m} = (<s>, SB)

Figure2: Aword-parsek-prex

W be a sentence of length n words to which we

haveprepended thesentencebeginningmarker<s>

andappendedthesentenceendmarker</s>sothat

w

0

=<s>and w

n+1

=</s>. LetW

k =w

0 :::w

k be

thewordk-prexofthesentence|thewordsfrom

the beginingof thesentence up to the current

po-sitionk |and W

k T

k

theword-parse k-prex.

Fig-ure 2showsa word-parse k-prex; h_0 .. h_{-m}

aretheexposedheads,eachheadbeingapair

(head-word,non-terminallabel),or(word,POStag)inthe

case of a root-only tree. The exposed heads at a

givenpositionkintheinputsentenceareafunction

oftheword-parsek-prex.

2.1 Probabilistic Model

ThejointprobabilityP(W;T)ofawordsequenceW

andacompleteparseT can bebrokeninto:

P(W;T)=

Q n+1 k =1 [P(w k =W k 1 T k 1 )P(t

k =W k 1 T k 1 ;w k ) Q N k i=1 P(p k i =W k 1 T k 1 ;w k ;t k ;p k 1 :::p

k i 1 )] where: W k 1 T k 1

istheword-parse(k 1)-prex

w

k

is the word predicted by

WORD-PREDICTOR

t

k

isthetagassignedto w

k

bytheTAGGER

N

k

1isthenumberofoperationsthePARSER

executes at sentence position k before passing

controltotheWORD-PREDICTOR(theN

k -th

operationat position kis thenulltransition);

N

k

isafunctionofT

p k

i

denotesthei-th PARSERoperationcarried

outat positionkin thewordstring;the

opera-binarybranchingparseswith allpossible

head-wordandnon-terminallabelassignmentsforthe

w

1 :::w

k

wordsequencecanbegenerated. The

p k

1 :::p

k

N

k

sequence of PARSER operations at

position k growsthe word-parse (k 1)-prex

into aword-parsek-prex.

...

T’_0

T_{-1}

T_0

<s>

T’_{-1}<-T_{-2}

h_{-1}

h_0

h’_{-1} = h_{-2}

T’_{-m+1}<-<s>

h’_0 = (h_{-1}.word, NTlabel)

Figure3: Resultofadjoin-left underNTlabel

...

_{T’_{-1}<-T_{-2}}

_{T_0}

h_0

h_{-1}

<s>

T’_{-m+1}<-<s>

h’_{-1}=h_{-2}

T_{-1}

h’_0 = (h_0.word, NTlabel)

Figure4: Resultofadjoin-rightunderNTlabel

Our model is based on three probabilities, each

estimated using deleted interpolation and

parame-terized(approximated) asfollows:

P(w k =W k 1 T k 1 ) :

= P(w

k =h 0 ;h 1 ) P(t k =w k ;W k 1 T k 1 ) :

= P(t

k =w k ;h 0 ;h 1 ) P(p k i =W k T k ) :

= P(p k i =h 0 ;h 1 )

Itisworthnotingthatifthebinarybranching

struc-ture developed by the parser were always

right-branching and we mapped the POStag and

non-terminal label vocabularies to a single type then

our model would be equivalent to a trigram

lan-guage model. Since the number of parses for a

given word prex W

k

grows exponentially with k,

jfT

k

gjO(2 k

),thestatespaceofourmodelishuge

evenforrelativelyshort sentences,sowehadtouse

asearch strategy that prunes it. Our choice wasa

synchronous multi-stack search algorithm which is

verysimilarto abeamsearch.

The language model probability assignment for

theword at position k+1in theinput sentence is

madeusing: P(w k +1 =W k ) = X T k 2S k P(w k +1 =W k T k )(W

k T k ); (W k T k

) = P(W

k T k )= X T k 2S k P(W k T k ) (1)

whichensuresaproperprobabilityoverstringsW

,

whereS

k

(3)

Each model component | WORD-PREDICTOR,

TAGGER, PARSER | is initialized from a set

of parsed sentences after undergoing headword

percolation and binarization. Separately for each

modelcomponentwe:

gathercountsfrom \main"data |about90%

ofthetrainingdata

estimatetheinterpolationcoeÆcientsoncounts

gathered from \check" data | the remaining

10%ofthetrainingdata.

An N-best EM(Dempsteret al.,1977)variantis

then employedto jointly re-estimatethemodel

pa-rameterssuchthatthelikelihoodofthetrainingdata

underourmodelis increased.

3 Training Procedure

This section describes the training procedure for

the SLM when applied to information extraction

and introduces the modications that need to be

madetotheSLMoperation.

Thetrainingofthemodelproceedsinfourstages:

1. initializetheSLMas asyntacticparser forthe

domainweareinterestedin. Ageneralpurpose

parser (such as NLPwin (Heidorn, 1999)) can

beused to generate asyntactictreebank from

which the SLM parameters can be initialized.

Another possibility for initializing the SLM is

touseatreebankforout-of-domaindata(such

as the UPenn Treebank (Marcus et al., 1993))

|seeSection6.1.

2. traintheSLMasamatchedconstrainedparser.

Atthissteptheparserisgoingtoproposeaset

of N syntactic binary parses for a given word

string (N-best parsing), all matching the

con-stituent boundaries specied by the semantic

parse: a parse T is said to match the

seman-tic parse S, denoted T 3 S, if and only if the

set of un-labeled constituents that dene S is

included in the set of constituents that dene

T.

Atthistimeonlytheconstituentspan

informa-tioninS istakenintoaccount.

3. enrichthenon-terminalandpre-terminallabels

of the resulting parses with the semantic tags

(frameandslot)presentinthesemanticparse,

thusexpandingthevocabularyofnon-terminal

and pre-terminal tags used by the syntactic

parser to include semantic information

along-sidetheusualsyntactictags.

4. traintheSLMasaL(abel)-matchedconstrained

stituentlabelsaretakenintoaccounttoo,which

means that aparse P |containingboth

syn-tactic and semantic information | is said to

L(abeled)-match S if and only if the set of

la-beled semantic constituents that denes S is

identicaltothesetofsemanticconstituentsthat

denes P. IfweletSEM(P)denote the

func-tionthatmapsatreeP containingboth

syntac-tic and semantic information to the tree

con-taining only semantic information, referred to

as the semantic projection of P, then all the

parses P

i

;8i< N, proposed bythe SLM fora

givensentenceW, L-match S and thus satisfy

SEM(P

i

)=S;8i<N.

The semantictree S has atwo level structure

sotheaboverequirementcanbesatisedonlyif

the parsesSEM(P)proposedby theSLM are

also on twolevels,frame and slot level

respec-tively. We have incorporated this constraint

intotheoperationoftheSLM|seeSection4.2.

The model thus trained is then used to parse

test sentences and recover the semantic parse

us-ing S = SEM(argmax

P

i P(P

i

;W)). In principle,

one should sum over all the parses P that yield

the same semantic parse S and then choose S =

argmax

S P

P

i

s:t:SEM(P

i )=S

P(P

i ;W).

A fewiterationsof theN-bestEMvariant|see

Section 2 | were run at each of the second and

fourth step in the training procedure. The

con-strainedparseroperationmakesthisanEMvariant

where thehidden space | the possible parse trees

foragivensentence|isapriorilimitedbythe

se-mantic constraintsto asubset of the hidden space

ofthe unrestrictedmodel. At test timewewish to

recoverthe most likely subset of the hidden space

consistentwiththeconstraintsimposedonthe

sen-tence.

To be more specic, during the second training

stage,theE-stepofthereestimationprocedure will

only explore syntactic trees (hidden events) that

match thesemanticparse;the fourthstage E-steps

willconsiderhiddeneventsthatareconstrainedeven

furthertoL-match thesemanticparse. Wehaveno

proofthatthisprocedureshould leadtobetter

per-formancein terms ofslot/frame accuracybut

intu-itivelyoneexpectsittoplacemoreandmore

proba-bilitymassonthedesirabletrees|thatis,thetrees

that are consistent with the semantic annotation.

This is conrmed experimentally by the fact that

thelikelihoodofthetrainingwordsequence

(observ-able) | calculatedby Eq. (1) where the sum runs

overtheparsetreesthatmatch/L-matchthe

seman-tic constraints | does increase 1

at every training

step, aspresented in Section 6, Table 1. However,

(4)

with a decrease in error rate on the trainingdata,

seeTables2and3in Section6.

4 Constrained Parsing Using the

Structured Language Model

WenowdetailtheconstrainedoperationoftheSLM

| matched and L-matched parsing | used at the

second and fourth steps of the training procedure

describedin theprevioussection.

A semanticparse S fora given sentence W

con-sists of a set of constituent boundaries along with

semantictags. Whenparsingthesentenceusingthe

standardformulationoftheSLM,oneobtainsbinary

parsesthatarenotguaranteedtomatchthe

seman-tic parse S, i.e. the constituent proposed by the

SLMmaycrosssemanticconstituentboundaries;for

theconstituentsmatching thesemanticconstituent

boundaries,thelabelsproposedmaynotbethe

de-siredones.

To x terminology, we dene a constrained

con-stituent|orsimplyaconstraint|ctobeaspan

together withaset 2

ofallowabletagsforthespan:

c =< l;r;Q > where l is the left boundary of the

constraint,ristherightboundaryoftheconstraint

and Q isthe set of allowablenon-terminal tags for

theconstraint.

A semanticparse can be viewed as a set of

con-straints;foreachconstraintthesetofallowable

non-terminal tags Q contains a single element,

respec-tivelythesemantictagforeachconstituent. An

ad-ditionalfacttobekeptinmindisthatthesemantic

parse tree consists of exactlytwolevels: the frame

level(root semantictag)and theslotlevel(leaf

se-mantictags).

Duringtraining,wewishtoconstraintheSLM

op-erationsuchthatitconsidersonlyparsesthatmatch

the constraintsc

i

;i =1:::C as it proceeds left to

right through agiven sentence W. In light of the

trainingproceduresketchedintheintroduction,we

consider twoavors of constrained parsing, one in

whichweonlygenerateparses that match the

con-straintboundariesandanotherinwhichwealso

en-forcethattheproposedtagforeverymatching

con-stituentisamongtheconstrainedsetofnon-terminal

tagsc

i

:Q|L(abeled)-match constrainedparsing.

The only constraints available for the test

sen-tencesare:

the semantic tag of the root node | which

spans the entire sentence | must be in the

set of frame tags. If it were a test

sen-tence the example in Figure 1 would have

the following semantic parse (constraints):

({CalendarTask,ContactsTask,MailTask}

2

Theset ofallowabletagsmustcontainat leastone

ele-internal lecturetotwothirtyp.m.)

the semantic projection of the trees proposed

by theSLM must haveexactlytwolevels; this

constraint is built in the operation of the

L-match parser.

The next section will describe the constrained

parsing algorithm. Section 4.2 will describe

fur-ther changes that the algorithm uses to produce

onlyparses P whose semanticprojection SEM(P)

hasexactly two levels, frame (root) and slot (leaf)

level,respectively|only inthe L-match case. We

concludewith Section 4.3 explaining howthe

con-strainedparsingalgorithm interacts withthe

prun-ingoftheSLMsearchspaceforthemostlikelyparse.

4.1 Matchand L-match SLMParsing

The trees produced by the SLM are binary trees.

Thetagsannotatingthenodesofthetreearepurely

syntactic| during thesecond trainingstage | or

syntactic+semantic|duringthelasttrainingstage

or at test time. It can be proved that satisfying

the following two conditions at each position k in

theinputsentence ensuresthat allthebinarytrees

generatedbytheSLM parsingalgorithmmatchthe

pre-setconstraintsc

i

;i=1:::Casitproceedsleftto

rightthroughtheinputsentenceW =w

0 :::w

n+1 .

for agivenword-parsek-prexW

k T

k

(see

Sec-tion2) acceptanadjointransitionifandonly

if:

1. theresultingconstituent doesnotviolate 3

anyoftheconstraintsc

i

;i=1:::C

2. L-match parsing only: if the

seman-tic projection of the non-terminal tag

SEM(NTtag)proposed bytheadjoin

op-erationisnon-voidthenthenewlycreated

constituentmustL-match anexisting

con-straint,9 c

i

s:t:SEM(NTtag)2c

i :Q.

for agivenword-parsek-prexW

k T

k

(see

Sec-tion 2) accept the null transition if and only

if all the constraints c

i

whose right boundary

is equalto thecurrentwordindex k, c

i :r =k,

havebeenmatched. Iftheseconstraintsremain

un-matchedtheywillbebrokenat alatertime

during theprocess of completing the parse for

thecurrentsentenceW: therewillbeanadjoin

operation involving a constituent to the right

of the current position that will break all the

constraintsendingatthecurrentposition k.

4.2 Semantic Tag Layering

Thetwo-layerstructure of the semantictrees need

notbeenforcedduring training,simplyL-matching

3

(5)

constraint. As explained above, for test sentences

wecanonlyspecifytheframelevelconstraint,

leav-ing open the possibility of generating atree whose

semantic projection would contain more than two

levels | nested slot level constituents. In order

to avoid this, each tree in a given word-parse has

twobitsthatdescribewhetherthetreealready

con-tains a constituent whose semantic projectionis a

frame/slotleveltag,respectively. An adjoin

opera-tionproposing atagthat violatesthecorrect

layer-ingofframe/slotleveltagscannowbedetectedand

discarded.

4.3 Interaction with Pruning

In the absence of pruning the search for the most

likely parse satisfying the constraints for a given

sentencebecomes computationally intractable 4

. In

practice,weareforcedtousepruningtechniquesin

ordertolimitthesizeofthesearchspace. However,

itispossiblethatduringthelefttorighttraversalof

thesentence,thepruningschemewillkeepaliveonly

parses whose continuation cannot meet constraints

that wehavenotencountered yet and nocomplete

parse for the current sentence can be returned. In

suchcases,weback-o tounconstrainedparsing|

regularSLMusage. Inourexperiments,wenoticed

that this was necessary for very few training

sen-tences (1 out of 2,239) and relativelyfew test

sen-tences(31outof1,101).

5 Comparison with Previous Work

The use of a syntactic parser augmented with

se-mantictagsforinformationinformationfromtextis

notanovel idea. Thebasic approach wedescribed

isverysimilartotheonepresentedin(Milleret al.,

2000)howeverthereareafewmajordierences:

in our approach the augmentation of the

syn-tactictagswithsemantictagsisstraightforward

due to the fact that the semantic constituents

arematchedexactly 5

. Theapproachin (Miller

et al., 2000) needs to insert additional nodes

in thesyntactictreeto accountfor the

seman-ticconstituentsthatdonothavea

correspond-ingsyntacticone. Webelieveourapproach

en-surestightercouplingbetweenthesyntacticand

thesemanticinformationinthenalaugmented

trees.

ourconstraintdenition allowsfor aset of

se-mantictagstobematchedonagivenspan.

4

Itisassumedthattheconstraintsforagivensentenceare

consistent,namelythereexistsatleastoneparsethatmeets

allofthem.

5

ThisisaconsequenceofthefactthattheSLMgenerates

trees is a structural constraint that is

embed-ded in the operationof theSLM and thus can

beguaranteedontestsentences.

The semanticannotationrequired byour task is

much simplerthanthat employedby(Miller et al.,

2000). Onepossiblybenecialextensionofourwork

suggestedby (Miller et al., 2000) would beto add

semantic tags describing relations between entities

(slots),inwhichcasethesemanticconstraintswould

notbe structured strictlyon thetwolevelsused in

the current approach, respectively frame and slot

level. However, this would complicate the task of

dataannotationmakingitmoreexpensive.

The same constrained EM variant employed for

reestimating the model parameters has been used

by(PereiraandSchabes,1992)fortrainingapurely

syntactic parser showing increasein likelihood but

noimprovementinparsingaccuracy.

6 Experiments

Wehaveevaluatedthemodelonmanuallyannotated

data forthe MiPad (Huanget al., 2000)task. We

haveused2,239 sentences (27,119words) for

train-ingand1,101sentences(8,652words)fortest. There

were2,239/5,431semanticframes/slotsinthe

train-ing data and 1,101/1,698 in the test data,

respec-tively.

The wordvocabulary size was 1,035,closed over

the test data. The slot and frame vocabulary

sizeswere 79and 3,respectively. Thepre-terminal

(POStag)vocabularysizeswere64and144for

train-ingsteps 2and 4(see Section 3), respectively; the

non-terminal(NTtag) vocabularysizes were61and

540for training steps 2 and 4 (see Section 3),

re-spectively. We have used the NLPwin (Heidorn,

1999)parsertoobtaintheMiPadsyntactictreebank

neededforinitializingtheSLMattrainingstep1.

Training Perplexity

Stage It TrainingSet Testset

2(matched) 0 9.27 34.81

2(matched) 1 5.81 31.25

2(matched) 2 5.51 31.41

4(L-matched) 0 4.71 24.39

4(L-matched) 1 4.61 24.73

4(L-matched) 2 4.56 24.88

Table1: LikelihoodEvolutionduring Training

Althoughnotguaranteedtheoretically,theN-best

EMvariantused for the SLM parameter

reestima-tion increases the likelihood of the training data

with each iteration when the parser is run in both

(6)

lutionofthetrainingandtestdataperplexities

(cal-culated using the probability assignment in Eq. 1)

duringtheconstrainedtrainingsteps2and 4.

Thetrainingdata perplexity decreases

monoton-ically during both training steps whereas the test

dataperplexitydoesn'tdecreasemonotonicallyin

ei-thercase. Weattributethisdiscrepancybetweenthe

evolution ofthe likelihoodon thetraining andtest

corporatothedierentconstrainedsettings forthe

SLM.

Themost importantperformance measure is the

slot/frame error rate. To measure it, weuse

man-uallycreatedparseswhichconsistof frame-level

la-belsand slot-levellabelsand spansasreference. A

frame-level error is caused by aframe label of the

hypothesis parse which is dierent from the frame

labelofthereference. Inordertocalculatethe

slot-level errors, we create a set of slot label and slot

span pairs for the reference and hypothesis parse,

respectively. Thenumberof slot errors is then the

minimum edit distance between these 2 sets using

the substitution, insertion and deletion operations

ontheelementsoftheset.

Table2showstheerrorrateontrainingandtest

dataatdierentstagesduringtraining. Thelast

col-umnoftestdataresults(Test-L1)showstheresults

obtainedbyassumingthattheuserhasspeciedthe

identityoftheframe|andthustheframelevel

con-straintcontainsonlythecorrectsemantictag. This

isaplausiblescenarioiftheuserhasthepossibility

tochoosetheframeusingadierentinputmodality

suchasastylus. Theerrorratesonthetrainingdata

werecalculatedbyrunningthemodelwiththesame

constraintasonthetestdata|constrainingtheset

of allowabletags at theframe level. This couldbe

seenasan upper bound onthe performance of the

model (since themodel parameterswere estimated

onthesamedata).

Ourmodelsignicantlyoutperforms thebaseline

model | a simple semantic context free grammar

authored manually forthe MiPadtask | in terms

of slot error rate (about 35% relative reduction in

slot errorrate)butit isoutperformed by thelatter

intermsofframeerrorrate. Whenrunningthe

mod-elsfromtrainingstep2ontestdataonecannotadd

anyconstraints;onlyframelevelconstraintscanbe

usedwhenevaluatingthemodelsfromtrainingstep

4ontest data. N-bestreestimationat either

train-ing stage (2 or 4) doesn't improvethe accuracy of

thesystem,although theresultsobtainedby

intial-izingthemodelusingthereestimatedstage2model

|iteration2-f0,1,2gmodelstendtobeslightly

bet-ter thantheir 0-f0,1,2gcounterparts. Constraining

theframeleveltagtohavethecorrectvaluedoesn't

signicantly reduce theslot errorrate in either

ap-results .

6.1 Out-of-domainInitialStatistics

Recent results(Chelba, 2001) onthe portability of

syntacticstructurewithintheSLMframeworkshow

that itis possibleto initializethe SLM parameters

from a treebank for out-of-domain text and

main-tainthesamelanguagemodeling performance. We

haverepeated the experimentin the context of

in-formationextraction.

Similartotheapproachin(Milleretal.,2000)we

initializedtheSLM statisticsfromtheUPenn

Tree-bankparsetrees(about1Mwdsoftrainingdata)at

thersttraining stage,see Section3. The

remain-ingpartof thetraining procedure was the sameas

intheprevioussetofexperiments.

The word, slot and frame vocabulary were the

same as in the previous set of experiments. The

pre-terminal(POStag)vocabularysizeswere40and

204fortrainingsteps2and4(seeSection3),

respec-tively; the non-terminal (NTtag) vocabulary sizes

were52and434fortrainingsteps2and4(see

Sec-tion3), respectively.

TheresultsarepresentedinTable3,showing

im-provedperformanceoverthemodelinitialized from

in-domainparsetrees. Theframeaccuracyincreases

substantially, almost matching that of the baseline

model, while the slot accuracy is just slightly

in-creased. Weattribute theimprovedperformanceof

the model initialized from the UPenn Treebank to

thefactthat the modelexplores amorediverse set

oftreesforagivensentencethanthemodel

initial-izedfrom theMiPadautomatictreebankgenerated

usingtheNLPwinparser.

6.2 Impactof Training Data Size on

Performance

We havealso evaluated the impact of the training

datasizeonthemodelperformance. Theresultsare

presentedin Table4, showingastrong dependence

ofboththeslotandframeerrorratesontheamount

oftrainingdataused. This, togetherwith thehigh

accuracyofthemodelontrainingdata(seeTable3),

suggeststhat weare farfrom saturationin

perfor-manceandthat moretrainingdataisverylikelyto

improvethemodel performancesubstantially.

6.3 Error Trends

As asummary error analysis,wehaveinvestigated

thecorrelationbetweenthesemanticframe/slot

er-rorrateand thenumberofsemanticslotsin a

sen-tence. Wehavebinnedthesentencesinthetest set

accordingto thenumberofslots in themanual

an-6

Theframeerrorrateinthiscolumnshouldbe0;in

prac-ticethisdoesn't happen because sometest sentencescould

(7)

Training Test Test-L1

Stage2 Stage4 Slot Frame Slot Frame Slot Frame

Baseline 43.41 7.20 57.36 14.90 57.30 6.90

0 0 9.78 1.65 37.87 21.62 37.46 0.64

0 1 10.36 1.20 39.16 21.80 38.28 0.64

0 2 9.42 1.05 39.75 22.25 38.63 0.82

2 0 8.92 1.25 38.04 22.07 37.81 0.91

2 1 9.01 0.95 37.51 21.89 37.28 0.91

2 2 9.47 0.90 38.99 21.89 38.57 0.82

Table2: TrainingandTestDataSlot/FrameErrorRates

TrainingIt ErrorRate(%)

Training Test Test-L1

Stage2 Stage4 Slot Frame Slot Frame Slot Frame

Baseline 43.41 7.20 57.36 14.90 57.30 6.90

0,MiPad/NLPwin 0 9.78 1.65 37.87 21.62 37.46 0.64

1,UPennTrbnk 0 8.44 2.10 36.93 16.08 36.34 0.91

1,UPennTrbnk 1 7.82 1.70 36.98 16.80 36.22 0.82

1,UPennTrbnk 2 7.69 1.50 36.98 16.80 36.22 1.00

Table3: TrainingandTestDataSlot/FrameErrorRates,UPennTreebankinitialstatistics

Training TrainingIt ErrorRate(%)

Corpus Training Test Test-L1

Size Stage2 Stage4 Slot Frame Slot Frame Slot Frame

Baseline 43.41 7.20 57.36 14.90 57.30 6.90

all 1,UPennTrbnk 0 8.44 2.10 36.93 16.08 36.34 0.91

1/2all 1,UPennTrbnk 0 | | 43.76 18.44 43.40 0.45

1/4all 1,UPennTrbnk 0 | | 49.47 22.98 49.53 1.82

Table4: PerformanceDegradationwithTrainingDataSize

notationand evaluated theframe/slot errorratein

eachbin. TheresultsareshowninTable5.

Theframe/slotaccuracyincreaseswiththe

num-ber of slots per sentence | except for the 5+ bin

wheretheframeerrorrateincreases|showingthat

slotco-ocurencestatisticsimproveperformance;

sen-tencescontainingmoresemanticslotstendtobeless

ambiguousfrom aninformation extraction pointof

view.

ErrorRate(%)

No. slots/sent Slot Frame No. Sent

1 43.97 18.01 755

2 39.23 16.27 209

3 26.44 5.17 58

4 26.50 4.00 50

5+ 21.19 6.90 29

Table5: Frame/SlotErrorRateversusSlotDensity

7 Conclusions and Future Directions

Wehavepresentedadata-drivenapproachto

infor-mation extraction that, despite the small amount

of training data used, is shown to outperform the

slot level accuracy of a simple semantic grammar

authoredmanuallyfortheMiPad|personal

infor-mationmanagement|task.

The performance of the baselinemodel could be

improvedwith moreauthoring eort,although this

isexpensive.

The bigdierencein performancebetween

train-ing and test and the fact that we are using so

little training data, makes improvements by using

more training data very likely, although this may

be expensive. A framework which utilizes the vast

amounts of text data collected once such asystem

is deployed would be desirable. Statistical

model-ing techniques that make moreeective use of the

trainingdatashouldbeusedintheSLM,maximum

entropy(Bergeretal.,1996)beingagoodcandidate.

(8)

understand-pactofincorporatingthesemanticconstraintsonthe

word-levelaccuracyofthesystem. Anotherpossible

research direction is to modify theframework such

that it nds the most likely semantic parse given

theacoustics|thustreatingthewordsequenceas

ahiddenvariable.

References

A. L. Berger, S. A. Della Pietra, and V. J. Della

Pietra. 1996. A maximum entropy approach to

naturallanguageprocessing. Computational

Lin-guistics,22(1):39{72,March.

CiprianChelbaandFrederickJelinek. 2000.

Struc-tured language modeling. Computer Speech and

Language,14(4):283{332,October.

CiprianChelba. 2001. Portabilityofsyntactic

struc-ture for language modeling. In Proceedings of

ICASSP,pageto appear.SaltLakeCity,Utah.

A.P.Dempster,N.M.Laird,andD.B.Rubin. 1977.

Maximumlikelihoodfromincompletedataviathe

EMalgorithm.InJournalof theRoyal Statistical

Society,volume39ofB,pages1{38.

George Heidorn. 1999. Intelligent writing

assis-tance. In R. Dale, H. Moisl, andH. Somers,

ed-itors, Handbook of Natural Language Processing.

MarcelDekker,NewYork.

X. Huang,A.Acero,C. Chelba,L.Deng, D.

Duch-ene, J. Goodman, H. Hon, D. Jacoby, L. Jiang,

R. Loynd, M. Mahajan, P. Mau, S. Meredith,

S. Mughal, S. Neto, M. Plumpe, K. Wang, and

Y. Wang. 2000. MiPad: Anext generationPDA

prototype. In ICSLP'00, Proceedings, Beijing,

China.

DanielJurafskyandJamesH.Martin,2000. An

In-troduction toNatural LanguageProcessing,

Com-putational Linguistics, and Speech Recognition,

pages577{583. PrenticeHall.

M. Marcus, B. Santorini, and M. Marcinkiewicz.

1993. Building a large annotated corpus of

En-glish: the Penn Treebank. Computational

Lin-guistics,19(2):313{330.

ScottMiller,HeidiFox,LanceRamshaw,andRalph

Weischedel. 2000. Anoveluseof statistical

pars-ingto extractinformation fromtext. In

Proceed-ings of ANLP-NAACL, pages 226{233. Seattle,

Washington.

Fernando Pereira and Yves Schabes. 1992.

Inside-outsidereestimationfrompartiallybracketed

cor-pora. ACL,30:128{135.

Y.-Y.Wang. 1999. Arobustparser forspoken

lan-guage understanding. In Eurospeech'99