Ciprian Chelba and Milind Mahajan
Microsoft Research
Microsoft Corporation
One Microsoft Way, Redmond, WA 98052
fchelba,milindmg@microsoft. com
Abstract
Thepaperpresentsadata-drivenapproachto
infor-mationextraction(viewedastemplatelling)using
thestructuredlanguagemodel(SLM)asastatistical
parser. The task of template lling is cast as
con-strainedparsingusingtheSLM.Themodelis
auto-maticallytrainedfrom aset ofsentencesannotated
with frame/slot labels and spans. Training
pro-ceeds in stages: rstaconstrained syntactic parser
istrainedsuchthattheparsesontrainingdatameet
thespeciedsemanticspans, thenthenon-terminal
labelsare enrichedto containsemanticinformation
andnallyaconstrainedsyntactic+semanticparser
istrainedontheparsetreesresultingfrom the
pre-vious stage. Despite the small amount of training
data used, the model is shown to outperform the
slot level accuracy of a simple semantic grammar
authoredmanuallyfortheMiPad|personal
infor-mationmanagement|task.
1 Introduction
Information extraction from text canbe
character-izedastemplatelling(JurafskyandMartin,2000):
a giventemplate orframe contains acertain
num-berof slots that need to be lledin with segments
oftext. Typicallynotallthewordsintextare
rele-vantto aparticularframe. Assumingthat the
seg-mentsoftextrelevanttollingintheslotsare
non-overlappingcontiguousstringsofwords,onecan
rep-resentthesemanticframeasasimplesemanticparse
treefor thesentenceto beprocessed. Thetree has
twolevels: the root node is tagged with the frame
label and spansthe entiresentence; theleaf nodes
aretaggedwiththeslot labelsandspanthestrings
ofwordsrelevantto thecorrespondingslot.
Considerthe semantic parse S for asentenceW
presentedinFig.1. CalendarTaskistheframetag,
(CalendarTask schedule meeting with
(ByFullName*Person megan hokins) about
(SubjectByWildCard*Subject internal lecture)
at(PreciseTime*Timetwothirtyp.m.))
Figure1: Samplesentenceandsemanticparse
spanningtheentiresentence;theremainingonesare
slottagswiththeircorrespondingspans.
IntheMiPadscenario(Huanget al.,2000)|
es-sentiallyapersonalinformationmanagement(PIM)
task | there is a module that is able to convert
theinformationextractedaccordingtothesemantic
parseintospecic actions. Inthiscasetheactionis
toscheduleacalendarappointment.
Weviewtheproblemofinformationextractionas
therecoveryofthetwo-levelsemanticparseS fora
givenwordsequenceW.
Weproposeadatadrivenapproachtoinformation
extraction that usesthestructured languagemodel
(SLM) (Chelba and Jelinek,2000)asanautomatic
parser. The parser is constrained to explore only
parsesthatcontainpre-setconstituents|spanning
a given word string and bearing a tag in a given
set ofsemantictags. Theconstraintsavailable
dur-ingtrainingandtestaredierent,thetestcase
con-straintsbeingmorerelaxedasexplainedinSection4.
The main advantage of the approach is that it
doesn't require any grammar authoring expertise.
Theapproachisfullyautomaticoncetheannotated
training data is provided; it does assume that an
application schema |i.e. frame andslot structure
| has been dened but does not require
seman-ticgrammarsthat identifyword-sequencetoslot or
frame mapping. However, the process of
convert-ing the word sequence coresponding to a slot into
actionable canonicalforms |i.e. convert half past
two inthe afternoon into 2:30 p.m. |mayrequire
grammars. Thedesignoftheframes|what
infor-mationis relevantfortakingacertainaction, what
slot/frametagsaretobeused,see(Wang,1999)|
isadelicatetaskthatwewillnotbeconcernedwith
forthepurposesofthispaper.
The remainder of the paper is organized as
fol-lows: Section 2 reviews the structured language
model(SLM)followedbySection 3whichdescribes
indetailthetrainingprocedureandSection4which
denes the operation of the SLM asa constrained
parser and presents the necessarymodications to
oth-havecarriedout. WeconcludewithSection7.
2 Structured Language Model
We proceed with a brief review of the structured
language model (SLM); an extensive presentation
of the SLM can be found in (Chelba and Jelinek,
2000). The model assigns a probability P(W;T)
to every sentence W and its every possible binary
parse T. The terminals of T are the words of W
with POStags, and the nodes of T are annotated
withphraseheadwordsandnon-terminallabels. Let
(<s>, SB) ... (w_p, t_p) (w_{p+1}, t_{p+1}) ... (w_k, t_k) w_{k+1}.... </s>
h_0 = (h_0.word, h_0.tag)
h_{-1}
h_{-m} = (<s>, SB)
Figure2: Aword-parsek-prex
W be a sentence of length n words to which we
haveprepended thesentencebeginningmarker<s>
andappendedthesentenceendmarker</s>sothat
w
0
=<s>and w
n+1
=</s>. LetW
k =w
0 :::w
k be
thewordk-prexofthesentence|thewordsfrom
the beginingof thesentence up to the current
po-sitionk |and W
k T
k
theword-parse k-prex.
Fig-ure 2showsa word-parse k-prex; h_0 .. h_{-m}
aretheexposedheads,eachheadbeingapair
(head-word,non-terminallabel),or(word,POStag)inthe
case of a root-only tree. The exposed heads at a
givenpositionkintheinputsentenceareafunction
oftheword-parsek-prex.
2.1 Probabilistic Model
ThejointprobabilityP(W;T)ofawordsequenceW
andacompleteparseT can bebrokeninto:
P(W;T)=
Q n+1 k =1 [P(w k =W k 1 T k 1 )P(t
k =W k 1 T k 1 ;w k ) Q N k i=1 P(p k i =W k 1 T k 1 ;w k ;t k ;p k 1 :::p
k i 1 )] where: W k 1 T k 1
istheword-parse(k 1)-prex
w
k
is the word predicted by
WORD-PREDICTOR
t
k
isthetagassignedto w
k
bytheTAGGER
N
k
1isthenumberofoperationsthePARSER
executes at sentence position k before passing
controltotheWORD-PREDICTOR(theN
k -th
operationat position kis thenulltransition);
N
k
isafunctionofT
p k
i
denotesthei-th PARSERoperationcarried
outat positionkin thewordstring;the
opera-binarybranchingparseswith allpossible
head-wordandnon-terminallabelassignmentsforthe
w
1 :::w
k
wordsequencecanbegenerated. The
p k
1 :::p
k
N
k
sequence of PARSER operations at
position k growsthe word-parse (k 1)-prex
into aword-parsek-prex.
...
T’_0
T_{-1}
T_0
<s>
T’_{-1}<-T_{-2}
h_{-1}
h_0
h’_{-1} = h_{-2}
T’_{-m+1}<-<s>
h’_0 = (h_{-1}.word, NTlabel)
Figure3: Resultofadjoin-left underNTlabel
...
T’_{-1}<-T_{-2}
T_0
h_0
h_{-1}
<s>
T’_{-m+1}<-<s>
h’_{-1}=h_{-2}
T_{-1}
h’_0 = (h_0.word, NTlabel)
Figure4: Resultofadjoin-rightunderNTlabel
Our model is based on three probabilities, each
estimated using deleted interpolation and
parame-terized(approximated) asfollows:
P(w k =W k 1 T k 1 ) :
= P(w
k =h 0 ;h 1 ) P(t k =w k ;W k 1 T k 1 ) :
= P(t
k =w k ;h 0 ;h 1 ) P(p k i =W k T k ) :
= P(p k i =h 0 ;h 1 )
Itisworthnotingthatifthebinarybranching
struc-ture developed by the parser were always
right-branching and we mapped the POStag and
non-terminal label vocabularies to a single type then
our model would be equivalent to a trigram
lan-guage model. Since the number of parses for a
given word prex W
k
grows exponentially with k,
jfT
k
gjO(2 k
),thestatespaceofourmodelishuge
evenforrelativelyshort sentences,sowehadtouse
asearch strategy that prunes it. Our choice wasa
synchronous multi-stack search algorithm which is
verysimilarto abeamsearch.
The language model probability assignment for
theword at position k+1in theinput sentence is
madeusing: P(w k +1 =W k ) = X T k 2S k P(w k +1 =W k T k )(W
k T k ); (W k T k
) = P(W
k T k )= X T k 2S k P(W k T k ) (1)
whichensuresaproperprobabilityoverstringsW
,
whereS
k
Each model component | WORD-PREDICTOR,
TAGGER, PARSER | is initialized from a set
of parsed sentences after undergoing headword
percolation and binarization. Separately for each
modelcomponentwe:
gathercountsfrom \main"data |about90%
ofthetrainingdata
estimatetheinterpolationcoeÆcientsoncounts
gathered from \check" data | the remaining
10%ofthetrainingdata.
An N-best EM(Dempsteret al.,1977)variantis
then employedto jointly re-estimatethemodel
pa-rameterssuchthatthelikelihoodofthetrainingdata
underourmodelis increased.
3 Training Procedure
This section describes the training procedure for
the SLM when applied to information extraction
and introduces the modications that need to be
madetotheSLMoperation.
Thetrainingofthemodelproceedsinfourstages:
1. initializetheSLMas asyntacticparser forthe
domainweareinterestedin. Ageneralpurpose
parser (such as NLPwin (Heidorn, 1999)) can
beused to generate asyntactictreebank from
which the SLM parameters can be initialized.
Another possibility for initializing the SLM is
touseatreebankforout-of-domaindata(such
as the UPenn Treebank (Marcus et al., 1993))
|seeSection6.1.
2. traintheSLMasamatchedconstrainedparser.
Atthissteptheparserisgoingtoproposeaset
of N syntactic binary parses for a given word
string (N-best parsing), all matching the
con-stituent boundaries specied by the semantic
parse: a parse T is said to match the
seman-tic parse S, denoted T 3 S, if and only if the
set of un-labeled constituents that dene S is
included in the set of constituents that dene
T.
Atthistimeonlytheconstituentspan
informa-tioninS istakenintoaccount.
3. enrichthenon-terminalandpre-terminallabels
of the resulting parses with the semantic tags
(frameandslot)presentinthesemanticparse,
thusexpandingthevocabularyofnon-terminal
and pre-terminal tags used by the syntactic
parser to include semantic information
along-sidetheusualsyntactictags.
4. traintheSLMasaL(abel)-matchedconstrained
stituentlabelsaretakenintoaccounttoo,which
means that aparse P |containingboth
syn-tactic and semantic information | is said to
L(abeled)-match S if and only if the set of
la-beled semantic constituents that denes S is
identicaltothesetofsemanticconstituentsthat
denes P. IfweletSEM(P)denote the
func-tionthatmapsatreeP containingboth
syntac-tic and semantic information to the tree
con-taining only semantic information, referred to
as the semantic projection of P, then all the
parses P
i
;8i< N, proposed bythe SLM fora
givensentenceW, L-match S and thus satisfy
SEM(P
i
)=S;8i<N.
The semantictree S has atwo level structure
sotheaboverequirementcanbesatisedonlyif
the parsesSEM(P)proposedby theSLM are
also on twolevels,frame and slot level
respec-tively. We have incorporated this constraint
intotheoperationoftheSLM|seeSection4.2.
The model thus trained is then used to parse
test sentences and recover the semantic parse
us-ing S = SEM(argmax
P
i P(P
i
;W)). In principle,
one should sum over all the parses P that yield
the same semantic parse S and then choose S =
argmax
S P
P
i
s:t:SEM(P
i )=S
P(P
i ;W).
A fewiterationsof theN-bestEMvariant|see
Section 2 | were run at each of the second and
fourth step in the training procedure. The
con-strainedparseroperationmakesthisanEMvariant
where thehidden space | the possible parse trees
foragivensentence|isapriorilimitedbythe
se-mantic constraintsto asubset of the hidden space
ofthe unrestrictedmodel. At test timewewish to
recoverthe most likely subset of the hidden space
consistentwiththeconstraintsimposedonthe
sen-tence.
To be more specic, during the second training
stage,theE-stepofthereestimationprocedure will
only explore syntactic trees (hidden events) that
match thesemanticparse;the fourthstage E-steps
willconsiderhiddeneventsthatareconstrainedeven
furthertoL-match thesemanticparse. Wehaveno
proofthatthisprocedureshould leadtobetter
per-formancein terms ofslot/frame accuracybut
intu-itivelyoneexpectsittoplacemoreandmore
proba-bilitymassonthedesirabletrees|thatis,thetrees
that are consistent with the semantic annotation.
This is conrmed experimentally by the fact that
thelikelihoodofthetrainingwordsequence
(observ-able) | calculatedby Eq. (1) where the sum runs
overtheparsetreesthatmatch/L-matchthe
seman-tic constraints | does increase 1
at every training
step, aspresented in Section 6, Table 1. However,
with a decrease in error rate on the trainingdata,
seeTables2and3in Section6.
4 Constrained Parsing Using the
Structured Language Model
WenowdetailtheconstrainedoperationoftheSLM
| matched and L-matched parsing | used at the
second and fourth steps of the training procedure
describedin theprevioussection.
A semanticparse S fora given sentence W
con-sists of a set of constituent boundaries along with
semantictags. Whenparsingthesentenceusingthe
standardformulationoftheSLM,oneobtainsbinary
parsesthatarenotguaranteedtomatchthe
seman-tic parse S, i.e. the constituent proposed by the
SLMmaycrosssemanticconstituentboundaries;for
theconstituentsmatching thesemanticconstituent
boundaries,thelabelsproposedmaynotbethe
de-siredones.
To x terminology, we dene a constrained
con-stituent|orsimplyaconstraint|ctobeaspan
together withaset 2
ofallowabletagsforthespan:
c =< l;r;Q > where l is the left boundary of the
constraint,ristherightboundaryoftheconstraint
and Q isthe set of allowablenon-terminal tags for
theconstraint.
A semanticparse can be viewed as a set of
con-straints;foreachconstraintthesetofallowable
non-terminal tags Q contains a single element,
respec-tivelythesemantictagforeachconstituent. An
ad-ditionalfacttobekeptinmindisthatthesemantic
parse tree consists of exactlytwolevels: the frame
level(root semantictag)and theslotlevel(leaf
se-mantictags).
Duringtraining,wewishtoconstraintheSLM
op-erationsuchthatitconsidersonlyparsesthatmatch
the constraintsc
i
;i =1:::C as it proceeds left to
right through agiven sentence W. In light of the
trainingproceduresketchedintheintroduction,we
consider twoavors of constrained parsing, one in
whichweonlygenerateparses that match the
con-straintboundariesandanotherinwhichwealso
en-forcethattheproposedtagforeverymatching
con-stituentisamongtheconstrainedsetofnon-terminal
tagsc
i
:Q|L(abeled)-match constrainedparsing.
The only constraints available for the test
sen-tencesare:
the semantic tag of the root node | which
spans the entire sentence | must be in the
set of frame tags. If it were a test
sen-tence the example in Figure 1 would have
the following semantic parse (constraints):
({CalendarTask,ContactsTask,MailTask}
2
Theset ofallowabletagsmustcontainat leastone
ele-internal lecturetotwothirtyp.m.)
the semantic projection of the trees proposed
by theSLM must haveexactlytwolevels; this
constraint is built in the operation of the
L-match parser.
The next section will describe the constrained
parsing algorithm. Section 4.2 will describe
fur-ther changes that the algorithm uses to produce
onlyparses P whose semanticprojection SEM(P)
hasexactly two levels, frame (root) and slot (leaf)
level,respectively|only inthe L-match case. We
concludewith Section 4.3 explaining howthe
con-strainedparsingalgorithm interacts withthe
prun-ingoftheSLMsearchspaceforthemostlikelyparse.
4.1 Matchand L-match SLMParsing
The trees produced by the SLM are binary trees.
Thetagsannotatingthenodesofthetreearepurely
syntactic| during thesecond trainingstage | or
syntactic+semantic|duringthelasttrainingstage
or at test time. It can be proved that satisfying
the following two conditions at each position k in
theinputsentence ensuresthat allthebinarytrees
generatedbytheSLM parsingalgorithmmatchthe
pre-setconstraintsc
i
;i=1:::Casitproceedsleftto
rightthroughtheinputsentenceW =w
0 :::w
n+1 .
for agivenword-parsek-prexW
k T
k
(see
Sec-tion2) acceptanadjointransitionifandonly
if:
1. theresultingconstituent doesnotviolate 3
anyoftheconstraintsc
i
;i=1:::C
2. L-match parsing only: if the
seman-tic projection of the non-terminal tag
SEM(NTtag)proposed bytheadjoin
op-erationisnon-voidthenthenewlycreated
constituentmustL-match anexisting
con-straint,9 c
i
s:t:SEM(NTtag)2c
i :Q.
for agivenword-parsek-prexW
k T
k
(see
Sec-tion 2) accept the null transition if and only
if all the constraints c
i
whose right boundary
is equalto thecurrentwordindex k, c
i :r =k,
havebeenmatched. Iftheseconstraintsremain
un-matchedtheywillbebrokenat alatertime
during theprocess of completing the parse for
thecurrentsentenceW: therewillbeanadjoin
operation involving a constituent to the right
of the current position that will break all the
constraintsendingatthecurrentposition k.
4.2 Semantic Tag Layering
Thetwo-layerstructure of the semantictrees need
notbeenforcedduring training,simplyL-matching
3
constraint. As explained above, for test sentences
wecanonlyspecifytheframelevelconstraint,
leav-ing open the possibility of generating atree whose
semantic projection would contain more than two
levels | nested slot level constituents. In order
to avoid this, each tree in a given word-parse has
twobitsthatdescribewhetherthetreealready
con-tains a constituent whose semantic projectionis a
frame/slotleveltag,respectively. An adjoin
opera-tionproposing atagthat violatesthecorrect
layer-ingofframe/slotleveltagscannowbedetectedand
discarded.
4.3 Interaction with Pruning
In the absence of pruning the search for the most
likely parse satisfying the constraints for a given
sentencebecomes computationally intractable 4
. In
practice,weareforcedtousepruningtechniquesin
ordertolimitthesizeofthesearchspace. However,
itispossiblethatduringthelefttorighttraversalof
thesentence,thepruningschemewillkeepaliveonly
parses whose continuation cannot meet constraints
that wehavenotencountered yet and nocomplete
parse for the current sentence can be returned. In
suchcases,weback-o tounconstrainedparsing|
regularSLMusage. Inourexperiments,wenoticed
that this was necessary for very few training
sen-tences (1 out of 2,239) and relativelyfew test
sen-tences(31outof1,101).
5 Comparison with Previous Work
The use of a syntactic parser augmented with
se-mantictagsforinformationinformationfromtextis
notanovel idea. Thebasic approach wedescribed
isverysimilartotheonepresentedin(Milleret al.,
2000)howeverthereareafewmajordierences:
in our approach the augmentation of the
syn-tactictagswithsemantictagsisstraightforward
due to the fact that the semantic constituents
arematchedexactly 5
. Theapproachin (Miller
et al., 2000) needs to insert additional nodes
in thesyntactictreeto accountfor the
seman-ticconstituentsthatdonothavea
correspond-ingsyntacticone. Webelieveourapproach
en-surestightercouplingbetweenthesyntacticand
thesemanticinformationinthenalaugmented
trees.
ourconstraintdenition allowsfor aset of
se-mantictagstobematchedonagivenspan.
4
Itisassumedthattheconstraintsforagivensentenceare
consistent,namelythereexistsatleastoneparsethatmeets
allofthem.
5
ThisisaconsequenceofthefactthattheSLMgenerates
trees is a structural constraint that is
embed-ded in the operationof theSLM and thus can
beguaranteedontestsentences.
The semanticannotationrequired byour task is
much simplerthanthat employedby(Miller et al.,
2000). Onepossiblybenecialextensionofourwork
suggestedby (Miller et al., 2000) would beto add
semantic tags describing relations between entities
(slots),inwhichcasethesemanticconstraintswould
notbe structured strictlyon thetwolevelsused in
the current approach, respectively frame and slot
level. However, this would complicate the task of
dataannotationmakingitmoreexpensive.
The same constrained EM variant employed for
reestimating the model parameters has been used
by(PereiraandSchabes,1992)fortrainingapurely
syntactic parser showing increasein likelihood but
noimprovementinparsingaccuracy.
6 Experiments
Wehaveevaluatedthemodelonmanuallyannotated
data forthe MiPad (Huanget al., 2000)task. We
haveused2,239 sentences (27,119words) for
train-ingand1,101sentences(8,652words)fortest. There
were2,239/5,431semanticframes/slotsinthe
train-ing data and 1,101/1,698 in the test data,
respec-tively.
The wordvocabulary size was 1,035,closed over
the test data. The slot and frame vocabulary
sizeswere 79and 3,respectively. Thepre-terminal
(POStag)vocabularysizeswere64and144for
train-ingsteps 2and 4(see Section 3), respectively; the
non-terminal(NTtag) vocabularysizes were61and
540for training steps 2 and 4 (see Section 3),
re-spectively. We have used the NLPwin (Heidorn,
1999)parsertoobtaintheMiPadsyntactictreebank
neededforinitializingtheSLMattrainingstep1.
Training Perplexity
Stage It TrainingSet Testset
2(matched) 0 9.27 34.81
2(matched) 1 5.81 31.25
2(matched) 2 5.51 31.41
4(L-matched) 0 4.71 24.39
4(L-matched) 1 4.61 24.73
4(L-matched) 2 4.56 24.88
Table1: LikelihoodEvolutionduring Training
Althoughnotguaranteedtheoretically,theN-best
EMvariantused for the SLM parameter
reestima-tion increases the likelihood of the training data
with each iteration when the parser is run in both
lutionofthetrainingandtestdataperplexities
(cal-culated using the probability assignment in Eq. 1)
duringtheconstrainedtrainingsteps2and 4.
Thetrainingdata perplexity decreases
monoton-ically during both training steps whereas the test
dataperplexitydoesn'tdecreasemonotonicallyin
ei-thercase. Weattributethisdiscrepancybetweenthe
evolution ofthe likelihoodon thetraining andtest
corporatothedierentconstrainedsettings forthe
SLM.
Themost importantperformance measure is the
slot/frame error rate. To measure it, weuse
man-uallycreatedparseswhichconsistof frame-level
la-belsand slot-levellabelsand spansasreference. A
frame-level error is caused by aframe label of the
hypothesis parse which is dierent from the frame
labelofthereference. Inordertocalculatethe
slot-level errors, we create a set of slot label and slot
span pairs for the reference and hypothesis parse,
respectively. Thenumberof slot errors is then the
minimum edit distance between these 2 sets using
the substitution, insertion and deletion operations
ontheelementsoftheset.
Table2showstheerrorrateontrainingandtest
dataatdierentstagesduringtraining. Thelast
col-umnoftestdataresults(Test-L1)showstheresults
obtainedbyassumingthattheuserhasspeciedthe
identityoftheframe|andthustheframelevel
con-straintcontainsonlythecorrectsemantictag. This
isaplausiblescenarioiftheuserhasthepossibility
tochoosetheframeusingadierentinputmodality
suchasastylus. Theerrorratesonthetrainingdata
werecalculatedbyrunningthemodelwiththesame
constraintasonthetestdata|constrainingtheset
of allowabletags at theframe level. This couldbe
seenasan upper bound onthe performance of the
model (since themodel parameterswere estimated
onthesamedata).
Ourmodelsignicantlyoutperforms thebaseline
model | a simple semantic context free grammar
authored manually forthe MiPadtask | in terms
of slot error rate (about 35% relative reduction in
slot errorrate)butit isoutperformed by thelatter
intermsofframeerrorrate. Whenrunningthe
mod-elsfromtrainingstep2ontestdataonecannotadd
anyconstraints;onlyframelevelconstraintscanbe
usedwhenevaluatingthemodelsfromtrainingstep
4ontest data. N-bestreestimationat either
train-ing stage (2 or 4) doesn't improvethe accuracy of
thesystem,although theresultsobtainedby
intial-izingthemodelusingthereestimatedstage2model
|iteration2-f0,1,2gmodelstendtobeslightly
bet-ter thantheir 0-f0,1,2gcounterparts. Constraining
theframeleveltagtohavethecorrectvaluedoesn't
signicantly reduce theslot errorrate in either
ap-results .
6.1 Out-of-domainInitialStatistics
Recent results(Chelba, 2001) onthe portability of
syntacticstructurewithintheSLMframeworkshow
that itis possibleto initializethe SLM parameters
from a treebank for out-of-domain text and
main-tainthesamelanguagemodeling performance. We
haverepeated the experimentin the context of
in-formationextraction.
Similartotheapproachin(Milleretal.,2000)we
initializedtheSLM statisticsfromtheUPenn
Tree-bankparsetrees(about1Mwdsoftrainingdata)at
thersttraining stage,see Section3. The
remain-ingpartof thetraining procedure was the sameas
intheprevioussetofexperiments.
The word, slot and frame vocabulary were the
same as in the previous set of experiments. The
pre-terminal(POStag)vocabularysizeswere40and
204fortrainingsteps2and4(seeSection3),
respec-tively; the non-terminal (NTtag) vocabulary sizes
were52and434fortrainingsteps2and4(see
Sec-tion3), respectively.
TheresultsarepresentedinTable3,showing
im-provedperformanceoverthemodelinitialized from
in-domainparsetrees. Theframeaccuracyincreases
substantially, almost matching that of the baseline
model, while the slot accuracy is just slightly
in-creased. Weattribute theimprovedperformanceof
the model initialized from the UPenn Treebank to
thefactthat the modelexplores amorediverse set
oftreesforagivensentencethanthemodel
initial-izedfrom theMiPadautomatictreebankgenerated
usingtheNLPwinparser.
6.2 Impactof Training Data Size on
Performance
We havealso evaluated the impact of the training
datasizeonthemodelperformance. Theresultsare
presentedin Table4, showingastrong dependence
ofboththeslotandframeerrorratesontheamount
oftrainingdataused. This, togetherwith thehigh
accuracyofthemodelontrainingdata(seeTable3),
suggeststhat weare farfrom saturationin
perfor-manceandthat moretrainingdataisverylikelyto
improvethemodel performancesubstantially.
6.3 Error Trends
As asummary error analysis,wehaveinvestigated
thecorrelationbetweenthesemanticframe/slot
er-rorrateand thenumberofsemanticslotsin a
sen-tence. Wehavebinnedthesentencesinthetest set
accordingto thenumberofslots in themanual
an-6
Theframeerrorrateinthiscolumnshouldbe0;in
prac-ticethisdoesn't happen because sometest sentencescould
Training Test Test-L1
Stage2 Stage4 Slot Frame Slot Frame Slot Frame
Baseline 43.41 7.20 57.36 14.90 57.30 6.90
0 0 9.78 1.65 37.87 21.62 37.46 0.64
0 1 10.36 1.20 39.16 21.80 38.28 0.64
0 2 9.42 1.05 39.75 22.25 38.63 0.82
2 0 8.92 1.25 38.04 22.07 37.81 0.91
2 1 9.01 0.95 37.51 21.89 37.28 0.91
2 2 9.47 0.90 38.99 21.89 38.57 0.82
Table2: TrainingandTestDataSlot/FrameErrorRates
TrainingIt ErrorRate(%)
Training Test Test-L1
Stage2 Stage4 Slot Frame Slot Frame Slot Frame
Baseline 43.41 7.20 57.36 14.90 57.30 6.90
0,MiPad/NLPwin 0 9.78 1.65 37.87 21.62 37.46 0.64
1,UPennTrbnk 0 8.44 2.10 36.93 16.08 36.34 0.91
1,UPennTrbnk 1 7.82 1.70 36.98 16.80 36.22 0.82
1,UPennTrbnk 2 7.69 1.50 36.98 16.80 36.22 1.00
Table3: TrainingandTestDataSlot/FrameErrorRates,UPennTreebankinitialstatistics
Training TrainingIt ErrorRate(%)
Corpus Training Test Test-L1
Size Stage2 Stage4 Slot Frame Slot Frame Slot Frame
Baseline 43.41 7.20 57.36 14.90 57.30 6.90
all 1,UPennTrbnk 0 8.44 2.10 36.93 16.08 36.34 0.91
1/2all 1,UPennTrbnk 0 | | 43.76 18.44 43.40 0.45
1/4all 1,UPennTrbnk 0 | | 49.47 22.98 49.53 1.82
Table4: PerformanceDegradationwithTrainingDataSize
notationand evaluated theframe/slot errorratein
eachbin. TheresultsareshowninTable5.
Theframe/slotaccuracyincreaseswiththe
num-ber of slots per sentence | except for the 5+ bin
wheretheframeerrorrateincreases|showingthat
slotco-ocurencestatisticsimproveperformance;
sen-tencescontainingmoresemanticslotstendtobeless
ambiguousfrom aninformation extraction pointof
view.
ErrorRate(%)
No. slots/sent Slot Frame No. Sent
1 43.97 18.01 755
2 39.23 16.27 209
3 26.44 5.17 58
4 26.50 4.00 50
5+ 21.19 6.90 29
Table5: Frame/SlotErrorRateversusSlotDensity
7 Conclusions and Future Directions
Wehavepresentedadata-drivenapproachto
infor-mation extraction that, despite the small amount
of training data used, is shown to outperform the
slot level accuracy of a simple semantic grammar
authoredmanuallyfortheMiPad|personal
infor-mationmanagement|task.
The performance of the baselinemodel could be
improvedwith moreauthoring eort,although this
isexpensive.
The bigdierencein performancebetween
train-ing and test and the fact that we are using so
little training data, makes improvements by using
more training data very likely, although this may
be expensive. A framework which utilizes the vast
amounts of text data collected once such asystem
is deployed would be desirable. Statistical
model-ing techniques that make moreeective use of the
trainingdatashouldbeusedintheSLM,maximum
entropy(Bergeretal.,1996)beingagoodcandidate.
understand-pactofincorporatingthesemanticconstraintsonthe
word-levelaccuracyofthesystem. Anotherpossible
research direction is to modify theframework such
that it nds the most likely semantic parse given
theacoustics|thustreatingthewordsequenceas
ahiddenvariable.
References
A. L. Berger, S. A. Della Pietra, and V. J. Della
Pietra. 1996. A maximum entropy approach to
naturallanguageprocessing. Computational
Lin-guistics,22(1):39{72,March.
CiprianChelbaandFrederickJelinek. 2000.
Struc-tured language modeling. Computer Speech and
Language,14(4):283{332,October.
CiprianChelba. 2001. Portabilityofsyntactic
struc-ture for language modeling. In Proceedings of
ICASSP,pageto appear.SaltLakeCity,Utah.
A.P.Dempster,N.M.Laird,andD.B.Rubin. 1977.
Maximumlikelihoodfromincompletedataviathe
EMalgorithm.InJournalof theRoyal Statistical
Society,volume39ofB,pages1{38.
George Heidorn. 1999. Intelligent writing
assis-tance. In R. Dale, H. Moisl, andH. Somers,
ed-itors, Handbook of Natural Language Processing.
MarcelDekker,NewYork.
X. Huang,A.Acero,C. Chelba,L.Deng, D.
Duch-ene, J. Goodman, H. Hon, D. Jacoby, L. Jiang,
R. Loynd, M. Mahajan, P. Mau, S. Meredith,
S. Mughal, S. Neto, M. Plumpe, K. Wang, and
Y. Wang. 2000. MiPad: Anext generationPDA
prototype. In ICSLP'00, Proceedings, Beijing,
China.
DanielJurafskyandJamesH.Martin,2000. An
In-troduction toNatural LanguageProcessing,
Com-putational Linguistics, and Speech Recognition,
pages577{583. PrenticeHall.
M. Marcus, B. Santorini, and M. Marcinkiewicz.
1993. Building a large annotated corpus of
En-glish: the Penn Treebank. Computational
Lin-guistics,19(2):313{330.
ScottMiller,HeidiFox,LanceRamshaw,andRalph
Weischedel. 2000. Anoveluseof statistical
pars-ingto extractinformation fromtext. In
Proceed-ings of ANLP-NAACL, pages 226{233. Seattle,
Washington.
Fernando Pereira and Yves Schabes. 1992.
Inside-outsidereestimationfrompartiallybracketed
cor-pora. ACL,30:128{135.
Y.-Y.Wang. 1999. Arobustparser forspoken
lan-guage understanding. In Eurospeech'99