in Supervised Learning for Japanese Named Entity Recognition
Manabu Sassano
FujitsuLaboratories, Ltd.
4-4-1, Kamikodanaka,Nakahara-ku,
Kawasaki211-8588,Japan
Takehito Utsuro
Department ofInformation
and Computer Sciences,
Toyohashi University of Technology
Tenpaku-cho, Toyohashi 441-8580,Japan
Abstract
Thispaperfocusesontheissueofnamedentity
chunkinginJapanesenamedentityrecognition.
We apply the supervised decision list
learn-ing method to Japanese named entity
recogni-tion. We also investigate and incorporate
sev-eral named-entity noun phrase chunking
tech-niques and experimentally evaluate and
com-paretheirperformance. Inaddition,wepropose
a method for incorporating richer contextual
information as well as patterns of constituent
morphemes within a named entity, which have
not been considered in previous research, and
show that the proposed method outperforms
these previousapproaches.
1 Introduction
It is widely agreed that named entity
recog-nition is an important step for various
appli-cations of natural language processing such as
information retrieval, machine translation,
in-formation extraction and natural language
un-derstanding. In the English language, the
task of named entity recognition is one of the
tasks of the Message Understanding
Confer-ence (MUC) (e.g., MUC-7 (1998)) and has
been studied intensively. In the Japanese
lan-guage, severalrecent conferences,such asMET
(Multilingual Entity Task, MET-1 (Maiorano,
1996)andMET-2(MUC,1998))andIREX
(In-formation Retrieval and Extraction Exercise)
Workshop(IREXCommittee,1999),focusedon
named entity recognition as one of their
con-testtasks,thuspromotingresearchonJapanese
named entity recognition.
In Japanese named entity recognition, it is
quite common to apply morphological
analy-sis as a preprocessing step and to segment
the sentence string into a sequence of
mor-rulesand/orstatistical namedentityrecognizer
are applied to recognize named entities. It is
often the case that named entities to be
rec-ognizedhave dierentsegmentationboundaries
from those ofmorphemes obtainedbythe
mor-phological analysis. For example, in our
anal-ysis of the IREXworkshop's trainingcorpusof
named entities, about half of the named
enti-ties have segmentationboundariesthat are
dif-ferent fromtheresultof morphologicalanalysis
by a Japanese morphological analyzer
break-fast(Sassanoetal.,1997)(section2). Thus,in
Japanese named entity recognition, among the
mostdiÆcultproblemsishowtorecognizesuch
named entities that have segmentation
bound-ary mismatch against themorphemes obtained
by morphological analysis. Furthermore, in
al-most90%ofcasesofthosesegmentation
bound-arymismatches,namedentitiestoberecognized
can be decomposed into several morphemes as
theirconstituents. Thismeansthattheproblem
of recognizingnamed entitiesinthosecases can
be solved by incorporating techniques of base
noun phrase chunking (Ramshaw and Marcus,
1995).
Inthispaper, wefocusontheissueof named
entitychunkinginJapanesenamedentity
recog-nition. First, we take asupervisedlearning
ap-proach rather than a hand-crafted rule based
approach, because the former is more
promis-ing than the latter withrespect to the amount
of humanlaborifrequires,aswellasits
adapt-abilityto a new domain or a new denition of
named entities. In general, creating training
data forsupervisedlearningis somewhateasier
than creating pattern matching rules by hand.
Next, we apply Yarowsky's method for
super-viseddecisionlistlearning 1
(Yarowsky,1994)to
frequency(%)
NEType Training Test
ORGANIZATION 3676(19.7) 361(23.9)
PERSON 3840(20.6) 338(22.4)
LOCATION 5463(29.2) 413(27.4)
ARTIFACT 747(4.0) 48(3.2)
DATE 3567(19.1) 260(17.2)
TIME 502(2.7) 54(3.5)
MONEY 390(2.1) 15(1.0)
PERCENT 492(2.6) 21(1.4)
Total 18677 1510
Japanese named entity recognition, into which
we incorporate several noun phrase chunking
techniques (sections 3 and 4) and
experimen-tally evaluate their performance on the IREX
workshop's training and test data (section 5).
As one of those noun phrase chunking
tech-niques, we propose a method for incorporating
richercontextualinformationaswellaspatterns
of constituent morphemes within a named
en-tity,comparedwiththoseconsideredinthe
pre-vious research (Sekine et al., 1998; Borthwick,
1999),andshowthattheproposedmethod
out-performsthese approaches.
2 Japanese Named Entity
Recognition
2.1 Task of the IREX Workshop
The task of named entity recognition of the
IREXworkshopisto recognizeeight named
en-titytypesinTable 1 (IREXCommittee,1999).
The organizer of the IREX workshop provided
1,174 newspaper articles which include 18,677
named entitiesasthetrainingdata. Inthe
for-mal run(general domain) of theworkshop, the
participating systems were requested to
recog-nize 1,510 named entities included inthe
held-out71 newspaperarticles.
2.2 Segmentation Boundaries of
Morphemes and Named Entities
Intheworkpresentedhere,wecomparethe
seg-mentation boundariesof named entities in the
IREX workshop's trainingcorpuswith thoseof
supervisedlearning techniquemainlybecause itis easy
toimplementandquitestraightforward toextenda
su-pervisedlearningversiontoaminimallysupervised
ver-sion(CollinsandSinger,1999;CucerzanandYarowsky,
1999). We alsoreportedin(UtsuroandSassano, 2000)
theexperimentalresultsofa minimallysupervised
ver-match of Morphemes (M) and Named Entities
(NE)
Match/Mismatch freq. ofNE Tags(%)
1Mto 1NE 10480(56.1)
n(2)Ms n=2 4557(24.4)
to n=3 1658(8.9) 7175
1NE n4 960(5.1) (38.4)
otherboundarymismatch 1022(5.5)
Total 18677
morphemes which were obtained through
mor-phological analysis by a Japanese
morphologi-calanalyzerbreakfast(Sassanoetal.,1997). 2
Detailed statistics of the comparison are
pro-vided in Table 2. Nearly half of the named
entities have boundarymismatchesagainst the
morphemes and also almost 90% of the named
entities with boundary mismatches can be
de-composed into more thanone morpheme.
Fig-ure 1showssome examples of suchcases. 3
3 Chunking and Tagging Named
Entities
In this section, we formalize the problem of
named entity chunking in Japanese named
en-tity recognition. We describe a novel
tech-nique aswellasthose proposedintheprevious
works on named entity recognition. The novel
technique incorporates richer contextual
infor-mation as well as patterns of constituent
mor-phemes withina named entity, compared with
thetechniquesproposedinpreviousresearchon
named entityrecognition andbasenounphrase
chunking.
3.1 Task Denition
First, we will provideourdenitionof thetask
of Japanese named entity chunking. Suppose
2
Thesetofpart-of-speechtagsofbreakfastconsists
of about 300tags. breakfast achieves 99.6%
part-of-speechaccuracyagainstnewspaperarticles.
3
Inmostcasesofthe\other boundarymismatch"in
Table 2, one or more named entities have to be
rec-ognized as a part of a correctly analyzed morpheme
and thosecasesarenot causedbyerrorsof
morpholog-ical analysis. One frequent example of this type is a
Japaneseverbalnoun\hou-bei(visitingUnitedStates)"
whichconsistsoftwocharacters\hou(visiting)"and\bei
(United States)",where\bei(UnitedStates)"hasto be
recognized as <LOCATION>. We believe that boundary
mismatchesofthistypecanbeeasilysolvedby
Named EntityTag <ORG> <LOC> <LOC>
Morpheme Sequence M M M M M M M M
Inside/OutsideEncoding O ORGI O LOCI LOCI LOCI LOCB O
Start/EndEncoding O ORGU O LOCS LOCC LOCE LOCU O
2Morphemesto 1NamedEntity
<ORGANIZATION>
Roshia gun
(Russian) (army)
<PERSON>
Murayama Tomiichi shushou
(lastname) (rstname) ( prime
minister )
3Morphemesto 1NamedEntity
<TIME>
gozen ku ji
(AM) (nine) (o'clock)
<ARTIFACT>
hokubei jiyuu-boueki kyoutei
( North
America
) (freetrade) (treaty)
Figure 1: Examples of Boundary Mismatch of
Morphemes and Named Entities
that a sequence of morphemes is given as
be-low:
( Left
Context
) (NamedEntity) ( Right
Context )
M L
k M
L
1 M
NE
1 M
NE
i M
NE
m M
R
1 M
R
l
"
(CurrentPosition)
Then, given that the current position is at
the morphemeM NE
i
,the task of named entity
chunkingistoassignachunkingstate(tobe
de-scribedinSection3.2)aswellasanamedentity
type to themorphemeM NE
i
atthe current
po-sition, considering the patterns of surrounding
morphemes. Note that inthesupervised
learn-ing phasewe can usethechunkinginformation
onwhichmorphemesconstituteanamedentity,
andwhichmorphemesareintheleft/right
con-textsof thenamed entity.
3.2 Encoding Schemes of Named
Entity Chunking States
In this paper, we evaluate the following two
schemes of encoding chunking states of named
entities. Examples of these encoding schemes
3.2.1 Inside/Outside Encoding
The Inside/Outsidescheme of encoding
chunk-ing states of base noun phrases was studied in
Ramshawand Marcus(1995). Thisscheme
dis-tinguishes the following three states: O { the
wordat thecurrentpositionisoutsideanybase
noun phrase. I { the word at the current
po-sition is inside some base nounphrase. B{ the
word at the current position marks the
begin-ningofabasenounphrasethatimmediately
fol-lowsanotherbasenounphrase. We extendthis
schemetonamedentitychunkingbyfurther
dis-tinguishingeachofthestatesIandBintoeight
named entitytypes. 4
Thus, thisscheme
distin-guishes 28+1=17 states.
3.2.2 Start/End Encoding
The Start/End scheme of encoding chunking
statesofnamedentitieswasemployedinSekine
et al. (1998) and Borthwick (1999). This
scheme distinguishes the following four states
for each named entity type: S{ themorpheme
atthecurrentpositionmarksthebeginningofa
named entityconsisting of more thanone
mor-pheme. C{ themorphemeat the current
posi-tionmarksthemiddleofanamedentity
consist-ing of more than one morpheme. E{ the
mor-phemeat thecurrentpositionmarkstheending
of a named entity consisting of more than one
morpheme. U { the morpheme at the current
positionisanamedentityconsistingofonlyone
morpheme. Theschemealsoconsidersone
addi-tional state forthe positionoutside any named
entity: O { the morpheme at the current
posi-tion is outside any named entity. Thus, inour
setting,thisschemedistinguishes48+1=33
states.
3.3 Preceding/Subsequent Morphemes
as Contextual Clues
In this paper, we evaluate the following two
models of considering preceding/subsequent
4
We allow the state xB for a named entity type x
onlywhenthemorphemeatthecurrent positionmarks
im-chunking/tagging. Hereweprovideabasic
out-line of these models, and the details of how to
incorporate them into the decisionlistlearning
frameworkwillbe describedinSection4.2.2.
3.3.1 3-gram Model
In this paper, we refer to the model used in
Sekine etal. (1998) and Borthwick(1999) asa
3-gram model. Suppose that the current
posi-tion is at the morpheme M
0
,as illustrated
be-low. Then, when assigning achunkingstate as
well as a named entity type to the morpheme
M
0
, the 3-gram model considers the preceding
singlemorphemeM
1
aswellasthesubsequent
single morphemeM
1
asthecontextual clue.
( Left
Context ) (
Current
Position ) (
Right
Context )
M
1
M
0
M
1
(1)
The major disadvantage of the 3-gram model
is that in the training phase it does not
take into account whether or not the
pre-ceding/subsequent morphemes constitute one
named entity together with the morpheme at
thecurrent position.
3.3.2 Variable Length Model
Inordertoovercomethisdisadvantageofthe
3-grammodel,weproposeanovelmodel,namely
the \Variable Length Model", which
incorpo-rates richer contextual information as well as
patterns of constituent morphemes within a
named entity. Inprinciple,aspartofthe
train-ingphasethismodelconsiderswhichofthe
pre-ceding/subsequent morphemes constitute one
named entity together with the morpheme at
the current position. It also considers
sev-eralmorphemesintheleft/rightcontextsofthe
namedentity. Herewerestrictthismodelto
ex-plicitly considering the cases of named entities
of the length up to three morphemes and only
implicitly considering those longer than three
morphemes. We also restrict it to considering
two morphemes in bothleft and right contexts
of thenamed entity.
( Left
Context
) (NamedEntity) ( Right
Context )
M L
2 M
L
1 M
NE
1 M
NE
i M
NE
m(3) M
R
1 M
R
2
" (2)
4 Supervised Learning for Japanese
Named Entity Recognition
This section describes how to apply the
deci-sion list learning method to chunking/tagging
named entities.
4.1 Decision List Learning
A decision list (Rivest, 1987; Yarowsky, 1994)
is a sorted list of decision rules, each of which
decidesthevalueofadecisionDgivensome
ev-idenceE. Eachdecisionruleinadecisionlistis
sortedindescendingorderwithrespectto some
preference value, and rules with higher
prefer-ence valuesare appliedrst when applyingthe
decisionlist to some newtest data.
First, the random variable D representing a
decisionvariesoverseveral possiblevalues, and
the random variable E representing some
evi-dence varies over `1' and `0' (where `1' denotes
the presence of the corresponding piece of
evi-dence, `0' itsabsence). Then,givensome
train-ing datain which thecorrect value of the
deci-sion D is annotated to each instance, the
con-ditional probabilities P(D=x j E =1) of
ob-serving the decisionD=x under the condition
of the presence of the evidence E (E =1) are
calculated and the decision list is constructed
bythefollowing procedure.
1. Foreachpieceofevidence,wecalculatethe
log of likelihood ratio of the largest
condi-tional probability of the decision D = x
1
(given the presence of that piece of
ev-idence) to the second largest conditional
probabilityofthe decisionD=x
2 :
log
2
P(D=x
1
jE=1)
P(D=x
2
jE=1)
Then, a decision list is constructed with
pieces of evidence sortedin descending
or-der with respect to their log of likelihood
ratios,wherethedecisionoftheruleateach
line is D=x
1
with the largest conditional
probability. 5
5
Yarowsky (1994) discusses several techniques for
avoiding the problems which arise when an observed
count is 0. From among thosetechniques, we employ
the simplestone, i.e.,addingasmallconstant (0:1
0:25) to the numerator and denominator. With
`adefault',wherethelogoflikelihoodratio
is calculated from the ratio of the largest
marginalprobabilityofthedecisionD=x
1
to the second largest marginal probability
ofthe decisionD=x
The `default' decision of this nal line is
D=x
1
withthe largestmarginal
probabil-ity.
4.2 Decision List Learning for
Chunking/Tagging NamedEntities
4.2.1 Decision
Foreachofthetwoschemesof encoding
chunk-ing states of named entities described in
Sec-tion 3.2, as the possible values of the
deci-sionD,weconsiderexactlythesamecategories
of chunking states as those described in
Sec-tion 3.2.
4.2.2 Evidence
The evidence E used inthe decision list
learn-ing is a combination of the features of
preced-ing/subsequent morphemes aswellasthe
mor-pheme at the current position. The following
describeshow to form theevidence E forboth
the3-gram modeland variable lengthmodel.
3-gram Model
TheevidenceErepresentsatuple(F
1
denotethefeaturesof
imme-diately preceding/subsequent morphemes M
1
andM
1
,respectively,F
0
thefeatureofthe
mor-phemeM
0
atthecurrent position(see Formula
(1)in Section3.3.1). Thedenitionof the
pos-sible values of those features F
1 , F
0
, and F
1
are given below, where M
i
denotes the
mor-pheme itself (i.e., including its lexical form as
well as part-of-speech), C
i
the character type
(i.e., Japanese(hiragana or katakana), Chinese
(kanji), numbers, English alphabets, symbols,
and all possible combinations of these) of M
i ,
T
i
thepart-of-speechof M
i
unsmoothedconditional probability P(D=x j E=1).
Yarowsky's trainingalgorithm also dierssomewhat in
hisuse oftheratio P(D=x
,whichis equivalentin
thecaseofbinaryclassications,andalsobythe
interpo-lation between theglobal probabilities (used here)and
theresidualprobabilities furtherconditionalon
higher-rankedpatternsfailingtomatchinthelist.
1 1 1 1 1
As the evidence E, we consider each possible
combination of the values of those three
fea-tures.
Variable Length Model
The evidence E represents a tuple
(F
the features of the morphemes M L
intheleft/rightcontextsofthecurrent
named entity, respectively, F
NE
the features
of the morphemes M NE
constituting the current named entity (see
Formula (2)inSection3.3.2). Thedenitionof
the possible values of those features F
L
are given below, where F NE
j
denotes
the feature of the j-th constituent morpheme
M NE
j
within the current named entity, and
M NE
i
isthemorphemeatthecurrentposition:
F
As the evidence E, we consider each possible
combination of the values of those three
fea-tures, except that the following three
restric-tions areapplied.
1. In the cases where the current named
en-tity consistsof up to three morphemes,as
the possible values of the feature F
NE in
the denition (3), we consider only those
which are consistent with therequirement
thateach morphemeM NE
j
isaconstituent
of the current named entity. For example,
supposethatthecurrentnamedentity
con-sists of three morphemes, where the
cur-rentpositionis atthemiddleofthose
con-stituent morphemesasbelow:
(
Then,asthepossiblevaluesof thefeature
F
NE
,we consideronlythefollowingfour:
jF
2. Inthecaseswherethecurrentnamedentity
consists of more than three morphemes,
only the three constituent morphemes are
regarded as within the current named
en-tity and the rest are treated as if they
were outside the named entity. For
exam-ple, suppose that the current named
en-tity consists of four morphemes as below:
(
In this case, the fourth constituent
mor-pheme M NE
4
istreated as ifit were in the
right context of the current named entity
asbelow:
3. As the evidence E, among the possible
combination of the values of three
fea-tures F
, we only accept
those in which the positions of the
mor-phemes are continuous, and reject those
discontinuous combinations. For example,
in the case of Formula (4) above, as the
evidence E, we accept the combination
(M
; null), while we reject
(M
4.3 Procedures for Training and
Testing
Next we will briey describe the entire
pro-cesses of learning the decision list for
chunk-ing/tagging named entities as wellas applying
itto chunking/taggingunseen namedentities.
4.3.1 Training
In the training phase, at the positions where
thecorrespondingmorphemeisaconstituentof
anamedentity,asdescribedinSection4.2,each
allowable combination of features is considered
as the evidence E. On the other hand, at the
positionswherethecorrespondingmorphemeis
outside any named entity, the way the
combi-nation of features is considered is dierent in
1 in the previous section is no longer applied.
Therefore, allthe possiblevaluesof thefeature
F
NE
inDenition(3)areaccepted. Finally,the
frequencyof each decisionDand evidence E is
counted and the decision list is learned as
de-scribedin Section4.1.
4.3.2 Testing
When applying the decision list to
chunk-ing/taggingunseennamedentities,rst,ateach
morphemeposition,thecombinationoffeatures
isconsideredasinthecaseofthenon-entity
po-sitioninthetrainingphase. Then,thedecision
listisconsultedandallthedecisionsoftherules
with a log of likelihood ratio above a certain
threshold are recorded. Finally, as in the case
ofpreviousresearch(Sekineet al.,1998;
Borth-wick, 1999), the most appropriate sequence of
thedecisionsthatareconsistentthroughoutthe
wholesequence issearched for. By consistency
ofthe decisions,we mean requirementssuchas
that thedecision representing thebeginning of
some named entity type has to be followed by
thatrepresentingthemiddleof thesame entity
type(inthecaseofStart/Endencoding). Also,
inourcase, theappropriatenessofthesequence
of the decisions is measured by the sum of the
logoflikelihoodratios ofthecorresponding
de-cisionrules.
5 Experimental Evaluation
We experimentally evaluate the performance
of the supervisedlearning for Japanese named
entity recognition on the IREX workshop's
training and test data. We compare the
re-sults of the combinations of the two
encod-ing schemes of named entity chunking states
(the Inside/Outside and the Start/End
encod-ingschemes)andthetwoapproachesto
contex-tualfeaturedesign(the3-gramandtheVariable
Length models). For each of those
combina-tions,we searchforanoptimalthresholdofthe
log of likelihoodratio in the decision list. The
performance of each combination measured by
F-measure( =1) isgiven inTable 4.
In this evaluation, we exclude the named
entities with \other boundary mismatch" in
Table 2. We also classify the system
output according to the number of
con-stituent morphemes of each named entity
sub-n Morphemes to1 Named Entity
n1 n=1 n=2 n=3 n2 n3 n4
3-gram Inside/Outside 72.9 75.9 79.7 51.4 69.4 42.5 29.2
Start/End 72.7 76.6 79.6 43.7 68.1 37.8 29.6
Variable Inside/Outside 74.3 77.6 80.0 55.5 70.9 49.9 41.0
Length Start/End 72.1 77.0 75.6 51.5 67.2 48.6 43.6
set, we compare the performance of the four
combinations of f3-gram, Variable Lengthg
fInside/Outside,Start/Endg and show the
highest performancewithbold-faced font.
Several remarkable points of these results of
performancecomparisoncanbestatedasbelow:
Amongthefourcombinations,theVariable
LengthModelwithInside/Outside
Encod-ing performs best in total (n 1) as well
asintherecognitionofnamedentities
con-sisting of more than one morpheme (n =
2;3;n2;3).
In the recognition of named entities
con-sistingof more than two morphemes (n =
3;n 3;4), the Variable Length Model
performs signicantly better than the
3-gram model. This result clearly supports
the claim that our modeling of the
Vari-ableLengthModelhasanadvantageinthe
recognitionof longnamed entities.
In general, the Inside/Outside encoding
scheme performs slightly better than the
Start/End encoding scheme, even though
theformerdistinguishesconsiderablyfewer
statesthan thelatter.
6 Conclusion
In this paper, we applied the supervised
deci-sionlistlearningmethodtoJapanesenamed
en-tityrecognition,intowhichweincorporated
sev-eral noun phrase chunking techniques and
ex-perimentally evaluated their performance. We
showedthatanoveltechniquethatweproposed
outperformedthoseusingpreviouslyconsidered
contextual features.
7 Acknowledgments
This research was carried out while the
au-thors were visiting scholars at Department of
Computer Science, Johns Hopkins University.
The authors would like to thank Prof. David
Yarowsky of Johns Hopkins University for
in-References
A.Borthwick. 1999. A Japanesenamed entity
rec-ognizerconstructedbyanon-speakerofJapanese.
InProc. ofthe IREX Workshop,pages187{193.
M.CollinsandY.Singer. 1999. Unsupervised
mod-els of named entity classication. In Proc. of
the 1999 Joint SIGDAT Conference on
Empiri-cal Methods in Natural Language Processing and
Very Large Corpora,pages100{110.
S.CucerzanandD.Yarowsky.1999. Language
inde-pendentnamedentityrecognitioncombining
mor-phological and contextual evidence. In Proc. of
the 1999 Joint SIGDAT Conference on
Empiri-cal Methods in Natural Language Processing and
Very Large Corpora,pages90{99.
IREXCommittee, editor. 1999. Proceedings of the
IREXWorkshop. (inJapanese).
S. Maiorano. 1996. The multilingual entity task
(MET): Japanese results. In Proc. of TIPSTER
PROGRAMPHASEII,pages449{451.
MUC. 1998. Proceedings ofthe7thMessage
Under-standingConference(MUC-7).
L.Ramshaw andM. Marcus. 1995. Text chunking
using transformation-basedlearning. In Proc. of
the 3rd Workshop on Very Large Corpora, pages
83{94.
R.L.Rivest. 1987. Learningdecisionlists. Machine
Learning,2:229{246.
M. Sassano, Y. Saito, and K. Matsui. 1997.
Japanese morphologicalanalyzer for NLP
appli-cations. In Proc. of the 3rd Annual Meeting of
the Association for Natural LanguageProcessing,
pages441{444. (inJapanese).
S. Sekine, R. Grishman, and H. Shinnou. 1998.
A decision tree method for nding and
classify-ingnames in Japanesetexts. InProc. of the 6th
WorkshoponVeryLargeCorpora,pages148{152.
T. Utsuro and M. Sassano. 2000. Minimally
su-pervisedJapanesenamed entity recognition:
Re-sourcesandevaluation. InProc. ofthe2nd
Inter-national Conference on Language Resources and
Evaluation,pages1229{1236.
D. Yarowsky. 1994. Decision listsfor lexical
ambi-guity resolution: Application to accent
restora-tioninSpanishand French. InProc. of the 32nd