Named Entity Chunking Techniques in Supervised Learning for Japanese Named Entity Recognition

(1)

in Supervised Learning for Japanese Named Entity Recognition

Manabu Sassano

FujitsuLaboratories, Ltd.

4-4-1, Kamikodanaka,Nakahara-ku,

Kawasaki211-8588,Japan

[email protected]

Takehito Utsuro

Department ofInformation

and Computer Sciences,

Toyohashi University of Technology

Tenpaku-cho, Toyohashi 441-8580,Japan

[email protected]

Abstract

Thispaperfocusesontheissueofnamedentity

chunkinginJapanesenamedentityrecognition.

We apply the supervised decision list

learn-ing method to Japanese named entity

recogni-tion. We also investigate and incorporate

sev-eral named-entity noun phrase chunking

tech-niques and experimentally evaluate and

com-paretheirperformance. Inaddition,wepropose

a method for incorporating richer contextual

information as well as patterns of constituent

morphemes within a named entity, which have

not been considered in previous research, and

show that the proposed method outperforms

these previousapproaches.

1 Introduction

It is widely agreed that named entity

recog-nition is an important step for various

appli-cations of natural language processing such as

information retrieval, machine translation,

in-formation extraction and natural language

un-derstanding. In the English language, the

task of named entity recognition is one of the

tasks of the Message Understanding

Confer-ence (MUC) (e.g., MUC-7 (1998)) and has

been studied intensively. In the Japanese

lan-guage, severalrecent conferences,such asMET

(Multilingual Entity Task, MET-1 (Maiorano,

1996)andMET-2(MUC,1998))andIREX

(In-formation Retrieval and Extraction Exercise)

Workshop(IREXCommittee,1999),focusedon

named entity recognition as one of their

con-testtasks,thuspromotingresearchonJapanese

named entity recognition.

In Japanese named entity recognition, it is

quite common to apply morphological

analy-sis as a preprocessing step and to segment

the sentence string into a sequence of

mor-rulesand/orstatistical namedentityrecognizer

are applied to recognize named entities. It is

often the case that named entities to be

rec-ognizedhave dierentsegmentationboundaries

from those ofmorphemes obtainedbythe

mor-phological analysis. For example, in our

anal-ysis of the IREXworkshop's trainingcorpusof

named entities, about half of the named

enti-ties have segmentationboundariesthat are

dif-ferent fromtheresultof morphologicalanalysis

by a Japanese morphological analyzer

break-fast(Sassanoetal.,1997)(section2). Thus,in

Japanese named entity recognition, among the

mostdiÆcultproblemsishowtorecognizesuch

named entities that have segmentation

bound-ary mismatch against themorphemes obtained

by morphological analysis. Furthermore, in

al-most90%ofcasesofthosesegmentation

bound-arymismatches,namedentitiestoberecognized

can be decomposed into several morphemes as

theirconstituents. Thismeansthattheproblem

of recognizingnamed entitiesinthosecases can

be solved by incorporating techniques of base

noun phrase chunking (Ramshaw and Marcus,

1995).

Inthispaper, wefocusontheissueof named

entitychunkinginJapanesenamedentity

recog-nition. First, we take asupervisedlearning

ap-proach rather than a hand-crafted rule based

approach, because the former is more

promis-ing than the latter withrespect to the amount

of humanlaborifrequires,aswellasits

adapt-abilityto a new domain or a new denition of

named entities. In general, creating training

data forsupervisedlearningis somewhateasier

than creating pattern matching rules by hand.

Next, we apply Yarowsky's method for

super-viseddecisionlistlearning 1

(Yarowsky,1994)to

(2)

frequency(%)

NEType Training Test

ORGANIZATION 3676(19.7) 361(23.9)

PERSON 3840(20.6) 338(22.4)

LOCATION 5463(29.2) 413(27.4)

ARTIFACT 747(4.0) 48(3.2)

DATE 3567(19.1) 260(17.2)

TIME 502(2.7) 54(3.5)

MONEY 390(2.1) 15(1.0)

PERCENT 492(2.6) 21(1.4)

Total 18677 1510

Japanese named entity recognition, into which

we incorporate several noun phrase chunking

techniques (sections 3 and 4) and

experimen-tally evaluate their performance on the IREX

workshop's training and test data (section 5).

As one of those noun phrase chunking

tech-niques, we propose a method for incorporating

richercontextualinformationaswellaspatterns

of constituent morphemes within a named

en-tity,comparedwiththoseconsideredinthe

pre-vious research (Sekine et al., 1998; Borthwick,

1999),andshowthattheproposedmethod

out-performsthese approaches.

2 Japanese Named Entity

Recognition

2.1 Task of the IREX Workshop

The task of named entity recognition of the

IREXworkshopisto recognizeeight named

en-titytypesinTable 1 (IREXCommittee,1999).

The organizer of the IREX workshop provided

1,174 newspaper articles which include 18,677

named entitiesasthetrainingdata. Inthe

for-mal run(general domain) of theworkshop, the

participating systems were requested to

recog-nize 1,510 named entities included inthe

held-out71 newspaperarticles.

2.2 Segmentation Boundaries of

Morphemes and Named Entities

Intheworkpresentedhere,wecomparethe

seg-mentation boundariesof named entities in the

IREX workshop's trainingcorpuswith thoseof

supervisedlearning techniquemainlybecause itis easy

toimplementandquitestraightforward toextenda

su-pervisedlearningversiontoaminimallysupervised

ver-sion(CollinsandSinger,1999;CucerzanandYarowsky,

1999). We alsoreportedin(UtsuroandSassano, 2000)

theexperimentalresultsofa minimallysupervised

ver-match of Morphemes (M) and Named Entities

(NE)

Match/Mismatch freq. ofNE Tags(%)

1Mto 1NE 10480(56.1)

n(2)Ms n=2 4557(24.4)

to n=3 1658(8.9) 7175

1NE n4 960(5.1) (38.4)

otherboundarymismatch 1022(5.5)

Total 18677

morphemes which were obtained through

mor-phological analysis by a Japanese

morphologi-calanalyzerbreakfast(Sassanoetal.,1997). 2

Detailed statistics of the comparison are

pro-vided in Table 2. Nearly half of the named

entities have boundarymismatchesagainst the

morphemes and also almost 90% of the named

entities with boundary mismatches can be

de-composed into more thanone morpheme.

Fig-ure 1showssome examples of suchcases. 3

3 Chunking and Tagging Named

Entities

In this section, we formalize the problem of

named entity chunking in Japanese named

en-tity recognition. We describe a novel

tech-nique aswellasthose proposedintheprevious

works on named entity recognition. The novel

technique incorporates richer contextual

infor-mation as well as patterns of constituent

mor-phemes withina named entity, compared with

thetechniquesproposedinpreviousresearchon

named entityrecognition andbasenounphrase

chunking.

3.1 Task Denition

First, we will provideourdenitionof thetask

of Japanese named entity chunking. Suppose

2

Thesetofpart-of-speechtagsofbreakfastconsists

of about 300tags. breakfast achieves 99.6%

part-of-speechaccuracyagainstnewspaperarticles.

3

Inmostcasesofthe\other boundarymismatch"in

Table 2, one or more named entities have to be

rec-ognized as a part of a correctly analyzed morpheme

and thosecasesarenot causedbyerrorsof

morpholog-ical analysis. One frequent example of this type is a

Japaneseverbalnoun\hou-bei(visitingUnitedStates)"

whichconsistsoftwocharacters\hou(visiting)"and\bei

(United States)",where\bei(UnitedStates)"hasto be

recognized as <LOCATION>. We believe that boundary

mismatchesofthistypecanbeeasilysolvedby

(3)

Named EntityTag <ORG> <LOC> <LOC>

Morpheme Sequence M M M M M M M M

Inside/OutsideEncoding O ORGI O LOCI LOCI LOCI LOCB O

Start/EndEncoding O ORGU O LOCS LOCC LOCE LOCU O

2Morphemesto 1NamedEntity

Roshia gun

(Russian) (army)

Murayama Tomiichi shushou

(lastname) (rstname) ( prime

minister )

3Morphemesto 1NamedEntity

<TIME>

gozen ku ji

(AM) (nine) (o'clock)

hokubei jiyuu-boueki kyoutei

( North

America

) (freetrade) (treaty)

Figure 1: Examples of Boundary Mismatch of

Morphemes and Named Entities

that a sequence of morphemes is given as

be-low:

( Left

Context

) (NamedEntity) ( Right

Context )

M L

k M

L

1 M

NE

1 M

NE

i M

NE

m M

R

1 M

R

l

"

(CurrentPosition)

Then, given that the current position is at

the morphemeM NE

i

,the task of named entity

chunkingistoassignachunkingstate(tobe

de-scribedinSection3.2)aswellasanamedentity

type to themorphemeM NE

i

atthe current

po-sition, considering the patterns of surrounding

morphemes. Note that inthesupervised

learn-ing phasewe can usethechunkinginformation

onwhichmorphemesconstituteanamedentity,

andwhichmorphemesareintheleft/right

con-textsof thenamed entity.

3.2 Encoding Schemes of Named

Entity Chunking States

In this paper, we evaluate the following two

schemes of encoding chunking states of named

entities. Examples of these encoding schemes

3.2.1 Inside/Outside Encoding

The Inside/Outsidescheme of encoding

chunk-ing states of base noun phrases was studied in

Ramshawand Marcus(1995). Thisscheme

dis-tinguishes the following three states: O { the

wordat thecurrentpositionisoutsideanybase

noun phrase. I { the word at the current

po-sition is inside some base nounphrase. B{ the

word at the current position marks the

begin-ningofabasenounphrasethatimmediately

fol-lowsanotherbasenounphrase. We extendthis

schemetonamedentitychunkingbyfurther

dis-tinguishingeachofthestatesIandBintoeight

named entitytypes. 4

Thus, thisscheme

distin-guishes 28+1=17 states.

3.2.2 Start/End Encoding

The Start/End scheme of encoding chunking

statesofnamedentitieswasemployedinSekine

et al. (1998) and Borthwick (1999). This

scheme distinguishes the following four states

for each named entity type: S{ themorpheme

atthecurrentpositionmarksthebeginningofa

named entityconsisting of more thanone

mor-pheme. C{ themorphemeat the current

posi-tionmarksthemiddleofanamedentity

consist-ing of more than one morpheme. E{ the

mor-phemeat thecurrentpositionmarkstheending

of a named entity consisting of more than one

morpheme. U { the morpheme at the current

positionisanamedentityconsistingofonlyone

morpheme. Theschemealsoconsidersone

addi-tional state forthe positionoutside any named

entity: O { the morpheme at the current

posi-tion is outside any named entity. Thus, inour

setting,thisschemedistinguishes48+1=33

states.

3.3 Preceding/Subsequent Morphemes

as Contextual Clues

In this paper, we evaluate the following two

models of considering preceding/subsequent

4

We allow the state xB for a named entity type x

onlywhenthemorphemeatthecurrent positionmarks

(4)

im-chunking/tagging. Hereweprovideabasic

out-line of these models, and the details of how to

incorporate them into the decisionlistlearning

frameworkwillbe describedinSection4.2.2.

3.3.1 3-gram Model

In this paper, we refer to the model used in

Sekine etal. (1998) and Borthwick(1999) asa

3-gram model. Suppose that the current

posi-tion is at the morpheme M

0

,as illustrated

be-low. Then, when assigning achunkingstate as

well as a named entity type to the morpheme

M

0

, the 3-gram model considers the preceding

singlemorphemeM

1

aswellasthesubsequent

single morphemeM

1

asthecontextual clue.

( Left

Context ) (

Current

Position ) (

Right

Context )

M

1

M

0

M

1

(1)

The major disadvantage of the 3-gram model

is that in the training phase it does not

take into account whether or not the

pre-ceding/subsequent morphemes constitute one

named entity together with the morpheme at

thecurrent position.

3.3.2 Variable Length Model

Inordertoovercomethisdisadvantageofthe

3-grammodel,weproposeanovelmodel,namely

the \Variable Length Model", which

incorpo-rates richer contextual information as well as

patterns of constituent morphemes within a

named entity. Inprinciple,aspartofthe

train-ingphasethismodelconsiderswhichofthe

pre-ceding/subsequent morphemes constitute one

named entity together with the morpheme at

the current position. It also considers

sev-eralmorphemesintheleft/rightcontextsofthe

namedentity. Herewerestrictthismodelto

ex-plicitly considering the cases of named entities

of the length up to three morphemes and only

implicitly considering those longer than three

morphemes. We also restrict it to considering

two morphemes in bothleft and right contexts

of thenamed entity.

( Left

Context

) (NamedEntity) ( Right

Context )

M L

2 M

L

1 M

NE

1 M

NE

i M

NE

m(3) M

R

1 M

R

2

" (2)

4 Supervised Learning for Japanese

Named Entity Recognition

This section describes how to apply the

deci-sion list learning method to chunking/tagging

named entities.

4.1 Decision List Learning

A decision list (Rivest, 1987; Yarowsky, 1994)

is a sorted list of decision rules, each of which

decidesthevalueofadecisionDgivensome

ev-idenceE. Eachdecisionruleinadecisionlistis

sortedindescendingorderwithrespectto some

preference value, and rules with higher

prefer-ence valuesare appliedrst when applyingthe

decisionlist to some newtest data.

First, the random variable D representing a

decisionvariesoverseveral possiblevalues, and

the random variable E representing some

evi-dence varies over `1' and `0' (where `1' denotes

the presence of the corresponding piece of

evi-dence, `0' itsabsence). Then,givensome

train-ing datain which thecorrect value of the

deci-sion D is annotated to each instance, the

con-ditional probabilities P(D=x j E =1) of

ob-serving the decisionD=x under the condition

of the presence of the evidence E (E =1) are

calculated and the decision list is constructed

bythefollowing procedure.

1. Foreachpieceofevidence,wecalculatethe

log of likelihood ratio of the largest

condi-tional probability of the decision D = x

1

(given the presence of that piece of

ev-idence) to the second largest conditional

probabilityofthe decisionD=x

2 :

log

2

P(D=x

1

jE=1)

P(D=x

2

jE=1)

Then, a decision list is constructed with

pieces of evidence sortedin descending

or-der with respect to their log of likelihood

ratios,wherethedecisionoftheruleateach

line is D=x

1

with the largest conditional

probability. 5

5

Yarowsky (1994) discusses several techniques for

avoiding the problems which arise when an observed

count is 0. From among thosetechniques, we employ

the simplestone, i.e.,addingasmallconstant (0:1

0:25) to the numerator and denominator. With

(5)

`adefault',wherethelogoflikelihoodratio

is calculated from the ratio of the largest

marginalprobabilityofthedecisionD=x

1

to the second largest marginal probability

ofthe decisionD=x

The `default' decision of this nal line is

D=x

1

withthe largestmarginal

probabil-ity.

4.2 Decision List Learning for

Chunking/Tagging NamedEntities

4.2.1 Decision

Foreachofthetwoschemesof encoding

chunk-ing states of named entities described in

Sec-tion 3.2, as the possible values of the

deci-sionD,weconsiderexactlythesamecategories

of chunking states as those described in

Sec-tion 3.2.

4.2.2 Evidence

The evidence E used inthe decision list

learn-ing is a combination of the features of

preced-ing/subsequent morphemes aswellasthe

mor-pheme at the current position. The following

describeshow to form theevidence E forboth

the3-gram modeland variable lengthmodel.

3-gram Model

TheevidenceErepresentsatuple(F

1

denotethefeaturesof

imme-diately preceding/subsequent morphemes M

1

andM

1

,respectively,F

0

thefeatureofthe

mor-phemeM

0

atthecurrent position(see Formula

(1)in Section3.3.1). Thedenitionof the

pos-sible values of those features F

1 , F

0

, and F

1

are given below, where M

i

denotes the

mor-pheme itself (i.e., including its lexical form as

well as part-of-speech), C

i

the character type

(i.e., Japanese(hiragana or katakana), Chinese

(kanji), numbers, English alphabets, symbols,

and all possible combinations of these) of M

i ,

T

i

thepart-of-speechof M

i

unsmoothedconditional probability P(D=x j E=1).

Yarowsky's trainingalgorithm also dierssomewhat in

hisuse oftheratio P(D=x

,whichis equivalentin

thecaseofbinaryclassications,andalsobythe

interpo-lation between theglobal probabilities (used here)and

theresidualprobabilities furtherconditionalon

higher-rankedpatternsfailingtomatchinthelist.

1 1 1 1 1

As the evidence E, we consider each possible

combination of the values of those three

fea-tures.

Variable Length Model

The evidence E represents a tuple

(F

the features of the morphemes M L

intheleft/rightcontextsofthecurrent

named entity, respectively, F

NE

the features

of the morphemes M NE

constituting the current named entity (see

Formula (2)inSection3.3.2). Thedenitionof

the possible values of those features F

L

are given below, where F NE

j

denotes

the feature of the j-th constituent morpheme

M NE

j

within the current named entity, and

M NE

i

isthemorphemeatthecurrentposition:

F

As the evidence E, we consider each possible

combination of the values of those three

fea-tures, except that the following three

restric-tions areapplied.

1. In the cases where the current named

en-tity consistsof up to three morphemes,as

the possible values of the feature F

NE in

the denition (3), we consider only those

which are consistent with therequirement

thateach morphemeM NE

j

isaconstituent

of the current named entity. For example,

supposethatthecurrentnamedentity

con-sists of three morphemes, where the

cur-rentpositionis atthemiddleofthose

con-stituent morphemesasbelow:

(

Then,asthepossiblevaluesof thefeature

F

NE

,we consideronlythefollowingfour:

(6)

jF

2. Inthecaseswherethecurrentnamedentity

consists of more than three morphemes,

only the three constituent morphemes are

regarded as within the current named

en-tity and the rest are treated as if they

were outside the named entity. For

exam-ple, suppose that the current named

en-tity consists of four morphemes as below:

(

In this case, the fourth constituent

mor-pheme M NE

4

istreated as ifit were in the

right context of the current named entity

asbelow:

3. As the evidence E, among the possible

combination of the values of three

fea-tures F

, we only accept

those in which the positions of the

mor-phemes are continuous, and reject those

discontinuous combinations. For example,

in the case of Formula (4) above, as the

evidence E, we accept the combination

(M

; null), while we reject

(M

4.3 Procedures for Training and

Testing

Next we will briey describe the entire

pro-cesses of learning the decision list for

chunk-ing/tagging named entities as wellas applying

itto chunking/taggingunseen namedentities.

4.3.1 Training

In the training phase, at the positions where

thecorrespondingmorphemeisaconstituentof

anamedentity,asdescribedinSection4.2,each

allowable combination of features is considered

as the evidence E. On the other hand, at the

positionswherethecorrespondingmorphemeis

outside any named entity, the way the

combi-nation of features is considered is dierent in

1 in the previous section is no longer applied.

Therefore, allthe possiblevaluesof thefeature

F

NE

inDenition(3)areaccepted. Finally,the

frequencyof each decisionDand evidence E is

counted and the decision list is learned as

de-scribedin Section4.1.

4.3.2 Testing

When applying the decision list to

chunk-ing/taggingunseennamedentities,rst,ateach

morphemeposition,thecombinationoffeatures

isconsideredasinthecaseofthenon-entity

po-sitioninthetrainingphase. Then,thedecision

listisconsultedandallthedecisionsoftherules

with a log of likelihood ratio above a certain

threshold are recorded. Finally, as in the case

ofpreviousresearch(Sekineet al.,1998;

Borth-wick, 1999), the most appropriate sequence of

thedecisionsthatareconsistentthroughoutthe

wholesequence issearched for. By consistency

ofthe decisions,we mean requirementssuchas

that thedecision representing thebeginning of

some named entity type has to be followed by

thatrepresentingthemiddleof thesame entity

type(inthecaseofStart/Endencoding). Also,

inourcase, theappropriatenessofthesequence

of the decisions is measured by the sum of the

logoflikelihoodratios ofthecorresponding

de-cisionrules.

5 Experimental Evaluation

We experimentally evaluate the performance

of the supervisedlearning for Japanese named

entity recognition on the IREX workshop's

training and test data. We compare the

re-sults of the combinations of the two

encod-ing schemes of named entity chunking states

(the Inside/Outside and the Start/End

encod-ingschemes)andthetwoapproachesto

contex-tualfeaturedesign(the3-gramandtheVariable

Length models). For each of those

combina-tions,we searchforanoptimalthresholdofthe

log of likelihoodratio in the decision list. The

performance of each combination measured by

F-measure( =1) isgiven inTable 4.

In this evaluation, we exclude the named

entities with \other boundary mismatch" in

Table 2. We also classify the system

output according to the number of

con-stituent morphemes of each named entity

(7)

sub-n Morphemes to1 Named Entity

n1 n=1 n=2 n=3 n2 n3 n4

3-gram Inside/Outside 72.9 75.9 79.7 51.4 69.4 42.5 29.2

Start/End 72.7 76.6 79.6 43.7 68.1 37.8 29.6

Variable Inside/Outside 74.3 77.6 80.0 55.5 70.9 49.9 41.0

Length Start/End 72.1 77.0 75.6 51.5 67.2 48.6 43.6

set, we compare the performance of the four

combinations of f3-gram, Variable Lengthg

fInside/Outside,Start/Endg and show the

highest performancewithbold-faced font.

Several remarkable points of these results of

performancecomparisoncanbestatedasbelow:

Amongthefourcombinations,theVariable

LengthModelwithInside/Outside

Encod-ing performs best in total (n 1) as well

asintherecognitionofnamedentities

con-sisting of more than one morpheme (n =

2;3;n2;3).

In the recognition of named entities

con-sistingof more than two morphemes (n =

3;n 3;4), the Variable Length Model

performs signicantly better than the

3-gram model. This result clearly supports

the claim that our modeling of the

Vari-ableLengthModelhasanadvantageinthe

recognitionof longnamed entities.

In general, the Inside/Outside encoding

scheme performs slightly better than the

Start/End encoding scheme, even though

theformerdistinguishesconsiderablyfewer

statesthan thelatter.

6 Conclusion

In this paper, we applied the supervised

deci-sionlistlearningmethodtoJapanesenamed

en-tityrecognition,intowhichweincorporated

sev-eral noun phrase chunking techniques and

ex-perimentally evaluated their performance. We

showedthatanoveltechniquethatweproposed

outperformedthoseusingpreviouslyconsidered

contextual features.

7 Acknowledgments

This research was carried out while the

au-thors were visiting scholars at Department of

Computer Science, Johns Hopkins University.

The authors would like to thank Prof. David

Yarowsky of Johns Hopkins University for

in-References

A.Borthwick. 1999. A Japanesenamed entity

rec-ognizerconstructedbyanon-speakerofJapanese.

InProc. ofthe IREX Workshop,pages187{193.

M.CollinsandY.Singer. 1999. Unsupervised

mod-els of named entity classication. In Proc. of

the 1999 Joint SIGDAT Conference on

Empiri-cal Methods in Natural Language Processing and

Very Large Corpora,pages100{110.

S.CucerzanandD.Yarowsky.1999. Language

inde-pendentnamedentityrecognitioncombining

mor-phological and contextual evidence. In Proc. of

the 1999 Joint SIGDAT Conference on

Empiri-cal Methods in Natural Language Processing and

Very Large Corpora,pages90{99.

IREXCommittee, editor. 1999. Proceedings of the

IREXWorkshop. (inJapanese).

S. Maiorano. 1996. The multilingual entity task

(MET): Japanese results. In Proc. of TIPSTER

PROGRAMPHASEII,pages449{451.

MUC. 1998. Proceedings ofthe7thMessage

Under-standingConference(MUC-7).

L.Ramshaw andM. Marcus. 1995. Text chunking

using transformation-basedlearning. In Proc. of

the 3rd Workshop on Very Large Corpora, pages

83{94.

R.L.Rivest. 1987. Learningdecisionlists. Machine

Learning,2:229{246.

M. Sassano, Y. Saito, and K. Matsui. 1997.

Japanese morphologicalanalyzer for NLP

appli-cations. In Proc. of the 3rd Annual Meeting of

the Association for Natural LanguageProcessing,

pages441{444. (inJapanese).

S. Sekine, R. Grishman, and H. Shinnou. 1998.

A decision tree method for nding and

classify-ingnames in Japanesetexts. InProc. of the 6th

WorkshoponVeryLargeCorpora,pages148{152.

T. Utsuro and M. Sassano. 2000. Minimally

su-pervisedJapanesenamed entity recognition:

Re-sourcesandevaluation. InProc. ofthe2nd

Inter-national Conference on Language Resources and

Evaluation,pages1229{1236.

D. Yarowsky. 1994. Decision listsfor lexical

ambi-guity resolution: Application to accent

restora-tioninSpanishand French. InProc. of the 32nd