Chinese Word Segmentation without Using Lexicon and Hand crafted Training Data

(1)

Chinese Word Segmentation

without Using Lexicon and Hand-crafted Training Data

Sun Maosong, Shen Dayang*, Benjamin K Tsou**

State Key Laboratory o f Intelligent Technology and Systems, Tsinghua University, Beijing, China Email: lkc-dcs@mail.tsinghua.edu.cn

* Computer Science Institute, Shantou University, Guangdong, China

** Language Information Sciences Research Centre, City University o f H o n g Kong, Hong Kong

Abstract

Chinese word segmentation is the first step in any Chinese NLP system. This paper presents a new algoritlml for segmenting Chinese texts without making use of any lexicon and hand-crafted linguistic resource. The statistical data required by the algorithm, that is, mutual information and the difference of t-score between characters, is derived automatically from raw Chinesc corpora. The preliminary experiment shows that the segmentation accuracy of our algorithm is acceptable. We hope the gaining of this approach will be beneficial to improving the performance(especially in ability to cope with unkamwn words and ability to adapt to various domains) of the existing segmenters, though the algorithm itself can also be utilized as a stand-alone segmenter in some NLP applications.

1. Introduction

An)' Chinese word is composed of either single or multiple characters. Chinese texts are explicitly concatenations of characters, words are not delimited by spaces as that in English. Chinese word segmentation is therefore the first step for an), Chinese infonnation processing system[ 1 ].

Almost all methods for Chinese word segmentation developed so far, both statistical and rule-based, exploited two kinds of important resources, i.e., lexicon and hand-crafted linguistic resources(manually segmented and tagged corpus, knowledge for unknom~ words, and linguistic

This work was supported in part by the National Natural Science Foundation of China under grant No. 69433010.

roles)[1,2,3,5,6,8,9,10]. Lexicon is usually used as the means for finding segmentation candidates for input sentences, while linguistic resources for solving segmentation ambiguities. Preparation of these resources (well-defined lexicon, widely accepted tag set, consistent mmotated corpus etc.) is very hard due to particularity of Chinese, and time consuming Furthermore, even the lexicon is large enough, and the corpus annotated is balanced and huge in size, the word segmenter will still face the problem of data incompleteness, sparseness and bias as it is utilized in different domains.

An important issue in designing Chinese segmenters is thus how to reduce the effort of human supervision as much as possible. Pahner(1997) conducted a Chinese segmenter which merely made use of a manually segmented corpus(without referring to any lexicon). A transformation-based algorithm was then explored to learn segmentation rules automatically from the segmented corpus. Sproat and Shih(1993) further proposed a method using neither lexicon nor segmented corpus: for input texts, simply grouping character pairs with high value of mutual information into words. Although this strategy is very simple and has many limitations(e.g., it can only treat bi-character words), the characteristic of it is that it is fully automatic -- the mutual information between characters can be trained from raw Chinese corpus directly.

(2)

accuracy is acceptable); nevertheless, our main purpose is to study how and how well the work can be done by machine at the extreme conditions, say, without any assistance of human. We believe the performance of the existing Chinese segmenters, that is, the ability to deal with segmentation ambiguities and unknown words as well as the ability to adapt to new domains, will be improved in some degree if the gaining of this approach is incorporated into systems properly.

2. Principle

2 . 1 . M u t u a l i n f o r m a t i o n a n d d i f f e r e n c e o f t - s c o r e b e t w e e n c h a r a c t e r s

Mutual

information

and

t-score,

two

important concepts in inforlnation theory and statistics, have been exploited to measure the degree of association between two words in an English corpus[4]. We adopt these measures almost completely here, with one major modification: the variables in two relevant fommlae are no longer

words'

but

Chinese characters.

Definition 1 Given a Chinese character string 'xy',

the

mutual information

between characters x and

3'(or equally, the

mutual information

of the

location

between x and y) is defined as:

mi(x: y)

= log~

p ( x , y )

p ( x ) p ( y )

where p(x,y) is the co-occurrence probability of x and y, and p(x), p(y) are the independent probabilities o f x and y respectively.

As claimed by Clmrch(1991), the larger the mutual information between x and y, the higher the possibility of x and y being combined together. For example:

mi

The distribution of

mi(x:y)

for sentence (1) is illustrated in Fig. l(where " O " denotes x, y should be combined and "M" be separated in terms of human judgment. This convention will be effective throughout the paper). The correct segmentation for (1) can be achieved when we decide that every location between x and y in the sentence be treated as 'combined' or 'separated' accordingly if its mi value is greater than or below a threshold(suppose the threshold is 3.0 for this example):

economy cooperation will be

I

for current world economy trend

of an appropriate answer

(Economic

cooperation

will

be

an

appropriate answer to the trend o f economics

in current worM.)

It is evident that x and y are to be strongly combined together if

mi(x:y)>>O

and to be separated if

mi(x:y)<<O.

But if

mi(x:y) ~ O,

the association of x and y becomes uncertain.

Observe the

mi

distribution for sentence (2) in Fig. 2:

(2)

In the region of 2.0 ~ mi < 4.0, there exist some confusions: we have

mi(Zfl: ~)=mi(~.'~/)>

mi(y~:." ¢7~), mi(,/~." ~:) > mi(~." ~/") > mi('/gL ~ ) ,

and

mi ( ~j:q' ~ ) > mi ( ~ . f/),

however, "54~:~'"'-~::

,,- x . , , , - ~,,, h

~. ; ~ : v ~ : ~ s ould be separated and " ~ : ~ ' " ' ~ : ; ~ ' " ' 7 ~ : ~1'"'$~: ~ " be combined by human judgment -- the power

ofmi

is somewhat weak in

10

8

6

4

2

0

-2

9

I l l

I1 II1 Iii

I I !

t I a • m | i t . . . . ~ ' I . . . . J " | i • i • I d , ~ " I I

B I i

N

. . . m . . . .

Fig. l T h e d i s t r i b u t i o n of m i ( s e n t e n c e 1)

• c o n n e c t

R break

[image:2.612.77.487.570.712.2]

(3)

mi g

6

4

2

0

-2

-4

, e

41" : ' ' .:

. . . ¢ , , i . ,. , ! : : : :: ' :; : : • c o i I n e c t • ~" : - - ~'~ I ~ b r e a k I

I . : : : '

i ~ I I : " i " " : : ' i ; I I I " I . . . . : ' 1 ~ : ' l ' : : ' ' " ~ '

• ; " ' " : " : " : . L ' "! : : :

I I I " : . •

I l l . . . . i . : : : " •

F i g . 2 T h e d i s t r i b u t i o n o f n l i ( s e n t e n c e 2 ) C h a r a c t e r p m r s m s e n t e n c e

the 'intermediate' range of its value. To solve this problem, we need to seek other ways additionally. Definition 2 Given a Chinese character string 'xvz'. the

t-score

of the character y relevant to characters x and z is defined as:

p(zl y) - p(yl x)

ts.:(y) = ~ v a r ( p ( z l y ) ) + var(p(ylx))

where p(ylx) is the conditional probability of y given x, and p(zly), of z given y, and var(p(ylx)), var(p(z[y)) are variances of p(ylx) and of p(zly) respectively.

Also as pointed out by Church(1991),

ts~,,(y)

indicates the binding tendency of y m the context of x and z:

if p(zly ) > p(ylx), or

is,, (y) > 0

then y tends to be bound with z rather than with x

i f p ( y l x ) > p(zly), or

tsx,:(y ) < 0

then y tends to be bound with x rather than with z

A distinct feature of ts is that it is context- dependent (a relative measure), along with certain degree of flexibility to the context, whereas

mi

is context-independent (an absolute measure). Its drawback is it attaches to a character rather than to the location between two adjacent characters. This may cause some inconvenience if we want to unify it with

mi.

We initially introduce a new measure

dts

instead of

ts:

Definition 3 Given a Chinese character string 'vxyw', the

difference oft-score

between characters x and v is defined as:

dts(x: y ) = ts,., (x) -

ts.,... ( y )

Now

dls(x:y)

is allocated to the location

between x and y, just like

m i ( x : y ) .

And the context of

dts(x:y)

becomes 4 characters, 1 character larger than that of ts~., ( y ) .

The value of

dts(x:y)

reflects the competition results among four adjacent characters v, x, y and w:

(1) t , % ( x ) > 0 ts,.w(y ) < 0

(x tends to combine with y, and y tends to combine with x) ==>

dts(x:y) > 0

Ill this case, x and y attract each other. The location between x and y should bc bound.

(2)

t%,y(x) < 0

ts.,.w(y) > 0

(x tends to combine with v, and y tends to combine with w) ==>

dts(x:y) < 0

[11 this case, x and y repel each other. The location between x and y should be separated.

(3a)

tSv.y (x) > 0

tsx, ~

(y) > 0

(x tends to combine with y, whereas y tends to combine with w)

(35) tsv,, (x) < 0 ts,..w ( y ) < 0

(x tends to combine with v, whereas y tends to combine with x)

®

(y)

In cases of (3a) mad (3b), the status of the location between x and y is determined by the competition of

t,%,y (x)

and t,5'~,~ (y ) :

if

dts(x:

y ) > 0 then it tends to be bound

(4)

dts 2 0 0

150

100

5O

0 i...

-50 _#

-100

If:

• co~mect

• break

Fig. 3 The distribution of dts(sentence 2) Character pairs in sentence

The general role governing

dts

is similar as that governing

mi:

the higher the difference of t- score between x and y, the stronger the combination strength between them, and vice versa. But the role of

dts

is somewhat different from that of rm: it is capable of complementing the 'blind area" of

mi

on some occasions.

Consider sentence (2) again. The distribution of

dts

for it is shown in Fig. 3. Return to the character pairs whose

mi

values fall into the region

of

2.0 <~ mi

< 4.0 in Fig. 2, compare their

dts

values accordingly:

dts( J~: ,z/z) > dts(.~_~k+.. ~ ) >

dts(S?:

dts(/A. N) >

> dts( :

and

dts(]2.][) > dts((~/.: ~_£) --

the conclusion

drawn from these comparisons is very close to the human judgment.

2.2. Local m a x i m u m and local m i n i m u m o f

dts

Most of the character pairs in sentence (2) have got satisfactory explanations by their mi and

dts

so far. " - ~ : ~ . . . . T[z :;~A ~7, are two of few exceptions. We have

mi(.~k)~)>

m i ( ~ : ~ ) and

dts(~:/_/,~) > dts(e~/::

i f ) , however, the human

judgment is the former should be separated and the latter be bound. Aiming at this, we further proposed two new concepts, that is, local maximum and local minimum of

dts.

Definition 4 Given 'vxyw' a Chinese character string,

dts(x:y)

is said to be a local maximum if

dts(x:y) > dts(v.'x)

and

dts(x:y) > dts(y:w).

And,

the height of the local maximum

dts(x.'y)

is defined

as:

h(dts(x:y))

= rain {

dts(x:y)- dts(v:x),

dts(x.'y) -- dts(y:w) }

Definition 5 Given 'vxyw' a Chinese character string,

dts(x:y)

is said to be a local minimum if

dts(x:y) < dts(v:x)

and

dts(x:y) < dts(y.'w). And,

the depth of the local minimum

dts(x.'y)

is defined

as:

d(dts(x.'y))

= min {

dts(v.'x)- dts(x:y),

c#s(y..w)- Jts(x..y) }

Two basic hypotheses can be easily made as the consequence of context-dependability of

dts(note: mi

has not such property):

Hypothesis 1 x and y tends to be bound

ifdts'(x:y)

is a local maximum, regardless of the value of

dts(x:y)(even

it is low).

Hypothesis 2 x and y tends to be separated if

dts(x.y)

is a local nainimuna, regardless of the value

of dts(x.y)(even

it is high).

In Fig. 3,

dts(,~J~." ~ )

is a local mininmm whereas

dts(H.'~)

isn't. At least we can say that s~: z~ is likely to be separated, as suggested by the hypothesis 2(though we still can say nothing more about " ~ : ~ " ) .

2.3. Tile second local m a x i m u m and the second local m i n i m u m of

dts

We continue to define other four related concepts:

Defnition 6 Suppose 'vxyzw' is a Chinese character string, and

dts(x:y)

is a local maximum. Then

dts(y:z)

is said to be the

right

second local maximum of

dts(x.'y)

if

dts(y:z) > dts(v:x) and

dts(y:z)> dts(z:w).And,

the distance between the

local maximum and the second local maximum is defined as:

dis(locmax, y:z) = dts(x.'y) -- dts(y:z)

[image:4.612.91.488.82.213.2]

(5)

character string, and dts(x:y) is a local minimum. Then dts(y:z) is said to be the right second local minimum o f dts(x:y) if dts(y:z)< dts(v:x) and dts(y:z) < dts(z:w). And, the distance between the local minimum and the second local minimum is defined as:

dis(locmin, y:z) = d t s ( y : z ) - dts(x:y)

The left second local maximum and the left

second local minimum of dts(x:y) can be defined similarly.

Refer to Fig. 3. By definition, dts(ff.'y?g) is the le~q second local minimum of d t s ( ~ : ~ ) , and

dts(y.A.~:O 3) is the right second local maximum of

dts(~'/." ~ : ) meanwhile the left second local minimum of dts( -¢~..: ~ ) .

These four measures are designed to deal with two common construction types in Chinese word fornmtion: "2 characters + 1 character" and "1 character + 2 characters". We will skip the discussion about this due to the limited volume of the paper.

3. Algorithm

The basic idea is to try to integrate all of the measures introduced in section 2 together into an algorithm, making best use of the advantages and bypassing the disadvantages of them under different conditions.

Given an input sentence S. let

/t .... the mean of mi of all locations in S;

or,,, the standard deviation of mi of all locations in S;

# d.. the mean of dts of all locations in S;

(in fact, /tdt,,. ~ 0)

o-at,, : the standard deviation of dts of all

locations in S

we divide the distribution graphs of mi and dts

of S into several regions(4 regions for each graph) by It,,,, o r , , ~ldt s and O'dt,~"

~ o n A

re_ on B

C

region D.

dts(x:y) > era,., 0 < dts(x:y)<~ oa~.~ -cr at.,. < dts(x.'y)~ 0

dts(x:y) cr

mi(x:y) >/tmi + O-,m It,, ~. mi(x:y)<~ ,tt,,~ q- o-,,

/t., i -- or., i < mi(x:y)~ /l .,

re~ion d mi(x.'y) <~ /t ,,,i -- or,. i

The algorithm scans the input sentence S from left to right two times:

The first round for S

For any location (x:y) in S, do

1. in cases that <dts(x.'y), mi(x:y)> falls into:

1.1 Aa or Ba or Ca or Da or Ab

mark (x:y) 'bound'

1.2 A d o r B d o r C d o r D d o r D c

mark (x:y) 'separated'

1.3 Ac or Cb

ifdts(x:y) is local maxinmm then

if h(dts(x:y)) > 61

then mark (x:y) 'bound' else '?'

ifdts(x:y) is local minimum then

if d(dts(x:y))> ~2

then mark (x:y) 'separated' else "7' 1.4 Bc or Db

if dts(x:y) is local maximum then

if h(dts(x:y)) > 5 2

then mark (x:y) 'bound' else "7'

ifdts(x:y) is local minimum then

if d(dtsOc:y)) > ~

then mark (x:y) 'separated' else '?'

1.5 Cc

if (dts(x:y) is local maxinmm) and

(h(dts(x:y)) > 5 3 )

then mark (x:y) 'bound' else '9'

ifdts(x:y) is local minimum

then mark (x:y) 'separated' else '9' 1.6 Bb

if dts(x:y) is local maximum then mark (x:y) 'bound' else '9' if (dts(x:y) is local minimum) and

(d(dt,(x:y)) > )

then mark (x:y) 'separated' else ")' 2. For (x:y) unmarked so far, mark it as '9'

cxcept that:

ifdts(x:y) is tile second local maximum then if dis(locmax, x:y) <

0.5 × lrmin(loc, x:y)

/* Refer to the notations in definition 6&7.

(6)

then mark (x:y) ' ~ - ' if

(x:y) is the right second local max o r ' ~ ' i f

(x:y) is the left second local max

ifdts(x:y) is the second local minimum then if dis(locmin, x:y) <

0.5 X lrmin(loe, x.'y)

then mark (x:y) '4-' if

(x:y) is the right second local min or '-~' if

(x:y) is the left second local min The second round for S

if (x:y) is marked '9' then if mi(x:y) >~ 0

then mark (x:y) 'bound' else 'separated' if (x:y) is marked ''--"

then the status of (x:y) follows that of the adjacent location on the left side if (x:y) is marked ' ~ "

then the status of (x:y) follows that of the adjacent location on the right side (The constants c~, d2, c~3, ~1, ~2, ~3 are determined by experinaents, satisfying:

G

<G

and 0 =2.5)

Generally speaking, the lower the <dts(x:y), mi(x:y)> in distribution graphs, the more restrictive the constraints. Take 'bound' operation as example: there is not any additional condition in case 1.1; in case 1.6 however, the existence of a local maximum is needed; in case 1.3, a requirement for the height of local maximum is added; in case 1.4, the height required becomes even higher; and in case 1.5, which is the worst case for 'bound' operation, the height must be high enough.

Case 2 says if the second local maximum is pretty near to the local maximum corresponded, then its status ('bound' or 'separated') would be likely to be consistent with that of the local maximum. So does the second local minimum.

Finally, for locations marked '?' with which we have no more means to cope, simply make decisions by the value of mi(we set it to 2.5, same as that in the system of Sproat and Shih(1993)).

Recall sentence (2). The character pair "SK:: ~ " is regarded as "separated' successfully by

following "~E: W__,"(local minimum) with the rule in case 2 although its mi value is rather high(3.4). " ~ : ;~" is marked '?' in the first round and treated properly by 0 in the second round.

The algorithm outputs segmentation for sentence (2) at last:

the correct

France tennis competition today

in Paris the western suburbs

open curtain

(The Tennis Competition o f France opened in the western suburbs o f Paris today.)

Note that there exist two ambiguous fragments " ~ } ~ 5 ~ - " ( " ~ I ~ " or " ~ Y [ : ~ " ) and " ~ ~ " ( " ~ I ~ i i ~ " or " ~ I ~ I ~v"), as well as two proper nouns "France" and "Paris" in sentence (2).

4. Experimental results

We select 100 Chinese sentences, consisting of 1588 characters(or 1587 locations between character pairs) randomly as testing texts. The statistical data required by calculating rni and dts,

in fact it is character bigram, is automatically derived from a news corpus of about 20M Chinese characters. The testing texts and training corpus are mutually excluded.

Out of 1587 locations in the testing texts,

1456 are correctly marked by our algoritlun. We define the accuracy of segmentation as:

# of locations being correctly marked

.................................................

# of locations in texts

Then, the accuracy for testing texts is 1456/1587 = 91.75%.

The distribution of local maximum, local minimum and other types of dts value(involving the second local maximum and the second local minimum) of the testing texts over <tits, mi>

regions is summarized in Fig. 4 (Fig. 5 is the same distribution in percentage representation). This would be helpful for readers to understand our algoritlml.

(7)

experiments; (2) refining the algorithm by studying the relationship between mi and dts in depth; and (3) integrating it as a module with the existing Chinese segmenters so as to improve their performance (especially in ability to cope with unknown words and ability to adapt to various domains). -- it

is

indeed the ultimate goal of our research here.

5. Acknowledgments

This work benefited a lot from discussions with Professor Huang Changning of Tsinghua University, Bejing, China. We would also like to thank anonymous COLING-ACL'98 reviewers for their helpful comments.

G

,-+

250

2()0

150

100

50

0

Aa Ab Ac Ad Ba Bb Bc Bd Ca Cb Cc Cd I)a Db Dc Dd

Fig.4 The distribution of dts types in testing texts Region

[] Others • LocMin m LocMax

100%

"v 80%

a

e. 60%

~ 40%

O~ rt,

2O%

0%

:~: i l~ ~l,i i; i~:! ii) #' iii~iii ii!!i!i iiii!ii ii

: ~ ~

Aa Ab Ac Ad Ba Bb Bc Bd Ca Cb Cc Cd l)a Db Dc Dd

t:ig.5 The distribution of dts types in testing texts

[] Others ]

; [] LocMin ] _El LocMax]

Region

References

[1] Liang N.Y.. "CDWS: An Automatic Word Segmentation System for Written Chinese Texts",

,hmrnal of Chines'e Information Processing, Vol. 1, No.2, 1987 (in Chinese)

]2] Fan C.K.,Tsai W.H.. "Automatic Word Identification in Chinese Sentences by the Relaxation Technique", Computer Processing of Chinese & Oriental Languages, Vol.4, No. 1, 1988 [3] Yao T.S., Zhang G.P., Wu Y.M., "A Rule- based Chinese Word Segmentation System",

Journal of Chinese Infi)rmation Processing, Vol.4, No. 1, 1990 (in Chinese)

114] Church K.W., Hanks P., Hindle D., "Using Statistics in Lexical /ulalysis", In Lexical Acquisition: Exploiting On-line Resources to Build a Lexicon, edited by U. Zernik, Hillsdale, N.J.:Erlbaum, 1991

[5] Chan K.J., Liu S.H., "Word Identification for Mandarin Chinese Sentences", Proc. of COLING- 92, Nantes, 1992

[6] Sun M.S, Lai B.Y., Lun S., Sun C.F., "Some Issues on Statistical Approach to Chinese Word Identification", Proc. of the 3rd International Conference on Chinese InJbrmation Processing,

Beijing, 1992

[7] Sproat R., Shill C.L., "A Statistical Method tbr Finding Word Boundaries in Chinese Text",

Computer Processing of Chinese and Oriental Languages, No.4, 1993

[8] Sproat R. et al, "A Stochastic Finite-State Word Segmentation Algorithnl for Chinese", Proc. of the 32nd Annual Meeting of ACL, New Mexico,

1994