Using Features from a Bilingual Alignment Model in Transliteration Mining

(1)

/iFydye!KBH9X/Qb?Bb?X+XDT

M/`2rX7BM+?!MB+iX;QXDT

b2vKKQ!KBHX/Qb?Bb?X+XDT

(2)

(3)

S

γ

G

|

α,G0

∼

DP

(

α, G

0)

(

k

,

k

)

|

G

∼

G

0

α

α >

0 G

α

G

0

G

0

k

(

k

,

k)

(

₋k

,

−k

)

−

k

p

((

k

,

k

))

|

(

₋k

,

₋k)) =

N

((

k

,

k

)) +

αG

0(( k

,

k

))

N

+

α

N

((

k

,

k))

(4)

アン

a

an

ド

do リューriyuu

d

roid

Japanese

Character Sequence

English

Character Sequence

Model Score:

0.034

0.012 10e-12

f

1 f

2 f

3 f

4

logprob numsegs |

t_| |s| |

s_bad_|+_|t_bad_|

|s|+|t|

minprob

f

1 f

2 f

3 f

4

(5)

Document

Web Resource (Wikipedia)

Document

Japanese Wiki

Titles

Document

English Wiki Titles Interlanguage links

マイケルジャクソン

...

Michael Jackson ...

Document

Segment File

Bilingual Co-segmentation

マイ|mi ケ|cha ル|el -4.6 -7.3 - -5.1

Document

Features

Document

Good pairs

Document

Bad Pairs

SVM

Document

Seed Sentences (Positive Examples)

Document

ExamplesNegative

Threshold

Train

Test Test pairs are a randomly sampled

(6)

Log probability of the least likely segment A ve ra g e l o g p ro b a b ili ty o f th e se g me n ts 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

-1 -0.5 0 0.5 1

Score

SVM classification threshold precision

(7)

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

Recall Precision

En-Ar

proposed lcsr50 random baseline 0 0.2 0.4 0.6 0.8 1

0 0.2 0.4 0.6 0.8 1

Recall Precision

En-Ch

proposed lcsr40 random baseline 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

Recall Precision

En-Hi

0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

Recall Precision

En-Ru

0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

Recall Precision

En-Ta

0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

(8)

(9)