• No results found

Automatic Text Processing: Cross-Lingual. Text Categorization

N/A
N/A
Protected

Academic year: 2021

Share "Automatic Text Processing: Cross-Lingual. Text Categorization"

Copied!
42
0
0

Loading.... (view fulltext now)

Full text

(1)

Automatic

Automatic Text Text Processing: Processing:

Cross

Cross - - Lingual Lingual Text Text Categorization Categorization

Dipartimento di Ingegneria dell’Informazione Università degli Studi di Siena

Dottorato di Ricerca in Ingegneria dell’Informazone XVII ciclo

Candidate: Advisor:

(2)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

Outlines Outlines

−− IntroductionIntroduction toto Cross Cross LingualLingual TextText CategorizationCategorization:: ÖÖ RealtionshipsRealtionships withwith Cross LingualCross Lingual InformationInformation RetrievalRetrieval ÖÖ PossiblePossible approachesapproaches

–– TextText CategorizationCategorization

Ö Multinomial Naive Bayes models

ÖÖ DistanceDistance distributiondistribution and termand term filteringfiltering ÖÖ LearningLearning withwith labeledlabeled and unlabeledand unlabeled datadata

–– The The algorithmalgorithm

ÖÖ The basic The basic solutionsolution

ÖÖ The modifiedThe modified algorithmalgorithm

–– ExperimentalExperimental resultsresults and and conclusionsconclusions

(3)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

Cross

Cross Lingual Lingual Text Text Categorization Categorization

− The problem arose in the last years due to the large amount of documents in many different languages

− Many industries would categorize the new documents according to the existing class structure without building a different text

management system for each language

− The CLTC is highly close to the Cross-Lingual Information Retrieval (CLIR):

Ö Many works in the literature deal with CLIR Ö Very little work about CLTC

(4)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

Cross

Cross Lingual Lingual Information Information Retrieval Retrieval

a) Poly-Lingual

Ö Data composed by documents in different languages Ö Dictionary contains terms from different dictionaries

Ö A wide learning set containing sufficient documents for each languages is needed

Ö An unique classifier is trained b) Cross-Lingual:

Ö The language is identified and translated into a different one Ö A new classifier is trained for each language

(5)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

a) a) Poly Poly - - Lingual Lingual

Drawbacks:

Ö Requires many documents for the learning set for each language Ö High dimensionality of the dictionary:

Ö n vocabularies

Ö Many terms shared between two languages

Ö Difficult feature selection due to the coexistence of many different languages

Advantages:

Ö Conceptually simple method Ö An unique classifier is used Ö Quite good performances

(6)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

b) b) Cross Cross - - Lingual Lingual

Drawbacks:

Ö Use of a translation step:

Ö Very low performances

Ö Named Entity Recognition (NER) Ö Time consuming

Ö In some approaches experts for each language are needed

Advantages:

Ö It does not need experts for each language

Three different approaches:

1. Training set translation 2. Test set translation 3. “Esperanto”

(7)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

1. Training set

1. Training set translation translation

− The classifier is trained with documents in language L2 translated from the L1 learning set:

Ö L2 is the language of the unlabeled data

Ö The learning set is highly noisy and the classifier could show poor performances

− The system works on the L2 language documents

Ö Number of translations lower than the test set translation approach

− Not much used in CLIR

(8)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

2. Test set

2. Test set translation translation

− The model is trained using documents in language L1 without translation:

Ö Training using data not corrupted by noise

− The unlabeled documents in language L2 are translated into the language L1:

Ö The translation step is highly time consuming

Ö It has very low performances and it introduces much noise

Ö A filtering phase on the test data after the translation is needed

− The translated documents are categorized by the classifier trained in the language L1:

Ö Possible inconsistency between training and unlabeled data

(9)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

3. 3. Esperanto Esperanto

− All documents in each languages are translated into a new universal language, Esperanto (LE)

Ö The new language should maintain all the semantic features of each language

ÖVery difficult to design

ÖHigh amount of knowledge for each language is needed

− The system works in this new universal language

Ö It needs the translation of the training set and of the test set ÖVery time consuming

− Few used in CLIR

(10)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

From From CLIR CLIR to to CLTC CLTC

Following the CLIR:

a) Poly-Lingual approach

Ö n mono-lingual text categorization problems, one for each language Ö It requires a test set for each language: experts that labels the

documents for each language b) Cross-lingual

1. Test set translation:

Ö It requires the tet set translation Î time consuming 2. Esperanto:

Ö It is very time consuming and requires a large amount of knowledge for each language

3. Training set translation:

Ö No proposals using this thecnique

(11)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

CLTC

CLTC problem problem formulation formulation

− “Given a predefined category organization for documents in the language L1 the task is to classify documents in language L2

according to that organization without having to manually label the data in L2 since it requires experts in that language and this is

expensive.”

− The Poly-Lingual approach translation is not usable in this case, since it requires a learning set in the unknown language L2

− Even the “esperanto” approach is not possible, since it needs knowledge about all the languages

− Only the training and test set approach can be used in this type of problem

(12)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

Outlines Outlines

−− IntroductionIntroduction toto Cross Cross LingualLingual TextText CategorizationCategorization:: ÖÖ RealtionshipsRealtionships withwith Cross LingualCross Lingual InformationInformation RetrievalRetrieval ÖÖ PossiblePossible approachesapproaches

–– TextText CategorizationCategorization

Ö Multinomial Naive Bayes models

ÖÖ DistanceDistance distributiondistribution and termand term filteringfiltering ÖÖ LearningLearning withwith labeledlabeled and unlabeledand unlabeled datadata

–– The algorithmThe algorithm

ÖÖ The basic solutionThe basic solution

ÖÖ The modifiedThe modified algorithmalgorithm

–– ExperimentalExperimental resultsresults and conclusionsand conclusions

(13)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

Naive

Naive Bayes Bayes classifier classifier

− The two most successful techniques for text categorization:

Ö NaiveBayes Ö SVM

− Naive Bayes

Ö A document di belongs to class Cj such that:

Ö Using bayes rule the probability can be expressed as:

)

| ( max

arg

r i

j C

P C d

C

r

=

)

| ( C

r

d

i

P

) (

)

| ( )

) (

| (

i

r i

r i

r

P d

C d

P C

d P C

P ×

=

(14)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

Multinomial

Multinomial Naive Naive Bayes Bayes

− Since is a common factor, it can be negleted

can be easily estimated from the document distribution in the training set or otherwise it can be considered constant

− The naive assumption is that the presence of each word in a document is an independent event and does not depend on the others. It allows to write:

where is the number of occurrences of word wt in the document di.

) ( d

i

P ) ( C

r

P

=

i t

i t

d w

d w N r t r

i C P w C

d

P( | ) ( | ) ( , ) )

, (wt di N

(15)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

Multinomial

Multinomial Naive Naive Bayes Bayes

− Assuming that each document is drawn from a multinomial distribution of words, the probability of wt in class Cr can be estimated as:

− This method is very simple and it is one of the most used in text categorization

− Despite the strong naive assumption, it yelds good performances in most cases

∑ ∑

=

s i j

j i

w d C s i

C

d t i

r

t

N w d

d w N C

w P

) , (

) , ( )

|

(

(16)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

Smoothing

Smoothing techniques techniques

− A typical problem in probailistic models are the zero values:

Ö If a feature was never observed in training process, its estimated probability is 0. When it is observed during the classification process, the 0 value can not be used, since it makes null the likelihood

− The two main methods to avoid the zero are Ö Additive smoothing (add-one or Laplace):

Ö Good-Turing smoothing:

) (#

|

|

) (#

) 1

| ˆ (

j j t

j

t

V w C

C C w

w

P + ∈

= +

j j

C w

C w w

P

= ∈

# ) 1 ( )) #

0

(

(

(17)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

Distance

Distance distribution distribution

− The distribution of documents in the space is uniform and does not form clouds

− The distances between two similar documents and between two different documents are very close

− It depends on:

Ö High number of dimensions

Ö High number of not discriminative words that overcome the others in the evaluation of the distances

(18)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

Distances

Distances distribution distribution

(19)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

Information

Information Gain Gain

− Term filtering:

Ö Stopword list Ö Luhn reduction Ö Information gain

− Information gain:

{ }

{ ∑ ∑ }

⎟⎟

⎜⎜ ⎞

= ×

k

k C i i

C

c w w w

k

i

P w P c

c w c P

w P C

w IG

, ,

2

( ) ( )

) , log (

) , ( )

, (

=

=

| |

1

( , )

)

(

C

k i k

i

IG w C

w

IG

(20)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

− New research area in Automatic Text Processing:

Ö Usually having a large labeled dataset is a time consuming task and much expensive

− Learning from labeled and unlabeled examples:

Ö Use a small initial labeled dataset

Ö Extract information from a large unlabeled dataset

− The idea is:

Ö Use the labeled data to initialize a labeling process on the unlabeled data

Ö Use the new labeled data to build the classifier

Learning

Learning from from labeled labeled and and unlabeled unlabeled data data

(21)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

Learning

Learning from from labeled labeled and and unlabeled unlabeled data data

− EM algorithm

Ö E step:

data are labeled using the current parameter configuration

Ö M step:

model is updated assuming the labeled to be correct

− The model is initialized using the small labeled dataset

(22)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

Outlines Outlines

−− IntroductionIntroduction toto Cross Cross LingualLingual TextText CategorizationCategorization:: ÖÖ RealtionshipsRealtionships withwith Cross LingualCross Lingual InformationInformation RetrievalRetrieval ÖÖ PossiblePossible approachesapproaches

–– TextText CategorizationCategorization

Ö Multinomial Naive Bayes models

ÖÖ DistanceDistance distributiondistribution and termand term filteringfiltering ÖÖ LearningLearning withwith labeledlabeled and unlabeledand unlabeled datadata

–– The The algorithmalgorithm

ÖÖ The basic The basic solutionsolution

ÖÖ The modifiedThe modified algorithmalgorithm

–– ExperimentalExperimental resultsresults and and ConclusionsConclusions

(23)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

Cross

Cross Lingual Lingual Text Text Categorization Categorization

− The problem can be stated as:

Ö We have a small labeled dataset in language L1

Ö We want to categorize a large unlabeled dataset in language L2 Ö We do not want to use experts for the language L2

− The idea is:

Ö We can translate the training set into the language L2

Ö We can initialize an EM algorithm with these very noisy data

Ö We can reinforce the behavior of the classifier using the unlabeled data in language L2

(24)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

Notation Notation

− With L

1

, L

2

and L

1Æ2

we indicate the languages 1,2 and L

1

translated into L

2

− We use these pedices for training set Tr, test set Ts and classifier C:

Ö C1Æ2 indicates the classifier trained with Tr1Æ2,, that is the training set Tr1 translated into language L2

(25)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

The basic

The basic algorithm algorithm

Tr Tr

11

Ts Ts

22

C C

22ÆÆ11

results results

Tr Tr

11ÆÆ22

Translation Translation 11ÆÆ 22

E(t) E(t) start

start

E E stepstep

M stepM step

(26)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

The basic

The basic algorithm algorithm

− Once the classifier is trained, it can be used to label a larger dataset

− This algortihm can start with small initial dataset and it is an advantage since our initial dataset is very noisy

− Problems

Ö Data

Ö Translation Ö Algorithm

(27)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

Data Data

− Temporal dependency:

Ö Documents regarding same topic in different times, deal with different themes

− Geographical dependency:

Ö Documents regarding the same topics in different places, deal with different persons, facts etc…

− Find the discriminative terms for each topic independent

of time and place

(28)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

Translation Translation

− The translator performs very poorly expecially when the text is badly written :

Ö Named Entity Recognition (NER):

Öwords that should not be translated

Ödifferent words referring to the same entity Ö Word-sense disambiguation:

ÖIn translation it is a fundamental problem

(29)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

Algorithm Algorithm

− EM algorithm has some important limitations:

Ö The trivial solution is a good solution:

Öall documents in a single cluster Öall the others clusters empty

Ö Usually it tends to form few large central clusters and many small peripheral clusters:

ÖIt depends on the starting point and on the noise on the data added at the cluster at each EM step

(30)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

Improved

Improved algorithm algorithm by by using using IG IG

Ts Ts

22

C C

22ÆÆ11

results results

Tr Tr

11ÆÆ22

E(t) E(t) start

start

EM EM iterations iterations

E E stepstep

M stepM step IG kIG k11

IG kIG k22

(31)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

The The filter filter k k

11

− Highly selective since the data are composed by translated text and they are very noisy

− Initialize the EM process by selecting the most informative words in the data

Ts Ts

22

results results

Tr Tr

11ÆÆ22 IG kIG k11

(32)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

The The filter filter k k

22

− It performs a regularization effect on the EM algorithm

Ö it selects the most discriminative words at each EM iteration

Ö The not significative words do not influence the updating of the centroid in EM iterations

− The parameter should be higher than the previous:

Ö It works on the original data

Ts Ts

22

C C

22ÆÆ11

results results

E(t) E(t)

E E stepstep

M stepM step

IG kIG k22

(33)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

Outlines Outlines

−− IntroductionIntroduction toto Cross Cross LingualLingual TextText CategorizationCategorization:: ÖÖ RealtionshipsRealtionships withwith Cross LingualCross Lingual InformationInformation RetrievalRetrieval ÖÖ PossiblePossible approachesapproaches

–– TextText CategorizationCategorization

Ö Multinomial Naive Bayes models

ÖÖ DistanceDistance distributiondistribution and termand term filteringfiltering ÖÖ LearningLearning withwith labeledlabeled and unlabeledand unlabeled datadata

–– The The algorithmalgorithm

ÖÖ The basic The basic solutionsolution

ÖÖ The modifiedThe modified algorithmalgorithm

(34)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

Previous

Previous works works

− Nuria et al. used ILO corpus and two language (E,S) to test three different approaches to CLTC:

Ö Polylingual

Ö Test set translation

Ö Profile-based translation

− They used the Winnow (ANN) and Rocchio algorithm

− They compared the results with the monolingual test

− Low performances: 70%-75%

(35)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

Multi

Multi - - lingual lingual Dataset Dataset

− Very few multi-lingual data sets available:

Ö No one with Italian language

− We built the data set by crawling the Newsgroups

− Newsgroups:

Ö Availability of the same groups in different languages Ö Large number of available messages

Ö Different levels of each topic

(36)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

Multi

Multi - - lingual lingual Dataset Dataset

− Multi lingual dataset compostion Ö Two languages:

Italian (LI) and English (LE) Ö Three groups:

auto, hardware and sport

20.963 3.000

3.000 total

6.991 1.000

1.000 Hw

6.984 1.000

1.000 Sports

Auto 1.000 TrI

TRAIN

TsI TrE

6.988 1.000

TEST

(37)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

Multi

Multi - - lingual lingual Dataset Dataset

− Drawbacks:

Ö Short messages

Ö Informal documents:

ÖSlang terms

ÖBadly written words Ö Often transversal topics

Öadvertising, spam, other actual topics (elections) Ö Temporal dependency:

same topic in two different moments deals with different problems

Ö Geographical dependency:

same topic in two different places deals with different persons, facts etc…

(38)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

Monolingual

Monolingual test test

Tr Tr

II

Ts Ts

II

C C

II

results results

94,43 ± 0,90%

94,43 ± 0,90%

20.963 total

93,76 ± 1,09%

93,01 ± 0,45%

96,74 ± 1,24%

94,01 ± 1,03%

96,21 ± 0,93%

92,89 ± 1,12%

6.988 6.991 6.984 Auto

Hw Sports

Precision Recall

TsI test set

Results are averaged on a ten-fold cross-validation – – No traslation No traslation

– – Training set and test set in Training set and test set in the Italian language

the Italian language

(39)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

Baseline

Baseline multilingual multilingual test test

C C

EEÆÆII

Tr Tr

EE

Ts Ts

II

results results

Tr Tr

EEÆÆII

Translation Translation EEÆÆ II

69,26 ± 4,22%

69,26 ± 4,22%

20.963 total

66,56 ± 4,76%

63,35 ± 3,72%

88,22 ± 4,36%

69,56 ± 5,34%

87,24 ± 2,02%

50,95 ± 6,28%

6.988 6.991 6.984 Auto

Hw Sports

Precision Recall

TsItest set

Translation from

Translation from

English to Italian

English to Italian

(40)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

Simple

Simple EM EM Algorithm Algorithm

TsTsII

results results TrTrEEÆÆII

Translation Translation EEÆÆII

E(t) E(t) start

start EM EM iterations iterations

E stepE step

M M stepstep

C C

EEÆÆII

TrTrEE

56,32 ± 1,10%

56,32 ± 1,10%

20.963 total

51,40 ± 1,00%

61,55 ± 0,98%

65,41 ± 0,05%

71,32 ± 1,05%

98,04 ± 1,01%

0,73 ± 0,41%

6.988 6.991 6.984 Auto

Hw Sports

Precision Recall

TsI test set

Translation from Translation from English to Italian English to Italian

Results are averaged on a ten-fold cross-validation

(41)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

Filtered

Filtered EM EM algorithm algorithm

87,07 ± 1,02%

92,78 ± 0,88%

92,28 ± 0,90%

92,59 ± 1,05%

87,88 ± 0,98%

91,01 ± 1,03%

6.988 6.991 6.984 Auto

Hw Sports

Precision Recall

TsI test set

Ts Ts

II

C C

EEÆÆII

results results TrTrEEÆÆII

start start

EM EM iterations iterations

E stepE step

M stepM step IG kIG k11

IG kIG k22

E(t) E(t) k k

1 1

= 300 = 300

k k

22

= 1000 = 1000

Translation from

Translation from

English to Italian

English to Italian

(42)

Art ifi cial Artificial Intelligence Intelligence ResearchResearch GroupGroup of Siena of Siena

Conclusions Conclusions

− The filtered EM algorithm performs better than other algorithms existing in literature

− It does not needs an initial labeled dataset in the desired language:

Ö No other algorithms have been proposed having such feature

− It achieves good results starting with few translated documents:

Ö It does not require much time for translation

References

Related documents

ANFH: Avascular Necrosis of the Femoral Head; BMP: Bone Morphogenetic Protein; DAB: Diaminobenzidine; EPO: Erythropoietin; GC: Glucocorticoid; HE staining: Hematoxylin-Eosin

Several studies in civilian populations have demon- strated a link between musculoskeletal disorders, pain and the ability to adequately control movements and muscular activation

virtual reality (VR) filmmakers to tell their stories and guide users, we analyze how end-users view 360 ◦ video in the presence of directional cues and evaluate if they are able

The field covered by the general index varies according to country: as regards the population concerned (specific income bracket, certain socio-professional categories or

CHEMICAL INDUSTRY OTHER SECTORS FINAL ENERGY

We then present Auld Leaky, a lightweight contextual link server that stores and serves structures represented in FOHM, using Context to filter query results..

(B)  Median duration and 25% and 75% IQR over successive segments of loops (continuous line) and half-zigzags (dashed line). In each segment bees change their flight direction

B , Myelination is seen in anterior (arrowheads) and posterior limbs of internal capsule and optic radiations (arrows). C, There is no significant difference between white