• No results found

Big Data Adaptation DatAptor Dr Khalil Sima'an

N/A
N/A
Protected

Academic year: 2021

Share "Big Data Adaptation DatAptor Dr Khalil Sima'an"

Copied!
25
0
0

Loading.... (view fulltext now)

Full text

(1)

Big Data Adaptation

Big Data Adaptation

DatAptor

DatAptor

Dr Khalil Sima

Dr Khalil Sima

'

'

an

an

Institute for Logic, Language and Computation

Institute for Logic, Language and Computation

University of Amsterdam

University of Amsterdam

The Netherlands

(2)

MT at ILLC-UvA

MT at ILLC-UvA

Statistical

Language Processing and Learning Lab.

Language Processing and Learning Lab.

MT at ILLC-UvA

MT at ILLC-UvA

Statistical

Language Processing and Learning Lab.

Language Processing and Learning Lab.

Linguistic structure Statistical LearningStatistical Learning Language Technology

Language Technology

Machine Translation

(3)

MT at ILLC-UvA

MT at ILLC-UvA

Statistical Language Processing and Learning Lab.

Statistical Language Processing and Learning Lab.

MT at ILLC-UvA

MT at ILLC-UvA

Statistical Language Processing and Learning Lab.

Statistical Language Processing and Learning Lab.

● SLPL Lab (Growing: 8 PhD students; 4 postdocs; Programmer)SLPL Lab ● Five projects on Statistical MT (2012—2019)Five projects on Statistical MT

This talk: Big Data and DatAptor (Feb 2013 - Feb 2017)

This talk: Big Data and DatAptor (Feb 2013 - Feb 2017)

Main topics within SLPL

Main topics within SLPL

● Syntax-driven SMT (learning, decoding) ● Learning latent reordering for translation ● Hierarchical models with morph. And syntax ● Data-powered Adaptation ● ... Statistical Learning Linguistics Machine Translation Machine Translation

(4)

BIG DATA

BIG DATA

What Comes to Mind?

(5)

BIG DATA

BIG DATA

BIG DATA

BIG DATA

Data Data Data Data ...

Data Data Data Data ...

(R

(6)

BIG DATA

BIG DATA

BIG DATA

BIG DATA

Everyone wants big data

Everyone wants big data

(Does anyone know what for?)

(7)

BIG DATA

BIG DATA

What comes to mind?

What comes to mind?

BIG DATA

BIG DATA

What comes to mind?

What comes to mind?

Efficient computing

Big storage + Fast search

Diversity

Q

uality differences

Noise; Difficult statistics ...

Saturated statistics

Just count and divide

Simple models are enough

[cf. The Unreasonable Effectiveness of Data. Halevy, Norvig and Pereira 2009.]

(8)

BIG

BIG

DATA

DATA

What comes to mind?

What comes to mind?

BIG

BIG

DATA

DATA

What comes to mind?

What comes to mind?

Efficient computing

Storage + search.

Diversity

Q

uality differences.

Noise, difficult statistics ...

Saturated statistics

Just count and divide

Simple models are enough.

[cf. The Unreasonable Effectiveness of Data. Halevy, Norvig and Pereira 2009.]

Diversity also offers advantages

Diversity also offers advantages

Language use (different domains)

(9)

BIG DATA

BIG DATA

What comes to mind?

What comes to mind?

BIG DATA

BIG DATA

What comes to mind?

What comes to mind?

Efficient computers

Storage + search.

Diversity

Q

uality differences.

Noise, difficult statistics ...

Saturated statistics

Just count and divide

Simple models are enough.

[cf. The Unreasonable Effectiveness of Data. Halevy, Norvig and Pereira 2009.]

Diversity also offers advantages

Diversity also offers advantages

Language use (different domains)

Translators practice and guidelines

DatAptor Project

(10)

DatAptor Project 2013-2017

DatAptor Project 2013-2017

DatAptor Project 2013-2017

DatAptor Project 2013-2017

● TAUSTAUS ● EC DGT EC DGT ● Intel Inc. Intel Inc. ● Symantec Symantec

Researchers

Researchers

(SLPL, ILLC, UvA)

(SLPL, ILLC, UvA)

Dr Khalil Sima'an (principal investigator)

Dr Christof Monz (senior researcher)

Dr Bart Mellebeek (postdoc: Jan 2013)

Milos Stanojevic (PhD student: March 2013)

Vacancies: postdoc + programmer

Data-Powered

Domain-Specific

Translation Services

on Demand

Partners (User Board)

(11)

DatAptor

DatAptor

DatAptor

Motivation

Motivation

DatAptor

Motivation

Motivation

Versatile adaptation needed

Potential demand vs. current demand

Continuously increasing text volumes

Large variability in kinds of texts (domains)

Changing translation market

Changing domains, e.g. shifting international

trade/cultural/... exchange

Changing acceptance for automatic translation

(12)

So many domains... so little time...

So many domains... so little time...

So many domains... so little time...

So many domains... so little time...

Versatile

Versatile

Build an

MT System

For every

domain of language use

Automatically and rapidly

On demand: user specs.

sports; news; politics; financial; banking;

automotive; drugs; food; appliance manuals; hardware/software manuals;

scientific articles; ....

A population of MT Systems!

Minimal human intervention

User supplied example texts to be translated

(13)

Current Practice

Current Practice

Tiny Data Adaptation

Tiny Data Adaptation

Tiny Data Adaptation

Current Practice

Current Practice

Tiny Data Adaptation

(14)

Current Practice

Current Practice

Tiny Data Adaptation

Tiny Data Adaptation

Current Practice

Current Practice

Tiny Data Adaptation

Tiny Data Adaptation

DATA

out (News)

DATA

out (News)

MT

Out

MT

Out

DATA

in

(15)

Current Practice

Current Practice

Tiny Data Adaptation

Tiny Data Adaptation

Current Practice

Current Practice

Tiny Data Adaptation

Tiny Data Adaptation

Adapt Adapt Adapt Adapt

D

in

D

in

MT

Out

MT

Out

A_MT

In

A_MT

In

DATA

in

DATA

in

DATA

out (News)

DATA

out (News)

(16)

Task

Build MT system from tiny in-domain data

using whatever out-of-domain data exists

Theoretically Challenging, interesting, very difficult

Practice

Assumptions maybe too strong

Alternative Scenario

BIG DATA

BIG DATA

Adaptation

Adaptation

Tiny Data Adaptation

Tiny Data Adaptation

Current Practice

Current Practice

Tiny Data Adaptation

Tiny Data Adaptation

Current Practice

(17)

Big Data == Diversity

Big Data == Diversity

Big Data == Diversity

Big Data == Diversity

News Politics Food Drugs Financial Banking Sports Hardware Software Appliance

(18)

DatAptor Hypothesis

DatAptor Hypothesis

Big Data

Big Data

DatAptor Hypothesis

DatAptor Hypothesis

Big Data

Big Data

Metaphore

Imagine a world of translators

A translator: background + experience

A translator for every situation

For every new translation order

(19)

DatAptor Hypothesis

DatAptor Hypothesis

Big Data ==

Big Data ==

Diversity

Diversity

DatAptor Hypothesis

DatAptor Hypothesis

Big Data ==

Big Data ==

Diversity

Diversity

Metaphore

Imagine a world of translators

● A translator for every situation

● A translator with own background and experience

For every new translation order

Find the best suitable translator

If (Big Data == ``A World of Domains”)

Diversity enables rapid adaptation to new domain

``Find the most suitable MT system in the Data”

If (Big Data == ``A World of Domains”)

Diversity enables rapid adaptation to new domain

(20)

DatAptor Challenges I

DatAptor Challenges I

DatAptor Challenges I

DatAptor Challenges I

Map of BIG DATA

Map of BIG DATA

Efficiency for distilling suitable training data

Map: the more related, the closer to each other on the map

How to measure domain similarity?

Statistical (hierarchical-)topic similarity; translation-equivalence and

instance weighting …

INPUT: User documents from some domain + BIG DATA

INPUT: User documents from some domain + BIG DATA

OUTPUT: SMT system adapted for domain

OUTPUT: SMT system adapted for domain

Distill from BIG DATA a suitable training data

Weigh some documents as more relevant than others

(21)

BIG DATA

BIG DATA

vs.

vs.

BIG

(22)

BIG

BIG

BIG

T-

T-

DATA

DATA

BIG

T-

T-

DATA

DATA

D Doommaiainn DDiviveersrsitityy Semantic Similarity(?) Semantic Similarity(?) Topic Similarity(?) Topic Similarity(?)

BIG DATA

BIG DATA

Meaning Equations Meaning Equations Translation Equations Translation Equations Semantic Equation Semantic Equation

Translation Data

Translation Data

+

+

VII = (V+II) = (5+2) = 7 VII = (V+II) = (5+2) = 7 VII = 7 VII = 7 V+II = VII V+II = VII 5+2 = 75+2 = 7

(23)

BIG

BIG

BIG

T-

T-

DATA

DATA

BIG

T-

T-

DATA

DATA

D Doommaiainn DDiviveersrsitityy Semantic Similarity(?) Semantic Similarity(?) Topic Similarity(?) Topic Similarity(?)

BIG DATA

BIG DATA

+

+

More … Many More

More … Many More

Accurate Accurate Topic/Semantic Topic/Semantic VII = 7 VII = 7 Meaning Equations Meaning Equations Translation Equations Translation Equations Semantic Equation Semantic Equation

Translation Data

Translation Data

VII = (V+II) = (5+2) = 7 VII = (V+II) = (5+2) = 7 VII = 7 VII = 7 V+II = VII V+II = VII 5+2 = 75+2 = 7

(24)

DatAptor Challenges II

DatAptor Challenges II

DatAptor Challenges II

DatAptor Challenges II

User-Driven Data-Powered Adaptation

Recursive, Hierarchical Meaning Equations

Structures translation equations; Better reordering

(25)

To conclude

To conclude

Big Trans. Data

Big Trans. Data

Enables Data-Powered Adaptation (DatAptation)

Statistics over ``Meaning Equations”

More than MT? Language understanding!

Thank you!

Thank you!

[email protected]

References

Related documents