• No results found

Language Technology based on Big Data: Current Situation and Future Perspectives

N/A
N/A
Protected

Academic year: 2021

Share "Language Technology based on Big Data: Current Situation and Future Perspectives"

Copied!
50
0
0

Loading.... (view fulltext now)

Full text

(1)

Language Technology based

on Big Data: Current Situation

and Future Perspectives

Timo Honkela

30 October 2014

Department of

Modern Languages

Centre for Preservation

and Digitisation

(2)

Introductory

remarks

(3)

HELSINKI

MIKKELI

Department of

Modern Languages

Language

(4)

Digital humanities

Research within humanities

with the help of computers

Digital resources

Computational models

Basic motivation

One can already fly to moon and

build sophisticated factorial products

The most important open questions

in the world are related to humanities

and social sciences

(5)

Changing role of computers

Machines are increasingly capable of performing

pattern

recognition

and

learning

.

Traditionally ICT systems were programmed to perform

their operations in a manner that made them predictable.

The systems do not repeat their actions in similar manner

over and over but they evolve and can take contextual

factors into account better than before

(6)

Early personal experiences on

rule-based natural language processing

H. Jäppinen, T. Honkela, H. Hyötyniemi & A. Lehtola (1988):

A Multilevel Natural Language Processing Model.

Nordic Journal of Linguistics 11:69-87.

What is the turnover of the ten largest stock exchange companies in forestry?

Morphological analysis

Dependency parsing

Logical analysis

(7)

Texts

Images

Videos

Computational

models

Numerical

data

DIGITAL RESOURCES

Speeches/

convers.

Multimedia

documents

Interactive

systems

Computer

software

(8)

Complexity of language

as an object of study

and as an means

of representation and

communication

(9)

> 6000 languages,

many more dialects

Billions of people

blogs.state.gov

en.wikipedia.org

A large number of

different cultures

en.wikipedia.org

A vast number of ways to relate

language, concepts and

(10)
(11)

Challenge:

A tension between

the usability and standardization

of content descriptions

and

richness and evolution of

language and its interpretation,

genre and style variation, and

contextuality, subjectivity and

(12)

red

wine

red

skin

red shirt

(13)

Color naming

(14)

Richness and contextuality

of interpretation

Shall I Compare Thee To A Summer's Day

A small elephant versus a big mouse

A beautiful scenery, painting or composition

Democracy, equality, sustainability,

(15)

Present and

emerging

methodological

possibilities

(16)

Opportunities:

(17)

Classical example: Learning meaning from context:

Maps of words in Grimm fairy tales

Honkela, Pulkki & Kohonen 1995

Au

tom

ate

d l

ear

nin

g o

f w

ord

re

lat

ion

s

us

ing

se

lf-o

rga

niz

ing

m

ap

on

te

xt

co

nte

xt

da

ta

(18)

Independent Component Analysis of wellbeing-related

words in Reddit texts

(19)

Opportunities:

Analysis and visualization

of text corpora

(20)

We are facing a new situation

Systems can simulate or imitate human

interpretation to some extent

Systems are actually becoming increasingly

“epistemologically autonomous”

Not only software that is used in some

analysis contain prebuilt assuptions but also

evolves over time based on the data it has

“read” or “seen”

(21)

Chemistry

Natural sciences

and engineering

Bio- and

environmental

sciences

Health

Culture and

society

Map of Finnish Science

(22)

Opportunities:

(23)

Acknowledgements:

Finnish Broadcasting Company (YLE)

An example of automatic multimedia content analysis

users.ics.aalto.fi/jorma/

scholar.google.com/citations?user=suHzeyIAAAAJ&hl=en

users.ics.aalto.fi/mikkok/

elec.aalto.fi/en/about/careers/professors/mikko_kurimo/

Jorma

Laaksonen

Mikko

Kurimo

(24)

Speaker

recognition

Video analysis / scene classification

(25)

Video analysis / scene classification

Speaker

recognition

Speech recognition

(speech to text)

OCR

(26)

Opportunities:

Analysis of

(27)

Labeling movements

(28)

RUNNING

WALKING

(29)

Opportunities:

Modeling subjectivity

(30)

GICA: Grounded Intersubjective

Concept Analysis

Honkela,

Raitio,

Lagus &

Nieminen

2012

(31)

Analysis of “health” in the

State of the Union addresses

Subjects on objects in contexts:

Using GICA method to quantify

epistemological subjectivity.

Timo Honkela, Juha Raitio, Krista Lagus,

Ilari T. Nieminen, Nina Honkela, and Mika Pantzar.

(32)

Distant .. close reading

We will have more and more methods that

make machines to help in conducting

close reading

(33)

Opportunities:

(34)

Google

Speech-to-speech

(35)

Consider how different languages

divide the conceptual space

in different ways

(36)

Opportunities:

Analysis of human interpretation

in the description of data

(37)

Analyzing Emotional Semantics of Abstract Art Using Low-Level Image Features.

He Zhang , Eimontas Augilius , Timo Honkela, Jorma Laaksonen, Hannes Gamper and

Henok Alene, Proceedings of IDA 2011.

(38)

Opportunities:

Using text mining to

(39)

Text Mining for Qualitative Research

Nina Janasik, Timo Honkela, and

Henrik Bruun. Text mining in

qualitative research: Application of

an unsupervised learning method.

Organizational Research Methods,

12(3):436–460, 2009.

(40)

Nina Janasik, Timo Honkela, and

Henrik Bruun. Text mining in

qualitative research: Application of

an unsupervised learning method.

Organizational Research Methods,

12(3):436–460, 2009.

(41)

Opportunities:

(42)

Honkela, Korhonen, Lagus & Saarinen:

Five-dimensional sentiment analysis of corpora,

documents and words,

WSOM 2014

P: Positive

E: Engagement

R: Relationships

M: Meaning

A: Achievement

(43)

Opportunities:

Interoperability without

standardization?!

(44)

Emergence of a coherent lexicon in

a community of interacting SOM-based agents

(Lindh-Knuutila, Lagus & Honkela, SAB'06)

(45)

Concept Formation and

Communication - General Theory

Timo Honkela, Ville Könönen, Tiina Lindh-Knuutila, and Mari-Sanna Paukkeri. Simulating processes of concept

 

λ

: C

i

 × C

j

   

R, i ≠ j

distance

 between 

two points 

in

 the 

concept spaces

 of 

different agents

S: symbol space,

The 

vocabulary

 of an

agent that consists of 

discrete symbols

: s

ξ

i

   S

i

 → C

An individual 

mapping

 function 

from symbols to 

concepts

φ

i

: S

i

   D

An individual 

mapping

 

from

 agent 

i's 

vocabulary

 

to

 the 

signal space

 D and

an inverse mapping 

φ

­ 1

 i 

from the signal 

space to the symbol 

space

C

i: 

N­dimensional 

metric 

concept 

space 

Observing f

1

 and after symbol 

selection process, agent 1 

communicates a symbol s*

to agent 2 as signal d.  When agent 

2 observes d, it maps it  to some s

2

 

 S

2

  by using the function φ

 ­1

1

.   

Then it maps the symbol to some 

point in its concept space by using 

ξ

2

.  If this point is close to its 

observation f

2

 in the sense of λ, the 

communication process has 

succeeded.

(46)
(47)

Archives

Libraries

Universities

Citizens

Researchers

Media

DIGITAL

RESOURCES

Museums

Teachers

Artists

Companies

Societies

Municipalities

State

Decision

makers

Journalists

Information

specialists

(48)

Texts

Images

Videos

Computational

models

Numerical

data

DIGITAL RESOURCES

Speeches/

convers.

Multimedia

documents

Interactive

systems

Computer

software

(49)

Resources

Content and

information

professionals

Users of

the contents

(professionals

and lay people)

Machine learning

and

pattern recognition

systems

Formal metadata

Language

technology

resources and

systems

(50)

References

Related documents

Operation of the software that they need to access for providing the service (e.g. RAMIS, integrated call center suite that will be implemented, proposed document

Not Implemented No change in status from previous reporting cycle. # 2 Ensure that the condition ratings for recently resurfaced streets are effectively updated within

Finally, as a proxy for the rate of return on the market portfolio of nominally risky assets I used an equally weighted average of the returns on three of the four representative

 High-throughput, or omics, technologies, can identify individual disease-related patterns  Data analysis is key to translate big data to meaningful biological

The unpalatable situation of the welfare schemes of public sector workers may be attributable to many factors but the most prominent and pervasive one seems to be careless neglect

The Commission on Crime Prevention and Criminal Justice, stressing the utmost importance of international cooperation against trafficking in persons and welcoming the efforts of

I give permission for my healthcare provider to provide LA Health Medical Scheme and Discovery Health (Pty) Ltd with my diagnosis and other relevant clinical information required

This research included the following: (1) a novel, effective set of gait features were proposed; (2) gait signatures were extracted by three different methods: