• No results found

Processing: current projects and research at the IXA Group

N/A
N/A
Protected

Academic year: 2021

Share "Processing: current projects and research at the IXA Group"

Copied!
23
0
0

Loading.... (view fulltext now)

Full text

(1)

N

N

atural

atural

L

L

anguage

anguage

P

P

rocessing

rocessing

:

:

current projects and research

current projects and research

at the

at the

IXA

IXA

Group

Group

IXA Research

Group on NLP

University of the Basque Country

(2)

Motivation

Motivation

l

A language that seeks to survive in the

modern information society requires language

technology products.

l

Most of the working applications are only

available for the "big" languages.

l

"Minority" languages have to make a great

(3)

How to face the challenge?

How to face the challenge?

l

Open proposal for the development of language

technology.

l

Steps to take: from necessary infrastructure to useful

LE applications.

l

Based on the twelve year-long experience of the IXA

Research Group in the field of natural language

processing applied to Basque.

(4)

IXA Research Group

IXA Research Group

on NLP (UPV/EHU) (I)

on NLP (UPV/EHU) (I)

l

Main research fields:

NLP, computational

linguistics, language engineering.

l

Goal:

to collaborate on

laying foundations for research;

the development of language processing software.

(5)

FOR MORE INFO ...

http://ixa.si.ehu.es

IXA Research Group

IXA Research Group

on NLP (UPV/EHU) (II)

on NLP (UPV/EHU) (II)

l

l

1986/1987: 4-5 university lecturers (CS)

1986/1987:

l

l

2000/2001:

2000/2001:

~30

~30

members

members

13 lecturers (11 doctorates, senior researchers)

13 PhD students (research grants)

A few research assistants assigned to projects

l

Interdisciplinary team:

computer scientists & linguists

(6)

IXA Research Group

IXA Research Group

on NLP (UPV/EHU) (III)

on NLP (UPV/EHU) (III)

l

Relationships with other universities in

Euskal Herria; Madrid; Toulouse; Barcelona; Maryland,

Las Cruces (USA); Sydney (Australia); Massey (New

Zealand); Rome (Italy); Helsinki (Finland); ...

l

And companies: Hizkia, Jalgi, Egunkaria, Microsoft,

Xerox, LingSoft, LexiQuest, ...

l

Funding: local government, University of the Basque

(7)

A growing number of people use

computer systems in their everyday life.

Many of these systems involve the use and

processing of language.

Document writing and correction

What is

What is

the

the

language industry?

language industry?

Remote information retrieval

Consultation of dictionaries and

encyclopedias

Translation of documents

Electronic messaging

(8)

Some terminology on

Some terminology on

Natural Language Processing (I)

Natural Language Processing (I)

l

NLP

deals with the automatic processing of both

spoken and written text: communicate

with/through computers by means of every day

language.

l

Computational linguistics

or computer-oriented

linguistics: formalisation of linguistic knowledge

for computer processing.

(9)

Some terminology on

Some terminology on

Natural Language Processing (II)

Natural Language Processing (II)

l

Language engineering

: production of computer

systems which can recognise, understand,

interpret and generate human language in all its

forms.

Typical products of LE are language software systems (lingware)

such as lemmatisers, phrase recognisers, word sense

disambiguation programs, translation aids, etc.

l

All this is usually gathered under the heading of

(10)

Underlying philosophy

Underlying philosophy

(I)

(I)

l

Use, share, and reuse:

theories, formalisms, and methodologies

techniques and expertise

technology

l

Build our own linguistic resources in order to

develop:

general and specific tools

(11)

Underlying philosophy

Underlying philosophy

(II)

(II)

As an example:

l

Several OCR programs claim to have Basque among the

languages they are set up for.

l

No one includes specific language information (dictionary,

bi-grams or tri-grams info., etc.).

l

In some of them, the use of (r acute) and similar

obsolete features is the only reason for that claim!!!.

Result

l

These programs don't work with Basque texts as properly

as they do with other languages.

(12)

Strategic priorities: from basic

Strategic priorities: from basic

research to application

research to application

development

development

Research & development

Research & development

End-user applications

Language tools

Basic & applied research

Basic & applied research

(13)

Linguistic foundations &

Linguistic foundations &

resources, tools and applications

resources, tools and applications

l

Linguistic foundations and resources:

necessary infrastructure for the automatic

processing of a language.

l

Tools:

mainly intended for application

developers.

l

Applications:

commercial or non-commercial,

(14)

Phase I: laying foundations

Phase I: laying foundations

Basic Lexical Database

(15)

Phase II: first basic tools and

Phase II: first basic tools and

applications

applications

Lemmatiser/Tagger

Morphological analyser

Statistical tools for the treatment of corpora

Comp. description

of morphology

MRD's

Morphologically annotated corpus

Enriched Lexical Database

(16)

Phase III: more advanced tools

Phase III: more advanced tools

and applications

and applications

Morphological analyser

Lemmatiser/Tagger

Statistical tools for the treatment of corpora

Lexical Database

Environment for linguistic tools integration

Basic CALL

MRD's

Comp. description

of morphology

Xuxen: spelling checker/corrector

Comp.

grammar

Lexical-Surface

syntax

analyser

WSD

Web crawler

Grammar

checker

Electronic

dictionaries

(17)

Phase IV:

Phase IV:

multilinguality

multilinguality

and

and

general use applications

general use applications

MRD's

Lexical Database

Comp. description

of morphology

Morphological analyser

Lemmatiser/Tagger

Statistical tools for the treatment of corpora

Comp.

grammar

Environment for linguistic tools integration

Electronic

dictionaries

Web crawler

Grammar

checker

Information retrieval and extraction

NL generation, translation aids, dialog systems, ...

Syntax

analyser

WSD

Xuxen: spelling checker/corrector

Advanced CALL

Morphol., synt., and semantically annotated multilingual corpus

Multilingual

lexical-semantic KB

(18)

What not to

What not to

do (I)

do (I)

l

Do not start developing applications before

linguistic foundations are created.

Ø

Ø

Follow, in general, the sequence stated above:

Follow, in general, the sequence stated above:

foundations, tools, and applications

foundations, tools, and applications

.

.

l

When a new system must be built, do not

create ad hoc linguistic resources.

Ø

Ø

Design these resources to be easily extended for

Design these resources to be easily extended for

full coverage and make them reusable by any

(19)

What not to do (II): example

What not to do (II): example

l

Basque is a language with a very rich

morphology.

Ø

Ø

We decided not to begin with advanced

We decided not to begin with advanced

applications (machine translation, ...)

applications (machine translation, ...)

but rather to develop a broad foundation based on

but rather to develop a broad foundation based on

lexicon and morphology.

lexicon and morphology.

l

Now those foundations have become the base

(20)

Reusability is a must:

Reusability is a must:

example (I)

example (I)

Structured electronic

dictionaries

Lemmatiser

Surface syntax

parser

Machine-Lexical

(21)

Reusability is a must:

Reusability is a must:

example (II)

example (II)

Translation

aids

Morphological

analyser

Structured

electronic

dictionaries

Surface syntax

parser

Word-sense

disambiguation

Lexical

Database

Corpus

(22)

What not to

What not to

do (III)

do (III)

l

When you complete a new resource or tool do

not keep it to yourself

many researchers in the world are investigating on

English, but only a few on each minority language

we will not become rich (market criteria do not

usually apply)

J

J

è

è

Results should be public and shared for

Results should be public and shared for

research purposes.

(23)

Conclusions

Conclusions

l

l

Long-term strategy for research and development of

Long-term strategy for research and development of

language engineering.

language engineering.

l

l

Based on the experience of the IXA Group on the

Based on the experience of the IXA Group on the

automatic processing of Basque.

automatic processing of Basque.

l

l

Every foundation, tool, and application developed in

Every foundation, tool, and application developed in

the previous phases is of great importance to face

the previous phases is of great importance to face

new problems and challenges.

new problems and challenges.

l

l

The development of a sound language industry should

The development of a sound language industry should

be the result of a coordinated effort, involving

be the result of a coordinated effort, involving

research groups, institutions and industry.

References

Related documents

Rassias, “On the stability of functional equations in Banach spaces,” Journal of Mathe- matical Analysis and Applications , vol.. Rassias, “On the stability of functional equations

Several existence results of solutions for the generalized strongly nonlinear mixed variational-like inequality involving strongly monotone, relaxed Lipschitz, cocoercive,

velocity of a satellite at a particular distance from the centre of the object being orbited (like the Earth)...

Your William Purves funeral director is there to help you write exactly the notice you want to place.. A sample notice can be found on our website at

Action Recognition, Human Motion Analysis, Video Surveillance, Bag- of-features, SVM Classification, Local Spatio-temporal Features, Sparse- representation, LDA representation,

Previous analyses from the SafERteens study demonstrated that universal computerized screening and BIs for multiple risk behaviors (ie, violence and alcohol misuse) are feasible,

Conclusions: This review found that innovative service delivery approaches, such as those potentially offered by CHWs, for adolescents in sub-Saharan Africa are lacking, CHW

Intensified PA counseling supported with an option for monthly thematic meetings with group exercise proved feasible among pregnant women at risk for gestational diabetes and was