• No results found

Towards automatic terminology extraction for Norwegian based on parallel corpora

N/A
N/A
Protected

Academic year: 2021

Share "Towards automatic terminology extraction for Norwegian based on parallel corpora"

Copied!
24
0
0

Loading.... (view fulltext now)

Full text

(1)

Towards automatic

terminology extraction

for Norwegian based

on parallel corpora

Gisle Andersen

LSP Conference, Vienna

8 July 2015

(2)

Background and contents

• NHH is developing a national infrastructure that

integrates

terminological language resources

Termportalen

; WP7 in CLARINO project (NFR)

• Many specialist fields lacking systematic terminology

• Case:

Sjøfartsdirektoratet

(Norwegian Maritime Authority)

Contents

• Introduction: aim

• Data and methods

• Pattern matching

• Conclusion

(3)

Aim of work

• Purpose: providing

aid to field experts

where

systematic terminology work is lacking

• A generic system meant to

enhance terminology

for

various domains

• Maximising value of existing tools and language

resources

• Setting up an infrastructure, a

production line

for

term extraction (TE)

• Using a variety of techniques for

parallel

corpus-based

TE

(4)

The necessary disclaimer

• What is extracted through computational methods

are always

term candidates

.

• Need for subsequent manual check by field experts

• Need to supply additional information about concepts

(definitions, structure, term variation); cf. e.g. Heylen

& De Hertog (2015)

(5)

DATA AND METHODS

Towards automatic terminology extraction for Norwegian

(6)

The corpus (1/2)

Sjøfartsdirektoratet

(NMA)

• Parallel corpus of translated texts (EN

NO)

• Policies and legislation relating to shipping

-

navigation, communication, safety, etc.

• Current version: translated regulations from

International Maritime Organization (IMO)

-

small; currently 9 items

• To be extended to include

-

Skipssikkerhetsloven / The Ship Safety and Security Act

-

NMA’s own regulations

(7)

The corpus (2/2)

7

Title of regulation Navn på forskrift TCA2 fil TCA2 fil Regulations of 1 July 2014 No. 1072 on the

construction of ships

Forskrift 01. juli 2014 om bygging av skip

RCS_E RCS_N Regulations of 1 July 2014 No. 944

on dangerous goods on Norwegian ships

Forskrift 1. juli 2014 om farlig last på norske skip

RDG_E RDG_N Regulations of 1 July 2014 No. 1099 on fire

protection on ships

Forskrift 1. juli 2014 om brannsikring på skip

RFP_E RFP_N Regulations of 1 July 2014 on life-saving

appliances on ships

Forskrift 1. juli 2014 om redningsredskaper på skip

RLS_E RLS_N Regulations of 5 June 2014 No. 805 on

medical examination of employees on Norwegian ships and mobile offshore units

Forskrift 01.07.2014 nr. xxxx om helseundersøkelse av arbeidstakere på norske skip og flyttbare

innretninger

RME_E RME_N

Regulations of 5 September 2014 No. 1157 on navigation and navigational aids for ships and mobile offshore units

Forskrift om navigasjon og

navigasjonshjelpemidler for skip og flyttbare innretninger

RNN_E RNN_N

Regulations of 1 July 2014 No. 955

concerning radiocommunication equipment for Norwegian ships and mobile offshore units

Forskrift 1. juli 2014 om

radiokommunikasjonsutstyr for norske skip og flyttbare innretninger

RRR_E RRE_N

Regulations of 5 January 2014 No. 1191 on a safety management system for

Norwegian ships and mobile offshore units

Forskrift om sikkerhetsstyringssystem for norske skip, og flyttbare

innretninger

RSM_E RSM_N

IMO standard marine communication phrases (SMCPs)

IMOs standarduttrykk for maritim kommunikasjon

(8)
(9)

Step 1: Text conversion: doc

html

xml

(10)
(11)

Step 3A: Pattern matching

11

<s>b)

barges

;</s>

<s>The

spooling device

shall:</s>

<s>a)

initial certification

upon changes

in use;</s>

<s>e)

handrails

,

corridors and

passageways

,

doorways

,

doors

,

lifts

,

vehicle decks

,

passenger lounges

,

accommodation

and

washrooms

shall

be …

Wire/chain stoppers

shall be

dimensioned for a safe working load …

<s>d)

lektere

</s>

<s>

Spoleapparatet

skal:</s>

<s>a)

førstegangssertifisering

ved

endret bruk</s>

<s>e)

Håndlister

,

korridorer og ganger

,

døråpninger

,

dører

,

heiser

,

bildekk

,

passasjersalonger

,

innredning

og

toaletter

skal være …

En

wire- og kjettingstopper

skal være

dimensjonert for en sikker

arbeidsbelastning …

• Premise:

recognisable patterns

in sentence and

paragraph structure, punctuation, etc. suggesting

termhood

(12)

Step 3B: Check of terminological inventory

• Premise: if word/sequence of words is

already

registered as term

in other component of

Termportalen

, it has high termhood (it is likely to

constitute a term in current context also)

• Question 1: same or different

translation relation

• Question 2: same or different

domain

• Methodological issue: inflected forms in texts; base

form in term base

(13)

Step 3C: Neology detection

• Premise: if word/sequence of words can be shown to

be a

neologism (domain-specific vocabulary)

, it has

high termhood (is likely to be a term)

• Check against inventory of words in large general

language corpus (GLC);

Norsk aviskorpus

(Norwegian Newspaper Corpus, NNC; cf. Andersen

2012; Andersen & Hofland 2012)

• Check among neologisms registered in NNC’s

neology database

(14)

Step 3D: Monolingual/bilingual lexicon lookup

• Premise: if word/sequence of words is found among

the lexical inventory in a mono/bilingual

technical

or

specialised dictionary

, it has high termhood

• Agreement with

Kunnskapsforlaget

to reuse some of

their manuscripts

(15)

Step 3E: Association measures (AMs)

• Premise: terms are often constituted as collocations, i.e.

words with a strong tendency to co-occur, so

strong

collocations

may be seen as indicators of termhood

• Association measures

, statistical measures of

unithood/termhood (Heylen & De Hertog 2015)

• Important to select

adequate AM

for TE, e.g. Pointwise

Mutual Information, Chi-square (cf. Lyse & Andersen

2012)

• Collocation patterns should be compared with GLC data

(NNC)

(16)

Step 3F: Parsing techniques

• Premise: terminological units are typically constituted

as (complex) noun phrases; output from

syntactic

parsing

may give good guidance towards

terminological units

• Parsers for Norwegian and English: INESS project

(UiB; cf. Rosén 2012)

(17)

A CLOSER LOOK AT PATTERN

MATCHING

Towards automatic terminology extraction for Norwegian

(18)

Term extraction based on pattern matching

final loading conditions Endelige lastetilstander Hydrostatics containing the following parameters as

a function of the draught with a specified reference point

Hydrostatikk som inneholder følgende parametere som funksjon av dypgang med spesifisert

referansepunkt

? ?

displacement deplasement

KB KB

centre of buoyancy oppdriftssenter

If warranted by the ferry's size or type Når fergens størrelse eller type tilsier det the Norwegian Maritime Authority may require the

mooring arrangement to be dimensioned for a mooring force higher than 30 tonnes

kan Sjøfartsdirektoratet kreve at

fortøyningsarrangementet blir dimensjonert for høyere fortøyningskraft enn 30 vekttonn

KM KM

transverse metacentre above the baseline tverrskips metasenter over basis

AwT AwT

waterline area vannlinjeareal

TP1 TP1

tonnes per unit submersion enhets neddykking

MT1 MT1

moment to change trim enhets trimmoment

LCF LCF

(19)

Output of procedure: term database file (tsv)

(20)

The next stage: manual editing in

Termportalen

(21)

CONCLUSION

Towards automatic terminology extraction for Norwegian

(22)

Other remaining tasks

• The output of each processing procedure: a bilingual

list of

term candidates

• Precision and recall needs to be checked against a

gold standard

• Will be developed via

manual term extraction

performed by field experts/research assistant

• The

performance/contribution

of each module will be

checked separately

(23)

Summary

• A

hybrid approach

, using a combination

linguistic

and

statistical

approaches to (bilingual) TE

• combining the strengths of both approaches

• at the same time attempting to

utilise and maximise

the value of

“old” existing language resources

• although based on data drawn specifically from the

maritime sector, the production line and

infrastructure proposed here is meant to be

generic

and applicable in (all) other domains

(24)

References

Andersen, Gisle, ed. 2012.

Exploring Newspaper Language - Using the web to

create and investgate a large corpus of modern Norwegian

. Amsterdam: John

Benjamins.

Andersen, Gisle, and Knut Hofland. 2012. Building a large monitor corpus based on

newspapers on the web. In

Exploring Newspaper Language - Using the web to

create and investigate a large corpus of modern Norwegian

, edited by G. Andersen.

Amsterdam: John Benjamins.

Heylen, Kris, and Dirk De Hertog. 2015. Automatic term extraction. In

Handbook of

Terminology

, edited by H. J. Kockaert and F. Steurs. Amsterdam: John Benjamins.

Lyse, Gunn Inger, and Gisle Andersen. 2012. Collocations and statistical analysis of

n-grams. In

Exploring Newspaper Language - Using the web to create and

investigate a large corpus of modern Norwegian

, edited by G. Andersen.

Amsterdam: John Benjamins.

Rosén, Victoria. 2012. Exploring corpora through syntactic annotation. In

Exploring

Newspaper Language - Using the web to create and investigate a large corpus of

modern Norwegian

, edited by G. Andersen: John Benjamins.

References

Related documents