Towards automatic
terminology extraction
for Norwegian based
on parallel corpora
Gisle Andersen
LSP Conference, Vienna
8 July 2015
Background and contents
• NHH is developing a national infrastructure that
integrates
terminological language resources
•
Termportalen
; WP7 in CLARINO project (NFR)
• Many specialist fields lacking systematic terminology
• Case:
Sjøfartsdirektoratet
(Norwegian Maritime Authority)
Contents
• Introduction: aim
• Data and methods
• Pattern matching
• Conclusion
Aim of work
• Purpose: providing
aid to field experts
where
systematic terminology work is lacking
• A generic system meant to
enhance terminology
for
various domains
• Maximising value of existing tools and language
resources
• Setting up an infrastructure, a
production line
for
term extraction (TE)
• Using a variety of techniques for
parallel
corpus-based
TE
The necessary disclaimer
• What is extracted through computational methods
are always
term candidates
.
• Need for subsequent manual check by field experts
• Need to supply additional information about concepts
(definitions, structure, term variation); cf. e.g. Heylen
& De Hertog (2015)
DATA AND METHODS
Towards automatic terminology extraction for Norwegian
The corpus (1/2)
•
Sjøfartsdirektoratet
(NMA)
• Parallel corpus of translated texts (EN
NO)
• Policies and legislation relating to shipping
-
navigation, communication, safety, etc.
• Current version: translated regulations from
International Maritime Organization (IMO)
-
small; currently 9 items
• To be extended to include
-
Skipssikkerhetsloven / The Ship Safety and Security Act
-
NMA’s own regulations
The corpus (2/2)
7
Title of regulation Navn på forskrift TCA2 fil TCA2 fil Regulations of 1 July 2014 No. 1072 on the
construction of ships
Forskrift 01. juli 2014 om bygging av skip
RCS_E RCS_N Regulations of 1 July 2014 No. 944
on dangerous goods on Norwegian ships
Forskrift 1. juli 2014 om farlig last på norske skip
RDG_E RDG_N Regulations of 1 July 2014 No. 1099 on fire
protection on ships
Forskrift 1. juli 2014 om brannsikring på skip
RFP_E RFP_N Regulations of 1 July 2014 on life-saving
appliances on ships
Forskrift 1. juli 2014 om redningsredskaper på skip
RLS_E RLS_N Regulations of 5 June 2014 No. 805 on
medical examination of employees on Norwegian ships and mobile offshore units
Forskrift 01.07.2014 nr. xxxx om helseundersøkelse av arbeidstakere på norske skip og flyttbare
innretninger
RME_E RME_N
Regulations of 5 September 2014 No. 1157 on navigation and navigational aids for ships and mobile offshore units
Forskrift om navigasjon og
navigasjonshjelpemidler for skip og flyttbare innretninger
RNN_E RNN_N
Regulations of 1 July 2014 No. 955
concerning radiocommunication equipment for Norwegian ships and mobile offshore units
Forskrift 1. juli 2014 om
radiokommunikasjonsutstyr for norske skip og flyttbare innretninger
RRR_E RRE_N
Regulations of 5 January 2014 No. 1191 on a safety management system for
Norwegian ships and mobile offshore units
Forskrift om sikkerhetsstyringssystem for norske skip, og flyttbare
innretninger
RSM_E RSM_N
IMO standard marine communication phrases (SMCPs)
IMOs standarduttrykk for maritim kommunikasjon
Step 1: Text conversion: doc
html
xml
Step 3A: Pattern matching
11
<s>b)
barges
;</s>
<s>The
spooling device
shall:</s>
<s>a)
initial certification
upon changes
in use;</s>
<s>e)
handrails
,
corridors and
passageways
,
doorways
,
doors
,
lifts
,
vehicle decks
,
passenger lounges
,
accommodation
and
washrooms
shall
be …
Wire/chain stoppers
shall be
dimensioned for a safe working load …
<s>d)
lektere
</s>
<s>
Spoleapparatet
skal:</s>
<s>a)
førstegangssertifisering
ved
endret bruk</s>
<s>e)
Håndlister
,
korridorer og ganger
,
døråpninger
,
dører
,
heiser
,
bildekk
,
passasjersalonger
,
innredning
og
toaletter
skal være …
En
wire- og kjettingstopper
skal være
dimensjonert for en sikker
arbeidsbelastning …
• Premise:
recognisable patterns
in sentence and
paragraph structure, punctuation, etc. suggesting
termhood
Step 3B: Check of terminological inventory
• Premise: if word/sequence of words is
already
registered as term
in other component of
Termportalen
, it has high termhood (it is likely to
constitute a term in current context also)
• Question 1: same or different
translation relation
• Question 2: same or different
domain
• Methodological issue: inflected forms in texts; base
form in term base
Step 3C: Neology detection
• Premise: if word/sequence of words can be shown to
be a
neologism (domain-specific vocabulary)
, it has
high termhood (is likely to be a term)
• Check against inventory of words in large general
language corpus (GLC);
Norsk aviskorpus
(Norwegian Newspaper Corpus, NNC; cf. Andersen
2012; Andersen & Hofland 2012)
• Check among neologisms registered in NNC’s
neology database
Step 3D: Monolingual/bilingual lexicon lookup
• Premise: if word/sequence of words is found among
the lexical inventory in a mono/bilingual
technical
or
specialised dictionary
, it has high termhood
• Agreement with
Kunnskapsforlaget
to reuse some of
their manuscripts
Step 3E: Association measures (AMs)
• Premise: terms are often constituted as collocations, i.e.
words with a strong tendency to co-occur, so
strong
collocations
may be seen as indicators of termhood
• Association measures
, statistical measures of
unithood/termhood (Heylen & De Hertog 2015)
• Important to select
adequate AM
for TE, e.g. Pointwise
Mutual Information, Chi-square (cf. Lyse & Andersen
2012)
• Collocation patterns should be compared with GLC data
(NNC)
Step 3F: Parsing techniques
• Premise: terminological units are typically constituted
as (complex) noun phrases; output from
syntactic
parsing
may give good guidance towards
terminological units
• Parsers for Norwegian and English: INESS project
(UiB; cf. Rosén 2012)
A CLOSER LOOK AT PATTERN
MATCHING
Towards automatic terminology extraction for Norwegian
Term extraction based on pattern matching
final loading conditions Endelige lastetilstander Hydrostatics containing the following parameters as
a function of the draught with a specified reference point
Hydrostatikk som inneholder følgende parametere som funksjon av dypgang med spesifisert
referansepunkt
? ?
displacement deplasement
KB KB
centre of buoyancy oppdriftssenter
If warranted by the ferry's size or type Når fergens størrelse eller type tilsier det the Norwegian Maritime Authority may require the
mooring arrangement to be dimensioned for a mooring force higher than 30 tonnes
kan Sjøfartsdirektoratet kreve at
fortøyningsarrangementet blir dimensjonert for høyere fortøyningskraft enn 30 vekttonn
KM KM
transverse metacentre above the baseline tverrskips metasenter over basis
AwT AwT
waterline area vannlinjeareal
TP1 TP1
tonnes per unit submersion enhets neddykking
MT1 MT1
moment to change trim enhets trimmoment
LCF LCF
Output of procedure: term database file (tsv)
The next stage: manual editing in
Termportalen
CONCLUSION
Towards automatic terminology extraction for Norwegian
Other remaining tasks
• The output of each processing procedure: a bilingual
list of
term candidates
• Precision and recall needs to be checked against a
gold standard
• Will be developed via
manual term extraction
performed by field experts/research assistant
• The
performance/contribution
of each module will be
checked separately
Summary
• A
hybrid approach
, using a combination
linguistic
and
statistical
approaches to (bilingual) TE
• combining the strengths of both approaches
• at the same time attempting to
utilise and maximise
the value of
“old” existing language resources
• although based on data drawn specifically from the
maritime sector, the production line and
infrastructure proposed here is meant to be
generic
and applicable in (all) other domains
References
Andersen, Gisle, ed. 2012.
Exploring Newspaper Language - Using the web to
create and investgate a large corpus of modern Norwegian
. Amsterdam: John
Benjamins.
Andersen, Gisle, and Knut Hofland. 2012. Building a large monitor corpus based on
newspapers on the web. In
Exploring Newspaper Language - Using the web to
create and investigate a large corpus of modern Norwegian
, edited by G. Andersen.
Amsterdam: John Benjamins.
Heylen, Kris, and Dirk De Hertog. 2015. Automatic term extraction. In
Handbook of
Terminology
, edited by H. J. Kockaert and F. Steurs. Amsterdam: John Benjamins.
Lyse, Gunn Inger, and Gisle Andersen. 2012. Collocations and statistical analysis of
n-grams. In
Exploring Newspaper Language - Using the web to create and
investigate a large corpus of modern Norwegian
, edited by G. Andersen.
Amsterdam: John Benjamins.
Rosén, Victoria. 2012. Exploring corpora through syntactic annotation. In
Exploring
Newspaper Language - Using the web to create and investigate a large corpus of
modern Norwegian
, edited by G. Andersen: John Benjamins.