DanNet
From Dictionary
to Wordnet
Jörg Asmussen
Society for Danish Language and Literature, DSL, Copenhagen
Bolette Sandford Pedersen
Centre for Language Technology, CST, University of Copenhagen
Lars Trap-Jensen
Outline
1. Introduction LTJ, 2 min.2. Characteristics of the DDO LTJ, 5 min.
3. Building DanNet BSP, 8 min.
4. Extraction of differentia info JA, 7 min.
DanNet
•
Lexical-semantic wordnet for Danish•
Joint project•
Society for Danish Language and Literature•
Centre for Language Technology, University of CopenhagenLimited resources
•
Adapt an existing wordnet? or•
Reuse other lexical-semantic resources:•
SIMPLE-DKOutline
1. Introduction2. Characteristics of the DDO 3. Building DanNet
4. Extraction of differentia info from definitons 5. Conclusions
Den Danske Ordbog
•
Published by DSL 2003–5•
Corpus-based, DDOC•
60,000 entries•
Spelling, morphology, pronunciation, meaning, collocations,fixed phrases, syntax, usage, word formation, etymology
Den Danske Ordbog
•
Words edited in related groups•
Machine readable•
Fine-grained microstructureSystematic domain info
→ concerns relation
Sense definition
→ relevant info „manually“ extracted
Hyperonym
Sense relations, i.e. synonyms
Collocational information
Authentic example
Definitions in the DDO
Definition scheme:•
Genus proximum – closest hyperonym:apparat ‚technical device‘
•
Differentia specifica – distinctive feature: remaining part of the definitionOutline
1. Introduction2. Characteristics of the DDO 3. Building DanNet
4. Extraction of differentia info from definitons 5. Conclusions
Building DanNet
•
Extract definitions and genus specifications•
Include them in the DanNet tool•
Use it for domain-wise development of data: 1. Homonymy and polysemy2. Establishing synsets
Homonymy & polysemy
celle ‚cell‘ is genus proximum of
•
gærcelle ,yeast cell‘•
fængselscelle ‚prison cell‘Convert lexical expressions into concepts:
•
celle-1 ‚part of living organism‘Establishing synsets
lære
‚studies‘ ‚subject‘fag videnskab‚science‘
informatik ‚informatics‘ bromatologi ‚nutrition science‘ samfundsfag ‚social studies‘ datalogi ‚computer science‘
Establishing synsets
lære
‚studies‘ ‚subject‘fag videnskab‚science‘
informatik ‚informatics‘ bromatologi ‚nutrition science‘ samfundsfag ‚social studies‘ datalogi ‚computer science‘ One synset
Building the hierarchy
Hyponymy is generally defined as
•
X is a YTaxonymy is a subtype of this:
•
X is a kind/type of YExample: Hyponymy?
vejtræ ‚roadside tree‘ træ ‚tree‘ kirsebærtræ ‚cherry tree‘ birketræ ‚birch‘Example: Hyponymy?
træ ‚tree‘ kirsebærtræ ‚cherry tree‘ birketræ ‚birch‘ vejtræ ‚roadside tree‘ „Orthogonal“ HyponymyBuilding the hierarchy
stol ‚chair‘
siddemøbel ‚sitting furniture‘
møbel ‚furniture‘
genstand ‚object‘ TOP
Building the hierarchy
stol ‚chair‘
siddemøbel ‚sitting furniture‘
møbel ‚furniture‘
genstand ‚object‘ TOP
Building the hierarchy
stol ‚chair‘
siddemøbel ‚sitting furniture‘
møbel ‚furniture‘
genstand ‚object‘ TOP
Definition composition
•
Genus selection – a conscious process•
Differentia:•
No editorial specifications, i.e. no fixed definition vocabulary nor syntax•
Consequences for DanNet:•
Complicates computational exploitationCoding relations
•
What is done manually:•
No semantic info other than that of DDO•
Reduction of semantic info•
What is done automatically:Outline
1. Introduction2. Characteristics of the DDO 3. Building DanNet
4. Extraction of differentia info from definitons 5. Conclusions
Extraction of telic role
fjernsyn ‚tv set‘
‚box-shaped device that can receive tv signals and transform them into animated pictures on a screen and accompanying sound in the speakers of the device‘
Extraction of telic role
genus expression
fjernsyn ‚tv set‘
‚box-shaped device that can receive tv signals and transform them into animated pictures on a screen and accompanying sound in the speakers of the device‘
Extraction of telic role
genus expression
fjernsyn ‚tv set‘
‚box-shaped device that can receive tv signals and transform them into animated pictures on a screen and accompanying sound in the speakers of the device‘
Telic role:
Extraction of telic role
genus expression
fjernsyn ‚tv set‘
‚box-shaped device that can receive tv signals and transform them into animated pictures on a screen and accompanying sound in the speakers of the device‘
Telic role:
‣
VPs in a relative clause which are headed bykan ‚can‘ specify the telic role (i.e. the
for_purpose_of relation) of the definiendum
Hypothesis
‣
VPs in a relative clause which are headed bykan ‚can‘ specify the telic role (i.e. the
for_purpose_of relation) of the definiendum
Hypothesis
Corpus query
Find all definitions with genus apparat
followed by der or som
followed by kan
Results of corpus query
query VP
heads denoting
Results of corpus query
query VP
heads denoting
telic role dictionary entries
Only 26 occurrences
of this pattern – but 203
Why this bad coverage?
1. Definitions where the pattern containsWhy this bad coverage?
1. Definitions where the pattern containsinterposed material are not captured 2. Other stuctural patterns indicating a
for_purpose_of relation than that one given in our hypothesis
1. GE that can VP-inf
2. GE that is used for to VP-inf with
3. GE for to VP-inf with/on/in
4. GE that VP-fin 5. GE for NP
6. GE that is specially designed for to VP-inf
Further patterns
1. GE that can VP-inf
2. GE that is used for to VP-inf with
3. GE for to VP-inf with/on/in
4. GE that VP-fin 5. GE for NP
6. GE that is specially designed for to VP-inf
Further patterns
1. GE that can VP-inf
2. GE that is used for to VP-inf with
3. GE for to VP-inf with/on/in
4. GE that VP-fin 5. GE for NP
6. GE that is specially designed for to VP-inf
Further patterns
head for_purpose_of
These patterns
capture 70% of the
apparat
A statistical approach
•
Frequency list of types in definitions with genus apparatA statistical approach
•
Frequency list of types in definitions with genus apparatA statistical approach
•
Frequency list of types in definitions with genus apparatcompared with
A statistical approach
•
Frequency list of types in definitions with genus apparatcompared with
•
frequency list of types in all definitionsA statistical approach
•
Frequency list of types in definitions with genus apparatcompared with
•
frequency list of types in all definitionsusing a statistical test (e.g. log likelihood)
‣
Salient types are listed for investigation and may give hints on semantic relationsSome salient types
•
afspille ‚to play back‘•
afspilning ‚play back‘•
måle ,measure‘•
måler ,measuring tool‘•
måling ,gauging‘Some salient types
•
afspille ‚to play back‘•
afspilning ‚play back‘•
måle ,measure‘•
måler ,measuring tool‘•
måling ,gauging‘•
målinger ,measurements‘grammofon,
cd-afspiller, afspiller, sequencer, diktafon
kassettespiller,
hjemmevideo, kassettebåndoptager, båndoptager
stroboskop,
måler, timer, løgnedetektor, ekkolod gasmåler,
speedometer, omdrejningstæller, benzinmåler, fotofældeelmåler,
trykmåler, luxmeter, spirometer, gyrometer, alkometer, newtonmeter,
magnetometer, instrument, kalorimeter
måleinstrument,
Automatic extraction?
Basically NO...
Developing reliant methods is
too expensive!
Automatic extraction?
•
Structural and lexical properties of definitions differ considerablyAutomatic extraction?
•
Structural and lexical properties of definitions differ considerably‣
Difficult to automatically extract semantic relations from definitionsAutomatic extraction?
•
Structural and lexical properties of definitions differ considerably‣
Difficult to automatically extract semantic relations from definitions‣
Concordances and lists of salient definition types may help the editorAutomatic extraction?
•
Structural and lexical properties of definitions differ considerably‣
Difficult to automatically extract semantic relations from definitions‣
Concordances and lists of salient definition types may help the editor‣
But the DanNet editor still has to do the core job of analysing dictionary definitionsOutline
1. Introduction2. Characteristics of the DDO 3. Building DanNet
4. Extraction of differentia info from definitons 5. Conclusions
Conclusion
Conclusion
Reusing the DDO
Cheap
Conclusion
Reusing the DDO
Semi-automatic exploitation of the dictionary structure
•
hyponymy structure•
synonym/antonym infoCheap
Conclusion
Reusing the DDO
Semi-automatic exploitation of the dictionary structure
•
hyponymy structure•
synonym/antonym infoAutomatic exploitation of definitions proper to find other semantic relations
Cheap
Conclusion
Reusing the DDO
Semi-automatic exploitation of the dictionary structure
•
hyponymy structure•
synonym/antonym infoAutomatic exploitation of definitions proper to find other semantic relations
Cheap
Conclusion
Conclusion
The DanNet approach
Expensive Cheap
Conclusion
The DanNet approach
Translation/expansion of existing WNs?
•
Better coherence with other WNs•
Linguistic biasExpensive Cheap
Conclusion
The DanNet approach
Translation/expansion of existing WNs?
•
Better coherence with other WNs•
Linguistic biasReusing/merging language resources?
•
More loyal to the specific language•
Expensive, unless based on an existing resource, i.e. a dictionaryExpensive Cheap
Conclusion
The DanNet approach
Translation/expansion of existing WNs?
•
Better coherence with other WNs•
Linguistic biasReusing/merging language resources?
•
More loyal to the specific language•
Expensive, unless based on an existing resource, i.e. a dictionaryExpensive Cheap