• No results found

The BioLexicon: a Large-Scale Domain-Specific Lexical Resource for Biomedical Text Mining

N/A
N/A
Protected

Academic year: 2021

Share "The BioLexicon: a Large-Scale Domain-Specific Lexical Resource for Biomedical Text Mining"

Copied!
45
0
0

Loading.... (view fulltext now)

Full text

(1)

The

The

BioLexicon

BioLexicon

:

:

a Large

a Large

-

-

Scale

Scale

Domain

Domain

-

-

Specific

Specific

Lexical Resource

Lexical Resource

for Biomedical Text Mining

for Biomedical Text Mining

Simonetta Montemagni

Istituto di Linguistica Computazionale “Antonio Zampolli” - CNR

[email protected]

LREC 2010

2nd Workshop on Building and evaluating resources for biomedical text mining

(2)

Outline

Outline

The

The

BioLexicon

BioLexicon

Who

Who

Why

Why

How

How

What

What

Where

Where

from

from

Focus on Verbs

Focus on Verbs

Representation

Representation

How

How

many

many

entries

entries

Evaluation

Evaluation

Distribution

Distribution

Conclusions

Conclusions

(3)

The

The

BioLexicon

BioLexicon

:

:

who

who

Joint and collaborative work of the following teams

Joint and collaborative work of the following teams

in the framework of the European

in the framework of the European BOOTStrepBOOTStrep

project

project (FP6 (FP6 -- 028099)028099)

D. RebholzD. Rebholz--SchuhmannSchuhmann, P. Pezik, P. Pezik, V. Lee, J.J. , V. Lee, J.J. Kim

Kim

European Bioinformatics Institute,

European Bioinformatics Institute, WellcomeWellcome Trust Trust Genome Campus, Cambridge, CB10 1SD, UK

Genome Campus, Cambridge, CB10 1SD, UK

N. N. CalzolariCalzolari, M. , M. MonachiniMonachini,, S. S. MontemagniMontemagni, R. , R. del

del GrattaGratta,, S. S. Marchi, V. Marchi, V. QuochiQuochi, G. , G. VenturiVenturi

Istituo

Istituo didi LInguisticaLInguistica ComputazionaleComputazionale ““Antonio Antonio Zampolli

Zampolli”” -- CNR, Via Giuseppe CNR, Via Giuseppe MoruzziMoruzzi NN°° 1, 56124 1, 56124 Pisa, Italy

Pisa, Italy

S. AnaniadouS. Ananiadou, J. , J. McNaught,McNaught, Y. Sasaki, P. Y. Sasaki, P. Thompson

Thompson

School of Computer Science, The University of

School of Computer Science, The University of

Manchester, 131 Princess Street, M1 7DN, UK

(4)

The

The

BioLexicon

BioLexicon

:

:

why

why

Text Mining needs information about words

Text Mining needs information about words

the lexical component still remains a major bottleneck the lexical component still remains a major bottleneck

TM systems in the biomedical domain must be provided with

TM systems in the biomedical domain must be provided with

a substantial lexicon covering a realistic vocabulary and

a substantial lexicon covering a realistic vocabulary and

providing the kinds of linguistic information appropriate to

providing the kinds of linguistic information appropriate to

grasp the knowledge embedded in texts

grasp the knowledge embedded in texts

Biomedical

Biomedical

term

term

variants

variants

(

(

orthographic

orthographic

,

,

semantic

semantic

,

,

geographical

geographical

,

,

)

)

betterbetter informationinformation retrievalretrieval

Terminological

Terminological

verbs

verbs

and

and

their

their

combinatorial

combinatorial

properties

properties

(

(subcategorizationsubcategorization framesframes and and predicatepredicate--argumentargument structurestructure))

betterbetter informationinformation extractionextraction and and questionquestion answeringanswering

Word

Word

derivations

derivations

toto reachreach similarsimilar meaningmeaning expressedexpressed in in differentdifferent waysways (e.g. (e.g. activation

(5)

The

The

BioLexicon

BioLexicon

as

as

a

a

key

key

ingredient

ingredient

to

to

bridge the gap

bridge the gap

between

between

text

text

and

and

knowledge

knowledge

To perform such a key role

To perform such a key role

the BioLexicon MUST

the BioLexicon MUST

reflect

reflect

the

the

actual

actual

usage

usage

of

of

words

words

in

in

biomedical

biomedical

texts

texts

be

be

continuously

continuously

updated

updated

with

with

new

new

word

word

synonyms

synonyms

emerging from texts

emerging from texts

include

include

rich

rich

linguistic

linguistic

information

information

on the

on the

behavioural

behavioural

properties

properties

of

of

nouns

nouns

and

and

verbs

verbs

Biomedical literature BioLexicon BioLexicon Biomedical knowledge

(6)

The

The

BioLexicon

BioLexicon

:

:

how

how

General Requirements

General Requirements

Modularity, extensibility, conformity to standards, reusability

Modularity, extensibility, conformity to standards, reusability

Biomedical Domain Specific Requirements

Biomedical Domain Specific Requirements

Gene names, protein names, bio

Gene names, protein names, bio--events and participants, events and participants, ……

Linguistic/Terminological Requirements

Linguistic/Terminological Requirements

term variants, source identifiers, acronyms, syntactic and seman

term variants, source identifiers, acronyms, syntactic and semantic tic properties of terms,

properties of terms, ……

Text Mining / Machine Learning Requirements

Text Mining / Machine Learning Requirements

Confidence scores for automatically extracted info (e.g. variant

Confidence scores for automatically extracted info (e.g. variants, s, subclusterizations

(7)

The

The

BioLexicon

BioLexicon

:

:

what

what

integrated

integrated

lexical

lexical

-

-

terminological

terminological

resource

resource

of

of

~

~

2.2M

2.2M

lexical

lexical

entries

entries

for

for

bio

bio

-

-

text

text

mining

mining

with

with

information

information

about

about

nounsnouns, , verbsverbs, , adjectivesadjectives, , adverbsadverbs

bothboth domaindomain--specificspecific and and generalgeneral languagelanguage wordswords

populated with terms gathered from

populated with terms gathered from

available biomedical sources available biomedical sources

texts (biomedical literature)texts (biomedical literature)

including rich linguistic information ranging over different

including rich linguistic information ranging over different

linguistic descriptions levels

linguistic descriptions levels

e.g. derivational morphology, e.g. derivational morphology, subcategorizationsubcategorization patterns, predicate patterns, predicate argument structure, syntax

argument structure, syntax--semantics linkingsemantics linking

combining features of both terminologies and open

combining features of both terminologies and open

-

-

domain

domain

computational lexicons

computational lexicons

conforming

conforming

to

to

international

international

lexical

lexical

representation

representation

standards

standards

(

(

the ISO/DIS 24613

the ISO/DIS 24613

Lexical Mark

Lexical Mark

-

-

up Framework

up Framework

)

)

(8)

The

The

BioLexicon

BioLexicon

:

:

where

where

from

from

Existing repositories

MEDLINE

BioLexiconBioLexicon chemical compounds, species names, disease, enzymes

Subclustering of term variants genes/proteins

Incremental

Incremental

population

population

process

process

Named Entity Recognition

Term Mapping by Normalisation new genes/proteins names

Manual curation Verbs, nouns, adjs, advs (variants, inflected forms, derivative relations, ...) Linguistic pre-processing Subcat extraction

Manual annotation of a

bio-event corpus Bio-event extraction

Syn-sem linking BL Population ToolKit

(9)

The

The

BioLexicon

BioLexicon

:

:

focus

focus

on

on

verbs

verbs

Accurate TM applications focused on event

Accurate TM applications focused on event

extraction require lexical resources providing an

extraction require lexical resources providing an

exhaustive account of the

exhaustive account of the

semantic and

semantic and

syntactic combinatorial properties of lexical

syntactic combinatorial properties of lexical

units

units

conveying

conveying

event information

event information

Several exist for the general language domain, e.g.

Several exist for the general language domain, e.g.

FrameNet

FrameNet

,

,

VerbNet

VerbNet

,

,

PropBank

PropBank

Specialist domains such as

Specialist domains such as

biology

biology

require

require

domain

domain

-

-specific

specific

resources

resources

use of

use of

different predicates

different predicates

to describe events

to describe events

E.g.

E.g.

methylate

methylate

,

,

phosphorylate

phosphorylate

general language predicates may have

general language predicates may have

different

different

properties

properties

E.g.

E.g.

the

the

patient

patient

presented

presented

with

with

influenza

influenza

to

to

the

the

doctor

doctor

vs

vs

the

the

patient

patient

presented

presented

the

the

doctor

doctor

with

with

influenza

(10)

Current

Current

biomedical

biomedical

lexical

lexical

resources

resources

A number of attempts have been made to produce

A number of attempts have been made to produce

domain

domain

-

-specific extensions

specific extensions

of

of

general

general

-

-

purpose lexical semantic

purpose lexical semantic

resources providing information on predicate

resources providing information on predicate

-

-

argument

argument

structure

structure

BioFrameNet and PASBio BioFrameNet and PASBio

corpuscorpus--basedbased

smallsmall--scalescale

SPECIALIST lexicon SPECIALIST lexicon

extension of a large lexicon of general Englishextension of a large lexicon of general English

notnot corpuscorpus--drivendriven

syntactic complementation patterns onlysyntactic complementation patterns only

Creation of resources focussed on predicate

Creation of resources focussed on predicate

-

-

argument

argument

structure can be

structure can be

a major bottleneck

a major bottleneck

Mostly manually created by lexicographersMostly manually created by lexicographers

Limited coverageLimited coverage

TimeTime--consuming to port to new domainsconsuming to port to new domains

Automatic or semi

Automatic or semi

-

-

automatic acquisition methods

automatic acquisition methods

more

more

promising and increasingly viable

promising and increasingly viable

Advances in NLP and machine learning technologyAdvances in NLP and machine learning technology

(11)

Current

Current

biomedical

biomedical

lexical

lexical

resources

resources

: the

: the

need

need

To

To

our

our

knowlege

knowlege

,

,

there

there

is

is

currently

currently

no

no

l

l

arge

arge

-

-scale domain

scale domain

-

-

specific lexical resource

specific lexical resource

providing predicate

providing predicate

-

-

argument information

argument information

based on domain

based on domain

-

-

specific corpora

specific corpora

containing both syntactic and semantic

containing both syntactic and semantic

information

information

The

(12)

The

The

BioLexicon

BioLexicon

:

:

our

our

approach

approach

towards

towards

acquisition

acquisition

of

of

verb

verb

information

information

bootstrapped

bootstrapped

from biomedical corpora

from biomedical corpora

the most relevant verbs are included in the lexiconthe most relevant verbs are included in the lexicon

their encoded their encoded behaviourbehaviour is domainis domain--specificspecific

contains

contains

both

both

syntactic

syntactic

and

and

semantic

semantic

information

information

syntactic syntactic subcategorizationsubcategorization (e.g. (e.g. actact ARG1#PPARG1#PP--asas#)#)

semantic event frame information semantic event frame information (e.g. (e.g. bindbind AGENT#THEME#LOC#)AGENT#THEME#LOC#)

explicit linkexplicit link between the twobetween the two (e.g. (e.g. expressexpress

AGENT>ARG1#THEME>ARG2#LOC>PP

AGENT>ARG1#THEME>ARG2#LOC>PP--inin#)#)

built

built

semi

semi

-

-

automatically

automatically

by combining NLP and

by combining NLP and

Machine Learning techniques

Machine Learning techniques

syntactic frames are syntactic frames are extracted through unsupervised learning on extracted through unsupervised learning on

dependency

dependency--annotated text annotated text

semantic frames are semantic frames are based on manual annotation of gene regulation based on manual annotation of gene regulation

bio

bio--events by domain expertsevents by domain experts

(13)

Structured Knowledge

From

From

Text

Text

to

to

Knowledge

Knowledge

:

:

NLP and

NLP and

Knowledge

Knowledge

Extraction

Extraction

Lexicons and ontologies Knowledge Extraction Tools Text Annotation Tools

(14)

The

The

BioLexicon

BioLexicon

:

:

our

our

approach

approach

to

to

subcat

subcat

extraction

extraction

acquired through unsupervised automatic acquisition

acquired through unsupervised automatic acquisition

techniques from linguistically pre

techniques from linguistically pre

-

-

processed domain corpora

processed domain corpora

the starting point: shallow or deep syntactic annotation? the starting point: shallow or deep syntactic annotation?

particular requirements for

particular requirements for

Subcategorization

Subcategorization

Frames (

Frames (

SCFs

SCFs

)

)

in biomedical language

in biomedical language

SCFsSCFs should also include strongly selected modifiers (such as locatishould also include strongly selected modifiers (such as location, on, manner and timing), as these are deemed to be essential for the

manner and timing), as these are deemed to be essential for the

correct interpretation of texts

correct interpretation of texts

average number of arguments in average number of arguments in SCFsSCFs higher than general languagehigher than general language

discovery

discovery

approach to SCF acquisition based on a looser

approach to SCF acquisition based on a looser

notion of

notion of

SCFs

SCFs

, which includes typical verb modifiers in

, which includes typical verb modifiers in

addition to strongly selected arguments

addition to strongly selected arguments

no a priori knowledge about the set of possible no a priori knowledge about the set of possible SCFsSCFs

no distinction between argument/modifierno distinction between argument/modifier

to meet this basic requirement, SCF induction

to meet this basic requirement, SCF induction

operates

(15)

SCFs

SCFs

extracted from a corpus of MEDLINE abstracts and

extracted from a corpus of MEDLINE abstracts and

full papers made up of 6 million word tokens

full papers made up of 6 million word tokens

The induction process was performed through:

The induction process was performed through:

syntactic annotationsyntactic annotation of the acquisition corpus with of the acquisition corpus with EnjuEnju syntactic syntactic parser (v2.2,

parser (v2.2, http://wwwhttp://www--tsujii.is.s.utsujii.is.s.u--tokyo.ac.jp/enjutokyo.ac.jp/enju//) (adapted to ) (adapted to biomedical texts)

biomedical texts)

extraction of the observed dependency setsextraction of the observed dependency sets ((ODSsODSs) for each verbal ) for each verbal occurrence

occurrence

each ODS represented as a set of dependencies described in termseach ODS represented as a set of dependencies described in terms of of

relation type (e.g. ARG1, ARG2, PP

relation type (e.g. ARG1, ARG2, PP--in, PPin, PP--across, thatacross, that--CL, etc.) CL, etc.)

order of dependencies in each ODS is normalisedorder of dependencies in each ODS is normalised

induction of relevant SCF associated with a given verbinduction of relevant SCF associated with a given verb

for each ODS type, the conditional probability given the verb tyfor each ODS type, the conditional probability given the verb type pe vv

was computed

was computed

weighted thresholds used to filter out noisy framesweighted thresholds used to filter out noisy frames

an ODS type with an associated probability score beyond a certain an ODS type with an associated probability score beyond a certain

threshold is selected as eligible SCF for that verb type threshold is selected as eligible SCF for that verb type

each SCF each SCF has been extracted for one has been extracted for one normalised verb tokennormalised verb token, i.e. the , i.e. the

extraction process makes abstraction from the passive usages

extraction process makes abstraction from the passive usages

The

The

BioLexicon

BioLexicon

: the

: the

subcat

subcat

extraction

(16)

Overproduction of TorR in a torT strain resulted in partial constitutive expression of the torA'-'lacZ fusion, suggesting that TorR acts downstream from TorT.

20 VBZ act acts verb_mod_arg12 17 VBG suggest suggesting -1 UNKNOWN UNKNOWN UNKNOWN verb_mod_arg12 17 VBG suggest suggesting 7 VBD result resulted verb_mod_arg12 17 VBG suggest suggesting 23 NNP tort TorT prep_arg12 22 IN from from 20 VBZ act acts prep_arg12 22 IN from from 11 NN expression expression prep_arg12 8 IN in in 7 VBD result resulted prep_arg12 8 IN in in 20 VBZ act acts adj_arg1 21 RB downstream downstream 20 VBZ act acts comp_arg1 18 IN that that 15 NN fusion fusion prep_arg12 12 IN of of 11 NN expression expression prep_arg12 12 IN of of 7 VBD result resulted punct_arg1 16 , -comma-, 6 NN strain strain prep_arg12 3 IN in in 0 NN overproduction Overproduction prep_arg12 3 IN in in 2 NN torr TorR prep_arg12 1 IN of of 0 NN overproduction Overproduction prep_arg12 1 IN of of 19 NN torr TorR verb_arg1 20 VBZ act acts 11 NN expression expression adj_arg1 10 JJ constitutive constitutive 15 NN fusion fusion noun_arg1 14 NN tora'-'lacz torA'-'lacZ 11 NN expression expression adj_arg1 9 JJ partial partial 6 NN strain strain noun_arg1 5 NN tort torT 15 NN fusion fusion det_arg1 13 DT the the 6 NN strain strain det_arg1 4 DT a a 0 NN overproduction Overproduction verb_arg1 7 VBD result resulted 7 VBD result resulted ROOT -1 ROOT ROOT ROOT

act ARG1=torr@NN 0 0 0 0 PP-from=tort@NNP 0 0 0 0 0 0 0 0 0 0 VBZ 0

result ARG1=overproduction@NN 0 0 0 0 PP-in=expression@NN 0 0 0 0 MOD=suggest@VBG 0 0 0 0 0 suggest ARG1=UNKNOWN@UNKNOWN 0 0 0 0 0 0 0 0 0 0 0 0 that-CL=act@VBZ 0 0 VBG 0

(17)

136 different induced 136 different induced SCFsSCFs

vsvs SPECIALIST Lexicon: very limited number of complementation patternsSPECIALIST Lexicon: very limited number of complementation patterns

The

The

BioLexicon

BioLexicon

:

:

subcat

subcat

extraction

extraction

results

results

0 0 0.1347457 0.1347457 ARG1#PP

ARG1#PP--in#in# accumulate accumulate 0.140625 0.140625 0.1084745 0.1084745 ARG1#ARG2#PP

ARG1#ARG2#PP--in#in# accumulate accumulate 0 0 0.4627118 0.4627118 ARG1# ARG1# accumulate accumulate 0.0403458 0.0403458 0.2940677 0.2940677 ARG1#ARG2# ARG1#ARG2# accumulate accumulate 0.7029702 0.7029702 0.0939534 0.0939534 ARG1#ARG2#PP

ARG1#ARG2#PP--in#in# abolish abolish 0.1904761 0.1904761 0.0390697 0.0390697 ARG1#ARG2#MOD@VBG# ARG1#ARG2#MOD@VBG# abolish abolish 0.1437768 0.1437768 0.8669767 0.8669767 ARG1#ARG2# ARG1#ARG2# abolish abolish Pass Pass P( P(subcat|vsubcat|v)) SCF SCF Verb Verb

(18)

many of the strongly selected modifiers spread over different

many of the strongly selected modifiers spread over different

SCFs

SCFs

radically underestimated roleradically underestimated role

SCFs

SCFs

complemented with information about individual

complemented with information about individual

dependencies of verbs

dependencies of verbs

typical verbal dependencies, corresponding to either arguments otypical verbal dependencies, corresponding to either arguments or r strongly selected modifiers, detected through the

strongly selected modifiers, detected through the llll association scoreassociation score

44 induced dependency types44 induced dependency types

The

The

BioLexicon

BioLexicon

:

:

acquired

acquired

SCFs

SCFs

and

and

strongly selected modifiers

strongly selected modifiers

0.2143 0.0332 14 422 ARG1#ARG2#PP-at# methylate 0.5588 0.0805 34 422 ARG1#ARG2#PP-in# methylate 0.1258 0.6967 294 422 ARG1#ARG2# methylate % passive usages p(SCF|v) SCF_freq v_freq SCF v 0.60 18.2749 0.0320 45 1406 PP-in# methylate 0.31 57.9113 0.0206 29 1406 PP-at# methylate 0.25 778.5146 0.2916 410 1406 ARG2# methylate % passive usages ll p(dep|v) dep_ freq all_ dep DEP v

(19)

The

The

BioLexicon

BioLexicon

:

:

an

an

example of

example of

stored

stored

subcat

subcat

information for the

information for the

verb

verb

acquire

acquire

Full parsing Preposition-based parsing 0.3750 0.0295 ARG1#ARG2#PP-during# acquire 0.0000 0.0406 ARG1#ARG2#PP-by# acquire 0.1818 0.0406 ARG1#ARG2#PP-from# acquire 0.0833 0.0886 ARG1#ARG2#PP-in# acquire 0.1284 0.5461 ARG1#ARG2# acquire % pass p(SCF|v) SCF v 0.1666667 13.416025 PP-in# acquire 0 13.626654 PP-by# acquire 0.3333333 22.716082 PP-from# acquire 0.1 25.703417 WH-when# acquire 0.1512915 579.96392 ARG2# acquire % pass ll DEP v

(20)

The

The

BioLexicon

BioLexicon

:

:

contained

contained

verb

verb

subcategorization

subcategorization

information

information

complementarity

complementarity

between Verb

between Verb

-

-

Dep

Dep

and Verb

and Verb

-

-

SCF

SCF

associations

associations

both information types included into the

both information types included into the

BioLexicon

BioLexicon

subcat

subcat

frame information acquired for 759 different verbs,

frame information acquired for 759 different verbs,

corresponding to 658 different base forms

corresponding to 658 different base forms

e.g. the occurrences of

e.g. the occurrences of

colocalize

colocalize

,

,

colocalise

colocalise

,

,

co

co

-

-

localize

localize

and

and

co

co

-

-

localise

localise

recorded under

recorded under

colocalize

colocalize

the

the

BioLexicon

BioLexicon

was augmented with:

was augmented with:

1410

1410

Verb

Verb

-

-

SCF

SCF

associations, involving

associations, involving

136

136

different

different

subcat

subcat

frame types

frame types

(21)

The

The

BioLexicon

BioLexicon

:

:

bio

bio

-

-

event

event

frames

frames

Extracted

Extracted

from

from

a corpus of 677 MEDLINE

a corpus of 677 MEDLINE

abstracts

abstracts

manually

manually

annotated

annotated

by

by

biologists

biologists

semantic

semantic

parsers

parsers

not

not

mature

mature

enough

enough

to

to

provide

provide

the

the

starting

starting

point

point

(

(

as

as

opposed

opposed

to

to

dependency

dependency

parsers

parsers

)

)

manual

manual

semantic

semantic

annotation

annotation

carried

carried

out on top of

out on top of

shallow

shallow

syntactic

syntactic

annotation

annotation

(

(

chunking

chunking

)

)

consistency of marked text spans helped by annotating syntactic consistency of marked text spans helped by annotating syntactic

chunks

chunks

[NP The

[NP The

narL

narL

gene product ]

gene product ]

[VP activates ]

[VP activates ]

[NP the nitrate

[NP the nitrate

reductase

reductase

operon

operon

]

]

[PP in ]

[PP in ]

[NP Escherichia coli ]

(22)

Bio

Bio

-

-

Event

Event

annotated

annotated

corpus:

corpus:

incremental

incremental

annotation

annotation

approach

approach

TEXT

KNOWLEDGE

Raw domain corpora Tokenised text

Morpho-syntactically tagged text Syntactically “chunked” text Domain specific annotation

Domain independent annotation Bio-Event Linguistic Annotation

(23)

The

The

BioLexicon

BioLexicon

:

:

bio

bio

-

-

event

event

frames

frames

annotation

annotation

Annotation consisted of:

Annotation consisted of:

Identifying relevant

Identifying relevant

gene regulation

gene regulation

events centred on

events centred on

verbs

verbs

and

and

nominalised

nominalised

verbs (e.g.

verbs (e.g.

expression

expression

)

)

Finding all

Finding all

semantic arguments

semantic arguments

of an identified event

of an identified event

Within

Within

-

-

sentence annotation

sentence annotation

Assigning a

Assigning a

semantic role

semantic role

to each argument

to each argument

Assigning

Assigning

named entity types

named entity types

to semantic arguments

to semantic arguments

(where appropriate)

(where appropriate)

A hierarchy of

A hierarchy of

NEs

NEs

, specially tuned to

, specially tuned to

gene regulation,

gene regulation,

was created

was created

Organised into five entity

Organised into five entity

-

-

specific super

specific super

-

-

classes

classes

Thompson P., Cotter P., McNaught J., Ananiadou S., Montemagni S., Trabucco A., Venturi G. 2008. Building a Bio-Event Annotated Corpus for the Acquisition of Semantic Frames from Biomedical Corpora. In Proceedings of LREC

(24)

Bio

Bio

-

-

Event

Event

annotated

annotated

corpus:

corpus:

semantic roles

semantic roles

Aim for a set of

verb-specific event frames

Use of frame-independent semantic roles

annotation of

all sublanguage semantic arguments

,

using a set of

domain-specific

and

domain-independent roles

The proposed set of 12 event-independent semantic

roles includes:

two domain-specific semantic roles, i.e. CONDITION and

MANNER;

semantic roles particularly important for the precise

definition of complex biological relations, even though not

necessarily specific to the field, i.e. LOCATION and

TEMPORAL;

(25)

Transcription of

Transcription of gntTgntTis activated by is activated by bindingbinding of the of the cyclic AMP (

cyclic AMP (cAMP)cAMP)--cAMPcAMP receptor protein (CRP) receptor protein (CRP) complex to

complex to a CRP binding sitea CRP binding site End point of event

End point of event DESTINATION

DESTINATION

A

A transducingtransducing lambda phage was lambda phage was isolatedisolatedfrom from a

a strainstrainharboring a glpDharboring a glpD’’’’lacZlacZ fusion fusion Start point of event

Start point of event SOURCE

SOURCE

Phosphorylation

Phosphorylation of OmpRof OmpR modulatesmodulatesexpression of expression of the

the ompFompF and ompCand ompC genes genes inin Escherichia coliEscherichia coli

Where

Where completecomplete event takes place event takes place LOCATION

LOCATION

We have

We have isolatedisolateda strain with the aid of a strain with the aid of the

the CasadabanCasadabanMud phageMud phage Used to carry out

Used to carry out event

event INSTRUMENT

INSTRUMENT

cpxA

cpxA gene gene increasesincreasesthe levels of csgAthe levels of csgA transcription transcription by by dephosphorylationdephosphorylationof CpxRof CpxR Method/way in Method/way in which event is which event is carried out carried out MANNER MANNER recA

recA proteinproteinwas was inducedinducedby UV radiationby UV radiation The FNR protein

The FNR protein resembles resembles CRPCRP a) Affected

a) Affected

by/results from event by/results from event b) Subject of events b) Subject of events describing states describing states THEME THEME The

The narLnarLgene productgene productactivatesactivatesthe nitrate the nitrate reductase

reductase operonoperon Drives/instigates Drives/instigates event event AGENT AGENT

Bio

Bio

-

-

Event

Event

annotated

annotated

corpus:

corpus:

list

(26)

The fusion strains were

The fusion strains were usedused totostudystudythe regulation the regulation of the

of the cysBcysBgene by assaying the fused lacZgene by assaying the fused lacZgene gene product

product Purpose/reason for

Purpose/reason for the event occurring the event occurring PURPOSE

PURPOSE

The FNR protein

The FNR protein resemblesresemblesCRPCRP.. Descriptive Descriptive information about information about THEME THEME DESCRIPTIVE DESCRIPTIVE- -THEME THEME It is likely that

It is likely that HyfRHyfR actsactsas as a a formateformate--dependent dependent regulator

regulator of the of the hyfhyfoperonoperon Descriptive Descriptive information about information about AGENT AGENT DESCRIPTIVE DESCRIPTIVE- -AGENT AGENT marR

marR mutations mutations elevatedelevatedinaAinaAexpression by expression by 1010-- toto 20

20--foldfoldover that of the wild-over that of the wild-type.type. Change of level or Change of level or rate rate RATE RATE S

Strains carrying a mutation in the crptrains carrying a mutation in the crp structural gene structural gene fail to

fail to repressrepressODC and ADC activities in response to ODC and ADC activities in response to increased

increased cAMPcAMP Environmental Environmental conditions/changes conditions/changes in conditions in conditions CONDITION CONDITION

The Alp protease activity is

The Alp protease activity is detecteddetected in cells in cells afterafter introduction

introduction of plasmids carrying the alpAof plasmids carrying the alpA genegene Situates event w.r.t Situates event w.r.t another event another event TEMPORAL TEMPORAL

Bio

Bio

-

-

Event

Event

annotated

annotated

corpus:

corpus:

list

(27)

Named Entity

Named Entity

Superclasses

Superclasses

A set of

A set of eventevent classes used to label biological classes used to label biological processes described in text.

processes described in text.

PROCESSES

PROCESSES

Entities representing individuals or collections of

Entities representing individuals or collections of

living things and their component parts.

living things and their component parts.

ORGANISMS

ORGANISMS

Both physical and methodological entities, either

Both physical and methodological entities, either

used, consumed or required for a reaction to take

used, consumed or required for a reaction to take

place.

place.

EXPERIMENTAL

EXPERIMENTAL

Entities chiefly composed of amino acids and their

Entities chiefly composed of amino acids and their

positional references. This includes the physical

positional references. This includes the physical

structure and functional roles associated with each

structure and functional roles associated with each

type.

type.

PROTEIN

PROTEIN

Entities chiefly composed of nucleic acids and their

Entities chiefly composed of nucleic acids and their

structural or positional references. This includes

structural or positional references. This includes

the physical structure of all DNA

the physical structure of all DNA--based entities based entities and the functional roles associated with regions

and the functional roles associated with regions

thereof. thereof. DNA DNA Definition Definition NE class NE class

(28)

DNA

A promoter

AGENT

Event Annotation example

Event Annotation example

has been identified that

has been identified that

towards

towards

in

in

directs

verb PROCESS

relA gene transcription

THEME

DNA

the pyrG gene

DESTINATION

DNA

on the E. Coli chromosome

LOCATION

a counterclockwise direction

(29)

The extraction process of

The extraction process of

Event Frame

Event Frame

Extracted event frame:

[

Agent:NN:DNA

] [

Verb:VBZ:express

] [

Theme:NN:DNA

]

transfer operon

expresses

F-like plasmids

Agent Theme

DNA DNA

express

(

Agent=>DNA

,

Theme=>DNA

)

Input sentence:

(30)

The

The

BioLexicon

BioLexicon

:

:

acquired

acquired

bio

bio

-

-

event

event

frames

frames

Theme#Condition Theme#Condition## Theme Theme## Agent#Theme#Source Agent#Theme#Source## Agent#Theme#Manner Agent#Theme#Manner## Agent#Theme#Location Agent#Theme#Location## Agent#Theme#Condition Agent#Theme#Condition## Agent#Theme Agent#Theme## activate activate Bio

Bio--event framesevent frames verb

verb

Agent

Agent--Protein#ThemeProtein#Theme--DNADNA# # Agent

Agent--Organisms#ThemeOrganisms#Theme--ProteinProtein# # Agent

Agent--DNA#ThemeDNA#Theme--DNADNA# # activate

activate

Bio

Bio--event frames with NE typesevent frames with NE types verb

(31)

The BioLexicon:

The BioLexicon:

Syntax

Syntax

-

-

Semantics

Semantics

Linking (1)

Linking (1)

The

The

starting

starting

point

point

:

:

acquiredacquired subcategorizationsubcategorization framesframes

verbalverbal biobio--eventevent framesframes basedbased on corpus on corpus annotationannotation

acquired acquired using different techniques and corpora of different sizeusing different techniques and corpora of different size

Linking

Linking

concerned

concerned

168 verbs

168 verbs

for which both syntactic

for which both syntactic

and semantic information was available

and semantic information was available

Linking

Linking

process

process

carried

carried

out

out

manually

manually

by

by

a

a

linguist

linguist

Different information types were taken into account, i.e.Different information types were taken into account, i.e.

literature regarding hierarchies of semantic roles and grammaticliterature regarding hierarchies of semantic roles and grammatical al

functions

functions

given a thematic role hierarchy (agent>theme ...) and a syntactic given a thematic role hierarchy (agent>theme ...) and a syntactic

functions hierarchy (subject>object ...), the mapping usually pr

functions hierarchy (subject>object ...), the mapping usually proceeds oceeds from left to right

from left to right

a list of a list of ‘‘prototypicprototypic’’ syntactic realisations of semantic argumentssyntactic realisations of semantic arguments

exploitationexploitation ofof general language repositories of semantic frames general language repositories of semantic frames

containing both syntactic and semantic information (as possible

containing both syntactic and semantic information (as possible

benchmarks)

(32)

The BioLexicon:

The BioLexicon:

Syntax

Syntax

-

-

Semantics

Semantics

Linking (2)

Linking (2)

Linking process resulted in

Linking process resulted in

668 linked frames

668 linked frames

Different types of mapping were performed:

Different types of mapping were performed:

full

full

mapping (239 frames)

mapping (239 frames)

arityarity of the of the subcategorizationsubcategorization and bioand bio--event frames is the event frames is the

same

same (ISO)(ISO)

partial

partial

mapping

mapping

1)

1) semantic frame contains more slots than corresponding semantic frame contains more slots than corresponding

subcategorization

subcategorization frameframe ((123 123 framesframes) (AUG)) (AUG)

e.g.

e.g. AGENT>ARG1#THEME>ARG2#LOCATION>PPAGENT>ARG1#THEME>ARG2#LOCATION>PP--in#in#CONDITIONCONDITION>0>0

2)

2) subcategorized slots do not have a semantic counterpart in the subcategorized slots do not have a semantic counterpart in the

corresponding bio

corresponding bio--event frameevent frame ((166 166 framesframes) (RED)) (RED)

e.g.

e.g. 0>ARG10>ARG1#THEME>ARG2#DESTINATION>PP#THEME>ARG2#DESTINATION>PP--intointo

3)

3) a combination of cases 1) and 2) abovea combination of cases 1) and 2) above ((140 140 framesframes))

e.g.

(33)

The

The

BioLexicon

BioLexicon

:

:

syntax

syntax

-

-

semantics

semantics

linking

linking

ISO ISO PP PP--inin Manner Manner ARG2 ARG2 Theme Theme ARG1 ARG1 Agent Agent ISO ISO ARG1 ARG1 Theme Theme ISO ISO ARG2 ARG2 Theme Theme ARG1 ARG1 Agent Agent RED RED ARG1 ARG1 0 0 PP PP--inin Condition Condition ARG2 ARG2 Theme Theme AUG AUG 0 0 Source Source ARG2 ARG2 Theme Theme ARG1 ARG1 Agent Agent ISO ISO PP PP--inin Location Location ARG2 ARG2 Theme Theme ARG1 ARG1 Agent Agent ISO ISO PP PP--byby Manner Manner ARG2 ARG2 Theme Theme ARG1 ARG1 Agent Agent ISO ISO PP PP--inin Conditio Conditio n n ARG2 ARG2 Theme Theme ARG1 ARG1 Agent Agent RED RED ARG1 ARG1 0 0 ARG2 ARG2 Theme Theme activate activate

Useful information for mixed syntax-semantics approaches

G. Venturi, S. Montemagni, S. Marchi, Y. Sasaki, P. Thompson, J. McNaught, S. Ananiadou, 2009, “Bootstrapping a Verb Lexicon for Biomedical Information Extraction”, in Proceedings of the CICLing-2009 conference, Mexico.

(34)

From

From

bricks

bricks

of

of

biolexical

biolexical

knowledge

knowledge

to

(35)

The

The

BioLexicon

BioLexicon

:

:

representation

representation

model

model

The BL model

The BL model

is

is

conformant

conformant

to

to

ISO

ISO

-

-

LMF (ISO 24613:2008)

LMF (ISO 24613:2008)

high

high

-

-

level objects

level objects

: the meta

: the meta

-

-

model, i.e. a set of

model, i.e. a set of

independent lexical objects with relations among them

independent lexical objects with relations among them

low

low

-

-

level objects

level objects

: a set of Data Categories, i.e. linguistic

: a set of Data Categories, i.e. linguistic

constants

constants

in the form of attribute

in the form of attribute

-

-

value pairs (either drawn

value pairs (either drawn

from the ISO

from the ISO

-

-

12620 or defined for the special domain)

12620 or defined for the special domain)

XML DTD for the entire lexicon

XML DTD for the entire lexicon

The implementation consists of a flexible, extensible

The implementation consists of a flexible, extensible

relational

relational

MySQL

MySQL

database

database

Automatic population procedures relying on a dedicated

Automatic population procedures relying on a dedicated

input data structure, the

input data structure, the

BioLexicon

BioLexicon

XML Interchange

XML Interchange

Format (XIF)

Format (XIF)

(36)

The BioLexicon Model: High-level objects, lexical objects

The

The

BioLexicon

BioLexicon

:

:

representation

(37)

BioLexicon Model:

Data Categories

Syntax

Semantics

e.g.

<feat att=“POS” val=“VVZ”> <feat att=“ConfScore” val=“0.9”> <feat att=“source” val=“UNIPROT” ……

The

The

BioLexicon

BioLexicon

:

:

representation

representation

model

model

- Partially drawn by the ISO Data Category Registry - Partially integrated with domain-specific DCs

(38)

The

The

BioLexicon

BioLexicon

:

:

the

the

starting

starting

point

point

RegulonDB

RegulonDB, , TransFac

TransFac, Gene , Gene Ontology Ontology Annotation Annotation Transcription Transcription Regulator Regulator InterPro InterPro Protein Domain Protein Domain Corum database Corum database Protein Complex Protein Complex BioThesaurus BioThesaurus Protein Protein Sequence Sequence Ontology Ontology Transcription Transcription Factor

Factor--BindingSiteBindingSite

NCBI Species

NCBI Species

Organism

Organism

RegulonDB

RegulonDB, ODB , ODB ( (OperonOperon DataBase DataBase)) Operon Operon Sequence Sequence Ontology :Region Ontology :Region NucleicAcid NucleicAcid Region Region Resources Resources Semantic type Semantic type GO:0004879

GO:0004879 ligandligand- -dependent nuclear dependent nuclear receptor activity receptor activity Nuclear Nuclear Receptor Receptor IMR

IMR -- INOH Protein INOH Protein name/family name name/family name ontology ontology Ligand Ligand BioThesaurus BioThesaurus Gene Gene Enzyme commission Enzyme commission Enzyme Enzyme OMIM OMIM Disease Disease CHEBI, IMR:0000947 CHEBI, IMR:0000947 chemical chemical Chemical Chemical Gene Ontology Gene Ontology GO:0005575 cellular GO:0005575 cellular component component Cell Cell Component Component Cell ontology Cell ontology Cell Cell Resources Resources Semantic Semantic type type

(39)

The

The

BioLexicon

BioLexicon

by

by

numbers

numbers

741 741 1431 1431 Sequences Sequences 368 368 2672 2672 Operons Operons 129 129 160 160 Trans.

Trans. FactorsFactors

512 512 842 842 Cell Cell 29831 29831 8850 8850 Molecular

MolecularRolesRoles

11314 11314 19457 19457 Diseases Diseases 77475 77475 19637 19637 Chemicals Chemicals 418 418 2104 2104 Protein

Protein ComplCompl..

15412 15412 16940 16940 Protein Domains Protein Domains 4164 4164 4016 4016 Enzymes Enzymes 182610 182610 482992 482992 Organisms Organisms 936126 936126 358335 358335 Gene/

Gene/ProtProt((synsetssynsets))

1408312 1408312 1640608 1640608 Gene/ Gene/ProtProt # # VariantsVariants # # Entries Entries Sem. Type Sem. Type

Entries and variants Entries and variants

by semantic type by semantic type -668 668 syntax

syntax--semanticssemantics mappings

mappings ((concernedconcernedwithwith 168 168 verbsverbs)) -856 856 bio

bio--eventevent framesframes

-1710

1710

verb

verb--SLOTSLOTassociationsassociations

-3040

3040 verb

verb--SCFSCF associationsassociations

-2764

2764

related

related entriesentries(e.g. (e.g. absorb

absorb--> > absorptionabsorption/N, /N, absorber

absorber/N, /N, absorbingabsorbing/J, /J,

absorbable

absorbable/J, /J, absorbentabsorbent/J, /J, absorbently

absorbently/R)/R)

15274 15274 inflected

inflected formsforms

496 496 658 658 general general domain domain- -specific specific Verbs Verbs 550 550 Adverbs Adverbs 1154 1154 Verbs Verbs 3428 3428 Adjectives Adjectives 2231574 2231574 Nouns Nouns # # entriesentries POS POS Entries by part of Entries by part of speech speech

(40)

Comparison with two existing large

Comparison with two existing large

-

-

scale dictionaries

scale dictionaries

WordNetWordNet:: General English ThesaurusGeneral English Thesaurus

NLM Specialist Lexicon:NLM Specialist Lexicon: Biomedical LexiconBiomedical Lexicon

Coverage evaluation

Coverage evaluation

The

The

BioLexicon

BioLexicon

:

:

intrinsic

intrinsic

evaluation

evaluation

0 10 20 30 40 50 60 70 80 90 100 v e r b n o u n a d j e c t i v e a d v e r b n o m i n a l i z a t i o n a d j e t i v i a l a d v e r b a l T e r m i n o l o gi e s De r i vat io n al r e l at i o n s

Wo rdN e t Spe c ialist Le xic o n

BioLexicon WordNet

Specialist Lexicon

covered not covered

(41)

Biolexicon, human

BioLexicon

BioLexicon

in

in

BioCreAtIve

BioCreAtIve

II, GN

II, GN

task

task

Applied methods:

- Abner for gene identification

- Statistical features (BioLexicon)

for filtering of non-relevant terms

- Classification on BioCreAtIve II data

- Only human concept ids

Baseline results

Highly reproducible

Available as Whatizit module

(42)

The

The

BioLexicon

BioLexicon

:

:

extrinsic

extrinsic

evaluation

evaluation

Task

Task

-

-

based evaluation (still ongoing)

based evaluation (still ongoing)

BLTagger

BLTagger

NeMine

NeMine

(NER)

(NER)

Enju

Enju

with the BL

with the BL

BLTagger

BLTagger

NeMine

NeMine

(NER)

(NER)

Tool

Tool

TREC Genomics Track

TREC Genomics Track

2007

2007

IR

IR

UoM

UoM

Gene Regulation

Gene Regulation

Corpus

Corpus

Data

Data

IE

IE

Task

Task

NeMine (

http://text0.mib.man.ac.uk/~sasaki/bootstrep/nemine.html

)

Y. Sasaki, P. Thompson, J. McNaught, S. Ananiadou, 2009, “Three BIONLP tools powered by the BioLexicon”, in Proceedings of EACL 2009.

(43)

BioLexicon

BioLexicon

distribution

distribution

• The BioLexicon (MySQL version) is distributed

through the

European Language Resources

European Language Resources

Association

Association

(

ELRA

)

• http://www.elra.info or http://www.elda.org

• Benefits

• Servicing of bug reporting through ELRA

• Organisational embedding into other lexical resources • Long-term availability

• Support to European language infrastructures

• Different licence types for

• Commercial use

• Research use by commercial organisations • Research use by academic organisations

(44)

Conclusions

Conclusions

BL

BL

BL

A unique resource amongst large-scale computational lexicons

within the biomedical domain in terms of

coverage and

typology of contained information

Populated with info from available

biomedical

resources and texts

TRY IT!

TRY IT!

Semi-automatically

populated from corpora:

Population toolkit available

Including both

domain-specific

and general

language words

Rich linguistic information

ranging over different linguistic descriptions levels

Conformant to international lexical representation

standards

Designed to meet bio-Text Mining requirements

(45)

THANK YOU

THANK YOU

References

Related documents

Our quantile regression outcomes reveal that the impact of consumer price index on corporate debt is significant and positive across the quantiles.. However, the

However, recently, Inflammatory Myofibroblastic Tumor (IMT) involving cardiac structures has emerged as a dis- tinct entity with characteristic clinical, pathological, and

Proceedings of the Proceedings of the Ninth International Workshop on Parsing Technologies (IWPT), pages 115?124, Vancouver, October 2005 c?2005 Association for Computational

Financial Planning Timelines at the U of C March 15th Annual Institutional Plan and Budget Submitted to AET July to October Prepares Budget Guidelines, Updates Budget

Peter James Nowland, Fitter &amp; Turner, born in Werris Creek, in the state of New South Wales, Australia on 1 Jul 1960, son of Clifford Nowland and Sylvia O’Brian.. David

Figur 19: Utvikling i budsjettandel for dypfryst sjømat, inndelt etter demografiske variabler 48 Figur A20: Utvikling i budsjettandel for fersk og dypfryst sjømat, inndelt etter

Now the object of human experience, which has for man the value of Godhead, and concerning which, therefore, in a certain sense we can venture to pronounce: ‘He is God’ – this

This debate between scholars of Jewish mystical texts illustrates the chasm that has been developing between historical criticism as a dominant scholarly paradigm and