The
The
BioLexicon
BioLexicon
:
:
a Large
a Large
-
-
Scale
Scale
Domain
Domain
-
-
Specific
Specific
Lexical Resource
Lexical Resource
for Biomedical Text Mining
for Biomedical Text Mining
Simonetta Montemagni
Istituto di Linguistica Computazionale “Antonio Zampolli” - CNR
LREC 2010
2nd Workshop on Building and evaluating resources for biomedical text mining
Outline
Outline
The
The
BioLexicon
BioLexicon
Who
Who
Why
Why
How
How
What
What
Where
Where
from
from
Focus on Verbs
Focus on Verbs
Representation
Representation
How
How
many
many
entries
entries
Evaluation
Evaluation
Distribution
Distribution
Conclusions
Conclusions
The
The
BioLexicon
BioLexicon
:
:
who
who
Joint and collaborative work of the following teamsJoint and collaborative work of the following teams
in the framework of the European
in the framework of the European BOOTStrepBOOTStrep
project
project (FP6 (FP6 -- 028099)028099)
D. RebholzD. Rebholz--SchuhmannSchuhmann, P. Pezik, P. Pezik, V. Lee, J.J. , V. Lee, J.J. Kim
Kim
European Bioinformatics Institute,
European Bioinformatics Institute, WellcomeWellcome Trust Trust Genome Campus, Cambridge, CB10 1SD, UK
Genome Campus, Cambridge, CB10 1SD, UK
N. N. CalzolariCalzolari, M. , M. MonachiniMonachini,, S. S. MontemagniMontemagni, R. , R. del
del GrattaGratta,, S. S. Marchi, V. Marchi, V. QuochiQuochi, G. , G. VenturiVenturi
Istituo
Istituo didi LInguisticaLInguistica ComputazionaleComputazionale ““Antonio Antonio Zampolli
Zampolli”” -- CNR, Via Giuseppe CNR, Via Giuseppe MoruzziMoruzzi NN°° 1, 56124 1, 56124 Pisa, Italy
Pisa, Italy
S. AnaniadouS. Ananiadou, J. , J. McNaught,McNaught, Y. Sasaki, P. Y. Sasaki, P. Thompson
Thompson
School of Computer Science, The University of
School of Computer Science, The University of
Manchester, 131 Princess Street, M1 7DN, UK
The
The
BioLexicon
BioLexicon
:
:
why
why
Text Mining needs information about words
Text Mining needs information about words
the lexical component still remains a major bottleneck the lexical component still remains a major bottleneck
TM systems in the biomedical domain must be provided with
TM systems in the biomedical domain must be provided with
a substantial lexicon covering a realistic vocabulary and
a substantial lexicon covering a realistic vocabulary and
providing the kinds of linguistic information appropriate to
providing the kinds of linguistic information appropriate to
grasp the knowledge embedded in texts
grasp the knowledge embedded in texts
Biomedical
Biomedical
term
term
variants
variants
(
(
orthographic
orthographic
,
,
semantic
semantic
,
,
geographical
geographical
,
,
…
…
)
)
betterbetter informationinformation retrievalretrieval
Terminological
Terminological
verbs
verbs
and
and
their
their
combinatorial
combinatorial
properties
properties
(
(subcategorizationsubcategorization framesframes and and predicatepredicate--argumentargument structurestructure))
betterbetter informationinformation extractionextraction and and questionquestion answeringanswering
Word
Word
derivations
derivations
toto reachreach similarsimilar meaningmeaning expressedexpressed in in differentdifferent waysways (e.g. (e.g. activation
The
The
BioLexicon
BioLexicon
as
as
a
a
key
key
ingredient
ingredient
to
to
bridge the gap
bridge the gap
between
between
text
text
and
and
knowledge
knowledge
To perform such a key role
To perform such a key role
the BioLexicon MUST
the BioLexicon MUST
reflect
reflect
the
the
actual
actual
usage
usage
of
of
words
words
in
in
biomedical
biomedical
texts
texts
be
be
continuously
continuously
updated
updated
with
with
new
new
word
word
synonyms
synonyms
emerging from texts
emerging from texts
include
include
rich
rich
linguistic
linguistic
information
information
on the
on the
behavioural
behavioural
properties
properties
of
of
nouns
nouns
and
and
verbs
verbs
Biomedical literature BioLexicon BioLexicon Biomedical knowledge
The
The
BioLexicon
BioLexicon
:
:
how
how
General Requirements
General Requirements
Modularity, extensibility, conformity to standards, reusability
Modularity, extensibility, conformity to standards, reusability
Biomedical Domain Specific Requirements
Biomedical Domain Specific Requirements
Gene names, protein names, bio
Gene names, protein names, bio--events and participants, events and participants, ……
Linguistic/Terminological Requirements
Linguistic/Terminological Requirements
term variants, source identifiers, acronyms, syntactic and seman
term variants, source identifiers, acronyms, syntactic and semantic tic properties of terms,
properties of terms, ……
Text Mining / Machine Learning Requirements
Text Mining / Machine Learning Requirements
Confidence scores for automatically extracted info (e.g. variant
Confidence scores for automatically extracted info (e.g. variants, s, subclusterizations
The
The
BioLexicon
BioLexicon
:
:
what
what
integrated
integrated
lexical
lexical
-
-
terminological
terminological
resource
resource
of
of
~
~
2.2M
2.2M
lexical
lexical
entries
entries
for
for
bio
bio
-
-
text
text
mining
mining
with
with
information
information
about
about
nounsnouns, , verbsverbs, , adjectivesadjectives, , adverbsadverbs
bothboth domaindomain--specificspecific and and generalgeneral languagelanguage wordswords
populated with terms gathered from
populated with terms gathered from
available biomedical sources available biomedical sources
texts (biomedical literature)texts (biomedical literature)
including rich linguistic information ranging over different
including rich linguistic information ranging over different
linguistic descriptions levels
linguistic descriptions levels
e.g. derivational morphology, e.g. derivational morphology, subcategorizationsubcategorization patterns, predicate patterns, predicate argument structure, syntax
argument structure, syntax--semantics linkingsemantics linking
combining features of both terminologies and open
combining features of both terminologies and open
-
-
domain
domain
computational lexicons
computational lexicons
conforming
conforming
to
to
international
international
lexical
lexical
representation
representation
standards
standards
(
(
the ISO/DIS 24613
the ISO/DIS 24613
“
“
Lexical Mark
Lexical Mark
-
-
up Framework
up Framework
”
”
)
)
The
The
BioLexicon
BioLexicon
:
:
where
where
from
from
Existing repositories
MEDLINE
BioLexiconBioLexicon chemical compounds, species names, disease, enzymes
Subclustering of term variants genes/proteins
Incremental
Incremental
population
population
process
process
Named Entity Recognition
Term Mapping by Normalisation new genes/proteins names
Manual curation Verbs, nouns, adjs, advs (variants, inflected forms, derivative relations, ...) Linguistic pre-processing Subcat extraction
Manual annotation of a
bio-event corpus Bio-event extraction
Syn-sem linking BL Population ToolKit
The
The
BioLexicon
BioLexicon
:
:
focus
focus
on
on
verbs
verbs
Accurate TM applications focused on event
Accurate TM applications focused on event
extraction require lexical resources providing an
extraction require lexical resources providing an
exhaustive account of the
exhaustive account of the
semantic and
semantic and
syntactic combinatorial properties of lexical
syntactic combinatorial properties of lexical
units
units
conveying
conveying
event information
event information
Several exist for the general language domain, e.g.
Several exist for the general language domain, e.g.
FrameNet
FrameNet
,
,
VerbNet
VerbNet
,
,
PropBank
PropBank
Specialist domains such as
Specialist domains such as
biology
biology
require
require
domain
domain
-
-specific
specific
resources
resources
use of
use of
different predicates
different predicates
to describe events
to describe events
E.g.
E.g.
methylate
methylate
,
,
phosphorylate
phosphorylate
general language predicates may have
general language predicates may have
different
different
properties
properties
E.g.
E.g.
the
the
patient
patient
presented
presented
with
with
influenza
influenza
to
to
the
the
doctor
doctor
vs
vs
the
the
patient
patient
presented
presented
the
the
doctor
doctor
with
with
influenza
Current
Current
biomedical
biomedical
lexical
lexical
resources
resources
A number of attempts have been made to produce
A number of attempts have been made to produce
domain
domain
-
-specific extensions
specific extensions
of
of
general
general
-
-
purpose lexical semantic
purpose lexical semantic
resources providing information on predicate
resources providing information on predicate
-
-
argument
argument
structure
structure
BioFrameNet and PASBio BioFrameNet and PASBio
corpuscorpus--basedbased
smallsmall--scalescale
SPECIALIST lexicon SPECIALIST lexicon
extension of a large lexicon of general Englishextension of a large lexicon of general English
notnot corpuscorpus--drivendriven
syntactic complementation patterns onlysyntactic complementation patterns only
Creation of resources focussed on predicate
Creation of resources focussed on predicate
-
-
argument
argument
structure can be
structure can be
a major bottleneck
a major bottleneck
Mostly manually created by lexicographersMostly manually created by lexicographers
Limited coverageLimited coverage
TimeTime--consuming to port to new domainsconsuming to port to new domains
Automatic or semi
Automatic or semi
-
-
automatic acquisition methods
automatic acquisition methods
more
more
promising and increasingly viable
promising and increasingly viable
Advances in NLP and machine learning technologyAdvances in NLP and machine learning technology
Current
Current
biomedical
biomedical
lexical
lexical
resources
resources
: the
: the
need
need
To
To
our
our
knowlege
knowlege
,
,
there
there
is
is
currently
currently
no
no
l
l
arge
arge
-
-scale domain
scale domain
-
-
specific lexical resource
specific lexical resource
providing predicate
providing predicate
-
-
argument information
argument information
based on domain
based on domain
-
-
specific corpora
specific corpora
containing both syntactic and semantic
containing both syntactic and semantic
information
information
The
The
The
BioLexicon
BioLexicon
:
:
our
our
approach
approach
towards
towards
acquisition
acquisition
of
of
verb
verb
information
information
bootstrapped
bootstrapped
from biomedical corpora
from biomedical corpora
the most relevant verbs are included in the lexiconthe most relevant verbs are included in the lexicon
their encoded their encoded behaviourbehaviour is domainis domain--specificspecific
contains
contains
both
both
syntactic
syntactic
and
and
semantic
semantic
information
information
syntactic syntactic subcategorizationsubcategorization (e.g. (e.g. actact ARG1#PPARG1#PP--asas#)#)
semantic event frame information semantic event frame information (e.g. (e.g. bindbind AGENT#THEME#LOC#)AGENT#THEME#LOC#)
explicit linkexplicit link between the twobetween the two (e.g. (e.g. expressexpress
AGENT>ARG1#THEME>ARG2#LOC>PP
AGENT>ARG1#THEME>ARG2#LOC>PP--inin#)#)
built
built
semi
semi
-
-
automatically
automatically
by combining NLP and
by combining NLP and
Machine Learning techniques
Machine Learning techniques
syntactic frames are syntactic frames are extracted through unsupervised learning on extracted through unsupervised learning on
dependency
dependency--annotated text annotated text
semantic frames are semantic frames are based on manual annotation of gene regulation based on manual annotation of gene regulation
bio
bio--events by domain expertsevents by domain experts
Structured Knowledge
From
From
Text
Text
to
to
Knowledge
Knowledge
:
:
NLP and
NLP and
Knowledge
Knowledge
Extraction
Extraction
Lexicons and ontologies Knowledge Extraction Tools Text Annotation Tools
The
The
BioLexicon
BioLexicon
:
:
our
our
approach
approach
to
to
subcat
subcat
extraction
extraction
acquired through unsupervised automatic acquisition
acquired through unsupervised automatic acquisition
techniques from linguistically pre
techniques from linguistically pre
-
-
processed domain corpora
processed domain corpora
the starting point: shallow or deep syntactic annotation? the starting point: shallow or deep syntactic annotation?
particular requirements for
particular requirements for
Subcategorization
Subcategorization
Frames (
Frames (
SCFs
SCFs
)
)
in biomedical language
in biomedical language
SCFsSCFs should also include strongly selected modifiers (such as locatishould also include strongly selected modifiers (such as location, on, manner and timing), as these are deemed to be essential for the
manner and timing), as these are deemed to be essential for the
correct interpretation of texts
correct interpretation of texts
average number of arguments in average number of arguments in SCFsSCFs higher than general languagehigher than general language
“
“
discovery
discovery
”
”
approach to SCF acquisition based on a looser
approach to SCF acquisition based on a looser
notion of
notion of
SCFs
SCFs
, which includes typical verb modifiers in
, which includes typical verb modifiers in
addition to strongly selected arguments
addition to strongly selected arguments
no a priori knowledge about the set of possible no a priori knowledge about the set of possible SCFsSCFs
no distinction between argument/modifierno distinction between argument/modifier
to meet this basic requirement, SCF induction
to meet this basic requirement, SCF induction
operates
SCFs
SCFs
extracted from a corpus of MEDLINE abstracts and
extracted from a corpus of MEDLINE abstracts and
full papers made up of 6 million word tokens
full papers made up of 6 million word tokens
The induction process was performed through:
The induction process was performed through:
syntactic annotationsyntactic annotation of the acquisition corpus with of the acquisition corpus with EnjuEnju syntactic syntactic parser (v2.2,
parser (v2.2, http://wwwhttp://www--tsujii.is.s.utsujii.is.s.u--tokyo.ac.jp/enjutokyo.ac.jp/enju//) (adapted to ) (adapted to biomedical texts)
biomedical texts)
extraction of the observed dependency setsextraction of the observed dependency sets ((ODSsODSs) for each verbal ) for each verbal occurrence
occurrence
each ODS represented as a set of dependencies described in termseach ODS represented as a set of dependencies described in terms of of
relation type (e.g. ARG1, ARG2, PP
relation type (e.g. ARG1, ARG2, PP--in, PPin, PP--across, thatacross, that--CL, etc.) CL, etc.)
order of dependencies in each ODS is normalisedorder of dependencies in each ODS is normalised
induction of relevant SCF associated with a given verbinduction of relevant SCF associated with a given verb
for each ODS type, the conditional probability given the verb tyfor each ODS type, the conditional probability given the verb type pe vv
was computed
was computed
weighted thresholds used to filter out noisy framesweighted thresholds used to filter out noisy frames
an ODS type with an associated probability score beyond a certain an ODS type with an associated probability score beyond a certain
threshold is selected as eligible SCF for that verb type threshold is selected as eligible SCF for that verb type
each SCF each SCF has been extracted for one has been extracted for one normalised verb tokennormalised verb token, i.e. the , i.e. the
extraction process makes abstraction from the passive usages
extraction process makes abstraction from the passive usages
The
The
BioLexicon
BioLexicon
: the
: the
subcat
subcat
extraction
Overproduction of TorR in a torT strain resulted in partial constitutive expression of the torA'-'lacZ fusion, suggesting that TorR acts downstream from TorT.
20 VBZ act acts verb_mod_arg12 17 VBG suggest suggesting -1 UNKNOWN UNKNOWN UNKNOWN verb_mod_arg12 17 VBG suggest suggesting 7 VBD result resulted verb_mod_arg12 17 VBG suggest suggesting 23 NNP tort TorT prep_arg12 22 IN from from 20 VBZ act acts prep_arg12 22 IN from from 11 NN expression expression prep_arg12 8 IN in in 7 VBD result resulted prep_arg12 8 IN in in 20 VBZ act acts adj_arg1 21 RB downstream downstream 20 VBZ act acts comp_arg1 18 IN that that 15 NN fusion fusion prep_arg12 12 IN of of 11 NN expression expression prep_arg12 12 IN of of 7 VBD result resulted punct_arg1 16 , -comma-, 6 NN strain strain prep_arg12 3 IN in in 0 NN overproduction Overproduction prep_arg12 3 IN in in 2 NN torr TorR prep_arg12 1 IN of of 0 NN overproduction Overproduction prep_arg12 1 IN of of 19 NN torr TorR verb_arg1 20 VBZ act acts 11 NN expression expression adj_arg1 10 JJ constitutive constitutive 15 NN fusion fusion noun_arg1 14 NN tora'-'lacz torA'-'lacZ 11 NN expression expression adj_arg1 9 JJ partial partial 6 NN strain strain noun_arg1 5 NN tort torT 15 NN fusion fusion det_arg1 13 DT the the 6 NN strain strain det_arg1 4 DT a a 0 NN overproduction Overproduction verb_arg1 7 VBD result resulted 7 VBD result resulted ROOT -1 ROOT ROOT ROOT
act ARG1=torr@NN 0 0 0 0 PP-from=tort@NNP 0 0 0 0 0 0 0 0 0 0 VBZ 0
result ARG1=overproduction@NN 0 0 0 0 PP-in=expression@NN 0 0 0 0 MOD=suggest@VBG 0 0 0 0 0 suggest ARG1=UNKNOWN@UNKNOWN 0 0 0 0 0 0 0 0 0 0 0 0 that-CL=act@VBZ 0 0 VBG 0
136 different induced 136 different induced SCFsSCFs
vsvs SPECIALIST Lexicon: very limited number of complementation patternsSPECIALIST Lexicon: very limited number of complementation patterns
The
The
BioLexicon
BioLexicon
:
:
subcat
subcat
extraction
extraction
results
results
0 0 0.1347457 0.1347457 ARG1#PPARG1#PP--in#in# accumulate accumulate 0.140625 0.140625 0.1084745 0.1084745 ARG1#ARG2#PP
ARG1#ARG2#PP--in#in# accumulate accumulate 0 0 0.4627118 0.4627118 ARG1# ARG1# accumulate accumulate 0.0403458 0.0403458 0.2940677 0.2940677 ARG1#ARG2# ARG1#ARG2# accumulate accumulate 0.7029702 0.7029702 0.0939534 0.0939534 ARG1#ARG2#PP
ARG1#ARG2#PP--in#in# abolish abolish 0.1904761 0.1904761 0.0390697 0.0390697 ARG1#ARG2#MOD@VBG# ARG1#ARG2#MOD@VBG# abolish abolish 0.1437768 0.1437768 0.8669767 0.8669767 ARG1#ARG2# ARG1#ARG2# abolish abolish Pass Pass P( P(subcat|vsubcat|v)) SCF SCF Verb Verb
many of the strongly selected modifiers spread over different
many of the strongly selected modifiers spread over different
SCFs
SCFs
radically underestimated roleradically underestimated role
SCFs
SCFs
complemented with information about individual
complemented with information about individual
dependencies of verbs
dependencies of verbs
typical verbal dependencies, corresponding to either arguments otypical verbal dependencies, corresponding to either arguments or r strongly selected modifiers, detected through the
strongly selected modifiers, detected through the llll association scoreassociation score
44 induced dependency types44 induced dependency types
The
The
BioLexicon
BioLexicon
:
:
acquired
acquired
SCFs
SCFs
and
and
strongly selected modifiers
strongly selected modifiers
0.2143 0.0332 14 422 ARG1#ARG2#PP-at# methylate 0.5588 0.0805 34 422 ARG1#ARG2#PP-in# methylate 0.1258 0.6967 294 422 ARG1#ARG2# methylate % passive usages p(SCF|v) SCF_freq v_freq SCF v 0.60 18.2749 0.0320 45 1406 PP-in# methylate 0.31 57.9113 0.0206 29 1406 PP-at# methylate 0.25 778.5146 0.2916 410 1406 ARG2# methylate % passive usages ll p(dep|v) dep_ freq all_ dep DEP v
The
The
BioLexicon
BioLexicon
:
:
an
an
example of
example of
stored
stored
subcat
subcat
information for the
information for the
verb
verb
acquire
acquire
Full parsing Preposition-based parsing 0.3750 0.0295 ARG1#ARG2#PP-during# acquire 0.0000 0.0406 ARG1#ARG2#PP-by# acquire 0.1818 0.0406 ARG1#ARG2#PP-from# acquire 0.0833 0.0886 ARG1#ARG2#PP-in# acquire 0.1284 0.5461 ARG1#ARG2# acquire % pass p(SCF|v) SCF v 0.1666667 13.416025 PP-in# acquire 0 13.626654 PP-by# acquire 0.3333333 22.716082 PP-from# acquire 0.1 25.703417 WH-when# acquire 0.1512915 579.96392 ARG2# acquire % pass ll DEP v
The
The
BioLexicon
BioLexicon
:
:
contained
contained
verb
verb
subcategorization
subcategorization
information
information
complementarity
complementarity
between Verb
between Verb
-
-
Dep
Dep
and Verb
and Verb
-
-
SCF
SCF
associations
associations
both information types included into the
both information types included into the
BioLexicon
BioLexicon
subcat
subcat
frame information acquired for 759 different verbs,
frame information acquired for 759 different verbs,
corresponding to 658 different base forms
corresponding to 658 different base forms
e.g. the occurrences of
e.g. the occurrences of
colocalize
colocalize
,
,
colocalise
colocalise
,
,
co
co
-
-
localize
localize
and
and
co
co
-
-
localise
localise
recorded under
recorded under
colocalize
colocalize
the
the
BioLexicon
BioLexicon
was augmented with:
was augmented with:
1410
1410
Verb
Verb
-
-
SCF
SCF
associations, involving
associations, involving
136
136
different
different
subcat
subcat
frame types
frame types
The
The
BioLexicon
BioLexicon
:
:
bio
bio
-
-
event
event
frames
frames
Extracted
Extracted
from
from
a corpus of 677 MEDLINE
a corpus of 677 MEDLINE
abstracts
abstracts
manually
manually
annotated
annotated
by
by
biologists
biologists
semantic
semantic
parsers
parsers
not
not
mature
mature
enough
enough
to
to
provide
provide
the
the
starting
starting
point
point
(
(
as
as
opposed
opposed
to
to
dependency
dependency
parsers
parsers
)
)
manual
manual
semantic
semantic
annotation
annotation
carried
carried
out on top of
out on top of
shallow
shallow
syntactic
syntactic
annotation
annotation
(
(
“
“
chunking
chunking
”
”
)
)
consistency of marked text spans helped by annotating syntactic consistency of marked text spans helped by annotating syntactic
chunks
chunks
[NP The
[NP The
narL
narL
gene product ]
gene product ]
[VP activates ]
[VP activates ]
[NP the nitrate
[NP the nitrate
reductase
reductase
operon
operon
]
]
[PP in ]
[PP in ]
[NP Escherichia coli ]
Bio
Bio
-
-
Event
Event
annotated
annotated
corpus:
corpus:
incremental
incremental
annotation
annotation
approach
approach
TEXT
KNOWLEDGE
Raw domain corpora Tokenised text
Morpho-syntactically tagged text Syntactically “chunked” text Domain specific annotation
Domain independent annotation Bio-Event Linguistic Annotation
The
The
BioLexicon
BioLexicon
:
:
bio
bio
-
-
event
event
frames
frames
annotation
annotation
Annotation consisted of:
Annotation consisted of:
Identifying relevant
Identifying relevant
gene regulation
gene regulation
events centred on
events centred on
verbs
verbs
and
and
nominalised
nominalised
verbs (e.g.
verbs (e.g.
expression
expression
)
)
Finding all
Finding all
semantic arguments
semantic arguments
of an identified event
of an identified event
Within
Within
-
-
sentence annotation
sentence annotation
Assigning a
Assigning a
semantic role
semantic role
to each argument
to each argument
Assigning
Assigning
named entity types
named entity types
to semantic arguments
to semantic arguments
(where appropriate)
(where appropriate)
A hierarchy of
A hierarchy of
NEs
NEs
, specially tuned to
, specially tuned to
gene regulation,
gene regulation,
was created
was created
Organised into five entity
Organised into five entity
-
-
specific super
specific super
-
-
classes
classes
Thompson P., Cotter P., McNaught J., Ananiadou S., Montemagni S., Trabucco A., Venturi G. 2008. Building a Bio-Event Annotated Corpus for the Acquisition of Semantic Frames from Biomedical Corpora. In Proceedings of LREC
Bio
Bio
-
-
Event
Event
annotated
annotated
corpus:
corpus:
semantic roles
semantic roles
Aim for a set of
verb-specific event frames
Use of frame-independent semantic roles
annotation of
all sublanguage semantic arguments
,
using a set of
domain-specific
and
domain-independent roles
The proposed set of 12 event-independent semantic
roles includes:
two domain-specific semantic roles, i.e. CONDITION and
MANNER;
semantic roles particularly important for the precise
definition of complex biological relations, even though not
necessarily specific to the field, i.e. LOCATION and
TEMPORAL;
Transcription of
Transcription of gntTgntTis activated by is activated by bindingbinding of the of the cyclic AMP (
cyclic AMP (cAMP)cAMP)--cAMPcAMP receptor protein (CRP) receptor protein (CRP) complex to
complex to a CRP binding sitea CRP binding site End point of event
End point of event DESTINATION
DESTINATION
A
A transducingtransducing lambda phage was lambda phage was isolatedisolatedfrom from a
a strainstrainharboring a glpDharboring a glpD’’’’lacZlacZ fusion fusion Start point of event
Start point of event SOURCE
SOURCE
Phosphorylation
Phosphorylation of OmpRof OmpR modulatesmodulatesexpression of expression of the
the ompFompF and ompCand ompC genes genes inin Escherichia coliEscherichia coli
Where
Where completecomplete event takes place event takes place LOCATION
LOCATION
We have
We have isolatedisolateda strain with the aid of a strain with the aid of the
the CasadabanCasadabanMud phageMud phage Used to carry out
Used to carry out event
event INSTRUMENT
INSTRUMENT
cpxA
cpxA gene gene increasesincreasesthe levels of csgAthe levels of csgA transcription transcription by by dephosphorylationdephosphorylationof CpxRof CpxR Method/way in Method/way in which event is which event is carried out carried out MANNER MANNER recA
recA proteinproteinwas was inducedinducedby UV radiationby UV radiation The FNR protein
The FNR protein resembles resembles CRPCRP a) Affected
a) Affected
by/results from event by/results from event b) Subject of events b) Subject of events describing states describing states THEME THEME The
The narLnarLgene productgene productactivatesactivatesthe nitrate the nitrate reductase
reductase operonoperon Drives/instigates Drives/instigates event event AGENT AGENT
Bio
Bio
-
-
Event
Event
annotated
annotated
corpus:
corpus:
list
The fusion strains were
The fusion strains were usedused totostudystudythe regulation the regulation of the
of the cysBcysBgene by assaying the fused lacZgene by assaying the fused lacZgene gene product
product Purpose/reason for
Purpose/reason for the event occurring the event occurring PURPOSE
PURPOSE
The FNR protein
The FNR protein resemblesresemblesCRPCRP.. Descriptive Descriptive information about information about THEME THEME DESCRIPTIVE DESCRIPTIVE- -THEME THEME It is likely that
It is likely that HyfRHyfR actsactsas as a a formateformate--dependent dependent regulator
regulator of the of the hyfhyfoperonoperon Descriptive Descriptive information about information about AGENT AGENT DESCRIPTIVE DESCRIPTIVE- -AGENT AGENT marR
marR mutations mutations elevatedelevatedinaAinaAexpression by expression by 1010-- toto 20
20--foldfoldover that of the wild-over that of the wild-type.type. Change of level or Change of level or rate rate RATE RATE S
Strains carrying a mutation in the crptrains carrying a mutation in the crp structural gene structural gene fail to
fail to repressrepressODC and ADC activities in response to ODC and ADC activities in response to increased
increased cAMPcAMP Environmental Environmental conditions/changes conditions/changes in conditions in conditions CONDITION CONDITION
The Alp protease activity is
The Alp protease activity is detecteddetected in cells in cells afterafter introduction
introduction of plasmids carrying the alpAof plasmids carrying the alpA genegene Situates event w.r.t Situates event w.r.t another event another event TEMPORAL TEMPORAL
Bio
Bio
-
-
Event
Event
annotated
annotated
corpus:
corpus:
list
Named Entity
Named Entity
Superclasses
Superclasses
A set of
A set of eventevent classes used to label biological classes used to label biological processes described in text.
processes described in text.
PROCESSES
PROCESSES
Entities representing individuals or collections of
Entities representing individuals or collections of
living things and their component parts.
living things and their component parts.
ORGANISMS
ORGANISMS
Both physical and methodological entities, either
Both physical and methodological entities, either
used, consumed or required for a reaction to take
used, consumed or required for a reaction to take
place.
place.
EXPERIMENTAL
EXPERIMENTAL
Entities chiefly composed of amino acids and their
Entities chiefly composed of amino acids and their
positional references. This includes the physical
positional references. This includes the physical
structure and functional roles associated with each
structure and functional roles associated with each
type.
type.
PROTEIN
PROTEIN
Entities chiefly composed of nucleic acids and their
Entities chiefly composed of nucleic acids and their
structural or positional references. This includes
structural or positional references. This includes
the physical structure of all DNA
the physical structure of all DNA--based entities based entities and the functional roles associated with regions
and the functional roles associated with regions
thereof. thereof. DNA DNA Definition Definition NE class NE class
DNA
A promoter
AGENT
Event Annotation example
Event Annotation example
has been identified that
has been identified that
towards
towards
in
in
directs
verb PROCESSrelA gene transcription
THEME
DNA
the pyrG gene
DESTINATION
DNA
on the E. Coli chromosome
LOCATION
a counterclockwise direction
The extraction process of
The extraction process of
Event Frame
Event Frame
Extracted event frame:
[
Agent:NN:DNA
] [
Verb:VBZ:express
] [
Theme:NN:DNA
]
transfer operon
expresses
F-like plasmids
Agent Theme
DNA DNA
express
(
Agent=>DNA
,
Theme=>DNA
)
Input sentence:
The
The
BioLexicon
BioLexicon
:
:
acquired
acquired
bio
bio
-
-
event
event
frames
frames
Theme#Condition Theme#Condition## Theme Theme## Agent#Theme#Source Agent#Theme#Source## Agent#Theme#Manner Agent#Theme#Manner## Agent#Theme#Location Agent#Theme#Location## Agent#Theme#Condition Agent#Theme#Condition## Agent#Theme Agent#Theme## activate activate BioBio--event framesevent frames verb
verb
Agent
Agent--Protein#ThemeProtein#Theme--DNADNA# # Agent
Agent--Organisms#ThemeOrganisms#Theme--ProteinProtein# # Agent
Agent--DNA#ThemeDNA#Theme--DNADNA# # activate
activate
Bio
Bio--event frames with NE typesevent frames with NE types verb
The BioLexicon:
The BioLexicon:
Syntax
Syntax
-
-
Semantics
Semantics
Linking (1)
Linking (1)
The
The
starting
starting
point
point
:
:
acquiredacquired subcategorizationsubcategorization framesframes
verbalverbal biobio--eventevent framesframes basedbased on corpus on corpus annotationannotation
acquired acquired using different techniques and corpora of different sizeusing different techniques and corpora of different size
Linking
Linking
concerned
concerned
168 verbs
168 verbs
for which both syntactic
for which both syntactic
and semantic information was available
and semantic information was available
Linking
Linking
process
process
carried
carried
out
out
manually
manually
by
by
a
a
linguist
linguist
Different information types were taken into account, i.e.Different information types were taken into account, i.e.
literature regarding hierarchies of semantic roles and grammaticliterature regarding hierarchies of semantic roles and grammatical al
functions
functions
given a thematic role hierarchy (agent>theme ...) and a syntactic given a thematic role hierarchy (agent>theme ...) and a syntactic
functions hierarchy (subject>object ...), the mapping usually pr
functions hierarchy (subject>object ...), the mapping usually proceeds oceeds from left to right
from left to right
a list of a list of ‘‘prototypicprototypic’’ syntactic realisations of semantic argumentssyntactic realisations of semantic arguments
exploitationexploitation ofof general language repositories of semantic frames general language repositories of semantic frames
containing both syntactic and semantic information (as possible
containing both syntactic and semantic information (as possible
benchmarks)
The BioLexicon:
The BioLexicon:
Syntax
Syntax
-
-
Semantics
Semantics
Linking (2)
Linking (2)
Linking process resulted in
Linking process resulted in
668 linked frames
668 linked frames
Different types of mapping were performed:
Different types of mapping were performed:
full
full
mapping (239 frames)
mapping (239 frames)
arityarity of the of the subcategorizationsubcategorization and bioand bio--event frames is the event frames is the
same
same (ISO)(ISO)
partial
partial
mapping
mapping
1)
1) semantic frame contains more slots than corresponding semantic frame contains more slots than corresponding
subcategorization
subcategorization frameframe ((123 123 framesframes) (AUG)) (AUG)
e.g.
e.g. AGENT>ARG1#THEME>ARG2#LOCATION>PPAGENT>ARG1#THEME>ARG2#LOCATION>PP--in#in#CONDITIONCONDITION>0>0
2)
2) subcategorized slots do not have a semantic counterpart in the subcategorized slots do not have a semantic counterpart in the
corresponding bio
corresponding bio--event frameevent frame ((166 166 framesframes) (RED)) (RED)
e.g.
e.g. 0>ARG10>ARG1#THEME>ARG2#DESTINATION>PP#THEME>ARG2#DESTINATION>PP--intointo
3)
3) a combination of cases 1) and 2) abovea combination of cases 1) and 2) above ((140 140 framesframes))
e.g.
The
The
BioLexicon
BioLexicon
:
:
syntax
syntax
-
-
semantics
semantics
linking
linking
ISO ISO PP PP--inin Manner Manner ARG2 ARG2 Theme Theme ARG1 ARG1 Agent Agent ISO ISO ARG1 ARG1 Theme Theme ISO ISO ARG2 ARG2 Theme Theme ARG1 ARG1 Agent Agent RED RED ARG1 ARG1 0 0 PP PP--inin Condition Condition ARG2 ARG2 Theme Theme AUG AUG 0 0 Source Source ARG2 ARG2 Theme Theme ARG1 ARG1 Agent Agent ISO ISO PP PP--inin Location Location ARG2 ARG2 Theme Theme ARG1 ARG1 Agent Agent ISO ISO PP PP--byby Manner Manner ARG2 ARG2 Theme Theme ARG1 ARG1 Agent Agent ISO ISO PP PP--inin Conditio Conditio n n ARG2 ARG2 Theme Theme ARG1 ARG1 Agent Agent RED RED ARG1 ARG1 0 0 ARG2 ARG2 Theme Theme activate activate
Useful information for mixed syntax-semantics approaches
G. Venturi, S. Montemagni, S. Marchi, Y. Sasaki, P. Thompson, J. McNaught, S. Ananiadou, 2009, “Bootstrapping a Verb Lexicon for Biomedical Information Extraction”, in Proceedings of the CICLing-2009 conference, Mexico.
From
From
bricks
bricks
of
of
biolexical
biolexical
knowledge
knowledge
to
The
The
BioLexicon
BioLexicon
:
:
representation
representation
model
model
The BL model
The BL model
is
is
conformant
conformant
to
to
ISO
ISO
-
-
LMF (ISO 24613:2008)
LMF (ISO 24613:2008)
high
high
-
-
level objects
level objects
: the meta
: the meta
-
-
model, i.e. a set of
model, i.e. a set of
independent lexical objects with relations among them
independent lexical objects with relations among them
low
low
-
-
level objects
level objects
: a set of Data Categories, i.e. linguistic
: a set of Data Categories, i.e. linguistic
constants
constants
in the form of attribute
in the form of attribute
-
-
value pairs (either drawn
value pairs (either drawn
from the ISO
from the ISO
-
-
12620 or defined for the special domain)
12620 or defined for the special domain)
XML DTD for the entire lexicon
XML DTD for the entire lexicon
The implementation consists of a flexible, extensible
The implementation consists of a flexible, extensible
relational
relational
MySQL
MySQL
database
database
Automatic population procedures relying on a dedicated
Automatic population procedures relying on a dedicated
input data structure, the
input data structure, the
BioLexicon
BioLexicon
XML Interchange
XML Interchange
Format (XIF)
Format (XIF)
The BioLexicon Model: High-level objects, lexical objects
The
The
BioLexicon
BioLexicon
:
:
representation
BioLexicon Model:
Data Categories
Syntax
Semantics
e.g.
<feat att=“POS” val=“VVZ”> <feat att=“ConfScore” val=“0.9”> <feat att=“source” val=“UNIPROT” ……
The
The
BioLexicon
BioLexicon
:
:
representation
representation
model
model
- Partially drawn by the ISO Data Category Registry - Partially integrated with domain-specific DCs
The
The
BioLexicon
BioLexicon
:
:
the
the
starting
starting
point
point
RegulonDB
RegulonDB, , TransFac
TransFac, Gene , Gene Ontology Ontology Annotation Annotation Transcription Transcription Regulator Regulator InterPro InterPro Protein Domain Protein Domain Corum database Corum database Protein Complex Protein Complex BioThesaurus BioThesaurus Protein Protein Sequence Sequence Ontology Ontology Transcription Transcription Factor
Factor--BindingSiteBindingSite
NCBI Species
NCBI Species
Organism
Organism
RegulonDB
RegulonDB, ODB , ODB ( (OperonOperon DataBase DataBase)) Operon Operon Sequence Sequence Ontology :Region Ontology :Region NucleicAcid NucleicAcid Region Region Resources Resources Semantic type Semantic type GO:0004879
GO:0004879 ligandligand- -dependent nuclear dependent nuclear receptor activity receptor activity Nuclear Nuclear Receptor Receptor IMR
IMR -- INOH Protein INOH Protein name/family name name/family name ontology ontology Ligand Ligand BioThesaurus BioThesaurus Gene Gene Enzyme commission Enzyme commission Enzyme Enzyme OMIM OMIM Disease Disease CHEBI, IMR:0000947 CHEBI, IMR:0000947 chemical chemical Chemical Chemical Gene Ontology Gene Ontology GO:0005575 cellular GO:0005575 cellular component component Cell Cell Component Component Cell ontology Cell ontology Cell Cell Resources Resources Semantic Semantic type type
The
The
BioLexicon
BioLexicon
by
by
numbers
numbers
741 741 1431 1431 Sequences Sequences 368 368 2672 2672 Operons Operons 129 129 160 160 Trans.
Trans. FactorsFactors
512 512 842 842 Cell Cell 29831 29831 8850 8850 Molecular
MolecularRolesRoles
11314 11314 19457 19457 Diseases Diseases 77475 77475 19637 19637 Chemicals Chemicals 418 418 2104 2104 Protein
Protein ComplCompl..
15412 15412 16940 16940 Protein Domains Protein Domains 4164 4164 4016 4016 Enzymes Enzymes 182610 182610 482992 482992 Organisms Organisms 936126 936126 358335 358335 Gene/
Gene/ProtProt((synsetssynsets))
1408312 1408312 1640608 1640608 Gene/ Gene/ProtProt # # VariantsVariants # # Entries Entries Sem. Type Sem. Type
Entries and variants Entries and variants
by semantic type by semantic type -668 668 syntax
syntax--semanticssemantics mappings
mappings ((concernedconcernedwithwith 168 168 verbsverbs)) -856 856 bio
bio--eventevent framesframes
-1710
1710
verb
verb--SLOTSLOTassociationsassociations
-3040
3040 verb
verb--SCFSCF associationsassociations
-2764
2764
related
related entriesentries(e.g. (e.g. absorb
absorb--> > absorptionabsorption/N, /N, absorber
absorber/N, /N, absorbingabsorbing/J, /J,
absorbable
absorbable/J, /J, absorbentabsorbent/J, /J, absorbently
absorbently/R)/R)
15274 15274 inflected
inflected formsforms
496 496 658 658 general general domain domain- -specific specific Verbs Verbs 550 550 Adverbs Adverbs 1154 1154 Verbs Verbs 3428 3428 Adjectives Adjectives 2231574 2231574 Nouns Nouns # # entriesentries POS POS Entries by part of Entries by part of speech speech
Comparison with two existing large
Comparison with two existing large
-
-
scale dictionaries
scale dictionaries
WordNetWordNet:: General English ThesaurusGeneral English Thesaurus
NLM Specialist Lexicon:NLM Specialist Lexicon: Biomedical LexiconBiomedical Lexicon
Coverage evaluation
Coverage evaluation
The
The
BioLexicon
BioLexicon
:
:
intrinsic
intrinsic
evaluation
evaluation
0 10 20 30 40 50 60 70 80 90 100 v e r b n o u n a d j e c t i v e a d v e r b n o m i n a l i z a t i o n a d j e t i v i a l a d v e r b a l T e r m i n o l o gi e s De r i vat io n al r e l at i o n s
Wo rdN e t Spe c ialist Le xic o n
BioLexicon WordNet
Specialist Lexicon
covered not covered
Biolexicon, human
BioLexicon
BioLexicon
in
in
BioCreAtIve
BioCreAtIve
II, GN
II, GN
task
task
Applied methods:
- Abner for gene identification
- Statistical features (BioLexicon)
for filtering of non-relevant terms
- Classification on BioCreAtIve II data
- Only human concept ids
⇒ Baseline results
⇒ Highly reproducible
⇒ Available as Whatizit module
The
The
BioLexicon
BioLexicon
:
:
extrinsic
extrinsic
evaluation
evaluation
Task
Task
-
-
based evaluation (still ongoing)
based evaluation (still ongoing)
BLTagger
BLTagger
NeMine
NeMine
(NER)
(NER)
Enju
Enju
with the BL
with the BL
BLTagger
BLTagger
NeMine
NeMine
(NER)
(NER)
Tool
Tool
TREC Genomics Track
TREC Genomics Track
2007
2007
IR
IR
UoM
UoM
Gene Regulation
Gene Regulation
Corpus
Corpus
Data
Data
IE
IE
Task
Task
NeMine (
http://text0.mib.man.ac.uk/~sasaki/bootstrep/nemine.html
)
Y. Sasaki, P. Thompson, J. McNaught, S. Ananiadou, 2009, “Three BIONLP tools powered by the BioLexicon”, in Proceedings of EACL 2009.
BioLexicon
BioLexicon
distribution
distribution
• The BioLexicon (MySQL version) is distributed
through the
European Language Resources
European Language Resources
Association
Association
(
ELRA
)
• http://www.elra.info or http://www.elda.org
• Benefits
• Servicing of bug reporting through ELRA
• Organisational embedding into other lexical resources • Long-term availability
• Support to European language infrastructures
• Different licence types for
• Commercial use
• Research use by commercial organisations • Research use by academic organisations
Conclusions
Conclusions
BL
BL
BL
A unique resource amongst large-scale computational lexiconswithin the biomedical domain in terms of
coverage and
typology of contained information
Populated with info from available
biomedical
resources and texts
TRY IT!
TRY IT!
Semi-automatically
populated from corpora:
Population toolkit available
Including both
domain-specific
and general
language words
Rich linguistic information
ranging over different linguistic descriptions levels
Conformant to international lexical representation
standards
Designed to meet bio-Text Mining requirements
THANK YOU
THANK YOU