Knowledge Management in Bioinformatics Humboldt-Universität Berlin
Ulf Leser
Text Mining and Knowledge
Management
University of Applied Sciences
Berlin Center for
Genome Based Bioinformatics
... something magic ...
Data Integration
KEGG
OMIM
PDB CATH
FSSP Ontology Gene
SCOP
All known structures of mammal proteins involved in the pentose phosphate pathway
that carry a Rossmann fold
resolved with a resolution better than 2.5 A
UniProt
Managing Biological Networks
Managing Biological Networks
Characteristics of the yeast proteome: map of protein-protein interactions.
PQL: Pathway Query Language
length=*
length=*
C=Lactaldehyde
B ISA Enzyme D=L-Lactaldehyde
length=2
BTW: PQL has
many meanings!
Knowledge Sources
• Where is knowledge in biomedical research?
- People‘s minds
- Publications and text books - Databases
• Databases
- Experimental data: Too little abstraction - Annotation: Text again
• Most valuable information is in text
- Human language superior to all formal methods
- Reputation gained by publications, not by database submissions
- Biological databases employ professional “reader” (curators)
Real Sentences
The human T cell leukemia lymphotropic virus type 1 Tax protein represses MyoD-
dependent transcription by
inhibiting MyoD-binding to the
KIX domain of p300.“
Real Sentences
The human T cell leukemia lymphotropic virus type 1 Tax protein represses MyoD-
dependent transcription by
inhibiting MyoD-binding to the KIX domain of p300.“
Named Entity Recognition is difficult
• 10 components
• Unclear borders
„NSCLC often becomes
resistant to chemotherapy due to multiple defects
found in expression of
CD95-L, CD95 and members of the Bcl-2 and IAP family, as well as caspase-8, -9 and -3 as examined by
immunohistochemistry, ..“
Protein Protein
Protein Protein Protein
Disease
Evidence
Therapy Relation
Reason
Protein family
Real Sentences
• Complex multi-token names with endless variations
• Synonyms and
homonyms, esp. gene, protein, disease, clone, region, locus, …
• Enumerations
• Cross-sentence dependencies
• Tables & figures
• …
Understanding Text?
• Artificial Intelligence
- Natural language processing
- Full parsing, complete syntax tree - Aims at “understanding” the text
• Text mining
- Simple NLP and machine learning
• Stemming, part-of-speech (chunking)
• Classification, pattern matching - Pragmatic approach
- Usually not perfect
- Needs careful evaluation
NOM VRB PRP NOM
FLICE bind to FADD
FLICE binds to FADD
Overview
• Why text mining for biomedical research
• Text Mining and Knowledge Management
• Subtasks
- Named Entity Recognition - Relationship Mining
- Mining Attributed Relationships
Tasks in Knowledge Management
• Create new knowledge
- Experiments, intuition, and analysis
• Organize knowledge
- Make it searchable - Make it exchangeable
• Integrate different sources of knowledge
- Put your results into context
- Structured (databases) and unstructured (text)
- Internal (LabDB, Endnote) and external (Swiss-Prot,
Medline)
Organizing Personal Knowledge
Integrating Structured Data
A difficult topic on its own
Genes
Sequence Proteins Diseases Location
Homology
Splicing
Structure
Domains
SNPs
Expression
Regulation
Phylogeny
Which Data?
EMBL Swissprot OMIM GDB
BLAST
dbEST
PDB
InterPro
HMDB
ArrayExprs
KEGG
ParaLogs
Incyte PIR NCICB LocusLink
Mult. Alig.
UniGene
DSSP
SCOP
HapMap
Affy
Enzyme
Taxonomy
RefSeq PDB GeneCards Ensembl
Pat.-Hunt
ASDB
Predator
CATH
dbSNP
GEO
Brenda
HomoloGen
Problem: Semantics
A Gene ?
Names
• Gene
• Protein
• ORF
• CDR
• EST
• cDNA
• ...
Definitions
• Start – Stop
• ... promoter
• ... introns
• … splice variants
• … (im)mature mRNA
• ... protein
• ...
Facts
• Accuracy
• Exp. validated
• Prediction
• ... on similarity
• ... on conservation
• ...
Integrating Unstructured Knowledge
• Majority of knowledge is only available in text
- Publications & abstracts - Notes, memos, lab books
• Problem: Find relevant knowledge
- “We found 50 significantly up-regulated genes …”
- Do they interact?
• Gene – gene relationship
- Are they related to specific diseases?
• Gene – disease relationship
- Do they share a common function?
Finding Relevant Knowledge
• PubMed/Medline
- 16.000.000 abstracts, ~400.000 new articles per year
• Find relevant articles
- Information retrieval - What is “relevant”
- Often: Large, unspecific results
- Often: Missing results (synonyms, full text versus abstract)
• Find relevant information inside each article
- Information extraction
- “Summarize” the results for my task
- Reading many abstracts is tedious
Extracting Information
• Find objects
- Genes, diseases, drugs, molecules, species, tissue, … - Named Entity Recognition
• Find relationships between objects
- Gene regulation, protein interaction, gene-disease relation, … - Relationship Mining
• Find properties of these relationships
- Intensity, type, kinetics, evidence, … - Mining Attributed Relationships
• Integrate with your results
Overview
• Why text mining for biomedical research
• Text Mining and Knowledge Management
• Subtasks
- Named Entity Recognition - Relationship Mining
- Mining Attributed Relationships
Biocreative Cup 2004
• NER is building block for many text mining applications
• Critical Assessment of Information Extraction Systems in Biology
- International competition
- Data provided by organizers in cooperation with database curators (Swiss-Prot)
- Test data available for one week
• Boost: Top systems reach ~84 F-measure
• Corpus of 7500 sentences
- 140.000 non-gene words - 60.000 gene names
• SVM light on different feature sets
• Dictionary compiled from Genbank, HUGO, MGD, YDB
• Post-processing for compound gene names
Vector
Generator SVMlight
Tagged Post Text
Processor Tokenized
Training Corpus
SVM Model driven Tagger
NewText Vector
Generator
Approach: SVM for NER
Features
Feature Weight Example
Word tf * idf kinase
n-grams
N=1 tf * idf k, i, n, a, s, e
N=2 tf * idf ki, in, na, as, se
N=3 tf * idf kin, ina, nas, ase
Special signs
HasNumbers [1|0] p300
HasCapitals [1|0] abLIM
AllCaps [1|0] DMD
InitCap [1|0] Pax
HasNumbers & Letters [1|0] cMOAT2, EST90757
Context
predecessing word [1|0] Gene
succeeding word [1|0] Product
0 10 20 30 40 50 60 70 80
Syntax features (SF)
SF + Dictionary (simple)
SF + Dictionary (advanced)
Precision Recall
Performance
• Best result for BioCreative Cup: 73 F-measure
• Current feature set reaches 79 F-measure
• Raises from 73 to 83 for loose evaluation
Real Sentences
The human T cell leukemia lymphotropic virus type 1 Tax protein represses MyoD-
dependent transcription by
inhibiting MyoD-binding to the KIX domain of p300.“
Named Entity Recognition is difficult
• 10 components
NER – Current Stage
• Most successful features found by trial&error
- Brute force approach
- Apparently true for all Biocreative participants
• NER results depend on type of object
- Gene or protein is hard
- Gene and protein is much harder
- Cell type: 81; virus strains: 67; disease: ?; drugs: ? …
• What is left?
- Entity names are not really defined (borders)
- Inter-Biologists agreement on type (gene, protein, RNA) and exact borders around 70% (Krauthammer et al. 2000)
- How good can we get at all (for genes)?
- Overfitting to annotators likely
Overview
• Why text mining for biomedical research
• Text Mining and Knowledge Management
• Subtasks
- Named Entity Recognition - Relationship Mining
- Mining Attributed Relationships
PubGene
• Pure co-occurrence
• Edge weight based on counting co-
occurrence
• Contains app. 6 Million associations
• Commercial
„We show that CBF-A and CBF-C interact
with each other to form a CBF-A-CBF-C complex and that CBF-B does not interact with CBF-A or CBF-C individually but that it associates with the CBF-A-CBF-C complex.“
CBF-A CBF-C
CBF-B CBF-A-CBF-C complex
interact complex associates
Example 2
Simple Example
Real Examples
Z-100 is an arabinomannan extracted from Mycobacterium tuberculosis that has various immunomodulatory activities, such as the induction of interleukin 12, interferon gamma (IFN-gamma) and beta-chemokines. The effects of Z-100 on human immunodeficiency virus type 1 (HIV-1) replication in human monocyte-derived macrophages (MDMs) are investigated in this paper. In MDMs, Z-100 markedly suppressed the replication of not only macrophage-tropic (M-tropic) HIV-1 strain (HIV-1JR-CSF), but also HIV-1 pseudotypes that possessed amphotropic Moloney murine leukemia virus or vesicular stomatitis virus G envelopes. Z-100 was found to inhibit HIV-1 expression, even when added 24 h after
infection. In addition, it substantially inhibited the expression of the pNL43lucDeltaenv vector (in which the env gene is defective and the nef gene is replaced with the firefly luciferase gene) when this vector was transfected directly into MDMs. These findings
suggest that Z-100 inhibits virus replication, mainly at HIV-1 transcription. However, Z-100 also downregulated expression of the cell surface receptors CD4 and CCR5 in MDMs,
suggesting some inhibitory effect on HIV-1 entry. Further experiments revealed that Z-100 induced IFN-beta production in these cells, resulting in induction of the 16-kDa
CCAAT/enhancer binding protein (C/EBP) beta transcription factor that represses HIV-1
long terminal repeat transcription. These effects were alleviated by SB 203580, a specific
inhibitor of p38 mitogen-activated protein kinases (MAPK), indicating that the p38 MAPK
Real Examples
Z-100 is an arabinomannan extracted from Mycobacterium tuberculosis that has various immunomodulatory activities, such as the induction of interleukin 12, interferon gamma (IFN-gamma) and beta-chemokines. The effects of Z-100 on human immunodeficiency virus type 1 (HIV-1) replication in human monocyte-derived macrophages (MDMs) are investigated in this paper. In MDMs, Z-100 markedly suppressed the replication of not only macrophage-tropic (M-tropic) HIV-1 strain (HIV-1JR-CSF), but also HIV-1 pseudotypes that possessed amphotropic Moloney murine leukemia virus or vesicular stomatitis virus G envelopes. Z-100 was found to inhibit HIV-1 expression, even when added 24 h after
infection. In addition, it substantially inhibited the expression of the pNL43lucDeltaenv vector (in which the env gene is defective and the nef gene is replaced with the firefly luciferase gene) when this vector was transfected directly into MDMs. These findings suggest that Z-100 inhibits virus replication, mainly at HIV-1 transcription. However, Z- 100 also downregulated expression of the cell surface receptors CD4 and CCR5 in MDMs, suggesting some inhibitory effect on HIV-1 entry. Further experiments revealed that Z-100 induced IFN-beta production in these cells, resulting in induction of the 16-kDa
CCAAT/enhancer binding protein (C/EBP) beta transcription factor that represses
HIV-1 long terminal repeat transcription. These effects were alleviated by SB 203580, a
Possible Approaches to PPI
• Co-occurrence
- Two proteins in one sentences -> PPI - Tendency: Low precision, good recall
• Use machine learning
- Classify tokens / phrases; requires annotated corpus - Tendency: Low precision, good recall
• Full sentence parsing
- Only ~30% of sentences are parsed unambiguously
- Tendency: Good precision, low recall
Relationship Mining
• Most systems work on language pattern
- Sentence
• … GENE regulates expression of GENE …
• … GENE is strongly suppressed by GENE … - Linguistic annotation
• … GENE VRB NOM PRP GENE …
• … GENE is ADJ VRB PRP GENE …
• Patterns: Different levels of abstraction
- … GENE .* VRB .* GENE
- … GENE [is] ADJ? {regulat|suppres} NOM? PRP GENE
State-of-the-Art
• Systems work on hand-crafted pattern sets
- Hundreds of pattern
- Tendency: Very good precision, low recall
• There are never enough / the right patterns
• AliBaba
- Learn patterns automatically
- Uses PPI database as input
Workflow
IntAct
PubMed
Protein pairs Search sentences NER and POS tagging
Initial patterns Clustering
Consensus pattern
Alignment
Initial Pattern
• Sentences: pair of proteins and “interaction”
- “…show that FADD immediately activates procaspase-8 during…”
• Extraction of core phrase around proteins
- “…show that FADD immediately activates procaspase-8 during…”
• Derivation of initial pattern
- “(FADD) (immediately) (activates) (procaspase-8) …”
• Semantic abstraction
- “PTN (immediately) (activates) PTN …”
Linguistic Annotation
• Part-of-speech and word stems
• Multi-layered pattern
• Highly specific pattern
• High precision, low recall
• Need to be generalized
POS PTN ADV VRB PTN
Stem Token
PTN immediat activat PTN
PTN immediately activates PTN
Workflow
IntAct
PubMed
Protein pairs Search sentences NER and POS tagging
Initial patterns Clustering
Consensus pattern Alignment
Extracted PPI
Generalization
• Pattern similarity using sentence alignment
• Cluster patterns based on similarity matrix
• Consensus by multiple sentence alignment
Workflow
IntAct
PubMed
Protein pairs Search sentences NER and POS tagging
Initial patterns Clustering
Consensus pattern Alignment
Extracted PPI
Search Phase
• Sentences
- … are searched for protein names
- … matched against all consensus pattern - Highest scoring pattern wins
• Results
- ~42000 IntAct pairs yield ~20.000 initial patterns - Generalized into ~10.000 consensus pattern
- Yields 79% precision at 52% recall (SPIES)
• Careful: What is an interaction? Which corpus? Which task?
- Fully automatic: rapid applicability, self-learning, probably
Query
PubMed visualized
Extracted infos
Tweaking and Tuning
• Special dictionaries
- Name lists incl. synonyms
• Special entity classes
- Database of pairs
• Special publications
- Certain journals, fulltext articles
• High precision or high recall
- Fuzziness of name search, pattern generalization, thresholds
• Dictionaries versus ML NER
Overview
• Why text mining for biomedical research
• Text Mining and Knowledge Management
• Subtasks
- Named Entity Recognition - Relationship Mining
- Mining Attributed Relationships
Kinetic Modelling
Source
Source Data Data Method Method Objective Objective
Example
The apparent K(m) value was calculated for adenosine and found to be 3.63 x 10(-3) M, which indicates high affinity of adenosine deaminase for its substrate adenosine.
Constant: K(m)
Value: 3.63 x 10(-3)
Unit: M
Enzyme: Adenosine deaminase
Compound: adenosine
Example
The apparent K(m) value was calculated for adenosine and found to be 3.63 x 10(-3) M, which indicates high affinity of adenosine deaminase for its substrate adenosine.
Constant: K(m)
Value: 3.63 x 10(-3)
Unit: M
Enzyme: Adenosine deaminase Compound: adenosine
Reaction: R01560
System Overview
Information Representation
Information Extraction &
Processing Entity Recognition
XML Pre- Processing
KMedDB
Entity Recognition
• Entities to recognize:
Enzymes ( KEGG )
Compounds ( KEGG )
Species ( NCBI Taxonomy )
Kinetic constants
Values
Units
Temperature & pH-values
Hand-crafted
regular expressions
Dictionaries
Information Extraction
KMedDB
KMedDB
KMedDB
Precision
0 10 20 30 40 50 60 70 80 90 100
Constan t
Value Unit
Enzyme Comp
ound
Reac
tion pH
Tem p
Species
Overall Class
Precision [%]
Recall
0 10 20 30 40 50 60
Constan t
Value Unit
Enzym e
om pound
Reaction pH
Tem p
Spec ies
Over all
Recall [%]