Text Mining and Knowledge Management

(1)

Knowledge Management in Bioinformatics Humboldt-Universität Berlin

Ulf Leser

Text Mining and Knowledge

Management

(2)

University of Applied Sciences

Berlin Center for

Genome Based Bioinformatics

(3)

... something magic ...

Data Integration

KEGG

OMIM

PDB CATH

FSSP Ontology Gene

SCOP

All known structures of mammal proteins involved in the pentose phosphate pathway

that carry a Rossmann fold

resolved with a resolution better than 2.5 A

UniProt

(4)

Managing Biological Networks

(5)

Managing Biological Networks

Characteristics of the yeast proteome: map of protein-protein interactions.

(6)

PQL: Pathway Query Language

length=*

C=Lactaldehyde

B ISA Enzyme D=L-Lactaldehyde

length=2

BTW: PQL has

many meanings!

(7)

Knowledge Sources

• Where is knowledge in biomedical research?

- People‘s minds

- Publications and text books - Databases

• Databases

- Experimental data: Too little abstraction - Annotation: Text again

• Most valuable information is in text

- Human language superior to all formal methods

- Reputation gained by publications, not by database submissions

- Biological databases employ professional “reader” (curators)

(8)

Real Sentences

The human T cell leukemia lymphotropic virus type 1 Tax protein represses MyoD-

dependent transcription by

inhibiting MyoD-binding to the

KIX domain of p300.“

(9)

Real Sentences

The human T cell leukemia lymphotropic virus type 1 Tax protein represses MyoD-

dependent transcription by

inhibiting MyoD-binding to the KIX domain of p300.“

Named Entity Recognition is difficult

• 10 components

• Unclear borders

(10)

„NSCLC often becomes

resistant to chemotherapy due to multiple defects

found in expression of

CD95-L, CD95 and members of the Bcl-2 and IAP family, as well as caspase-8, -9 and -3 as examined by

immunohistochemistry, ..“

Protein Protein

Protein Protein Protein

Disease

Evidence

Therapy Relation

Reason

Protein family

Real Sentences

• Complex multi-token names with endless variations

• Synonyms and

homonyms, esp. gene, protein, disease, clone, region, locus, …

• Enumerations

• Cross-sentence dependencies

• Tables & figures

• …

(11)

Understanding Text?

• Artificial Intelligence

- Natural language processing

- Full parsing, complete syntax tree - Aims at “understanding” the text

• Text mining

- Simple NLP and machine learning

• Stemming, part-of-speech (chunking)

• Classification, pattern matching - Pragmatic approach

- Usually not perfect

- Needs careful evaluation

NOM VRB PRP NOM

FLICE bind to FADD

FLICE binds to FADD

(12)

Overview

• Why text mining for biomedical research

• Text Mining and Knowledge Management

• Subtasks

- Named Entity Recognition - Relationship Mining

- Mining Attributed Relationships

(13)

Tasks in Knowledge Management

• Create new knowledge

- Experiments, intuition, and analysis

• Organize knowledge

- Make it searchable - Make it exchangeable

• Integrate different sources of knowledge

- Put your results into context

- Structured (databases) and unstructured (text)

- Internal (LabDB, Endnote) and external (Swiss-Prot,

Medline)

(14)

Organizing Personal Knowledge

(15)

Integrating Structured Data

A difficult topic on its own

Genes

Sequence Proteins Diseases Location

Homology

Splicing

Structure

Domains

SNPs

Expression

Regulation

Phylogeny

(16)

Which Data?

EMBL Swissprot OMIM GDB

BLAST

dbEST

PDB

InterPro

HMDB

ArrayExprs

KEGG

ParaLogs

Incyte PIR NCICB LocusLink

Mult. Alig.

UniGene

DSSP

SCOP

HapMap

Affy

Enzyme

Taxonomy

RefSeq PDB GeneCards Ensembl

Pat.-Hunt

ASDB

Predator

CATH

dbSNP

GEO

Brenda

HomoloGen

(17)

Problem: Semantics

A Gene ?

Names

• Gene

• Protein

• ORF

• CDR

• EST

• cDNA

• ...

Definitions

• Start – Stop

• ... promoter

• ... introns

• … splice variants

• … (im)mature mRNA

• ... protein

• ...

Facts

• Accuracy

• Exp. validated

• Prediction

• ... on similarity

• ... on conservation

• ...

(18)

Integrating Unstructured Knowledge

• Majority of knowledge is only available in text

- Publications & abstracts - Notes, memos, lab books

• Problem: Find relevant knowledge

- “We found 50 significantly up-regulated genes …”

- Do they interact?

• Gene – gene relationship

- Are they related to specific diseases?

• Gene – disease relationship

- Do they share a common function?

(19)

Finding Relevant Knowledge

• PubMed/Medline

- 16.000.000 abstracts, ~400.000 new articles per year

• Find relevant articles

- Information retrieval - What is “relevant”

- Often: Large, unspecific results

- Often: Missing results (synonyms, full text versus abstract)

• Find relevant information inside each article

- Information extraction

- “Summarize” the results for my task

- Reading many abstracts is tedious

(20)

Extracting Information

• Find objects

- Genes, diseases, drugs, molecules, species, tissue, … - Named Entity Recognition

• Find relationships between objects

- Gene regulation, protein interaction, gene-disease relation, … - Relationship Mining

• Find properties of these relationships

- Intensity, type, kinetics, evidence, … - Mining Attributed Relationships

• Integrate with your results

(21)

Overview

• Why text mining for biomedical research

• Text Mining and Knowledge Management

• Subtasks

- Named Entity Recognition - Relationship Mining

- Mining Attributed Relationships

(22)

Biocreative Cup 2004

• NER is building block for many text mining applications

• Critical Assessment of Information Extraction Systems in Biology

- International competition

- Data provided by organizers in cooperation with database curators (Swiss-Prot)

- Test data available for one week

• Boost: Top systems reach ~84 F-measure

(23)

• Corpus of 7500 sentences

- 140.000 non-gene words - 60.000 gene names

• SVM ^light on different feature sets

• Dictionary compiled from Genbank, HUGO, MGD, YDB

• Post-processing for compound gene names

Vector

Generator SVM^light

Tagged Post Text

Processor Tokenized

Training Corpus

SVM Model driven Tagger

NewText Vector

Generator

Approach: SVM for NER

(24)

Features

Feature Weight Example

Word ^{tf * idf} ^kinase

n-grams

N=1 tf * idf k, i, n, a, s, e

N=2 tf * idf ki, in, na, as, se

N=3 tf * idf kin, ina, nas, ase

Special signs

HasNumbers [1|0] p300

HasCapitals [1|0] abLIM

AllCaps [1|0] DMD

InitCap [1|0] Pax

HasNumbers & Letters [1|0] cMOAT2, EST90757

Context

predecessing word [1|0] Gene

succeeding word [1|0] Product

(25)

0 10 20 30 40 50 60 70 80

Syntax features (SF)

SF + Dictionary (simple)

SF + Dictionary (advanced)

Precision Recall

Performance

• Best result for BioCreative Cup: 73 F-measure

• Current feature set reaches 79 F-measure

• Raises from 73 to 83 for loose evaluation

(26)

Real Sentences

The human T cell leukemia lymphotropic virus type 1 Tax protein represses MyoD-

dependent transcription by

inhibiting MyoD-binding to the KIX domain of p300.“

Named Entity Recognition is difficult

• 10 components

(27)

NER – Current Stage

• Most successful features found by trial&error

- Brute force approach

- Apparently true for all Biocreative participants

• NER results depend on type of object

- Gene or protein is hard

- Gene and protein is much harder

- Cell type: 81; virus strains: 67; disease: ?; drugs: ? …

• What is left?

- Entity names are not really defined (borders)

- Inter-Biologists agreement on type (gene, protein, RNA) and exact borders around 70% (Krauthammer et al. 2000)

- How good can we get at all (for genes)?

- Overfitting to annotators likely

(28)

Overview

• Why text mining for biomedical research

• Text Mining and Knowledge Management

• Subtasks

- Named Entity Recognition - Relationship Mining

- Mining Attributed Relationships

(29)

PubGene

• Pure co-occurrence

• Edge weight based on counting co-

occurrence

• Contains app. 6 Million associations

• Commercial

(30)

„We show that CBF-A and CBF-C interact

with each other to form a CBF-A-CBF-C complex and that CBF-B does not interact with CBF-A or CBF-C individually but that it associates with the CBF-A-CBF-C complex.“

CBF-A CBF-C

CBF-B CBF-A-CBF-C complex

interact complex associates

Example 2

Simple Example

(31)

Real Examples

Z-100 is an arabinomannan extracted from Mycobacterium tuberculosis that has various immunomodulatory activities, such as the induction of interleukin 12, interferon gamma (IFN-gamma) and beta-chemokines. The effects of Z-100 on human immunodeficiency virus type 1 (HIV-1) replication in human monocyte-derived macrophages (MDMs) are investigated in this paper. In MDMs, Z-100 markedly suppressed the replication of not only macrophage-tropic (M-tropic) HIV-1 strain (HIV-1JR-CSF), but also HIV-1 pseudotypes that possessed amphotropic Moloney murine leukemia virus or vesicular stomatitis virus G envelopes. Z-100 was found to inhibit HIV-1 expression, even when added 24 h after

infection. In addition, it substantially inhibited the expression of the pNL43lucDeltaenv vector (in which the env gene is defective and the nef gene is replaced with the firefly luciferase gene) when this vector was transfected directly into MDMs. These findings

suggest that Z-100 inhibits virus replication, mainly at HIV-1 transcription. However, Z-100 also downregulated expression of the cell surface receptors CD4 and CCR5 in MDMs,

suggesting some inhibitory effect on HIV-1 entry. Further experiments revealed that Z-100 induced IFN-beta production in these cells, resulting in induction of the 16-kDa

CCAAT/enhancer binding protein (C/EBP) beta transcription factor that represses HIV-1

long terminal repeat transcription. These effects were alleviated by SB 203580, a specific

inhibitor of p38 mitogen-activated protein kinases (MAPK), indicating that the p38 MAPK

(32)

Real Examples

Z-100 is an arabinomannan extracted from Mycobacterium tuberculosis that has various immunomodulatory activities, such as the induction of interleukin 12, interferon gamma (IFN-gamma) and beta-chemokines. The effects of Z-100 on human immunodeficiency virus type 1 (HIV-1) replication in human monocyte-derived macrophages (MDMs) are investigated in this paper. In MDMs, Z-100 markedly suppressed the replication of not only macrophage-tropic (M-tropic) HIV-1 strain (HIV-1JR-CSF), but also HIV-1 pseudotypes that possessed amphotropic Moloney murine leukemia virus or vesicular stomatitis virus G envelopes. Z-100 was found to inhibit HIV-1 expression, even when added 24 h after

infection. In addition, it substantially inhibited the expression of the pNL43lucDeltaenv vector (in which the env gene is defective and the nef gene is replaced with the firefly luciferase gene) when this vector was transfected directly into MDMs. These findings suggest that Z-100 inhibits virus replication, mainly at HIV-1 transcription. However, Z- 100 also downregulated expression of the cell surface receptors CD4 and CCR5 in MDMs, suggesting some inhibitory effect on HIV-1 entry. Further experiments revealed that Z-100 induced IFN-beta production in these cells, resulting in induction of the 16-kDa

CCAAT/enhancer binding protein (C/EBP) beta transcription factor that represses

HIV-1 long terminal repeat transcription. These effects were alleviated by SB 203580, a

(33)

Possible Approaches to PPI

• Co-occurrence

- Two proteins in one sentences -> PPI - Tendency: Low precision, good recall

• Use machine learning

- Classify tokens / phrases; requires annotated corpus - Tendency: Low precision, good recall

• Full sentence parsing

- Only ~30% of sentences are parsed unambiguously

- Tendency: Good precision, low recall

(34)

Relationship Mining

• Most systems work on language pattern

- Sentence

• … GENE regulates expression of GENE …

• … GENE is strongly suppressed by GENE … - Linguistic annotation

• … GENE VRB NOM PRP GENE …

• … GENE is ADJ VRB PRP GENE …

• Patterns: Different levels of abstraction

- … GENE .* VRB .* GENE

- … GENE [is] ADJ? {regulat|suppres} NOM? PRP GENE

(35)

State-of-the-Art

• Systems work on hand-crafted pattern sets

- Hundreds of pattern

- Tendency: Very good precision, low recall

• There are never enough / the right patterns

• AliBaba

- Learn patterns automatically

- Uses PPI database as input

(36)

Workflow

IntAct

PubMed

Protein pairs Search sentences NER and POS tagging

Initial patterns Clustering

Consensus pattern

Alignment

(37)

Initial Pattern

• Sentences: pair of proteins and “interaction”

- “…show that FADD immediately activates procaspase-8 during…”

• Extraction of core phrase around proteins

- “…show that FADD immediately activates procaspase-8 during…”

• Derivation of initial pattern

- “(FADD) (immediately) (activates) (procaspase-8) …”

• Semantic abstraction

- “PTN (immediately) (activates) PTN …”

(38)

Linguistic Annotation

• Part-of-speech and word stems

• Multi-layered pattern

• Highly specific pattern

• High precision, low recall

• Need to be generalized

POS PTN ADV VRB PTN

Stem Token

PTN immediat activat PTN

PTN immediately activates PTN

(39)

Workflow

IntAct

PubMed

Protein pairs Search sentences NER and POS tagging

Initial patterns Clustering

Consensus pattern Alignment

Extracted PPI

(40)

Generalization

• Pattern similarity using sentence alignment

• Cluster patterns based on similarity matrix

• Consensus by multiple sentence alignment

(41)

Workflow

IntAct

PubMed

Protein pairs Search sentences NER and POS tagging

Initial patterns Clustering

Consensus pattern Alignment

Extracted PPI

(42)

Search Phase

• Sentences

- … are searched for protein names

- … matched against all consensus pattern - Highest scoring pattern wins

• Results

- ~42000 IntAct pairs yield ~20.000 initial patterns - Generalized into ~10.000 consensus pattern

- Yields 79% precision at 52% recall (SPIES)

• Careful: What is an interaction? Which corpus? Which task?

- Fully automatic: rapid applicability, self-learning, probably

(43)

Query

PubMed visualized

Extracted infos

(44)

Tweaking and Tuning

• Special dictionaries

- Name lists incl. synonyms

• Special entity classes

- Database of pairs

• Special publications

- Certain journals, fulltext articles

• High precision or high recall

- Fuzziness of name search, pattern generalization, thresholds

• Dictionaries versus ML NER

(45)

Overview

• Why text mining for biomedical research

• Text Mining and Knowledge Management

• Subtasks

- Named Entity Recognition - Relationship Mining

- Mining Attributed Relationships

(46)

Kinetic Modelling

Source

Source Data Data Method Method Objective Objective

(47)

Example

The apparent K(m) value was calculated for adenosine and found to be 3.63 x 10(-3) M, which indicates high affinity of adenosine deaminase for its substrate adenosine.

Constant: K(m)

Value: 3.63 x 10(-3)

Unit: M

Enzyme: Adenosine deaminase

Compound: adenosine

(48)

Example

The apparent K(m) value was calculated for adenosine and found to be 3.63 x 10(-3) M, which indicates high affinity of adenosine deaminase for its substrate adenosine.

Constant: K(m)

Value: 3.63 x 10(-3)

Unit: M

Enzyme: Adenosine deaminase Compound: adenosine

Reaction: R01560

(49)

System Overview

Information Representation

Information Extraction &

Processing Entity Recognition

XML Pre- Processing

KMedDB

(50)

Entity Recognition

• Entities to recognize:

Enzymes ( KEGG )

Compounds ( KEGG )

Species ( NCBI Taxonomy )

Kinetic constants

Values

Units

Temperature & pH-values

Hand-crafted

regular expressions

Dictionaries

(51)

Information Extraction

(52)

KMedDB

(53)

KMedDB

(54)

KMedDB

(55)

Precision

0 10 20 30 40 50 60 70 80 90 100

Constan t

Value Unit

Enzyme Comp

ound

Reac

tion pH

Tem p

Species

Overall Class

Precision [%]

(56)

Recall

0 10 20 30 40 50 60

Constan t

Value Unit

Enzym e

om pound

Reaction pH

Tem p

Spec ies

Over all

Recall [%]

(57)