• No results found

Protein Protein Interaction Networks

N/A
N/A
Protected

Academic year: 2022

Share "Protein Protein Interaction Networks"

Copied!
41
0
0

Loading.... (view fulltext now)

Full text

(1)

Functional Pattern Mining from Genome‐Scale Functional Pattern Mining from Genome‐Scale 

Protein‐Protein Interaction Networks 

Young-Rae Cho, Ph.D.

Assistant Professor

Department of Computer Science B l U i it

Baylor University

(2)

My Definition of Bioinformatics

 Hidden Knowledge (Pattern) Discovery

Computational

Machine Learning Techniques Data Mining Algorithms

Mathematical Models

p

Techniques

Mathematical Models

Knowledge Biomedical

Applications Biological

Data

Genome Proteome Networks

Functional Characterization Disease Diagnosis Drug Development Drug Development

(3)

Bioinformatics Milestone

Computational Biology Functional Genomics Systems Biology

Stage 1. 

Sequence Analysis

Stage 2. 

Structure Analysis

Stage 3. 

Genome Analysis

Stage 4. 

System Analysis

Sequence Analysis Structure Analysis Genome Analysis y y

‱ DNA sequencing

‱ Homolog search

‱ Motif finding

‱ Protein folding

‱ Homolog search

‱ Binding site prediction

‱ Function prediction

‱ Gene clustering

‱ Network modeling

‱ Module finding

‱ Pathway prediction

(4)

Protein‐Protein Interactions

 Protein‐Protein Interactions  (PPIs)

 Permanent interactions  vs. Transient interactions

 Determined by high‐throughput experimental methods:

mass spectrometry and yeast two‐hybrid system

 Interactome

 The entire set of PPIs in a species

 Problem?

‱ Large scale & Unreliability

 Demand?

‱ Computational, integrative approaches

(5)

Protein‐Protein Interaction Networks

 Protein‐Protein Interaction Networks

 A set of nodes V represents proteins 

 A set of edges E represents interactions

 Undirected unweighed graph, G(V,E)

 Problem?

‱ Complex connectivity

 Demand?

‱ Computational, 

systematic approaches

(6)

Overview

 Protein Function Prediction

 Protein Complex Prediction

 Functional Module Detection

 Functional Hub Identification

 Functional Pathway Identification

(7)

Protein Function Prediction from PPI Networks

 Previous Intuitions

 Utilize connectivity between proteins

‱ Local neighborhood‐based methods

 Limited accuracy

‱ Some interactions irrelevant to functional linkage

 My Direction

 Utilize network motifs

‱ Interconnection patterns occurring  in PPI networks more frequently  than in randomized networks

‱ Core components  for functional  activities

‱ Evolutionary conserved

Wuchty, et al., 2003

(8)

Function Association Patterns

 Subgraph Patterns

 Functional Association Patterns

 Labeled subgraphs

 Assigned a set of functions of a protein into the corresponding node as a label

 Functional Association Pattern Mining

 Identification of functional association patterns occurring frequently

 Function Prediction

 Predicting protein function by functional association pattern mining

(9)

Pattern Mining Process

 Assumption in Frequency

 If a sub‐pattern of a pattern p is not frequent, then p is not frequent

ï‚ź Downward closure

ï‚ź Applied the frequent item‐set mining algorithm in the market basket problem

 Selective Joining

 Merged two frequent (k‐1)‐node patterns to generate a candidate k‐node pattern

 Considered the (k‐1)‐node patterns which share a frequent (k‐2)‐node sub‐pattern

 Apriori Pruning

 Filtered out infrequent patterns using a threshold of minimum frequency

 Counted all isomorphic patterns using canonical forms

(10)

Pattern Mining Example

candidate

2-node patterns frequency

{ f1, f2} { f1}

{ f1}

{f1} - {f1} 1 {f1, f2} - {f1, f2} 1 {f1, f3} - {f1, f3} 0 {f1} - {f1, f2} 3

frequent

2-node patterns frequency {f1} - {f1, f2} 3

{ f1, f2} { f1, f3}

{f1} - {f1, f3} 2 {f1, f2} - {f1, f3} 1

{f1} - {f1, f3} 2

{ f1, f2} { f1}

{ f1} frequent

3-node patterns frequency

candidate

3-node patterns frequency

{ f1, f2} { f1, f3}

3-node patterns

{f1, f2} - {f1} - {f1, f3} 3 {f1} - {f1, f2} - {f1} 1 {f1, f2} - {f1} - {f1, f2} 1 {f1} - {f1, f3} - {f1} 1 {f1, f3} - {f1} - {f1, f3} 0 {f1, f2} - {f1} - {f1, f3} 3

(11)

Function Prediction Algorithm

 Pattern Analogue

 Replaced only one node label in a pattern p with a different label

 Function Prediction

 Given k, predicting function of an unknown protein by projecting each k‐node frequent  pattern

 Algorithm

P di ti ( G(V E) Fk 1 Fk ∈ V ) Prediction ( G(V,E), Fk-1, Fk, vu∈ V )

Fnk-1← selecting frequent patterns pnk-1including vn∈ N(vu) Puk← generating patterns pukby extending pnk-1 ∈ Fnk-1to vu if pik∈ Pukis a pattern analogue of pjk ∈ Fk

f← predicting function of vuby the most frequent pjk end if

return f

(12)

Function Prediction Accuracy

 Evaluation Design

 Leave‐one‐out cross‐validation

 Exact match vs. Inclusive match

‱ Exact match:  { predicted func ons } ≡ { real func ons }

‱ Inclusive match:  { predicted functions } ≀ { real functions }

 Results

0.8 1.0

0.8 1.0

85% 86%

0.4 0.6

accuracy

0.4 0.6

accuracy

0.0 0.2

2 3 4 5 6

min. freq. = 20 min. freq. = 40 min. freq. = 60

0.0 0.2

2 3 4 5 6

min. freq. = 20 min. freq. = 40 min. freq. = 60

 Cho and Zhang, IEEE Transactions on Information Technology in Biomedicine (TITB), 2010

2 3 4 5 6

k

2 3 4 5 6

k

(13)

Overview

 Protein Function Prediction

 Protein Complex Prediction

 Functional Module Detection

 Functional Hub Identification

 Functional Pathway Identification

(14)

Protein Complex Prediction from PPI Networks

 Previous Intuitions

 Search densely connected subgraphs

‱ Graph clustering algorithms

 Limited accuracy

‱ Exclusion of proteins with low connectivity

 My Direction

 Apply a seed‐growth style approach

‱ An initial seed cluster grows based a connectivity function

‱ Searches core (dense region) and periphery (sparse region)

(15)

Graph Entropy

 Graph Entropy

 An information‐theoretic function of connectivity

 General Notations

‱ pi(v): probability of v having inner links (edges from v to the vertices in V’ of G’(V’,E’) ) 

‱ po(v): probability of v having outer links (edges from v to the vertices not in V’ )

 Definition

‱ Vertex Entropy:  e(v) = ‐ pi(v) log2pi(v) ‐ po(v) log2po(v)

‱ Graph Entropy:  e(G(V,E)) = Σ vVe(v)

 Graph Entropy Example

(16)

Protein Complex Prediction Algorithm

 Algorithm

 Greedy algorithmÂ Â ï‚ź Randomness,  No parameters

Creates an initial cluster including a vertex (seed) and its neighbors

Removes vertices on cluster inner boundary iteratively to decrease graph entropy until it is minimal

Adds vertices on cluster outer boundary iteratively to decrease graph entropy until it is minimal

Outputs the cluster, and repeats all until there is no more vertex left

(17)

Protein Complex Prediction Accuracy

 Evaluation Design

 F‐measure

‱ Mean of recall and precision

 P‐value

‱ Hyper‐geometric distribution

 Results

 Kenley and Cho, Proteomics, 2011

 Kenley and Cho, ICDM 2012

(18)

Overview

 Protein Function Prediction

 Protein Complex Prediction

 Functional Module Detection

 Functional Hub Identification

 Functional Pathway Identification

(19)

Functional Module Detection from PPI Networks

 Previous Intuitions

 Partition PPI networks

‱ Graph clustering algorithms

 Limited accuracy

‱ Unreliable, complex connections

‱ Non‐overlapping clusters

 My Direction

 Utilize an integrative approach

‱ Semantic analytics using the genome‐wide Gene Ontology (GO) data

 Simulate information propagation

‱ Information propagation from a selected source node through the network

(20)

Information Propagation Model

 Path Strength Measurement

F l

 Formula :

 Factors

‱ Normalized edge weights

‱ Inverse of path length

‱ Inverse of node degree

 Functional Impact Scoring

 Functional impact Fs(vi) of a source vs on vi:  Sum of strength of all possible paths from vsto vi

(21)

Dynamic Propagation Algorithm

 Algorithm

 Computation of cumulative functional impact of a source vss on all the other nodes

 Repeated  random walk simulation starting from a specific node vs

 Uses a minimum functional impact threshold

 Stops when there are no more updates on cumulative functional impact on any nodes

 Example

0.45

0.28 0.79

0.65

0.15

S

1.0 0.83

1.74 1.26 0.27

0.89 0.41

1.38 0.92

0.31 0.11

(22)

Propagation Pattern Mining

 Process

(23)

Functional Module Detection Efficiency

 Evaluation in Synthetic Networks

 Potential Factors

‱ Number of nodes

‱ Network density

‱ Average node degree

 Results

50 60

constant density constant average degree

100 120

constant number of nodes constant average degree

20 30 40

run time (msec)

40 60 80

un time (msec)

0 10 20

0 1000 2000 3000 4000 5000 6000 7000 0

20 40

0 0.002 0.004 0.006 0.008 0.01

ru

number of nodes

density

(24)

Functional Module Detection Accuracy

 Hypothesis

 If two proteins have coherent propagation patterns,  0.8

0.9

patterns random pairs from different functions random pairs from the same functions

then they are likely to perform the same functions

0.4 0.5 0.6 0.7

on of functional flow p

0 0.1 0.2 0.3

average correlatio

 Results

 Cho, Hwang, Ramanathan and Zhang, BMC Bioinformatics, 2007

 Cho, Shi and Zhang, ICDM 2009

(25)

Overview

 Protein Function Prediction

 Protein Complex Prediction

 Functional Module Detection

 Functional Hub Identification

 Functional Pathway Identification

(26)

Functional Hub Identification from PPI Networks

 Previous Intuitions

 Search high‐degree nodes

 Search centrally located nodes

‱ Closeness / Betweenness

 Limited accuracy

‱ Unreliable, complex connections

‱ Unable to find a hierarchy

 My Direction

 Utilize an integrative approach

‱ Semantic analytics using the genome‐wide Gene Ontology (GO) data

 Reconstruct the hierarchical structure of proteins

(27)

Functional Similarity Model

 Path Strength

F l

 Formula :

 Functional Similarity

 k‐length path strength:

 Functional similarity:      where   

th h ld threshold

(28)

Network Conversion Algorithm

 Algorithm

 Conversion of a complex PPI network into a hierarchical tree structure of proteins

 Process

(1) Computes centrality for each node a

(2) Obtains a set of ancestor nodes T(a) of a

(3) Selects a parent node p(a) of a

(29)

Hub Confidence Measurement

 Algorithm

 Quantification of functional “hubness” of proteins 

 Process

(1) Obtains a set of child nodes D(a) of a

(2) Obtains a set of descendent nodes Laof a

(3) Computes the hub confidence H(a) of a

(30)

Topological Assessment of Functional Hubs

 Network Vulnerability Test

 Random attack:  Repeatedly disrupt a randomly selected node

 Degree‐based hub attack:  Repeatedly disrupt the highest degree node

 Functional hub attack:  Repeatedly disrupt the node with the highest hub confidence

 For each iteration, observe the largest component

 Results

0.95 1.00

t

0.75 0.80 0.85 0.90

of largest component

0.60 0.65 0.70

0 20 40 60 80 100 120 140 160

b f d

fraction o

random attack

degree-based hub attack structural hub attack

number of nodes

(31)

Biological Assessment of Functional Hubs

 Protein Lethality Test

 To determine lethal proteins by knock‐out experiment ( Lethality represents functional essentiality. )

 Order proteins by degree and hub confidence

 For every 10 proteins, observe the cumulative proportion of lethal proteins

 Results

0 9 1.0

structural hubs

d b d h b

0.6 0.7 0.8 0.9

average lethality

degree-based hubs

0.4 0.5

0 20 40 60 80 100 120 140

number of hubs

a

 Cho and Zhang, BMC Bioinformatics, 2010

(32)

Overview

 Protein Function Prediction

 Protein Complex Prediction

 Functional Module Detection

 Functional Hub Identification

 Functional Pathway Identification

(33)

Functional Pathway Identification from PPI Networks

 Previous Intuitions

 Search the strongest path between a source and a target

ï‚ź Computational inefficiency

 Limited accuracy

‱ Difference between signal transduction and interaction

 My Direction

 Utilize an integrative approach

‱ Semantic analytics using the genome‐wide Gene Ontology (GO) data

 Measurement of path frequency towards the target node

(34)

Frequent Path Mining

 Notations

 lx :  a list of length x,  

 Pk :  a set of all paths of length k between a source and a target

 Lx:  a set of path prefixes of Pk of length x

 A selection function f :  returning a subset of Pk, having a specific prefix lx

 Frequent Path Mining

 Given a path prefix lx, computes the support of the association rule, lxï‚ź vi

(35)

Functional Pathway Prediction Algorithm

 Algorithm

Mining frequent paths

Selecting multiple successors by an expansion parameter

Was the target selected?

no

Merging the pathways yes

(36)

Efficiency Improvement of Path Mining

 Preprocessing

 Quasi‐clustering

‱ Information propagation from the source node

‱ Frequent path mining in a subgraph of the PPI network

 Approximation

 Approximate support pre‐computation

‱ Estimation of cycles

‱ Use of a pruning parameter

(37)

Functional Pathway Identification Performance

 Accuracy

 Test for MAP Kinase signaling pathways

 Efficiency

(38)

Overview

 Protein Function Prediction

 Protein Complex Prediction

 Functional Module Detection

 Functional Hub Identification

 Functional Pathway Identification

(39)

Conclusion

 Bioinformatics Research Trend

 Gene‐level  →   Genome‐level

 The local  →  The global

 The par cular  →  The universal

 Functional Characterization Process

 Pre‐genomic era:  sequence → structure → function

 Post‐genomic era: interaction → network → function

(40)

Bioinformatics Program at Baylor

 Bioinformatics in Computer Science

(41)

References

 My personal webpage:   http:// web.ecs.baylor.edu/faculty/cho/

 My lab webpage:   http://bionet.ecs.baylor.edu/

 Baylor, Bioinformatics program webpage:  http://www.ecs.baylor.edu/bioinformatics/

 Baylor, Institute of Biomedical Studies webpage:  http://www.baylor.edu/biomedical_studies/

References

Related documents

The board of county commissioners hereby declares that it is the intent of this article to establish a coastal setback line to buffer major habitable and major accessory

malacitanus, an Endangered species (Freyhof and Kottelat, 2008), may be drastically reduced in the future as a result of global climate change and the trend to an

Understanding the impact of climate change on water quality will require an understanding of the link between P concentrations, loads and recipient freshwater ecosystems within

They may work for federal, state and local government, forensic laboratories, medical examiners offices, hospitals, universities, toxicology laboratories, police departments,

The increasing authority of inter- national public law and use of soft law in the private legal realm, as well as the prevalence of comparative law as a normative lens through which

To successfully be a full time beekeeper, you typically need to produce more than just honey; a combination of honey, pollination and queen rearing is typically necessary.. Also,

In our model of a composite web service, each peer has a queue for all of its in- put messages and may send messages to the input queues of other peers. To model the global behavior

The disk graph of a handlebody H of genus g ≄ 2 with m ≄ 0 marked points on the boundary is the graph whose vertices are isotopy classes of disks disjoint from the marked points