Functional Pattern Mining from GenomeâScale Functional Pattern Mining from GenomeâScaleÂ
ProteinâProtein Interaction NetworksÂ
Young-Rae Cho, Ph.D.
Assistant Professor
Department of Computer Science B l U i it
Baylor University
My Definition of Bioinformatics
ï± Hidden Knowledge (Pattern) Discovery
Computational
Machine Learning Techniques Data Mining Algorithms
Mathematical Models
p
Techniques
Mathematical Models
Knowledge Biomedical
Applications Biological
Data
Genome Proteome Networks
Functional Characterization Disease Diagnosis Drug Development Drug Development
Bioinformatics Milestone
Computational Biology Functional Genomics Systems Biology
Stage 1.Â
Sequence Analysis
Stage 2.Â
Structure Analysis
Stage 3.Â
Genome Analysis
Stage 4.Â
System Analysis
Sequence Analysis Structure Analysis Genome Analysis y y
âą DNA sequencing
âą Homolog search
âą Motif finding
âą Protein folding
âą Homolog search
âą Binding site prediction
âą Function prediction
âą Gene clustering
âą Network modeling
âą Module finding
âą Pathway prediction
ProteinâProtein Interactions
ï± ProteinâProtein Interactions  (PPIs)
ï§ Permanent interactions  vs. Transient interactions
ï§ Determined by highâthroughput experimental methods:
mass spectrometry and yeast twoâhybrid system
ï± Interactome
ï§ The entire set of PPIs in a species
ï§ Problem?
⹠Large scale & Unreliability
ï§ Demand?
⹠Computational, integrative approaches
ProteinâProtein Interaction Networks
ï± ProteinâProtein Interaction Networks
ï§ A set of nodes V represents proteinsÂ
ï§ A set of edges E represents interactions
ï§ Undirected unweighed graph, G(V,E)
ï§ Problem?
⹠Complex connectivity
ï§ Demand?
âą Computational,Â
systematic approaches
Overview
ï± Protein Function Prediction
ï± Protein Complex Prediction
ï± Functional Module Detection
ï± Functional Hub Identification
ï± Functional Pathway Identification
Protein Function Prediction from PPI Networks
ï± Previous Intuitions
ï§ Utilize connectivity between proteins
âą Local neighborhoodâbased methods
ï§ Limited accuracy
⹠Some interactions irrelevant to functional linkage
ï± My Direction
ï§ Utilize network motifs
⹠Interconnection patterns occurring in PPI networks more frequently than in randomized networks
⹠Core components  for functional activities
⹠Evolutionary conserved
Wuchty, et al., 2003
Function Association Patterns
ï± Subgraph Patterns
ï± Functional Association Patterns
ï§ Labeled subgraphs
ï§ Assigned a set of functions of a protein into the corresponding node as a label
ï± Functional Association Pattern Mining
ï§ Identification of functional association patterns occurring frequently
ï± Function Prediction
ï§ Predicting protein function by functional association pattern mining
Pattern Mining Process
ï± Assumption in Frequency
ï§ If a subâpattern of a pattern p is not frequent, then p is not frequent
ïź Downward closure
ïź Applied the frequent itemâset mining algorithm in the market basket problem
ï± Selective Joining
ï§ Merged two frequent (kâ1)ânode patterns to generate a candidate kânode pattern
ï§ Considered the (kâ1)ânode patterns which share a frequent (kâ2)ânode subâpattern
ï± Apriori Pruning
ï§ Filtered out infrequent patterns using a threshold of minimum frequency
ï§ Counted all isomorphic patterns using canonical forms
Pattern Mining Example
candidate
2-node patterns frequency
{ f1, f2} { f1}
{ f1}
{f1} - {f1} 1 {f1, f2} - {f1, f2} 1 {f1, f3} - {f1, f3} 0 {f1} - {f1, f2} 3
frequent
2-node patterns frequency {f1} - {f1, f2} 3
{ f1, f2} { f1, f3}
{f1} - {f1, f3} 2 {f1, f2} - {f1, f3} 1
{f1} - {f1, f3} 2
{ f1, f2} { f1}
{ f1} frequent
3-node patterns frequency
candidate
3-node patterns frequency
{ f1, f2} { f1, f3}
3-node patterns
{f1, f2} - {f1} - {f1, f3} 3 {f1} - {f1, f2} - {f1} 1 {f1, f2} - {f1} - {f1, f2} 1 {f1} - {f1, f3} - {f1} 1 {f1, f3} - {f1} - {f1, f3} 0 {f1, f2} - {f1} - {f1, f3} 3
Function Prediction Algorithm
ï± Pattern Analogue
ï§ Replaced only one node label in a pattern p with a different label
ï± Function Prediction
ï§ Given k, predicting function of an unknown protein by projecting each kânode frequent pattern
ï± Algorithm
P di ti ( G(V E) Fk 1 Fk â V ) Prediction ( G(V,E), Fk-1, Fk, vuâ V )
Fnk-1â selecting frequent patterns pnk-1including vnâ N(vu) Pukâ generating patterns pukby extending pnk-1 â Fnk-1to vu if pikâ Pukis a pattern analogue of pjk â Fk
fâ predicting function of vuby the most frequent pjk end if
return f
Function Prediction Accuracy
ï± Evaluation Design
ï§ Leaveâoneâout crossâvalidation
ï§ Exact match vs. Inclusive match
âą Exact match:  { predicted func ons } âĄÂ { real func ons }
âą Inclusive match:  { predicted functions } â€Â { real functions }
ï± Results
0.8 1.0
0.8 1.0
85% 86%
0.4 0.6
accuracy
0.4 0.6
accuracy
0.0 0.2
2 3 4 5 6
min. freq. = 20 min. freq. = 40 min. freq. = 60
0.0 0.2
2 3 4 5 6
min. freq. = 20 min. freq. = 40 min. freq. = 60
ï± Cho and Zhang, IEEE Transactions on Information Technology in Biomedicine (TITB), 2010
2 3 4 5 6
k
2 3 4 5 6
k
Overview
ï± Protein Function Prediction
ï± Protein Complex Prediction
ï± Functional Module Detection
ï± Functional Hub Identification
ï± Functional Pathway Identification
Protein Complex Prediction from PPI Networks
ï± Previous Intuitions
ï§ Search densely connected subgraphs
⹠Graph clustering algorithms
ï§ Limited accuracy
⹠Exclusion of proteins with low connectivity
ï± My Direction
ï§ Apply a seedâgrowth style approach
⹠An initial seed cluster grows based a connectivity function
⹠Searches core (dense region) and periphery (sparse region)
Graph Entropy
ï± Graph Entropy
ï§ An informationâtheoretic function of connectivity
ï§ General Notations
âą pi(v): probability of v having inner links (edges from v to the vertices in Vâ of Gâ(Vâ,Eâ) )Â
âą po(v): probability of v having outer links (edges from v to the vertices not in Vâ )
ï§ Definition
âą Vertex Entropy:  e(v) = â pi(v) log2pi(v) â po(v) log2po(v)
âą Graph Entropy:  e(G(V,E)) = Σ vïVe(v)
ï± Graph Entropy Example
Protein Complex Prediction Algorithm
ï± Algorithm
ï§ Greedy algorithmÂ Â ïź Randomness,  No parameters
Creates an initial cluster including a vertex (seed) and its neighbors
Removes vertices on cluster inner boundary iteratively to decrease graph entropy until it is minimal
Adds vertices on cluster outer boundary iteratively to decrease graph entropy until it is minimal
Outputs the cluster, and repeats all until there is no more vertex left
Protein Complex Prediction Accuracy
ï± Evaluation Design
ï§ Fâmeasure
⹠Mean of recall and precision
ï§ Pâvalue
âą Hyperâgeometric distribution
ï± Results
ï± Kenley and Cho, Proteomics, 2011
ï± Kenley and Cho, ICDM 2012
Overview
ï± Protein Function Prediction
ï± Protein Complex Prediction
ï± Functional Module Detection
ï± Functional Hub Identification
ï± Functional Pathway Identification
Functional Module Detection from PPI Networks
ï± Previous Intuitions
ï§ Partition PPI networks
⹠Graph clustering algorithms
ï§ Limited accuracy
⹠Unreliable, complex connections
âą Nonâoverlapping clusters
ï± My Direction
ï§ Utilize an integrative approach
âą Semantic analytics using the genomeâwide Gene Ontology (GO) data
ï§ Simulate information propagation
⹠Information propagation from a selected source node through the network
Information Propagation Model
ï± Path Strength Measurement
F l
ï§ Formula :
ï§ Factors
⹠Normalized edge weights
⹠Inverse of path length
⹠Inverse of node degree
ï± Functional Impact Scoring
ï§ Functional impact Fs(vi) of a source vs on vi:  Sum of strength of all possible paths from vsto vi
Dynamic Propagation Algorithm
ï± Algorithm
ï§ Computation of cumulative functional impact of a source vss on all the other nodes
ï§ Repeated  random walk simulation starting from a specific node vs
ï§ Uses a minimum functional impact threshold
ï§ Stops when there are no more updates on cumulative functional impact on any nodes
ï± Example
0.45
0.28 0.79
0.65
0.15
S
1.0 0.83
1.74 1.26 0.27
0.89 0.41
1.38 0.92
0.31 0.11
Propagation Pattern Mining
ï± Process
Functional Module Detection Efficiency
ï± Evaluation in Synthetic Networks
ï§ Potential Factors
⹠Number of nodes
⹠Network density
⹠Average node degree
ï± Results
50 60
constant density constant average degree
100 120
constant number of nodes constant average degree
20 30 40
run time (msec)
40 60 80
un time (msec)
0 10 20
0 1000 2000 3000 4000 5000 6000 7000 0
20 40
0 0.002 0.004 0.006 0.008 0.01
ru
number of nodes
density
Functional Module Detection Accuracy
ï± Hypothesis
ï§ If two proteins have coherent propagation patterns, 0.8
0.9
patterns random pairs from different functions random pairs from the same functions
then they are likely to perform the same functions
0.4 0.5 0.6 0.7
on of functional flow p
0 0.1 0.2 0.3
average correlatio
ï± Results
ï± Cho, Hwang, Ramanathan and Zhang, BMC Bioinformatics, 2007
ï± Cho, Shi and Zhang, ICDM 2009
Overview
ï± Protein Function Prediction
ï± Protein Complex Prediction
ï± Functional Module Detection
ï± Functional Hub Identification
ï± Functional Pathway Identification
Functional Hub Identification from PPI Networks
ï± Previous Intuitions
ï§ Search highâdegree nodes
ï§ Search centrally located nodes
⹠Closeness / Betweenness
ï§ Limited accuracy
⹠Unreliable, complex connections
⹠Unable to find a hierarchy
ï± My Direction
ï§ Utilize an integrative approach
âą Semantic analytics using the genomeâwide Gene Ontology (GO) data
ï§ Reconstruct the hierarchical structure of proteins
Functional Similarity Model
ï± Path Strength
F l
ï§ Formula :
ï± Functional Similarity
ï§ kâlength path strength:
ï§ Functional similarity:      where  Â
th h ld threshold
Network Conversion Algorithm
ï± Algorithm
ï§ Conversion of a complex PPI network into a hierarchical tree structure of proteins
ï± Process
(1) Computes centrality for each node a
(2) Obtains a set of ancestor nodes T(a) of a
(3) Selects a parent node p(a) of a
Hub Confidence Measurement
ï± Algorithm
ï§ Quantification of functional âhubnessâ of proteinsÂ
ï± Process
(1) Obtains a set of child nodes D(a) of a
(2) Obtains a set of descendent nodes Laof a
(3) Computes the hub confidence H(a) of a
Topological Assessment of Functional Hubs
ï± Network Vulnerability Test
ï§ Random attack:  Repeatedly disrupt a randomly selected node
ï§ Degreeâbased hub attack:  Repeatedly disrupt the highest degree node
ï§ Functional hub attack:  Repeatedly disrupt the node with the highest hub confidence
ï§ For each iteration, observe the largest component
ï± Results
0.95 1.00
t
0.75 0.80 0.85 0.90
of largest component
0.60 0.65 0.70
0 20 40 60 80 100 120 140 160
b f d
fraction o
random attack
degree-based hub attack structural hub attack
number of nodes
Biological Assessment of Functional Hubs
ï± Protein Lethality Test
ï§ To determine lethal proteins by knockâout experiment ( Lethality represents functional essentiality. )
ï§ Order proteins by degree and hub confidence
ï§ For every 10 proteins, observe the cumulative proportion of lethal proteins
ï± Results
0 9 1.0
structural hubs
d b d h b
0.6 0.7 0.8 0.9
average lethality
degree-based hubs
0.4 0.5
0 20 40 60 80 100 120 140
number of hubs
a
ï± Cho and Zhang, BMC Bioinformatics, 2010
Overview
ï± Protein Function Prediction
ï± Protein Complex Prediction
ï± Functional Module Detection
ï± Functional Hub Identification
ï± Functional Pathway Identification
Functional Pathway Identification from PPI Networks
ï± Previous Intuitions
ï§ Search the strongest path between a source and a target
ïź Computational inefficiency
ï§ Limited accuracy
⹠Difference between signal transduction and interaction
ï± My Direction
ï§ Utilize an integrative approach
âą Semantic analytics using the genomeâwide Gene Ontology (GO) data
ï§ Measurement of path frequency towards the target node
Frequent Path Mining
ï± Notations
ï§ lx :  a list of length x, Â
ï§ Pk :  a set of all paths of length k between a source and a target
ï§ Lx:  a set of path prefixes of Pk of length x
ï§ A selection function f :  returning a subset of Pk, having a specific prefix lx
ï± Frequent Path Mining
ï§ Given a path prefix lx, computes the support of the association rule, lxïź vi
Functional Pathway Prediction Algorithm
ï± Algorithm
Mining frequent paths
Selecting multiple successors by an expansion parameter
Was the target selected?
no
Merging the pathways yes
Efficiency Improvement of Path Mining
ï± Preprocessing
ï§ Quasiâclustering
⹠Information propagation from the source node
⹠Frequent path mining in a subgraph of the PPI network
ï± Approximation
ï§ Approximate support preâcomputation
⹠Estimation of cycles
⹠Use of a pruning parameter
Functional Pathway Identification Performance
ï± Accuracy
ï§ Test for MAP Kinase signaling pathways
ï± Efficiency
Overview
ï± Protein Function Prediction
ï± Protein Complex Prediction
ï± Functional Module Detection
ï± Functional Hub Identification
ï± Functional Pathway Identification
Conclusion
ï± Bioinformatics Research Trend
ï§ Geneâlevel  â   Genomeâlevel
ï§ The local  â  The global
ï§ The par cular  â  The universal
ï± Functional Characterization Process
ï§ Preâgenomic era: sequence â structure â function
ï§ Postâgenomic era: interaction â network â function
Bioinformatics Program at Baylor
ï± Bioinformatics in Computer Science
References
ï± My personal webpage:   http:// web.ecs.baylor.edu/faculty/cho/
ï± My lab webpage:   http://bionet.ecs.baylor.edu/
ï± Baylor, Bioinformatics program webpage:  http://www.ecs.baylor.edu/bioinformatics/
ï± Baylor, Institute of Biomedical Studies webpage:  http://www.baylor.edu/biomedical_studies/