Metodi Numerici per la Bioinformatica

(1)

Metodi Numerici per la Bioinformatica

A.A. 2008/2009

Biclustering

(2)

Outline

• Motivation

• What is Biclustering?

• Why Biclustering and not just Clustering?

Francesco Archetti

• Bicluster Types

• Algorithms

Francesco Archetti 2

Metodi numerici per la

(3)

Motivations

• Gene expression matrices have been extensively

analyzed

using

clustering

in

one

of

two

dimensions

– The gene dimension

– The condition dimension

• This correspond to the:

– Analysis of expression patterns of genes by comparing

rows in the matrix.

– Analysis of expression patterns of samples by

comparing columns in the matrix.

(4)

Motivations

• Analysis via clustering makes several a priori

assumptions that may not be adequate in all

circumstances:

– Clustering can be applied to either genes or samples,

implicitly directing the analysis to a particular aspect of

the system under study (e.g., groups of patients or

groups of co-regulated genes)

– Clustering algorithms usually seek a disjoint cover of

the set of elements, requiring that no gene or sample

belongs to more than one cluster.

(5)

Motivations

• the results of the application of standard clustering

techniques to genes are limited due to the existence of a

number of experimental conditions where the activity

of genes is uncorrelated.

• Many activation patterns are common to a group of

Francesco Archetti

• Many activation patterns are common to a group of

genes only under specific experimental conditions.

• Discovering such local expression patterns may be the

key to uncovering many genetic pathways that are not

apparent otherwise.

• It is therefore highly desirable to move beyond the

clustering paradigm and develop approaches capable of

discovering local patterns in microarray data.

(6)

What is Biclustering?

BICLUSTER BICLUSTER:

• a submatrix spanned by a set of genes (rows) and a set of sample (column)

• given a gene expression matrix, it’s possible to characterize the biological phenomena it embodies by a collection of biclusters, each representing a different type of joint behavior of a set of genes in a corresponding set of samples.

Francesco Archetti Metodi numerici per la

(7)

What is Biclustering?

(8)

• Given the matrix

A = (X,Y)

I

= Subset of rows

J

= Subset of columns

Francesco Archetti

J

= Subset of columns

•

(I,Y) =

a subset of rows that exhibit similar behavior

across the set of all columns = cluster of rows

•

(X,J)

= a subset of columns that exhibit similar

behavior across the set of all rows = cluster of

columns

(9)

Biclustering Goals:

• find a set of significant biclusters in a matrix: identify

sub-matrices (subsets of rows and subsets of columns)

with interesting properties.

Francesco Archetti

• Perform simultaneous clustering on the row and

column dimensions of the gene expression matrix

instead of clustering the rows and columns separetely.

• Gene Expression Data Analysis

• Identify subgroups of genes and subgroups of

conditions, where the genes exhibit highly correlated

activities for every condition

(10)

Why Biclustering and not just Clustering?

•

Clustering

– Can be applied to either the rows or the columns of the

data matrix, separately.

– Produce either clusters of rows (subgroups of rows) or

clusters of columns (subgroups of columns).

g e n e ra l g e n e ra l m o d e ls m o d e ls Francesco Archetti

clusters of columns (subgroups of columns).

•

Biclustering

– Perform simultaneous clustering of both rows and columns

of the data matrix.

– Produce biclusters (subgroups of rows and subgroups of

columns)

bioinformatica 10 g e n e ra l g e n e ra l lo ca l lo ca l m o d e ls m o d e ls

(11)

Unlike Clustering :

• Biclustering identifies groups of genes that show similar activity

patterns under a specific subset of the experimental conditions.

Biclustering is the key technique to use when:

Why Biclustering and not just Clustering?

Biclustering is the key technique to use when:

• Only a small set of the genes participates in a cellular process of

interest.

• An interesting cellular process is active only in a subset of the

conditions.

• A single gene may participate in multiple pathways that may or not

be co-active under all conditions.

(12)

Gene A Gene B Gene C 1 2 3 4 5 6 7 8 9 10 Clustering… 1 2 3 5 7 10

Biclustering V’s Clustering

Gene C Gene D Gene E Gene F Gene G Gene H Gene I Gene J Gene K Gene L Gene M

Similarity does not exist over all attributes…

Solution: Cluster both Row and Columns

Simultaneously - Biclustering 1 2 3 5 7 10 Gene A Gene B Gene C Gene D Gene K Gene L Bicluster {1,2,3,5,7,10} {A,B,C,D,E,F}

(13)

Biclustering characteristics

Biclustering algorithms should identify groups of genes and conditions, obeying the following rules:

• A cluster of genes should be defined with respect to only a subset of the conditions.

• A cluster of conditions should be defined with respect to only a subset of the genes.

• The clusters should not be exclusive and/or exhaustive

Francesco Archetti

• The clusters should not be exclusive and/or exhaustive

• There are no a-priori constraints on the organization of biclusters: a gene or condition should be able to belong to more than one bicluster or to no bicluster at all.

• The lack of structural constrains on biclustering solutions allows greater freedom but is consequently more vulnerable to overfitting

• biclustering algorithms must guarantee that the output biclusters are meaningful accompanying statistical model or a heuristic scoring method that define which of the many possible submatrices represent a significant biological behavior.

(14)

Biclustering: clinical application

• In clinical applications, gene expression analysis is done on tissues

taken from patients with a medical condition. Using such assays,

biologists have identified molecular fingerprints that can help in the

classification and diagnosis of the patient status and guide treatment

protocols.

• the focus is: identify profiles of expression over a subset of the genes

that can be associated with clinical conditions and treatment

• the focus is: identify profiles of expression over a subset of the genes

that can be associated with clinical conditions and treatment

outcomes, where ideally, the set of samples is equal in all but the

subtype or the stage of the disease.

• However, a patient may be a part of more than one clinical group,

e.g., may suffer from syndrome A, have a genetic background B and

be exposed to environment C.

• Biclustering analysis is thus highly appropriate for identifying and

distinguishing the biological factors affecting the patients along with

the corresponding gene subsets.

(15)

Biclustering:

functional genomics application

• Goal: understand the functions of each of the genes operating in a biological system.

• The rationale is that genes with similar expression patterns are likely to be regulated by the same factors and therefore may share function.

• By collecting expression profiles from many different biological conditions • By collecting expression profiles from many different biological conditions and identifying joint patterns of gene expression among them, researchers have characterized transcriptional programs and assigned putative function to thousands of genes.

• Since genes have multiple functions, and since transcriptional programs are often based on combinatorial regulation, biclustering is highly appropriate for these applications as well.

• An important aspect of gene expression data is their high noise levels: biclustering algorithms should be robust enough to cope with significant levels of noise

(16)

Bicluster Types

An interesting criteria to evaluate a biclustering algorithm

concerns the identification of the type of biclusters the algorithm

is able to find.

We identified four major classes of biclusters:

Francesco Archetti

We identified four major classes of biclusters:

1. Biclusters with constant values.

2. Biclusters with constant values on rows or columns.

3. Biclusters with coherent values.

4. Biclusters with coherent evolutions.

(17)

Bicluster Types

• According to the

specific properties

of each problem

– One or more of these different types of biclusters are

generally considered interesting.

– A different type of merit function should be used to

Francesco Archetti

– A different type of merit function should be used to

evaluate the quality of the biclusters identified.

• The choice of the

merit function

is strongly related with

the characteristics of the biclusters each algorithm

aims at finding.

(18)

Biclusters with constant values

• The simplest biclustering algorithms identify subsets of rows and

subsets of columns with

constant values

.

• A

perfect constant bicluster

is a sub-matrix

(I,J)

where all

values within the bicluster are equal for all

i

∈

I

and

j

∈

J

:

Francesco Archetti

∈

• The merit function used to compute and evaluate constant

biclusters is, in general, the variance or some metric based on it.

bioinformatica 18

a

_ij

=

µ

a

_ij

=

µ

a a a a a a a a a a a a a a a a

(19)

Biclusters with constant values on rows

• A

perfect bicluster with constant rows:

is a sub-matrix

(I,J)

where all values within the bicluster can be obtained using one

of the following expressions:

a

_ij

=

µ +α

_i

a

_ij

=

µ +α

_i a a a a a a a a

bioinformatica 19 Francesco Archetti

a

_ij

=

µ +α

_i

a

_ij

=

µ x α

_i

a

_ij

=

µ +α

_i

a

_ij

=

µ x α

_i a+i_a+j a+i_a+j a+i_a+j a+i_a+j

a+k a+k a+k a+k

a x i a x i a x i a x i a x j a x j a x j a x j a x k a x k a x k a x k

• A bicluster with constant values in the rows identifies a subset of genes with similar expression values across a subset of conditions, allowing the expression levels to differ from gene to gene.

Where:

• µ is the typical value within the bicluster

(20)

Biclusters with constant values on columns

• A

perfect bicluster with constant columns:

is a sub-matrix

(I,J)

where all values within the bicluster can be obtained using

one of the following expressions:

a

_ij

=

µ + β

_j

a

_ij

=

µ + β

_j a a+i a+j a+k a a x i a x j a x k

a

_ij

=

µ + β

_j

a

_ij

=

µ x β

_j

a

_ij

=

µ + β

_j

a

_ij

=

µ x β

_j

• A bicluster with constant values in the columns identifies a subset of conditions within which a subset of genes present similar expression values assuming that the expression values may differ from condition to condition.

Where:

•µ is the typical value within the bicluster

•β is the adjustment for column j ∈ J.

a a+i a+j a+k a a+i a+j a+k a a+i a+j a+k

a a x i a x j a x k a a x i a x j a x k a a x i a x j a x k

(21)

Biclusters with constant values

on rows or columns

• The straightforward approach to identify non-constant

biclusters is to

normalize the rows or the columns of the data matrix

using the row mean and the column mean, respectively.

• By doing this, the biclusters with constant rows/columns are

transformed into constant biclusters before the biclustering

Francesco Archetti

transformed into constant biclusters before the biclustering

algorithm is applied.

(22)

Biclusters with coherent values

• A

perfect bicluster with coherent values:

is defined as a

subset of rows and a subset of columns whose values are

predicted using the following expressions:

a =

µ + α + β

a =

µ + α + β

–

– ADDITIVEADDITIVE MODELMODEL:

a

_ij

=

µ + α

_i

+ β

_j

a

_ij

=

µ + α

_i

+ β

_j

a b c d

a+i b+i c+i d+i a+j b+j c+j d+j a+k b+k c+k d+k

Where:

• µ is the typical value within the bicluster • α_i is the adjustment for row i _∈ I

(23)

Biclusters with coherent values

–

– MULTIPLICATIVEMULTIPLICATIVE MODELMODEL:

a =

µ’ x α’ x β’

a =

µ’ x α’ x β’

a b c d

a x i b x i c x i d x i

a

_ij

=

µ’ x α’

_i

x β’

_j

a

_ij

=

µ’ x α’

_i

x β’

_j a x i b x i c x i d x i

a x j b x j c x j d x j a x k b x k c x k d x k

Where:

• µ’ is the typical value within the bicluster • α’_i is the adjustment for row i _∈ I

(24)

Types of Biclusters : examples

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 Constant values 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 3.0 4.0 4.0 4.0 4.0

Constant values on rows

1.0 2.0 3.0 4.0 1.0 2.0 3.0 4.0 1.0 2.0 3.0 4.0 1.0 2.0 3.0 4.0

Constant values on columns

1.0 1.0 1.0 1.0 _4.0 _4.0 _4.0 _4.0 _1.0 _{2.0 3.0} _4.0 1.0 2.0 5.0 0.0 2.0 3.0 6.0 1.0 4.0 5.0 8.0 3.0 5.0 6.0 9.0 4.0 1.0 2.0 0.5 1.5 2.0 4.0 1.0 3.0 4.0 8.0 2.0 6.0 3.0 6.0 1.5 4.5 Coherent values

(25)

General additive models

• For every element a_ij:

– The general additive model represents a sum of models.

– Each model represents the contribution of the bicluster B_k to the value of a_ij in case i_∈I and j_∈J.

• The general additive model is defined as follows:

∈ ∈

where:

– k is the number of biclusters

– The terms θ_ik and κ_jk are binary values that represent memberships:

• ρik is the membership of row i in the bicluster k.

• κ_jk is the membership of column j in the bicluster k.

jk ik K k ijk ij

a

=

∑

₌

θ

ρ

κ

0

(26)

The value of θ

_ijk

specifies the contribution of each bicluster k

and can be one of the following expressions:

• µ

_k

• µ

_k

+ α

_ik

• µ

_k

+ β

_jk

General additive models

• µ

_k

+ β

_jk

• µ

_k

+ α

_ik

+ β

_jk

Representing different types of biclusters:

• Constant Biclusters

• Biclusters with constant rows/columns

• Biclusters with additive model

(27)

General additive models:

GENERAL ADDITIVE MODELS: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 3.0 3.0 2.0 2.0 3.0 3.0 2.0 2.0 2.0 2.0 2.0 2.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0 3.0 3.0 4.0 4.0 8.0 8.0 5.0 5.0 10 10 6.0 6.0 7.0 7.0 7.0 7.0 1.0 2.0 3.0 4.0 1.0 2.0 3.0 4.0 1.0 2.0 1.0 2.0 8.0 10 7.0 8.0 8.0 10 7.0 8.0 5.0 6.0 7.0 8.0

2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 Constant values 7.0 7.0 7.0 7.0 8.0 8.0 8.0 8.0 Constant rows 5.0 6.0 7.0 8.0 5.0 6.0 7.0 8.0 Constant columns 1.0 2.0 5.0 0.0 2.0 3.0 6.0 1.0 4.0 5.0 5.0 6.0 9.0 5.0 5.0 0.0 11 7.0 6.0 1.0 4.0 5.0 8.0 3.0 5.0 6.0 9.0 4.0 Coherent Values

(28)

General multiplicative models

• Similiarly we can also think of a general multiplicative model:

∏

=

K k ijk ik jk ij

a

0

θ

ρ

κ

where:

– K is the number of biclusters

– The terms θ_ik and κ_jk are binary values that represent memberships:

• ρik is the membership of row i in the bicluster k.

• κ_jk is the membership of column j in the bicluster k.

(29)

The value of θ

_ijk

specifies the contribution of each bicluster k

and can be one of the following expressions:

• µ

_k

• µ

_k

x

α

_ik

• µ

_k

x

β

_jk

General multiplicative models

• µ

_k

x

β

_jk

• µ

_k

x

α

_ik

+

β

_jk

Representing different types of biclusters:

• Constant Biclusters

• Biclusters with constant rows/columns

• Biclusters with multiplicative model

(30)

General multiplicative models

GENERAL MULTIPLICATIVE MODELS: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0 3.0 3.0 4.0 4.0 15 15 5.0 5.0 24 24 6.0 6.0 7.0 7.0 7.0 7.0 1.0 2.0 3.0 4.0 1.0 2.0 3.0 4.0 1.0 2.0 1.0 2.0 15 24 7.0 8.0 15 24 7.0 8.0 5.0 6.0 7.0 8.0

2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 Constant values 7.0 7.0 7.0 7.0 8.0 8.0 8.0 8.0 Constant rows 5.0 6.0 7.0 8.0 5.0 6.0 7.0 8.0 Constant columns 1.0 2.0 5.0 0.0 2.0 3.0 6.0 1.0 4.0 5.0 5.0 6.0 2.0 12 5.0 0.0 3.0 18 6.0 1.0 4.0 5.0 8.0 3.0 5.0 6.0 9.0 4.0 Coherent Values 1X2 6X2 2X1.5 4.5X4

(31)

BICLUSTERING

ALGORITHMS

(32)

Algorithms

• DifferentObjectives

– Identify one bicluster.

– Identify a given number of biclusters.

• DifferentApproaches

– Discover one bicluster at a time.

– Discover one set of biclusters at a time.

– Discover

all

biclusters

at

the

same

time

(Simultaneous bicluster identification)

(33)

Algorithms:

• Iterative Row and Column Clustering Combination

– Apply clustering algorithms to the rows and columns of the data matrix, separately.

– Combine the results using some sort of iterative procedure to combine the two cluster arrangements.

• Divide and Conquer:

– Break the problem into several sub-problems that are similar to the original problem but smaller in size.

– Solve the problems recursively.

– Combine the intermediate solutions to create a solution to the

original problem.

– Usually break the matrix into submatrices (biclusters) based on a

certain criterion and then continue the biclustering process on

the new submatrices.

(34)

Algorithms:

• Greedy Iterative Search:

– make a locally optimal choice in the hope that this choice will lead to a globally good solution.

– Usually perform greedy row/column addition/removal.

• Exhaustive Bicluster Enumeration

:

Cheng

Cheng & Church& Church Algorithm Algorithm

• Exhaustive Bicluster Enumeration

:

– The best biclusters are identified using an exhaustive

enumeration of all possible biclusters existent in the data, in

exponential time.

(35)

Overview of the Biclustering Algorithms

Method Publish Cluster Model Goal

Cheng & Church ISMB 2000 Background + row effect + column effect

Minimize mean squared residue of biclusters Getz et al.

(CTWC)

PNAS 2000 Depending on plugin clustering algorithm

Depending on plugin clustering algorithm Lazzeroni & Owen Bioinformatics Background + row effect Minimize modeling error

35

Lazzeroni & Owen (Plaid Models)

Bioinformatics 2000

Background + row effect + column effect

Minimize modeling error

Ben-Dor et al. (OPSM)

RECOMB 2002 All genes have the same order of expression

values

Minimize the p-values of biclusters Tanay et al. (SAMBA) Bioinformatics 2002 Maximum bounded bipartite subgraph

Minimize the p-values of biclusters

Yang et al. (FLOC)

BIBE 2003 Background + row effect + column effect

Minimize mean squared residue of biclusters Kluger et al.

(Spectral)

Genome Res. 2003

Background × row effect

× column effect

Finding checkerboard structures

(36)

Overview of the Biclustering Algorithms

Method Allow

overlap?

Bicluster Discovery

Complexity Testing Data

Cheng & Church Yes

(rare in reality)

One at a time O(MN) or O(MlogN) Yeast (2884×17), lymphoma (4026×96)

Getz et al. (CTWC)

Yes One set at a time Exponential Leukemia (1753×72), colon cancer (2000×62)

Lazzeroni & Owen Yes One at a time Polynomial Food (961×6),

36

Lazzeroni & Owen (Plaid Models)

Yes One at a time Polynomial Food (961×6), forex (276×18), yeast (2467×79)

Ben-Dor et al. (OPSM)

Yes All at the same time

O(NM3_l) _{Breast tumor (3226}×₂₂₎

Tanay et al. (SAMBA)

O((N2d+1₎log_(r+1)/r(rd)₎ _{Lymphoma (4026}×_96),

yeast (6200×515)

Yang et al. (FLOC)

O((N+M)2_kp) _{Yeast (2884}×₁₇₎

Kluger et al. (Spectral)

No All at the same time

Polynomial Lymphoma (1 rel., 1 abs.), leukemia, breast cell line, CNS embryonal tumor

(37)

Cheng and Church’s Algorithm

• Cheng and Church were the first to introduce biclustering to gene expression analysis .

• Their algorithmic framework represents the biclustering problem as an optimization problem, defining a score for each candidate bicluster and developing heuristics to solve the constrained optimization problem defined by this score function. The constraints force the uniformity of the matrix by this score function. The constraints force the uniformity of the matrix and the procedure gives preference to larger submatrices.

• Cheng and Church implicitly assume that (gene, condition) pairs in a “good” bicluster have a constant expression level, plus possibly additive row and column specific effects.

37 Metodi numerici per la

bioinformatica

Biclustering of Expression data

Y. Cheng and M.Church, ISMB 2000

(38)

Cheng and Church’s Algorithm

• Model: A bicluster is represented by the submatrix A of the

whole expression matrix (the involved rows and columns need

not be contiguous in the original matrix).

• Each entry

a

_ij

in the bicluster is the summation of:

1. The background level 1. The background level 2. The row (gene) effect

3. The column (condition) effect

• A dataset contains a number of biclusters, which are not

necessarily disjoint.

(39)

Cheng and Church’s Algorithm:residue

• In the matrix A the

residue score

of element a

_ij

is given by:

j I J • a_iJ = mean of row i | | I a a i I ij Ij

∑

_∈ = | | J a aiJ j J ij

∑

_∈ =

•a_Ij=mean of column j

IJ Ij iJ ij ij

a

R

(

)

=

−

+

a i

• Biological meaning: the genes have the same (amount of) response to the conditions | | I | || | , J I a a_IJ =

∑

i∈I j∈J ij •a_Ij= mean of A

(40)

• The mean square residue is the variance of the set of all

elements in the bicluster, plus the mean row variance and the

mean column variance.

Cheng and Church’s Algorithm:

mean square residue

∑

− − + = = Rij a a a a J I H 2 2 ) ( 1 ) , (

• A submatrix A

_IJ

is called a

δ

-bicluster

if H(I,J)≤ δ for some

δ≥0.

∑

∈ ∈ ∈ ∈ = + − − = J j I i ij J j I i IJ Ij iJ ij J I a a a a J I J I H , , 2 | || | ) ( | || | 1 ) , (

GOAL

: find biclusters with

low mean squared residue

, in

particular, large and maximal ones with scores below a certain

threshold δ.

(41)

Cheng & Church’ algorithm

• A score of H(I,J)=0 would mean that we are in the case of a constant bicluster of elements of a single value. (The gene expression levels fluctuates in unison)

∑

∈ ∈ ∈ ∈ = + − − = J j I i ij J j I i IJ Ij iJ ij J I R a a a a J I J I H , 2 , 2 | || | ) ( | || | 1 ) , (

of elements of a single value. (The gene expression levels fluctuates in unison) • With a score of H(I,J)≠0 it is always possible to remove a row ora a column to

lower the score, until the remaining bicluster becomes constant.

• The global H score gives an indicator of how data fits together within that matrix; whether it has some coherence or is random:

– A high H value signifies that data is uncorrelated.

– A low H score values means that there is a correlation in the matrix

(42)

Minimum squared residue: example

• If 5 was replaced with 3 then the score would change to : H(M₂)= 2.06

•A matrix with elements randomly and uniformly generated in the range [a,b] (a=1, b=12), has an expected score of(b-a)2_{/12. In this case: H(M}

(43)

• Constraints:

– 1xM and Nx1 matrixes always give zero residue.

Find biclusters with maximum sizes, with residues not

more than a threshold δ (largest δ-biclusters)

– Constant matrixes always give zero residue.

Use average row variance to evaluate the “interestingness”

of a bicluster.

Biologically, it represents genes that have large change in

expression values over different conditions.

(44)

• Objective function for heuristic methods (to minimize):

∑

∈ ∈ ∈ ∈ = + − − = J j I i ij J j I i IJ Ij iJ ij J I R a a a a J I J I H , 2 , 2 | || | ) ( | || | 1 ) , (

sum of the components from each row and column, which suggests simple greedy algorithms to evaluate each row and column independently

(45)

Cheng and Church’s Algorithm

• Greedy approach to rapidly converge to a maximal

bicluster.

• In phase I, it removes rows/columns with a large

contribution to the mean residue score (

msr

).

contribution to the mean residue score (

msr

).

• In phase II, rows/columns are added that have a low

contribution to the

msr

without exceeding δ.

• After a bicluster is identified, its values are randomized

to prevent it to show up again.

(46)

Cheng and Church’s Algorithm

Given the threshold parameter δ, the algorithm runs in two phases:

FIRST PHASE:

•the algorithm removes rows and columns from the full matrix. At each step,where the current submatrix has row set and column set , the algorithm examines the set of possible moves.

∑

_∈ = _j _J RSI J i j J i d ( , ) | | 1 ) ( _, Francesco Archetti

•for rows it calculates:

•for columns it calculates:

• It then selects the highest scoring row or column and removes it from the current submatrix, as long as H(I,J)>δ.

The idea is that rows/columns with large contribution to the score can be removed with guaranteed improvement (decrease) in the total mean square residue score.

A possible variation of this heuristic removes at each step all rows/columns with a contribution to the residue score that is higher than some threshold.

46

∑

_j_∈_J RSI J i j J i d ( , ) | | ) ( _,

∑

_∈ = _i _I RSI J i j I j e ( , ) | | 1 ) ( _,

(47)

Cheng and Church’s Algorithm

SECOND PHASE:

• Goal: increases the matrix size without crossing the threshold δ.

For this rows and columns are being added, using the same scoring scheme, but this time looking for the lowest square residues d(i) e(j) at each move, and terminating where none of the possible moves increases the matrix size without crossing the threshold δ.

Francesco Archetti

crossing the threshold δ.

Upon convergence, the algorithm outputs a submatrix with low mean residue and locally maximal size.

To discover more than one bicluster, Cheng and Church suggested repeated

application of the biclustering algorithm on modified matrices. The modification includes randomization of the values in the cells of the previously discovered biclusters, preventing the correlative signal in them to be beneficial for any other bicluster in the matrix. This has the obvious effect of precluding the

identification of biclusters with significant overlaps.

(48)

Evolutionary bicluster

• Binary encoding for rows/columns

• Fitness:

– mean squared residue

– row variance

– large volume

– penalty (exponential)

• Typical genetic operators

Evolutionary Biclustering of Gene Expressions

H.Banka and S.Mitra ACM, Ubiquity, 7 (42) 2006

(49)

Genetic Algorithms

-a brief

introduction-• The idea of genetic algorithm (GA) was first introduced by John Holland in early 1970’s

• based on the adaptive global search heuristic inspired by natural evolution and genetics with survival of the fittest strategy.

• It is a stochastic population based search strategy works on biological mechanism of natural selection, crossover, and mutation.

mechanism of natural selection, crossover, and mutation.

• GAs are executed iteratively on a set of coded solutions, called population,

with the three basic operators: selection, crossover, and mutation.

• For solving a problem, GA starts with a set of encoded random solutions

(i.e., chromosomes) and evolves better set of solutions over generations

(iterations) by applying the basic GA operators.

• Better solutions are determined from objective values (fitness functions) that determines the suitability of reproduction for the solutions. Hence better solutions are selected whereas the bad ones are eliminated from the population at each generation

(50)

Simple Genetic Algorithm

{

initialize population;

evaluate population;

while Termination Criteria Not Satisfied

{

select parents for reproduction;

perform recombination and mutation;

evaluate population;

} }

(51)

Evolutionary biclustering:

Representation

• An encoded solution representing a bicluster:

– Each bicluster is represented by a fixed sized binary string called chromosome or individual, with a bit string for genes appended by another bit string for conditions.

– The chromosome corresponds to a solution for this optimal bicluster generation problem.

– A bit is set to one if the corresponding gene and/or condition is present in the bicluster, and reset to zero otherwise.

(52)

Evolutionary biclustering:

fitness function

• Goal:

generating maximal set of genes and conditions while

maintaining the “homogeneity” of the biclusters

• Maximize:

Multi-objective optimization

• where:

– g and c are the number of ones in the genes and conditions within the bicluster, – G(g, c) is its mean squared residue score

– δ is the user-defined threshold for the maximum acceptable dissimilarity or mean squared residue score of the bicluster

– G and C are the total number of genes and conditions of the original gene expression array

(53)

Evolutionary biclustering:

Local search

• Since the initial biclusters are generated randomly, it may happen

that some irrelevant genes and/or conditions get included in spite of

their expression values lying far apart in the feature space.

• An analogous situation may also arise during crossover and mutation

in each generation.

• These genes and conditions, with dissimilar values, need to be

eliminated deterministically.

• Furthermore, for good biclustering, some genes and/or conditions

having similar expression values need to be incorporated as well.

• The algorithm starts with a given bicluster and an initial gene

expression array (G,C).

• The irrelevant genes or conditions having mean squared residue

above (or below) a certain threshold are now selectively eliminated

(or added) using the some conditions.

(54)

• Domination: The conditions for a solution to be dominated with respect to the other solutions is:

If there are M objective functions, a solution x(1) is said to dominate another solution x(2), if both conditions the solution x(1) is no worse than x(2) in all the M objective functions and the solution x(1) is strictly better than x(2) in at least one of the M objective functions.

Evolutionary biclustering:

• Crowding distance: this assigns the highest value to the boundary

solutions and the average distance of two solutions [(i+1)th _{and (i−1)}th_{] on} either side of solution i along each of the objectives.

• Crowding selection: A solution i wins tournament with another solution j

if:

– solution i has better rank, i.e

.,

r_i< r_j .

– both the solutions are in the same front, i.e., r_i= r_j , but solution i is less densely located in the search space, i.e., d_i > d_j .

(55)

Evolutionary biclustering:

The algorithm

The main steps of the proposed algorithm, repeated over a specified number of generations, are:

1. Generate a random population of size P.

2. Delete or add multiple nodes (genes and conditions) from each individual of the population.

3. Calculate the multi-objective fitness functions f1 and f2 4. Rank the population using the dominance criteria.

4. Rank the population using the dominance criteria. 5. Calculate crowding distance.

6. Perform selection using crowding tournament selection.

7. Perform crossover and mutation (as in conventional GA) to generate offspring population of size P.

8. Combine parent and offspring population.

9. Rank the mixed population using dominance criteria and crowding distance, as above.

10.Replace the parent population by the best |P| members of the combined population.

(56)

Biclustering advantages

1.

automatically selects genes and conditions with more coherent

measurement

2.

groups items based on a similarity measure that depends on a

context, which is best defined as a subset of the attributes. It

discovers not only the grouping, but the context as well. And to

some extent, these two become inseparable and exchangeable, which

is a major difference between biclustering and clustering rows after

clustering columns.

3.

allows rows and columns to be included in multiple biclusters, and

thus allows one gene or one condition to be identified by more than

one function categories. This added flexibility correctly reflects the

reality in the functionality of genes and overlapping factors in tissue

samples and experiment conditions.

(57)

Biclustering: observations

• The algorithms presented demonstrate some of the approaches

developed for the identification of bicluster patterns in large

matrices, and in gene expression matrices in particular.

• A classification of the different methods ca be:

a)

By their model and scoring schemes

b) By the type of algorithm used for detecting biclusters

(58)

Biclustering: models and score

• To ensure that the biclusters are statistically significant, each of the biclustering methods defines a scoring scheme to assess the quality of candidate biclusters, or a constraint that determines which submatrices represent significant bicluster behavior.

• Constraint based methods: search for gene (property) sets that define ”stable” subsets of properties.

subsets of properties.

Algorithms: iterative signature algorithm, the coupled two-way clustering method and the spectral algorithm of Kluger et al.

• Scoring based methods : rely on a background model for the data. The basic model assumes that biclusters are essentially uniform submatrices and scores them according to their deviation from such uniform behavior. More

elaborate models allow different distributions for each condition and gene, usually in a linear way.

Algorithms: the Cheng-Church algorithm and the Plaid model.

(59)

Biclustering: algorithmic approaches

• The algorithmic approaches for detecting biclusters given the

data are greatly affected by the type of score/constraint model

in use:

– Several algorithms alternate between phases of gene sets and condition sets optimization (the iterative signature algorithm and the coupled sets optimization (the iterative signature algorithm and the coupled two-way clustering algorithm.)

– Other use standard linear algebra or optimization algorithms to solve key subproblems. (Plaid model and the Spectral algorithm)

– A heuristic hill climbing algorithm is used in the Cheng-Church algorithm.

(60)

Research Opportunities

Many issues in biclustering algorithm design also remain open and

should be addressed by the scientific community:

– Propose other bicluster models.

– Based on the current models, propose new algorithms that improve – Based on the current models, propose new algorithms that improve bicluster quality (validated statistically or biologically) and/or time complexity.

– Combine the strength of multiple studies.

– Investigate the effects of normalization to the models/algorithms. – Compare the different methods on some other real datasets.

– Make better use of domain knowledge.