• No results found

A parallel algorithm for the extraction of structured motifs

N/A
N/A
Protected

Academic year: 2021

Share "A parallel algorithm for the extraction of structured motifs"

Copied!
86
0
0

Loading.... (view fulltext now)

Full text

(1)

A parallel algorithm for the extraction of structured motifs

Alexandra Carvalho

MEIC 2002/03

(2)

Plan of the talk

Biological model of regulation Nucleic acids: DNA and RNA

Classification of living organisms: Prokaryotes and Eukaryotes Transcription and Translation

Promoter and Regulatory Sequences Computational model of regulation

Suffix tree and generalized suffix tree

Single models extraction

[M.-F. Sagot, Latin, 1998]

Structured models extraction

[L. Marsan and M.-F. Sagot, J. Computational Biology, 2000]

Parallelization

[A. Carvalho, A. Freitas, A. Oliveira and M.-F. Sagot, submitted, 2003]

The PARTITION UP TO ε problem

· The SimpleCut algorithm The tree partition problem

· The PSMILE algorithm

Experimental results

(3)

Nucleic acids

Nucleotides:

storage and retrieval of biological information

building blocks for the construction of nucleic acids

(4)

Nucleic acids

Nucleotides:

storage and retrieval of biological information

building blocks for the construction of nucleic acids

Two main types of nucleic acids:

DNA - DeoxyriboNucleic Acid:

contain the bases A, C, G, and T double-stranded molecule

RNA - RiboNucleic Acid:

contain the bases A, C, G, and U

single-stranded molecule

(5)

Classification of living organisms

Prokaryotes:

Greek words: pro"before"and karyon ≡ "nucleus"

bacteria and prokaryotes are generally used interchangeably most prokaryotes live as single-celled organisms

Eukaryotes:

Greek words: eu"well"and karyon ≡ "nucleus"

yeast is an eukaryotic single-celled organism

(6)

Transcription and Translation

(7)

Promoter and Regulatory Sequences

(8)

Structured motifs

Definition.

model

A model is an element in Σ

+

.

Definition.

structured model

A structured model is a pair (m, d) where:

m = (m

i

)

1≤i≤p

, denoting the p boxes

d = (d

mini

, d

maxi

, δ

i

)

1≤i≤p−1

, denoting the p − 1 intervals of distance

(9)

Structured motifs

Definition.

model

A model is an element in Σ

+

.

Definition.

structured model

A structured model is a pair (m, d) where:

m = (m

i

)

1≤i≤p

, denoting the p boxes

d = (d

mini

, d

maxi

, δ

i

)

1≤i≤p−1

, denoting the p − 1 intervals of distance

m1 k1

m2 k2

mp

kp p boxes

[d1min,d1max]

e2 subst.

e1 subst. ep subst.

(10)

Structured motifs

Definition.

e -occurrence

A model m e-occurs in the input sequences if exists u in the input sequences such that

HammingDistance(m, u) ≤ e (minimum number of substitutions to transform u into m).

(11)

Structured motifs

Definition.

e -occurrence

A model m e-occurs in the input sequences if exists u in the input sequences such that HammingDistance(m, u) ≤ e (minimum number of substitutions to transform u into m).

Definition.

valid model, quorum

A model is valid if e-occurs in at least q input sequences, where q is called the quorum.

(12)

Structured motifs

Definition.

e -occurrence

A model m e-occurs in the input sequences if exists u in the input sequences such that HammingDistance(m, u) ≤ e (minimum number of substitutions to transform u into m).

Definition.

valid model, quorum

A model is valid if e-occurs in at least q input sequences, where q is called the quorum.

k1=[6,8] k2=[6,8]

[16,18]

e1=2 e2=1

(13)

Structured motifs

Definition.

e -occurrence

A model m e-occurs in the input sequences if exists u in the input sequences such that HammingDistance(m, u) ≤ e (minimum number of substitutions to transform u into m).

Definition.

valid model, quorum

A model is valid if e-occurs in at least q input sequences, where q is called the quorum.

k1=[6,8] k2=[6,8]

[16,18]

e1=2 e2=1

TTGACA TATAAT

TTGACT TTGACA TTGCCA

TTGTCT

TAAAAT TATAAA

TATTAT TATAAT

17 17

too far 16

(14)

Input sequences

>strand + guaB inositol-monophosphate dehydrogenas

CTTTCCGTTATCTAAATATTTCAACTCTTTCCCGCTTCCTTGACATGCTCTTGGCTAGTTGATAATCT ACATATAATATTTTGCCGAAAA

>strand - yaaC yaaC

TTTTCGGCAAAATATTATATGTAGATTATCAACTAGCCAAGAGCATGTCAAGGAAGCGGGAAAGAGTT GAAATATTTAGATAACGGAAAG

>strand + yaaJ similar to hypothetical proteins

CCGTTTCAGTTATAGTTAATATGTAGCCTTTTTAGGCAATGAAAAAACTTTGAAA

>strand - yaaI similar to isochorismatase

TTTCAAAGTTTTTTCATTGCCTAAAAAGGCTACATATTAACTATAACTGAAACGG

>strand + metS methionyl-tRNA synthetase

ATTTTATAAATATTTAATAAAGCTATTATCCTACTAAAAATCCTTTTAAATCAAGACTTTCGAACCAA

AGTTTTTTATTTCATTTGATTATATACGACAAAATTCGACACGAACAGACTTTTTTTATTTTCATTAA

AGATTTTTAATTTTAATTATTCTTTTTCAGGGCGTATGTATATATTCTTGATCTTAAAGGCTAAGATG

GTATCATAGATAAAGGATAAATATAAATAATATTCATATATGATTTGCACTTATCGCCGCTCTCGTCC

TTTGGGCGGGAGCTTTTTGACATTCTGA

(15)

Suffix Tree

Definition.

Suffix tree

A suffix tree of a n-character string S is a rooted directed tree with exactly n leaves:

leaves are numbered 1 to n

each internal node has at least two children

each edge is labeled with a nonempty substring of S

no two edges out of a node can have edge-labels beginning with the same character

The key feature of the suffix tree is that for any leaf i, the label of the path from the root to the leaf i exactly spells out the suffix of S that starts at position i.

Weiner, IEEE Symposium on Switching and Automata Theory, 1973 Ukkonen, Algorithmica, 1995

Theorem.

Suffix trees can be built in linear-time.

(16)

Suffix Tree

Suffix tree for the string TACTA$

(17)

Suffix Tree

Suffix tree for the string TACTA$

A

C

T

A

$ T

(18)

Suffix Tree

Suffix tree for the string TACTA$

A C

T A

$

A

C

T

A

$ T

(19)

Suffix Tree

Suffix tree for the string TACTA$

A C

T A

$

A

C

T

A

$

C T

T A

$

(20)

Suffix Tree

Suffix tree for the string TACTA$

A C

T A

$

A

C

T

A

$

C T

T A

$

$

(21)

Suffix Tree

Suffix tree for the string TACTA$

A C

T A

$

A

C

T

A

$

C T

T A

$

$

$

(22)

Suffix Tree

Suffix tree for the string TACTA$

A C

T A

$

A

C

T

A

$

C T

T A

$

$

$

$

(23)

Generalized Suffix Tree

Suffix tree for the strings TACTA$ and CACTCA#

(24)

Generalized Suffix Tree

Suffix tree for the strings TACTA$ and CACTCA#

A C

T A

$

A

C

T

A

$

C T

T A

$

$

$

$

(25)

Generalized Suffix Tree

Suffix tree for the strings TACTA$ and CACTCA#

A C

T A

$

A

C

T

A

$

C T

T A

$

$

$

A

C

T

C

A

#

$

(26)

Generalized Suffix Tree

Suffix tree for the strings TACTA$ and CACTCA#

A C

T A

$

A

C

A

$

C T

T A

$

$

$

$

A

C

T

C

A

# T

C A

#

(27)

Generalized Suffix Tree

Suffix tree for the strings TACTA$ and CACTCA#

A C

T A

$

A

C

A

$

C T

T A

$

$

$

$

A

C

T

C

A

# T

C A

# C

A

#

(28)

Generalized Suffix Tree

Suffix tree for the strings TACTA$ and CACTCA#

A C

T A

$

A

C

A

$

C T

T A

$

$

$

$

A

C

T

C

A

#

C A #

T C

A

# C

A

#

(29)

Generalized Suffix Tree

Suffix tree for the strings TACTA$ and CACTCA#

A C

T A

$

A

C

A

$

C T

T A

$

$

$

$

A

C

T

C

A

#

C A #

T C

A

# C

A

# #

(30)

Generalized Suffix Tree

Suffix tree for the strings TACTA$ and CACTCA#

A C

T A

$

A

C

A

$

C T

T A

$

$

$,#

A

C

T

C

A

#

C A #

T C

A

# C

A

# #

$

(31)

Generalized Suffix Tree

Suffix tree for the strings TACTA$ and CACTCA#

A C

T A

$

A

C

A

$

C T

T A

$

$

$,#

A

C

T

C

A

#

C A #

T C

A

# C

A

# #

$,#

(32)

Generalized Suffix Tree with Colors

Suffix tree for the strings TACTA$ and CACTCA#

A C

T A

$

A

C

A

$

C T

T A

$

$

$,#

A

C

T

C

A

#

C A #

T C

A

# C

A

# #

$,#

[1,1]

[1,1]

[1,1]

[1,1]

[0,1]

[1,0] [0,1]

[1,0]

[1,1]

[0,1]

[0,1]

[1,0]

[1,0]

[1,1] [0,1]

[1,1]

[1,0]

[1,1]

(33)

Extraction of Single Models

Definition.

e -node-occurrence

A e-node-occurrence of a model m is represented by a pair (v, e

v

) where v is a tree node and e

v

≤ e is the Hamming distance between the label of the path from the root to v and m.

Notation.

ν (e, k)

The number of distinct words at Hamming distance at most e from a k-long word:

ν(e, k) =

e

X

i=0

 k

i

 (|Σ| − 1)

i

≤ k

e

|Σ|

e

.

Notation.

n k

The number of tree nodes at depth k of a suffix tree.

(34)

Extraction of Single Models

M.-F. Sagot, Latin, 1998

k = 2 Input sequences: TAC and CCC

e = 1 q = 2

A A A A

A

A C

C C C C C C

G

G T G T G T G T

T T

A C

[1,1]

[0,1] [1,0]

[1,0]

(35)

Extraction of Single Models

M.-F. Sagot, Latin, 1998

k = 2 Input sequences: TAC and CCC

e = 1 q = 2

[1,1]

A A A A

A

A C

C C C C C C

G

G T G T G T G T

T T

A C

[1,0] [0,1] [1,0]

(A,0); (C,1); (T,1)

(36)

Extraction of Single Models

M.-F. Sagot, Latin, 1998

k = 2 Input sequences: TAC and CCC

e = 1 q = 2

[1,1]

A A A A

A

A C

C C C C C C

G

G T G T G T G T

T T

A C

[1,0] [0,1] [1,0]

(A,0); (C,1); (T,1)

(AC,1); (TA,1)

(37)

Extraction of Single Models

M.-F. Sagot, Latin, 1998

k = 2 Input sequences: TAC and CCC

e = 1 q = 2

[1,1]

A A A

A

A C

C C C C C C

G

G T G T G T G T

T T

A C

[1,0] [0,1] [1,0]

(A,0); (C,1); (T,1)

(38)

Extraction of Single Models

M.-F. Sagot, Latin, 1998

k = 2 Input sequences: TAC and CCC

e = 1 q = 2

[1,1]

A A A

A

A C

C C C C C C

G

G T G T G T G T

T T

A C

[1,0] [0,1] [1,0]

(A,0); (C,1); (T,1)

(AC,0); (CC,1)

(39)

Extraction of Single Models

M.-F. Sagot, Latin, 1998

k = 2 Input sequences: TAC and CCC

e = 1 q = 2

[1,1]

A A A

A

A C

C C C C C

G

G T G T G T G T

T T

A C

[1,0] [0,1] [1,0]

(A,0); (C,1); (T,1) C

(AC,0)

(CC,1)

(40)

Extraction of Single Models

M.-F. Sagot, Latin, 1998

k = 2 Input sequences: TAC and CCC

e = 1 q = 2

[1,1]

A A A

A

A C

C C C C C

G

G T G T G T G T

T T

A C

[1,0] [0,1] [1,0]

(A,0); (C,1); (T,1) C

(AC,1) (AC,0)

(CC,1)

(41)

Extraction of Single Models

M.-F. Sagot, Latin, 1998

k = 2 Input sequences: TAC and CCC

e = 1 q = 2

[1,1]

A A A

A

A C

C C C C C

G

G G G

T T T T

T T

A C

[1,0] [0,1] [1,0]

(A,0); (C,1); (T,1) C

(AC,0)

(CC,1)

(42)

Extraction of Single Models

M.-F. Sagot, Latin, 1998

k = 2 Input sequences: TAC and CCC

e = 1 q = 2

[1,1]

A A A

A

A C

C C C C C

G

G G G

T T T T

T T

A C

[1,0] [0,1] [1,0]

(A,0); (C,1); (T,1) C

(AC,1) (AC,0)

(CC,1)

(43)

Extraction of Single Models

M.-F. Sagot, Latin, 1998

k = 2 Input sequences: TAC and CCC

e = 1 q = 2

[1,1]

A A A

A

A C

C C C C C

G

G T G T G T

T T

A C

[1,0] [0,1] [1,0]

(A,0); (C,1); (T,1) C

(AC,0)

(CC,1)

(44)

Extraction of Single Models

M.-F. Sagot, Latin, 1998

k = 2 Input sequences: TAC and CCC

e = 1 q = 2

[1,1]

A A A

A

A C

C C C C C

G

G T G T G T

T T

A C

[1,0] [0,1] [1,0]

C

(AC,0)

(CC,1)

(45)

Extraction of Single Models

M.-F. Sagot, Latin, 1998

k = 2 Input sequences: TAC and CCC

e = 1 q = 2

[1,1]

A A A

A

A C

C C C C C

G

G T G T G T

T T

A C

[1,0] [0,1] [1,0]

C

(A,1); (C,0); (T,1) (AC,0)

(CC,1)

(46)

Extraction of Single Models

M.-F. Sagot, Latin, 1998

k = 2 Input sequences: TAC and CCC

e = 1 q = 2

[1,1]

A A A

A

A C

C C C C C

G

G T G T G T

T T

A C

[1,0] [0,1] [1,0]

C

(A,1); (C,0); (T,1) (CC,1); (TA,1) (AC,0)

(CC,1)

(47)

Extraction of Single Models

M.-F. Sagot, Latin, 1998

k = 2 Input sequences: TAC and CCC

e = 1 q = 2

[1,1]

A A A

A

A C

C C C C C

G

G T G T G T

T T

A C

[1,0] [0,1] [1,0]

C

(A,1); (C,0); (T,1) (AC,0)

(CC,1)

(CC,1)

(TA,1)

(48)

Extraction of Single Models

M.-F. Sagot, Latin, 1998

k = 2 Input sequences: TAC and CCC

e = 1 q = 2

[1,1]

A A A

A

A C

C C C C C

G

G T G T G T

T T

A C

[1,0] [0,1] [1,0]

C

(A,1); (C,0); (T,1) (AC,1); (CC,0) (AC,0)

(CC,1)

(CC,1)

(TA,1)

(49)

Extraction of Single Models

M.-F. Sagot, Latin, 1998

k = 2 Input sequences: TAC and CCC

e = 1 q = 2

[1,1]

A A A

A

A C

C C C C C

G

G T G T G T

T T

A C

[1,0] [0,1] [1,0]

C

(A,1); (C,0); (T,1) (AC,0)

(CC,1)

(CC,1) (TA,1)

(AC,1)

(CC,0)

(50)

Extraction of Single Models

M.-F. Sagot, Latin, 1998

k = 2 Input sequences: TAC and CCC

e = 1 q = 2

[1,1]

A A A

A

A C

C C C C C

G

G T G T G T

T T

A C

[1,0] [0,1] [1,0]

C

(A,1); (C,0); (T,1) (CG,1)

(AC,0) (CC,1)

(CC,1) (TA,1)

(AC,1)

(CC,0)

(51)

Extraction of Single Models

M.-F. Sagot, Latin, 1998

k = 2 Input sequences: TAC and CCC

e = 1 q = 2

[1,1]

A A A

A

A C

C C C C C

G

G G

T T T

T T

A C

[1,0] [0,1] [1,0]

C

(A,1); (C,0); (T,1) (AC,0)

(CC,1)

(CC,1) (TA,1)

(AC,1)

(CC,0)

(52)

Extraction of Single Models

M.-F. Sagot, Latin, 1998

k = 2 Input sequences: TAC and CCC

e = 1 q = 2

[1,1]

A A A

A

A C

C C C C C

G

G G

T T T

T T

A C

[1,0] [0,1] [1,0]

C

(A,1); (C,0); (T,1) (CT,1)

(AC,0) (CC,1)

(CC,1) (TA,1)

(AC,1)

(CC,0)

(53)

Extraction of Single Models

M.-F. Sagot, Latin, 1998

k = 2 Input sequences: TAC and CCC

e = 1 q = 2

[1,1]

A A A

A

A C

C C C C C

G

G T G T

T T

A C

[1,0] [0,1] [1,0]

C

(A,1); (C,0); (T,1) (AC,0)

(CC,1)

(CC,1) (TA,1)

(AC,1)

(CC,0)

(54)

Extraction of Single Models

M.-F. Sagot, Latin, 1998

k = 2 Input sequences: TAC and CCC

e = 1 q = 2

[1,1]

A A A

A

A C

C C C C C

G

G T G T

T T

A C

[1,0] [0,1] [1,0]

C (AC,0) (CC,1)

(CC,1) (TA,1)

(AC,1)

(CC,0)

(55)

Extraction of Single Models

M.-F. Sagot, Latin, 1998

k = 2 Input sequences: TAC and CCC

e = 1 q = 2

[1,1]

A

A

A C

C C C C C

G T T

A C

[1,0] [0,1] [1,0]

C (AC,0) (CC,1)

(CC,1) (TA,1)

(AC,1) (CC,0)

(AC,1) (CC,1)

(AC,1)

(CC,1)

(56)

Extraction of Single Models

M.-F. Sagot, Latin, 1998

k = 2 Input sequences: TAC and CCC

e = 1 q = 2

[1,1]

A

A

A C

C C C C C

G T T

A C

[1,0] [0,1] [1,0]

C (AC,0) (CC,1)

(CC,1) (TA,1)

(AC,1) (CC,0)

(AC,1) (CC,1)

(AC,1) (CC,1)

Proposition.

The single motifs extraction takes O(N n

k

ν (e, k)) time.

(57)

Extraction of Structured Models: SMILE

L. Marsan and M.-F. Sagot, Journal of Computational Biology, 2000

ExtractModels (

Model

m

, Block

i)

1. for each node-occurrence v of m 2. if ( i > 1 )

3. put in P otencialStarts the children of v at levels (i − 1)k + (i − 1)d

mini−1

to (i − 1)k + (i − 1)d

maxi−1

4. else

5. put v in P otencialStarts

6. for each model m

i

obtained by doing a recursive depth-first traversal from the root of the virtual model tree M while simultaneously traversing T from the node-occurrences in P otencialStarts

7. if ( i < p )

8. ExtractModels ( m = m

1

. . . m

i

, i + 1 )

(58)

Extraction of Structured Models: SMILE

L. Marsan and M.-F. Sagot, Journal of Computational Biology, 2000

p = 2 Input sequences: ACTGAA and CACGTA

k

1

= 2, d = 1, k

2

= 2 e

1

= 1 , e

2

= 1

q = 2

[1,1]

[1,1]

[0,1] [0,1] [1,0]

A

A

A

[1,0]

C

C T

T

T

G G

A

A A [1,1]

G T C G

A A A A

A C

C C C C

G

G T G T G T G T

T

A A A A

A C

C C C C

G

G T G T G T G T

T

(59)

Extraction of Structured Models: SMILE

L. Marsan and M.-F. Sagot, Journal of Computational Biology, 2000

p = 2 Input sequences: ACTGAA and CACGTA

k

1

= 2, d = 1, k

2

= 2 e

1

= 1 , e

2

= 1

q = 2

[1,1]

[1,1]

[0,1] [0,1] [1,0]

A

A

A

[1,0]

C

C T

T

T

G G

A

A A [1,1]

G T C G

A A

A C

C C C C

G

G T G T

T

A A

A C

C C C C

G

G T G T

T

A G T A G T

(60)

Extraction of Structured Models: SMILE

L. Marsan and M.-F. Sagot, Journal of Computational Biology, 2000

p = 2 Input sequences: ACTGAA and CACGTA

k

1

= 2, d = 1, k

2

= 2 e

1

= 1 , e

2

= 1

q = 2

[1,1]

[1,1]

[0,1] [0,1] [1,0]

A

A

A

[1,0]

C

C T

T

T

G G

A

A A [1,1]

G T C G

(AC,1) (CA,1)

A A

A C

C C C C

G

G T G T

T

A A

A C

C C C C

G

G T G T

T

A G T A G T

(61)

Extraction of Structured Models: SMILE

L. Marsan and M.-F. Sagot, Journal of Computational Biology, 2000

p = 2 Input sequences: ACTGAA and CACGTA

k

1

= 2, d = 1, k

2

= 2 e

1

= 1 , e

2

= 1

q = 2

[1,1]

[1,1]

[0,1] [0,1] [1,0]

A

A

A

[1,0]

C

C T

T

T

G G

A

A A [1,1]

G T C G

(AC,1) (CA,1)

A A

A C

C C C C

G

G T G T

T

A

A G T

A A

C

A C G T

(62)

Extraction of Structured Models: SMILE

L. Marsan and M.-F. Sagot, Journal of Computational Biology, 2000

p = 2 Input sequences: ACTGAA and CACGTA

k

1

= 2, d = 1, k

2

= 2 e

1

= 1 , e

2

= 1

q = 2

[1,1]

[1,1]

[0,1] [0,1] [1,0]

A

A

A

[1,0]

C

C T

T

T

G G

A

A A [1,1]

G T C G

A A

A C

C C C C

G

G T G T

T

A

A G T

A A

(AC,1)

C

C

C C

G T A G T C G T G T

(63)

Extraction of Structured Models: SMILE

L. Marsan and M.-F. Sagot, Journal of Computational Biology, 2000

p = 2 Input sequences: ACTGAA and CACGTA

k

1

= 2, d = 1, k

2

= 2 e

1

= 1 , e

2

= 1

q = 2

[1,1]

[1,1]

[0,1] [0,1] [1,0]

A

A

A

[1,0]

C

C T

T

T

G G

A

A A [1,1]

G T C G

A A

A C

C C C C

G

G T G T

T

A

A G T

A A

(AC,1)

C

A

(64)

Extraction of Structured Models: SMILE

L. Marsan and M.-F. Sagot, Journal of Computational Biology, 2000

p = 2 Input sequences: ACTGAA and CACGTA

k

1

= 2, d = 1, k

2

= 2 e

1

= 1 , e

2

= 1

q = 2

[1,1]

[1,1]

[0,1] [0,1] [1,0]

A

A

A

[1,0]

C

C T

T

T

G G

A

A A [1,1]

G T C G

A A

A C

C C C C

G

G T G T

T

A

A G T

A A

(AC,1)

C

A

(65)

PARTITION UP TO ε

PARTITION UP TO

ε

problem

:

` gold bars

w

i

≥ 0 is the the weight of the ith gold bar any gold bar can be cut in c equal parts

Optimization version

: The problem is how to share the gold between r persons, with the

minimum number of gold bars z , in such a way that each person gets the same share of gold up to some weight ε > 0 .

Decision version

: The problem is to decide whether it is possible to share the gold between r persons, with z gold bars, in such a way that each person gets the same share of gold up to some weight ε ≥ 0.

Proposition.

The PARTITION UP TO ε problem is NP-complete in the strong sense.

(66)

PARTITION UP TO ε

SimpleCut (

Partition

i

, GoldBars

`

, Persons

r

, Weights

w

j, CutFactor

c

, WorkOverload

ε ) 1. find the smallest t such that

max wj

ct

≤ ε 2. for each j ∈ {1, ..., `}

3. let V

j

=

h P

j−1

k=1

w

k

× c

t

, P

j

k=1

w

k

× c

t

 4. let w = P

`

j=1

w

j

5. let γ = w × c

t

mod r 6. let δ = b

w×cr t

c

7. let I

i0

=

[(i − 1)(δ + 1), i(δ + 1) ) for all i ≤ γ [γ(δ + 1) + (i − (γ + 1))δ, γ(δ + 1) + (i − γ)δ ) otherwise 8. transform I

i0

= [a, b) into I

i

= [f (a), f (b)) with f : w × c

t

→ ` × c

t

:

f (x) =

(j − 1) × c

t

+

x−inf(Vw j)

j

for all x ∈ V

j

` × c

t

if x = w × c

t

(67)

PARTITION UP TO ε

j 1 2

w

j

2 1 r = 3 ε = 1

V1=[0,4) V2=[4,6)

I’1=[0,2) I’2=[2,4) I’3=[4,6)

I3=[2,4) I1=[0,1) I2=[1,2)

(68)

PARTITION UP TO ε

j 1 2

w

j

2 1 r = 3 ε = 1 t = 1

1. find the smallest

t

such that max wj

ct

≤ ε

V1=[0,4) V2=[4,6)

I’1=[0,2) I’2=[2,4) I’3=[4,6)

I3=[2,4) I1=[0,1) I2=[1,2)

(69)

PARTITION UP TO ε

j 1 2

w

j

2 1 r = 3 ε = 1 t = 1

2. for each

j ∈ 1, . . . , `

3.

V

j =

h P

j−1

k=1

w

k

× c

t

, P

j

k=1

w

k

× c

t



V1=[0,4) V2=[4,6) I’1=[0,2) I’2=[2,4) I’3=[4,6)

I3=[2,4) I1=[0,1) I2=[1,2)

(70)

PARTITION UP TO ε

j 1 2

w

j

2 1 r = 3 ε = 1 t = 1

w = 3 γ = 0 δ = 2

4.

w

=

P

`

j=1

w

j

5.

γ

=

w × c

t

mod r

6.

δ

=

b

w×cr t

c

V1=[0,4) V2=[4,6) I’1=[0,2) I’2=[2,4) I’3=[4,6)

I3=[2,4) I1=[0,1) I2=[1,2)

(71)

PARTITION UP TO ε

j 1 2

w

j

2 1 r = 3 ε = 1 t = 1

w = 3 γ = 0 δ = 2

7.

I

i0 =

( [(i − 1)(δ + 1), i(δ + 1) )

for all

i ≤ γ [γ(δ + 1) + (i − (γ + 1))δ, γ(δ + 1) + (i − γ)δ )

otherwise

V1=[0,4) V2=[4,6)

I’1=[0,2) I’2=[2,4) I’3=[4,6) I3=[2,4)

I1=[0,1) I2=[1,2)

(72)

PARTITION UP TO ε

j 1 2

w

j

2 1 r = 3 ε = 1 t = 1

w = 3 γ = 0 δ = 2

8. transform

I

i0 =

[a, b)

into

I

i =

[f (a), f (b))

with

f (x) =

( (j − 1) × c

t

+

x−inf(Vw j)

j

for all x ∈ V

j

` × c

t

if x = w × c

t

V1=[0,4) V2=[4,6)

I’1=[0,2) I’2=[2,4) I’3=[4,6)

I3=[2,4) I1=[0,1) I2=[1,2)

(73)

PARTITION UP TO ε

Proposition.

The SimpleCut algorithm requires O(`) time.

Proposition.

The SimpleCut algorithm has a ratio bound ρ(`, r, (w

i

)

1≤i≤`

, c, ε) = O(

max wε i

).

(74)

Parallelization

Reducing the tree partition problem to the PARTITION UP TO ε problem

Input of the SimpleCut algorithm for the ith grid node:

` = |Σ|

r matches the number of grid nodes

w

j

of each symbol of the alphabet is obtained by scanning the input sequences c = |Σ|

ε is an user parameter

(75)

Parallelization

Reducing the tree partition problem to the PARTITION UP TO ε problem

Input of the SimpleCut algorithm for the ith grid node:

` = |Σ|

r matches the number of grid nodes

w

j

of each symbol of the alphabet is obtained by scanning the input sequences c = |Σ|

ε is an user parameter

Output of the SimpleCut algorithm for the ith grid node:

the number t of cuts gives the depth t + 1 of the tree where the partition is defined

an interval I

i

corresponding to tree nodes at depth t + 1 assigned to the ith grid node

(76)

Parallelization

Reducing the tree partition problem to the PARTITION UP TO ε problem

Input of the SimpleCut algorithm for the ith grid node:

` = |Σ|

r matches the number of grid nodes

w

j

of each symbol of the alphabet is obtained by scanning the input sequences c = |Σ|

ε is an user parameter

Output of the SimpleCut algorithm for the ith grid node:

the number t of cuts gives the depth t + 1 of the tree where the partition is defined an interval I

i

corresponding to tree nodes at depth t + 1 assigned to the ith grid node SimpleCut (

Partition

i

, AlphabetSize

`

, GridNodes

r

, Weights

w

j, AlphabetSize

c

, WorkOverload

ε )

max w

(77)

Parallelization

j 1 2 3 4

σ

j

A C T G

w

j

2 1 1 2

r = 5 ε = 1

A G T

C G T A C G T A C G T A C G T

C

A

V1=[0,8) V2=[8,12) V3=[12,16) V4=[16,24)

I’2=[5,10) I’3=[10,15) I’4=[15,20)

I’1=[0,5) I’5=[20,24)

I1=[0,2) I2=[2,6) I3=[6,11) I4=[11,14) I5=[14,16)

(78)

Parallelization

j 1 2 3 4

σ

j

A C T G

w

j

2 1 1 2

r = 5 ε = 1 t

0

= 1 t = 1

1. find the smallest

t

0 such that max wj

ct0

≤ ε

2.

t

=

min

(depth(

M

) -

1

,

t

0)

A G T

C G T A C G T A C G T A C G T

C

A

V1=[0,8) V2=[8,12) V3=[12,16) V4=[16,24)

I’2=[5,10) I’3=[10,15) I’4=[15,20)

I’1=[0,5) I’5=[20,24)

I1=[0,2) I2=[2,6) I3=[6,11) I4=[11,14) I5=[14,16)

(79)

Parallelization

j 1 2 3 4

σ

j

A C T G

w

j

2 1 1 2

r = 5 ε = 1 t

0

= 1 t = 1

3. for each

j ∈ 1, . . . , `

4.

V

j =

h P

j−1

k=1

w

k

× c

t

, P

j

k=1

w

k

× c

t



A G T

C G T A C G T A C G T A C G T

C

A

V1=[0,8) V2=[8,12) V3=[12,16) V4=[16,24)

I’2=[5,10) I’3=[10,15) I’4=[15,20)

I’1=[0,5) I’5=[20,24)

I1=[0,2) I2=[2,6) I3=[6,11) I4=[11,14) I5=[14,16)

(80)

Parallelization

j 1 2 3 4

σ

j

A C T G

w

j

2 1 1 2

r = 5 ε = 1 t

0

= 1 t = 1

5.

w

=

P

`

j=1

w

j

6.

γ

=

w × c

t

mod r

7.

δ

=

b

w×cr t

c

A G T

C G T A C G T A C G T A C G T

C

A

V1=[0,8) V2=[8,12) V3=[12,16) V4=[16,24)

I’2=[5,10) I’3=[10,15) I’4=[15,20)

I’1=[0,5) I’5=[20,24)

I1=[0,2) I2=[2,6) I3=[6,11) I4=[11,14) I5=[14,16)

(81)

Parallelization

j 1 2 3 4

σ

j

A C T G

w

j

2 1 1 2

r = 5 ε = 1 t

0

= 1 t = 1

w = 6 γ = 4 δ = 4

8.

I

i0 =

( [(i − 1)(δ + 1), i(δ + 1) )

for all

i ≤ γ [γ(δ + 1) + (i − (γ + 1))δ, γ(δ + 1) + (i − γ)δ )

otherwise

A G T

C G T A C G T A C G T A C G T

C

A

V1=[0,8) V2=[8,12) V3=[12,16) V4=[16,24)

I’2=[5,10) I’3=[10,15) I’4=[15,20)

I’1=[0,5) I’5=[20,24)

I1=[0,2) I2=[2,6) I3=[6,11) I4=[11,14) I5=[14,16)

(82)

Parallelization

j 1 2 3 4

σ

j

A C T G

w

j

2 1 1 2

r = 5 ε = 1 t

0

= 1 t = 1

w = 6 γ = 4 δ = 4

9. transform

I

i0 =

[a, b)

into

I

i =

[f (a), f (b))

with

f (x) =

( (j − 1) × c

t

+

x−inf(Vw j)

j

for all x ∈ V

j

` × c

t

if x = w × c

t

A G T

C G T A C G T A C G T A C G T

C

A

V1=[0,8) V2=[8,12) V3=[12,16) V4=[16,24)

I’2=[5,10) I’3=[10,15) I’4=[15,20)

I’1=[0,5) I’5=[20,24)

(83)

Parallelization

PExtractModels (

Model

m

, Block

i

, PartitionSet

I

i of

M ) 1. for each node-occurrence v of m

2. if (i > 1 )

3. put in P otencialStarts the children of v at levels (i − 1)k + (i − 1)d

mini−1

to (i − 1)k + (i − 1)d

maxi−1

4. else

5. put v in P otencialStarts

6. for each model m

i

∈ I

i

obtained by doing a recursive depth-first traversal from the root of the virtual model tree M while

simultaneously traversing T from the node-occurrences in P otencialStarts

7. if ( i < p )

8. PExtractModels ( m = m

1

. . . m

i

, i + 1 , I

i

)

9. else

(84)

Parallelization

PSmile (

GridNode

i

, WorkOverload

ε )

1. compute weights (w

i

)

1≤i≤|Σ|

; 2. build suffix tree T ;

3. create colors on T ;

4. let I

i

= SimpleCut( i, |Σ|, r, (w

i

)

1≤i≤|Σ|

, |Σ|, ε );

5. call PExtractModels( T , I

i

);

Proposition.

Assume Σ fixed and w

i

= 1 for 1 ≤ i ≤ |Σ|. The parallel algorithm PSmile is

work-efficient with respect to the sequential version when r = O(ν

p2

(e, k)) and

wε

1r

.

(85)

Parallelization

Experimental results

2 boxes 3 boxes

models time (sec) models time (sec)

grid node 1 2 155.83 9987 444.50

grid node 2 1 168.11 6178 385.28

grid node 3 2 245.35 3108 473.70

grid node 4 16 262.51 15884 581.64

total 21 831.80 35157 1885.18

parallel time 262.51 581.64

sequential time 757.97 1790.70

speed up 2.9 3.1

(86)

On going and future work

Implementation of a more efficient sequential algorithm to extract structured models

[L. Marsan and M.-F. Sagot, J. Computational Biology, 2000]

Establishing an even more efficient algorithm to extract structured models

[A. Carvalho, A. Freitas, A. Oliveira and M.-F. Sagot, in preparation, 2003]

Comparison between algorithms which attempts to model the combinatorics of

regulation

References

Related documents

Through practice we must make the body, the senses, the mind, the breath, all of them rhythmic, then only we come to have the proper mood for spiritual practices and meditation, and

 Human elements involved in the problem  Support systems surrounding the problem  Tracking systems related to the problem..  Institutional process for managing

In the United States, the Deaf community, or culturally Deaf is largely comprised of individuals that attended deaf schools and utilize American Sign Language (ASL), which is

Furthermore, observational visits were made to two special schools for the hearing impaired, one integrated school for children with intellectual impairments, and one mainstream

For the poorest farmers in eastern India, then, the benefits of groundwater irrigation have come through three routes: in large part, through purchased pump irrigation and, in a

• Follow up with your employer each reporting period to ensure your hours are reported on a regular basis?. • Discuss your progress with

Public awareness campaigns of nonnative fish impacts should target high school educated, canal bank anglers while mercury advisories should be directed at canal bank anglers,

This narrative inquiry bears ontological, epistemological, and ethi- cal implications for teacher education programs because identity joins emotions and knowledge