• No results found

Longest Common Extensions via Fingerprinting

N/A
N/A
Protected

Academic year: 2021

Share "Longest Common Extensions via Fingerprinting"

Copied!
23
0
0

Loading.... (view fulltext now)

Full text

(1)

Longest Common Extensions via Fingerprinting

Philip Bille Inge Li Gørtz Jesper Kristensen

Technical University of Denmark DTU Informatics

LATA, March 9, 2012

1 / 17

(2)

Contents

Introduction

The LCE Problem Existing Results

The DirectComp algorithm

The SuffixNca and LcpRmq Algorithms Practical results

The Fingerprintk Algorithm Data Structure

Query

Preprocessing Practical Results Cache Optimization Summary

(3)

The LCE Problem

LCE value LCEs(i , j ) is the length of the longest common prefix of the two suffixes of a string s starting at index i and j

LCE problem Efficiently query multiple LCE values on a static string s and varying pairs (i , j )

Example:

input: s = abbababba, (i , j ) = (4, 6) suffix i of s = ababba

suffix j of s = abba

longest common prefix = ab LCEs(i , j ) = 2

3 / 17

(4)

Existing algorithm: DirectComp

Input

I s = abbababba

I (i , j ) = (4, 6)

The DirectComp algorithm

s = a b b a b a b b a

6 4

6 6

a a

a a 1 match

6 5

6 7

b b

b b 2 matches

6 6

6 8

a b

a b

Result

LCEs(4, 6) = 2

(5)

Existing algorithm: DirectComp

Input

I s = abbababba

I (i , j ) = (4, 6)

The DirectComp algorithm

s = a b b a b a b b a

6 4

6 6

a a

a a 1 match

6 5

6 7

b b

b b 2 matches

6 6

6 8

a b

a b

Result

LCEs(4, 6) = 2

4 / 17

(6)

Existing algorithm: DirectComp

Input

I s = abbababba

I (i , j ) = (4, 6)

The DirectComp algorithm

s = a b b a b a b b a

6 4

6 6

a a

a a 1 match

6 5

6 7

b b

b b 2 matches

6 6

6 8

a b

a b

Result

LCEs(4, 6) = 2

(7)

Existing algorithm: DirectComp

Input

I s = abbababba

I (i , j ) = (4, 6)

The DirectComp algorithm

s = a b b a b a b b a

6 4

6 6

a a

a a

1 match 6

5

6 7

b b

b b 2 matches

6 6

6 8

a b

a b

Result

LCEs(4, 6) = 2

4 / 17

(8)

Existing algorithm: DirectComp

Input

I s = abbababba

I (i , j ) = (4, 6)

The DirectComp algorithm

s = a b b a b a b b a

6 4

6 6

a a

a a 1 match

6 5

6 7

b b

b b 2 matches

6 6

6 8

a b

a b

Result

LCEs(4, 6) = 2

(9)

Existing algorithm: DirectComp

Input

I s = abbababba

I (i , j ) = (4, 6)

The DirectComp algorithm

s = a b b a b a b b a

6 4

6 6

a a

a a 1 match

6 5

6 7

b b

b b

2 matches 6

6

6 8

a b

a b

Result

LCEs(4, 6) = 2

4 / 17

(10)

Existing algorithm: DirectComp

Input

I s = abbababba

I (i , j ) = (4, 6)

The DirectComp algorithm

s = a b b a b a b b a

6 4

6 6

a a

a a 1 match

6 5

6 7

b b

b b

2 matches 6

6

6 8

a b

a b

Result

LCEs(4, 6) = 2

(11)

Existing Algorithm: DirectComp

Space O(1) + |s|

Query O(LCE(i , j )) = O(n) Average query O(1)

For a string length n and alphabet size σ, the average LCE value over all nσ strings and n2 query pairs is O(1).

References

L. Ilie, G. Navarro, and L. Tinta. The longest common extension problem revisited and applications to approximate string searching. J. Disc. Alg., 8(4):418-428, 2010.

5 / 17

(12)

Existing Algorithms: SuffixNca and LcpRmq

Two algorithms with best known bounds:

SuffixNca Nearest common ancestor queries on a suffix tree LcpRmq Range minimum queries on a longest common prefix

array Space O(n) Query O(1) Average query O(1)

References

J. Fischer, and V. Heun. Theoretical and Practical Improvements on the

RMQ-Problem, with Applications to LCA and LCE. In Proc. 17th CPM, pages 36-48, 2006.

D. Harel, R. E. Tarjan. Fast Algorithms for Finding Nearest Common Ancestors.

SIAM J. Comput., 13(2):338-355, 1984.

(13)

Existing Algorithms: Practical Results

Query times ofDirectComp andLcpRmq by string length Average case

1e-08 1e-07 1e-06

10 100 1000 10000 100000 1e+06 1e+07

s = random characters σ = 10

Worst case

1e-08 1e-07 1e-06 1e-05

10 100 1000 10000 100000 1e+06 1e+07

s = aaaaa...

7 / 17

(14)

The Fingerprint

k

Algorithm: Data Structure

I For a string s[1 . . n], the t-length fingerprints Ft[1 . . n] are natural numbers, such that Ft[i ] = Ft[j ] if and only if s[i . . i + t − 1] = s[j . . j + t − 1].

I k levels, 1 ≤ k ≤ dlog ne

I For each level, ` = 0 . . k − 1:

I t`= Θ(n`/k), t0= 1

I H`= Ft`

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 a b b a a b b a b a b b a a b b i

s = H0[i]

H1[i] 1 2 3 4 1 2 5 6 5 1 2 3 4 1 2 5

17 18 19 20 21 22 23 24 25 26 27 a b a b a a b a b a $ 6 5 6 3 4 6 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 H2[i]

Space O(k · n)

(15)

The Fingerprint

k

Algorithm: Query

1. As long as H`[i + v ] = H`[j + v ], increment v by t`, increment

` by one, and repeat this step unless and ` = k − 1.

2. As long as H`[i + v ] = H`[j + v ], increment v by t` and repeat this step.

3. Stop and return v when ` = 0, otherwise decrement ` by one and go to step two.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 a b b a a b b a b a b b a a b b j + v

H0[j + v]

H1[j + v] 1 2 3 4 1 2 5 6 5 1 2 3 4 1 2 5

17 18 19 20 21 22 23 24 25 26 27 a b a b a a b a b a $ 6 5 6 3 4 6 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 310 11 12 13 14 15 16 17 18 19 20 21 22 23 24 H2[j + v]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 a b b a a b b a b a b b a a b b i + v

H0[i + v]

H1[i + v] 1 2 3 4 1 2 5 6 5 1 2 3 4 1 2 5

17 18 19 20 21 22 23 24 25 26 27 a b a b a a b a b a $ 6 5 6 3 4 6 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 310 11 12 13 14 15 16 17 18 19 20 21 22 23 24 H2[i + v]

LCE(3, 12) = 9 Query O(k · n1/k)

Average query O(1)

9 / 17

(16)

The Fingerprint

k

Algorithm: Preprocessing

I For each level `

I For each t`-length substring in lexicographically sorted order

I If the current substring s[SA[i ] . . SA[i ] + t`− 1] is equal to the previous substring, give it the same fingerprint as the previous substring, otherwise give it a new unused fingerprint. The two substrings are equal when LCP[i ] ≥ t`.

Subst.

a aba abb abb ba bab bab bba bba

Hl[i]

1 2 3 3 5 6 6 8 8

i 9 4 6 1 8 3 5 7 2

s = abbababba For t` = 3:

H` = [3, 8, 6, 2, 6, 3, 8, 5, 1]

Preprocessing O(k · n + sort(n, σ))

(17)

The Fingerprint

k

Algorithm

1 ≤ k ≤ dlog ne Space O(k · n)

Query O(k · n1/k) Average query O(1)

k = 1 k = 2 k = dlog ne Space O(n) O(n) O(n log n) Query O(n) O(√

n) O(log n) Average query O(1) O(1) O(1)

Space for Fingerprintk is the same as for LcpRmq when k = 6.

11 / 17

(18)

Practical Results

Query times ofDirectComp,Fingerprint2,Fingerprint3, Fingerprintdlog ne andLcpRmqby string length

Average case

1e-08 1e-07 1e-06

10 100 1000 10000 100000 1e+06 1e+07

Worst case

1e-08 1e-07 1e-06 1e-05

10 100 1000 10000 100000 1e+06 1e+07

(19)

Cache Optimization of Fingerprint

k

I Original:

I Data structure: H`[i ] = Ft`[i ]

I Size: |H`| = n

I I/O: O k · n1/k

I Cache optimized:

I Data structure:

H`[((i − 1) mod t`) · dn/t`e + b(i − 1)/t`c + 1] = Ft`[i ]

I Size: |H`| = n + t` I I/O: O

k ·

n1/k B + 1

I Best when k is small =⇒ n1/k is large.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

a b b a a b b a b a b b a a b b

j + v H0[j + v]

H1[j + v] 1 2 3 4 1 2 5 6 5 1 2 3 4 1 2 5

17 18 19 20 21 22 23 24 25 26 27

a b a b a a b a b a $

6 5 6 3 4 6 5 6 7 8 9

1 2 3 4 5 6 7 8 9 1 2 310 11 12 13 14 15 16 17 18 19 20 21 22 23 24 H2[j + v]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

a b b a a b b a b a b b a a b b

i + v H0[i + v]

H1[i + v] 1 2 3 4 1 2 5 6 5 1 2 3 4 1 2 5

17 18 19 20 21 22 23 24 25 26 27

a b a b a a b a b a $

6 5 6 3 4 6 5 6 7 8 9

1 2 3 4 5 6 7 8 9 1 2 310 11 12 13 14 15 16 17 18 19 20 21 22 23 24 H2[i + v]

13 / 17

(20)

Cache Optimization, Practical Results

Is I/O optimization good in practice?

I Pro: better cache efficiency

I Best for small k, no change for k = dlog ne

I Con: Calculating memory addresses is more complicated

I ((i − 1) mod t`) · dn/t`e + b(i − 1)/t`c + 1 vs. i

(21)

Cache Optimization, Practical Results

Query times of Fingerprint2 without cache optimizationand with cache optimization usingshift operationsvs. multiplication, division and modulo

Average case

1e-08 1e-07 1e-06

10 100 1000 10000 100000 1e+06 1e+07

Worst case

1e-08 1e-07 1e-06 1e-05

10 100 1000 10000 100000 1e+06 1e+07

15 / 17

(22)

Cache Optimization, Practical Results

Query times of Fingerprint3 without cache optimizationand with cache optimization usingshift operationsvs. multiplication, division and modulo

Average case

1e-08 1e-07 1e-06

10 100 1000 10000 100000 1e+06 1e+07

Worst case

1e-08 1e-07 1e-06 1e-05

10 100 1000 10000 100000 1e+06 1e+07

(23)

Summary

Direct- LcpRmq /

Comp SuffixNca Fingerprintk

Space O(1) O(n) O(k · n)

Query O(n) O(1) O(k · n1/k)

Average query O(1) O(1) O(1)

fast slow fast

Query I/O O Bn

O(1) O

 k ·

n1/k

B + 1



Code complexity very simple complex simple

I Cache optimization of Fingerprintk improves query times at k = 2 and worsens query times at k ≥ 3

I Space for Fingerprintk is the same as for LcpRmq when k = 6.

17 / 17

References

Related documents

However, there was no significant difference in the median values of IL-4 level between the HbS (homozygous inheritance of sickle gene) patients and the control

Community Based Rehabilitation (CBR) has been promoted internationally for more than 30 years as a core strategy for improvement in the quality of life and services for people

From 1975 until 1999 the Swiss National Bank (SNB) performed monetary tar- geting for different monetary aggregates. Since 2000 a new monetary policy strategy has been implemented

The overall purpose of this thesis is to explore the development policies of Bhutan in the paradigm of Gross National Happiness, women’s position in this

Die Programmierung mit dem Suchlauf Der Suchlauf läßt nach aktiven Frequenzen innerhalb einer von Ihnen ausgewählten Bandbreite suchen, die Sie dann alle oder nur zum Teil in

S S VIDEO jack, described 11, 13 SAT/CABLE FUNCTION button 40 SAT/CABLE POWER button 41 satellite receiver, using with TV remote control 62 Screen Menu 65, 70 Scrolling Index

Some players just want to know where the bad guys are so they can shoot at them, while other players may take more of a back seat, spectating more during the game and enjoy- ing