Finding compact structural motifs

(1)

Contents lists available atScienceDirect

Theoretical Computer Science

journal homepage:www.elsevier.com/locate/tcs

Finding compact structural motifs

Dongbo Bu

a,c

, Ming Li

a,∗

, Shuai Cheng Li

a

, Jianbo Qian

a

, Jinbo Xu

b

a_{David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada N2L 3G1}

b_{Toyota Technological Institute at Chicago, 1427 East 60th Street, Chicago, IL 60637, United States}

c_{Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China}

a r t i c l e i n f o Keywords:

Compact Structural motif NP-Hardness

Approximation algorithm

a b s t r a c t

Protein structural motif detection has important applications in structural genomics. Compared with sequence motifs, structural motifs are more sensitive in revealing the evolutionary relationships among proteins. A variety of algorithms have been proposed to attack this problem. However, they are either heuristic without theoretical performance guarantee, or inefficient due to employing exhaustive search strategies. This paper studies a reasonably restricted version of this problem: the compact structural motif problem. We prove that this restricted version is still NP-hard, and we present a polynomial-time approximation scheme to solve it. This is the first approximation algorithm with a guaranteed ratio for the protein structural motif problem.1

1. Introduction

It is widely accepted that during the evolution of proteins, structures are more conserved than sequences. In addition, structural motifs are tightly related to protein functions [2]. Thus, identifying the common substructures from a set of proteins, or a family of proteins to be more precise, can help us learn their evolutionary history and functions. With rapid accumulation of protein structures in the Protein Data Bank (PDB) there is a demand for fast and accurate structural comparison and motif finding methods.

The multiple structural motif finding problem is the structural analogy with the sequence motif finding problem, which has been thoroughly studied, see for example [12]. For the former problem, the input consists of a set of protein structures in three-dimensional (3D) space, R3_{. The objective is to find a set of substructures, one from each protein, that exhibit the}

highest degree of similarity.

Roughly speaking, there are two main methods to measure the structural similarity, i.e., coordinate root mean squared deviation (cRMSD) and distance root mean squared deviation (dRMSD). The first one calculates the internal distance for each protein first, and compares these internal distance matrices. In contrast, the second method uses the Euclidean distance between the corresponding residues from different protein structures. To do this, the optimal rigid transformation of these protein structures should be done first.

Various methods have been proposed to solve the structural motif finding problem under different similarity measuring schemes. Under the unit-vector RMSD (URMSD) measure, Chew et al. [5] proposed an iterative algorithm to compute the consensus shape and proved the convergence of the algorithm. Applying graph-based data mining tools, Bandyopadhyay et al. [4] described a method to assign a protein structure to functional families using the family-specific fingerprints. Under

∗_{Corresponding author. Tel.: +1 519 888 4659; fax: +1 519 885 1208.}

E-mail addresses:[email protected](D. Bu),[email protected](M. Li),[email protected](S.C. Li),[email protected](J. Qian),

[email protected](J. Xu).

(2)

the bottleneck metric similarity measure, Shatsky et al. [15] presented an algorithm for recognition of binding patterns common to a set of protein structures. This problem is also studied in [6,7,11,13,14,18].

One of the closely related problems is the structural alignment problem, to which a lot of successful approaches have been developed. Among them, DALI [8] and CE [16] attempt to identify the alignment with minimal dRMSD, while STRUCTURAL [17] and TM-align [20] employ heuristics to detect the alignment with minimal cRMSD.

However, the methods mentioned above are all heuristic and the solutions are not guaranteed to be optimal or near optimal. Recently, Kolodny et al. [10] proposed the first polynomial-time approximate algorithm for pairwise protein structure alignment based on the Lipschitz property of the scoring function. Though this method can be extended to the case of multiple protein structural alignment, the simple extension has a time complexity exponential in the number of proteins.

We present an approximation algorithm employing the random sampling technique of [12] under the coordinate mean squared distance cMSD measure, which is the square of the cRMSD. Adopting a reasonable assumption, we prove that our algorithm is efficient and produces a good approximation solution. In contrast to the method in [10], our algorithm has a time complexity polynomial in the number of input protein structures. Furthermore, the sampling size is an adjustable parameter, adding more flexibility to our algorithm.

The rest of this manuscript is organized as follows. In Section2, we introduce the notations and some background knowledge. In Section3, we prove the NP-hardness of the compact structural motif problem. The algorithm, along with its performance analysis, is given in Section4.

2. Preliminaries

A protein consists of a sequence of amino acids (also called residues), each of which contains a number of atoms including one C_αatom. In a protein structure, each atom is associated with a 3D coordinate. Here, we only take into consideration the C_αatom of a residue; thus a protein structure can be simplified as a sequence of 3D points. The common globular protein is generally compact, with the distance between two consecutive C_αatoms restrained by the bond lengths and bond angles, and the volume of the bounding sphere linear in the number of residues.

A structural motif of a protein is a subset of its residues arranged in the order of appearance, and its length is the number of residues in this subset. We study in current paper only the (R,C)-compact motif, which is bounded by the minimal ball B with a radius at most R, and at most C residues in this ball do not belong to this motif. This ball B is referred to as the containing ball.

To measure the similarity of two protein structures, a transformation, including a rotation and a translation, should be done first. Such a transformation is known as a rigid transformation and can be expressed with a 6D-vector

τ =

(

t_x

,

t_y

,

t_z

,

r₁

,

r₂

,

r₃

),

r₁

,

r₂

,

r₃

∈ [

0

,

2

π],

t_x

,

t_y

,

t_z

∈

R. Here,

(

t_x

,

t_y

,

t_z

)

denotes a translation and

(

r₁

,

r₂

,

r₃

)

denotes a rotation. Applying transformation

τ

for a 3D-point u, we get a new 3D coordinate

τ(

u

)

.

We adopt cMSD to measure the similarity between two structural motifs. Formally, for two motifs u

=

(

u1

,

u2

, . . . ,

u`

)

and

v = (v

1

, v

2

, . . . , v

`

)

, the cMSD distance between them is defined as d

(

u

, v) = P

`i=1

k

ui

−

v

i

k

2, where

k

.k

is the

Euclidean distance.

We will study the (R,C)-Compact Consensus Structural Motif ((R,C)-CCSM) problem: Given n protein structures P1

,

P2

, . . . ,

Pn, and an integer

`

, find an

(

R

,

C

)

-compact motif ui of length

`

along with rigid transformation

τ

ifor each

Pi, and a consensus structure of

`

3D points: q

=

(

q1

,

q2

, . . . ,

q`

)

, where qiis a point in 3D, minimizing distance function

P

n

i=1d

(

q

, τ

i

(

ui

))

.

Before presenting the approximation algorithm, we first prove that the transformation can be simplified. More specifically, we need to consider only rotations.

The following lemma, due to [9], is used in our proofs. Roughly speaking, it states that if we want to superimpose two chains of points optimally, we must make their centroids coincide.

Lemma 1. Given n ordered 3D points A

=

(

a1

,

a2

, . . . ,

an

)

and n ordered 3D points B

=

(

b1

,

b2

, . . . ,

bn

)

, to minimize

P

i

k

ρ(

ai

) +

T

−

bi

k

2, where

ρ

is a rotation matrix and T is a translation vector, T must make the centroids of A and B coincide

with each other.

Lemma 2. In the optimal solution of (R,C)-Compact Consensus Structural Motif problem, the centroid of

τ

i

(

ui

)

must coincide

with the centroid of q, for 1

≤

i

≤

n.

Lemma 2is a direct corollary ofLemma 1. The basic idea is that to find the optimal rigid transformation, we can first translate the proteins so that their centroids coincide with each other. Therefore, a transformation can be simplified to a rotation vector T

=

(

r1

,

r2

,

r3

)

, where r1, r2, r3

∈ [

0

,

2

π]

. Discrete transformation set is a commonly-used technique [10].

Specifically, for a real number

, we can discretize the range

[

0

,

2

π]

into a series of bins with a width of

. We refer to the discrete transformation set

{

(

i

· ,

j

· ,

k

· )|

0

≤

i

,

j

,

k

≤

2π

}

, where i

,

j

,

k are integers, as an

-net of rotation spaceT.

3. Hardness result

In this section, we will show that (R,C)-Compact Consensus Structural Motif problem is NP-hard. The reduction is from the Local Multiple Alignment(LMA) problem, which has been proven to be NP-hard in [1].

(3)

S1 : 011100101 010111011 tn1tn0tn0 tntntn tntntntntntn tntntntntntn S₂: 011010111 110110011 Sn : 111 100101 001111010 100 011 000000 111111 Extending Mapping 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0

Fig. 1. The reduction from Local Multiple Alignment problem to (R,C )-Compact Consensus Structural Motif problem.

Local Multiple Alignment (sum-of-pairs variant) [1]: Given a set of sequences S

= {

s1

,

s2

, . . . ,

sn

}

over an alphabet

Σ

=

{

0

,

1

}

, and an integer

`

, find a substring ti of length

`

from each si, minimizing the sum-of-pairs score

SP-score

(

t1

, . . . ,

tn

) = P

1≤i<j≤ndH

(

ti

,

tj

)

, where dH

(

x

,

y

)

is the Hamming distance. We call

(

t1

, . . . ,

tn

)

a local alignment.

Given an LMA instance, we transform it into a (1,0)-CCSM instance by two steps, namely, extending and mapping (see

Fig. 1).

1. Extending Step: For each

`

-mer of each sequence si, say ti

=

sjis j+1

i

. . .

s j+`−1

i , we first append its complement, then attach

a tail of 2

`

0’s and 2

`

1’s. For example, 100 becomes 100011000000111111.

2. Mapping Step: We map the extended string to a 6

`

ordered 3D points: 0 is mapped to a 3D point t_i0

=

(

0

,

2i

,

0

)

and 1 is mapped to a 3D point t1

i

=

(

1

,

2i

,

0

)

. Hence, each

`

-mer tiis mapped to 6

`

ordered points; those points are located at

only two 3D coordinates. Let denote these ordered points as M

(

ti

)

. Note that the centroid of M

(

ti

)

is

(

1

/

2

,

2i

,

0

)

.

By the above transformation, sequence siis mapped to a protein structurePiwith 6

`(

m

−

` +

1

)

points. We should notice

that in real protein structures, points do not share identical coordinates. However, it is not difficult to modify our reduction to take this aspect into consideration. We ignore this for the simplicity and clarity purposes.

With the above reduction, we can prove our hardness result.

Theorem 1. (R,C)-Compact Consensus Structural Motif is NP-hard.

The result can be proved by the following lemmas.

Lemma 3. cMSD(M

(

ti

)

, M

(

tj

)

)=2dH

(

ti

,

tj

)

.

Proof. Suppose

τ

iand

τ

jare the optimal transformations of M

(

ti

)

and M

(

tj

)

to superimpose them, respectively. The centroids

of M

(

ti

)

and M

(

tj

)

should coincide under the optimal superimposition according toLemma 1. Let the angle between the two

line segments

h

τ

_i

(

t_i0

), τ

i

(

ti1

)i

and

h

τ

i

(

tj0

), τ

i

(

tj1

)i

be

θ

(seeFig. 2). Denote d

=

dH

(

ti

,

tj

)

. According to simple geometry, the

cMSD value of M

(

ti

)

and M

(

tj

)

under this superimposition can be expressed as:

d

(

M

(

ti

),

M

(

tj

)) = (

6

` −

2d

) ×

1 2

2

+

1 2

2

−

2

×

1 2

2

×

cos

θ

!

+

2d

×

1 2

2

+

1 2

2

+

2

×

1 2

2

×

cos

θ

!

=

3

` − (

3

` −

2d

)

cos

θ.

It is clear that this distance reach its minimum when

θ =

0, and cMSD(M

(

ti

)

, M

(

tj

)) =

2d.

The following lemma, due to [19], suggests the equivalence of the minimum of sum of the scores to the centroid and the sum-of-pair scores for a structural motif set.

Lemma 4. Given a set of structural motifs m1

, . . . ,

mn, the minimal value of

P

n

i=1cMSD

(

q

,

mi

)

is reached when q is the average

of these motifs under the optimal transformations and the minimal value is1_n

P

i<jcMSD

(

mi

,

mj

)

.

Suppose there is a local alignment consisting of t1

, . . . ,

tnwith a cost c

=

P

1≤i<j≤ndH

(

ti

,

tj

)

.Lemma 3tells us that the

corresponding motifs M

(

ti

)

has a score of 2c, i.e.,

P

1≤i<j≤nd

(

M

(

ti

),

M

(

tj

)) =

2c. Furthermore,Lemma 4shows that we can

find a q that has a score of2_nc, i.e.,

P

1≤i≤nd

(

M

(

ti

),

q

) =

2_nc.

Conversely, given an optimal solution of an instance of the

(

R

,

C

)

-Compact Consensus Structural Motif problem with a score of c, the corresponding sequence fragments form a local multiple alignment with a cost ofn₂c. ThusTheorem 1holds.

4.

(

R

,

C

)

-compact motif finding algorithm

In this section, we present an approximation algorithm to solve the (R

,

C )-Compact Consensus Structural Motif problem. The basic idea of our algorithm is as follows: first, we translate the proteins to make their centroids coincide.

(4)

i (ti0)

τ

i (ti1)

τ

j (tj0)

τ

j (tj1)

τ

1 2 1 2 12 1 b π -θ θ 2 a

Fig. 2. Superimposition of two motifs.

Then, for each discrete rigid transformation and each r-tuple of compact motifs, we calculate the median, and find from each protein the closest part to this median. Ultimately the median with the minimal value of the objective function value is output.

(

R

,

C

)

-Compact Motif Finding Algorithm

Input: n protein structuresP1

,

P2

, . . . ,

Pn, integers

`,

C

,

r, real numbers R

,

.

Output: median consensus u of length

`

, rigid transformation

τ

_i,

(

R

,

C

)

-compact motif u_i of length

`

forPi, for 1

≤

i

≤

n.

1. FixP1, translate other proteins to make their centroids coincide with that ofP1

2. FOR every r length-

` (

R

,

C

)

-compact motif u1

,

u2

, . . . ,

ur, where uiis a motif of somePjDO

3. FOR every r

−

1 transformations

τ

2

, τ

3

, . . . , τ

rfrom

/

Rn

`

-net of rotation spaceT DO

(a) Find the average of u1

, τ

2

(

u2

), . . . , τ

r

(

ur

)

: u

=

(

u1

+

τ

2

(

u2

) + · · · + τ

r

(

ur

))/

r

(b) FOR i

=

1

,

2

, . . . ,

n DO

Find the length-

` (

R

,

C

)

-compact motif

v

iofPiand its optimal rigid transformation

τ

i0

that minimize d

(

u

, τ

0 i

(v

i

))

.

(c) Let c

(

u

) = P

n_i₌₁d

(

u

, τ

0 i

(v

i

)).

4. Output u and the corresponding

v

i,

τ

i0that minimize c

(

u

)

.

Let f

(

x

) = P

n_i₌₁

(

x

−

ai

)

2. It is easy to see that f

(

x

)

is minimized when x equals to the average of

{

ai

}

. The following lemma

states that if we randomly choose r numbers from

{

ai

}

, and let x be the average of these r numbers, then the expected value

of f

(

x

)

is

(

1

+

1

/

r

)

times the minimum.

Lemma 5. Let a1

,

a2

, . . . ,

anbe n real numbers, 1

≤

r

≤

n is an integer. Then the following equation holds:

1 nr

X

1≤i1,...,ir≤n n

X

i=1

ai1

+ · · · +

air r

−

ai

2

=

r

+

1 r n

X

i=1

a1

+ · · · +

an n

−

ai

2

.

Proof. For the sake of simplicity, we use

σ

to denote

P

n

i=1aiand

σ

0to denote

P

n i=1a2i. 1 nr

X

1≤i1,...,ir≤n n

X

i=1

ai1

+ · · · +

air r

−

ai

2

=

1 nr

X

1≤i1,...,ir≤n

n

(

ai1

+ · · · +

air

)

2 r2

−

2 ai1

+ · · · +

air r

σ + σ

0

=

1 nr

X

1≤i1,...,ir≤n na 2 i1

+ · · · +

a 2 ir

+

2

(

ai1ai2

+ · · · +

air−1air

)

r2

−

2rnr−1

_σ

rnr

σ + σ

0

=

rn r−1

_σ

0

₊

r

(

r

−

1

)

nr−2

σ

2 r2_nr−1

−

2

σ

2 n

+

σ

0

=

σ

0 r

+

r

−

1 r

σ

2 n

−

2

σ

2 n

+

σ

0

=

r

+

1 r

σ

0

₋

σ

2 n

=

r

+

1 r

σ

2 n

−

2

σ

2 n

+

σ

0

=

r

+

1 r n

X

i=1

σ

n

−

ai

2

.

(5)

The following lemma is needed for our analysis of the time complexity.

Lemma 6. All of the

(

R

,

C

)

-compact motifs of length

`

for proteinPwith m residues can be enumerated in O

(

m4

_`

c

₎

_time. Proof. According to the definition of the

(

R

,

C

)

-compact motif, we know that the containing ball B of a motif must contain

`

to

` +

C residues ofP. In addition, it is easy to see that due to the minimality of B, either there are 4 residues on the surface

of B, or there are 3 residues on its surface and the radius of B is R. Therefore, to enumerate the compact motifs, we can first enumerate the containing balls, which takes O

(

m4

₎

_{; then from each ball, we enumerate the motifs, which takes O}

_(`

c

₎

_times.

In total, it takes O

(

m4

_`

c

₎

_time.

Theorem 2. The

(

R

,

C

)

-Compact Motif Finding Algorithm outputs a solution with cost no more than

1

+

1 r

copt

+

O

(),

in time O

(

n4r−2_m4r+4_R3r−3

_`

cr+c+3r−2

_/

3r−3

₎

_{, where c}

optis the cost of the optimal solution.

Proof. Step 1 takes O

(

nm

)

time. The enumeration of

{

ui

}

takes O

(

nr

(

m4

`

c

)

r

)

time,

{

τ

i

}

takes O

((

Rn`

)

3(r−1)

)

. Step 3(a)–

(c) takes O

(

n

·

m4

_`

c

_·

_`)

_{time (finding}

_τ

0

i takes O

(`)

time according to [3]). So, the time complexity of the algorithm is

O

(

n4r−2_m4r+4_R3r−3

_`

cr+c+3r−2

_/

3r−3

₎

_.

Now, we prove the performance ratio. Given an instance of the problem, we use u∗_{to denote the optimal median;}

_v

∗ i and

τ

∗

i denote the optimal motif inPiand the corresponding optimal rigid transformation, respectively. Then we have copt

=

P

n i=1d

(

u

∗

_{, τ}

∗ i

(v

∗

i

))

. By the property of our cost function, it is easy to see that u

∗_{is the average of}

_τ

∗ 1

(v

∗ 1

), τ

∗ 2

(v

∗ 2

), . . . , τ

∗ n

(v

∗ n

)

, i.e., u∗

=

(τ

₁∗

(v

₁∗

) + τ

₂∗

(v

₂∗

) + · · · + τ

_n∗

(v

_n∗

))/

n.

First, we claim that copt can be approximated by sampling r proteins. In particular, we will show that there exist

1

≤

i1

,

i2

, . . . ,

ir

≤

n such that n

X

i=1 d

(

u∗_i 1...ir

, τ

∗ i

(v

∗ i

)) ≤ (

1

+

1

/

r

)

copt

,

(1) where u∗_i₁_..._ir

=

(τ

_i∗ 1

(v

∗ i1

) + τ

∗ i2

(v

∗ i2

) + · · · + τ

∗ ir

(v

∗

ir

))/

r. It suffices to prove that the average of such value for all 1

≤

i1

,

i2

, . . . ,

ir

≤

n is

(

1

+

1

/

r

)

copt, which can be easily deduced fromLemma 3.

For 1

≤

i

≤

n, let

τ

_i0be a rotation in

/

Rn

`

-net (remember R is the maximum radius of the motifs) ofT that is closest to

τ

∗

i inT. Then

τ

∗

i can be reached from

τ

0

i by moving at most

/

2Rn

`

along each of the three dimensions. Let

u0 i1...ir

=

(τ

0 i1

(v

∗ i1

) + τ

0 i2

(v

∗ i2

) + · · · + τ

0 ir

(v

∗ ir

))/

r

.

Now we will prove that

P

n i=1d

(

u 0 i1...ir

, τ

0 i

(v

∗ i

)) ≤ P

n i=1d

(

u ∗ i1...ir

, τ

∗ i

(v

∗ i

)) +

O

()

. Let u ∗ i1...ir

[

j

]

be the jth element of u ∗ i1...ir, and

v

∗_i

[

j

]

be the jth element of

v

∗_i, then we have

n

X

i=1 d

(

u0_i 1...ir

, τ

0 i

(v

∗ i

))

=

n

X

i=1 `

X

j=1

k

u0_i₁_..._ir

[

j

] −

τ

_i0

(v

_i∗

[

j

]

)k

2

=

n

X

i=1 `

X

j=1

k

u0_i₁_..._ir

[

j

] −

u∗_i₁_..._ir

[

j

] +

u∗_i₁_..._ir

[

j

] −

τ

_i∗

(v

∗_i

[

j

]

) + τ

_i∗

(v

∗_i

[

j

]

) − τ

_i0

(v

∗_i

[

j

]

)k

2

≤

n

X

i=1 `

X

j=1

(k

u∗_i 1...ir

[

j

] −

τ

∗ i

(v

∗ i

[

j

]

)k + (k

u 0 i1...ir

[

j

] −

u ∗ i1...ir

[

j

]k + k

τ

∗ i

(v

∗ i

[

j

]

) − τ

0 i

(v

∗ i

[

j

]

)k))

2

=

n

X

i=1 `

X

j=1

(k

u∗_i 1...ir

[

j

] −

τ

∗ i

(v

∗ i

[

j

]

)k

2

₊

_(k

_u0 i1...ir

[

j

] −

u ∗ i1...ir

[

j

]k + k

τ

∗ i

(v

∗ i

[

j

]

) − τ

0 i

(v

∗ i

[

j

]

)k)

2

+

2

k

u∗_i 1...ir

[

j

] −

τ

∗ i

(v

∗ i

[

j

]

)k × k

u 0 i1...ir

[

j

] −

u ∗ i1...ir

[

j

]k + k

τ

∗ i

(v

∗ i

[

j

]

) − τ

0 i

(v

∗ i

[

j

]

)k)

≤

n

X

i=1 d

(

u∗_i 1...ir

, τ

∗ i

(v

∗ i

)) +

O

() +

8

2

_/

_Rn

_`

=

n

X

i=1 d

(

u∗_i₁_..._ir

, τ

_i∗

(v

∗_i

)) +

O

().

The first inequality follows by the triangle inequality of the cRMSD metric, and the second one follows by the property of

-net of transformation space and the compactness of motifs. Specifically, by our choice of

τ

_ij0, we have

k

τ

_i0

(v

_i∗

) − τ

_i∗

(v

∗_i

)k ≤

(6)

/

Rn

`, k

u0_i₁_..._ir

−

u∗_i₁_..._ir

k ≤

/

Rn

`

(more details can be found in [10]). In addition,

|

u∗_i₁_..._ir

[

j

] −

τ

_i∗

(v

_i∗

[

j

]

)k ≤

2R since the points are bounded by a containing ball with radius at most R.

It is easy to see that the output of our algorithm is at least as good as

P

n i=1d

(

u 0 i1i2...ir

, τ

0 i

(v

∗

i

))

. Together with (1), the

performance ratio of our algorithm is proven.

5. Conclusion

We present a sampling-based approximation algorithm for the problem of finding the compact consensus shape from a family of proteins. Our algorithm requires that the consensus pattern satisfies the compactness condition. To find a good algorithm in more general case is an open problem.

Acknowledgments

We thank the editors and referees for their useful suggestions and efficient handling of this paper. We thank our coauthors Bin Ma and Lusheng Wang of paper [12]. This work is partially supported by the Canada Research Chairs program, the NSERC Discovery Grant OGP0046506, a 863 Grant by the Ministry of Science and Technology of China, an NSERC Collaborative grant, a MITACS grant, and a grant by the National Natural Science Foundation of China 30800168.

References

[1] T. Akutsu, H. Arimura, S. Shimozono, On approximation algorithms for local multiple alignment, in: Proceedings of the Fourth Annual International Conference on Computational Molecular Biology, ACM, New York, NY, USA, 2000, pp. 1–7.

[2] P. Aloy, E. Querol, F.X. Aviles, M.J. Sternberg, Automated structure-based prediction of functional sites in proteins: Applications to assessing the validity of inheriting protein function from homology in genome annotation and to protein docking, Journal of Molecular Biology 311 (2001) 395–408. [3] K.S. Arun, T.S. Huang, S.D. Blostein, Least square fitting of two 3-D point sets, IEEE Transactions on Pattern Analysis and Machine Intelligence 9 (5)

(1987) 698–700.

[4] D. Bandyopadhyay, J. Huan, J. Liu, J. Prins, J. Snoeyink, W. Wang, A. Tropsha, Structure-based function inference using protein family-specific fingerprints, Protein Science 15 (2006) 1537–1543.

[5] L.P. Chew, K. Kedem, Finding the consensus shape of a protein family, in: Proceedings of 18th Annual ACM Symposium on Computational Geometry, 2002, pp. 64–73.

[6] I. Gelfand, A. Kister, C. Kulikowski, O. Stoyanov, Geometric invariant core for the VL and VH domains of immunoglobulin molecules, Protein Engineering 11 (1998) 1015–1025.

[7] M. Gerstein, R.B. Altman, Average core structure and variability measures for protein families: application to the immunoglobins, Journal of Molecular Biology 112 (1995) 535–542.

[8] L. Holm, C. Sander, Dali: a network tool for protein structure comparison, Trends in Biochem Sciences 20 (11) (1995) 478–480.

[9] T.S. Huang, S.D. Blostein, E.A. Margerum, Least-square estimation of motion parameters from 3-d point correspondences, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 69 (1986) 198–201.

[10] R. Kolodny, N. Linial, Approximate protein structural alignment in polynomial time, Proceedings of the National Academy of Sciences 101 (2004) 12201–12206.

[11] N. Leibowitz, Z.Y. Fligelman, R. Nussinov, Multiple structural alignment and core detection by geometric hashing, in: Proc. 7th Int. Conf. Intell. Sys. Mol. Biol, 1999, pp. 169–177.

[12] M. Li, B. Ma, L. Wang, Finding similar regions in many strings, in: Proceedings of the 31st Annual ACM Symposium on Theory of Computing, Atlanta, 1999, pp. 473–482.

[13] C. Orengo, CORA-Topological fingerprints for protein structural family, Protein Science 8 (1999) 699–715.

[14] C. Orengo, W. Taylor, SSAP: Sequential structure alignment program for protein structure comparison, Methods in Enzymology 266 (1996) 617–635. [15] M. Shatsky, A. Shulman-Peleg, R. Nussinov, H.J. Wolfson, The multiple common point set problem and its application to molecule binding pattern

detection, Journal of Computational Biology 13 (2) (2006) 407–428.

[16] I.N Shindyalov, P.E. Bourne, Protein structure alignment by incremental combinatorial extension CE of the optimal path, Protein Engineering 11 (9) (1998) 739–747.

[17] S. Subbiah, D.V. Laurents, M. Levitt, Structural similarity of DNA-binding domains of bacteriophage repressors and the globin core, Current Biology 3 (1993) 141–148.

[18] J. Xu, F. Jiao, B. Berger, A parameterized algorithm for protein structure alignment, in: Proceedings of the Tenth Annual International Conference on Computational Molecular Biology, 2006 pp. 488–499.

[19] J. Ye, R. Janardan, Approximate multiple protein structure alignment using the sum-of-pairs distance, Journal of Computational Biology 11 (5) (2004) 986–1000.