Secondary sequence alignment - Description of Algorithm

4.2 Description of Algorithm

4.2.2 Secondary sequence alignment

It has been observed that if two structures are similar, their secondary structure sequences can be well aligned at the sequence level, where each secondary structure can be represented by either α for an α helix orβ for a β sheet. Using the α-β sequence alignment can reduce a large number of unrelated structures and greatly speed up the searching in the database.

The second structure sequences of all the protein structures in the databank are extracted. For each protein, its secondary structure sequence has format s1s2· · ·sk such

that each si contains the following information: • α-β type.

• Number of Cα atoms in the secondary structure. Define cα(si) to be the number ofCα

atoms in si.

• Two crucial points of the secondary structure.

Define the weight function such that w(a, b) = max(cα(a),cα(b))

cα(a)+cα(b) if the α-β type ofa and

b are different or one of them is a space, and w(a, b) = |cα(a)−cα(b)|

cα(a)+cα(b) if the α-β type of a and b

are the same.

An alignment of two sequencesa1· · ·an andb1· · ·bm of secondary structures will add

gaps, each marked by a ‘-’, into each. The first sequence becomes a0₁· · ·a0_k and the second sequence becomesb0₁· · ·b0_k, and for each 1≤i≤k, at least one ofaiandbimust not be a gap.

The total cost for the alignment from a0₁· · ·a0_k and b0₁· · ·b0_k is measured by Pk

i=1w(a

i, b

i).

Define D(i, j) to be the cost of an optimal alignment between a1· · ·ai and b1· · ·bj.

We have the following recursion, which implies that D(n, m) can be computed in O(mn) time by the method of dynamic programming.

D(i, j) = min      D(i−1, j−1) +w(ai, bj) D(i−1, j) +w(ai,−), D(i, j−1) +w(−, bj) (4.1) Secondary-structure-sequence-alignment(S, S0)

Input: S is the first protein structure, and S0 is the second protein structure. Output: an alignment for the secondary structure sequences betweenS and S0.

Begin

Lets1s2· · ·sk be the sequence of the first protein S.

Lets0₁s0₂· · ·s0_k0 be the sequence of the second protein S0.

Apply the dynamic programming with weight function w(). Output the alignment with the best score.

End (of Secondary-structure-sequence-alignment)

We use function Select-via-secondary-structure-sequence() to select those proteins that have their secondary structure sequences aligned well with that of input protein structure S0.

Select-via-secondary-structure-sequence(S0, Structure-list)

Input: S0 is the input protein structure, Structure-list is the list of structures to be searched

for similar proteins.

Output: a sublist of protein structures that can be well aligned with S0 according to the

secondary structure sequence alignment.

Begin

LetL=∅.

For each protein structure S in the Structure-list

Begin

A=Secondary-structure-sequence-alignment(S0, S).

If (alignment A is good enough ) Thenput S into L.

End (of For)

Return L.

End (of Select-via-secondary-structure-sequence)

4.2.3 3-D alignment for secondary structures

In this phase, we select those protein structures that have good geometric alignment by secondary structures. This phase is also fast since each protein has about 30 secondary structures in average. We just use two points to represent a secondary structure.

Build-Star((s1, s01),(s2, s02), S, S

Input: S and S0 are two protein structures, s1 and s2 are secondary structures in S,s01 and s0₂ are secondary structures in S0, and there exists an rigid body alignment for (s1, s01) and

(s2, s02).

Output: a star with center at (s1, s01),(s2, s02). Begin

Let Center={(s1, s01),(s2, s02)}.

Let Star=Center.

For each pair of secondary structures (s, s0) betweenS and S0.

Begin

If (there exists a rigid body transformation for Center and (s, s0))

Then Let Star=Star ∪{(s, s0)}.

End (of For)

Return Star.

End (of Build-Star)

The function Prune-star() deletes some pairs in a star until there exists an alignment with RM SD less than a threshold r.

Prune-star(Star, r)

Input: Star is a star of secondary structures, andr is a threshold.

Output: a new star of secondary structures that has rigid body alignment with RM SD no more than r.

Begin

While (RM SD(Star)> r)

Remove the pair (s, s0) of Star that has the largest distance dist(s, s0).

End (of Prune-star)

The function Secondary-structure-3-D-alignment() aligns the 3-D secondary structures between two protein structures S and S0. The method is based on building and pruning stars.

Secondary-structure-3-D-alignment(S, S0)

Input: S is a protein structure; S0 is a protein structure.

Output: a 3-D alignment between the secondary structures of S and S0.

Begin

L=Secondary-structure-sequence-alignment(S, S’). Best-star=∅.

For each pair (s1, s2) of neighbor secondary structures in L Begin

Star=Build-Star(s1, s2).

Star=Prune-star(Star, r)

If (size(best-star)<size(Star)) Then Best-star=Star.

End (of For)

Return best-star as an alignment.

The function Select-via-secondary-structure-3-D-alignment() selects those proteins that can be well aligned with S0 by the Secondary-structure-3-D-alignment function.

Select-via-secondary-structure-3-D-alignment(S0, Structure-list)

Input: S0 is the input protein structure, and Structure-list is the list of structures to be

searched for similar proteins.

Output: a sublist of protein structures that can be well aligned with S0. Begin

LetL=∅.

For each protein structure S in the Structure-list

Begin

A=Secondary-structure-3-D-alignment(S0, S).

If (alignment A is good enough ) Thenput S into L.

End (of For)

Return L.

End (of Select-via-secondary-structure-3-D-alignment)

In document Robust and Efficient Algorithms for Protein 3-D Structure Alignment and Genome Sequence Comparison (Page 51-54)