5.4 Voronoi contact patterns around functionally important sites
5.4.1 Outline of the method
Matthias Siebert has evaluated the usefulness of the Vorolign scoring function to identify func- tionally important, conserved 3D patterns in protein families in his Diploma thesis under the supervision of the author. Similar to other 3D template based methods we aim at the automatic identification of highly conserved 3D residue patterns in protein structures which are known to have the same function. In a first step, we identify (functionally) important seed residues, e.g. using PROSITE patterns or by identifying highly conserved and spatial close residues in multiple alignments of the family. For the identification of larger patterns, we assume that not only the functional residues themselves but also surrounding residues (calledsupporting shell residuesin the following) may have important functions e.g. to coordinate functionally important residues or exhibit previously undiscovered functions and therefore may be highly conserved in the family as well. Therefore, in a second step, we aim at the identification of highly conserved, consistent residue networks in the supporting shells of functional sites across all members of the family. Given a Vorolign pattern for a family we can then search a complete database of protein struc- tures for matches of the pattern in a third step and therefore identify proteins with potentially similar functions. In the following we will briefly outline and describe the single steps of the method.
In the first step, we need to identify residues of potential interest which are used as starting points for the further exploration of the contact network conservation. The most simple approach to define such residues is to make use of PROSITE patterns describing functionally important residues and active sites like the histidine residue of the catalytic triad in trypsin-like serine proteases (Pattern PS000134: [LIVM]-[ST]-A-[STAG]-H-C). In this first attempt to test our ideas we made mainly use of such active site PROSITE patterns in order to define seed residues (see also Figure 5.12). However, an other approach which does not require prior knowledge could be to explore all sites (e.g. all pairs or triplets of residues) which are highly conserved in a multiple Vorolign alignment of the family and which are close in space on the surface of the protein as potential seeds.
The second step aims at the identification of 3D patterns containing highly conserved residues in the supporting shell of residues around the seeds. The idea is supported by the finding that interactions between catalytic and non-catalytic residues may play functional roles in catalysis [63], e.g. in controlling tautomerization of the histidine imidazole ring or in the stabilization of charged residues. Starting from the seed residues defined in step one we extent the potential pattern to all direct neighbors of the seed residues in the Voronoi tessellation of the structure (supporting shell) and to all direct neighbors of the supporting shell residues, i.e. the supporting- supporting shell. All those residues are potential candidates for being contained in the final Voronoi pattern (see also Figure 5.13).
Given now two protein structures together with their seed and shell residues, the task is to identify the conserved residue network in both structures. Given the mapping of the seed residues (which is given by the PROSITE pattern or the multiple alignment) we first aim at the mapping of supporting shell residues of the seed residues. In our method we make use of the Vorolign scoring function in order to compute the similarity of residues in the cell neighborhood and then
5.4 Voronoi contact patterns around functionally important sites 57
we extract the final residue mapping from the Vorolign low level matrix (see also Figure 5.3). Having mapped the shell residues of the seeds we carry out the same procedure for mapped shell residues and their nearest neighbors according to the Voronoi tessellation. Finally, this results in a set of residues placed around the seed residues which are mapped onto the respective residues in the other structure. Those can then be used to define the Voronoi pattern using different features for the edges and vertices in the pattern like secondary structure and amino acid conservation of the vertices or geometrical constraints (distance, face area...) of the edges.
A B
C
A(L)
L
A B(L)
L
B
A
B A(L)
B(L)
A
B
C
A(L)
C(L)
C(L)
B(L)
A(L)
B(L)
C(L)
A
B
A
B
AC BC
O(N
3L)
L
N
A B C D
E
A
B A(L)
B(L)
A
B
C A(L)
C(V)
C(V)
B(V)
A(L)
B(L)
C
D
E
D(V)
E(V)
D(V)
E(V)
A(L)
B(L)
C(V)
Figure 5.11: Consistency check example considering three patterns from proteins A (red), B (blue) and C (green). Seed residues are indicated by black circles and the patterns are depicted in a graph-like representation. The (consistent) mapping of the Leucine residues is shown by black arrows. (Figure taken from the diploma thesis of Matthias Siebert)
In the case of more than two structures in the family, we need to identify the contact networks being conserved across all members of the family, given a set of all pairwise Voronoi contact patterns of pairs of proteins in the set. The need combine those sets of patterns into one consensus pattern is a similar problem like the computation of a multiple alignment given a set of pairwise alignments. The idea of our mapping procedure is similar to the T-Coffee method for multiple alignments. Our goal is to identify all residues which are consistently mapped between two proteins A and B given a third structure C (a simple mapping case is also shown in Figure 5.11), and finally to identify all such residues which are consistent and conserved among all members of the family. The final consensus Voronoi pattern (one example is shown in Figure 5.12) is defined by the pair of patterns from two structures A and B which implies the largest consistency across all other patterns and structures in the group and therefore the multiple pattern alignment which includes the most alignment columns without any gaps.
Given a mapping of all patterns in the set, we can extract different features to form the con- sensus pattern. Those can in principle include all geometric, structural or biochemical features. In the diploma thesis of Matthias Siebert several such feature representations have been tested. Surprisingly the features also used by Vorolign, namely the combination of secondary structure
58 5. Vorolign - fast protein structure alignment using Voronoi contacts
and amino acid features also turns out to be the best performing one and is therefore applied in the case study described below.
The final question now is how to search with a pattern or a set of patterns against a database of e.g. newly resolved protein structures. The task is to match a pattern, which corresponds to a Delaunay graph, to the Delaunay graph of the protein structure to be searched which corresponds to the subgraph isomorphism problem. Despite the fact that the problem is known to be NP- complete, we use a brute-force enumeration procedure, more precisely, an exhaustive depth-first tree-search algorithm. This approach is computationally feasible since the patterns and their vertex and edge features allow for pruning the search space dramatically.