Pairwise protein comparison: Sequence and

1.8 Structure comparison metrics

1.8.7 Secondary structure matching-based Q score

1.8.7.2 Pairwise protein comparison: Sequence and

Pairwise protein comparison performed using SSM generates a quality score (Qscore) which is used as a measure of distance between the structures. This section demonstrates the use of the SSM metric for protein comparison in two cases, one of conserved and the other of divergent proteins.

First, theα andβ-haemoglobins from Anser indicus are compared with one another as they are known to be the result of a relatively recent gene duplication and divergence event. The SSM metric generates a rotation matrix which is used to superpose the structures shown in Figure 1.14, illus- trating the alignment. The sequence alignment, for α and β-haemoglobins in Figure 1.15, shows 34% identity and 54% similarity. This case shows strong sequence similarity equating to strong structural similarity which is captured by the SSM alignment.

Figure 1.14: Superposition of structure using SSM:α(comprising 141 amino acid residues) and β (comprising 146 amino acid residues) haemoglobins from Anser indicus (PDB 1hv4) are superposed using the transformation matrix from SSM.α andβ chains are in red and blue respectively. AQscore of 0.63 was achieved, with an RMSD of 1.35 ˚A over 125 aligned residues.

Next, the two nucleosome-forming histone proteins H3 and H4, from

Homo sapiens, are compared, see Figures 1.16 and 1.17. In this case, the proteins H3 and H4 are part of the histone family. This case is analogous to the first one, ofα and β-haemoglobins, with a small difference i.e. the gene duplication and divergence in case of H3 and H4 is a result of a deeper evolu-

Figure 1.15: Pairwise sequence alignment of α and β-haemoglobins. The alignment shows 34% identity (labelled “*”) and 54% similarity (labelled “:” and “.”). The similar residues include those labelled identical.

Figure 1.16: Superposed structures of Histone H3 (136 residues) and H4 (103 residues) proteins from Homo sapiens (PDB 2cv5). H3 and H4 are in red and blue respectively. AQscore of 0.43 was achieved, with an RMSD of 1.92 ˚A over 68 aligned residues.

tionary event, relative to the haemoglobin case. Due to their biological role these proteins have been conserved at a structural level but to a lesser extent at a sequence level. This is reflected in the sequence alignment, between H3

Figure 1.17: Pairwise sequence alignment of H3 and H4 histone proteins. The alignment indicates 23% identity (labelled “*”) and 36% similarity (labelled “:” and “.”). The similar residues include those labelled identical.

and H4, which illustrates a 23% identity and 36% similarity between these proteins. Due to the sequence alignment scores falling in the “twilight zone” a unified sequence-based phylogenetic analysis has not been attempted for these proteins. However, the conservation in structure can be detected by an SSM-based superposition, Figure 1.16, and reflected in aQscore of 0.43.

SSM has been in use since 2004 and has been tested thoroughly by the authors [89]. It also satisfies the criteria of utilizing structural compo- nents, albeit as vectors, instead of reducing them to distances, considering the aligned sequence as well as the lengths of the individual proteins and, finally, generating a normalized score which can be seen equivalent to distance between structures compared. This coupled with the two cases that have been examined here, one where the sequences of proteins were slightly different (i.e. theα andβ-haemoglobins, comparison of which is not in the “twilight zone”) and the other where they were significantly different (i.e. the H3 and H4 histone proteins) indicates that this metric is a satisfactory choice for generating distances between structures for use in structure-based phylogeny determination, as done previously by Lundin et. al [8].

1.9 Inferential method

The method of inference used in this approach is neighbour-joining. The choice of this method amongst other distance and character-based methods, discussed earlier, is clear. For one, the metric used for structural comparison generates a score which can directly be interpreted as structural distance which justifies the use of a distance method. Secondly, the specific choice of NJ out of the distance methods is for the purpose of convenience as the other reliable method, minimal evolution, requires the use of an optimality criterion which in the case of structures cannot be determined in a straight- forward way. Thirdly, character-based methods e.g. in the case of Bayesian methods require the use of an evolutionary model and prior probabilities. These approaches cannot be satisfactorily extended to utilizing structure only, and have been highlighted previously in, Section 1.6.2. Thus, NJ be- comes a suitable choice.

Once all the proteins in the structural data set are pairwise compared, the similarity scores are converted to distances. The distances are used to construct a square distance matrix of size n∗n, wheren is the number of structures being analysed. Each value in the matrix,dxy, corresponds to the distance between two structures listed on row x and column y. Using this nomenclature, the following section explains in detail the NJ algorithm [98] and lists all the steps for converting the distances in the matrix to an un- rooted phylogenetic tree.

In document Exploring deep phylogenies using protein structure : a dissertation submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Biochemistry, Institute of Natural and Mathematical Sciences, Massey University, Auckland, Ne (Page 62-65)