1.8 Structure comparison metrics
1.8.1 RMSD
Root mean squared deviation (RMSD) is a method of quantifying dissim- ilarity between structures and hence can be regarded as a scoring function. RMSD, essentially, calculates distance between a pair of points or, in the case of protein structures, atoms. The mathematical relationship to quantify this difference is given by:
RM SD(v, w) = v u u t 1 n n X i=1 ((vix−wix)2+ (viy−wiy)2+ (viz−wiz)2) (1.4)
where v and w are two sets ofn points and x,y, z are 3D Cartesian coor- dinates.
RMSD therefore generates a measure of difference in the coordinates of the two sets of points compared. In case of rigid superposition of struc- tures before employing RMSD it is essential to remove any contributions from rotational and translational motions by protein structures in 3D space. This is because the empirical superposition approach using RMSD does not represent protein structure in a position-independent manner. To achieve an optimal superposition, a pairwise correspondence is considered between points inv, e.g. coordinates of atoms from structureA, as they are mapped onto points inw, e.g. coordinates of atoms from structureB. During super- position one structure is kept as a target onto which the other structure is mapped. A common method of mapping uses the Kabsch algorithm [94], to generate a transformation matrix which maps one structure onto the other such that the RMSD between the superposed structures is minimized.
Other comparison metrics differ from this RMSD-based superposition method in that they first represent the structures internally. This elimi- nates the need to remove position-dependant rotational and translational motions. Secondly, the metrics considered here allow for flexible alignments i.e. the structural comparison will not assume a one to one relation. This means that while rigid superposition would superpose atom with index“m”
in structure “A” with atom“m” in structure “B”, a flexible alignment could compare atom“m” from structure “A” to“m - k” or“m + k” in structure “B” where“k”can be a number that introduces an offset of a few amino acid residues. This alignment process decides which atoms are considered equiv- alent and hence should be compared, while retaining connectivity. Once an equivalence is established RMSD (as a scoring function only) or other complex functions could be used to quantify the difference between them.
Following from the above discussion, RMSD could not be employed for this work as in itself it does not employ a flexible method of aligning struc- tures. For phylogenetic analysis, the metric needs to be able to perform flexible alignments to better capture evolutionary processes.
1.8.2 DALI
DALI (distance matrix alignment) is a structural comparison metric that utilizes a distance matrix to represent a protein structure internally. It cal- culates distances between all α-carbons (Cα) in a structure and uses the distances to populate a square distance matrix in a sequence dependant
manner. The distance matrix now represents the three dimensional struc- ture. For pairwise comparison both structures are first represented by their respective distance matrices. This is followed by the comparison process which starts by reducing each distance matrix to a set of distance matri- ces of six amino acids each. Each of these matrices essentially represents a hexapeptide fragment from the original protein structure. The same pro- cess is carried out with the distance matrix of the other structure. An all by all comparison between the two sets of hexapeptide matrices is carried out. The final alignment is such that the scoring function in Equation 1.5 is maximized [95]. S = X i∈core X j∈core (θ−∆(dAij, dBij))ω(dAij, dBij) (1.5)
Where for structuresAandB, core is the set of structurally equivalent pairs (iA,jB), ∆ is the deviation of intramolecular Cα-Cαdistances between (iA,
jA) and (iB, jB), relative to their arithmetic mean (d). θ is an empirical threshold of similarity, empirically set to 0.2,ω is an envelope function such thatω = exp(−d2/r2) wherer = 20 ˚A.
Once a final alignment is achieved aZ-score is returned which indicates how similar or different a particular structural alignment is against a distri- bution of random structural alignments.
Although effective in many cases, the DALI metric has two drawbacks a) loss of information when converting a 3D structure to a 2D distance matrix and b) the output score (Z-score) is an open ended probability score i.e. it is not normalised. Holm et al. [96] have suggested a way in which to normalize the score but this uses a background distribution of comparisons. This is problematic because a change in the limited structural data used to calculate the background distribution may impact the normalization process. Furthermore probability scores are not equivalent to distances and hence even in a normalized state the Z-score cannot be used as a measure of structural distance.
1.8.3 TM-Align
TM-Align uses α-carbons while reducing a structure to its respective SSEs. A helix, sheet or coil is assigned by looking at a five amino acid neighbour-
hood of each α-carbon. Dynamic programming is then used to align the SSEs across two structures, with “1” indicating a match, “0” a mismatch and penalty of “-1” for a gap. In the second step an alignment is performed, using dynamic programming, where no gaps are inserted and the best align- ment, using TM-score in Equation 1.6, is selected.
T Mscore = 1 LN LT X i=1 1 1 + di d0 (1.6)
whereLN is the length of the native structure,LT is the number of residues aligned to the template structure,di is the distance between theith pair of
aligned amino acid residues andd0 is a scaling factor given by:
d0 = 1.243 p
LN −15−1.8 (1.7) A third alignment then reintroduces gaps but uses a mix of scores from the first two alignments. The three initial alignments are followed by a heuristic approach which uses the rotation matrix obtained in the earlier alignments to iteratively improve the alignments such that the TM-score is maximized.
TM-Align provides results with a score normalized between 0 and 1 however the normalization is performed with a scaling factord0, see Equa-
tion 1.7, in which the empirical constants are based on random structural matches. Furthermore the score formulation, see Equation 1.6, only consid- ers the length of one structure onto which the other is being mapped i.e. LN.
Although the output generates two scores, each normalized on the length of the respective structure, significantly different scores emerge when the two structures are of different sizes, which is usually the case with significantly diverged proteins.
1.8.4 CE
Combinatorial extension (CE) is based on internal distance matrices similar to DALI, but instead of computing the distance matrix for the entire struc- ture it considers octameric fragments. These fragments are then aligned across the two structures, with the best aligned pair acting as the seed for the remaining alignment. CE defines the best alignment as the longest con-
tinuous path P of aligned fragment pairs (AFPs) between two structures
nA and nB amongst all AFPs. Three criterion determine the continuity of
AFPs constituting an alignment. These are:
pAi+1 =pAi +m ∧ pBi+1=pBi +m (1.8) or
pAi+1 > pAi +m ∧ pBi+1=pBi +m (1.9) or
pAi+1 =pAi +m ∧ pBi+1> pBi +m (1.10)
where pAi and pBi are the starting residues in protein A and B at the ith position in the alignment, m is size of the similarity matrix set to eight for octameric fragments. Equation 1.8 describes consecutive AFPs whereas Equations 1.9 and 1.10 represent consecutive AFPs aligned with gaps in structureA and B, respectively.
The seed alignment is extended to maximize the alignment by incor- porating other aligned fragments using Equations 1.8, 1.9 and 1.10. To minimize gaps Equations 1.9 and 1.10 are improved through the addition of gaps and given by:
pAi+1=pAi +m+G (1.11)
pBi+1=pBi +m+G (1.12)
where G is the maximum gap size of 30, chosen empirically. A heuristic approach is used coupled with distance measures for extending seed align- ments. Following the alignment the aligned fragment pairs are listed along with aZ-score which uses a normal distribution with mean “0” and standard
deviation “1”. The Z-score gives the quality of the alignment by provid- ing the probability of attaining a better alignment from comparing random structures.
Similar to DALI, CE loses information in reducing the structure to a matrix of distances. Moreover CE does not provide a normalized score.
1.8.5 VAST
The vector alignment search tool (VAST) [93] method only allows for com- parison between a query structure and a target database and therefore can- not be used for pairwise comparisons. Although this method is used in the literature, it is not suitable for generating pairwise structural distances, therefore it is not discussed in detail here.
1.8.6 MAMMOTH
Matching molecular models obtained from theory (MAMMOTH) represents a continuous fragment of sevenα-carbons as a unit vector. The vector points from theith to theith + 1α-carbon. The vectors are mapped to the origin as unit vectors. For comparison between two structures, unit vectors from one structure are mapped onto the unit vectors of the other structure and a rotation matrix is determined. The final rotation matrix is such that it minimizes the sum of squared distances between corresponding unit vectors and the square root of the minimized sum gives the URMS (unit vector root mean square) distance. In the second step the URMS is converted to a similarity score by comparing the observed URMS with a minimum URMS. The minimum URMS is given by:
U RM SR= s
2− 2√.84
n (1.13)
whereRis a random set ofn unit vectors. The similarity score is thus given by:
SAB = U RM S
R−U RM SAB
U RM SR ∆(U RM S
where ∆(U RM SR, U RM SAB) is “10” if U RM SR > U RM SAB and “0” otherwise. Comparison between all possible heptapeptides therefore popu- lates a scoring matrix. In the following step dynamic programming is applied to the similarity matrix to build an alignment using the global alignment method, discussed in the previously, Section 1.5.2.1. Finally a maximum subset of local structures havingα-carbons within 4 ˚A is identified and re- ported as PSI (percentage of structural identity). Extreme value fitting is used to compute a P-value to assess the score quality by comparing to random structural alignments.
All the algorithms outlined in this section either decompose structure to distance, which may result in a loss of information, or have non-normalized comparison scores, i.e. there is no upper bound for the score, which in- creases as the sizes of proteins compared increases. Additionally, instead of scores some metrics generate a probability value to assess the quality of the comparison, probability values that may change when the background dis- tribution used to calculate them changes. For these reasons, none of these metrics were used here. Instead the SSM-basedQscorewas used. This metric is outlined in the following section.
1.8.7 Secondary structure matching-based Qscore