CHAPTER 2. CLOSE CORRESPONDENCE BETWEEN THE PROTEIN MO-
2.4 Results and Discussion
2.4.5 Large Overlaps between PCs and Normal Modes – A Structure-Based Expla-
nation of Observed Motions
The dominant directions of motions represented by the first few PCs have been obtained by direct principal component analysis (PCA) of experimental data (X-ray or NMR) and MD tra- jectories. In this section, we will investigate whether there are structure-based and physics-based explanations for these directions of motions. In other words, are there intrinsic reasons why these directions of motions are preferred?
For this purpose, we compare these directions of motions with the computationally predicted mode motions by ENM. We calculate the overlaps between the first few PCs and low-frequency modes according to Equation 2.7, for the 3 datasets. In all the cases, we observe some large overlap values between the first several PCs and a few low-frequency modes. The results imply that the observed structures and the corresponding conformational changes are likely facilitated by the low- frequency, global motions that are intrinsic to the structure. ENM thus provides a coarse-grained, structure-based explanation for the experimentally observed conformational changes taking place mostly upon inhibitor binding (for the X-ray structures), as well as for the dynamics revealed from both the NMR ensemble and the simulated MD dataset.
mational changes, the mode motions of the protein from ENM can also be used to predict the collective motions of the protein that have not been detected in crystal or NMR structures, and when combined with the experimentally observed conformational changes, can deepen our under- standing of the dynamics of the protein, and provide specific information regarding the dynamics in the vicinity of the binding site, e.g., the motion of the flaps. Such an understanding (and visual- ization) of the dynamics may provide key insights for better ways to design new drugs for protein targets.
2.4.5.1 Matching a Single PC with a Single Mode
The overlaps between the first 3 PCs and the first 3 low-frequency modes (Mode 1-3) are shown in Table 2.1(a). In the X-ray-II dataset, the largest overlap is 0.52, between PC 1 and Mode 2. The overlap between PC 2 and Mode 3 is 0.51. In the NMR dataset, the largest overlap is 0.91, between PC 1 and Mode 2. The overlap between PC 2 and Mode 1 is 0.88. In the MD dataset, the largest overlap is 0.74, between PC 1 and Mode 1. The overlap between PC 3 and Mode 3 is 0.65. These results indicate that the principal motions (i.e., the first few PCs) can be explained well by a single low-frequency normal mode in each of the X-ray, NMR and MD cases.
The largest overlaps found for the first two PCs of the NMR ensemble are highly significant, at 0.91 and 0.88 respectively (see Table 2.1(a)). This significance has two implications. On one hand, as mentioned above, the dynamics revealed from applying PCA to the NMR ensemble yields a structure-based explanation. On the other hand, the NMR ensembles promise improved agreements over the X-ray structures, so that the dynamics revealed may provide an important validation tool of the accuracy of the ENM modes of motion. The large overlaps suggest that the ENM, even though coarse-grained, can capture well the essential dynamics of protein in solution (for the NMR case). In a recent study by Yang et al. [30], they applied GNM to both X-ray structures and NMR ensembles of the same proteins, and find GNM is able to reproduce the residue fluctuations in NMR structures better than that from X-ray structures. These results also support the applicability of ENM to capture the dynamics of NMR structures.
However, we also see that the larger overlap for the third PC of the NMR dataset is far smaller (0.30). This is mainly because there are only 28 structures in the NMR ensemble, which means
that higher PCs may quickly become unreliable. Therefore, a larger ensemble or more ensembles are desired. Unfortunately, there is no other NMR structure available for HIV-1 protease in the Protein Data Bank. A more thorough study using other NMR ensembles of structures is underway.
2.4.5.2 Principal Motion (PC) Represented by A Few Modes
Since ENM is a coarse-grained model, it is possible that each individual mode may not be so precise. The details of each normal mode will of course depend on the force field details. However, the subspace of the low-frequency modes is much less affected by such details [31,32], and it has been shown that the overall shape is dominant in determining the motions of the slower modes [33–35]. Therefore, it is worthwhile to determine how well a given principal motion (PC) can be represented by a few low-frequency normal modes collectively. To do so, we calculate the cumulative overlap (CO) for each PC with the subspace defined by the first few low-frequency normal modes.
The results in Table 2.1(b) show that even with 3 modes, overlap values are usually significantly improved. More improvements are gained across the board when the first 20 low-frequency modes are used. The cumulative overlap for PC 3 of the NMR set remains relatively low. As pointed out earlier, this is mainly due to the small size of the NMR ensemble, which renders its high PCs undependable. In summary, the principal motions determined from PCA can be well captured by a small number of low-frequency normal modes.
2.4.5.3 Overlaps between PC and Mode Subspaces
The first few PCs collectively capture the majority of the total variance. So the subspace spanned by these PCs reflects the dominant motion space of the protein. To measure how well this motion space can be captured by the first several low-frequency normal modes, we calculated the RMSIP (see Equation 2.9) between the two spaces. Intuitively, RMSIP measures the percentage of the PC subspace that is covered by the subspace spanned by the selected low-frequency modes.
Table 2.1(c) lists the RMSIP values between the subspaces spanned by the first 6 PCs with those spanned by the first 3, 6, and 20 modes. Large RMSIP values are seen even with 3 modes, and marginal improvements are achieved as more modes are included, until the RMSIP values reach about 0.7 (or 70%) when the first 20 low-frequency modes are considered. These results
suggest that the majority of the dynamics displayed in these datasets can be explained by a small set of the ENM modes. This, in addition to ENM’s success in interpreting the crystal B-factors of X-ray structures and the NMR ensembles [30], confirms the validity of using ENM to study protein dynamics. And, these include the dynamics from a broad range of cases, that in crystals, in solution, or from MD simulations.
Though ENMs are coarse-grained models, their usefulness in capturing the collective dynamics of macromolecules has been proved over the last decade. Here we can see again in Table 2.1(c) that the subspace spanned by the first 20 low-frequency modes of the ENM matches quite well with the subspace spanned by the PCs of the X-ray and the NMR structures, as well as that of the MD trajectory.