• No results found

Correlation Between Sequence Entropy and Structural Conservation

10.2 Methods

10.3.4 Correlation Between Sequence Entropy and Structural Conservation

For each fold family, we compute the sequence entropy for the representative protein according to Section 10.2.6. For the Ig family, the sequence entropies of all sites in the family are shown in Figure 10.4. In order to compare sequence and structure entropy, in the same figure we also plot structure entropy. Matching the two entropies, we notice that lower sequence entropy always coupled with lower structure entropy. For example, site 70 has a very low sequence entropy (0.5) and also has a very low structure entropy (close to 0). However, the opposite is not always true. For example, site 71 has a very high sequence entropy ( 0.9) but still has a very low structure entropy since it is located in a structurally conserved region of the protein.

0 10 20 30 40 50 60 70 80 90 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Sequence and Structure Entropy

Residue Location Sequence Entropy Structure Entropy

Figure 10.4: Sequence and structure entropies in the Ig fold family.

The asymmetric relation between sequence conserved regions and structure conserved re- gions is not hard to understand. It is well known that sequence conserved sites play important role in the structure (and/or function) of the protein. On the other hand, sites in a highly

conserved structure region may still undergoes fast mutations and that’s the reason why there are so many proteins with diverse sequences that can still adopt the same fold. In the sequel, we call sites that are conserved in both sequence (characterized by low sequence entropy) and structure (characterized by low structure entropy) “influential”. For example, sites 33, 68, and 70 in the Ig family are ”influential”.

In Figure 10.5 for the Ig family, we further study the correlation between the sequence entropy and the structure entropy using linear regression. At the left part of Figure 10.5, we show the scatter plot of the sequence entropy (x-axis) and the structure entropy (y-axis) with the fitted linear regression line. If sites with low sequence entropy always couple with low structure entropy and vice versa, we should expect a positive correlation between the two entropies. With a few outliers (to be discussed), this is the case for the Ig family. The correlation coefficient is 0.48 and under the null hypothesis that there is no correlation, the

P−value of the observed linear regression is 1.9×10−6. Since there is no obvious reason to believe the linear relationship between the sequence entropy and structure entropy, the regression line and theP−value computation have only illustrative purpose.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Structural Entropy Sequence Entropy 0 10 20 30 40 50 60 70 80 90 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

Position Along the Sequence

Various Distributions

Sequence Entropy Strucutral Entropy Residuals of Linear Regression

Figure 10.5: Correlation between sequence and structure entropies. Left: the scatter plot of the sequence entropy (x-axis) and the structure entropy (y-axis) with the fitted linear regression line. Right: distribution of sequence entropy (blue), structural entropy (green), and linear regression residual (red) of sites in the Ig fold family.

We postulate that the asymmetric relation between sequence entropy and structure en- tropy contributes to the low correlation we observe. In other words, sites that are close to an influential site tend to have low structure entropy (since they are located in a highly conserved structure region) but high sequence entropy. To confirm this, we utilize a simple method to eliminate “outliers” by removing sites which have high sequence entropy (¿0.8) and low struc- ture entropy (¡0.4), if they are in direct contact with at least one an influential residue. This leads to an improved correlation where the correlation coefficient increases from 0.48 to 0.74 and the P−valueof the linear regression drops from 1.9×10−6 to 6×10−14.

We carried the similar tests for the rest of the fold families and the results are shown in Table 10.6.

Fold Ig OB R α/β−P TIM

r 0.74 0.62 0.74 0.73 0.68

p 6×10−14 1.3×106 <10−15 5.9×10−14 <10−15

Table 10.6: Linear correlation of sequence and structure entropy. Both correlation and statistical significance were computed using Matlab. <10−15 was used in this table when zero was output by

Matlab for the statistical significance.

10.3.5 Structure Entropy and Conserved Contact Subnetworks

Given a group of contact maps, the focus of this section is to use the graph mining method to locate the most invariant contact subnetwork(s) across protein structures in a fold family. To that end, we use the algorithm called FFSM (Fast Frequent Subgraph Mining) [HWP03]. The algorithm supports a group of parameters, among which two are important for our task: (1) frequency thresholdσ, which indicates that a mined subgraph pattern should appears in a minimal σ fraction of the group of graphs and (2) closeness D to complete graphs, which indicate that at most D number of edges for a subgraph pattern may miss from a complete graph. A complete graph is one where all nodes are pair-wise joined and is also calledclique in graph theory. In the context of a contact map, cliques (or ones close to them) correspond to substructures where residues are extensively interacting.

In Figure 10.6, we show one of 5 most frequent cliques we obtained from the Ig family. This clique is composed by four residues: Thr836, Try837, Glu870, and Val871. Two residues (Try837 and Val871) in the cliques are believed to be the folding nucleus of the protein [HSC00]. The other four cliques are shown in Table 7.

All the residues that covered by the 5 cliques correspond to sites 33-35 and sites 69-73 in the Ig fold family. These sites all have low structure entropy and some of them have low sequence entropy (e.g. sites 35 and 70). This offered an additional evidence for correlated sequence and structure evolution.

Cliques Residues

1 Glu834, Leu835, SER872, Tyr873 2 Thr836, Val871, SER872, Glu870 3 Glu834, Leu835, Thr836, Ser872 4 Glu834, SER872, Tyr873, Ile874

837 836 870 871 837 836 870 871

Figure 10.6: One of the pattern obtained in protein 1TEN with support 164. Left: the occurrence of the pattern in the protein 1TEN. Right: the topology of the pattern. This pattern is a fully connect graph.