We present a method to infer a specific function for proteins of unknown function, independent of sequence patterns, structural alignments and templates of known functional sites, which complements the other methods and may succeed where they fail. The use of multiple motifs mined from protein families using almost-Delaunay graph edges makes the method especially robust. Our method can find and identify packing patterns of functionally important residues even when there are distortions in the functional site. Unlike clique patterns [WTR+03, MSO03] and triplets [BKB02, LWT05b], our method finds family-specific patterns of arbitrary topology and size.
The successful inference of new members of SCOP families validates the predictive power of fingerprints and the decision to use families from a structural classification; the success rate of 69% is high considering that there are functional outliers among existing and new members of SCOP families. The packing patterns in fingerprints do capture functional information that is unique to a functional family, rather than shared structural information. This is shown by function discrimination within the TIM barrel fold, and the inference of YcdX as belonging to the sequence-diverse metallo-dependent hydrolase family even though it has a different fold.
Our method can be trained on families from a functional classification system such as EC [Bai00] or GO [Con04], and we will report these results in the future.
Our structure-based method may be applied to function prediction at the sequence level using either good quality predicted structures, or sequence patterns derived from fingerprints whose sequence order is preserved within the family. We have preliminary results on this that will be reported in a future publication.
We do observe annotations that oppose our inference. For example, the Gene Ontology An- notation (GOA) database [Con04] annotates 1m65 (which we believe is a metallo-dependent hydrolase) as having DNA-directed DNA polymerase activity (GO:0003887). We notice that this function assignment is putative and is made on the basis of electronic annotation trans- ferred from InterPro, a sequence database. This is the least reliable evidence code for GO annotation, Inferred from Electronic Annotation(IEA). The discoverers of the PHP domain sequence family [AK98] indicated shared active site sequence motifs between the metallo- dependent hydrolase family and the PHP-domain family, and hypothesized that bacterial and archaeal DNA polymerases possess intrinsic phosphatase activity that hydrolyzes the py- rophosphate released during nucleotide polymerization. Several metallo-dependent hydrolases hydrolyze phosphoester or phosphate bonds; thus, the assigned GO term may not contradict the function inferred by our method, which is left for further study.
Our method has a few limitations, arising from implementation choices, algorithmic is- sues, and the nature of the problem itself. In our implementation, we useCα coordinates to
calculate graph edges and lengths; this choice captures shared topology, but may miss interac- tions with long side-chains. Currently we do not allow residue substitutions in patterns, other than unifying V,A,I,L. Merging commonly substituted residue types (e.g. D,E) increases the sensitivity of fingerprints but can decrease their specificity; we may lose fingerprints that are no longer unique to a family. Also, the distance edge matching criteria may be too restric- tive to find some patterns that vary widely in their geometry or have distances lying on bin boundaries. We are developing a new distance edge representation that will remedy this.
Algorithmically, subgraph mining is an NP-complete problem since it involves subgraph isomorphism. Though the FFSM algorithm [HWB+04] avoids the isomorphism problem in
most cases by storing graph embeddings, it does affect families with very similar or identical structures, and families of 3 or fewer proteins. Such families often yield excessive and non- specific fingerprints that have little power to infer new members. Also, finding fingerprints in a family is not guaranteed; roughly 35% of SCOP families that we tried gave no fingerprints (typically functionally heterogeneous SCOP superfamilies, and some families with extreme sequence divergence) or too many fingerprints (typically highly sequence-similar enzyme fam- ilies). The number, specificity and sensitivity of fingerprints depends on family characteristics such as size and heterogeneity; thus the support and background occurrence parameters must be varied to find meaningful sets of fingerprints in the maximum number of families.
On the systemic level, the method identifies functional families with four or more represen- tatives by learning what packing patterns are unique to each family, and it does not identify the exact function within a family. Also, several structures have errors, missing fragments or mutations that lead to failure of mining or function inference. Careful manual selection of families and fixing errors in structures should improve the results further. Finally, all function inferences made by computational methods such as ours need to be validated in the laboratory.
Chapter 10
Coordinated Evolution of Protein
Sequences and Structures
10.1
Introduction
Evolutionary relationships between extant proteins afford a valuable database with which to systematically examine how protein structures derive from their sequences. Protein families exhibit immense population diversity in which many families contain only a few members while a few families contain large numbers of related structures. Large protein families include the TIM barrel family and the Rossman fold family among others, which account for 30% of the entire proteome [SMM+11]. An important key to understanding the origins of such large families may be found by identifying sequence patterns and structure patterns that are associated with such large fold classes. The study of the relation between protein sequence spaces and protein structure spaces have reached a significant plateau, from which one can envision important new insights from an appropriate analysis of the joint distributions of conserved sequence and structure motifs [KWK04].
Our investigation springs from previous studies of protein fold families, which comprise collections of proteins whose structures are similar, but whose sequences differ considerably [KWK04]. In those fold families, conserved sequence sites have been studied extensively. For example, Dokholyan et al. defined a measure called sequence entropy to identify conserved sites in protein superfamilies. The concept of sequence entropy was also elaborated by Donald et al., who established a database of Conservatism of Conservatism (CoC) to record the sequence entropy of all residues of many protein structures [DHR+ar]. As reported by Donald et al. , residues with low sequence entropy and low solvent accessibility are responsible for rapid folding and thus candidates for a “folding nucleus”.
Tertiary structure information is utilized sparingly in most of the current studies for pro- tein fold family evolution. The recent exponential growth in the structure databases has brought with it considerable new potential for knowledge-based methods to study structure
conservation within a fold family. Quantifying and identifying structurally conserved regions can supplement our understanding of protein sequence conservation, enhancing our under- standing of how sequence determines structure by identifying correlations between sequence and structure conservation.
To achieve a quantitative measure of structure conservation, we have developed structure entropy to measures the conservation of the local environmental in a protein fold family. Our methods make it feasible to directly relate the sequence and structure conservation during protein evolution [DHR+ar, SDDS03]. Specifically, we transform a protein structure to a consistent contact graph where a node corresponds to an amino acid residue in the protein and an edge represents a pairwise interactions of residues. Given a group of graphs and a node
v, we search a subset of nodes that (almost) always connect to v in the entire set of graphs and define structure entropies accordingly. The computation of structure entropy involves a search in a high dimensional space and has been shown to be reasonably stable in the presence of structure noises.
Our result shows that there are strong correlation between the sequence entropy, structure entropy, and frequent structure patterns in a fold family. These result provides the first direct evidence at the residue level for the correlated sequence and structure evolution in protein fold families.