2.2.1
Calculating the structural conservation of structural domains
within CATH-Gene3D FunFams and CATH superfamilies
To explore the conservation of domains within FunFams, we clustered all the CATH structural domains for each FunFam (which contains structural domains) into se- quence identity 90% (S90) clusters. The structural representative for each S90 cluster was selected as the relative whose length is closest to the average length of the do- mains of the cluster. After that, we carried out all the pairwise structural comparisons between all the S90 representatives using SSAP (Taylor and Orengo, 1989).We also compared the structural conservation of CATH domains across CATH superfamilies to compare against the conservation within FunFams. For each super- family, we also compared representatives from sequence identity 35% (S35) clusters. The S35 structural representative was selected based on the average length of do- mains inside the S35 cluster and for having the best X-ray resolution of the structure. For each type of analysis (i.e. FunFam or superfamily), we carried out pairwise struc- tural comparisons between all the representatives.
We calculated the mean of the normalised RMSD and SSAP score for the com- parisons.
2.2.2
Assessing the model quality assessment methods, query-
template alignment and template selection strategy
To assess the protocols for selecting templates, aligning templates and queries, rank- ing and assessing models, a query dataset was compiled from sequences of known structural domains in CATH (Sillitoe et al., 2015). The clustering program CD-HIT (Fu et al., 2012) was used to cluster CATH domain sequences into clusters of relatives showing 30% or more sequence identity (S30 clusters). A single representative was selected from each cluster to generate a non-redundant dataset.
CHAPTER 2. MODELLING PROTEIN MONOMERS 68
To make sure the proteins selected have good structure, only proteins solved by X-ray crystallography with resolution equal to or better than 2Å were used. At this res- olution, proteins tend to have fewer errors with only a few incorrect rotamers (Huang, 2007).
This dataset was separated into two subsets based on sequence homology. The close homologue subset consists of query targets that have sequence relatives with more than or equal to 30% global sequence identity. By contrast, the remote ho- mologue subset comprises query targets that have sequence relatives with less than 30% global sequence identity. This gave 8,633 close homologue query targets and 602 remote homologue query targets.
2.2.2.1 Template selection methods
The default parameters suggested by the authors of the respective methods were used for all the methods.
HHsearch Query sequences in the benchmark set were scanned against the CATH
HHsearch HMM library using HHsearch. The library is composed of HMM built for every single domain in the CATH version 4.0. The HMM library was built according to the protocol provided in the HHseach guidebook: For each CATH target sequence, we performed two iterations of HHsearch searches against the UniProt20 database (pro- vided by HHsearch, clustered at 20% sequence identity), with an E-value threshold of 0.001.
We selected the structural template with the highest probability to be a true positive provided by the program, which considers both the sequence and secondary structure alignment for each target sequence. The structural templates selected must exceed the default probability score of 20 and E-value below 0.001. The original structures of the query sequences were excluded as templates.
FunFam Query sequences in the benchmark set were scanned against the HMMs
CHAPTER 2. MODELLING PROTEIN MONOMERS 69
selected. If the matched FunFam did not contain any structural templates, the next closest FunFam was checked to determine whether there was a structural template. However, the query sequence had to match the FunFam with an E-value threshold of <0.001.
The original structures of the query sequences were excluded as templates. The best templates were selected based on the average BLOSUM62 sequence similarity with the query targets. Templates with X-ray resolution below 5Å were prioritised.
2.2.2.2 Comparing different model quality assessment methods
A variety of model quality assessment methods were tested. These methods were used to select the best models from ten models built by MODELLER. nDOPE (i.e. normalised DOPE) (Shen and Sali, 2006), GOAP (Zhou and Skolnick, 2011), and BACH (Sarti et al., 2013) are statistical potential based single model quality assess- ment methods. Whereas, GA341 combines statistical potentials with information on target-template sequence identity, and structural compactness (Melo et al., 2002). ProQ2 (Ray et al., 2012) is a machine learning method that is based on evolutionary information, a MSA and structural features. ModFOLDclust2 (McGuffin and Roche, 2010) is a clustering based model quality assessment method.
To assess the performance of the model assessment methods, the best model selected by the different methods was superposed on the original protein structure, and the quality of the model was given by the structural similarity score.
2.2.2.3 Comparing different query-template alignments methods
MAFFT alignment We employed MAFFT MSA method (Katoh and Standley, 2013)
to align the query and template identified by the different template searching algo- rithms.
For HHsearch, we realigned the sequences of the best HHsearch match using MAFFT (L-INS-i mode). Next, the query sequence was added to the alignment using MAFFT (using the mafft –add option). Subsequently, for both BLAST and HHsearch,
CHAPTER 2. MODELLING PROTEIN MONOMERS 70
the alignment of query and template was submitted to MODELLER.
For FunFam method, query sequences were added to the pre-built FunFam align- ment of the matched FunFams using MAFFT and the alignment of query and template was submitted to MODELLER. The original FunFam alignment was produced using MAFFT (L-INS-i mode).
HHsearch alignment For HHsearch, we obtained the query-template sequence
alignment provided by the HHsearch program. For the FunFam method, after identi- fying the FunFam to which the query belongs and selecting the best structural tem- plates, we obtained an HHsearch HMM profile for the structural template to generate an alignment. Then, we aligned the query sequence with the template HMM profile using the HHsearch alignment methods and extracted the query-template alignment.
2.2.2.4 Comparative modelling
Comparative modelling software MODELLER version 9.15 was used to predict ten models for each query target for each template selection method. The best model was selected using the best model quality assessment score obtained in Section 2.3.2.
2.2.2.5 Assessing the performance of the prediction protocols and ranking the
models
In order to assess the target selection, sequence alignment and model ranking proto- cols, TMscore, a sequence dependent structural superposition program (Zhang and Skolnick, 2005), was used. TMScore calculates a TM-score (Zhang and Skolnick, 2004) in the range of 0 to 1. Protein pairs with TM-score >0.5 are generally in the same CATH/SCOP fold group (Xu and Zhang, 2010).
The TMScore program was also used to calculate the GDT-HA score. The GDT- HA score takes into account the number of Cα pairs within a distance 0.5Å, 1Å, 2Å, and 4Å after superposition of the two structures (Read and Chavali, 2007; Zemla, 2003; Zemla et al., 1999). GDT-HA is the current official structural comparison score
CHAPTER 2. MODELLING PROTEIN MONOMERS 71
used in The Critical Assessment of Protein Structure Prediction (CASP) competition (Kryshtafovych et al., 2015).
2.2.3
Modelling structurally uncharacterised sequences in human
and fly
Once a robust modelling pipeline had been established by the benchmarking, 3D models were generated for human and fly domain sequences classified in the CATH- Gene3D resource (version 12 (Lees et al., 2014)). After removing all sequences that already had a structure in the PDB, there were 97,326 human (Homo sapiens) se- quences and 36,761 fly (Drosophila melanogaster ) sequences inclusive of isoforms.
Different template searching methods were combined to model all sequences: • FunFams
• HHsearch (using the CATH 4.0 HMM library, which includes HMMs for all CATH domains)
• HHsearch (using the SCOP 1.75 HMM library clustered at 70% sequence iden- tity)
• HHsearch (using the SCOe 2.05 HMM library clustered at 90% sequence iden- tity)
• HHsearch (using the September 2014 release of PDB HMM library clustered at 70% sequence identity)
After the models had been built, we used the best model quality assessment score obtained in Section 2.3.2 to assess the quality of the models.