Materials and Methods - A domain based protein structural modelling platform applied in the ana

2.2.1 Calculating the structural conservation of structural domains

within CATH-Gene3D FunFams and CATH superfamilies

To explore the conservation of domains within FunFams, we clustered all the CATH structural domains for each FunFam (which contains structural domains) into sequence identity 90% (S90) clusters. The structural representative for each S90 cluster was selected as the relative whose length is closest to the average length of the domains of the cluster. After that, we carried out all the pairwise structural comparisons between all the S90 representatives using SSAP (Taylor and Orengo, 1989).

We also compared the structural conservation of CATH domains across CATH superfamilies to compare against the conservation within FunFams. For each superfamily, we also compared representatives from sequence identity 35% (S35) clusters. The S35 structural representative was selected based on the average length of domains inside the S35 cluster and for having the best X-ray resolution of the structure. For each type of analysis (i.e. FunFam or superfamily), we carried out pairwise structural comparisons between all the representatives.

We calculated the mean of the normalised RMSD and SSAP score for the comparisons.

2.2.2 Assessing the model quality assessment methods, query-

template alignment and template selection strategy

To assess the protocols for selecting templates, aligning templates and queries, ranking and assessing models, a query dataset was compiled from sequences of known structural domains in CATH (Sillitoe et al., 2015). The clustering program CD-HIT (Fu et al., 2012) was used to cluster CATH domain sequences into clusters of relatives showing 30% or more sequence identity (S30 clusters). A single representative was selected from each cluster to generate a non-redundant dataset.

CHAPTER 2. MODELLING PROTEIN MONOMERS 68

To make sure the proteins selected have good structure, only proteins solved by X-ray crystallography with resolution equal to or better than 2Å were used. At this resolution, proteins tend to have fewer errors with only a few incorrect rotamers (Huang, 2007).

This dataset was separated into two subsets based on sequence homology. The close homologue subset consists of query targets that have sequence relatives with more than or equal to 30% global sequence identity. By contrast, the remote homologue subset comprises query targets that have sequence relatives with less than 30% global sequence identity. This gave 8,633 close homologue query targets and 602 remote homologue query targets.

2.2.2.1 Template selection methods

The default parameters suggested by the authors of the respective methods were used for all the methods.

HHsearch Query sequences in the benchmark set were scanned against the CATH

HHsearch HMM library using HHsearch. The library is composed of HMM built for every single domain in the CATH version 4.0. The HMM library was built according to the protocol provided in the HHseach guidebook: For each CATH target sequence, we performed two iterations of HHsearch searches against the UniProt20 database (provided by HHsearch, clustered at 20% sequence identity), with an E-value threshold of 0.001.

We selected the structural template with the highest probability to be a true positive provided by the program, which considers both the sequence and secondary structure alignment for each target sequence. The structural templates selected must exceed the default probability score of 20 and E-value below 0.001. The original structures of the query sequences were excluded as templates.

FunFam Query sequences in the benchmark set were scanned against the HMMs

CHAPTER 2. MODELLING PROTEIN MONOMERS 69

selected. If the matched FunFam did not contain any structural templates, the next closest FunFam was checked to determine whether there was a structural template. However, the query sequence had to match the FunFam with an E-value threshold of <0.001.

The original structures of the query sequences were excluded as templates. The best templates were selected based on the average BLOSUM62 sequence similarity with the query targets. Templates with X-ray resolution below 5Å were prioritised.

2.2.2.2 Comparing different model quality assessment methods

A variety of model quality assessment methods were tested. These methods were used to select the best models from ten models built by MODELLER. nDOPE (i.e. normalised DOPE) (Shen and Sali, 2006), GOAP (Zhou and Skolnick, 2011), and BACH (Sarti et al., 2013) are statistical potential based single model quality assessment methods. Whereas, GA341 combines statistical potentials with information on target-template sequence identity, and structural compactness (Melo et al., 2002). ProQ2 (Ray et al., 2012) is a machine learning method that is based on evolutionary information, a MSA and structural features. ModFOLDclust2 (McGuffin and Roche, 2010) is a clustering based model quality assessment method.

To assess the performance of the model assessment methods, the best model selected by the different methods was superposed on the original protein structure, and the quality of the model was given by the structural similarity score.

2.2.2.3 Comparing different query-template alignments methods

MAFFT alignment We employed MAFFT MSA method (Katoh and Standley, 2013)

to align the query and template identified by the different template searching algo- rithms.

For HHsearch, we realigned the sequences of the best HHsearch match using MAFFT (L-INS-i mode). Next, the query sequence was added to the alignment using MAFFT (using the mafft –add option). Subsequently, for both BLAST and HHsearch,

CHAPTER 2. MODELLING PROTEIN MONOMERS 70

the alignment of query and template was submitted to MODELLER.

For FunFam method, query sequences were added to the pre-built FunFam alignment of the matched FunFams using MAFFT and the alignment of query and template was submitted to MODELLER. The original FunFam alignment was produced using MAFFT (L-INS-i mode).

HHsearch alignment For HHsearch, we obtained the query-template sequence

alignment provided by the HHsearch program. For the FunFam method, after identi- fying the FunFam to which the query belongs and selecting the best structural templates, we obtained an HHsearch HMM profile for the structural template to generate an alignment. Then, we aligned the query sequence with the template HMM profile using the HHsearch alignment methods and extracted the query-template alignment.

2.2.2.4 Comparative modelling

Comparative modelling software MODELLER version 9.15 was used to predict ten models for each query target for each template selection method. The best model was selected using the best model quality assessment score obtained in Section 2.3.2.

2.2.2.5 Assessing the performance of the prediction protocols and ranking the

models

In order to assess the target selection, sequence alignment and model ranking protocols, TMscore, a sequence dependent structural superposition program (Zhang and Skolnick, 2005), was used. TMScore calculates a TM-score (Zhang and Skolnick, 2004) in the range of 0 to 1. Protein pairs with TM-score >0.5 are generally in the same CATH/SCOP fold group (Xu and Zhang, 2010).

The TMScore program was also used to calculate the GDT-HA score. The GDT- HA score takes into account the number of Cα pairs within a distance 0.5Å, 1Å, 2Å, and 4Å after superposition of the two structures (Read and Chavali, 2007; Zemla, 2003; Zemla et al., 1999). GDT-HA is the current official structural comparison score

CHAPTER 2. MODELLING PROTEIN MONOMERS 71

used in The Critical Assessment of Protein Structure Prediction (CASP) competition (Kryshtafovych et al., 2015).

2.2.3 Modelling structurally uncharacterised sequences in human

and fly

Once a robust modelling pipeline had been established by the benchmarking, 3D models were generated for human and fly domain sequences classified in the CATH- Gene3D resource (version 12 (Lees et al., 2014)). After removing all sequences that already had a structure in the PDB, there were 97,326 human (Homo sapiens) sequences and 36,761 fly (Drosophila melanogaster ) sequences inclusive of isoforms.

Different template searching methods were combined to model all sequences: • FunFams

• HHsearch (using the CATH 4.0 HMM library, which includes HMMs for all CATH domains)

• HHsearch (using the SCOP 1.75 HMM library clustered at 70% sequence identity)

• HHsearch (using the SCOe 2.05 HMM library clustered at 90% sequence identity)

• HHsearch (using the September 2014 release of PDB HMM library clustered at 70% sequence identity)

After the models had been built, we used the best model quality assessment score obtained in Section 2.3.2 to assess the quality of the models.

In document A domain based protein structural modelling platform applied in the analysis of alternative splicing (Page 67-71)