Database searching - A domain based protein structural modelling platform applied in the analys

1.4.1 Sequence based

Generally, all of the sequence based methods involve searching for sequence matches in databases such as UniProt (UniProt, 2017) or PDB (Berman et al., 2000). BLAST (Altschul et al., 1990) is a very fast tool for scanning large sequence repositories. BLAST splits the query sequence into small fragments (with the default size of 3). These fragments are then used to search against all the fragments (available in a particular database). A substitution matrix (i.e. BLOSUM or PAM) is used to score the matches. Each fragment match is then extended in both directions, allowing for gaps, to create the largest possible segment segment pair. All the different segment pairs are scored, assigned significant value and ranked. Sequence identity and over- lap between the query and segment matches are used as criteria in a typical BLAST search (Altschul et al., 1990).

1.4.2 Profile based

Sequence profiles that capture the pattern of residue types embedded in an MSA of evolutionarily related relatives can improve the sequence signal for sequence searching. These approaches have the ability to extend the sequence search into the "twi- light zone" (sequence identity below 30%) and "midnight zone" (sequence identity below 20%).

Evolutionary information from families of homologous proteins can be captured in position-specific scoring matrices (PSSMs). For example, PSI- BLAST (Altschul et al., 1997) uses a PSSM to score matches between query and database sequences and is about three times more sensitive than BLAST. PSI-BLAST first employs BLAST to find the close relatives for query sequence, from which an MSA can be built. A PSSM is then generated based on the MSA and used to find new relatives. Subsequently, a new MSA and PSSM can then be built. The cycle is continued until no more new relatives can be found within a specific statistical threshold.

CHAPTER 1. INTRODUCTION 37

Hidden Markov models (HMMs) are more advanced forms of sequence profiles. HMMs were first proposed for the use in bioinformatics by Gary Churchill of Cold Spring Harbor Laboratory. The first HMM based program was SAM (Krogh et al., 1994). The revolutionary feature of HMMs is their ability to additionally capture the insertions and deletions that are found in MSAs. HMMs implement a statistical frame- work based on state-transition probabilities (emission probabilities and transition probabilities for moving from state to state) in an MSA. Probability is calculated for one of three states: match state (models the distribution of residues allowed), insert state (in- sertion of one or more residues) and delete state (deletion of one or more residues). By traversing this probabilistic network, a distribution of residues is ’emitted’ at each position to create the HMM model.

Figure 1.4: HMM model showing the transition probabilities between the different

states

The HMMER (Eddy, 2011) software suite provides programs that can be used to create and manipulate profile HMMs, create databases of profile HMMs and perform sensitive searches of sequence and profile HMM databases. HMMER is widely used for making HMM models, particularly by protein family databases such as Pfam (Finn et al., 2016), SUPERFAMILY (Wilhelm et al., 2014) and Gene3D (Lam et al., 2016).

In addition, pairs of sequence profiles can be compared to find out the similarity between the two MSAs. COMPASS (Sadreyev and Grishin, 2003) generates PSSMs from two input MSAs and generates an optimal PSSM-PSSM alignment. Profile com-

CHAPTER 1. INTRODUCTION 38

parer (PRC) (Madera, 2008) and HHsearch (Remmert et al., 2012) supports HMM- HMM comparison and alignments.

HHsearch provides a suite of programs for HMM-based sequence searching and sequence alignment. HHsearch has been shown to perform better than HMMER and PSI-BLAST in sensitivity, alignment quality and also speed (Remmert et al., 2012). HHsearch includes predicted secondary structure information in the HMM profile. Moreover, HHsearch employs a maximum accuracy (MAC) alignment algorithm. This algorithm is inspired by the MAC algorithm introduced by Holmes and Durbin (1998). This is done by maximizing the number of correctly aligned pairs of residues (i.e. the posterior probability of match state to be aligned between 2 HMMs). HHsearch has been employed by some of the state-of-the-art 3D structural modelling servers such as Robetta (Kim et al., 2014; Ovchinnikov et al., 2017), BioSerf (Buchan et al., 2013), SWISS-MODEL (Biasini et al., 2014), nns (Joo et al., 2016) and MULTICOM (Li et al., 2015) for both template searching and sequence alignment.

1.4.3 Assessing the significance of the result obtained from these

programs

When we are inferring homologies between sequences, it is important to know if a sequence match constitutes homology evidence and the likelihood that it is expected by chance. Measures such as score, bit-score and E-value are usually used to assess the quality of an alignment (match). A score S, is a numerical value that explains the overall quality of an alignment, depending on the substitution matrix and gap penalty used. Typically, the higher the score, the higher the similarity and the better the alignment. Bit score is a normalised score that allows user to estimate the magnitude of the search space. It allows the comparison of alignment irrespective of substitution matrices or gap penalty used. The bit score is calculated by:

Bit score = λS − ln(K)

CHAPTER 1. INTRODUCTION 39

where S is the overall alignment score, λ and K reflect the matrices and penalties used. The E-value represents the number of hits one can "expect" to see by chance when searching a database. Hence, the E-value is dependent on the size of the database and the query length. The closer the E-value to 0, the better the alignment. E-values are calculated using the following formula:

E = Kmne−λS (1.5)

where m is the length of the query sequence, n is the total number of residues in the database, S represents the overall alignment score and K and λ are parameters dependant on the substitution matrix and on the gap penalties.

In document A domain based protein structural modelling platform applied in the analysis of alternative splicing (Page 36-39)