S equence analysis
3.2.1. Sequence analysis
Predicting the function and domain content o f an uncharacterised protein
sequence is perform ed when its sequence is compared to those o f hom ologous proteins.
N ucleic acid and protein sequences that have evolved from a com m on ancestor are said
to be homologous. Sequences that are similar (possessing a degree o f residue
similarity), but not hom ologous (possessing a common evolutionary origin) are said to
be convergent. This is fairly com m on in evolution at the m acroscopic level (Creighton,
1993). In the process o f sequence analysis, the initial step is to locate and retrieve
sim ilar sequences from an appropriate sequence database. Various search engines have
been designed that will carry out this process quickly and efficiently. Similarity between
sequences is found by aligning two or more sequences for comparison, and scoring for residues that are identical. The expected similarity o f tw o unrelated sequences is
expected to be
6
% (Creighton, 1993) when all the amino acids occur at their normalfrequency. Using this alignm ent technique, sequences that are totally unrelated can give
an identity o f -2 0 % , even though at this level sequences can be hom ologous (Tasman,
1989; Creighton, 1993). The lower lim it to the technique is in the region o f 20% to 30%
identity and is referred to as the ‘Twilight Z one’ where the alignments are no longer
statistically significant (Doolittle, 1986; Taylor, 1988; Taylor, 1989). Com parisons are
made using sequence alignment algorithms that score residue differences between
equivalent positions in two sequences. The most com m only used scoring schemes have
been derived by examining the substitution frequencies observed in sequence
matrices (H enikoff & Henikoff, 1992). Scoring schemes based upon the nucleotide base
changes required to interconvert the codons for the two residues, or the physicochemical
properties o f amino acids have also been used (Barton, 1996). The identification o f
hom ologues can be perform ed using algorithms such as FASTA (Pearson & Lipman,
1988) and BLAST (Altschul et ah, 1990) which scan sequence databases for matches
to a target sequence. Alternatively, databases such as Entrez store existing sequences
with links to their close homologues. How ever, it may not be possible to find
hom ologous sequences because the residues that are essential for the dom ain fold only
represent a small fraction o f the sequence. In such circumstances, short sequence motifs
that are characteristic o f a protein family or superfam ily can be invaluable for predicting
homologues. Known sequence motifs are compiled in the Prosite database (Bairoch,
1991).
The target sequence is aligned with its hom ologues for several purposes. For
homology m odelling the sequence m ust be correctly aligned against the homologous
(template) structure. M ultiple sequence alignments can be used for secondary structure
and tertiary structure predictions if the structure has not been solved for any o f its homologues. If homology w ith a know n structure is ambiguous, m ultiple sequence
alignm ents may be used to assess the functionality o f the target protein. M ultiple
sequence alignments are conveniently perform ed using M ULTAL (Taylor, 1988) or
CLU STA LW (Thompson et al., 1994). However, m ultiple sequence alignments are
subsequently generally refined manually, especially if the locations o f the secondary
structure elem ents and the solvent accessible regions had been determ ined from a
hom ologous structure. M ULTAL aligns sequences and alignm ents using a clustering
method. Each sequence is first aligned to all other sequences to obtain all the pairwise
alignments. Then, starting from the pairwise alignment w ith the highest degree o f
relatedness, the multiple alignment is constructed by adding to it, in order o f decreasing
relatedness, those pairwise alignments w hich have a sequence that overlaps with one o f
its ‘free ends’. If a pairwise alignment cannot be linked to the m ultiple alignm ent for
reason o f it not containing an overlapping sequence, it is used to start a further
alignment. Subsequent pairwise alignments are compared to the ‘free ends’ o f all
fused together. This alignment generally produces an ordered list o f sequences, but a
relatedness cutoff can be used to prevent subfamilies being linked together. MULTAL
is highly interactive and enables the user to alter many o f the parameters that control the
clustering and the final alignment stages in order to generate an acceptable alignment.
CLUSTALW also starts by aligning each sequence to all others to obtain pairwise
alignments. These are used to calculate a distance m atrix giving the divergence o f each
pair o f sequences and the m atrix is used to calculate ‘guide trees’ which describe the
evolution o f the sequences. The ‘trees’ are then used to determ ine the progressive
alignm ent o f the sequences: starting from the sequences w ith the highest degree o f
relatedness at the tips o f the ‘tree’ the sequences are aligned in order o f decreasing
similarity to the roots o f the ‘tree’. A t the progressive alignm ent stage, the choice o f
BLOSUM m atrix is varied depending upon the divergence between sequences and gaps
in the alignm ent are favoured more strongly in regions abundant in hydrophilic residues,
which are predicted to correspond to loops. A significant advantage o f CLUSTALW is
that it creates an alignment without the requirem ent for significant user intervention.