• No results found

S equence analysis

3.2.1. Sequence analysis

Predicting the function and domain content o f an uncharacterised protein

sequence is perform ed when its sequence is compared to those o f hom ologous proteins.

N ucleic acid and protein sequences that have evolved from a com m on ancestor are said

to be homologous. Sequences that are similar (possessing a degree o f residue

similarity), but not hom ologous (possessing a common evolutionary origin) are said to

be convergent. This is fairly com m on in evolution at the m acroscopic level (Creighton,

1993). In the process o f sequence analysis, the initial step is to locate and retrieve

sim ilar sequences from an appropriate sequence database. Various search engines have

been designed that will carry out this process quickly and efficiently. Similarity between

sequences is found by aligning two or more sequences for comparison, and scoring for residues that are identical. The expected similarity o f tw o unrelated sequences is

expected to be

6

% (Creighton, 1993) when all the amino acids occur at their normal

frequency. Using this alignm ent technique, sequences that are totally unrelated can give

an identity o f -2 0 % , even though at this level sequences can be hom ologous (Tasman,

1989; Creighton, 1993). The lower lim it to the technique is in the region o f 20% to 30%

identity and is referred to as the ‘Twilight Z one’ where the alignments are no longer

statistically significant (Doolittle, 1986; Taylor, 1988; Taylor, 1989). Com parisons are

made using sequence alignment algorithms that score residue differences between

equivalent positions in two sequences. The most com m only used scoring schemes have

been derived by examining the substitution frequencies observed in sequence

matrices (H enikoff & Henikoff, 1992). Scoring schemes based upon the nucleotide base

changes required to interconvert the codons for the two residues, or the physicochemical

properties o f amino acids have also been used (Barton, 1996). The identification o f

hom ologues can be perform ed using algorithms such as FASTA (Pearson & Lipman,

1988) and BLAST (Altschul et ah, 1990) which scan sequence databases for matches

to a target sequence. Alternatively, databases such as Entrez store existing sequences

with links to their close homologues. How ever, it may not be possible to find

hom ologous sequences because the residues that are essential for the dom ain fold only

represent a small fraction o f the sequence. In such circumstances, short sequence motifs

that are characteristic o f a protein family or superfam ily can be invaluable for predicting

homologues. Known sequence motifs are compiled in the Prosite database (Bairoch,

1991).

The target sequence is aligned with its hom ologues for several purposes. For

homology m odelling the sequence m ust be correctly aligned against the homologous

(template) structure. M ultiple sequence alignments can be used for secondary structure

and tertiary structure predictions if the structure has not been solved for any o f its homologues. If homology w ith a know n structure is ambiguous, m ultiple sequence

alignm ents may be used to assess the functionality o f the target protein. M ultiple

sequence alignments are conveniently perform ed using M ULTAL (Taylor, 1988) or

CLU STA LW (Thompson et al., 1994). However, m ultiple sequence alignments are

subsequently generally refined manually, especially if the locations o f the secondary

structure elem ents and the solvent accessible regions had been determ ined from a

hom ologous structure. M ULTAL aligns sequences and alignm ents using a clustering

method. Each sequence is first aligned to all other sequences to obtain all the pairwise

alignments. Then, starting from the pairwise alignment w ith the highest degree o f

relatedness, the multiple alignment is constructed by adding to it, in order o f decreasing

relatedness, those pairwise alignments w hich have a sequence that overlaps with one o f

its ‘free ends’. If a pairwise alignment cannot be linked to the m ultiple alignm ent for

reason o f it not containing an overlapping sequence, it is used to start a further

alignment. Subsequent pairwise alignments are compared to the ‘free ends’ o f all

fused together. This alignment generally produces an ordered list o f sequences, but a

relatedness cutoff can be used to prevent subfamilies being linked together. MULTAL

is highly interactive and enables the user to alter many o f the parameters that control the

clustering and the final alignment stages in order to generate an acceptable alignment.

CLUSTALW also starts by aligning each sequence to all others to obtain pairwise

alignments. These are used to calculate a distance m atrix giving the divergence o f each

pair o f sequences and the m atrix is used to calculate ‘guide trees’ which describe the

evolution o f the sequences. The ‘trees’ are then used to determ ine the progressive

alignm ent o f the sequences: starting from the sequences w ith the highest degree o f

relatedness at the tips o f the ‘tree’ the sequences are aligned in order o f decreasing

similarity to the roots o f the ‘tree’. A t the progressive alignm ent stage, the choice o f

BLOSUM m atrix is varied depending upon the divergence between sequences and gaps

in the alignm ent are favoured more strongly in regions abundant in hydrophilic residues,

which are predicted to correspond to loops. A significant advantage o f CLUSTALW is

that it creates an alignment without the requirem ent for significant user intervention.