• No results found

4 PART II: ASSEMBLING DATA FOR ANALYSIS

4.5 Building Sequence Alignments

4.5.4 CLUSTALW

A

AbboouuttCCLLUUSSTTAALLWW

ClustalW is a widely used system for aligning any number of homologous

nucleotide or protein sequences. For multi-sequence alignments, ClustalW uses progressive alignment methods. In these, the most similar sequences, that is, those with the best alignment score are aligned first. Then progressively more distant groups of sequences are aligned until a global alignment is obtained. This heuristic approach is necessary because finding the global optimal solution is prohibitive in both memory and time requirements. ClustalW performs very well in practice. The algorithm starts by computing a rough distance matrix between each pair of sequences based on pair-wise sequence alignment scores. These scores are computed using the pair-wise alignment parameters for DNA and protein sequences. Next, the algorithm uses the neighbor-joining method with midpoint rooting to create a guide tree, which is used to generate a global alignment. The guide tree serves as a rough template for clades that tend to share insertion and deletion features. This generally provides a close-to-optimal result, especially when the data set contains sequences with varied degrees of divergence, so the guide tree is less sensitive to noise.

See:

Higgins D., Thompson J., Gibson T. Thompson J. D., Higgins D. G., Gibson T. J. CLUSTAL W: improving the sensitivity of progressive multiple sequence

alignment through sequence weighting, position-specific gap penalties and weight matrix choice.

Nucleic Acids Res. 22:4673-4680. (1994)

C

CLLUUSSTTAALLWWOOppttiioonnss((DDNNAA))

This dialog box displays a single tab containing a set of organized parameters that are used by ClustalW to align the DNA sequences. If you are aligning protein-coding sequences, please note that CLUSTALW will not respect the codon positions and may insert alignment gaps within codons. For aligning cDNA or sequence data containing codons, we recommend that you align the

translated protein sequences (see Aligning coding sequences via protein sequences).

In this dialog box, you will see the following options: Parameters for Pair-wise Sequence Alignment

Gap Opening Penalty: The penalty for opening a gap in the alignment. Increasing this value makes the gaps less frequent.

Gap Extension Penalty: The penalty for extending a gap by one residue. Increasing this value will make the gaps shorter. Terminal gaps are not penalized.

Parameters for Multiple Sequence Alignment

Gap Opening Penalty: The penalty for opening a gap in the alignment. Increasing this value makes the gaps less frequent.

Gap Extension Penalty: The penalty for extending a gap by one residue. Increasing this value will make the gaps shorter. Terminal gaps are not penalized.

Common Parameters

DNA Weight Matrix: The scores assigned to matches and mismatches (including IUB ambiguity codes).

Transition Weight: Gives transitions a weight between 0 and 1. A weight of zero means that the transitions are scored as mismatches, while a weight of 1 gives the transitions the match score. For distantly-related DNA sequences, the weight should be near zero; for closely-related sequences, it can be useful to assign a higher score.

Use Negative Matrix: Enabled negative weight matrix values will be used if they are found; otherwise the matrix will be automatically adjusted to all positive values.

Delay Divergent Cutoff (%): Delays the alignment of the most distantly-related sequences until after the most closely-related sequences have been aligned. The setting shows the percent identity level required to delay the addition of a

sequence. Sequences that is less identical than this level will be aligned later. Keep Predefined Gaps: When checked, alignment positions in which ANY of the sequences have a gap will be ignored.

NOTE: All Definitions are derived from the CLUSTALW manual.

C

CLLUUSSTTAALLWWOOppttiioonnss((PPrrootteeiinn))

This dialog box displays a single tab containing a set of organized parameters that are used by ClustalW to align DNA sequences. If you are aligning protein- coding sequences, please note that CLUSTALW will not respect the codon positions and may insert alignment gaps within codons. For aligning cDNA or sequence data containing codons, we recommend that you align the translated protein sequences (see Aligning coding sequences via protein sequences). In this dialog box, you will see the following options:

Parameters for Pair-wise Sequence Alignment

Gap Opening Penalty: The penalty for opening a gap in the alignment. Increasing this value makes the gaps less frequent.

Gap Extension Penalty: The penalty for extending a gap by one residue. Increasing this value will make the gaps shorter. Terminal gaps are not penalized.

Parameters for Multiple Sequence Alignment

Gap Opening Penalty: The penalty for opening a gap in the alignment. Increasing this value makes the gaps less frequent.

Gap Extension Penalty: The penalty for extending a gap by one residue. Increasing this value will make the gaps shorter. Terminal gaps are not penalized.

Common Parameters

DNA Weight Matrix: The scores assigned to matches and mismatches (including IUB ambiguity codes).

Residue-specific Penalties: Amino acid specific gap penalties that reduce or increase the gap opening penalties at each position or sequence in the

alignment. For example, positions that are rich in glycine are more likely to have an adjacent gap than positions that are rich in valine. See the documentation for details.

Hydrophilic Penalties: Used to increase the chances of a gap within a run (5 or more residues) of hydrophilic amino acids; these are likely to be loop or random coil regions in which gaps are more common.

Gap Separation Distance: Tries to decrease the chances of gaps being too close to each other. Gaps that are less than this distance apart are penalized more than other gaps. This does not prevent close gaps; it makes them less frequent, promoting a block-like appearance of the alignment.

Use Negative Matrix: When enabled negative weight matrix values will be used if they are found; otherwise the matrix will be automatically adjusted to all positive values.

Delay Divergent Cutoff (%): Delays the alignment of the most distantly-related sequences until after the alignment of the most closely-related sequences. The setting shows the percent identity level required to delay the addition of a sequence; sequences that are less identical than this level will be aligned later. Keep Predefined Gaps: When checked, any alignment positions in which ANY of the sequences have a gap will be ignored.

NOTE: All definitions are derived from CLUSTALW manual.