Similarity searches using sequence databases

Table 1.2 Summary of some advantages and disadvantages of direct cDNA selection I Lovett, M (1994)

1.5.10 Similarity searches using sequence databases

Similarity searching is the process of comparing a new sequence against all other known sequences, then attempting to infer the function of the new sequence by assessing the matches and their biological annotations as described in the databases themselves and the literature.

There are a number of important issues in searching DNA and protein sequence databases (Altschul, et aL, 1994), but the most important is access to a comprehensive and up-to-date data repository. GenBank distributed by the National

T ab le 1.3 Experim ental step s necessary to analyse a YAC for transcribed seq u en ces

Method Exon trapping cDNAselection Direct cDNA screening

• subclone>arrayed cosmid library

•purify • subclone>arrayed

cosmid library •subclone pools of

cosmids into splice vector

•immobilise (filter/beads)

•deplete cDNA probe

•transfect COS cells •block repeats,yeast •hybridise replica filters of arrayed cosmids

•extract mRNA •hybridise with

cDNA library inserts

•Southern blot positive cosmids

•reverse transcribe/PCR •wash/elute •hybridise with

depleted cDNA probe •clone PCR products in plasmid>array clones •PCR •clone PCR products plasmid>array clones •subclone individual positive fragments in plasmids Initial

collection redundant collection of of gene small products

candidates:

redundant collection of small PCR products

non-redundant collection of genomic fragments carrying transcribed sequences Follow-up •reduce redundancy

analysis by screening random collection with individual members to exclude multiple isolates •reduce redundancy by by screening random collection with individual members to exclude multiple isolates

•sequence •GRAIL analysis •design oligo probe

from exon •screen several cDNA

libraries from different tissues/times to validate exons and to link exons through a conunon cDNA

• check chromosomal • screen cDNA location to exclude library from non-specific isolates specific tissue/ time

•screen cDNA libraries from specific tissue/ time to link partial overlapping clones

Listed are the major experimental steps involved in three different methods from starting with a YAC to arriving at a cDNA clone from a gene encoded by

Method Structural bias Expression bias

Level Tissue and time

Problems

Exon trapping genes with^S exons variation variation with position of exons relative to restriction sites used for cloning

no ? tissue specific

splicing

redundant collection of initial candidates validation of exons

cDNA selection No variable dependent

upon number of clones picked

limited to tissue and time in cDNA ibrary

redundant collection of initial candidates redundant isolation of genes expressed in more than one tissue/ time

completion sensitivity ^10'^ deriving sequence Direct cDNA screening No Limited to sensitivity of hybridisation reaction ('-10"^) limited to tissue and time used in experiment

GRAIL ? No No validation requires extensive

genomic sequence

?

I-

Center for Biotechnology Information (NCBI), the EMBL nudeotide sequence database (36%) facilitated by the European Bioinformatics Institute (EBI), and the DNA Database (4%) of Japan (DDBJ) are three partners in a long-standing collaboration to collect and distribute all publidy available sequence data. Sites in these respective countries exchange new sequence data and updates over the Internet on a daily basis and make this information available to the public by a variety of means induding e-mail, anonymous ftp and the World Wide Web (Harper, 1994) giving a comprehensive up-to- date database. Timely access to complete and "nonredundant" sequence databases has become relatively simple and inexpensive.

Strong biases exist in protein and nudeic add sequences and sequence databases. Many of these reflect fundamental mosiac sequence properties that are of considerable biological interest in themselves, such as segments of low compositional complexity or short-period repeats. Databases also contain some very large families of related domains, motifs or repeated sequences, in some cases with hundreds of members. These biases commonly confound database search methods and interfere with the discovery of interesting new sequence similarities. Problems may indude the occurrence of misleading, spuriously high scores, ambiguities in the phase of sequence alignments and can result in chance similarities being daimed significant, or biologically important relationships being overlooked. Large improvements in the effidency of searching databases have been implemented using new strategies, such as pre processing a query sequence to identify known domains and motifs, dispersed repeats, low complexity segments and other regions of compositional bias such as potential membrane spanning and a-helical coiled-coil regions. Another complementary strategy was to reduce the redundancy in the target database(s) to be searched (Altschul et al.,

1994).

The computer programs used to search the sequence database itself is also of importance. A number of different search algorithms have been developed over the years and further information about them may be found in Altschul et al., (1994). The BLAST family of programs ("Basic Local Alignment Search Tool") is a set of similarity search programs designed to explore all of the available sequence databases regardless of whether the query is protein or DNA. The BLAST programs have been designed for speed, with a minimal sacrifice of sensitivity to distant sequence relationships. The scores assigned in a BLAST search have a well-defined statistical interpretation, making real matches easier to distinguish from random background hits. It therefore offers a good combination of speed, sensitivity, flexibility and statistical rigor. BLAST uses a heuristic algorithm which seeks local as opposed to global alignments and is therefore able to detect relationships among sequences which share only isolated regions of similarity (Altschul et al., 1990).

A recent paper by Altschul et al., (1997) provides modifications to the current algorithms in the BLAST program that save the user time and provide far greater return by allowing the production of gapped alignments (Gapped BLAST) and motif searching (Position-specific Iterated BLAST- PSI-BLAST) within the BLAST system. Gapped BLAST and PSI-BLAST are useful search tools provided by the BLAST server (version 2.0, http://www.ncbi.nlm.nih.gov/BLAST/blast) (Altschul et al., 1997). The Gapped BLAST algorithm allows gaps (deletions and insertions) to be introduced into the alignments that are returned. Allowing gaps means that similar regions are not broken into several segments. The scoring of these gapped alignments tends to reflect biological relationships more closely. Position-Specific iterated BLAST (PSI-BLAST) provides an automated, easy-to-use version of a "profile" search, which is a sensitive way to look for sequence homologues. The program first performs a gapped BLAST database search. The PSI-BLAST program uses the information from any significant alignments returned to construct a position-specific score matrix, which replaces the query sequence for the next round of database searching. PSI-BLAST may be iterated until no new significant alignments are found. At this time PSI-BLAST may be used only for comparing protein queries with protein databases.

There are five implementations of BLAST, three designed for nucleotide sequence queries (BLASTN, BLASTX, TBLASTX) and two for protein sequence queries (BLASTP, TBLASTN). The former are used for the analysis of genomic sequence (including putative exons) and "single-pass" cDNA data, the latter usually come into play when one has identified discrete gene products from complete sequence. BLASTN compares a nucleotide query sequence against a nucleotide sequence database, BLASTX compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database and TBLASTX compares the six- frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Protein sequence query programs BLASTP compares an amino add query sequence against a protein sequence database and TBLASTN compares a protein query sequence against a nudeotide sequence database dynamically translated in all six reading frames (both strands).

The fundamental unit of the BLAST algorithm output is the High-scoring Segment Pair (HSP). An HSP consists of two sequence fragments of arbitrary but equal length whose alignment is locally maximal and for which the alignment score meets or exceeds a threshold or cut-off score. A set of HSPs is thus defined by two sequences, a scoring system and a cut-off score; this set may be empty if the cut-off score is suffidently high. The HSP consists of one segment from the query and one from the database sequence. In addition an MSP or Maximal-scoring Segment Pair has also been defined. It is defined by two sequences and a scoring system and is the highest-scoring of all possible segment pairs that can be produced from the two sequences.

The task of finding HSPs begins with identifying short words of length W in the query sequence that either match or satisfy some positive-valued threshold score T when aligned with a word of the same length in a database sequence. T is referred to as the neighborhood word score threshold (Altschul et ah, 1990). These initial neighborhood word hits act as seeds for initiating searches to find longer HSPs containing them. The word hits are extended in both directions along each sequence for as far as the cumulative alignment score can be increased. Extension of the word hits in each direction are halted when the cumulative alignment score falls off by the quantity X from its maximum achieved value; the cumulative score goes to zero or below, due to the accumulation of one or more negative-scoring residue alignments, or when the end of either sequence is reached. The sensitivity and speed of the program can be adjusted via the standard BLAST algorithm parameters W, T, and X (Altschul et al., 1990); selectivity of the programs can be adjusted via the cut-off score.

In document Genetic analysis of inherited X-linked retinitis pigmentosa: Development of a transcriptional map of the RP2 region (Page 67-72)