• No results found

L 6,4 The Human Genome Mapping Project and Sequence Database Searching,

The Human Genome Mapping Project (HGMP), is a world-wide collaboration with the ultimate aim of sequencing the entire human genome. If DNA sequence is available, computational analysis can be used to locate protein coding regions rapidly and inexpensively. Finding genes within the approximate 3x10%p of the haploid genome will involve many computational techniques, and incorporate the use of database “tags”, some of which are described below.

1.6.4.1 Expressed Sequence Tags (ESTs).

Sequence tag sites (STSs) are randomly derived DNA segments of known sequence up to 500bp in length, that have been physically mapped to human chromosomes in many cases. Information on STSs is available in databases, allowing them to function as landmarks in genomic mapping (Olson et al 1989).

ESTs, as the name suggests, represent expressed sequences and have an analogous role in providing an expression map of the genome. Mapping using 3’UTR ESTs was first proposed in 1991 (Wilcox et al 1991), due to the ability of ESTs to act as

genomic markers, but with the advantage of supplying genes for the map. The two main reasons to use the 3’UTR of mRNA sequences are the rarity of introns within these regions (thus making them comparable to genomic STSs), and the fact that conservation of this region is not strong, so is easier to distinguish between individual genes of gene families. Sequence from the 5’ and 3’ ends of each oligo-dT primed EST has been submitted to databases for approximately 447,000 ESTs, generated from a large variety of tissue sources. Of these, around 20,000 have been locally mapped to a chromosomal region using radiation hybrids (RH), yeast artificial chromosomes (YACs), or human mapping panels. The chromosome location

mapping has an estimated accuracy of 99%. The accuracy of mapping to the correct subchromosomal location is slightly lower, at 95% (Schuler et al 1996). An RH map is concurrently being constructed that will also contain genetic markers, anonymous STSs and CpG island markers. These EST databases, such as GenBank’s dbEST (Boguski et al 1993), will eventually permit any isolated cDNA to be mapped within the human genome. Of great importance in current positional cloning experiments is the availability to retrieve ESTs that map within the genomic region of interest.

1.6.4.2 Computational Techniques.

Discovering new genes, and their functions, can be aided not only by gene finding software, but also by searches in key databases and by computational programmes for finding particular sites relevant to gene expression, such as promoters and splice sites. (Internet access addresses for the following programmes are listed in Fickett J. 1996). GRAIL (gene recognition and analysis internet link) (Uberbacher & Mural 1991), is a gene identification programme that relies on recognising the regularities in protein

coding regions. Particular DNA sequences that make up individual codons in mature mRNA are found far more often than in sequences that are not eventually translated. Coding region detection algorithms within GRAIL use these regularities to calculate the likely presence or absence of a gene. The coding detection measurements are

l)Frame Bias matrix; tabulation of each codon frequency to detect “codon bias”, 2) Fickett; an algorithm that considers several properties of coding sequences, 3) Dinucleotide fractal dimension; examines dinucleotide occurrence and compares the occurrence to that of known coding and non-coding regions, 4) Coding 6-tuple word preferences; examines the frequency of nucleotides in a given length and compares to nucleotide “words” in coding and non-coding regions, 5) Coding 6-tuple in-ffame preferences; as before, but compare 6-tuple words in frame, 6) Word commonality; examines the common occurrence of the “words” and compares the uncommon exon words and common intron words and, 7) Repetitive 6-tuple word preferences; compares the “words’ to several classes of repetitive DNA and is thus a negative control.

A neural network integrates the output from the seven sensory algorithms to form a single number called a discriminant. The discriminant is calculated for successive subsequences over the length of the whole sequence in question, and is displayed in graphical form. The ideal output from the network returns a value of 1, for a position in a coding region, and a score of 0, for a position that comes from a non-coding region. Between 80% to 90% accuracy over sequence length greater than lOObp has been calculated (Fickett J. 1996, Hochgeschwender U. 1992). Coding bias

measurements are species specific, and therefore GRAIL is a singularly human programme.

Searching for a known gene homologue or functional motif, is a simple method of isolating genes within a DNA sequence. BLAST (basic local alignment search tool), is one such homology database search, that, as it’s name suggests, calculates a

probability value (^-value) of alignment due to homology between the DNA sequence under investigation and sequences within a chosen database (Altschul et al 1990). The BLASTN programme searches against a non-redundant nucleotide database such as GenBank (Benson et al 1994), or EMBL (Emmert et al 1994). BLASTX analysis

against a non-redundant protein database, such a SWISS-PROT (Emmert et al

1994). Use of the database TREMBLE (Bairoch & Apweiler 1996), a database of translated sequences from EMBL, ensure completeness of each BLAST search (as some gene sequences within the nucleotide databases have no protein entry in the protein databases). The major advantage of finding a homologous product is that some biology of the protein may already be known. The limitations caused by the non-completeness of nucleotide and protein databases will be removed as more and more genes are isolated.