Microsatellites.
Since 0 is defined as the frequency with which recombination is observed, a recom bination event may be defined as 0 and a non-recombinant as (1-0).
1.6. Identification of expressed sequences.
T h e vast majority of eucaryotic genes have been shown to have a split gene structure - that is to say that short coding sequences are interrupted by intervening sequences, the introns (reviewed by Breathnach and Cham bon, 1981 ; Brennan and Hochgeschwender, 1995). Identifying these coding regions within a relatively vast candidate region of genomic D N A is frequently tim e- consuming and difficult. Generally, coding regions are conserved, single copy sequences, preceded by CpG islands. In the following pages a num ber of m ethods used to identify expressed sequences are outlined; the first of these, a search for conserved sequences, w as the method used by the author during the course of this work, but other techniques, direct cD N A selection and exon trapping, w ere used in this region by other groups.
1.6.1. Conserved sequences.
T h e search for fragm ents of DN A that have been conserved across species during evolution, has proved to be a successful method of screening genomic D N A for potential coding sequences. DNA coding for proteins that have a biological function in various organisms is likely to have changed little com pared to intervening sequences w here change is not deleterious and
therefore not selected against. This strategy has been used successfully to identify conserved D N A fragm ents in cosmids during cloning of the genes for
Duchenne M uscular Dystrophy (M onaco e ta l., 1986), choroiderem ia (Crem ers
e ta l., 199 0), and a gene term ed D C C - deleted in colon cancer (Fearon e ta l.,
199 0). Cosm ids or genom ic fragm ents which show discrete positive signals when hybridised to Southern blots containing DN A from diverse anim al species
can be used to screen libraries of transcribed sequences. H o w ever this
approach is by no m eans foolproof and is very labour-intensive. During the
search for the ad renoleukodystrophy gene, every EcoR\ fragm ent from a 3 70kb
candidate region was used as a probe on zooblots - restriction digests of genom ic D N A from various evolutionarily diverse animals. However, cross- hybridising conserved genomic fragm ents from the candidate region failed to detect transcribed sequences in cD N A libraries from five tissues and on Northern blots (M osser e t al., 1993). T h e gene w as eventually isolated by sequencing the conserved fragm ents and searching for putative coding regions using com puter algorithms.
1.6.2. Exon amplification.
All methods which depend on the screening of expressed sequence libraries with genom ic fragm ents are subject to the problem of screening a library from an inappropriate tissue, or a library m ade outside the tem poral window of expression of a gene of interest. T h e method of exon amplification enables large genom ic regions to be screened for coding sequences independent of the tim e or tissue of expression (Buckler e t a /.,1991). Sections of genom ic DNA, usually from cosmids, are cloned into a vector containing an intron of the H um an Immunodeficiency Virus 1 faf gene along with part of its natural flanking exonic sequences. W hen the vector transfects C O S cells, m R N A is produced and spliced according to the splice signals contained within the vector and the cloned DNA. If an exon is contained within the genom ic insert, the resulting transcript will have the new exon spliced in, and this can be amplified by reverse transcriptase- PC R , based on primers to the flanking exons. Products of this technique can then be used as probes (Taylor, 1990). Since its original
inception modifications have been m ade to improve the system (Church e ta l.,
g en e by one group and identification of the myosin VII g en e in the S h aker 1 m ouse (W alker e t al., 1993; Gibson e t a!., 1995).
1.6.3. cDNA selection.
Identification of transcribed sequences from large genom ic regions by direct selection has been shown to be effective, as dem onstrated during cloning of the genes for the immunodeficiencies, X-linked agam m aglobulinaem ia and W iskott Aldrich Syndrom e (Vetrie e t a!., 1993; Derry e t a!., 199 4). Using this technique, a Y A C , or a cosmid contig is immobilised on a solid support such as beads or a m em brane, and hybridised with total cD N A either from a library or a specific tissue. Unbound cD NA s are w ashed off and bound cD N A s m ay be eluted. T h e se molecules are then amplified, using linkers attached to the cD N A s or vector primers if the cD NA s are cloned, and subject to repeated rounds of hybridisation and washing. Enrichment of specific bound cD N A s occurs with each round of amplification and washing; fidelity is checked by hybridisation back to the original genomic source (YAG) or by verification with a selectable m arker (Lovett et a!., 1991). This method permits the screening of m any more clones more quickly than is practical in conventional library
screening. In the search for the gene for X-linked agam m aglobulinaem ia,
Vetrie used enrichm ent of a-galactosidase transcripts as a m arker of enrichm ent of cD NA s specific to the Y A C of interest. As well as identifying a particular cD NA , this method identifies other transcripts from the genom ic region enabling transcript maps to be constructed (Korn et al., 1992).
1.6.4. Computer programmes.
Com puter program m es have been shown to be effective in identifying coding sequences in large regions of uncharacterised genom ic DNA, for exam ple in the identification of the genes for adrenoleukodystrophy and diastrophic dysplasia (H astbacka e ta l, 1994; M osser e ta l, 1993). Protein coding regions can be recognised because the genetic code and amino acid composition of proteins impose constraints on DN A sequence. T h e frequency with which each of the four bases occupies the three positions in the codon is not random, as preferred codons are used to specify particular amino acids and all am ino acids
reading fram es are exam ined, and in a coding region, one should show a good fit with the constraints whilst the other two will not. In addition the overall base composition of test D N A is com pared to that of known coding and non-coding
sequences (U bernacher e t a/., 1991). In diastrophic dysplasia, shotgun
sequencing of genom ic fragm ents (in which each m em ber of a set of randomly gen erated subclones of the target region w as com pletely sequenced and the seq uence of the target region reconstructed by com puter) showed one of these to bear a strong am ino acid similarity to the 5 ’ coding region of the rat sat-1
gene, a sulphate transporter (H astbacka e t al., 1994).
Nucleic acid sequences can also be used in sequence similarity searches of G en b an k and E M B L nucleic acid databases using program m es such as B L A S T (Basic Local Alignm ent Search Tool) to detect homology with known genes or expressed sequences. T h e efficiency with which a match is found and "scored" will depend on the param eters used in the search, and scores good for detecting similarity betw een greatly diverged sequences differ from those best for detecting short but nearly identical segm ents. B L A S T X can be used to search a nucleic acid sequence directly for the presence of protein coding regions; the query sequence is translated in all six reading fram es and is
com pared with a protein sequence database. Protein-protein com parison
m ethods are important because distant evolutionary relationships which ap p ear minor at the nucleic acid level can be much more compelling at the protein seq uence level.
T h e impetus from the Hum an G en o m e Project has encouraged large scale efforts to sequence parts of expressed sequences, known as expressed seq uence tags or ESTs (Adam s e ta l., 1991, and 199 2) which are then entered onto a d atabase, d b E S T (Khan et al., 1992). T h e match of an expressed seq uence with sequence from a genom ic region confirms that a transcribed seq uence is present and this technique should prove to be even more powerful once all these ES Ts are m apped to Y A C s ( Berry et al., 1 99 5).
1.6.5. HTF island rescue.
that a large fraction, up to 6 0 % , of promoters, especially those of "housekeeping" genes are associated with unm ethylated C pG dinucleotides (H T F islands). Thus, D N A around H T F islands is enriched for first exons of genes. P C R based "rescue" of islands is achieved by cutting Y A C s or cosmids from a critical region, within an island with an en zym e that recognises C p G , and ligating fragm ents to a linker. This tem plate is amplified using a linker
specific primer and a primer to the family of abundant hum an repeat
sequences, Alu. Products m ay then be used to screen libraries of expressed
sequences (V aldes e t al., 1994; John e t a!., 199 4). T h e effectiveness of this m ethod in cloning new genes is yet to be proven (Shiraishi e t al., 199 5).