Comparative vertebrate genomic sequence analysis studies based on Fugu rubripes

(1)

Comparative vertebrate genomic sequence analysis studies based on Fugu rubripes

Marc Robert Michel Botcherby M.Sc.

A thesis submitted to the University of London for the degree of

Doctor of Philosophy

Department of Molecular Genetics Institute of Ophthalmology University College London

University of London Bath Street London ECIV 9EL

(2)

ProQuest Number: U643460

INFORMATION TO ALL USERS

The quality of this reproduction is dependent upon the quality of the copy submitted.

In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed,

a note will indicate the deletion.

uest.

ProQuest U643460

Published by ProQuest LLC(2016). Copyright of the Dissertation is held by the Author.

This work is protected against unauthorized copying under Title 17, United States Code. Microform Edition © ProQuest LLC.

ProQuest LLC

789 East Eisenhower Parkway P.O. Box 1346

(3)

Abstract

This thesis explores the benefits of the comparative analysis of genomic DNA sequence in three vertebrate species: man, mouse and Fugu rubripes.

The first region studied is the interval around Fibroblast Growth Factor 2 (FGF2). It falls within the candidate mapping interval for Insulin Dependent Diabetes 3 (Idd3) in mouse, in which Spermatogenesis Associated Factor (SPAF), Fibroblast Growth Factor 2 (FGF2) along with NUDT4 (formerly known as the antisense of FGF2, GFG) and a Centrin related sequence are known to be present. The Ensembl entry for the human draft

sequence only contains FGF2. A Fugu cosmid was sequenced to 99.99% accuracy and is shown to contain orthologues to SPAF, FGF2, NUDT4 and a Centrin gene, as well as a novel gene prediction (vader), evolutionarily conserved in mouse and man. Comparison to the available mouse and human genomic sequences indicates the presence of SPAF, FGF2, NUDT4 and a Centrin gene in both the mouse and human FGF2 genomic regions. The second area of study is the conventional Protein Kinase C family (cPKC), which comprises 4 isoenzymes in man: a , p i, P2, and y. A number of approaches are taken to identify the Fugu cPKC homologues: genomic library screening by hybridisation, sequence scanning. Southern Blotting and degenerate PCR. A PKC a orthologue is identified and evidence for the existence of two further P homologues found. These genes are compared to the PKCs found in other species by means of phylogenetic trees. Two overlapping cosmids containing the PKCa orthologue are identified, sequenced to 99.99% accuracy and analysed. Potential regulatory elements are identified. A calcium channel subunit gene (CACNG5) is also found close to Fugu PRKCA, which reflects an association between all cPKCs and CACNG genes. This is interesting both from an evolutionary point of view and in the light of the association between cPKCs and Retinitis pigmentosa.

(4)

Declaration

I declare that this thesis submitted for the degree of Doctor of Philosophy is my own composition and save as otherwise stated the data presented herein is my own original work.

(5)

Acknowledgements

I would like to thank the succession of people who have, in turn, acted as my supervisors, at the Institute of Ophthalmology, Prof. Shomi Bhattacharya, Dr Mai Al-Maghteth and eventually Dr Eranga Vithana, and at the HGMP Resource Centre, Dr Chris Mundy, then Dr Melody Clark and finally. Dr Greg Elgar. I would like to also thank Drs Sarah Smith and Melody Clark for their support and advice during my period of research and for their constructive criticism of the thesis.

I would also like to thank my colleagues at the HGMP-RC for their support and

understanding as, although this PhD was conducted in my spare time and it always came second to my work, it most certainly has impacted on the investment, both in terms of time and intellectual input, I have been able to made to my job.

(6)

L’homme n ’est qu’un roseau, le plus faible de la nature; mais c’est un roseau pensant. Il ne faut pas que l’univers entier s’arme pour l’écraser: une vapeur, une goutte d’eau suffît pour le tuer. Mais quand l’univers l’écraserait, l’homme serait encore plus noble que ce qui le tue, puisqu’il sait qu’il meurt, et l’avantage que l’univers a sur lui, l’univers n’en sait rien.

Man is only a reed, the weakest thing in nature, but he is a thinking reed. The entire universe need not take up arms to crush him: a humor, a drop of water is enough to kill him. But should the universe crush him, man would be even nobler than what kills him, since he knows he is dying and, of the advantage that the universe has over him, the universe knows nothing.

(7)

LIST OF FIGURES AND TABLES

Figure 3-1 : Alignment of the Centrin family proteins known to date... 55

Figure 3-2: Ethidium bromide stained 0.8% agarose gel of EcoRl (E) and Sacl(S) digests of cosmid 196N09... 61

Figure 3-3: Nix output for cosmid 196N09 displaying the results of exon predictions and Blast homology searches...62

Figure 3-4: Genomic features of cosmid 196N09...64

Figure 3-5: Percentage Identity Plot (PIP) comparing the sequence of Fugu cosmid 196N09 to the equivalent human and mouse draft sequences... 66

Figure 3-6: Genomic sequence of the Fugu centrin genomic sequence... 69

Figure3-7: Alignment of Human Centrin 2 (HsCETN2), the Fugu putative

Centrin {Fugu CEN), Human and Mouse Centrin 1 (HsCETNl and MusCETNl) and C.reinhardtii Centrin (CRCEN)... 70

Figure 3-8: Phylogénie tree of known Centrin proteins... 72

Figure 3-9: est2genome output aligning the mouse cDNA sequence to the Fugu genomic sequence for FGF2... 73

Figure 3-10: Genomic sequence of the Fugu FGF2 genomic sequence... 75

Figure 3-11 : Alignment of FGF2 protein for Man, Fugu rubripes and three strains on mice: A/J, C57black6 and 129... 76

Figure 3-12: Genomic sequence of the Fugu GFG genomic sequence... 77-78

Figure 3-13: Alignment of the human (AAH09842 - nudix (nucleoside diphosphate linked moiety x)-type motif 6), rat (GFG_RAT), Xenopus laevis (GFG_XENLA

and Fugu (ftigu_gfg) GFG sequences using clustalW... 79

Figure 3-14: Available genomic sequence of the Fugu SPAF genomic sequence.. 82

Figure 3-15: Alignment of the predicted Fugu SPAF transcript to the mouse

amino acid sequence... 83

Figure 3-16: Depiction of the 6 kb region around the gene predicted by Genscan.. 84

(12)

Figure 3-18: Alignment of the protein predicted (gene prediction 3) from the Fugu sequence and its T.nigroviridis homologue... 90

Figure 3-19: Analysis of the transcript of predicted gene 3 using F IX ...91

Figure 3-20: Graphical display showing the transcription factor binding sites predicted byrVista...94

Figure 4-1 : Domain structure of conventional PKC enzymes

(PK C a, p i, P2 andY)... 103

Figure 4-2: Relationship between PKC isotypes from mammalian and non

mammalian species based on the full-length amino acid sequences... 105

Figure 4-4: Restriction digest of the PKC-G positive cosmids... I l l

Figure 4-5 : Result of hybridisation of Southern blot of PKC-G positive cosmid... 112

Figure 4-6: Example of fractionation (C76D01 cosmid)...115

Figure 4-7: Ethidium bromide stained 0.8% agarose gel of the S ad digests of

the 20 clones isolated by screening the Fugu cosmid library with 12B17... 119

Figure 4-8: Schematic description of the alignments of the Fugu sequences to the human PRKCG cDNA sequence... 121

Figure 4-9: Restriction Digests of PKC cosmids and Centrin cosmid... 123

Figure 4-10 : Summary of primers used in degenerate PCR experiments... 128

Figure 4-11 : Ethidium bromide stained agarose gel (0.8%) displaying the results of a degenerate PCR experiment carried out using primers PKCGGF 5 and 6... 129

Fig 4-12: Result of PCR amplification using the PKCDGF5/6 primers on the

bands cut out from the original PCR from genomic and cosmid DNA... 130

Fig4-13: PKC degenerate primers 7 and 8 PCR experiment carried out

at 46°C and 50 °C... 131

Fig 4-14: PCR experiment using primers PKCDGF 9/10 on cosmid 12B17 (C)

and Fugu genomic (G) DNA... 132

(13)

Figure 4-16: Phylogénie tree comparing the PKC exon 11 and 12 sequences

generated by degenerate PCR from Fugu genomic DNA to the human PKC 135

Figure 4-17: Stained agarose gel of DG9/10 PCR products on hybridisation

positive cosmids... 136

Figure 4-18: Ethidium bromide stained agarose gel (0.8%) showing Ecorl and BamHl digests of the hybridisation positive and sequence scanning identified

cosmids... 138

Figure 5-1: NIX output of the 123J1-12B17 contig sequence... 145

Figure 5-2: Graphical summary of the features highlighted by NIX... 146

Figure 5-3: Genomic sequence of the Fugu prkcl gene...140-150

Figure 5-4: ClustalW alignment of known PKC proteins and those predicted

from Fugu sequences... 152-155

Figure 5-5: Phylogénie tree of all identified Fugu and mammalian PKC proteins. 157

Figure 5-6: Phylogénie tree of the Fugu and mammalian conventional PKC

proteins... 158

Fig 5-7: Percentage Indentity Plot (PIP) comparing the finished sequence of Fugu contig (cosmids 123j 1 and 12b 17) to the available human, mouse and

tetraodon draft sequences... 164

Figure 5-8: Conserved regulatory element binding sites immediately upstream

of PKC exonl... 166 Fig 5-9: Theatre prediction for P300 binding sites in the PKC upstream region.... 167

Figure 5-10: Genomic sequence of the Fugu eukaryotic initiation factor like

predicted gene... 168

Figure 5-11: : BLAST alignment of the Salivary glue protein SGS-4 precursor

amino acid sequence to the Fugu genomic sequence... 170

Figure 5-12: BLAST alignment of the Voltage-dependent calcium channel

gamma-5 subunit (CCG5) amino acid sequence to the Fugu genomic sequence.. 172

Figure 5-13: Genomic sequence of Fwgw cacgnS... 173

(14)

proteins and the cacngS protein predicted from Fugu sequences... 174

Figure 5-15: Phylogénie tree of the Fugu and mammalian calcium channel gamma subunit proteins (cacng)...175

Figure 5-16: BLAST alignment of the Voltage-dependent calcium channel gamma-3 subunit (cacgnS) amino acid sequence to the Fugu genomic sequence of scaffold S001310... 176

Figure 5-17: Screenshot of Ensembl database (www.ensembl.org) of the 19ql3.4 region showing PRKCG and CACNG7... 177

Figure 5-18: BLAST alignment of the Cytohesin 3 amino acid sequence to the Fugu genomic sequence... 179

Table 3-1 : Summary of protein sequence matches found using BLAST... 63

Table 3-2: Result of tBlast X search of Embl Blast databases for predicted gene 1.85 Table 3-3: Result of tBlast X search of Embl Blast databases for predicted gene 2.87 Table 3-4: Result of tBlast X search of Embl Blast databases for predicted gene 3.88 Table 3-5: Summary of the transcription binding site predictions made by rVista. 94 Table 4-1: Distribution of PKC isoenzymes in various species... 107

Table 4-2: reference ID of the Fugu cosmids identified by screening by hybridisation with the human PKCG... 110

Table 4-3: Cosmid subclones with significant database matches...116

Table 4-4: Cosmid subclones with significant database matches... 120

Table 5-1: Association between cPKC genes and Retinitis pigmentosa... 144

Table 5-2: Fugu scaffolds identified by searching the database with human PRKCA, B and G... 148

(15)

List of abbreviations

BAG Bacterial artificial chrom osome

cDNA Complementary DNA

EST Expressed sequence tag

HGP Human genom e project

Kb Kilobase

LCR Locus control region

MAR Matrix associated region

Mb M egabase

Mtp Minimum tiling path

ORE Open reading frame

OST Open reading frame sequence tag

PAG P1-derived artificial chrom osome

PGR Polymerase chain reation

PKG Protein Kinase 0

PRKGA Protein Kinase 0 alpha PRKGB Protein Kinase 0 beta PRKGG Protein Kinase 0 gam m a

Rflp Restriction fragment length polymorphism

SAR Scaffold associated region

STS S equence tagged site

YAG Yeast artificial chrom osom e

(16)

Introduction

In 2001 the draft DNA sequence of the human genome was determined both by the

large scale government funded program, the Human Genome Project (HGP) and by

the Celera corporation (International human genome sequencing consortium, 2001;

Venter et al, 2001). This represents a tremendous advance for science and, although

the sequences are not yet complete, it will have an enormous impact on the future of

genomics research. The Celera assembly was created by assembling whole genome

shotgun reads as well as publicly available data and is made up of ‘scaffolds’ or

contiguous assemblies o f overlapping sequences and paired sequence reads from 2, 10

and 50 kb plasmid inserts. Those scaffolds therefore contain large gaps yet cover

nearly all the genome. The HGP assembly is made up of mapped genomic clones that

have been fully or draft sequenced as well as whole genome shotgun reads and is also,

in parts, incomplete. The HGP data is made available to the scientific community

freely and as soon as it is available by sites such as that of the Ensembl project

(http://ensembl.org). The Celera database contains the sequence generated by the

corporation as well as the publicly available data. It is private and can be accessed by

paying a subscription fee. Three chromosomes have so far been completed to

accepted international standards, chromosomes 22 (Dunham et al, 1999),

chromosome 21 (Hattori et al, 2000) and chromosome 20 (Deloukas et al, 2001). As a

(17)

1-The H um an Genome Project (HGP)

1-1 Mapping the genome

Originally called the Human Genome Mapping Project, this was started in the late

1980s with the ambitious aim to map and then sequence the entire human genome by

2007. The brainchild o f James Watson, this was set up as a worldwide collaboration

under the aegis of the American National Institute o f Health (NIH).

1-1-1 The human genetic man

The first goal o f the project was to produce a genetic map of density o f one marker

per cM.

This was an ambitious target. Comprehensive maps had been achieved by means of

setting up crosses and studying their progeny for a few organisms such as Drosophila

melanogaster but, in order to map the human genome, one was going to have to rely

on human pedigree analysis and relatively few polymorphic markers. In genetic

mapping, use is being made of the fi*equency of recombination between chromosomes

during meiosis. Linked markers will be more likely to be separated by a cross-over

event the more distant they are. In laboratory organisms, it is possible to set up

crosses to specifically study a set o f markers. In human mapping, the genotypes of

members o f a family have to be studied carefully in order to detect recombinations

and statistical analysis employed to determine the relative distance between markers.

Furthermore, in order to derive this information, the markers have to be polymorphic

(18)

The first highly polymorphic markers to be recognised in animal genomes were

Restriction Fragment Length Polymorphisms (RFLP). The first human RFLP map

was published in 1987 (Donis-Keller, 1987). It was based on the study o f 21 families,

using 393 RFLPs and 10 additional polymorphic markers. It had an average density

o f one marker every 10 cM.

In order to refine the map, it was necessary to find more frequent markers.

Microsatellites (simple tandem repeats, typically 10-30 copies o f a repeat of 2,3, or 4

bases) quickly became the marker of choice for three main reasons:

■ they are frequent in the human genome - several per Mb

■ they are highly variable - several alleles o f each one being present in the

population as a whole and display a high level o f heterozygosity (the average

heterozygosity of the markers used in the 1996 map was 0.7 - 7 in 10 chance that

an individual will be heterozygous for the marker).

■ they can be efficiently and accurately typed by PCR, being under 300 bp in

length.

The target o f the 1 marker per Mb map was exceeded in 1994 with a map (published

two years later (Dib, 1996) featuring on average one marker per 0.6 Mb. The maps of

chromosomes 1 to 22 were produced from 8 families taken from the Centre d’étude

du polymorphisme humain (CEPH) collection, using 5264 microsatellites. A further

12 families were required to map the X chromosome, due to the paucity of

recombination events between X-linked markers. In total 304 individuals were

(19)

1-1-2 The human physical map

The production o f refined physical maps was not far behind.

Using Sequence Tagged Site (STS) screening, as well as other clone fingerprinting

methods, a clone contig map of the entire genome consisting of 33 000 YACs (Yeast

Artificial Chromosome) was published in the early 1990s (Cohen, 1993).

Unfortunately a large proportion of the YACs were found to be chimeric (containing

more than one genomic insert) and its value was therefore limited. A novel mapping

technique, making use of Radiation Hybrid Panels was to supplant it. These panels

are generated by breaking up human chromosomes using radiation and fusing the

resulting fi-agments into a rodent cell line. The resulting hybrid cell lines therefore

contain human DNA fi-agments within their rodent chromosomes. The marker

contents o f the different lines is determined and a matrix created allowing the

positioning of STSs depending on which cells lines are positive by PCR. A map of

200 kb density was generated (Hudson, 1995) and then a 100 kb map (Schuler, 1996)

A radiation hybrid map of the human genome was also generated as a collaboration

between the University o f Cambridge and the Généthon (Giapay et al, 1996). In a

recent analysis, these maps were found to be in broad agreement with the human

sequence map (Wright et al, 2001). An integretated map of the physical and genetic

maps was then generated (Deloukas, 1998) and this was to form the basis for

(20)

1-2 Sequencing the genome.

The 25th anniversary discovery o f DNA sequencing should see the completion of the

HGP. 1977 saw the publication, in the same issue of the Proceedings of the National

Academy of Science, USA, of the two principal methods of DNA sequencing: the

chain termination method (Sanger, 1977) and the chemical degradation method

(Maxam and Gilbert, 1977).

The chemical degradation method was the more accessible technique at the time. In

this method, the strands are radiolabelled at one end and then different reactions are

carried out which cleave the DNA molecules at different bases: guanine, cytosine,

adenine or guanine and lastly at cytosine or thymine. The resulting fragments are

separated on a polyacrylamide gel o f 1 base resolution and visualised by

autoradiography. Because only one end of the molecule is labelled, only the fragment

upstream of the cleavage is visible. And if the reaction is only allowed to work

partially, all the families of fragments ending in G will be present. The same will be

true o f A, T and C and therefore the banding pattern will represent the sequence of the

bases in the DNA molecule. All the reagents required were readily available and the

techniques employed were standard laboratory techniques.

The chain termination method relies on synthesising a second DNA strand from a

single stranded template. The nucleotides present in the reaction are a mixture of

deoxynucleotides and dideoxynucleotides, which are chain extension terminators. If

the ratio is set correctly, all the termination events will be represented in the reaction

product and therefore the reaction product is similar to that o f a chemical degradation

(21)

more easily available however, the chain termination method established itself as the

method o f choice and still forms the basis o f the Human Genome Sequencing

program. The sequencing revolution came with the advent o f automated sequencers.

These machines rely on fluorescent dye labelled sequencing primers or terminators.

The automated sequencer separates the extension products using a polyacrylamide

slab gel or capillaries and detects them using a complex system of optics. Briefly, a

laser is used to excite the dye and the emission detected either by photomultiplier tube

or a CCD camera. In this way a digital image of the gel is created and this can be

read automatically, without having to dry the gel or expose it to film (Ansorge et al,

1986). The patenting of a method in which four different dyes were used, allowing

the running o f all four reactions pertaining to a sample to be run in the same lane,

improved the capacity o f the machines. Novel enzymes and dyes have improved the

performance o f the technique so that it is possible to read a 1000 bases in one

sequencing experiment. It is also possible to read the sequence directly off large

clones such as BACs or YACs or small genomes such as viral genomes. A great deal

o f confidence was set by these improvements and there was a general belief that, as

one got nearer the deadline, the techniques would be so much more efficient that it

would be possible to complete the task with less effort. This may partly account for

the fact that the sequencing data was not really generated on a large scale until 1996-

(22)

1-2-1 Four different sequencing approaches:

Genome-wide sequencing strategies have taken four different routes: cDNAs,

sequence scans o f genomic clones, finished sequence o f genomic clones and whole

genome shotguns.

1 -2-1 -1 Sequencing cDNAs

With the drive forward o f the human physical map, a great deal of STSs were

generated. Sequence tags o f cDNAs (ESTs) are a representation o f the expression of

the genome and are also a 'gene catalogue'. (Adams et al, 1991, 1992, 1993) Patents

were filed for these, on the premise that these would subsequently be assigned

function and be of financial interest (Kerlavage et al, 1993; Venter et al, 1992). The

TIGR corporation formed a private company (HGS) which was to take a very large

scale approach to this and sequenced hundreds of thousands o f ESTs, all o f which

were only available to subscribers (Marshall, 1994; Rowe, 1995).

The IMAGE library is a cDNA library in which all the clones are sequence tagged.

(Lennon, 1996). This is a powerful resource as every clone isolated can be checked

for homology to published databases using programs such as BLAST (Altschul,

1997).

The drawback in using ESTs to study expression is that there is vast over

representation o f some cDNAs while others are missing, large numbers of ESTs in

fact representing the same cDNA where other cDNAs are absent. A great deal o f

(23)

Furthermore, full-length cDNA libraries are being generated and the complete,

finished sequences of the cDNAs obtained. Fantom, a large-scale annotation project

o f full-length mouse cDNAs is under way in Japan, for example, at the Riken Institute

(Kawai, 2001). This resource will allow scientists to identify genes and determine

alternative transcripts. Another initiative is concerned with the creation o f full-length,

expression-ready cDNAs (FLEX clones). The inserts are the cDNA sequence, from

methionine to stop, without the 5’ or 3’ untranslated regions (UTR), cloned

recombinationally in a way that allows for the directional and in-frame shuttling of

the gene into an expression vector (http//:www.hip.Harvard.edu).

1-2-1-2 Sequence scanning of genomic clones

A great deal o f information can be derived from partial sequences. A one fold or two

fold coverage o f a clone will be sufficient to identify nearly all the genes and markers

present in a genomic clone. In order to analyse a given gene, a 6-10 fold coverage of

the clone followed is required by editing. Genome scans, however, provide a cost

effective way of analysing genome contents and provide the means for genome

comparisons. The Fugu Landmark Genome Project, for example, scanned 25 Mb

(6%) of the Fugu rubripes genome (Elgar et al, 1999). This successfully

demonstrated the benefits of sequence scanning in comparative genomics, as

(24)

1-2-1-3 Full length sequencing from genomic clones

This is the strategy adopted by the HGP. Clones from a physical map were selected

to form a minimal tiling path and sequenced individually. At the beginning this

consisted o f cosmids, but as larger clones became available, BACs and FACs became

the template o f choice. Directed strategies and ordered shotgun strategies have been

devised, but the large sequencing centres have nearly all adopted the random shotgun

approach. In this method, the clone is broken up randomly by sonication or shearing,

the fragments size selected and end repaired before being sub-cloned by blunt-end

ligation. The first vector o f choice was Ml 3, as it naturally produces large amounts

of single-stranded DNA, traditionally the sequencing template o f choice. As cycle

sequencing became more widespread, its greater tolerance o f double-stranded

template made more stable vectors such as plasmids more attractive and now PUC 18

has supplanted M13mpl8 as the shotgun sequencing workhorse. These subclones are

then end-sequenced to provide a relatively deep coverage of the clone (8-12 times

coverage or 10 -15 reads per kb). The reads are then ’assembled’, by which is meant

that overlapping reads are aligned using a computer package or ’assembly engine’

such as Gap4 (Staden, 1998) or Phrap (Gordon et al, 1998). O f course these almost

never cover the whole clone but fall into contiguated assemblies (contigs). These

alignments are edited and experiments designed to link up the various contigs until

the sequence o f the whole clone is arrived at. Various conventions and quality

standards are internationally accepted to provide greater than 99.99% accuracy (less

than 1 error in 10 000 bases), based on the Phred base caller quality values. Phred

(25)

base caller and it assigns a quality value for each base 20, corresponding to a 99.99%

confidence value.

1-2-1-4 Whole Genome Shotgun

This method had previously been successfully applied to the sequencing o f viral and

bacterial genomes (Cole et al, 1998). Briefly, the principle o f the method consists in

fragmenting and sub-cloning the entire genome directly into a plasmid library without

carrying out any physical mapping. Sufficient sub-clones are sequenced to give an

adequate depth o f coverage and the whole database o f sequence reads assembled

much in the same way as in a traditional shotgun project.

In 1997, the idea was put forward that it might be possible to apply this method to

entire human chromosomes (Weber et al, 1997; Green, 1997). Much debate was

occasioned as to whether the assembly would be correct or whether the presence o f so

many repeat sequences over such a long stretch o f DNA would render it impossible to

obtain an assembly that would be any more than a crude scribble, an opinion ventured

by the then director of the NIH, Harald Varmus.

The conflict was further fuelled by a fundamental divergence o f opinion as to the

ownership o f data. The genome centres taking part in the HGP are all signatories o f

the Bermuda agreement by which they pledge to make all sequence data publicly

accessible with no delay. The Celera database is available only on subscription. The

first sign o f acceptance of the Celera corporation by the publicly funded bodies was

the partnership between the Lawrence Berkeley National Laboratory and Celera

(26)

sequence assembly was based on maps acquired over years of research by publicly

funded scientists worldwide.

Craig Venter was the master o f hubris in this conflict between government

organisations and big corporations yet he seems to have been vindicated and also to

have forced the HGP to bring forward their targets. By June this year the Whole

Human Genome Shotgun was sequenced and the HGP published the draft sequence of

the entire human genome (International human genome sequencing consortium, 2001;

Venter et al, 2001).

1-3 The sequence of the human genome

The current status o f the human genome sequence is that the Human Genome Project

has obtained a 'draft* sequence for the human genome and that the Celera corporation

has finished sequencing the whole human genome shotgun. The HGP draft is on

average 5 bases deep, which enables the identification and location of genes and will

form the basis for fair amount o f analysis and polymorphism detection. The

determination of gene structure and regulation however will necessitate the ftill

sequence allied with a degree of experimentation and comparative analysis.

The total number o f genes in the genome is currently a matter for debate.

Estimates based purely on a gene index o f clustered and filtered ESTs arrive at a

number of human genes to be as high as 120 000 (Liang et al, 2000). Because

clustering ESTs is a difficult task, requiring enormous computing resources, in order

to eliminate clusters which are not really unique, these were matched to the finished

(27)

was found to contain 545 genes and 134 pseudogenes (Dunham et al, 1999).

Extrapolating from these figures would give an estimate figure of 37 700 genes in the

human genome. Chromosome 21 represents 1 to 1.5 % o f the genome and contains

127 known genes, 59 pseudogenes and 98 predicted genes (Hattori, 2000), therefore

giving a lower figure of 28 400 genes. In a similar study, 680 assumed genes on

chromosome 22 were found matches to 2 % o f a selected set o f EST clusters (Ewing

and Green, 2000). This was based on the assumption that those 680 correspond to

2% o f all genes, from which a total figure o f 34 000 genes can be extrapolated.

Similarly, comparative sequence analysis of evolutionary conserved regions in known

genes on human chromosome 22 to the teleost Tetraodon nigroviridis gave an

estimate o f 30 000 genes, assuming a uniform rate of conserved regions per true gene

(Roest Crollius et al, 2000). These latter figures are more closely in agreement with

estimates resulting combining de novo computer gene predictions to EST matches.

Matching Genscan gene predictions to EST homology provides an estimated total

gene number of 33 000 (www.ensembl.org). It has been argued, however, that

estimates o f sensitivity and redundancy should be taken into account when computing

gene numbers, and that the total number of genes in the genome should be revised to

60 000 (Bimey E, Genome Based Structure Determination Symposium, June 2000).

In a more recent study aligning transcriptional units to the genome, a prediction of the

genome containing 65 000-75 000 transcriptional units was made (Wright, 2001).

The authors estimate the presence of 854 genes on chromosome 22, which is 25%

more than the previous estimate o f Ewing and Green, but representing only 1.4% (as

opposed to 2%) o f the gene total.

It would seem that the emerging sequence data and refining o f prediction tools is

(28)

however, for the complete sequence to be fully annotated in order to know the correct

number of our genes. It would seem, for the present that this number lies between 28

400 and 75 000.

It is interesting, at this stage to introduce a comparison with another organism, which

is currently being annotated, the worm Caenorhabditis elegans . Whereas, in the case

o f the human genome, early estimates of 60 000 to 150 000 genes have been revised

down to around 30 000 in the light of the annotation o f the two finished

chromosomes, the number of genes in C elegans have been revised up from the early

estimates of around 10 000 to 19404 protein genes in Wormbase

(http://www.wormbase.org), now that the finished sequence is available (The C.

elegans sequencing consortium, 1998). In order to test whether this near doubling of

the estimate of gene numbers could be substantiated, an open reading fi'ame sequence

tag (OST) approach was employed, in which it was attempted to amplify a product

from cDNA for 1 222 predicted genes for which no EST had yet been obtained

(Reboul et al, 2001). At least 70% o f the genes studied were verified, although the

predicted intron/exon structure was not always correct, indicating that a conservative

gene number estimate for C. elegans is 17 387. The total gene number is likely to be

even higher, as many small genes, such as those encoding neuropeptides,

antimicrobial peptides and small regulatory proteins such as egl-1 are difficult to

identify (Conradt and Horvitz, 1998). Moreover, this count does not include RNA-

encoding genes, which are difficult to pinpoint in raw sequence, and may total over

1000 in C.g/ggawj (Hodgkin, 2001).

The other consideration is that the gene count of 20 000 for C.elegans seems high, as

(29)

complexity o f the organisms. There may be some reasons, however, why the two

counts are not directly comparable.

There is a relative lack of differential splicing in C. elegans, compared to Drosophila

or mammals (Hodgkin, 2001). Whereas 30 000 to 40 000 genes may give rise to 100

000 proteins or more in a mammal, only 4% of genes have been confirmed as

exhibiting alternative splicing in the worm (wormbase: http://www.wormbase.org).

This figure may rise when more C. elegans cDNA sequences are deposited in the

databases, since 40% o f C. elegans genes do not have associated ESTs as o f yet.

Furthermore, human proteins tend to contain more multiple domains than C. elegans

proteins, particularly in the case of transmembrane proteins with large extracellular

regions. Mammalian proteins may therefore be more multifunctional.

Thirdly, many C. elegans genes and small genetic regions seem to have been involved

in local duplication events (The C.elegans sequencing consortium, 1998).

Interestingly, the C. elegans genome displays many examples o f incomplete

redundancy, in which genes share some of their function, yet also display unique

functions. For eample, the Notch-related receptor genes, lin-12 and glp-1 have

distinct postembryonic roles yet share function in embryogenesis (Kimble and

Simpson, 1997).

Finally, there are also gene families which are greatly expanded, such as the nuclear

hormone receptor genes (nhr), for example, which are much more numerous in the

worm than the Drosophila or human genomes (Maglich et al, 2001).

It is clear that the precise number of human genes is going to remain uncertain for a

few years to come. It is also probable, judging from the experience of C. elegans that

computer predictions from the raw sequence will have to be supplemented by ESTs,

(30)

2 - M aking sense of the genome sequence

Now that the tour de force of sequencing the human genome is nearing completion,

scientists are facing an even greater challenge: understanding the make up and

function of the genome.

■ which parts o f the sequence constitute the transcriptome?

■ how is transcription regulated?

■ what functions to attribute to its product, the proteome?

2-1 The transcriptome

It is estimated that coding sequence accounts for only around 3% o f the genome. The

great majority of the sequence is intergenic while within a gene, coding exons are

separated by long stretches of intronic sequence. Coding sequences can be identified

by examining the DNA sequence or by performing experiments.

2-1-2 Gene prediction bv sequence analysis

Protein coding genes are made up of open reading frames (ORF). These usually start

with the ATG codon and end with a TAA, TAG or TGA termination codon.

However, looking for ORFs in higher eukaryotic sequences is complicated by the fact

that genes are often split by introns and do not appear as continuous ORFs in the

(31)

■ Codon bias: each species has a specific frequency o f codon usage for a given

amino acid. This will be found in real exons, but not in random series o f triplets

■ Exon-intron boundaries: these have distinct sequence features, the upstream

boundary being o f the format: 5’-AG| | GTAAGT-3' in which the GT is

invariably found but the other bases are only the most common. The downstream

boundary is defined as: 5'-YYYYYYNCAG| |-3' (Y denoting C or T and N

denoting any base). Evidently, these motifs are not easy to search for in a

sequence and therefore identification of intron-exon boundaries by analysing the

sequence alone is not going to give the full answer (Freeh et al., 1997).

■ Upstream control sequences: these too have characteristic motifs. Unfortunately

they are not defined enough to make good predictors.

■ CpG islands: these regions of high prevalence of the CG motif are associated

with the upstream region of the majority o f vertebrate genes (Bird, 1986). These

can be identified by computer programs and are a good indicator of the start o f a

gene.

2-1-2-1 Exon predictions

A number o f exon prediction programs have been developed, as HEXON (Solovyev

et al, 1994), MZEF (Zhang, 1997), Genemark (Lukashin and Borodovsky, 1998) or

GRAIL/exons (Uberbacher and Mural, 1991). Other programs attempt to link the

exons together into genes, such as Grailgap2 (Uberbacher et al, 1996), Fgene

(Solovyev et al, 1995) or Genscan (Tiwari et al, 1997). Programs such as Grail use a

neural network to combine a number of test that try to identify patterns characteristic

(32)

discrimination method, classifying the sequence on one or more observed sequence

patterns, as used by programs such as HEXON, FGENEH and MZEF.

2-1-2-2 Homology searching:

Homology to already published genomic or cDNA sequences is a powerful gene

identification method. This relies first and foremost on up-to-date, well curated

databases. The first sequences to be collected were o f proteins, in the wake of the

development of protein sequencing methods (Sanger and Tuppy, 1951). This was the

work of Margaret Dayhoff (Dayhoff et al, 1972, 1978) and collaborators at the

National Biomedical Research Foundation (NBRF), resulting in the Protein

Information Resource (FIR). One particularly carefully annotated protein database is

Swissprot.

DNA sequence databases were first assembled at Los Alamos National Laboratory

(LANL) by Walter Goad and colleagues in the GenBank and at the European

Molecular Biology Laboratory (EMBL) in Heidelberg. Genbank is now under the

hospices of the National Center for Biotechnology Information (NCBI).

Searching these extensive databases required fast, efficient similarity searching

programs. One o f those is FASTA (Pearson and Lipman, 1988), which breaks down

the sequences into short words a few characters long, and those words organised into

(33)

both sequences, particularly if several words can be joined, the sequences must be

similar in those regions.

An even faster program is BLAST (Altschul et al, 1990), which also prepares a table

o f short sequences for each sequence, but additionally determines which o f these

words are most significant such that they are a good indicator o f similarity in both

sequences and then confines the search to these words. BLAST can search with

protein or nucleic acid sequences. BLASTX will translate a nucleic acid sequence to

search a protein database, TBLASTN searches a translated DNA database with a

protein query sequence and TBLASTX will search translated DNA against translated

DNA.

BLAST is an immensely useful tool, revealing what novel sequences most closely

resemble, thereby aiding in their characterisation. Additionally, by revealing which

discrete sections of alignment between a cDNA and genomic sequence and

positioning them thereby consolidating the results o f the exon predictions.

This latter function is also performed by programs such as est2genome (R. Mott,

unpublished data), Sim4 (Florea et al, 1998) or Procrustes (Gelfand et al, 1996),

which attempt to determine the exon/intron structure o f a gene by aligning the cDNA

sequence to the genomic.

All these programs can be run within NIX (Williams and Hingkamp, unpublished

data) at http://www.hgmp.mrc.ac.uk. NIX is a web-based interface, which runs exon

prediction tools and BLAST homology searches on repeat masked sequence, as well

as CpG island, promoter and polyadenylation site predictions, and is a powerful fi*ont

line gene identification tool. The results are presented in an interactive graphical

(34)

To put matters into perspective, however, one should observe that, in the analysis of

the now complete Caenorhabditis elegans genome, around half o f the predicted

genes, originally predicted by GeneFinder have no assigned function or known

homology (The c.elegans sequencing consortium, 1998). Clearly, exon predictions

and homology searches alone are not going to provide all the answers.

2-1-3 Identifying genes bv comparative sequence analvsis

Comparing the genomic sequences o f syntenic regions from different species is a

powerful way to detect genes, by making use can be made o f the high conservation of

coding sequence between species. A number o f genomes have been fully sequenced.

They include Saccharomyces cerevisiae (Goffeau, 1996; The yeast genome directory,

1997), Saccharomycespombe, Caenorhabditis elegans (The c.elegans sequencing

consortium, 1998), Drosophila melanogaster (Adams, 2000) and Arabidopsis

thaliana (Lin, 1999; Mayer, 1999; Salanoubat, 2000; Tabata, 2000; Theologis, 2000).

The mouse genome is also being completely sequenced. Long range comparative

studies had strengthened the case for sequencing the mouse genome in order to assist

gene annotation and regulatory element identification (Hardisson et al, 1997). The

Celera corporation have created an assembly from three different strains on mouse

(129X1/ SvJ, A/J and DBA/2J) while the international mouse genome sequencing

consortium are sequencing the C57BL/6 genome from mapped RPCI-23 and RPCI-24

BAC clones, having first made available 40 million C57BL/6 whole genome shotgun

(35)

specific interest are being sequenced from the same libraries by the UK mouse

sequencing consortium. All the publicly available mouse genomid data are being

integrated into mouse Ensembl (http://mouse.ensembl.org).

Interestingly, the Celera corporation report a level o f single nucleotide

polymorphisms o f one every 14 kb within the 129/SvJ strain (Mural, 2001) whereas

none have been found between the C57BL/6 whole genome shotgun and BAC

sequences (North, 2001). The genetic variability among 129 sub-strains has been

documented (Simpson et al, 1997), the view has been expressed, however, that this

reported incidence o f SNPs within a strain seems abnormally high (as expressed by

Eric Lander at the 2001 Genome Sequencing and Biology conference. Cold Spring

Harbor). This may reflect the fact that whole genome shotgun assemblies are at the

mercy o f repeats, some o f which may not yet have been identified and may cause mis-

assemblies.

There is therefore a vast scope for comparative genomic sequence analysis. Fugu

rubripes is one organism which has been extensively used for these studies.

2-1-3-1 The Fugu model:

The Fugu rubripes genome is made up of 400 million bases and therefore 7.5 times

smaller than the human genome (Hinegardner, 1968, Brenner, 1993). Its gene

complement, however, is thought to be very similar to that of higher vertebrates

(Elgar, 1996). The genome contraction was recently confirmed by the analysis o f 199

introns in 22 orthologous genes, showing an eight-fold average size reduction

(36)

throughout the genome, particularly in view of the uneven distribution o f genes in the

human genome (Deloukas et al, 1998).

Although its genomic divergence relative to man is great, 400 million years, there are

convincing examples o f conserved synteny and very strong evidence that almost all

genes are conserved. One early example was the identification o f the Fugu orthologue

to the gene responsible for Huntington’s disease, the HD gene (Baxendale et al,

1995). In this case, all 67 exons and the exon/intron boundaries are conserved, the

overall identity o f exons being 69% at the DNA level. The CAG repeat whose

expansion is the cause o f Huntington’s disease is also conserved in Fugu.

In view o f the compact nature its genome, the Fugu model has often been seen as an

aid to positional cloning strategies. One study demonstrated that three genes linked to

FOS in the familial Alzheimer disease focus (AD3) on human chromosome 14q24.3,

have homologues in the Fugu genome adjacent to Fugu cFOS. The three genes FOS,

S31iiil25 and S20il5, are conserved in the same order in the two genome but occupy

600 kb in man whereas they take up 12.4 kb in Fugu rubripes. This highlighted the

utility o f the Fugu model for sequence- based identification of genes in human disease

loci (Trower et al, 1996).

Similarly, the WAGR region (1 lp l3 ) (Wilm’s tumour. Aniridia, Genitourinay

abnormalities and mental Retardation - Francke et al, 1979) shows gene order

conservation and contraction (Miles 1998) . The region contains three genes, W tl

(Wilms tumour), reticulocalbin (Ren) and Paired box protein PAX-6 (Oculorhombin)

or PAX6. W tl and PAX6 are approximately 750kb apart in human and mouse, yet

the equivalent Fugu region is contained within 90 kb and contains the three human

(37)

Similarly, a region o f human Xp22.2-p22.1 containing 9 genes in a 600 kb stretch

shows conserved gene order in a 68 kb Fugu interval, with one inversion (Brunner et

al, 1999). In an other example of the compact nature of the Fugu genome, the 148 kb

region around the Fugu wtnl locus contains 15 genes, showing an extremely high

gene density (Gellner and Brenner, 1999).

Gene order is not, however, always preserved. The human region on 9q 34.2-q34.1

(2-4 Mb) containing the Surfeit genes shows extensive rearrangements with regard to

the Fugu genome (Gilley and Fried, 1999). A three species study comparing a human

region on 19pl2-13 (225 kb) containing 7 genes to the equivalent mouse and Fugu

regions found synteny preserved between human and mouse, yet three o f the genes to

be missing from the Fugu region (Clarke et al, 2001).

A novel transcription factor gene, SLP-1, a member o f the stem cell leukaemia (SCL)

gene family was identified by screening the Fugu cosmid library by hybridisation and

sequence scanning (Gottgens, 1998). The full sequence o f the cosmid revealed a

gene organisation similar to that of the other gene family members. Expression

pattern analysis using RT-PCR and Northern Blot confirmed the validity o f the gene

prediction, the expression contrasting with that o f the other gene family members.

Preliminary Southern blot data suggest the presence of a mammalian homologue

(unpublished data), however, the novel gene may not have a mammalian homologue,

as evidence suggests this is sometimes the case (Apparicio et al, 1997). This

controversy arises from observations relating to the HOX cluster genes. These

encode DNA binding proteins that specify fate along the anterior-posterior axis o f

bilateral animals (Krumlauf R., 1994; Kmita-Cunisse M. et a., 1998; Kourakis M. et

al., 1997; Lewis E.B., 1978), the order of the HOX genes reflecting the order they act

(38)

HOX cluster and little axial diversity while tetrapods have four clusters and

substantial axial complexity (Garcia-Femandez J. and Holland P.W.H., 1994; Di

Gregorio A. et al., 1995). Tetrapod clusters arose by duplications o f an ancestral

cluster containing 13 genes (Holland P.W.H and Garcia-F emandez J., 1996; Sharman

A. and Holland P.W.H., 1998). Although it is widely assumed that all vertebrates

have four HOX clusters, initial studies of teleosts revealed unexpected HOX genes

(Aparicio S. et al. 1997; Misof B. and Wagner G., 1996; van de Hoeven et al., 1996;

Prince V., 1998); Eiken et al., 1987; Njolstad P.R. et al., 1988; Njolstad P.R. et al.,

1988; Njolstad P.R. et al., 1990; Runstadler J.A., 1991; Sordino P., 1996; Prince V.,

1997). Genetic mapping studies in the zebrafish {Danio rerio) indicate the presence

o f seven HOX clusters. These results, combined with phylogenetic studies suggest

two rounds o f HOX chromosome duplications (possibly whole genome duplications)

before the divergence o f ray-finned and lobe-finned fish lines and one more in ray-

finned fish before teleost radiation. Therefore, in the case o f HOX genes, teleost have

more genes than mammals, despite less complexity in the anterior-posterior axis

(Amores A. et al., 1998).

Studies of gene families have so far not yielded evidence o f large scale duplications in

Fugu, although a large number o f rearrangements is probable (4000-16000 reciprocal

translocations or inversions) (McLisaght, 2000).

The Fugu rubripes Landmark project (Elgar et al, 1999), as mentioned in section

1-2-1-2, provides, as a public resource searchable by keyword or sequence on the world

wide web, a sequence scan covering 6% of the genome

(http://www.hgmp.mrc.ac.uk/Fugu). The high degree of gene conservation

experienced in Fugu, combined with the compactness of the genome, introns are short

(39)

by sequence scanning. A far greater proportion o f the genome is coding (17.4 % as

opposed to less than 3 % in man) with a low incidence o f repeats (less than 10 %), the

gene density is higher (every 6-7 kb rather that 80 kb in man).

This resource was used in an attempt to quantify the amount o f conserved synteny

between man and Fugu, a study of gene linkage was carried out, based on sequenced

scanned cosmids bearing more than one gene. An assessment o f conserved synteny

was based on whether genes thus known to be linked in Fugu were also linked in

man. Synteny was found to be conserved for 40-50% of genes (McLysaght, 2000).

The Fugu genome has recently been draft sequenced by the Fugu Genome

Consortium. Indeed, an assembly of a 6-fold whole genome shotgun has recently been

released (25^^ October, 2001). This sequence complements the public resource o f the

Fugu landmark project and is available on the same web site.

Another teleost genome, that o f Tetraodon nigroviridis, is also being sequenced,

currently to a depth o f 4.5 fold coverage as a joint effort by the Genoscope and

Whitehead Institute (Roest-Crollius, Cold Spring Harbor/Welcome Trust

bioinformatic conference, August 2001). As the two organisms are relatively close in

evolutionary terms, useful comparisons will be possible.

2-1-3-2 The mouse model

A number o f comparative genomic sequence analysis have been carried out using the

mouse model. Briefly, they comprise the Btk locus on the mouse and human X

chromosomes (Oeltjen et al, 1997), a gene-rich cluster on human 12pl3 and mouse 6

(Ansari-Lari et al, 1998), the mnd2 region on human 2pl3 (Jang et al, 1999), the

ADA region (Brickner et al, 1999), the T-cell receptor locus (Koop and Hood, 1994)

(40)

In a recent mouse comparative study, a comparison of the 600 kb Bare patches (Bpa)/

Striated (Str) region was undertaken (Mallon et al, 2000). The Bare patches and

Striated phenotypes were possible models for two X-linked lethal conditions:

Icontinentia pigmetii and Chondrodysplasia punctata. Comparative analysis

facilitated the identification o f the Bpa and Str gene, Nsdhl (Liu et al, 1999). A

further eight genes were identified in mouse and human sequence, not found

previously by exon trapping or cDNA selection. These include a member of the

melanoma antigen gene family (Magea9), two novel members o f the X-linked

lymphocyte-regulated family (xlr4 and xlr5) and a zinc-finger gene {ZJp275). Mouse

and human sequence comparison analysis identified at least four potential additional

genes on the basis of evolutionary conservation.

Large stretches o f close sequence identity were identified in intergenic regions where

no exons are predicted and no homology detected using BLAST. In order to set

criteria for the study of these evolutionary conserved regions (ECR), the degree of

conservation in coding exons, 5’ UTRs and 3’UTRs was plotted. From these, two

categories were defined, ECRs greater in length than 50 bp and with a percentage

identity greater than 80% and ECRs longer than 50 bp and with a percentage identity

higher than 70%. This filter greatly reduced the noise, with 35 o f the 80% category

and 116 o f the 70% category ECRs detected. O f these, four were potentially

transcribed regions.

Using RT PCR analysis those can be found to be expressed in various tissues and

therefore it can be argued that comparative sequence analysis can help discover novel

(41)

2-2 Gene regulation

In order to understand the challenge faced in determining how genes function, it is

useful to revisit how genes are regulated.

2-2-1 Access to the genome

In the non-dividing nucleus, DNA presents itself in the form o f heterochromatin,

tightly packaged in chromatin and inaccessible to DNA binding proteins, and

euchromatin, which is less compact, allowing entry to expression proteins. Telomeres

and centromeres are permanently packaged and sometimes termed constitutive

heterochromatin. Portions of the genome are included in the heterochromatin for part

o f the cell cycle and are thought to contain genes that are active only in some cells for

part o f the cell cycle: this makes up the facultative heterochromatin.

Heterochromatin tends to be concentrated at the edge of the nucleus. Euchromatin is

dispersed throughout the nucleus, and while it is associated with chromatin, it is less

tightly packaged, the DNA forming loops attached to the chromatin fibre via AT-rich

segments known as MARs (matrix asociated regions) or SARs (scaffold associated

regions). (Brown, Genomes). It is thought that the DNA in these loops is readily

accessible to expression proteins and contains active genes. The maintenance of this

structure is, at least for some domains, under the control o f the locus control region

(LCR). Loss o f this function can cause disease, as mutations in the P-globin LCR

gives rise to thalassemia (Wolffe, 1995). The fine structure of chromatin can also

have a marked effect, in that repositioning of nucleosomes can fine tune gene

(42)

polymerase (Wade, 1997). One such mechanism is histone acétylation, the other

chromatin remodelling (Cairns, 1998).

MARs and SARs are considered to be an indicator of the presence o f expressed genes.

Although the sequence of only a small number o f such regions has been determined,

several characteristic sequence patterns have been identified. The program MAR-

FINDER (http://www.ncgr.org/MarFinder/) searches for sequences that have a high

representation of such sites in genomic DNA (Singh et al, 1997).

Silencing o f parts o f the genome is also achieved by means o f méthylation o f cytosine

residues. Transposable elements and repetitive sequences are typically

hypermethylated (Bender, 1998). Silencing of one of a pair of genes in genomic

imprinting is also achieved by méthylation (Jaenisch, 1997) as is X inactivation, the

mechanism by which one o f the female X chromosomes is inactivated under the

control o f the Xist gene. Méthylation patterns are preserved in DNA replication in an

enzymatic process known as maintenance méthylation. Housekeeping genes which

are expressed in all tissues have unmethylated CpG islands close to them and tissue

specific genes have unmethylated CpG islands only in the tissues where they are

expressed. It has been shown that méthylation is associated with reduced levels of

histone acétylation and therefore induces chromatin formation in methylated regions.

(Eden, 1998).

2-2-2 Transcription

In eukaryotes there are, in addition to the core promoters, at which site the initiation

(43)

In the absence o f an upstream promoter element, assembly of the initiation factor will

take place very inefficiently.

Different genes will be transcribed by different RNA polymerases and each o f the

three eukaryotic polymerases has a different type o f promoter.

RNA polymerases attach to the core promoter directly or via proteins to form the

initiation complex.

Some eukaryotic repressors may compete with transcription activators for the same

binding site but they have not been comprehensively studied in eukaryotes.

Regulation o f transcription factors is achieved either by controlling their synthesis or

their ability to activate or repress transcription. This is achieved either by

extracellular signal compounds entering the cell (direct activation) or binding to cell

surface receptors which transduce the signal to the cell interior (indirect activation).

As the upstream promoter elements and the transcription factor binding sites are not

clearly correlated with the regions exerting transcription control, it is difficult to

identify them. Furthermore, although consensus DNA binding sites for transcription

factors do fall into categories such as CAAT box, GC box for example (Twyman

1998), upstream promoter sequences are far more difficult to characterise and

therefore more difficult to search for in one genomic sequence alone.

It is hoped that comparative sequence analysis strategies can play a role in identifying

them.

2-2-3 Identifying regulatorv elements using bioinformatic tools

A number of programs have been written with the aim to detect Transcription Factor

(44)

of initiation o f transcription. Grailll (Uberbacher and Mural, 1991; Uberbacher,

1996) makes use o f neural networks based on scores of characteristic sequence

patterns and composition. The TRANSFAC programs and TRANSFAC database

(Heinemeyer et al, 1999; Klingenhoff et al, 1999) provide search for TF binding site

(Http://www. gsf.de). Marind for making scoring matrices and Matinspector for

searching matches to matrices. A great deal can be learnt from the identification of

conserved non-coding conserved sequences (CNS). Tools such as Dotter

(Sonnhammer and Durbin, 1995) allow for the close comparison o f two genomic

sequences. Being able to visualise conservation in the context o f the annotated

sequence is an obvious advantage, as is the possibility o f aligning multiple sequences.

Two programs are available on the web: PIP (Schwartz et al, 2000)

(http://nog.cse.psu.edu/pipmaker/) and Vista (Dubchak et al, 2000; Mayor et al,

2000)(http://www-gsd.lbl.gov/vista). In addition, the regulatory Vista program

(rVista) (Loots et al, 2001) will combine sequence identity and TRANSFAC search,

as will the Theatre Program (http://www.hgmn.mrc.ac.uk) (Edwards, unpublished

data).

2-2-4 Identifying regulatorv elements bv comparative sequence analvsis

In a large scale comparison of the human and mouse Bruton Tyrosin kinase (BTK)

region, four clusters o f potential transcription factor-binding sites were identified

within evolutionary conserved regions by searching the TRANSFAC database.

Transient expression analysis in relevant cell lines indicated that the conserved

(45)

In the three way comparison of the WAGR region, where a megabase o f human, a

megabase o f mouse C57BL/6 and 90 kb of Fugu finished sequence can be aligned,

known regulatory elements can be identified by sequence conservation. The

comparative study o f PAX6 had already yielded such examples (Miles et al, 1998).

Two clusters were found 5 kb upstream of the 5’ end o f the gene, while a third, highly

conserved region in intron 4 was a proposed retinal specific enhancer (Plaza et al,

1995) and a fourth was found in intron 7. The availability o f the mouse sequence and

the extension o f the Fugu sequence reveals the presence o f large stretches o f highly

conserved sequence, including around a 500 bp sequence, corresponding to a known

enhancer, which is conserved at around 95% identity in the mouse and at around 60-

75% in Fugu.

In the search for long range regulatory elements, a different strategy was employed in

the case o f Interleukin 4,13 and 5 (Loots et al, 2000). In a comparison o f a megabase

o f human and mouse genomic sequence, 90 conserved non coding sequences (CNS)

stretching over 100 bp with greater than 70% identity were identified. 15 were

selected for further analysis. Degenerate primers were designed with the view of

assessing their copy number in human and their presence in other vertebrates as well

as chicken and Fugu. 10 were found in at least two further species and 12 appeared to

be unique in human, as confirmed by low stringency Southern blot. The largest CNS

was 401 bp, situated in the intergenic region between IL-4 and IL-13, was amplified

in all species, although bearing no homology in chicken and Fugu. In CNS-l*^®*

transgenic mice studies, it was found that the absence of CNS-1 reduced the

expression of IL-4 by half and IL-13 by two thirds, IL-5 being also affected. This

strategy was therefore successful in identifying a long range regulatory element which

Comparative vertebrate genomic sequence analysis studies based on Fugu rubripes

uest.

TABLE OF CONTENTS

LIST OF FIGURES AND TABLES