• No results found

Serial analysis of genes expressed in normal human glomerular mesangial cells

N/A
N/A
Protected

Academic year: 2020

Share "Serial analysis of genes expressed in normal human glomerular mesangial cells"

Copied!
305
0
0

Loading.... (view fulltext now)

Full text

(1)

Serial Analysis of Genes Expressed

In

Normal Human Glomerular

Mesangial Cells

A thesis presented for the degree of Doctor of Philosophy by

J a m e s M a x w ell W ilk in so n

UNIVERSITY COLLEGE LONDON University Of London

Royal Free & University College London Medical School Department of Medicine

Centre for Nephrology Royal Free Campus

ROWLAND HILL STREET LONDON

(2)

ProQuest Number: 10016117

All rights reserved

INFORMATION TO ALL USERS

The quality of this reproduction is dependent upon the quality of the copy submitted.

In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed,

a note will indicate the deletion.

uest.

ProQuest 10016117

Published by ProQuest LLC(2016). Copyright of the Dissertation is held by the Author.

All rights reserved.

This work is protected against unauthorized copying under Title 17, United States Code. Microform Edition © ProQuest LLC.

ProQuest LLC

789 East Eisenhower Parkway P.O. Box 1346

(3)

Abstract

Advances in sequencing based genomics like the H um an Genome M apping Project (HGMP) have meant that the majority of the estimated human genes have been at least partially sequenced. The variation in expression of a set of essentially identical genes will provide information on the molecular basis of phenotype. Serial analysis of gene expression (SAGE) is based on the ability to assign an individual transcript to a ten base pair ‘tag’, and the technology facilitating rapid sampling of such tags.

G lom erular mesangial cells (MC) are considered to play a major role in the development of renal disease and in vitro culturing of M C ’s has become a model system with which to study the molecular mechanisms of glomerular pathology. To this end, a SAGE project was undertaken to identify genes expressed in normal human mesangial cells (NHMC).

Primary normal human mesangial cells were cultured for periods up to 96 hrs. A total of 46,219 tags were sampled (14,953 unique tags). Tags were mapped to 20,382 sequences. O f these 79% of tags mapped to characterised cDNAs, 16% tags mapped to ESTs. 5% of tags failed to match any database entry. The m ost abundant tags mapped to ribosomal genes or genes associated with the cytoskeleton. Represented in the top ten tags were the matricellular genes transgelin (1.2%), SPARC (1%) type IV collagen (0.5%) and fibronectin (0.53%), which support the notion that the MC is a producer and re-m odeller of the glomerular extracellular matrix (ECM). The contractile nature of MC was apparent with the high abundance of contractile proteins like myosins and tropomyosins.

(4)

Acknowledgements

The work described in this thesis was carried out in the Centre for Nephrology at the Royal Free Hospital in London and was in part supported by a grant from the British Diabetic Association (Grant No. EDA: RD98/0001854). To begin, thanks must go to Centre D irector and my supervisor, Prof. Steven H Powis, who first brought my attention to the method of SAGE, suggested I look into it, then provided me with the opportunity to do so.

The other members of the Centre for Nephrology also require acknowledgement for assistance at the Royal Free Hospital, continually answering endless questions regarding the locations of various pieces of equipment and reagents and the proofing of manuscripts as they emerged. Also requiring acknowledgement are the departmental secretaries who assisted with all my office requirements.

Heartfelt thanks must also go to members of my family, friends and especially my partner, who are all as relieved as I am that this particular project is ending. I could not have w anted more support from any of them , indeed only ever received encouragement.

(5)

Table of Contents

TITLE PAGE...1

ABSTRACT...2

AKNO WLEDGEMENTS... 3

TABLE OF CONTENTS... 4

FIGURES...10

TABLES... 11

EQUATIONS...13

ABBREVIATIONS...14

1 INDRODUCTION... 17

1.1 Th e Hu m a n Ge n o m e Ma p p in g Pr o j e c t (H G M P ) & Fu n c t io n a l Ge n o m i c s...18

1.1.1 Estim ating the Complement o f Genes in the G enom e... 19

1.2 Th e Th r e e Mo l e c u l a r Le v e l so fa Ce l l... 2 0 1.2.1 The G enom e... 21

1.2.2 The Transcriptom e... 22

1.2.3 The Proteom e... 23

1.3 A Dy n a m ic Lin kb e t w e e n Ge n o t y p ea n d Ph e n o t y p e...25

1.4 Th e St u d yo f Tr a n s c r i p t o m e s... 2 6 1.4.1 Differential D isplay...26

1.4.2 M icro Arrays and Gene C h ip s ... 26

1.4.3 In silico m ining ... 28

1.4.4 Serial Analysis o f Gene Expression...29

1.5 Us in g S A G E t o In v e s t ig a t et h e Tr a n s c r ip t io no f Ge n e s...31

1.5.1 Analysing Transcriptomes...32

1.5.2 Summary of Transcriptome A nalysis... 34

1.6 Th e Pa t h o l o g y a n d Pr o g r e s s i o n o f Dia b e t e s Me l l it u s in Ta r g e t Or g a n s a n d Tis s u e s ... 34

1.6.1 Diabetes and the E y e ...35

1.6.2 Diabetes and the Nervous System...35

1.6.3 Diabetes and the V asculature... 36

1.6.4 Diabetes and the K id n e y ...36

1.7 Me c h a n is m so f Hy p e r g l y c a e m ic St r e s s... 37

1.7.1 Polyol P athw ay... 39

1.7.2 Hexosamine P ath w ay ... 40

1.7.3 AGE Formation and Persistence... 42

1.7.4 PK C A ctivation... 44

1.7.5 Oxidative S tress... 46

1.7.6 The Mechanisms for Hyperglycaem ia Induced Pathology Remain C om plex... 48

1.8 Gl o m e r u l a r Ap p a r a t u s in D M ... 48

1.8.1 The G lom erulus...48

(6)

Table of Contents

1.9 Gl o m e r u l a r Me s a n g ia l Ce l l s... 50

1.9.1 Functions o f Mesangial C ells... 50

1.9.2 Histology o f M esangial C ells... 52

1.9.3 MC and the Mesangial Matrix (m E C M )...54

1.10 Me c h a n is m so f Me s a n g ia l Ce l l Dy s f u n c t i o n...55

1.10.1 Cellular Factors... 55

1.10.2 Extracellular Matrix F a c to rs... 57

1.10.3 The C ell-Cycle... 58

1.10.4 Signal Transduction...59

1.10.5 M etabolic M echanisms... 60

1.10.6 Mechanical S train ...61

1.10.7 Transcription of g en es... 61

1.10.8 Current Hypothesis o f M esangial Cell Contribution to D N ... 63

1.11 Hy p o t h e s e s Un d e r l y in g t h is Th e s is...65

1.12 Aim s OF THIS THESIS...65

METHODS...66

2.1 Ge n e r a l Ov e r v ie wo f Te c h n ic a l Pr o t o c o l s... 67

2 .2 Ce l l Cu l t u r e... 67

2.2.1 T H P - l... 67

2.2.2 N HM C ... 67

2.2.3 H M C L ... 69

2.2.4 Cryo-Preservation o f C ells... 70

2.3 R N A Is o l a t i o n...7 0 2.3.1 Isolation of Total RNA from Cell Suspensions...70

2.3.2 Isolation of Total RNA from Cell M ono-L ayers...71

2.3.3 Purification o f mRNA from T R N A ... 71

2.3.4 Preparation of cD N A ... 72

2 .4 Hy b r id is a t io n Ex p e r i m e n t s... 73

2.4.1 Hybridisation to G eneFilters™ ... 73

2.4.2 Hybridisation to Northern B lots...74

2.4.3 Hybridisation to Dot B lo ts...75

2.4.4 Determ ining Band or Spot D ensity... 76

2.5 Im a g e Cl o n e s... 7 6 2.5.1 Identifying IM AGE C lones... 77

2.5.2 Growing Bacterial C u ltu re... 77

2.5.3 M ini-prep Plasmid DNA Isolation...78

2.5.4 Identifying by Sequence...78

2.5.5 Size D eterm ination...78

2.5.6 M olar Concentration... 79

2 .6 Se r ia l An a l y s iso f Ge n e Ex p r e s s io n: S A G E ... 79

2.6.1 Cleaving the cDNA and Binding to Dynal Magnetic B eads...80

2.6.2 Creating Specific SAGE L inkers...81

2.6.3 Ligating Linkers to the 5' cDNA and Releasing T ags... 82

2.6.4 Ligating the Tags to Form D itag s...82

(7)

2.6.6 Ligation of Ditags to Form C oncatem ers... 84

2.6.7 Cloning the C oncatem ers... 85

2.6.8 PCR Amplification o f V ector Insert D N A ... 86

2.6.9 DNA Sequencing... 87

2 .7 S A G E An a l y s i s... 88

2.7.1 M apping Tags and G enes... 88

2.8 St a t i s t i c s... 89

2.8.1 Correlation Functions...90

2.8.2 D etecting a T ranscript...91

2.8.3 Detecting a Change in Expression L evel... 91

2.8.4 Comparing M eans... 94

2.9 Re v e r s e Tr a n s c r ip t a s e P C R ... 95

2.9.1 Real Time R T -P C R ... 96

2.9.1.1 Light C ycler... 97

2.9.1.2 ABI 7000 Sequence Detection System... 98

2.9.2 Quantitation o f Relative o f Gene Transcription... 99

2 .1 0 Di g it a l No r t h e r n s...99

P R E L I M I N A R Y V A L I D A T I O N E X P E R I M E N T S ... 101

3.1 In t r o d u c t io n...102

3.2 T H P - l Pil o t Pr o j e c t... 103

3.2.1 Preliminary Northern b lo t... 103

3.2.2 Tag Sam pling... 103

3.2.3 M apping Tags to G enes...104

3.2.4 Differential Gene Expression...107

3.2.5 Concluding Remarks on the Pilot Project... 108

3.3 No r t h e r n Bl o t Hy b r i d i s a t io n s...109

3.4 Do t Bl o t Hy b r id is a t io n s... I l l 3.5 Ge n e Fil t e r Hy b r id is a t io n s... 112

3 .6 Di s c u s s i o n...115

C O N S T R U C T I O N & S A M P L I N G O F N H M C S A G E L I B R A R I E S ... 120

4.1 In t r o d u c t io n...121

4 .2 CUMULATIVE Sa m p l in g OF Ta g s...122

4.2.1 Tag Sampling Indicates a Complex P opulation...123

4.2.2 Each Sub-library has Similar C om plexity... 124

4 .3 Co m p a r is o n o f S A G E Lib r a r ie s f r o mo t h e r Ce l l sa n d Ti s s u e s...125

4 .4 Co m b in in g Su b-l i b r a r i e s... 127

4 .5 Ta g Fr e q u e n c y Dis t r ib u t io na n d Pr o b a b il it y o f De t e c t i o n...128

4.5.1 Frequency Distribution...128

4.5.2 Detecting a T ranscript...129

4 .6 Ex p e r im e n t a l Er r o r s... 130

4.6.1 Efficiency of Tag G eneration...130

4.6.2 Contamination by 5’ Anchoring Enzyme Digestion P ro d u cts...133

(8)

Table of Contents

4.6.4 PCR B ia s... 134

4.7 Dis c u s s i o n...134

GENERATION & VALIDATION OF THE NHMC TRANSCRIPTOME...139

5.1 In t r o d u c t io n...140

5.2 St r a t e g yf o r Ma p p in g S A G E Ta g st o Ge n e s... 141

5.2.1 Generation of the Primary NHMC SAGE T ranscriptom e... 144

5.2.2 Condensing the Primary NHMC Transcriptome into the Secondary Transcriptom e... 146

5.2.3 Generation of the Final, Non-Redundant T ranscriptom e... 148

5.2.4 Summary o f Mapping D ata... 148

5.3 Va l id a t io no f S A G E Lib r a r ya sa Ca t a l o g u eo f Tr a n s c r i p t i o n... 150

5.3.1 Dot Blots Demonstrate the Presence of T ran scrip ts... 152

5.3.2 Real-Time RT-PCR Quantifies Abundance o f T ranscripts... 155

5 .4 S A G E TAGS ARE PRESENT IN OTHER LIBRARIES AT SIMILAR RELATIVE LEVELS...157

5.5 Su m m a r yo f S A G E Va l id a t io n...158

5 .6 Dis c u s s i o n... 159

DISCRIPTION OF GENES WITHIN THE NHMC TRANSCRIPTOME... 163

6.1 In t r o d u c t io n... 164

6.2 Th e N H M C Tr a n s c r ip t o m e (2° Tr a n s c r ip t o m e) ...165

6.2.1 High Abundance G enes...165

6.2.2 Categories According to F unction...167

6.2.2.1 Prom inent Cytoskeleton G enes...167

6.2.2.2 Prom inent ECM Genes...168

6.2.2.3 Prom inent Transcription and Translation Factors... 169

6.2.2.4 Prom inent M etabolic E nzym es... 170

6.2.2.5 Prom inent Receptors and Antigenic M ark ers...171

6.2 2.6 Prom inent Cytokines and Cellular Factors... 172

6.2.2.7 Miscellaneous G enes... 173

6.2.3 Genes o f Potential Functional Significance in D N ... 174

6.2.4 Summary of the NHM C transcriptom e... 176

6.3 Ta g An o m a l i e s... 176

6.3.1 Ambiguity in T ag s...176

6.3.2 Ambiguity in G enes... 177

6.3.2.1 RTN 4... 177

6.3.2.2 C T G F ... 179

6.3.3 Summary of Tag A nom alies... 181

6 .4 Vir t u a l No r t h e r n...182

6.4.1 Comparing T ranscriptom es... 182

6.4.2 Constructing a Virtual N o rth ern...182

6.4.2.1 Housekeeping Genes (Present in A ll)...184

6.4.2.2 Restricted Transcription in N H M C s...185

6.4.2.3 Genes Present in ‘NHM C-like’ C e lls...185

6.4.3 Summary o f the Digital N orthern...186

(9)

7 ANALYSIS OF DIFFERENTIAL TRANSCRIPTION... 190

7.1 In t r o d u c t io n... 191

7 .2 Re a l Tim e R T -P C R An a l y s i s... 192

7.2.1 Tracked Candidate G enes...192

7.2.2 Summary o f Tracked G enes...195

7.3 Ca n d id a t e Ge n e s De t e r m in e d Fr o m S A G E An a l y s i s...196

7.3.1 Primary Com parison...196

7.3.2 Determining Statistical Significance to Changes in Tag Frequency... 197

7.3.3 Selection of Reliable SAGE C andidates... 199

7.3.3.1 Primary Tags... 199

7.3.3.2 Stable Accumulation o f T ag s...199

1 .3 3 3 Level of Sam pling ...199

1.3.3.4 Resolving the 11* base p a ir ... 200

7.3.3.5 Final L is t... 201

7.3.4 Real time RT-PCR Analysis to Test Changes in C andidates... 202

7.3.5 Summary on the Analysis of SAGE Determ ined Candidates... 202

7.4 Ca n d id a t e Ge n e sf r o m Ge n eFil t e r An a l y s i s... 2 0 4 7.4.1 Primary Comparison o f GeneFilter S ig n a ls... 204

7.4.2 RT-PCR Analysis of Gene filter C andidates... 205

7.4.3 Summary of the G eneFilter A n aly sis... 206

7.5 R T -P C R An a l y s iso f N H M C & H M C L ... 20 7 7.5.1 Candidates for Comparing HMCL to N H M C ...207

7.5.2 RT-PCR Analysis o f HM CL in Glucose S tre ss...208

7.5.3 Comparison o f NHMC to H M C L ... 209

7.5.4 Summary o f the Comparison of NHMC and H M C L... 210

7 .6 Di s c u s s i o n...2 1 2 8 GENERAL DISCUSSION... 217

8.1 Su m m a r y OF Re s u l t s... 218

8.2 MODEL SYSTEMS...218

8.2.1 Culture M odel... 219

8.2.2 Culture Conditions... 219

8.2.3 Validating Culture P ro to co l... 221

8.2.4 Pure Cell Culture versus T issue...222

8.3 An a l y s in g Tr a n s c r i p t o m e s... 223

8.3.1 SAGE as a Tool to Explore Transcriptom es... 224

8.3.2 Technical Errors in the SAGE P ro to c o l...225

8.3.3 Experim ental Errors in the SAGE A nalysis... 226

8.3.3.1 S ampling E rrors...226

8.3.3.2 Sequencing E rro rs ... 227

8.3.3.3 Non-Random D N A ... 227

8.3.3.4 Non-Unique T a g s ... 228

8.3.4 Hybridisation Analysis o f Transcriptom es... 228

8.3.5 In silico Analysis of Global Expression D a ta ...229

(10)

Table of Contents

8.4.1 M apping Tags to G enes...232

8.4.2 Validation of Transcription... 234

8.4.3 The NHMC T ranscriptom e... 235

8.4.4 M apping A nom alies... 236

8.4.5 Virtual N o rth ern ... 237

8.5 Dif f e r e n t ia l Tr a n s c r ip t io n... 238

8.5.1 G eneFilter A nalysis... 239

8.5.2 SAGE A nalysis...240

8 .6 Co n c l u s io n so f Pr o j e c t...2 4 2 8.7 Th e s i s... 243

REFERENCES... 244

APPENDIX 1. GENE ABBREVIATIONS... 272

APPENDIX 2. PRIMER SEQUENCES & REFERENCE ACCESSION NUMBERS... 274

APPENDIX 3. THP-l 1 TRANSCRIPTOME... 277

APPENDIX 4. NHMC 2" TRANSCRIPTOME... 282

APPENDIX 5. FULL NHMC DIFFERENTIAL LIST... 289

(11)

Figures

F IG U R E 1.1. Fl o wc h a r til l u s t r a t in gt h es t e p sin v o l v e dinh y b r id is a t io n...27

F IG U R E 1.2. Sc h e m ao ft h es t e p sin v o l v e d ina S A G E a n a l y s is...30

F IG U R E 1.3. Th e Po l y o l Pa t h w a y...39

F IG U R E 1.4 Th e He x o s a m in e Pa t h w a y...41

F IG U R E 1.5. Pr o d u c t io no f A G Esf r o m g l u c o s et u r n o v e r...43

F IG U R E 1.6. Ac t iv a t io na n da c t io n so f P K C ...45

F IG U R E 1.7. El e c t r o nt r a n s p o r tin Ox id a t iv e Ph o s p h o r y l a t io n... 47

F IG U R E 1.8. Th e Gl o m e r u l u s... 4 9 F IG U R E 1.9. Sig n a lt r a n s d u c t io n p a t h w a y sp r e s e n tinm e s a n g ia lc e l l s... 59

F IG U R E 1.10. Su m m a r yo ft h ep a t h w a y sim p l ic a t e din D N ...64

F IG U R E 2.1. Sc h e m ao ft h ec u l t u r in gp r o t o c o l... 80

F IG U R E 2.2. S A G E Lin k e r s... 81

F IG U R E 2.3 A & B. P C R AMPLIFICATION OF DITAGS... 83

F IG U R E 2 .4 . Co n c a t e m e r so fd it a g s...85

F IG U R E 2.5. Re p r e s e n t a t iv es c r e e no ft r a n s f o r m a n t s... 87

F IG U R E 2 .6 A & B. Pr io ra n d Po s t e r io r Be t ap d f s... 9 4 F IG U R E 3.1. No r t h e r n Bl o to f T H P - 1 R N A p r o b e dw it h G A P D H a n d IL - 1 p ... 103

F IG U R E 3 .2a, b & c. No r t h e r n Bl o to f T H B S 1 & G L U T l ... 110

F IG U R E 3.3. Mo d u l a t io no fs e l e c t e dg e n e s b a s e do nd o tb l o td a t a...112

F IG U R E 3.4. G e n e F i l t e r G F 2 0 0 h y b r i d i s e d t o I f i G o f l a b e l l e d f i r s t s t r a n d c D N A ... 113

F IG U R E 3.5. Mo d u l a t io no fs e l e c t e dg e n e sr e l a t iv et o L G o n G F 2 0 0 ...114

F IG U R E 3.6. De n s it y Dis t r ib u t io n o f Re l a t iv e Sig n a l sf r o m G F 2 0 0 ...114

F IG U R E 4 .1 . Th ea c c u m u l a t io no fu n iq u et a g sa saf u n c t io no ft a g ss a m p l e d... 122

F IG U R E 4 .2 . Ve n nd ia g r a m m e sil l u s t r a t in gt h ein t e r s e c t io no fu n iq u et a g s... 125

F IG U R E 4.3 . Gr a p h ic a lr e p r e s e n t a t io no fl in k a g eb e t w e e n S A G E l ib r a r ie s...128

F IG U R E 4.4 . Re p r e s e n t a t io no ft h ep o s it io no f S A G E t a g s...131

F IG U R E 5.1 . Sc h e m a Fl o w c h a r to ft h ep r o c e s su s e dt om a pt a g st og e n e s... 143

F IG U R E 5 .2 A&B. A MANUALLY CONSTRUCTED DOT BLOT HYBRIDISED TO LABELLED SSD N A (A )... 152

F IG U R E 5 .3 . Co m p a r is o no f S A G E a n d R T -P C R d e t e r m in e da b u n d a n c e...155

F IG U R E 6.1. Ta g sg e n e r a t e d f r o mt h e R T N 4 g e n e...178

F IG U R E 6.2. Ta g sg e n e r a t e df r o m t h e C T G F g e n e... 180

F IG U R E 6.3 . Sc h e m ao u t l in in gc o n s t r u c t io na n do u t p u tf r o m t h e Vir t u a l No r t h e r n... 183

F IG U R E 7.1. Gr a p h sa n dt a b l eo f P C R d a t a ‘Tr a c k e d Se t 1 ’... 193

F IG U R E 7 .2 . Gr a p ha n dt a b l eo f P C R d a t a ‘Tr a c k e d Se t2 ’... 194

F IG U R E 7.3. Pr o p o r t io n a ld is t r ib u t io nf o ra l lm a t c h e dt a gp a ir sA&B...198

F IG U R E 7 .4 A ,B,c. Ge n e sp r e d ic t e dt oa l t e rt r a n s c r ip t io na sd e t e r m in e db y S A G E ... 203

(12)

Tables

T A B L E 1.1. In t e r n e ts it e sp r o v id in gb io-in f o r m a t ic sa n dd a t aw a r e h o u s in g...30

T A B L E 1.2. Pr o t e in Kin a s e CISOFORMS... 4 4 T A B L E 1.3. Ce l l u l a rf a c t o r sim p l ic a t e din D N b ya s s o c ia t io nw it h M C ... 5 6 T A B L E 1.4. Mo d u l a t io no f M C m a t r ixp r o t e in sin D N o rh ig hg l u c o s e...57

T A B L E 1.5. MATRIX TURNOVER FACTORS AND THE EFFECT OF D N OR H G ...58

T A B L E 1.6. Ge n et r a n s c r ip t io ninr e s p o n s et og l u c o s eo rin D M ...62

T A B L E 2 .1 . Da t a s e t sa v a il a b l ef o rp u b l ica c c e s sa t N C B I... 89

T A B L E S 3 .1(A,B,C,D). Ta g FREQUENCY DISTRIBUTION FOR THE PILOT S A G E LIBRARY... 104

T A B L E 3.2 A & B . TOP 10 TAGS (TOTAL MATCHES/REDUNDANCIES/MULTIPLE HITS)... 106

T A B L E 3.3. TOP 2 0 UNIQUE TAGS (Fin a ll is ta n dm a p p in gd a t a) ... 107

T A B L E 3.4. TOP 4 0 DIFFERENTIALLY EXPRESSED TAGS... 109

T A B L E 3.6. GENES DIFFERENTIAL REGULATED AS ASSESSED BY GENEFILTER ANALYSIS... 115

T A B L E 4 .1 . Ge n e r a le f f ic ie n c y s t a t is t ic so ft h ein d iv id u a l S A G E l ib r a r ie s... 122

T A B L E 4 .2 A & B . THE FREQUENCY DISTRIBUTION OF TAGS IN EACH SUB-LIBRARY... 123

T A B L E 4 .3 A & B . COMPARING N H M C LIBRARIES TO INDEPENDENT LIBRARIES... 126

T A B L E 4 .4 . Ta gd is t r ib u t io n inc o m b in e dl ib r a r ie s... 129

T A B L E 4.5 . PROBABILITY OF DETECTION... 129

T A B L E 4 .5 . Ge n es e q u e n c e su s e dt og e n e r a t et h ea r t if ic ia lt a g s... 132

T A B L E 4 .6 . CONTAMINATION BY ANTI-SENSE TAGS... 133

T A B L E 4 .7 . EXAMPLES OF LINKER SEQUENCES REMOVED FROM THE S A G E LIBRARY... 134

T A B L E 5.1. Pr im a r y N H M C Tr a n s c r ip t o m e (To p5 0 ) ... 145

T A B L E 5 .2 . SECONDARY N H M C TRANSCRIPTOME (T O P50)... 147

T A B L E 5.3. NON-REDUNDANT N H M C TRANSCRIPTOME (To p5 0 ) ... 149

T A B L E 5.4 . SUMMARY OF THE COMPLETE MAPPING OF TAGS TO GENES... 150

T A B L E 5.5 . PROBES USED TO VALIDATE GENE TRANSCRIPTION IN N H M C ... 151

T A B L E 5.6 . INDIVIDUAL HYBRIDISATION SIGNAL COMPARED TO S A G E DERIVED FREQUENCY... 153

T A B L E 5.7 . GENES USED TO CREATE TEMPLATES FOR R T R T -P C R AND THE PRIMER SEQUENCES USED.... 156

T A B L E 5.8. DIGITAL NORTHERN OF A SELECTION OF HOUSEKEEPING GENES...158

T A B L E 5 .9 . TOP TEN TAGS THAT INITIALLY FAILED TO MATCH IN THE RELIABLE DATABASE...161

T A B L E 6.1. To p 2 0 TAGS AND CORRESPONDING GENES EXTRACTED FROM THE 2° TRANSCRIPTOME 165 T A B L E 6.2 . TOP 10 GENES ASSOCIATED WITH THE CYTOSKELETON... 168

T A B L E 6.3. GENES ASSOCIATE WITH THE E C M ... 169

T A B L E 6 .4 . GENES ASSOCIATED WITH TRANSLATION (RIBOSOMAL PROTEINS) AND TRANSCRIPTION 170 T A B L E 6.5. TOP 10 ENZYMES IN THE 2° Tr a n s c r ip t o m e...171

T A B L E 6 .6 . TOP 10 RECEPTORS OR ANTIGENIC MARKERS IN THE N H M C TRANSCRIPTOME... 172

T A B L E 6.7. TOP 10 GENES FOR CYTOKINES AND THOSE ASSOCIATED WITH CELL CYCLE...173

T A B L E 6.8 TOP 10 GENES NOT PLACED IN ANY OF THE OTHER GROUPS...174

T A B L E 6 .9 . GENES OF FUNCTIONAL SIGNIFICANCE IN D N ...175

T A B L E 6 .1 0 . T a g s in N H M C l i b r a r y t h a t m a p t o R T N 4 ( H s .6 5 4 5 0 ) ...179

(13)

TABLE 6.1 2 . LIBRARIES USED IN THE CONSTRUCTION OF VIRTUAL NORTHERN DATABASE... 183

T A B L E 6.13 A & B . DIGITAL NORTHERN OF HOUSEKEEPING GENES AND RIBOSOMAL PROTEINS (R P )... 184

TABLE 6.14. DIGITAL NORTHERN OF GENES RESTRICTED TO NHMCS...185

TABLE 6.15. DIGITAL NORTHERN OF GENES RESTRICTED TO CELLS OF MESENCHYMAL ORIGIN...186

TABLE 7.1. Su m m a r yo ft r a c k e d c a n d i d a t e s...193

TABLE 7.2. PRIMARY SAGE DETERMINED TAGS OF DIFFERENTIAL FREQUENCY...200

TABLE 7.3. SUMMARY OF RELIABLE SAGE DIFFERENTIAL CANDIDATES...201

T A B L E 7.4. SUMMARY OF CANDIDATE GENES AS DETERMINED FROM GENEFILTER ANALYSIS...205

(14)

Equations

EQUATION 2.1 . MOLAR RELATIONSHIP OF D N A... 79

EQUATION 2.2. B e t a {a,b)... 92 EQUATION 2.3. INTEGRAL OF Be t a(A +a,B + 5 )...93 EQUATION 2 .4 . EXPONENTIAL PCR AMPLIFICATION... 9 6

(15)

Abbreviations

G eneral abbreviations used throughout this thesis. A com plete list of gene abbreviations is presented in APPENDIX 1

ABBREVIATION DESCRIPTION

X g (ref) Relative Centrifugal Force ‘g’

pME Beta Mercaptoethanol

P-NAD Beta Nicotinamide Adenine Dinucleotide

A260 Absorbance at 260nm

AE Anchoring Restriction Enzyme (Nla III)

AGE Advanced Glycation End products

Amp Ampicillin

AMY Avian Myeloblastosis Virus

AR Aldose Reductase

ARI Aldose Reductase Inhibitor

ATCC American Type Culture Collection

BLASTn Basic Local Alignment Search Tool nucleotide

bp Base Pairs

BSA Bovine Serum Albumin

BW Binding & Washing buffer

cDNA Complementary DNA (to mRNA)

CGAP Cancer Genome Anatomy Project

DAG Diacyl Glycerol

DCCT Diabetes Control and Complications Trial

DM Diabetes Mellitus

DMSG Dimethyl Sulfoxide

DN Diabetic Nephropathy

DNA Deoxyribonucleic Acid

dNTPs Deoxynucleoside Triphosphates (dATP,dCTP,dTTP and dGTP)

dsDNA Double Stranded DNA

DTT Dithiothreitol

ECM Extracellular matrix

EDTA Ethylenediaminetetraacetic Acid

ESRF End Stage Renal Failure

ET-1 Endothelin-1

EtBr Ethidium Bromide

FBS Foetal Bovine Serum

PCS Foetal Calf Serum

GAPDH Glyeraldehyde-3-Phosphate Dehydrogenase

GF200 Gene Filter release 200

(16)

Abbreviations

ABBREVIATION DESCRIPTION

GSH/GSSH Glutathione (Reduced/Oxidised)

GTE Glucose Tris EDTA buffer

GTT Gene-To-Tag

HBSS Hanks Balanced Salt Solution

HGMP Human Genome Mapping Project

HMCL Human Mesangial Cell Line

lAC Chloroform: Isoamyl Alcohol (24:1) IDDM Insulin Dependent Diabetes Mellitus

IMAGE Integrated Molecular Analysis of Genome Expression

kb Kilo basepairs

LB Luria Broth

LPS Lipo-polysaccaride

MC Mesangial Cell

mECM Mesangial Extracellular Matrix

MGO Methyl glyoxyl

MMV Moloney Murine Leukemia Virus

mol Amount of Substance (SI)

MOPS (3-(N-Morpholino)propanesulfonic acid

mRNA Messenger RNA

MW Molecular Weight

NCBI National Centre for Bioinformatics

NHMC Normal Human Mesangial Cells

NIDDM Non Insulin Dependent Diabetes Mellitus

NOS Nitric Oxide Synthase

oligos Oligonucleotides

P/IAC Phenol:Chloroform:Isoamyl Alcohol (25:24:1) PAGE Polyacrylamide Gel Electrophoriesis

PALI Plasminogen Activator Inhibitor 1

PBS Phosphate Buffered Saline

PCR Polymerase Chain Reaction

PDGF Platlet Dervied Growth Factor

PEGgooo Polyethylene Glycol (average MW 8000)

PKC Protein Kinase C

PMA Phorbol 12-Myristate 13-Acetate PolyA^ RNA RNA with poly A tails

PPP Pentose Phosphate Pathway

R/T Room Temperature

(17)

ABBREVIATION DESCRIPTION

ref Relative Centrifugal Force

RNA Ribonucleic Acid

ROS Reactive Oxygen Species

rpm Revolutions Per Minute

RT-PCR Reverse Transcriptase PCR

rtRT-PCR Real-Time RT-PCR

SAGE Serial Analysis of Gene Expression

SAGEmap Data files relating SAGE tags to Unigene Clusters

SDS Sodium Dodecyl Sulphate

SNP Single Nucleotide Polymorphism

SSC Saline Sodium Citrate (0.015M Citrate,0.15M NaCl)

STZ Streptozotocin

ssDNA Single Stranded DNA

SSPE Saline Sodium Phosphate EDTA (lOmM Phosphate, 150mM NaCl, ImM EDTA) SYBR Green 1 Syanogen Bromide green I

TiqEi Tris EDTA buffer (lOmM Tris, ImM EDTA, pHS.O)

TAE Tris Acetate EDTA buffer

TAPS Tris-Acetate PCR Buffer (Qiagen )

TBE Tris Borate EDTA buffer

TE Tagging Enzyme (Bsm FI)

TGFp Transforming growth Factor beta

THP-l Human Monocyte Cell Line

TRNA Total cellular RNA

TTG Tag-To-Gene

UTR Untranslated Region

(18)

CHAPTER 1

(19)

1.1

T h e H u m a n G e n o m e M a p p in g

P r o j e c t

(HGMP)

& F u n c t i o n a l

G e n o m ic s

Sequencing the entire human genome w ill provide a complete catalogue of genes. Such a catalogue will not only contain information on the name and sequence of each gene, but also contain many variations and mutations, and as such can be thought of as the biological equivalent to a chemical periodic table (Fields et al., '94, Lander, '96, Fields, '97). This catalogue of all genes required to define a living organism will also contain a basis of classification for these genes. From its inception in the early 1980’s the task of sequencing the entire human genome was considered a methodical sequencing project that would be completed in 20-25 years, and was designed to reveal three levels of genomic classification, reflected in the three analytical layers of the sequencing project. The first step produced the genetic maps of gene units, the second mapped the physical location of these gene units and the final step would be completing the DNA sequence of the genome (Lander et al., '01).

Together with this methodical approach came the unexpectedly efficient method of random sequencing of cDNA in conjunction with an alternative experim ental procedure, the whole genome shotgun (WGS). This process complemented the time consuming sequencing of hierarchical contigs, concentrating rather on the high through­ put single read sequencing of expressed genes easily obtained from cloned cDNA and genomic fragments that were produced directly from genomic DNA (Venter et al., '01). These EST (Expressed Sequence Tag) libraries provided a resource for molecular and cellular biologists to use in the ever-expanding search for the functional significance of new genes. This was a useful resource for the dissemination of coding sequences within the genomic DNA, which in vertebrates is remarkably diffuse with some genes reaching into hundreds of kilobases for a mRNA that is only 1.5kb. The result was a draft map created by two alternative methods, one of which was methodical and rational, the other efficient and resourceful.

(20)

CHAPTER 1. Introduction

the already available public data the level of coverage is currently believed to be 94%. The sequence therefore is not complete and still in the initial draft form. Many gaps and ambiguous regions exist within the current data set, and these require resolving, as does the remaining 4% of genomic sequence. Despite this, the HGP will form a valuable archive of data and a resource for future study.

In its current state, the HGP has facilitated discussion in a very general sense. Q uestions such as the distribution of genes and repeated elem ents along the chromosomes, the relationships to homologous genes in other species and the allelic differences in populations have all been given insight through the HGP. Additionally, the identification of widely and evenly dispersed single nucleotide polymorphisms (SNPs) mean even individuals within a population may be classified (Sachidanandam et al., '01, Stoneking, '01). However, detailed questions regarding gene expression have been revealed to be more complicated than once thought.

1.1.1

E

s t im a t in g

t h e

C

o m p l e m e n t

o f

G

e n e s

in

THE G

e n o m e

The coding capacity of the genome is an ongoing topic of debate. The window of estimation as the HGP progressed was between 30,000 and 150,000 genes (Lander, '96, Deloukas et al., '98). Following the initial draft of the genome this has now been downgraded and estimates now run to 30,000-40,000 genes (McPherson et al., '01). To date the HGP has described location and sequence for 22,000 of these genes and other mapping projects claim to have data to place 26,000 genes although proprietary licence precludes easy non-commercial access (Claverie, '01).

(21)

While these estimations are neither static nor complete, it is becoming apparent that the num ber of genes is only one m echanism for assessing the genome. The literature is rich with reports of alternative transcripts for the same gene, often the result of multiple transcription start sites, alternative splicing of introns, premature or delayed termination, and transcription artefacts (Montoliu et al., '90, Lin et al., '93, Ayoubi and Van De Yen, '96, Rogaev et al., '97). Indeed the complex nature of the transcriptional complement of the cell is currently an area of intense research and forms the primary experimental approach of this thesis. The HGP has supported the notion of multiple transcripts with data suggesting numerous ESTs clusters over the same gene. Recently, a report used 700,000 ESTs and assembled them into 15,000 full-length mRNA clusters (Camargo et al., '01). These authors estimated that over 80% of human genes are now at least partially sequenced and enough ESTs have been com piled to facilitate the building of a scaffold over a gene which would experimentally close many of the gaps present in the coding regions of the HGP draft sequence. However, this scaffold revealed discrete clustering of ESTs over genes, which suggests transcriptional units rather than discrete genes.

W ith the advances in sequencing based genomics and public and private EST libraries, the m ajority of the estim ated human genes have been at least partially sequenced. The next m ajor biological challenge will be to assign a functional significance to this genetic information, so called functional genomics (Deloukas et al., '98). The hypothesis that the phenotype of a cell is essentially defined by the genes it expresses forms the basis of functional genomics. The dynamic link between the information contained in the cell’s genome and its phenotype provides an opportunity to test this hypothesis and investigate the transcriptional elem ents associated with changing phenotype.

1.2

T h e T h r e e M o l e c u l a r L e v e l s o f a

C e l l

(22)

CHAPTER 1. Introduction

several levels of genomic expression. The expression of a genome is an important and natural extension to the HGP and studies of gene expression are numerous in the literature. Somewhat ignored during the early intensive activity of the HGP is the study of the proteins encoded in the genome. A gene will usually produce a protein (or other non-peptide catalytic unit such as ribozym es), and it is the tim ely and ordered functioning of the protein, which forms the metabolic dimension to the cell.

Therefore, the human genome project represents only the first dimension of the molecular basis of a cell. Beyond the genome is the elucidation of the mechanisms of gene transcription and the entire transcriptional profile of a cell, the transcriptome. The final mechanism will be the complete complement of proteins and their specific roles in the biochemistry of the cell, the proteome. These increasing levels of resolution reflect increasing levels of complexity. The genome may be complicated in the sense that is a large amount of data, but it is essentially static and thus more readily quantifiable. The transcriptom e is a dynamic entity and changes in response to stimuli, differentiation lineage and cellular life cycle. A transcriptome may be considered unique to a cell or cell type within a mass of thousands that constitute the organism. The proteome is by far the most complicated of the three. Not only does the sequence of a protein indicate its function, the same protein can be procured, translocated, structurally altered, phosphorylated, glycosylated, secreted and sequestered, all in a specific manner. The functional significance of this ‘protean’ expression, from gene to transcript to functional catalytic unit, creates a truly complex set of data, and increasing complexity leads to greater inaccuracy in measuring the relationships between these levels.

1.2.1

T

h e

G

e n o m e

(23)

D rosophila m elanogaster and C aenorhabdtitis elagans. O ver 1.4 m illion single nucleotide polymorphisms (SNP) have been identified and their frequent occurrence across the genome should facilitate further mapping of linkage disequilibrium of genes throughout the population (Sachidanandam et al., '01).

1.2.2

T

h e

T

r a n s c r ip t o m e

The second level of complexity is the controlled transcription of the genome within the organism. Transcription of genes from DNA into mRNA is a fundamental factor in gene expression. In the context of the human body, each of the several thousand cell types is believed to have a series of unique patterns of gene expression designed for specific biological function at particular times in their cycle. External and internal factors can m odulate the levels of gene expression and lead to altered phenotype associated with physiological differentiation and disease. This variation in gene expression betw een cell types provides us with the next level of genetic complexity.

Researches have long recognised the relationship between a gene’s expression and its functional role. This reasoning has lead to the initiation and continuing search for differentially transcribed genes in disease states and more simply in altered phenotype. The variation in expression of a set of essentially identical genes has provided biology with clues to the functional roles of genes in the cellular context as well as a m olecular basis for phenotype. Technical advances have lead to several methods for investigating genome-wide expression. M any are based largely on the partial gene sequences derived from HGP and EST libraries. With the development of high resolution fluorescent detection and solid support for hybridisation targets it is now possible to m onitor the level of expression of tens of thousands of genes on a single microscope slide (reviewed in (Ferea and Brown, '99). Alternatively, one may enrich for differentially expressed genes by selective amplification using anchored redundant sets of PCR primers (Liang and Pardee, '92).

(24)

CHAPTER 1. Introduction

approach is useful in two ways. First, it allows the simultaneous monitoring of many genes within a single experiment and secondly, it has facilitated the annotation of genes not previously associated with the particular model system by virtue of the ability to include them in an analysis. This was evident in the study of fibroblasts and their proliferative response to serum, an in vitro model for the wound healing process. A total of 9000 genes were monitored over several time scales but because the genes present on the array were not limited to those known to be involved in proliferation, a complex pattern relating to the physiology of wound healing was unexpectedly revealed (Iyer et al., '99).

M ore recently, the clustering of EST data to chromosomal maps has revealed regions of chromosomes that contain differing densities of gene transcription, with some regions particularly rich for highly transcribed genes. Additionally, differences betw een the transcriptional potential o f the chrom osom es becam e apparent. Chromosome 19 displayed an extremely high level of transcription compared to other chromosomes, while chromosome 13 displayed very little transcriptional activity at all. Compiling these maps has revealed novel genes involved in disease (Caron et al., '01).

1.2.3

T

h e

P

r o t e o m e

(25)

The knowledge of protein active sites is only a step in the understanding of the protein function, A protein may have many active sites and participate in a single biochemical pathway or only one active site and possess activity in many pathways, as is evident in the com plexities of secondary m essengers and signal transduction. Proteins can act alone or in concert with many others. The cellular location of a protein can also affect a control mechanism as with PKC isoforms, only aligning substrates with catalysts at certain times of the cellular cycle.

One particularly useful analysis is determ ining the total num ber of distinct protein units in an organism, the ‘core proteom e’ (Schuler et al., '96, Ferea and Brown, '99, Mushegian, '99, Ison et al., '00, Persson, '00). W hile grouping protein paralogs as single units is a crude indicator of complexity, the comparison of so assembled core proteomes has revealed another counterintuitive phenomenon hinted when comparing the genomes. It appears that despite the clear magnitudes of difference between the single-cell yeast, Saccharom yces cerevisiae and the metazoan fly, D .m eloanogaster, there is only a two-fold difference in the core proteome (Rubin et al., '00). When this is added to the hypothesis that the large human genome is a result of multiple duplications, this leads to the idea that there are probably not many more protein members in the human ‘core proteome’ than there are in that of the fly or nematode.

(26)

CHAPTER 1. Introduction

1.3

A D y n a m ic L in k b e t w e e n

G e n o t y p e a n d P h e n o t y p e

With the near completion of the human genome mapping project, as well as other genom e projects of the m ouse, nem atode, num erous viruses, various plants and invertebrates, a large dataset has been generated and the first step to a greater understanding of the dynamic functioning of the living organism has been achieved (Velculescu et al., '00, Caron et al., '01). The genome represents a structural basis for what is essentially the plan of cellular function. As with most plans its mere presence is not in itself informative of an entire picture but a guide to the arrangem ent of raw products into functional molecules. The central dogma of m olecular biology is that genes in DNA contain the information required for the synthesis of proteins from a relatively small set of nucleotides and amino acid building blocks. The cellular proteins affect their function through catalytic or specific sites before being degraded back to amino acids. Protein turnover is constant through the life of the cell. From this it appears that DNA in itself is a relatively benign cellular molecule affecting no real function except as a ‘user m anual’. However, the information contained in DNA is more complex than simply a code for protein synthesis. Continued study of DNA has revealed that there are controlling mechanisms encoded within the DNA molecule that affect the transcription o f genes, the processing of pro-transcripts and even the signalling of specific translation and processing of early proteins.

(27)

1.4

T h e S t u d y o f T r a n s c r ip t o m e s

Several techniques have been developed which use changes in gene expression to identify and characterise novel genes, or are designed specifically to search for such genes. An understanding of the available technologies and their advantages and disadvantages will benefit the understanding of transcription analysis and provide a basis for the inclusion or exclusion of a specific technique. A b rief review of techniques also provides an understanding of the data types that are generated.

1.4.1

D

if f e r e n t ia l

D

is p l a y

Differential display was first described as a tool to identify genes that were differentially regulated by their selective am plification using anchored and random primers in PCR reactions (Liang and Pardee, '92). Products of various sizes were enriched as the genes were ‘turned on’, then resolved on polyacrylamide gels where the induction or suppression o f a gene was seen as an appearance or disappearance of product bands. Though initially useful and to a degree reproducible, differential display tended to produce a large amount of false positives and was reliant on redundant, anchored oligos, that showed non-random annealing and a high contamination rate in the am plification stage (M atz and Lukyanov, '98, Ledakis et al., '98, Frost and Guggenheim, '99). The next generation of techniques based on differential display, e.g. suppressive-subtractive hybridisation, have been used to study the effects of high glucose on cells of renal origin and has successfully identified a number of up-regulated genes. These include CTGF, Throm bospondin-1, a zinc finger homologue and the amiloride sensitive sodium channel as well as a number of unidentified transcripts (Page et al., '97, Holmes et al., '97, Death et al., '99).

1.4.2

M

ic r o

A

r r a y s

a n d

G

e n e

C

h ip s

(28)

C H A P T E R 1. Introduction

bound on a nylon filter and with gene chips tens of thousands of genes are bound on glass slides (Schena et al., '95). Both these methods facilitate the simultaneous monitoring of gene transcripts based on homology and represent the most popular and accessible method of global transcription analysis.

Micro-array and gene chip technology has been used to examine the pathology of many disorders, including diabetes mellitus (DM). (Wada et al., '01). Despite the apparent widespread use of this technology and its application in global expression analysis there are currently far more reviews describing experiments and possible uses than original research publications. No doubt this will change as the technology advances though there certainly appears problems associated with interpretation of data, especially the inability to standardise platform s across the research community (Ermolaeva et al., '98, Boguski, '99, Scherf et al., '00).

G e n e fra g m e n ts

tu re /tis su e

N y lo n /N itro c e llu lo s e filte r

S can fo r d e n s ity o r flu o re s c e n c e

H y b rid is a tio n sig n a l o c m R N A level

Iso la te a n d label m R N A

H y b rid iz e to im m o b ilise d gen e fra g m e n ts

E x p o se to X -ra y film o r o th e r im a g in g te c h n o lo g y

FIGURE 1.1. Fl o w c h a r ti l l u s t r a t i n g t h es t e p si n v o l v e d i nh y b r i d i s a t i o n.

(29)

1.4.3

I

n

s il ic o

m in in g

In silico mining is a relatively new term in m olecular biology. The literal meaning regards the exploration of genomic, proteomic and transcriptional data within the setting of data storage and retrieval i.e. bioinform atics. W ith the advent of m olecular techniques in DNA cloning, sequencing and expression, a vast repository of data has amassed, initially nucleic acid and protein sequence (see TABLE 1.1). While the structuring of this data and its retrieval are uppermost in its usefulness, the notion of hypothesis driven research is beginning to be supplanted by in silico m ining in attempting to access the information contained in these vast datasets. A primary and current use of in silico mining is the automated searching for coding regions (CDS) and open reading frames (ORFs) in raw genomic sequence data. Such regions are then annotated as containing putative genes and transcriptions sites within the genome.

(Rana et al., '01) presented an exam ple of in silico m ining. Using only hom ology searches the gene structures o f adenylate cyclases were determ ined. A denlyate cyclases are a m ultigene fam ily consisting of some 9 transm em brane isoforms, A D C Y l-9, which catalyse the formation of cAMP from ATP. The cloning of the DNA for this gene family was troublesome because, as inferred from other species, the genes for ADCY contain numerous large introns, which complicate most standard cloning techniques, including PCR and recombinant based technologies. Alignments of homologous sequences using BLASTn at NCBI revealed numerous contigs that were unassigned and unaligned in the HGMP. Further, when fully analysed they revealed the 21 exons of the ADCY gene and confirmed that indeed it was spread over 18.4kb. W hile such data may have been determ ined from a cloning project the in silico experimental approach, or data mining, was conducted with minimum laboratory time and only required access to relatively inexpensive com puting equipm ent and basic knowledge of nucleic acid bioinformatics.

(30)

CHAPTER 1. Introduction

validates the continued efforts in EST library construction and dissemination o f data

(see TABLE 1.1).

1.4 .4

S

e r ia l

A

n a l y s is

o f

G

e n e

E

x p r e s s io n

The techniques described above have inherent limitations. Differential display is highly dependent upon the sensitivity of degenerate primers and may identify only a fraction of candidate genes, while micro arrays, gene chips and in silico mining still require the use of known cDNA sequences (Heller et al., '97, Ledakis et al., '98, Frost and Guggenheim, '99, Rana et al., '01). The generation of large EST libraries has proved useful in the analysis of the primary DNA data emerging from the HGMP, but the task of randomly sequencing entire cDNA libraries is laborious, even using current automotive standards, and represents a large investment in time and resources.

A relatively new method of transcriptional analysis has been described, serial analysis o f gene expression (SAGE), which addresses many o f these problem s (Velculescu et al., '95). SAGE is based on the mathematical calculation that the genetic code of a 9bp fragm ent contains sufficient inform ation to discrim inate betw een 4^(262,144) individuals. As this represents a several fold redundancy of the human genome it should be possible to identify any gene transcript from a 9bp sequence. Furtherm ore, the use of short tag sequences permits their serial incorporation into vectors to facilitate high throughput sequencing (see FIGURE 1.2).

(31)

To date SAGE has been used to successfully study genes expressed in many human cell systems, both in vitro and in vivo, and has amassed some 2.2 million tags, representing 454,836 unique tags, from some 136 SAGE libraries. SAGE data from Arabidopsis thaliana. Mus musculus and Rattus norvégiens libraries are also collected.

(http://www.ncbi.nlm.nih.gov/SAGEl (see TABLE l.l)

Cells

Tissue

9bp Tag

I 3’ Nlain

Convert to cDNA and Isolate tags mRNA

Amplify for efficient cloning

Tag frequency o c mRNA Abundance C ount Tag Frequency

Clone and Sequence

Plasmid Vector

FIGURE 1.2. Sc h e m a OF THE

STEPS INVOLVED IN A SAGE

ANALYSIS.

m R N A is iso la te d from experim ental sam ples and converted to cDNA. Tags 9-lObp are isolated from points in the cD N A defined by the anchoring restriction enzym e N/a III (AE). These tags are amplified, concatenated, cloned and sampled by sequencing. The frequency o f each tag is proportional to the abundance of the gene it represents. Using this protocol for samples under d i f f e r e n t e x p e r im e n t a l c o n d it io n s a llo w s th e identification o f genes that are differentially transcribed in response to the experimental stress.

Information Web Sites Platform Comments

www.ncbi.nlm.nih.gov/SAGE Serial A nalysis o f Gene Expression

D atabase for the entire SAGE data including mapping and digital Northern data.

www.ncbi.nlm.nih.gov/UniGene cDNA clustering Creates (builds) gene clusters from submitted gene sequence and EST data. WWW .affvmetrix.com Micro array Current market leader in high density gene

‘chips’ and scanning technology

www.geneindex.org Micro array Microchip data profiling gene expression data in many cell systems

www.nhgri.nih.gov/DIR/LCG/15K Micro array Database for warehousing micro array data together with p rofilin g and m ining algorithms

ww w.ncbi.nlm.nih. go v/CG AP Genome annotation Warehouse for cancer genome anatomy project

www.msri.oru Genome annotation Warehouse for genome annotation

www.sasenet.org SAGE S ite p ro v id in g SA G E te c h n o lo g y , references and links to warehouse data. TABLE 1.1. In t e r n e ts i t e sp r o v i d i n g b i o-i n f o r m a t i c sa n d d a t aw a r e h o u s i n g.

(32)

CHAPTER 1. Introduction

With the problems of data storage, retrieval and cross platform access faced by the m icro-array and gene chip technology, SAGE offers the advantages of both warehousing and in silico mining. The output from a SAGE library is essentially four fields containing the tag sequence, the frequency in the library, the mapping information and some unique tag identifier. In this regard, the SAGE libraries are much simpler than their hybridisation-derived cousins and thus retrieval and cross-library comparisons are also straightforward (Adams, '96, Zhang et al., '97, Velculescu et al., '00). Digital data is based on absolute frequency within a population sample and not on a relatively arbitrary fluorescence or radioactive signal. Many believe the SAGE library is a truer picture of a transcriptional profile than any other current form of array technology, and that assignations of absolute abundance rather than relative abundance are valid. An added advantage of a SAGE library is that library membership is based on abundance and is not dependent on predetermined sequence homology. In this respect the SAGE analysis represents the next generation of bioinformatics, where discovery in based on the presence of an individual rather than an active search for that individual.

1.5

U

s in g

SAGE

t o

I

n v e s t ig a t e

t h e

T

r a n s c r ip t io n

o f

G

e n e s

A further use for SAGE libraries is the application to differential analysis. Clearly if a SAGE library can accurately describe a cellular phenotype, comparison of SAGE libraries constructed in parallel in the same cells, one controlled the other under experim ental conditions, will reveal tags and thus genes that are differentially transcribed in the experimental system. The nature of sampled SAGE data is essentially digital and facilitates comparison. Assigning significance to such comparison is more complicated than usual standard statistical techniques, but has been addressed in many SAGE publications and is currently used in the SAGEmap database (NCBI) (Chen et al., '98a, Lash et al., '00). A summary of the statistical methods used is presented in CHAPTER 2. Once a significance factor is used to filter the differential data, straightforward molecular techniques can be applied to verify the data.

(33)

was a small SAGE library with a low level of sampling (approximately 1000 tags) and described the normal pancreatic profile (Velculescu et al., '95). The second important publication involved the larger application of SAGE to the yeast cell cycle with the express purpose of identifying new genes involved in the yeast cell cycle, together with estimating the complete transcriptome of the yeast cell (Velculescu et al., '97). In these experiments, more than 60,000 tags were sampled, representing 4,500 genes. Over 1,980 genes had been previously characterised while the remaining 2,520 were not. The last of the three original papers involved the large scale sampling of SAGE tags from several cancer cells, primary cells and dissected cancers, amassing some 300,000 tags that were representative of some 45,000 transcripts (Zhang et al., '97). This study identified many genes that had not previously been described in these particular cancers, but also identified the differential transcription of 500 genes occurring in cancer cells, indicating possible causative mechanisms and thus diagnostic markers or therapeutic targets.

The expanding database for SAGE tags has increased dramatically in the last few years and the ease with which data can be stored, manipulated and analysed is reflected in the usefulness of the SAGE mapping project, SAGEmap (ncbi.nlm.nih.gov/sage). SAGEmap has compiled the data from all submitted libraries and mapped the tags back to the transcripts they represent, in this case an individual EST and/or mRNA as well as the Uni Gene cluster they belong to (Lash et al., '00). Original publications using SAGE as a primary technique are increasing rapidly. Identifying novel disease candidates on SAGE data are also increasing, indicating a powerful tool in molecular analysis. It is likely that investment in SAGE analysis will yield large amounts of data regarding gene expression.

1.5.1

A

n a l y s in g

T

r a n s c r ip t o m e s

(34)

CHAPTER 1. Introduction

a situation which is at least partially addressed with public access to the databases and software for in silico mining. Micro-arrays provide convenient access to transcriptome analysis and, with the ease of introducing a tim e dimension, can accelerate a truer dynamic profile of gene transcription. M icro-array construction requires access to expensive robotic infrastructure and is ultimately dependent on hybridisation kinetics and predeterm ined sequence. Additionally, micro-array data requires some sort of ‘industry standard’ so that data can be shared more easily across the various platforms available.

SAGE offers possibly the highest resolution of transcription profiling to date. The techniques are straightforward, although complicated, and data is simple and easily manipulated. The storage and retrieval of SAGE data is particularly suited to in silico m ining and transfer between laboratories, models and organisms remarkably easy. Based on transcript abundance, the discovery o f new transcripts will compete with redundant sampling in a SAGE library. Abundant genes are more likely to be well characterised, a phenomena gleaned from the early mapping projects, but are also more likely to be sampled, thus redundant in this setting. The more obscure genes or gene products may have unique tags that can discriminate between their family members, but they may be present at such a low level that large sampling projects will be required before accurate identification and quantitation is achieved.

(35)

1.5.2

S

u m m a r y

o f

T

r a n s c r ip t o m e

A

n a l y s is

Genetics continues to be faced with the problems of bridging the gap between genotype and phenotype. Transcriptome analysis provides the first step in addressing this gap, and forms the basis with which to continue addressing the expression of genes from the genome through the transcriptome and proteome. Access to the data of the HGMP coupled with evolution of transcriptional analysis offers a powerful resource in understanding the transcription of gene products and any changes in this transcription that occur from experim ental stimuli or disease models. The ordered and tim ely expression of genes represents an important level of complexity in the understanding of an organism. The disruption of such balance, and the ability to measure changes, facilitates an understanding of pathology, which could identify causal relationships and therapeutic targets.

1.6 T

h e

P

a t h o l o g y

a n d

P

r o g r e s s io n

OF D

ia b e t e s

M

e l l it u s

in

T

a r g e t

O

r g a n s

a n d

T

is s u e s

Prior to the introduction of insulin as a treatment for insulin dependent diabetes mellitus (IDDM), most patients did not survive long enough for clinical complications to develop. It was only after the introduction of insulin treatment the manifestations of diabetes (retinopathy, neuropathy and diabetic nephropathy (DN)) becam e serious clinical issues. Because of the tight association between insulin and glycaemic control the hypothesis that hyperglycaem ia is a causative agent for diabetic pathology was proposed. The glucose hypothesis postulates that hyperglycaem ia causes diabetic com plications and that correction o f hyperglycaem ia will prevent them. Rigorous testing of this hypothesis in experimental animals dem onstrated a strong correlation between elevated blood glucose and multi-organ and tissue pathology similar to those present in humans with longstanding diabetes (Pirart et al., 7 8 , Klein et al., '88, Chase et al., '89).

(36)

CHAPTER 1. Introduction

complications. The primary question addressed whether intensive insulin therapy could prevent diabetic retinopathy in patients with no retinopathy. A second question was whether intensive therapy could slow the progression of early retinopathy (DCCT and Group, '93). The study found that intensive insulin therapy reduced the incidence of diabetic retinopathy by 50% after five years and continued to decrease with time. The intensive therapy also reduced the risk of progression of diabetic retinopathy by 54%.

In sum m ary, the D C C T found that although not able to reverse the complications of IDDM, intensive insulin treatment of EDDM reduced the incidence and delays progression of diabetic pathology in multiple organ systems. Hyperglycaemia rem ains the hallm ark of diabetes and the hypothesis persists that good glycaemic control reduces the risk for development and progression of diabetes specific pathology within the retina, peripheral nerve, vasculature and glomerular apparatus.

1.6.1

D

ia b e t e s

a n d

t h e

E

y e

Diabetic retinopathy is the leading cause of visual impairment in the developed world (Palmberg et al., '81, Frank et al., '82, DCCT, '95). In the eye, retinal terminal capillary damage, micro-aneurysm s, leads to leaking erythrocytes and dot and blot haemorrhages. The retinal vessels are also abnormally permeable and leak serous fluid that will eventually form hard exudates. With increasing duration of diabetes the retinal vessels can become occluded and lead to ischemic infarctions in the retinal nerve layer. The response to this ischem ia is the form ation of new blood vessels (neo- vascularisation) and proliferation out of the retinal surface and into the vitreous cavity. The new vessels are fragile and tend to bleed into the vitreous cavity resulting in vision obstruction. These vitreous haemorrhages will be resorbed but the fibro-proliferative changes that ensue will result in retinal traction, eventual detachment and loss of vision.

1.6.2

D

ia b e t e s

a n d

t h e

N

e r v o u s

S

y s t e m

(37)

function, bladder function, cardiac function and vascular tone (Nathan, '92). Not only do the nerve fibres degenerate in DM but regeneration mechanisms are short lived and fail to progress. This failure of neuronal buds to m ature and progress creates the progressive neuropathy, and is currently believed to be a result of three mechanisms; m etabolic dysfunction in the neurone, ischem ic effects caused by vasculature abnormalities and deleterious effects of protein glycation on the supporting Schwann cells and ECM (King, '01, Cameron et al., '01).

1.6.3

D

ia b e t e s

a n d

t h e

V

a s c u l a t u r e

Vascular impairment is also a chronic complication of diabetes with accelerated atherom a resulting in prem ature, aggressive coronary artery, cerebrovascular and peripheral vascular disease. Endothelial dysfunction appears to be a common starting point for diabetic vascular disease with subsequent involvement from other cells of the vasculature, predominantly smooth muscle cells. Early haemodynamic changes include an increase in blood flow in the skin, retina and glomerulus. This increase in blood flow is seen before structural changes and is reversible early in DM. Once progressed however, the structural alterations become irreversible. Increases in blood flow and micro-vascular pressure cause leaking and thickening of the capillary m embrane and leads to failure in norm al functioning, tissue ischaem ia and organ dam age. Interestingly, hyperglycaemia only displays an indirect association to vascular damage and there appears a group of patients that fail to develop microvasular complications even after long duration DM. This suggests a genetic basis for microvasulature damage (Shore and Tooke, '94).

1.6.4

D

ia b e t e s

a n d

t h e

K

id n e y

Figure

FIGURE 1.1. F l o w  c h a r t  il l u s t r a t in g  t h e  s t e p s  in v o l v e d  in  h y b r id is a t io n .
FIGURE 1.2. S c h e m a  OF THE
FIGURE 1.4 T h e
TABLE 1.2. P r o t e in  K in a s e C  I s o f o r m s.
+7

References

Related documents

In a study of the dynamics of ethnic spatial stratification in Oslo, Magnusson Turner and Wessel (2013) found different relocation patterns for different ethnic groups and only

It is established, that decapitalization of the financial sector in Ukraine in 2014-2016 ac- quired the following main forms: a decrease in equity due to depreciation

D-Pantothenic Acid (calcium pantothenate) 50 mg Vitamin B6 (pyridoxine hydrochloride) 50 mg Vitamin B12 (cyanocobalamin) 50 mcg Biotin 50 mcg Folic Acid 1 mg Lipotropic Factors:

Keywords: Cerebrospinal fluid, Dementia with Lewy bodies, Amyloid- β peptides, Co-morbid Alzheimer ’ s disease

The analysis of teaching tutorials provide students with opportunities to engage in discussion and critical reflection on key aspects of good professional practice in the

National Malaria Control Programme in Nigeria and their partners need to recognize these links, and identify mechanisms for ensuring that the poorest have access to essential

Given a subset of the database containing both relevant and non relevant im- ages for a given query, we computed the similarity measures using different proportions of