• No results found

Data mining and integration of heterogeneous bioinformatics data sources.

N/A
N/A
Protected

Academic year: 2021

Share "Data mining and integration of heterogeneous bioinformatics data sources."

Copied!
248
0
0

Loading.... (view fulltext now)

Full text

(1)

bioinformatics data sources

Badr H. Al-Daihani Al-Mutairy

(2)

All rights reserved

INFORMATION TO ALL USERS

The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript

and there are missing pages, th ese will be noted. Also, if material had to be removed, a note will indicate the deletion.

Dissertation Publishing

UMI U559833

Published by ProQuest LLC 2013. Copyright in the Dissertation held by the Author. Microform Edition © ProQuest LLC.

All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code.

ProQuest LLC

789 East Eisenhower Parkway P.O. Box 1346

(3)

This work has not previously been accepted in substance for any degree and is not concurrently submitted in candidature for any degree.

S ig n e d ... (candidate) D a te ....Q./.£&?..l£ ....

STATEM ENT 1

This thesis is being submitted in partial fulfilment o f the requirements for the degree o f PhD.

Signed . . ... (candidate) D a te ...?.?.(.}*(.*-. P..Q.X...

STATEM ENT 2

This thesis is the result o f my own independent w ork/investigation, except where otherwise stated. Other sources are acknow ledged by explicit references.

S ig n e d ... (candidate) D a te ... .3<^.l.Q./..2rSO.^...

STATEM ENT 3

I hereby give consent for my thesis, if accepted, to be available for photocopying and for inter-library loan, and for the title and summary to be made available to outside organisations.

S ig n e d ~...(candidate) Date J <) f / Z O O &

(4)

my wife,

(5)

My first and foremost thanks and praises are due to Allah (God) Alm ighty who has helped me and provided me with faith, patience and commitment to complete this research.

I would like to express my deep thanks and gratitude to my supervisor, Professor Alex Gray, for his supervision, guidance, support and encouragement throughout this research.

My special thanks also go to Dr. Peter Kille for his continued and unlimited help with regard to the biological aspects o f my research. I am very grateful for his careful reading of, and constructive comments on this thesis.

Special thanks are due to the members o f the school for their help, especially Mrs. Margaret Evans who has helped me with travel-related issues, Mrs. Helen W illiams for her help in administrative issues, and Mr. Robert Evans and Dr. Rob Davies for their technical assistance.

I would also like to express my thanks to my fellow research students in the School o f Computer Science at C ardiff University for providing a pleasant and stimulating research environment. I really enjoyed the friendship that I developed with them while doing this research.

Special admiration and gratitude are due to my parents, wife, brothers and sisters whose prayers, love, care, patience, support and encouragement have always enabled me to perform to the best o f my abilities.

Last but not least, I would like to thank all the people, members o f my family and close friends, who have borne with me during the period o f my PhD studies.

(6)

T h e in te g ratio n o f b io in fo rm a tic s d a ta so u rces is one o f the m o st c h a lle n g in g p ro b lem s facin g b io in fo rm a tic ia n s to d ay due to th e in c re a sin g n u m b er o f b io in fo rm a tic s d a ta so u rc es and the ex p o n en tia l g ro w th o f th eir content.

In this th esis, w e h av e p resen te d a n o v el a p p ro a c h to in tero p erab ility b ased on the u se o f b io lo g ical re la tio n sh ip s th a t h a v e used relatio n sh ip - b a se d in te g ratio n to in teg rate b io in fo rm a tic s d a ta so u rces; this refers to th e u se o f d ifferen t rela tio n sh ip ty p es w ith d iffe re n t relatio n sh ip clo sen e ss v alu es to lin k g en e e x p re ssio n d a ta se ts w ith o th er in fo rm atio n av ailab le in p u b lic b io in fo rm a tic s d a ta so u rces. T h e se relatio n sh ip s p ro v id e flex ib le lin k ag e fo r b io lo g ists to d isc o v e r lin k e d d a ta acro ss the b io lo g ical u n iv erse. R e la tio n sh ip clo sen ess is a v a ria b le u se d to m easu re the clo sen e ss o f the b io lo g ical en tities in a re la tio n sh ip and is a c h a ra c te ristic o f the relatio n sh ip . T he n o v elty o f th is a p p ro a c h is th at it a llo w s a u se r to lin k a gen e ex p ressio n d a ta se t w ith h e te ro g e n e o u s data so u rces d y n a m ic a lly an d flex ib ly to facilitate c o m p a ra tiv e g en o m ics in v estig atio n s. O u r re se a rc h h as d e m o n stra te d th a t u sin g d ifferen t re la tio n sh ip s allo w s b io lo g ists to an aly ze e x p e rim e n ta l datasets in d ifferen t w ay s, sh o rten the tim e n e ed e d to a n a ly z e th e datasets and p ro v id e an e asie r w a y to u n d e rta k e this an aly sis. T h u s, it pro v id es m o re p o w e r to b io lo g ists to do e x p e rim e n ta tio n s u sin g ch an g in g th resh o ld v alu es and lin k ag e types. T h is is a c h ie v e d in o u r fram ew ork b y in tro d u c in g the S o ft L in k M o d el (S L M ) an d a R e la tio n sh ip K n o w led g e B ase (R K B ), w h ic h is b u ilt an d u se d b y S L M . In teg ratio n and D ata M in in g B io in fo rm atics D ata so u rc es sy ste m (ID M B D ) is im p lem en ted as an illu stratio n o f c o n cep t p ro to ty p e to d e m o n stra te the tech n iq u e o f lin k ag es d escrib ed in the thesis.

(7)

D E C L A R A T I O N ... II A c k n o w le d g e m e n ts... IV A b s t r a c t ...V C o n te n t... V I L ist o f F ig u r e s ... X III L ist o f T a b le s...X V II L ist o f A c r o n y m s ... X IX C H A P T E R 1: I n tr o d u c tio n ...1 1.1 S y n o p s is ... 1

1.2 B a c k g ro u n d to In teg ratio n o f b io in fo rm a tic s s o u r c e s ... 1

1.2.1 E x p erim e n tal D a ta s e ts ... 3

1.3 R a tio n a le ...4

1.4 T h e h y p o th esis an d th e aim o f the r e s e a r c h ... 5

1.4.1 O b je c tiv e s ... 6

1.5 R e se arch A p p r o a c h ... 8

1.6 O v erall A c h iev e m e n ts o f the r e s e a r c h ...9

1.7 T h esis o rg a n iz a tio n ... 11

C H A P T E R 2: B a c k g r o u n d ... 14

2.1 S y n o p s is ... 14

2.2 In tro d u c tio n ... 14

(8)

2.3.1 D a t a ... 19

2.3.2 D a ta s o u r c e s ...20

2 .4 H e tero g en e ity in B io in fo rm a tic s D a ta S o u rc e s...20

2.4.1 S y n ta c tic ...21 2.4.2 S e m a n tic ... 21 2.4.3 D ata m o d e ls ... 23 2.5 S u m m a ry ...25 C H A P T E R 3: B io in fo rm a tics D a ta S o u rce I n t e g r a t io n ... 26 3.1 S y n o p s is ... 26 3.2 In tro d u c tio n ... 26 3.3 In te g ra tio n a p p ro a c h e s ... 27 3.3.1 A rc h ite c tu re ... 27 3.3.2 Jo in in g an d m a tc h in g strateg ies ( m e c h a n is m ) ... 31 3.4 E x istin g sy ste m s... 36 3.5 C h a lle n g e s ... 45 3.6 S u m m a ry ...46 C H A P T E R 4: S o ft L in k M o d e l... 47 4.1 S y n o p s is ...47 4.2 C o m p arativ e g e n o m ic s... 47 4.3 B io lo g ical r e la tio n s h ip s ...49

(9)

4.3 .2 S ig n ifican ce o f th e ty p e s o f re la tio n s h ip ...51

4.3.3 C a lcu latio n o f re la tio n sh ip c lo s e n e s s ... 53

4 .4 S o ft L in k M o d e l...59

4.4.1 D e fin itio n s ...59

4.4.2 F o rm al R e p re s e n ta tio n ... 61

4.4.3 SL M O p e ra to rs ...66

4.5 S ource selectio n a lg o rith m ... 72

4 .6 S u m m a ry ... 73 C H A P T E R 5: S y stem A r c h ite c tu r e ... 75 5.1 In tro d u c tio n ... 75 5.2 S y stem a rc h ite c tu re ...76 5.2.1 A rc h ite c tu re l a y e r s ...76 5.2.2 In te g ra tio n P h a s e s ...78 5.3 B u ild in g the S L M ... 89 5.4 S y stem S e q u e n c e ... 90 5.5 In teractio n b e tw ee n the M e d ia to r an d S L M ...92 5.5.1 R e q u e s t... 93 5.5.2 R e sp o n se ...93 5.6 S u m m a ry ...96 C H A P T E R 6: E x tr a ctin g M e ta d a ta o f E x p er im e n ta l d a ta se t 97 VIII

(10)

6.2 In tro d u c tio n ... 97

6.3 E x p erim en tal d a ta se t m o d e l...98

6.3.1 M etad ata e x tra c tio n ...98

6.3.2 S ch em a c r e a tio n ... 103

6.3.3 S ch em a e x p lo ita tio n ... 103

6.4 M e tad a ta L in k ag es w ith D o m a in O n to lo g y ... 104

6.4.1 O n to lo g y ... 104

6.4.2 D isc o v erin g se m an tic re la tio n s h ip s ... 104

6.4.3 E n h an c ed m e ta d a ta ... 106 6.5 S y stem A rc h ite c tu re ... 109 6.6 L im ita tio n ... I l l 6.7 S u m m a r y ... 112 C H A P T E R 7: Im p le m e n ta tio n ... 114 7.1 S y n o p s is ... 114 7.2 R eq u ire m e n t A n a ly s is ... 114 7.3 Im p le m e n tatio n o v e r v ie w ... 115 7.4 C h o ice o f p ro g ram m in g la n g u a g e ... 118 7.5 M o d u le s... 118 7.5.1 Soft L in k M o d e l... 118 7.5.2 C o n fig u ra tio n ... 123

(11)

7.5.4 W ra p p e r... 124 7.5.5 P a r s e r ... 125 7 .5 .6 U se rln te rfa c e s... 126 7.6 G e n e ric ity ... 132 7.7 S u m m a ry ... 132 C H A P T E R 8: A n a ly sis o f “w e t la b o r a to r y ” d a ta ... 137 8.1 In tro d u c tio n ... 137

8.2 D ata fro m W et L ab o ra to ry e x p e rim e n t... 137

8.3 O b jectiv es o f the S L M A n a ly s is ... 138

8.4 In teg ratio n o f W et L ab o ra to ry d ata in to “ S o ft L in k M o d el E n v iro n m e n t” ... 139

8.4.1 M e ta d a ta e x tra c tio n ... 140

8.4.2 Id e n tifie r c o n v e r s io n ... 140

8.4.3 C ro ss sp ecies tra n s fo rm a tio n s ... 140

8.4.4 D e fin in g g en es c o n se rv e d b e tw e e n s p e c ie s u sin g specific fu n ctio n s an d th r e s h o ld s ... 140

8.4.5 C o m p ariso n an d v a lid a tio n ... 141

8.5 R esu lts from SL M A n a ly s is ... 144

8.5.1 O rth o lo g ical and O n to lo g ic a l D a ta T ra n sfo rm a tio n 144 8.5.2 D e term in in g the o p tim a l th re sh o ld for cro ss-sp ecies o rth o lo g y re la tio n sh ip ... 148

(12)

c o n se rv a tio n ... 150

8.5.4 F u n ctio n al e n ric h m e n t th ro u g h cro ss-e x p erim e n tal c o m p a ris o n ... 152 8.6 B io lo g ist e v a lu a tio n ... 157 8.7 S u m m a ry ... 157 C H A P T E R 9: E v a lu a t io n ... 158 9.1 S y n o p s is ... 158 9.2 In tro d u c tio n ... 158 9.3 C u rren t resea rch p r o c e s s ... 161 9.4 T h e ID M B D a p p r o a c h ... 167 9.4.1 S L M ... 167 9 .4.2 T h e A rc h ite c tu re ... 168 9.5 ID M B D e v a lu a tio n ... 171 9.5.1 S av in g tim e ... 172 9 .5.2 G en ericity an d U n ifo rm a c c e s s ... 172 9.5.3 R e d u cin g h u m an in te ra c tio n ... 172 9 .5.4 tran sp aren c y an d a u to n o m y ... 173 9.5.5 F le x ib ility ... 173 9.5.6 E x te n d ib ility ... 174 9.5.7 H e te ro g e n e ity ... 174 9.5.8 F u n c tio n a lity ... 174

(13)

9.6 S u m m a ry ... 176

C H A P T E R 10: C o n clu sio n s an d fu tu r e w o r k ... 177

10.1 S y n o p s is ... 177

10.2 T h esis s u m m a r y ... 177

10.3 T h esis c o n trib u tio n s ... 179

10.4 S tren g th s and L im itatio n s o f S L M ... 180

10.5 F u tu re W o r k ... 182 10.6 C o n c lu s io n ... 183 A P P E N D IX A . S y s te m c o m p a r is o n ... 184 A P P E N D IX B. X M L d o cu m en ts and S c h e m a ... 190 A P P E N D I X C .T e c h n o lo g ie s ... 194 A P P E N D IX D . J A V A C la s s e s ...197 A P P E N D IX E. B io lo g is t’s E v a lu a tio n ...206 R E F E R E N C E ... 208 XII

(14)

F ig u re 2.1: G row th o f b io in fo rm a tic s d a ta so u rc e s (1 9 9 9 -2 0 0 7 )... 16

F ig u re 2.2: D e ve lo p m en t o f th e in te rn a tio n a l N u c le o tid e S e q u en c e D a ta b a s e... 17

F ig u re 2.3: G row th o f G en B a n k (1 9 8 2 -2 0 0 5 )...18

F ig u re 3.1: B a sic da ta in teg ra tio n m o d e ls b a s e d on a rc h ite c tu re 28

F ig u re 3.2: B a sic jo in in g a n d in teg ra tio n s tr a te g ie s... 28

F ig u re 4.1: O rth o lo g s a n d p a r a lo g s e x p la in e d g r a p h ic a lly...51

F ig u re 4.2: A sa m p le p a r t o f a B L A S T o u tp u t s h o w in g the p a ir o f s e q u e n c e id en tifiers, score, e -v a lu e a n d id en tities b e tw e e n each p a ir o f th e se q u e n c e s... 54 F ig u re 4.3: R e p r e se n ta tio n o f S o ft L in k M o d e l... 62 F ig u re 4.4: an e x c e rp t o f th e b la stp p ro g r a m re p o r t u se d to f i n d p o s s ib le h o m o lo g u e b etw e en m o u se se q u e n c e s a n d C .elegans se q u e n c e s...62 F ig u re 4.5: S o u rc e selectio n a lg o r ith m...74 F ig u re 5.1: The ID M B D F ra m e w o rk : a c o n c e p tu a l V ie w...78 F ig u re 5.2: A lg o rith m to g e n e ra te a re la tio n sh ip k n o w le d g e b a s e 81 F ig u re 5.3: X M L sc h em a f o r S L M m e ta d a ta... 82

F ig u re 5.4: O vera ll A rc h ite c tu re o f In te g ra tio n s y s te m... 84

(15)

F ig u re 5.7: The m e d ia to r in te ra c ts w ith th e S L M via a re q u e st/re sp o n se

p a r a d ig m...92

F ig u re 5.8: X M L sch em a d e fin itio n f o r th e R e q u e st o p e ra tio n... 94

F ig u re 5.9: X M L sch em a d e fin itio n f o r th e R e s p o n se o p e ra tio n... 95

F ig u re 6.1: A lg o rith m f o r m a p p in g e x p e rim e n ta l d a ta set elem en ts to O n to lo g y...105

F ig u re 6.2: D o m a in O n to lo g y... 107

F ig u re 6.3: M a p p in g the e x p e rim e n ta l d a ta se t c o n c e p t in to the D o m a in O n to lo g y... 108

F ig u re 6.4:D isc o v e re d se m a n tic re la tio n sh ip s b e tw e e n th e exp e rim en ta l d a ta se t co n cep t a n d D o m a in o n to lo g y c o n c e p ts... 109

F ig u r e 6.5: Q u ery H a n d le r a n d M eta d a ta e xtra c tio n A r c h ite c tu r e... 111

F ig u re 6.6: S a m p le o f f l a t f i l e s...112

F ig u re 7.1: A n o v e rv ie w o f th e im p lem en ta tio n A r c h ite c tu r e... 120

F ig u re 7.2: A n exa m p le o f S L M m e ta d a ta... 121

F ig u re 7.3: A g ra p h re p re se n tin g p ro te in -p r o te in re la tio n sh ip s betw een m o u se a n d c .e le g a n t...122

F ig u re 7.4: The ID M B D m o d u le s... 127

F ig u re 7.5: G U I M a in in terfa ce f o r r e la tio n sh ip d isc o v e ry a n d b u ild in g S L M...129

F ig u re 7.6: U ser in terfa ce f o r d isc o v e rin g rela tio n sh ip b etw een concepts. U ser ch o o ses the co n cep ts, d a ta so u rc e s a n d rela tio n sh ip s typ e a n d a lg o rith m to c o m p u te re la tio n sh ip s c lo s e n e s s... 131

(16)

F ig u r e 7.8: U p lo a d in g e x p e rim e n ta l d a ta s e t fr o m a f l a t f i l e... 134

F ig u r e 7.9: The m eta d a ta d e te c te d f r o m exp e rim en ta l data s e t... 135

F ig u re 7.10: S ch em a v ie w a n d u se r p a r a m e te r s f o r integration p r o c e s s

...136

F ig u re 8.1: S creen sn a p sh o t sh o w s th e e x tr a c te d m eta da ta fr o m th e exp e rim en ta l d a ta s e ts... 142

F ig u re 8.2: A sc h e m a tic o v erv iew o f q u e ry w o rk flo w , a n d ho w vario u s in p u ts a n d o u tp u ts a re in te rlin k e d... 143

F ig u re 8.3: The p r o file o f th e re la tio n sh ip b e tw e e n p r o te in se q u en c e co n serva tio n (as e x p re sse d b y h o m o lo g y sc o re ) a n d m a in te n a n c e o f the b io lo g ica l ro le...149

F ig u re 8.4: A g ra p h exp lo rin g th e overla p b e tw e en th e h o m o lo g u e s o f th e c o h o rt o f m o u se g en es d isp la y in g u p -re g u la tio n in re sp o n se to a g e w ith an o n to lo g ic a l c a teg o ry in both m o u se a n d C. e le g a n s d e fin e d as "a g e " a n d "grow th " ...151

F ig u re 8.5: D a v id F u n c tio n a l a n n o ta tio n c lu ste rin g u s in g cla ssifica tio n strin g e n c y “h i g h ” “G ene L is t M C - 10 P a i r ”... 154

F ig u r e 8.6: D a v id F u n c tio n a l a n n o ta tio n c lu s te rin g u s in g cla ssifica tion strin g e n c y “h i g h ” “G ene L is t M C -7 0 P a i r ”... 155

F ig u re 8.7: D a v id F u n c tio n a l a n n o ta tio n c lu s te r in g u sin g classification strin g e n c y “h i g h ” “G ene L is t M C -M F P a i r ”...156

F ig u re 9.1: T yp ica l se q u en c e o f ste p s a b io lo g ist p e rfo rm s to d rive a serie s o f c o m p u ta tio n a l a n a lyses r e la tin g to co m p a ra tive g e n o m ic a n a ly se s... 160

(17)

g e n o m ic a n a ly s e s... 171

F ig u re B .l : X M L sch em a o f m e ta d a ta o f d a ta s o u r c e s... 191

F ig u re B .2: M eta d a ta d escrip tio n o f d a ta s o u r c e s... 192

F ig u re B .3: X M L sch em a f o r S L M m e ta d a ta...193

F ig u re D .l : M a in S o ftL in k In te rfa c e C la ss w ith P rim itiv e s f o r S L M A P I ...197

F ig u re D .2 : Q u ery H a n d le r C lass w ith P rim itiv e s f o r S L M A P I... 198

F ig u re D .3 : R e la tio n sh ip W ra p p e r C lass w ith P r im itiv e s f o r S L M A P I ... 199

F ig u re D. 4: G en era teS o ftL in kT a b le C la ss... 199

F ig u r e D .5 : B la stP a rse r C lass w ith P rim itiv es f o r S L M A P I ....200

F ig u re D. 6: G en e C la s s... 200

F ig u re D. 7: R e la tio n s In fo C la s s...201

F ig u re D .8: A lg o rith m C la ss w ith P rim itiv es f o r S L M A P I...201

F ig u re D .9 : U n iG e n eW ra p p er C la ss w ith P rim itiv e s f o r S L M A P I...202

F ig u re D .1 0 : W rapper M a n a g e r C la s s...202

F ig u re D . l 1: W ra p p er C la s s...203

F ig u re D . l 2: G O W ra p p er C la s s...204

F ig u re D . l 3: S L M P a rse r C la s s...205

(18)

T able 2.1: G row th o f b io in fo rm a tic s d a ta so u rc e s (1 9 9 9 -2 0 0 7 )[8 2 -8 5 ]

16

Table 3.1: d im e n sio n s u se d in c h a ra c te risin g e x istin g s y s te m...37

Table 4.1: typ e o f re la tio n sh ip s in S L M ...60

T able 4.2: S a m p le o f g e n e a n n o ta tio n o f C. e le g a n s... 63

T able 4.3: S a m p le o f g e n e a n n o ta tio n o f M o u s e... 63

Table 4.4: S a m p le o f g e n e a n n o ta tio n o f C. e le g a n s... 65

T able 4.5: S a m p le o f g e n e a n n o ta tio n o f M o u s e...65

T able 4.6: The re su lt o f a p p lyin g L i n ’s m e a su re to c o m p u te se m a n tic sim ila rity b etw e en p a ir s o f g e n e p ro d u c ts u sin g M o le c u la r F u n ctio n G O term s a n n o ta tio n o f g e n e s in Table 4.4 a n d T a b le 4 .5...66

Table 4.7: D iffe re n t su b se ts fr o m the C artesia n p r o d u c t o f R a n d S o f each p a ir in the a lig n m e n t o f th e se q u en c es (ri, s j )... 68

T able 5.1: step s taken b y the sy ste m to a n sw e r a u se r q u e r y...90

Table 6.1: S c o rin g S y s te m...103

Table 7.1: Q u ery H a n d le r m e th o d s... 128

Table 8.1: C o m p a riso n o f th e e x p e r im e n ta l m e ta d a ta d escrib in g the tw o w et lab e x p e rim en t u se d f o r S L M a n a ly s is... 139

Table 8.2: N u m b e r o f In te rse c tin g h o m o lo g p a ir s b etw een tw o d a ta sets a t d iffe re n t th r e s h o ld s... 145

(19)

Table 8.4: In tersec tio n b etw een h o m o lo g y p a ir a n d M F, B P a n d C C . 146

Table 8.5: F ra ctio n o f M F , B P a n d C C to h o m o lo g y across m o u se a n d C. elegans. M a p p in g m o u se a g e -r e la te d g e n es onto C. eleg a n s co m p o n en ts u sin g d ifferen t re la tio n sh ip s a n d thresholds. These fig u r e s a re c a lc u la te d by: l.M F = (H M X M F )/H M , 2 .B P = (H M X B P )/H M ,

Table 8.6: The n u m b e r o f g e n e s w ith G O -te rm s re la te d to a g in g a n d

Table 8.7: The ra tio o f g e n e s w ith G O term s r e la te d to a g in g a n d g ro w th to the to ta l w ith c o n se r v e d o n to lo g ic a l c la ssific a tio n across tw o

a n d 3. C C = ((H M X C C )/H M 146

g ro w th 147

d a ta sets 147

Table B.l: D escrip tio n o f x m l sc h e m a elem en ts 190

T able C .l: tech n o lo g ie s u se d in th e im p lem en ta tio n o f I D M B D... 196

(20)

ACEDB A Caenorhabditis Elegans Database

AcePerl An object-oriented Perl interface for AceDB API Application Programming Interface

AQL Acedb Query Language

BLAST Basic Local Alignment Search Tool BP Biological Process

CAS Chemical Abstracts Service CC Cellular Component

CDM Common Data Model cDNA clone DNA

CPL Collection Programming Language

DAVID Database for Annotation, Visualization, and Integrated Discovery

DB Database

DBMS Database Management System DBS Database System

DDBJ DNA Data Bank of Japan DM Data Mining

DNA DeoxyriboNucleic Acid

EMBL European Molecular Biology Laboratory EC Enzyme Commission

GO Gene Ontology

GRAIL GALEN Representation and Integration Language GUI Graphical User Interface

HMM Hidden Markov Model

(21)

IDMBD Integration and Data Mining of Bioinformatics Data Sources JDBC Java Database Connectivity

JDOM Java Document Object Model JSP Java Server Pages

MF Molecular Function

MGI Mouse Genome Informatics

OODBMS Object Oriented Database Management Systems ORDBMS Object Relational Database Management Systems OODM Object Oriented Data Model

OQL Object Query Language OWL Ontology Web Language RC Relationship Closeness

RDBMS Relational Database Management System RDF Resource Description Framework

RKB Relationship Knowledge Base SAX Simple API for XML

SEMEDA Semantic Meta Database SLA Soft Link Adapter SLM Soft Link Model

SOAP Simple Object Access Protocol SQL Structured Query Language SRS Sequence Retrieval System

TAMBIS Transparent Access to Multiple Bioinformatics Information Sources TaO TAMBIS Ontology

PERL Practical Extraction and Reporting Language URL Uniform Resource Locater

(22)

WWW

XML

Word Wide Web

(23)

Introduction

1.1

S y n o p sis

B io in fo rm atics d a ta so u rces are h e te ro g e n e o u s in th e ir rep resen tatio n a n d q u ery cap a b ilities acro ss d iv erse in fo rm a tio n fields h eld in d istrib u te d a u to n o m o u s reso u rces. T h e v o lu m e o f d a ta co llected and sto re d in th ese d istrib u te d an d h e te ro g e n e o u s d a ta so u rc e s, p resen ts a m a jo r ch allen g e w ith resp ec t to th e e ffic ie n t a n d e ffe c tiv e accessio n , p ro c e ssin g , ex tractio n , d isco v ery an d in te g ra tio n o f th is in fo rm atio n . In p a rtic u la r, th is o ccu rs w h en a b io lo g ist w a n ts to u se d a ta m in in g to o ls lin k e d w ith in fo rm a tio n h e ld in ex istin g k n o w le d g e a n d co m p u tatio n al reso u rce s in in v e stig a tio n s to ex p lo it th e e x p o n e n tia lly in creasin g a m o u n t o f c o m p a ra tiv e g e n o m ic data. In th is c h a p te r, a b a ck g ro u n d to th is p ro b lem is p ro v id e d , fo llo w ed b y the re s e a rc h m o tiv a tio n s for the th esis. N e x t, th e h y p o th esis, th e aim s an d o b je c tiv e s o f th e research are p resen ted . T h e resea rch m e th o d o lo g y u se d is p re s e n te d , follow ed b y a su m m ary o f the o v erall a ch iev e m e n ts o f th e re se a rc h . T h e chapter ends b y d e scrib in g the o rg an iz atio n o f th e th esis.

1.2

B a ck g ro u n d to In teg ra tio n o f b io in fo r m a tic s sou rces

T he in teg ratio n o f b io in fo rm atics d a ta so u rc e s is one o f the m o st c h allen g in g p ro b lem s facin g b io in fo rm a tic ia n s today, due to th e in creasin g n u m b er o f b io in fo rm a tic s d a ta so u rc es and the ex p o n en tial g ro w th o f th eir c o n te n t an d u sa g e [1 3 1 , 138]. T h ese sources u su a lly d iffe r in th eir stru ctu re, scope an d c o n te n ts [139]. M o st d ata so u rces are c en tred o n one p rim a ry c lass o f o b je c ts, su ch as gene, p ro tein , o r D N A

(24)

seq u en ces. T h is m ea n s th at e ac h d a ta so u rce co n tain s d ifferen t p iec es o f b io lo g ic al in fo rm a tio n an d k n o w le d g e reflectin g th e p u rp o se o f th e so u rce, an d can a n sw e r q u eries a p p ro p ria te to its d o m ain , b u t c an n o t h elp w ith q u eries th at cro ss d o m a in b o u n d a rie s and in v o lv e d ifferen t d a ta rep o sito ries. A n area o f re s e a rc h th a t is g ro w in g in im portance. In m o st ex istin g in teg ratio n sy ste m s, jo in in g in fo rm atio n h e ld in d ifferen t d ata so u rces is b a se d o n th e u n iq u e n e ss o f co m m on field s in th e so u rces o r b y lin k ag e th ro u g h o n to lo g y term s. D ata entries in so m e d ata so u rces h av e relatio n sh ip s e x p re sse d a s lin k s, o r p red efin ed cro ss- referen ces. S uch c ro ss-re fere n ce s are u su a lly sto re d as a p air o f v alu es, fo r ex am p le, targ e t-d a ta so u rce an d a c c e ssio n n u m b e r, an d are effected th ro u g h a h y p e rlin k o n a w e b p ag e [36, 140]. T h e s e lin k s are ad d ed to d a ta e n tries fo r m an y d ifferen t reaso n s: fo r e x a m p le , d a ta cu rato rs in sert th em as stru ctu ral re la tio n sh ip s b e tw e e n tw o d a ta sources, and b io lo g ists in sert th em w h e n th e y d isc o v e r a c o n fid e n t relatio n sh ip b e tw e e n item s [36]. Y et, th ese lin k s are n o t e sta b lis h e d in c o llab o ra tio n w ith th e c u ra to r o f the lin k e d d ata so u rces. T h e s e static lin k s (h y p e rlin k s) are p ro b lem atic, as the h y p e rlin k m a y c h an g e . T h u s, i f a c u ra to r c h an g e s, o r w ith d raw s an entry th at is re la te d to an en try in a n o th e r d a ta so u rc e, th e lin k fails [36, 140]. W ith so u rc es ch an g in g q u ick ly , th is lead s to in co n siste n cy and c o n tin u a l u p d a tin g is needed. M o reo v er, m an y b io in fo rm a tic s d a ta so u rc es d o n o t su p p o rt ex p licit re la tio n sh ip s w ith d a ta h e ld in o th e r d ata so u rc es, s u c h as o rtholog and o th e r ty p es o f relatio n sh ip . B io in fo rm a tic s d a ta so u rc e s n eed linking u sin g asso ciatio n s b e tw ee n en titie s th a t a re h a rd to find, as they are im p licit in th e so u rces an d n o t e x p lic it in th e d a ta [3]. R elatio n sh ip s b e tw ee n d ata h e ld in su ch d a ta so u rc es are u s u a lly n u m ero u s, and o n ly p artially ex p licit. T h ere is, th ere fo re , a g ro w in g n e e d to link these d ata so u rces u sin g d y n am ic an d flex ib le lin k in g at a h ig h er level th ro u g h relatio n sh ip s, p a rticu la rly i f this c an b e a c h ie v e d in an efficient m anner.

(25)

1.2.1 Experimental D atasets

T h e em e rg en c e o f b io te c h n o lo g y h a s m ad e it p o ssib le to stu d y th e e x p re ssio n o f th o u san d s o f g en es o r p ro te in s in a sin g le e x p erim e n t in th e lab o rato ry , w h ic h creates an e x p e rim e n ta l d ataset [7, 181]. T h is raises m an y ch allen g es:

• In o rd er to m in e re le v a n t b io lo g ic a l k n o w led g e from an ex p erim en tal d ataset, it is im p o rta n t n o t o n ly to an aly se th e ex p erim en tal data, b u t also to c ro s s-re fe re n c e and associate th e large v o lu m e s o f d ata p ro d u c e d in th is w a y w ith in fo rm atio n av ailab le in ex tern al b io in fo rm a tic s d a ta so u rces, in o rd er to c o n d u ct c o m p arativ e g e n o m ics in v e s tig a tio n s an d so p red ict gene fu n ctio n s an d stu d y e v o lu tio n ary a n a ly sis [1 8 6 ].

• D u e to th e c o m p lex ity o f th e b io lo g ic al p ro b le m s u n d e r stu d y and th e lack o f c o m p lete ex p erim e n tal an d a n a ly tic a l m o d els, th ere is a n e ed to d esig n a k n o w le d g e -d riv e n sy ste m th a t assists in the e x p la n atio n an d v a lid atio n o f th e p re d ic tiv e o u tco m es o f e x p e rim e n ts [198].

• R e se a rc h e rs h av e g reat d ifficu lty in s e ttin g u p larg e-scale ex p erim e n ts, m a in ly b ecau se o f a sh o rta g e o f ex p ertise and lim ited reso u rce s to re c ru it a p p ro p riate s ta f f [2 5 ], so m o st cu rren t resea rch e rs a n n o tate g e n es o n e at a tim e , u sin g online d ata so u rces o r a m an u al lite ra tu re search [1 0 6 ]. A p rev io u s study [107] h as rev e ale d th at 4 0 to 60 % o f g e n e s fo u n d in new g enom ic seq u en ces do n o t hav e assig n e d fu n ctio n s.

• M a n y resea rch e rs stru g g le to id e n tify th e m o st ap p ro p riate so u rces an d to o ls to b e u se d in th e a n a ly sis o f th eir ex p erim en tal d atasets [106].

• O ne o f the sig n ifican t c h a lle n g e s is to in teg rate gene an n o ta tio n w ith the g en e e x p re ssio n a n d se q u en c e in fo rm atio n [136, 138, 193, 194], so th at b io lo g ists can stu d y genes b ased o n th e ir

(26)

fu n ctio n , c h ro m o so m al lo c a tio n , an d tissu e ex p ressio n , an d cro ss- refe ren c e th e d a ta d e riv e d fro m d ifferen t sp ecies acro ss d iv erse ex p ressio n an aly sis p latfo rm s.

• W h en lin k in g an d in te g ratin g d a ta p resen te d in an ex p erim en tal d a ta se t in a se m i-stru c tu re d fo rm w ith data h eld in a b io in fo rm atics d ata so u rce, it is e sse n tia l to d eterm ine as m u ch in fo rm atio n ab o u t th e e x p e rim e n ta l d a ta se t as possible. T h is in fo rm atio n can b e d etected a u to m a tic a lly fro m its m etadata, su ch as co lu m n n a m e s an d th eir c o n te n t d e sc rip tio n s [75].

T h u s, in stead o f o v e rw h elm in g re se a rc h e rs w ith long lists o f u n a n n o ta te d data, resea rch e rs n e e d a sy ste m th a t allo w s th em to a n n o tate g en es, an d m ic ro a rra y 1 in fo rm a tio n b y lin k a g e to additional in fo rm atio n fro m v a rio u s o n lin e p u b lic d a ta so u rc e s. T h e sy stem sh o u ld h a v e th e ab ility to in teg rate ex p erim e n tal d a ta se ts w ith th e rich set o f g en e an n o tatio n in fo rm atio n a v aila b le w ith in a n d a c ro ss sp ecies. S uch a sy ste m sh o u ld allo w research ers to c o llect an d m a n a g e larg e am o u n ts o f g e n e e x p re ssio n , g en e seq u en ce, an d g en e a n n o ta tio n d ata.

In o u r resea rch , w e aim to d ev elo p a fra m e w o rk fo r in teg ratin g b io in fo rm atics d a ta so u rces th at u ses re la tio n sh ip s a c ro ss species and u se r p referen ces. It sh o u ld a llo w th e u se r to sp e c ify co n strain ts and p a ra m eters fo r th e in teg ratio n , w h ic h w o u ld a llo w a b io lo g ist to fac ilitate flex ib le u sag e o f d iffe re n t ty p es o f c o m p a ra tiv e genom ics rela tio n sh ip s in in v estig atio n s.

1.3

R a tio n a le

In 20 0 6 , o v e r 100,000 in d iv id u al sa m p le s w e re d ep o sited in p u b lic rep o sito ries fo r gen e e x p re ssio n /m o le c u la r ab u n d an c e data. T h ese su b m issio n s rep re se n t o v e r 2 0 0 0 p la tfo rm s o r array types from 60 d ifferen t sp ecies [87]. T h is b o d y o f p u b lic d ata is g ro w in g

(27)

ex p o n en tia lly an d is m atch e d b y a n eq u al o r g reater n u m b e r o f stu d ies in th e p riv ate do m ain . F ew to o ls h a v e b een d e v elo p ed to c o m p a re d irec tly th e resu lts y ield ed fro m in d iv id u al studies. A lth o u g h , sig n ific an t a d v an ces h av e b e en m a d e in v isu a lizin g [22, 38, 4 7 , 88] an d m a n ip u la tin g in d iv id u al d a ta se ts (in c lu d in g d ata p ro cessin g [200], statistica l an aly sis[1 0 3 ], c lu ste rin g [16, 2 1 1 ] an d an n o tatio n b ased o v e r­ re p re se n ta tio n [73]), th ese a p p ro a c h e s a llo w o n ly c ro ss-ex p erim en tal co m p a riso n b y su b jectiv e an aly sis o f th e o u tp u t. T h ese co m p ariso n s o ffe r an o p p o rtu n ity to rev eal c o n se rv e d d ise a se m ech an ism s o r c o m m o n m o d es o f actio n in cases o f to x ic o sis c au se d by ch em ical ex p o su re. T h e v alu e o f th is d ata to th e fu n d a m e n ta l u n d erstan d in g o f th ese p ro ce sses can n o t b e u n d e re stim a te d , b u t n e w approaches are n eed ed . T he m a jo r h u rd les to th ese d a ta se t c o m p a riso n s include v a ria tio n s in rep o rte d n o m e n c la tu re , d a ta b a s e v ersio n in g , o rth o lo g y /p aralo g y , ch o ice o f rela tio n sh ip , a n d th e th re sh o ld u se d to d e term in e relatio n sh ip v alid ity . In th is resea rch , w e se t o u t to d ev elo p a p la tfo rm th at w o u ld allow d irec t c o m p a riso n b e tw e e n tw o d atasets, w ith in sp ecies, a llo w in g v a ria b le gen e id en tifie rs to b e m a p p e d on to the sp e c ie s-sp e c ific p rim a ry d ata so u rce, w h ic h in tu rn c o u ld b e u se d to y ie ld se q u en c e o r g en e an n o tatio n th at w o u ld fa c ilita te co m p ariso n , w ith flex ib ility in th e ty p es u se d an d the th re s h o ld s o f lin k ag e.

1.4 T h e h y p o th e sis an d th e aim o f th e r e se a r c h

T h e resea rch h y p o th esis fo r th is th esis is:

Hidden relationships between biological objects can be used in integrating bioinformatics data sources, so th at a biologist can flexibly link an experim ental dataset with bioinform atics data sources and the resulting data source can be m ined effectively to inform the investigation.

T h u s, the aim o f th e resea rch is to in v e stig a te th e u se o f rela tio n sh ip s b e tw ee n b io lo g ical o b jects to lin k h e te ro g e n e o u s b io in fo rm atics d a ta so u rc es to an n o tate g en es d isc o v e re d in ex p erim en ts and p red ic t g en e fu n ctio n s v ia c o m p arativ e g e n o m ic s an aly sis.

(28)

1.4.1 O b je ctiv es

In o rd e r to d e m o n stra te th e h y p o th e sis, w e aim to m ee t a n u m b e r o f ob jectiv es:

O b je ctiv e 1: to ex tra ct an e x p e r im e n ta l d a ta set’s m eta d a ta an d to d e tec t su ita b le c a n d id a te k eys fo r lin k a g e in it

M o st ex p erim en tal d atasets are sto re d in u n stru c tu re d files th at do n o t h a v e m etad a ta sav ed in logical field s. In o rd e r to investigate fully th e d ataset b e in g g e n erate d b y a m ic ro a rra y o r in a lab o ra to ry experim ent, it is essen tial to d e te ct an d u se as m u c h in fo rm a tio n about the e x p erim e n tal d a ta se t as p o ssib le. T h is in fo rm a tio n can b e found in h e ad in g s an d c o n te n t d e scrip tio n s, an d n e e d s to b e ex tracted and ex p lo ited to en su re th at th e d a ta can b e in te g ra te d in v a lid w ays an d so in crease th e sco p e o f the in v estig atio n s o f th e e x p e rim e n ta l dataset. T h u s, a to o l is n e ed e d to d isc o v e r an d e x tra c t th is in fo rm a tio n .

E x p erim e n tal d atasets u su ally h a v e m an y e le m e n ts. O n ly a few o f th ese e le m e n ts can b e u se d as a c an d id ate k ey fo r lin k a g e w ith o th e r data. A c a n d id ate k e y h e lp s us to jo in tu p les in d a ta se ts w ith o th er data. T h ere fo re , w e n e e d to try to d e tect a u to m a tic a lly c a n d id a te k ey s th at can b e u se d to lin k an d in teg rate a d ataset w ith p u b lic d a ta sources.

O b je ctiv e 2: to tra n sfo r m ex tr a cte d m e ta d a ta a n d d a ta sets into a fo rm th a t can b e u sed fo r lin k a g e w ith o th e r so u r c e s

U su ally , ex p erim en tal d atasets are n o t in a fo rm th a t can be d irectly lin k ed to o th er b io in fo rm atics d a ta so u rces. T h e m e ta d a ta should be sto red in a fo rm at th at allo w s its e ffe ctiv e u se. A lso , d atasets need to be a n aly sed an d sto red so th at th ey can b e in te g ra te d an d linked to o th er b io in fo rm atics sources. O n ce th e d a ta h a s b e e n sto red in a su itab le stru ctu re, it can be u sed to lin k w ith o th e r ap p ro p riate p u b lic b io in fo rm atics sources.

(29)

O b je ctiv e 3: to sh o w th a t th e se r e la tio n sh ip s can p ro v id e flex ib le a n d lo o sely co u p led lin k a g es a c ro ss h e tero g en eo u s d a ta so u rces

B io in fo rm atics d a ta so u rces c o n ta in a larg e v ariety o f objects. T h ese o b jec ts are c o n n ected in a v a rie ty o f w ay s giv in g an ex te n siv e in te rco n n e cte d g rap h o f re la tio n sh ip s. T h e se relatio n sh ip s are o ften m an y -to -m a n y , an d refe r to d y n a m ic e ffe c ts th at one o b ject h as o n an o th er. D isco v erin g th ese re la tio n sh ip s b e tw e e n b io lo g ical o b jects is im p o rtan t fo r b io lo g ists so th at th ey c an in v e stig a te w h eth er the lin k s en rich th e ir k n o w le d g e ab o u t the g en etic stru c tu re . T h u s, the d isco v ered rela tio n sh ip s p ro v id e a m ean s fo r jo in in g in fo rm a tio n and linking d ata so u rces d y n a m ica lly an d flex ib ly , an d so p ro v id e b io lo g ists w ith rich in fo rm a tio n an d an n o tatio n . T h u s, th e o b je c tiv e is to d etect th ese sem an tic rela tio n sh ip s an d b u ild a re la tio n sh ip k n o w le d g e b ase co n ta in in g th is in fo rm atio n th at c an b e u se d to jo in in fo rm a tio n b ased o n the G O classificatio n a sso c iatio n o r h o m o lo g y b e tw e e n seq u en ces, so th at a b io lo g ist can assess th e sig n ifican ce o f th e d iffe re n t links u se d in an in v estig atio n .

O b je ctiv e 4: to b u ild a k n o w led g e b a se o f d is c o v e r e d rela tio n sh ip s b etw e en so u rc e s an d to ex p lo it th is to c o m b in e an n o ta tio n k n o w led g e fro m d iffe r e n t so u rces.

D isc o v ere d rela tio n sh ip s b e tw e e n b io lo g ical o b je c ts w ill b e stored in a k n o w le d g e b ase th at can b e u se d in the in te g ra tio n p ro c e ss to enrich a qu ery . U se r q u eries can b e e x te n d e d u sin g th e se rela tio n sh ip s to o b tain a g re a te r am o u n t o f rele v an t in fo rm a tio n . T h e o b je c tiv e is to store th ese rela tio n sh ip s in an a p p ro p riate m o d el so th a t th e y can be reu sed in future in v estig atio n s.

O b jectiv e 5: to p r o v id e u sers w ith u n ifo r m a c ce ss to b io in fo rm a tics so u rces so th a t th ey can b e q u e r ied as i f th e y w ere a sin gle so u rce, th u s sh ield in g u sers fro m th e u n d e r ly in g stru ctu re o f sou rces.

A n in teg ratio n aim is to p ro v id e u se rs w ith a single interface to access an d q u ery m u ltip le b io in fo rm a tic s so u rces. T he sy stem sh o u ld e n ab le

(30)

u se rs to su b m it a sin g le q u e ry to m u ltip le b io in fo rm atics d a ta so u rces, an d retu rn a u n ifie d set o f re su lts ra th e r th an the u se r h a v in g to sp e n d u n n e c e ssa ry tim e su b m ittin g th e sa m e q u ery o v er an d o v e r ag ain to m an y d a ta so u rces an d th en in te g ra tin g th e resu lts m anually. M o reo v er, en d u se rs o f the in teg ratio n sy ste m s h o u ld n o t n eed to b e aw are o f the u n d e rly in g stru ctu re o f so u rces w h e n accessin g o r q u e ry in g h e te ro g en e o u s d ata sources. T h e sy s te m sh o u ld handle all th e u n d e rly in g m ech an ics n eed ed to p ro c e ss a u s e r ’s query and retu rn resu lts. T h e o b jectiv e is to h id e th e in te rn a l stru c tu re o f these so u rces fro m users to sim p lify the in terface fo r th e b io lo g ist.

1.5

R esea rch A p p ro a ch

In th is sectio n , w e su m m arise th e m e th o d o lo g y u s e d in co n d u ctin g ou r research . F irstly , th e p ro b lem is d e fin e d as lin k in g e x p e rim e n ta l datasets fro m b io lo g ical ex p erim en ts w ith h e te ro g e n e o u s b io in fo rm a tic s d ata so u rc es in flex ib le w ay s to su p p o rt k n o w le d g e d isc o v e ry , co m p arativ e g e n o m ics, o r fu rth er in v estig atio n . E x istin g in te g ra tio n sy stem s are th en re v ie w e d to d eterm in e th e m o st a p p ro p ria te ap p ro ach . T he lite ra tu re re v ie w is sp lit into tw o tracks; th e first c o n c e n tra te s on the in te g ratio n o f h e te ro g e n e o u s d ata so u rces in g e n e ra l a n d th e seco n d is ab o u t b io in fo rm a tic s d a ta so u rce in te g ra tio n a n d th e m in in g o f b io lo g ical data. T h ese trac k s are th en c o m b in e d to su p p o rt the research aim .

D isc u ssio n s w ith p ro fessio n als in b io lo g ic al sc ie n c e w a s u n d ertak en , as it w as o u r targ e te d a p p licatio n field. D r. P e te r K ille (B io sc ie n c e School, C a r d iff U niversity) w as freq u en tly c o n su lte d to e n su re th at o u r research m et a b io lo g is t’s needs. E x p erim e n tal d a ta se ts w e re co llected u n d er the su p erv isio n o f s ta ff o f the S ch o o l o f B io scien ce. D ifferen t b io in fo rm atics d a ta so u rces w ere se le c te d to b e in teg rated w ith th ese d atasets b ased o n th e b io lo g y u n d e r in v e stig a tio n , n am ely , W o rm b ase [46, 21 0 ], M G D [33-35, 4 1 , 71] a n d G e n e O n to lo g y (G O ) [89].

B a se d on o u r in v estig atio n o f th e re s e a rc h p ro b lem , w e b u ilt a m o d el fo r c ap tu rin g an d sto rin g re la tio n sh ip s b e tw e e n the b io lo g ical o b jects to b e

(31)

u se d fo r the in te g ratio n an d lin k a g e o f th e b io in fo rm atics d a ta so u rces. A n in itial sy stem stru ctu re w a s p ro p o se d w h ich p ro v id ed a u se r w ith u n ifo rm access to h e te ro g en e o u s b io in fo rm a tic s sources. T h e final step in o u r resea rch w as th e im p le m e n ta tio n o f o u r p ro p o sed sy stem as a p ro to ty p e.

1.6

O v era ll A ch iev e m en ts o f th e re sea rch

T h e fo llo w in g is a su m m ary o f th e m a in a c h ie v e m e n ts o f this research: a) In tro d u cin g an ap p ro a ch fo r e x tra c tin g an ex p erim en tal

d a ta se t’s m etad ata an d id e n tify in g a p p ro p riate candidate k ey s fo r lin k ag e w ith o th e r re la te d d a ta (C h a p te r 6).

b) T h e cre atio n (see C h a p te r 4 ) o f a n o v e l a p p ro a ch — SL M - to th e in teg ratio n o f b io in fo rm a tic s d a ta so u rces w h ich allo w s b io lo g ists to c reate easily , d iffe re n t ty p es o f lin k ag es b e tw ee n b io in fo rm atics d a ta so u rc e s, d riv e th e in teg ratio n p ro cess, ch an g e th e lin k ag e ty p e fle x ib ly , a d ju st the lin k ag e easily , so th at th e in v estig ato r can try d iffe re n t lin k ag es, see th e e ffe ct o f u sin g th em an d so d e te rm in e w h ic h one i f any m a tc h e s th e p u rp o ses o f th e ir re s e a rc h a n d p ro d u ces sig n ific an t resu lts. T h is a llo w s b io lo g ists to analyze e x p erim e n tal d a ta se ts in d ifferen t w a y s, sh o rte n s the tim e n e ed e d to an aly ze th e d atasets an d p ro v id e s an easier w ay to u n d ertak e th is an aly sis. T h u s, S L M p ro v id e s bio lo g ists w ith a to o l w h ic h su p p o rts e x p e rim e n ta tio n by u sin g d ifferen t th resh o ld v a lu e s a n d lin k a g e ty p es and th ereb y su p p o rts in v estig ativ e re se a rc h (C h a p te r 8).

c) T h e creatio n o f a k n o w le d g e b a se o f the d isco v ered rela tio n sh ip s b etw ee n b io lo g ic a l o b jec ts (S ection 9.4), w h ic h is u sed to c o m p a re a n d lin k th e experim ental d atasets w ith p u b lic so u rces. T h is k n o w le d g e base im p ro v es co m p arativ e a p p ro a c h e s to an n o ta te genes, by id en tify in g p o ssib le re la tio n sh ip s b e tw e e n o b jects across sp ecies, an d

(32)

p red ic tin g p ro te in -fu n c tio n fro m seq u en ce h o m o lo g y , o rth o lo g y an d G O -te rm s. B y in teg ratin g fu n ctio n al an d seq u en ce d ata a cro ss sp e cie s, b io lo g ist can a n n o tate th e g en o m e o f a sp ecies u sin g fu n ctio n al d ata from an o th er. C o m p arativ e g e n o m ics p ro v id e s evidence fo r clo se e v o lu tio n ary re la tio n sh ip s b e tw e e n gene fam ilies. A lso , th is k n o w led g e can b e re u se d in o th e r investigations.

d) A flex ib le m ed iato r a rc h ite c tu re fo r lin k in g (i.e. in teg ratin g ) ex p erim en tal d atasets w ith re le v a n t in fo rm atio n h eld in h ete ro g en e o u s d ata so u rc es (se e C h a p te r 5). T his m ean s th at a b io lo g ist d o es n o t n e e d to d ire c tly q u ery individual d ata so u rces o r u se a v a rie ty o f In te rn e t se arc h tools for th is p u rp o se. W e p re se n t a m e d ia to r-b a s e d in teg ratio n arch itectu re th at lin k s e x p e rim e n ta l d a ta se ts to relev an t in fo rm atio n h e ld in h e te ro g e n e o u s d a ta sources. O u r m ed iated arch itectu re offers a set o f to o ls fo r d isco v erin g sem an tic rela tio n sh ip s b e tw e e n b io lo g ic a l o b jects, b ro w sin g th ese rela tio n sh ip s a n d a u to m a tin g m etad ata ex tra ctio n , an d o fferin g a sin g le p o in t o f a cc e ss to a set o f d a ta so u rces. It en ab les fle x ib le in te g ratio n o f h e te ro g en e o u s d ata sources. T h is a llo w s b io lo g ists to be ab le to c reate easily , d iffe re n t ty p e s o f lin k a g es b etw een b io in fo rm atics d a ta so u rces, d riv e th e in te g ra tio n p ro cess, ch an g e th e lin k ag e ty p e flex ib ly , a d ju s t th e linkage easily so th at th e in v e stig a to r c an try d iffe re n t lin k ag es to see w h ic h one i f an y m atch e s th e p u rp o se s o f th e ir research and d e term in e the e ffe ct o f d iffe re n t re la tio n sh ip s easily and so id en tify th e ir b io lo g ic al sig n ific a n c e .

e) T h e D e term in a tio n o f th e o p tim a l th resh o ld fo r c ro ss­ sp ecies o rth o lo g y re la tio n sh ip s. T h is is d em o n strated fo r M o u se an d C .e le g an s (se e S e c tio n 8.5).

(33)

Six p a p ers w ere p u b lish ed o n th e w o rk rep o rted in th is thesis. T h e full d e ta ils o f th ese p a p ers are fo u n d in [8-12]. T he c o n fe ren c es an d th e w o rk sh o p s in w h ic h the p a p ers a p p e a r are:

1. 2 1 st A n n u al B ritish N a tio n a l C o n fe re n c e on D atab ases, B N C O D 21, E d in b u rg h , U K , 7-9 Ju ly 2 0 0 4 .

2. S ix th In fo rm atics W o rk sh o p fo r R e se a rc h S tudents, U n iv ersity o f B rad fo rd , B rad fo rd , U K , M a rc h 2 0 0 5 .

3. 2 2 n d B ritish N atio n al C o n fe re n ce o n D a ta b a se s, B N C O D 22, S u n d erlan d , U K , 5-7 Ju ly 200 5 .

4. H IB IT 05: In tern atio n al S y m p o siu m o n H e a lth In fo rm atics and B io in fo rm atics, B elek , A n ta ly a , T u rk e y , 10-12 N o v e m b e r 2005 5. 4 th In tern atio n al W o rk sh o p o n B io lo g ic a l D a ta M a n ag e m en t -

B ID M '06 in co n ju n ctio n w ith D E X A 2 0 0 6 , K ra k o w , P o lan d , 3-7 S ep tem b er 2006.

6. V L D B 2 0 0 6 o n D ata M in in g in B io in fo rm a tic s in co n ju n ctio n w ith V L D B 2 0 0 6 , S eoul, S outh K o rea, 11-15 S e p te m b e r 2006.

1.7 T h esis o rg a n iza tio n

T h is sectio n p resen ts an o v e rv ie w o f th e th e s is o rg an izatio n . A n o v e rv ie w o f the c h a p te r c o n te n ts is given.

Chapter 2: Background

T h is c h ap ter g iv es th e n e c e ssa ry b a c k g ro u n d in fo rm atio n a b o u t th e ch arac teristics o f b io lo g ic a l objects and b io in fo rm atics d ata so u rces.

Chapter 3: Bioinform atics D ata source Integration

T h is c h ap ter su rv ey s th e b a c k g ro u n d areas o f research related to the m ain ideas p re se n te d in th e th e sis on lin k in g datasets.

(34)

Chapter 4: S oft L in k M odel

T h is c h a p te r in tro d u ces th e p ro p o se d S oft L in k M o d el fo r d a ta so u rce in teg ratio n an d d e sc rib e s th e ap p ro ach used.

Chapter 5: System A rchitecture

T h is c h ap ter in tro d u ces th e d e sig n o f the arch itectu re an d the d ifferen t co m p o n en ts o f th e ID M B D (In teg ratio n an d D a ta M in in g o f B io in fo rm atics D a ta so u rc e s) system .

Chapter 6: Implementation

T h is c h a p te r d iscu sses th e im p le m e n ta tio n issues for the p ro p o se d sy stem , an d d e scrib e s th e p ro to ty p e im plem entation.

Chapter 7: Extracting M etadata o f E xperim ental Dataset

T h is c h a p te r p resen ts an a p p ro a c h fo r ex tractin g the ex p erim en tal d a ta se ts’ m e ta d a ta a n d fin d in g th e su itab le lin k ag e k ey s th at can b e u se d fo r in te g ra tio n b ased on a m ath em atical fo u n d atio n . F u rth e rm o re , it sh o w s h o w to m ap a lin k ag e k ey w ith th e d o m ain o n to lo g y to fin d re la te d co n cep ts an d sem an tic relatio n sh ip s.

Chapter 8: Analysis o f “wet laboratory99 data

T h is c h a p te r d e m o n strates the u tility o f o u r p ro to ty p e system . W e u se d the to o ls to an aly se d a ta se ts g e n e ra te d b y w et lab o rato ry e x p erim e n tatio n . T h e a im w a s to d em o n strate th at th e soft lin k fram e w o rk w o u ld a llo w u s to d eriv e novel in sig h ts in to the ex p erim e n tal sy ste m b y d eterm in in g the elem en ts co n serv ed b e tw e e n sp ecies.

Chapter 9: Evaluation

T h is c h ap ter p ro v id es an e v a lu a tio n o f th e sy stem in term s o f d ifferen t d im en sio n s.

Chapter 10: Conclusions an d fu tu re work

T his c h ap ter su m m arize s a n d c o m m en ts on the c o n trib u tio n s m ad e b y the resea rch a n d d isc u sse s the p ersp ectiv es an d research d irec tio n s th a t re m a in o p en fo r future w o rk th at c o u ld

(35)

b e c arried o u t to im p ro v e th e e ffectiv en ess o f the S L M as a m eth o d o f in te g ratin g h e te ro g en e o u s b io in fo rm atics d a ta sources.

(36)

B ackground

2.1

S y n o p sis

T h is c h ap ter g iv es th e b a c k g ro u n d a b o u t b io lo g ic a l d ata and b io in fo rm atics d a ta sources. T h e n e c e ssa ry b a c k g ro u n d in fo rm atio n a b o u t b io in fo rm atics d ata so u rces is p resen te d . T h is c o v e rs reaso n s fo r th e g ro w th in the n u m b er an d size o f b io in fo rm a tic s d a ta sources, an d the ch aracteristics o f b io in fo rm atics an d its d a ta so u rc es. T h is g ro w th is o ften d e scrib e d in the literatu re as e x p lo s iv e [l 13, 187, 214]. H e te ro g e n e ity p re se n t in b io in fo rm atics d a ta so u rc e s is d etailed an d ty p es o f c o n flic t ex p lain ed . D ata m o d els are d e fin e d a n d d escrib ed in d etail, an d th e ir ad v an tag e s an d d isa d v an tag e s d isc u sse d .

2.2

In tro d u ctio n

In re c e n t y ears, th ere h as b e e n a m assiv e in c re a se in th e n u m b er and size o f b io in fo rm atics d ata so u rces, w h ic h is e x p e c te d to co n tin u e at the sam e, o r an ev en faster p ace in th e c o m in g y e a rs [131]. T he grow th in th e n u m b e r o f d a ta so u rces is re la te d to th e c o n te n t o f d a ta held in them [65]. T h e reaso n s fo r th is g ro w th c a n b e s u m m a rise d as follow s:

i. R ap id p ro g ress o f the h u m a n g e n o m e p ro jec t and o th er seq u en cin g p ro jects [58];

ii. E asy access to sto red d ata p ro v id e d b y th e In tern et [13, 131]; iii. P ro liferatio n o f n ew b io d a ta a n a ly sis tech n o lo g ies, b io -statistical

ap p ro ach es, co m p u tatio n al a lg o rith m s, k n o w led g e d isco v ery , d ata m in in g an d d ata a n aly sis to o ls [60, 157];

(37)

iv. D esig n an d d e v elo p m e n t o f n e w b io te ch n o lo g y an d effic ie n t (w ith resp ec t to sp eed a n d a cc u ra c y ) ex p erim en tal tec h n iq u e s, p rim a rily D N A se q u en c in g , D N A m icro array s an d o th er h ig h th ro u g h p u t tech n o lo g ies [1 3 1 ]; a n d

v. M a ssiv e in v estm en t in g e n o m ic s b y g o v ern m en ts an d th e p h a rm a ce u tic al in d u stry [92, 131, 199].

In Ju n e 2008, the G en B an k d a ta b ase a lo n e h e ld the records o f m o re th an 8 8 ,5 5 4 ,5 7 8 seq u en ces an d o v e r 9 2 ,0 0 8 ,6 1 1 ,8 6 7 bases [86]. A c co rd in g to a rec en t survey, m o re th a n 1078 b io in fo rm atics d ata so u rces are a v ailab le o n lin e [83]. T ab le 2.1 a n d F ig u re 2.1 show the in crease in th e n u m b e r o f b io in fo rm atics d a ta so u rc e s fro m 1999 to the p resen t day. F ig u re 2.2 illu strates th e d e v e lo p m e n t o f th e international N u c le o tid e S eq u en ces d a ta b ase [86]. F ig u re 2.3 sh o w s th e g ro w th o f th e G e n B an k d atab ase from 1982 to 2005. In th is p e rio d , th ere w as an e x p o n en tial g ro w th in b ase p a ir d a ta fro m 6 8 0 K to 5 6 ,0 3 7 m illio n an d in seq u en ces fro m 606 to 52 m illio n [85]. S u c h e x p lo s iv e g ro w th is e x p ec te d to co n tin u e w ell into th e 2 1 st c e n tu ry [1 1 3 , 114, 187, 196]. D a ta so u rc es are m ain tain ed b y d iffe re n t c o m m u n itie s an d o rg an iz atio n s [131, 138]; th ey are a u to n o m o u s, d istrib u te d , d isp arate, h e te ro g en e o u s a n d o ften do n o t p ro v id e d ire c t a c c e ss [29, 138]. A d e scrip tio n o f th ese ch arac teristics can b e fo u n d in se c tio n 2.3.2.

D ata so u rces in g en eral can b e c la ssified as p rim a ry o r secondary. A p rim a ry so u rce h o ld s in fo rm a tio n fro m an e x p e rim e n t a n d is som etim es c alle d an arch iv al d a ta source. It c o n ta in s ra w d a ta o f sequences o r stru ctu res. E x am p le s o f th ese p rim a ry so u rc e s a re G e n B a n k [31, 32], E M B I an d D D B J fo r G en o m e se q u en c es a n d th e P ro te in D atab an k for p ro tein stru ctu res [21].

(38)

G row th o f b io in fo r m a tic s d ata s o u r c e s 1200 1000 800 600 400 200 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 year

F ig u re 2.1: G row th o f bio in fo rm atics data so u rc e s 1 9 9 9 -2 0 0 8 b a sed on sta tistic s p u b lis h e d in [79-83]

Year 19 9 9 2 0 0 0 2001 2 0 0 2 2 0 0 3 2 0 0 4 2 0 0 5 2 0 0 6 2 0 0 7 2 0 0 8

Num ber 197 2 2 6 281 3 3 5 3 8 6 5 4 8 7 1 9 8 5 8 9 6 8 1 078

Table 2.1: G row th o f b io in fo rm a tics data so u rc e s (1 9 99 -20 0 8) [82-85]

(39)

G row th of the

International Nucleotide S e q u e n c e D atabase Collaboration

P a r s eoolntxj!«d by G fiB a rfc g — « EMSL— DOBJ —i

F ig u re 2.2: D evelo p m en t o f the in tern a tio n a l N u c le o tid e S equence D a tab a se [85]

S eco n d ary d ata source inform ation is d eriv ed fro m p rim a ry d ata source data; S eco n d ary data sources hold data, such as c o n se rv e d sequences, signature seq u en ces and active site resid u es o f th e p ro tein fam ilies derived by the m u ltip le sequence alignm ent o f a set o f related proteins. A secondary d ata source is called a curated d ata so u rc e and exam ples include M G D [34] and W o rm b ase [46].

W hile the contents o f p rim ary data so u rces are co n tro lled by the subm itter, the contents o f seco n d ary d ata so u rces are controlled by a third party. S econdary d ata so u rces are d e riv e d from the follow ing pro ced u res [132]:

• A n notating and enriching data, eith er m an u a lly o r autom atically, • C leansing and rem oving red u n d an t in fo rm atio n ,

• C ollecting data from literature,

• M ining and com piling d ata fro m several data sources, and • A nalysing prim ary data.

(40)

In general, bio in fo rm atics d ata so u rces cover a w ide range o f subjects and data types, including gene seq u en ces, gene expression data, p ro tein sequences, p ro tein structure and m etab o lic pathw ays. T hey can be classified as general purpose o r sp ecific p u rp o se data sources [29].

</> c o </> < u u c 4> 3 cr a> CO

Growth of GenBank

(1982 - 2005) 54 j 52 50 48 46 44 42 40 38 36 34 32 30 28 26 24 22 20 18 16 14 12 -10 - 8 6 4 -2 - 0 I 1982 Base Pairs Sequences ♦ ' ♦ ' ♦ ♦ ♦ i f f t- t- t ' </> c o 2 , < o

S

(0 a. <v </> <0 CO 1986 1990 1994 1998 2002 F ig u re 2.3: G row th o f G en B a nk (1 9 82 -2 0 0 5 ) [8 5 ] 2.3 C h a r a c t e r is t ic s o f b i o i n f o r m a t i c s d a t a s o u r c e s

T he characteristics o f b io in fo rm atics d ata sources are presented here to give the reader an u n d erstan d in g o f the field and the challenges it presents.

(41)

2.3.1 D a ta

E lm asri an d N a v ath e [70] id e n tify se v era l ch aracteristics o f b io lo g ical d a ta th at m ak e it d iffic u lt to m an ag e:

C o m p le x ity : b io lo g ical d ata are q u e stio n a b ly the m o st co m p lex d ata k n o w n w h e n co m p ared w ith m o st o th e r ap p licatio n s [177]. T h ey are co n n ec te d to each o th er in m an y w a y s, in a h ig h ly in terco n n ected g rap h o f rela tio n sh ip s [174]. T hus, d e fin itio n s o f su c h b io lo g ical d ata m u st be ab le to rep re se n t a co m p lex su b stru c tu re o f d a ta as w ell as relatio n sh ip s [70, 154]. F o r ex am p le, b io in fo rm atics d a ta so u rc es include n o t on ly the fu n ctio n s o f in d iv id u al g en es a n d p ro te in s, b u t th eir com plex in teractio n s w ith in a tissu e, cell tissu e, a n d w h o le o rg an ism [70, 154,

159, 177].

D iv e rs ity : B io lo g ical d ata h av e a g rea t d iv e rs ity o f ty p es, such as seq u en ces, sp atial, 3D stru ctu res, g rap h s, strin g , sc a la r a n d v ecto r data. T h ere m ay also b e o v erlap s in d a ta ty p es b e tw e e n d iffe re n t species and d ifferen t g enom e sources [70, 154].

In c o m p le te : B io lo g ical d ata are v e ry o fte n in c o m p le te since som e b io lo g ic al o b jec ts are large an d full d e scrip tio n s ta k e tim e to ach iev e, o r the lim ite d reso u rce s av ailab le p rev e n t th e c o lle c tio n o f rele v an t d ata [177]. F o r e x am p le , m o st o f the g e n o m e s are in c o m p le te and n o t a n n o tated b e c a u se th e fu n ctio n o f som e g e n es is still u n k n o w n .

L a r g e size: O n e o f th e m o st n o tab le c h a ra c te ristic s o f b io lo g ic al data is th e ir larg e size on a cco u n t o f th e c o m p le x ity o f b io lo g ic a l concepts, d a ta ty p es an d stru ctu re. S eq u en ces, g rap h s, p ro te in -p ro te in interactions all c o n trib u te to the c o m p lex ity an d size o f b io lo g ic a l d a ta [131].

L a c k o f a s ta n d a r d is e d n o m e n c la tu r e : D iffe re n t o rg an isatio n s and c o m m u n ities u se th e ir o w n te rm in o lo g y to d e sc rib e b io lo g ic al concepts. T h u s, b io lo g ical d ata freq u en tly su ffe r fro m a m b ig u o u s and u n clear co n cep ts since th ere is n o sta n d a rd ise d n o m e n c la tu re fo r them [131,

177].

(42)

2.3.2 D a ta so u rces

H ere w e d iscu ss th e d ifferin g c h a ra c te ristic s o f b io in fo rm atics d a ta so u rces [29]:

H ete ro g e n e o u s In stru ctu r e a n d c o n te n t: eac h d ata source h as its o w n

d a ta m o d el an d u se s its o w n te rm in o lo g y an d ontology. D iffere n t d esig n ers, h av e u se d several w a y s to m o d e l a p a rtic u la r co n cep t an d th e aim o f th e ex p erim en t and p ro je c t all c o n trib u te to th is h etero g en eity [98, 154]. T h u s, th e structure o f d a ta so u rc e s, a n d rep resen tatio n s o f the sam e d ata q u e ry resu lts m ay b e d iffe re n t (se e se c tio n 2.4).

L a rg e in size: in the last few y e ars, th e n u m b e r an d size o f n ew

b io in fo rm atics d a ta so u rces h as b e e n g ro w in g e x p o n e n tia lly , as has the n u m b e r o f c o m p u tatio n al to o ls a v ailab le fo r a n a ly sin g th ese data. T here is n o sig n o f any d e celeratio n o f g ro w th [29].

D y n a m ic: b io in fo rm atics d a ta so u rces are d y n a m ic . T h e ir in terfaces

alter fro m tim e to tim e an d th e ir sch em as c h a n g e a t a ra p id p ace as do th e ir co n ten ts [70].

A u to n o m o u s: b io in fo rm atics d a ta so u rces are a u to n o m o u sly o w n ed

an d m a in ta in e d b y d ifferen t c o m m u n ities a n d o rg a n is a tio n s o ften fo r d ifferen t p u rp o se s [138]. C o n seq u en tly , q u e ry ty p e s a llo w e d o n d ata so u rces a n d th e p re c ise m o d e o f in teractio n are d iv e rs e b e ca u se o f the d ifferen t rea so n s fo r h o ld in g the d ata [29, 138].

W id ely d istrib u ted : b io in fo rm atics d a ta so u rc es a re w id e ly d istrib u ted

acro ss the w o rld , an d su ch d a ta is cu rre n tly n o t h e ld in a cen tralised lo catio n fo r an aly tical p u rp o ses. T h is is m o st lik e ly to co n tin u e to be th e case [29, 138].

2.4 H ete ro g en eity in B io in fo r m a tic s D a ta S ou rces

T h is sectio n id en tifies d ifferen t ty p e s o f h e te ro g e n e ity that affect b io in fo rm atics d a ta so u rces w ith th e a im o f sh o w in g the ch allen g es th ey p resen t to m ak in g an in te ro p era b le sy stem . T h is h etero g en eity m ay ex ist at three levels, n am ely , sy n ta c tic , se m a n tic an d d ata m odel levels [26, 69, 84, 99, 110, 123, 128, 129, 131].

Figure

Table 2.1:  G row th  o f  b io in fo rm a tics data so u rc e s  (1 9 99 -20 0 8) [82-85]
Table 3.1:  d im en sio n s  u se d  in  c h a ra c te risin g  e xistin g  system
Table 4.4:  S a m p le o f  g e n e  a n n o ta tio n   o f  C.  eleg a n s
Table  4.7:  D iffe re n t su b sets fr o m   th e   C a rte sia n  p r o d u c t  o f  R   a n d   S  o f  each p a ir  in  the a lig n m en t o f  th e s e q u e n c e s   (r»  sj)
+4

References

Related documents

Enter the IP source address for Cisco IOS Telephony Services :10.90.0.1 Enter the Skinny Port for Cisco IOS Telephony Services : [2000]:2000 How many IP phones do you want to

[r]

Any companion brighter than 0.3% the brightness(V-band) of the primary would have been detected... The red points are from the SOPHIE spectrometer at Obs. Lower left) Phase

If its first argument is nonfresh, then the second cond line of walk ∗ must

We analyze the galactic kinematics and orbit of the host star LHS 1815 and find that it has a large probability (P thick /P thin =6482) to be in the thick disk with a much

The use of the emergency released vapor (0.2MPa/120ºC) can generate an emergency 2Mwe.. The production cost of the nuclear electricity. In calculating the reduced costs of production

Solid dispersion technology extremely helps in improving the dissolution property of poorly water soluble drugs. Various techniques described in this review