B ackground
2.2 In tro d u ctio n
In re c e n t y ears, th ere h as b e e n a m assiv e in c re a se in th e n u m b er and size o f b io in fo rm atics d ata so u rces, w h ic h is e x p e c te d to co n tin u e at the sam e, o r an ev en faster p ace in th e c o m in g y e a rs [131]. T he grow th in th e n u m b e r o f d a ta so u rces is re la te d to th e c o n te n t o f d a ta held in them [65]. T h e reaso n s fo r th is g ro w th c a n b e s u m m a rise d as follow s:
i. R ap id p ro g ress o f the h u m a n g e n o m e p ro jec t and o th er seq u en cin g p ro jects [58];
ii. E asy access to sto red d ata p ro v id e d b y th e In tern et [13, 131]; iii. P ro liferatio n o f n ew b io d a ta a n a ly sis tech n o lo g ies, b io -statistical
ap p ro ach es, co m p u tatio n al a lg o rith m s, k n o w led g e d isco v ery , d ata m in in g an d d ata a n aly sis to o ls [60, 157];
iv. D esig n an d d e v elo p m e n t o f n e w b io te ch n o lo g y an d effic ie n t (w ith resp ec t to sp eed a n d a cc u ra c y ) ex p erim en tal tec h n iq u e s, p rim a rily D N A se q u en c in g , D N A m icro array s an d o th er h ig h th ro u g h p u t tech n o lo g ies [1 3 1 ]; a n d
v. M a ssiv e in v estm en t in g e n o m ic s b y g o v ern m en ts an d th e p h a rm a ce u tic al in d u stry [92, 131, 199].
In Ju n e 2008, the G en B an k d a ta b ase a lo n e h e ld the records o f m o re th an 8 8 ,5 5 4 ,5 7 8 seq u en ces an d o v e r 9 2 ,0 0 8 ,6 1 1 ,8 6 7 bases [86]. A c co rd in g to a rec en t survey, m o re th a n 1078 b io in fo rm atics d ata so u rces are a v ailab le o n lin e [83]. T ab le 2.1 a n d F ig u re 2.1 show the in crease in th e n u m b e r o f b io in fo rm atics d a ta so u rc e s fro m 1999 to the p resen t day. F ig u re 2.2 illu strates th e d e v e lo p m e n t o f th e international N u c le o tid e S eq u en ces d a ta b ase [86]. F ig u re 2.3 sh o w s th e g ro w th o f th e G e n B an k d atab ase from 1982 to 2005. In th is p e rio d , th ere w as an e x p o n en tial g ro w th in b ase p a ir d a ta fro m 6 8 0 K to 5 6 ,0 3 7 m illio n an d in seq u en ces fro m 606 to 52 m illio n [85]. S u c h e x p lo s iv e g ro w th is e x p ec te d to co n tin u e w ell into th e 2 1 st c e n tu ry [1 1 3 , 114, 187, 196]. D a ta so u rc es are m ain tain ed b y d iffe re n t c o m m u n itie s an d o rg an iz atio n s [131, 138]; th ey are a u to n o m o u s, d istrib u te d , d isp arate, h e te ro g en e o u s a n d o ften do n o t p ro v id e d ire c t a c c e ss [29, 138]. A d e scrip tio n o f th ese ch arac teristics can b e fo u n d in se c tio n 2.3.2.
D ata so u rces in g en eral can b e c la ssified as p rim a ry o r secondary. A p rim a ry so u rce h o ld s in fo rm a tio n fro m an e x p e rim e n t a n d is som etim es c alle d an arch iv al d a ta source. It c o n ta in s ra w d a ta o f sequences o r stru ctu res. E x am p le s o f th ese p rim a ry so u rc e s a re G e n B a n k [31, 32], E M B I an d D D B J fo r G en o m e se q u en c es a n d th e P ro te in D atab an k for p ro tein stru ctu res [21].
G row th o f b io in fo r m a tic s d ata s o u r c e s 1200 1000 800 600 400 200 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 year
F ig u re 2.1: G row th o f bio in fo rm atics data so u rc e s 1 9 9 9 -2 0 0 8 b a sed on sta tistic s p u b lis h e d in [79-83]
Year 19 9 9 2 0 0 0 2001 2 0 0 2 2 0 0 3 2 0 0 4 2 0 0 5 2 0 0 6 2 0 0 7 2 0 0 8
Num ber 197 2 2 6 281 3 3 5 3 8 6 5 4 8 7 1 9 8 5 8 9 6 8 1 078
Table 2.1: G row th o f b io in fo rm a tics data so u rc e s (1 9 99 -20 0 8) [82-85]
G row th of the
International Nucleotide S e q u e n c e D atabase Collaboration
P a r s eoolntxj!«d by G fiB a rfc g — « EMSL— DOBJ —i
F ig u re 2.2: D evelo p m en t o f the in tern a tio n a l N u c le o tid e S equence D a tab a se [85]
S eco n d ary d ata source inform ation is d eriv ed fro m p rim a ry d ata source data; S eco n d ary data sources hold data, such as c o n se rv e d sequences, signature seq u en ces and active site resid u es o f th e p ro tein fam ilies derived by the m u ltip le sequence alignm ent o f a set o f related proteins. A secondary d ata source is called a curated d ata so u rc e and exam ples include M G D [34] and W o rm b ase [46].
W hile the contents o f p rim ary data so u rces are co n tro lled by the subm itter, the contents o f seco n d ary d ata so u rces are controlled by a third party. S econdary d ata so u rces are d e riv e d from the follow ing pro ced u res [132]:
• A n notating and enriching data, eith er m an u a lly o r autom atically, • C leansing and rem oving red u n d an t in fo rm atio n ,
• C ollecting data from literature,
• M ining and com piling d ata fro m several data sources, and • A nalysing prim ary data.
In general, bio in fo rm atics d ata so u rces cover a w ide range o f subjects and data types, including gene seq u en ces, gene expression data, p ro tein sequences, p ro tein structure and m etab o lic pathw ays. T hey can be classified as general purpose o r sp ecific p u rp o se data sources [29].
</> c o </> < u u c 4> 3 cr a> CO
Growth of GenBank
(1982 - 2005) 54 j 52 - 50 - 48 - 46 - 44 - 42 - 40 - 38 - 36 - 34 - 32 - 30 - 28 - 26 - 24 - 22 - 20 - 18 - 16 - 14 - 12 - 10 - 8 - 6 - 4 - 2 - 0 I 1982 Base Pairs Sequences ♦ ' ♦ ' ♦ ♦ ♦ i f f t- t- t ' </> c o 2 , < oS
(0 a. <v </> <0 CO 1986 1990 1994 1998 2002 F ig u re 2.3: G row th o f G en B a nk (1 9 82 -2 0 0 5 ) [8 5 ] 2.3 C h a r a c t e r is t ic s o f b i o i n f o r m a t i c s d a t a s o u r c e sT he characteristics o f b io in fo rm atics d ata sources are presented here to give the reader an u n d erstan d in g o f the field and the challenges it presents.
2.3.1 D a ta
E lm asri an d N a v ath e [70] id e n tify se v era l ch aracteristics o f b io lo g ical d a ta th at m ak e it d iffic u lt to m an ag e:
C o m p le x ity : b io lo g ical d ata are q u e stio n a b ly the m o st co m p lex d ata k n o w n w h e n co m p ared w ith m o st o th e r ap p licatio n s [177]. T h ey are co n n ec te d to each o th er in m an y w a y s, in a h ig h ly in terco n n ected g rap h o f rela tio n sh ip s [174]. T hus, d e fin itio n s o f su c h b io lo g ical d ata m u st be ab le to rep re se n t a co m p lex su b stru c tu re o f d a ta as w ell as relatio n sh ip s [70, 154]. F o r ex am p le, b io in fo rm atics d a ta so u rc es include n o t on ly the fu n ctio n s o f in d iv id u al g en es a n d p ro te in s, b u t th eir com plex in teractio n s w ith in a tissu e, cell tissu e, a n d w h o le o rg an ism [70, 154,
159, 177].
D iv e rs ity : B io lo g ical d ata h av e a g rea t d iv e rs ity o f ty p es, such as seq u en ces, sp atial, 3D stru ctu res, g rap h s, strin g , sc a la r a n d v ecto r data. T h ere m ay also b e o v erlap s in d a ta ty p es b e tw e e n d iffe re n t species and d ifferen t g enom e sources [70, 154].
In c o m p le te : B io lo g ical d ata are v e ry o fte n in c o m p le te since som e b io lo g ic al o b jec ts are large an d full d e scrip tio n s ta k e tim e to ach iev e, o r the lim ite d reso u rce s av ailab le p rev e n t th e c o lle c tio n o f rele v an t d ata [177]. F o r e x am p le , m o st o f the g e n o m e s are in c o m p le te and n o t a n n o tated b e c a u se th e fu n ctio n o f som e g e n es is still u n k n o w n .
L a r g e size: O n e o f th e m o st n o tab le c h a ra c te ristic s o f b io lo g ic al data is th e ir larg e size on a cco u n t o f th e c o m p le x ity o f b io lo g ic a l concepts, d a ta ty p es an d stru ctu re. S eq u en ces, g rap h s, p ro te in -p ro te in interactions all c o n trib u te to the c o m p lex ity an d size o f b io lo g ic a l d a ta [131].
L a c k o f a s ta n d a r d is e d n o m e n c la tu r e : D iffe re n t o rg an isatio n s and c o m m u n ities u se th e ir o w n te rm in o lo g y to d e sc rib e b io lo g ic al concepts. T h u s, b io lo g ical d ata freq u en tly su ffe r fro m a m b ig u o u s and u n clear co n cep ts since th ere is n o sta n d a rd ise d n o m e n c la tu re fo r them [131,
177].
2.3.2 D a ta so u rces
H ere w e d iscu ss th e d ifferin g c h a ra c te ristic s o f b io in fo rm atics d a ta so u rces [29]:
H ete ro g e n e o u s In stru ctu r e a n d c o n te n t: ea c h d ata source h as its o w n
d a ta m o d el an d u se s its o w n te rm in o lo g y an d ontology. D iffere n t d esig n ers, h av e u se d several w a y s to m o d e l a p a rtic u la r co n cep t an d th e aim o f th e ex p erim en t and p ro je c t all c o n trib u te to th is h etero g en eity [98, 154]. T h u s, th e structure o f d a ta so u rc e s, a n d rep resen tatio n s o f the sam e d ata q u e ry resu lts m ay b e d iffe re n t (se e se c tio n 2.4).
L a rg e in size: in the last few y e ars, th e n u m b e r an d size o f n ew
b io in fo rm atics d a ta so u rces h as b e e n g ro w in g e x p o n e n tia lly , as has the n u m b e r o f c o m p u tatio n al to o ls a v ailab le fo r a n a ly sin g th ese data. T here is n o sig n o f any d e celeratio n o f g ro w th [29].
D y n a m ic: b io in fo rm a tics d a ta so u rces are d y n a m ic . T h e ir in terfaces
alter fro m tim e to tim e an d th e ir sch em as c h a n g e a t a ra p id p ace as do th e ir co n ten ts [70].
A u to n o m o u s: b io in fo rm atics d a ta so u rces are a u to n o m o u sly o w n ed
an d m a in ta in e d b y d ifferen t c o m m u n ities a n d o rg a n is a tio n s o ften fo r d ifferen t p u rp o se s [138]. C o n seq u en tly , q u e ry ty p e s a llo w e d o n d ata so u rces a n d th e p re c ise m o d e o f in teractio n are d iv e rs e b e ca u se o f the d ifferen t rea so n s fo r h o ld in g the d ata [29, 138].
W id ely d istrib u ted : b io in fo rm atics d a ta so u rc es a re w id e ly d istrib u ted