• No results found

Querying distributed heterogeneous structured and semi-structured data sources

N/A
N/A
Protected

Academic year: 2021

Share "Querying distributed heterogeneous structured and semi-structured data sources"

Copied!
260
0
0

Loading.... (view fulltext now)

Full text

(1)

U N I V E R S I T Y

P R I F Y S G O L

C

ae

RD

y

S>

Querying Distributed Heterogeneous Structured and

Semi-structured Data Sources

by

Fahad M. Al-Wasil

A thesis subm itted in partial fulfillment o f the requirem ents for the degree o f

Doctor o f Philosophy in

Com puter Science

School o f Computer Science

Cardiff University April 2007

(2)

All rights reserved

INFORMATION TO ALL USERS

The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, th ese will be noted. Also, if material had to be removed,

a note will indicate the deletion.

Dissertation Publishing

UMI U584905

Published by ProQuest LLC 2013. Copyright in the Dissertation held by the Author. Microform Edition © ProQuest LLC.

All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code.

ProQuest LLC

789 East Eisenhower Parkway P.O. Box 1346

(3)

This w ork has not previously been accepted in substance for any degree and is not concurrently submitted in candidature for any degree.

Signed ... ... ..(candidate) Date

....A J.

/ . I . . / . . . , 2 . . ...

STATEMENT 1

This thesis is being submitted in partial fulfillm ent o f the requirements for the degree o f PhD.

S ig n e d ... (candidate) Date 2 .5 ? ..^ ..^ ...

STATEMENT 2

This thesis is the result o f my own independent work/investigation, except where otherwise stated. O ther sources are acknow ledged by explicit references.

S ig n e d ... (candidate) Date ...\.g ./.£ ./.2 ..? ..f ? ..:£ ...

STATEMENT 3

I hereby give consent for my thesis, if accepted, to be available for photocopying and for inter-library loan, and for the title and summary to be made available to outside organisations.

S ig n ed (candidate)

(4)

my wife,

(5)

I would like to start by praising Allah (God) Almighty for providing me with faith, patience and commitment to complete this research.

1 would like to express my sincere gratitude to my supervisors, Prof. W. A. Gray and Prof. N. J. Fiddian, for their expert guidance and encouragement throughout this research. I am grateful for their careful reading and constructive comments on this thesis and our joint papers.

1 w ould like to thank the paper referees w hose comments on my published papers have added to the success o f this project.

Special thanks are due to the m embers o f the school for their help, especially Mrs. M argaret Evans who has helped me w ith travel related issues, Mrs. Helen Williams for her help in administrative issues, and Mr. Robert Evans and Dr. Rob Davies for their technical assistance.

I would also like to express m y thanks to m y fellow research students in the School o f Computer Science at C ardiff U niversity for their friendship and help. I really enjoyed their friendship that I developed while doing this research.

Special admiration and gratitude is due to m y parents, brothers and sisters whose prayers, love, care, patience, support and encouragem ent have always enabled me to perform to the best o f my abilities.

Last, but certainly not least, I am indebted to my wife for her endurance and unconditional support which provided vital encouragement during the period o f my PhD study. W ithout her love and devotion, this research would have been impossible. Finally, 1 would like to mention my beloved children, Nora, Mohammed, and Adeem who have given me happiness during the difficult period o f my study.

(6)

T h e c o n tin u in g g ro w th an d w id e sp re a d p o p u larity o f th e in te rn et m ea n s th at th e c o lle c tio n o f u sefu l d a ta a v a ila b le for p u b lic access is rap id ly in c re a sin g b o th in n u m b e r and size. T h e s e d ata are sp read o v e r d istrib u te d h e te ro g e n e o u s d a ta so u rc es like tra d itio n a l d atab ases o r so u rces o f v ario u s fo rm s c o n ta in in g u n stru c tu re d an d se m i-stru c tu re d data. O b v io u sly , th e v a lu e o f th ese d a ta so u rc es w o u ld in m a n y case s be g reatly e n h an ced i f th e d a ta th ey c o n ta in c o u ld be c o m b in e d an d q u e rie d in a u n ifo rm m anner. T h e re se a rc h w o rk re p o rte d in th is d isse rta tio n is c o n ce rn e d w ith q u e ry in g an d in te g ra tin g a m u ltip lic ity o f d istrib u te d h e te ro g en e o u s stru ctu red d a ta re sid in g in re la tio n a l d a ta b a se s an d se m i-stru c tu re d d a ta held in w ell- fo rm ed X M L d o c u m e n ts p ro d u c e d b y in tern et a p p lica tio n s o r h u m an - co d ed . In p a rticu la r, w e h av e a d d re ss e d th e p ro b lem s of: (1 ) sp ecify in g the m a p p in g s b e tw ee n a g lo b al sc h e m a an d th e local d a ta so u rces' sch em as, and re so lv in g th e h e te ro g e n e ity w h ic h can o c c u r b e tw ee n d ata m o d els, sc h em a s o r sc h em a co n cep ts; (2 ) p ro c e s s in g q u e rie s th at are e x p ressed on a g lo b al sc h em a in to local q u eries.

W e h av e p ro p o se d an a p p ro a ch to c o m b in e and q u ery th e d ata so u rc es th ro u g h a m ed iatio n layer. S u ch a la y e r is in ten d ed to estab lish and ev o lv e an X M L M e ta d a ta K n o w le d g e B a se (X M K B ) in crem en tally w h ich assists the Q u e ry P ro c e sso r in m e d ia tin g b e tw e e n u ser q u e rie s p o se d o v e r th e global sc h em a and th e q u e rie s o n th e u n d e rly in g d istrib u te d h e te ro g e n e o u s data so u rces. It tra n sla te s such q u e rie s into su b -q u e rie s -called local qu eries- w h ich are a p p ro p ria te to e ac h local d ata so u rce. T h e X M K B is built in a b o tto m -u p fash io n by e x tra c tin g and m erg in g in crem en tally th e m etad ata o f th e d a ta so u rces. It h o ld s th e d ata s o u rc e ’s in fo rm atio n (n am es, types and lo ca tio n s), d e scrip tio n s o f th e m ap p in g s b e tw ee n the g lo b al schem a and the p a rtic ip a tin g d a ta so u rc e sch em as, and fu n ctio n n am es fo r h andling se m an tic an d stru ctu ral d isc re p a n c ie s b etw een the rep re se n ta tio n s.

(7)

p ro to ty p e sy stem calle d S IS S D (S y ste m to In teg ra te S tru c tu re d and S e m i­ stru c tu re d D a tab a ses). T h e sy stem a u to m a tic a lly c re ate s a G U I tool fo r m e ta -u se rs (w h o d o th e m e ta d a ta in te g ra tio n ) w h ich th ey u se to d e scrib e m a p p in g s b e tw ee n th e g lo b al sc h e m a and local d a ta so u rce sch em as. T h e se m a p p in g s are used to p ro d u c e th e X M K B . T h e S IS S D a llo w s the tra n sla tio n o f u se r q u e rie s into su b -q u e rie s fittin g each p a rticip a tin g d a ta so u rce, by e x p lo itin g th e m a p p in g in fo rm a tio n stored in the X M K B .

T h e m a jo r resu lts o f the th e sis are: (1 ) an a p p ro ach th at facilitates b u ild in g stru c tu re d an d se m i-stru c tu re d d a ta in teg ratio n system s; (2 ) a m eth o d fo r g e n e ra tin g m a p p in g s b e tw e e n a g lo b al and local sch em as' p a th s, an d re so lv in g th e c o n flic ts c a u se d by th e h e te ro g en e ity o f the d a ta so u rc es su ch as n a m in g , stru c tu ra l, an d sem an tic c o n flicts w h ich , m ay o c c u r b e tw ee n th e sch em as; (3 ) a m e th o d fo r tran sla tin g q u eries in term s o f a g lo b al sc h em a into su b -q u e rie s in te rm s o f local sch em as. H ence, the p resen te d a p p ro a ch sh o w s th at: (a ) m a p p in g o f th e sch em as' p a th s can on ly be p a rtially a u to m ate d , sin ce th e lo g ic al h e te ro g en e ity p ro b lem s n eed to be reso lv ed by h u m an ju d g m e n t b a se d on th e ap p lica tio n req u irem en ts; (b) q u e ry in g d istrib u te d h e te ro g e n e o u s stru c tu re d and se m i-stru c tu re d d a ta so u rces is p o ssib le.

(8)

L ist o f F ig u r e s ... vi

A c r o n y m s ... x

C H A P T E R 1 I n tr o d u c tio n ... 1

1.1 M o tiv atio n o f the R e se a rc h ...1

1.2 P ro b lem S ta te m e n t...4 1.3 H y p o th e sis, A im s an d O b je c tiv e s ... 5 1.4 A c h ie v e m e n t o f th e R e s e a r c h ... 7 1.5 O rg a n iz a tio n o f th e T h e s is ... 8 C H A P T E R 2 B a c k g r o u n d an d su r v e y o f the s ta t e - o f - th e - a r t ...11 2.1 D istrib u ted h e te ro g e n e o u s d a ta b a s e s ... 12

2.2 A T ax o n o m y fo r In teg ra tin g H e te ro g e n e o u s D ata S o u rc e s ... 14

(9)

2 .2 .2 D a ta W a re h o u s e s ... 19

2.2.3 M e tasea rc h E n g in e s... 20

2 .2 .4 V irtu al In teg ra tio n o f D a ta b a s e s...22

2.2.4.1 F ed erated d a ta b a se s y s te m s ...22 2 .2 .4 .2 M u lti-d a ta b a se s y s te m s ... 23 2 .2 .5 S u m m ary o f p re v io u s a p p ro a c h e s ...24 2 .2 .6 M ed iatio n S y s te m ...25 2.3 D a ta in te ro p e ra b ility ...28 2 .4 H e te ro g e n e ity o f th e d a ta s o u r c e s ...29 2.5 D a ta in te g ra tio n ...32 2 .6 G lo b a l-A s-V ie w (G A V ) a p p r o a c h ...34 2.6.1 G A V sy s te m s...35 2.7 L o c a l-A s-V ie w (L A V ) a p p r o a c h ...36 2.7.1 L A V s y s te m s ...37 2.8 R e la te d W o r k ... 38 C H A P T E R 3 X M L a n d rela ted t e c h n o lo g ie s ... 42 3.1 X M L ... 43 3.2 D T D an d X M L S c h e m a ... 46 3.2.1 D T D ... 46 3 .2 .2 X M L S c h e m a ...48

3.3 X M L a p p lic a tio n p ro g ra m m in g in te rfa c e s...50

3.3.1 D O M ... 51

3 .3 .2 S A X ... 51

3.3.3 J D O M ... 52

3.4 X M L q u e ry la n g u a g e s ...52

(10)

3.4.1.1 X P ath 1.0...53 3 .4 .1 .2 X P ath 2 .0 ...56 3 .4 .2 X Q L ... 57 3.4 .3 X M L -Q L ... 58 3 .4 .4 T h e Q u ilt q u ery la n g u a g e ... 60 3.4 .5 X Q u e ry ... 63 C H A P T E R 4 T h e S IS S D d a ta in te g r a tio n s y s te m ... 66 4.1 In tr o d u c tio n ... 66 4.2 A n o v e rv ie w o f o u r a p p r o a c h ... 70 4.3 T h e S IS S D a rc h ite c tu re an d C o m p o n e n ts ... 74 4 .4 H e te ro g e n e ity issu es in th e S IS S D s y s te m ... 77 4.5 A n a p p lic a tio n e x a m p le ... 81 C H A P T E R 5 T h e m ed ia tio n p r o c e s s ... 84

5.1 G e n e ra tin g S c h e m a S tru c tu re D e fin itio n ( S S D ) ... 84

5.2 p a th s g e n e r a tio n ...87

5.3 p a th s c o rr e s p o n d e n c e ...90

5.4 C re a tin g X M K B ...97

5.4.1 T h e S tru ctu re o f X M K B ... 98

5.4.2 T h e g e n eratio n p ro c e ss o f th e X M K B ...100

5.4.3 In d ex n u m b e r g e n e ra tio n fo r the m aste r v iew e le m e n ts ... 104

5.4.4 M ap p in g cases b etw ee n e le m e n ts ... 108

5.5 S u m m a r y ... 112

C H A P T E R 6 T h e q u er y tr a n sla tio n p r o c e s s ... 114

6.1 In tro d u c tio n ... 114

6.2 T h e Q u ery P ro c e sso r a rc h ite c tu re and C o m p o n e n ts ...117

(11)

6.4 X Q u e ry -to -S Q L tra n sla tio n p r o c e s s ... 120

6.5 Q u e ry tra n sla tio n e x a m p le s ...122

6.5.1 O n e -to -o n e q u e ry e x a m p le ... 124

6 .5 .2 F u n c tio n -in v o lv e d o n e -to -o n e q u ery e x a m p le ...127

6.5 .3 O n e -to -m a n y q u e ry e x a m p le ... 127 6 .5 .4 M a n y -to -o n e q u ery e x a m p le ... 130 C H A P T E R 7 T h e S IS S D im p le m e n ta tio n ... 133 7.1 I n tr o d u c tio n ... 133 7.2 m e ta d a ta e x tra c tin g p ro c e s s ... 135 7.3 X M K B e sta b lish in g an d m a p p in g p r o c e s s ...138

7.4 Q u ery p a rse r an d tra n sla tio n p r o c e s s ...143

C H A P T E R 8 E v a lu a tio n & D is c u s s io n ... 145 8.1 E v a lu a tio n ... 145 8.1.1 F u n c tio n a lity o f S I S S D ...147 8.1.2 F lex ib ility o f S IS S D s y s te m ... 148 8.1.3 A rc h ite c tu re o f S IS S D s y s te m ... 149 8.1.4 C o n stru c tio n o f th e X M K B ... 150 8.1.5 C h o ice o f X M L as th e d a ta m o d e l...151

8.1.6 H a n d lin g d iffe re n t ty p es o f h e te ro g e n e ity ...153

8.1.7 W ays o f u sin g th e s y s te m ... 157 8.2 D isc u ss io n ... 158 C H A P T E R 9 S u m m a r y , c o n c lu sio n a n d fu tu re w o r k ... 164 9.1 T h esis s u m m a r y ... 164 9.2 C o n c lu s io n s ... 167 9.3 T h e future w o r k ... 168 B ib lio g ra p h y ... 170

(12)

2.1 C lassific atio n o f S y stem s fo r In te g ra tin g H e tero g en e o u s D ata S o u rces

[ 5 6 ] ... 18

2.2 T h e th re e -tie r m e d ia to r a r c h i te c t u r e ... 27

2.3 C o n flic ts C la ssific a tio n ... 31

3.1 A n e x am p le o f a sim p le X M L d o c u m e n t ...45 3.2 A D T D o f an X M L d o c u m e n t in F ig u re 3.1 ...48 3.3 A n X M L sc h em a o f an X M L d o c u m e n t in F ig u re 3.1 ... 50 3.4 T he core ru les o f X P a t h ... 54 3.5 T he X M L -Q L q u e r y ... 60 4.1 T he S IS S D A r c h ite c tu r e ... 76

(13)

4.2 M a p p in g b e tw e e n M ark s an d G r a d e s ... 79

4.3 S u m m ary o f C o n flic ts su p p o rte d by S IS S D sy stem ... 81

4.4 A p a rt o f th e tree stru c tu re o f fo u r d a ta s o u r c e s ...82 5.1 A lg o rith m to g e n e ra te S S D fo r X M L d o c u m e n t... 87 5.2 T h e S S D M o d el stru c tu re fo r th e b ib sc h e m a s t r u c tu r e ...88 5.3 A lg o rith m to g e n erate S S D p a t h s ... 89 5.4 T h e tree stru c tu re m o d el fo r b ib S S D ...90 5.5 T h e g e n erate d p a th s o f th e b ib d a ta s o u r c e ... 91 5.6 A sa m p le X M K B ... 100 5.7 T h e X M K B X M L sc h e m a d e f i n i t i o n ... 101 5.8 A lg o rith m fo r X M K B g e n e ra tio n p r o c e s s ... 104

5.9 A G U I fo r S c h e m a S tru c tu re D e fin itio n sh o w n in F ig u re 5 . 1 0 ... 105

5.10 S c h e m a S tru c tu re D e fin itio n (S S D ) o f bib X M L d o c u m e n t... 106

5.11 A lg o rith m to g e n e ra te in d ex n u m b e r s ... 106

5.12 T he m aste r v iew tree stru c tu re w ith in d ex n u m b e r s ... 107

5.13 T h e M a ste r V i e w ... 107

5.14 O ne to N m ap p in g e x a m p l e ... 109 5.15 N to one m ap p in g e x a m p l e ... I l l

(14)

5.16 E x am p le o f o n e to o n e m a p p in g w ith an o p e r a t io n ... 112

6.1 T h e Q P A r c h ite c tu r e ... 117

6.2 A lg o rith m fo r th e q u ery tra n sla tio n p r o c e s s ... 121

6.3 T h e p art o f X M K B w h ich m ain ta in d a ta so u rces in f o r m a tio n 123 6.4 T h e M a ste r V i e w ... 123

6.5 S c h e m a S tru c tu re s o f th e fo u r d a ta s o u r c e s ... 124

6.6 S o m e p a rts o f X M K B u sed to tra n s la te Q1 ... 125

6.7 T h e g e n erate d local q u e rie s fro m Q1 ... 126

6.8 S o m e p arts o f X M K B u sed to tra n s la te Q 2 ... 127

6.9 T h e g e n erate d local q u e rie s fro m Q 2 ... 128

6.10 S o m e p a rts o f X M K B u sed to tra n s la te Q 3 ... 129

6.11 T h e g e n erate d local q u e rie s fro m Q 3 ... 130

6.12 S o m e p arts o f X M K B u sed to tra n s la te Q3 ... 131

6.13 T h e g e n erate d local q u e rie s fro m Q 4 ... 132

7.1 T he m ain in terface o f S IS S D sy ste m ... 134

7.2 SS D o f bib X M L d o c u m e n t ... 135

(15)

7.4 R e la tio n a l D B co n n ec tio n p a r a m e t e r s ... 137

7.5 X M L d o c u m e n t c o n n ec tio n p a r a m e t e r s ... 137

7.6 In d ex n u m b ers g e n erate d fo r m a s te r v iew sh o w n in F ig u re 7 . 7 ... 138

7.7 M a ste r v i e w ... 139 7.8 P art o f the G U I fo r S S D sh o w n in F ig u re 7 . 2 ... 140 7.9 In terfac e fo r su b m ittin g in d ex n u m b e r s ... 141 7.10 G e n era te d p a th s m a p p i n g ... 142 7.11 In terfac e fo r rem o v in g d a ta s o u r c e ... 142 7.12 E x am p le o f a g lo b al q u e ry t r a n s l a t i o n ... 143 8.1 E x am p le o f re so lv in g stru c tu ra l h e te r o g e n e ity ... 154 8.2 E x am p le o f h a n d lin g sy n o n y m c o n f l i c t ... 155

(16)

A P I A p p lica tio n P ro g ra m m in g In terfac e

C D M C o m m o n D ata M o d el

D B D atab ase

D B M S D a tab a se M a n a g e m e n t S y stem

D B S D atab ase S y stem

D D B D istrib u te d D a tab a se

D D B M S D istrib u te d D a tab a se M a n a g e m e n t S ystem

D O M D o c u m en t O b je ct M o d el

D TD D o c u m en t T y p e D e fin itio n

F D B S F ed erated D a tab ase S y stem

F L W R F or- L et- W h ere- R etu rn

(17)

G U I G rap h ical U ser In terfac e

H T M L H y p e rT e x t M a rk u p L an g u a g e

J D B C Ja v a D a tab a se C o n n e c tiv ity

J D O M Ja v a D o c u m e n t O b je c t M o d el J X C Ja v a X M L C o n n e c tiv ity K S K n o w led g e S e rv er L A V L o ca l-A s-V iew M D B S M u lti-d a ta b a se S y stem M D E M e ta d a ta E x tra c to r

M V P M a ste r V iew P a rse r

O E M O b je ct E x c h a n g e M o d el

Q P Q u ery P ro c e sso r

SA X S im p le A PI fo r X M L

S G M L S tan d ard G e n e ra liz e d M a rk u p L an g u ag e

S IS S D S y stem to In teg ra te S tru c tu re d and S e m i-stru ctu red D atab ases

S Q L S tru ctu red Q u ery L a n g u a g e

SSD S ch em a S tru c tu re D e fin itio n

S S D P S ch em a S tru c tu re D e fin itio n P arser

U D F U ser-D e fin ed F u n c tio n

U R L U n ifo rm R eso u rce L o c a te r

W 3 C W orld W ide W eb C o n so rtiu m

X D S D L X M L D ata S o u rce D e fin itio n L an g u ag e

(18)

X M K B X M L M etad a ta K n o w le d g e B ase

X M K B M L X M L M e ta d a ta K n o w le d g e B ase M a p p in g L an g u ag e

X M L E x te n sib le M a rk u p L an g u a g e

X Q IS X Q u ery In tern al S tru c tu re

(19)

In tro d u ctio n

1.1 M o tiv a tio n o f th e R e se a r c h

U sers an d a p p lica tio n p ro g ram s in a w id e v ariety o f b u sin esses today are in creasin g ly re q u irin g th e in te g ra tio n o f m u ltip le d istrib u te d au to n o m o u s h e te ro g en e o u s d a ta so u rc es [86, 130]. T h e co n tin u in g g ro w th and w id esp read p o p u larity o f th e In te rn e t m ea n th at the c o llectio n o f useful data so u rc es a v aila b le fo r p u b lic a c c e ss is rap id ly in cre asin g b o th in n u m b er and size. F u rth e rm o re , th e v a lu e o f th ese d ata so u rces w o u ld in m any case s be g rea tly e n h a n c e d i f th e d ata th ey co n tain could be co m b in ed , "q u eried " in a u n ifo rm m a n n e r (i.e. u sin g a sin g le q u ery language an d in terface), an d su b se q u e n tly retu rn ed in a m a c h in e -rea d ab le form . F o r the fo resee ab le fu tu re, m u ch d ata w ill c o n tin u e to be sto red in relational d a ta b ase sy stem s b e c a u se o f th e reliab ility , scalab ility , to o ls and p erfo rm an ce a sso c iated w ith th ese sy ste m s [68, 133]. H o w ev er, due to the im pact o f the w eb, th ere is an e x p lo sio n in co m p lem e n ta ry d a ta av ailability: th is d a ta can be a u to m a tic a lly g e n erate d by w eb -b ased ap p licatio n s o r can be h u m an -c o d e d [102]. Such d ata is called se m i­

(20)

stru c tu re d data, w h ic h m ean s th a t alth o u g h th e d a ta m ay h av e so m e stru c tu re , th e stru c tu re is n o t re g u la r o r co m p lete as is th e case w ith d a ta held in trad itio n a l d a ta b ase m a n a g e m e n t sy stem s (S ee [9] fo r a su rv ey on se m i-stru c tu re d d ata). In th e d o m a in o f se m i-stru c tu re d data, th e e x te n s ib le M ark u p L an g u a g e (X M L ) is arg u ab ly the m ajo r d a ta rep re se n ta tio n lan g u ag e as w ell as d a ta e x ch a n g e form at. X M L has a W 3C sp e cifica tio n [4] th at a llo w s c re a tio n an d tran sfo rm a tio n o f a se m i­ stru c tu re d d o c u m e n t c o n fo rm in g to its X M L sy n tax ru les w hich has no refe ren c ed D T D o r X M L sch em a. S u ch a d o c u m e n t h as m etad ata b u ried in sid e th e d o c u m e n t an d is calle d a w e ll-fo rm e d X M L docum ent. T h e w e ll-fo rm ed X M L d o c u m e n ts sim p ly m ark u p p ag es w ith d escrip tiv e tags. It d o e s n ’t n eed to d e scrib e o r e x p la in w h a t th ese tag s m ean. In o th er w o rd s a w e ll-fo rm ed X M L d o c u m e n t d o e s n o t n eed a D T D o r X M L sch em a, b u t is m u st c o n fo rm to th e X M L sy n ta x ru les. I f all tag s in a d o c u m e n t are co rrectly fo rm ed an d fo llo w X M L g u id elin es, th en a d o c u m e n t is co n sid ere d as w e ll-fo rm ed . T h e m e ta d a ta c o n te n t o f an X M L d o c u m e n t

en ab les au to m ate d p ro ce ssin g , g e n eratio n , tran sfo rm a tio n an d

co n su m p tio n o f th e se m i-stru c tu re d d a ta in th e d o c u m e n t by ap p licatio n s. M uch in te restin g an d u sefu l d a ta can be p u b lish ed as a w e ll-fo rm ed X M L d o cu m en t by w e b -b ased a p p lic a tio n s o r b y h u m an -co d in g .

H ence, b u ild in g a d a ta in te g ratio n sy ste m th at p ro v id es a u n ified m eth o d o f access to se m an tic a lly an d stru c tu ra lly d iv erse d ata so u rces is h ig h ly d esirab le as it w ill be able to lin k stru c tu re d d ata resid in g in relatio n al d atab ases an d sem i-stru c tu re d d a ta h eld in w ell-fo rm ed X M L d o c u m e n ts [73, 101]. T h ese X M L d o c u m e n ts can be X M L files on local hard d riv es or d o c u m e n ts h eld on rem o te w eb serv ers. Such a d a ta in teg ratio n system will hav e to find stru ctu ral tra n sfo rm a tio n s and sem an tic m ap p in g s th at result in c o rre ct m erg in g o f th e d a ta an d allo w users to q u ery the resu ltin g so-called m ed iate d sch em a [100]. T h is lin k in g is a c h alle n g in g p ro b lem since th e p re -e x istin g d a ta b ases c o n c e rn e d are ty p ically au to n o m o u s and

(21)

lo cated o n h e te ro g e n e o u s h a rd w a re an d so ftw are p latfo rm s. T h is m ean s it is n e c e ssa ry to reso lv e c o n flicts c au se d by th e h e te ro g e n e ity o f th e d a ta so u rces w h ic h c an o c cu r b e tw e e n d a ta m o d els, sch em as o r sc h em a co n cep ts. C o n se q u e n tly , m a p p in g s b e tw ee n e n tities in d iffe re n t so u rces re p re se n tin g th e sam e re a l-w o rld o b je c ts hav e to be d efin ed . T h e m ain d iffic u lty in this p ro ce ss is th a t th e re la te d d ata in d ifferen t so u rces m ay be rep re se n te d in d iffe re n t fo rm ats an d in in co m p atib le w ays. F o r in stan ce, b ib lio g rap h ic al d a ta b ases o f d iffe re n t p u b lish e rs m ay use d ifferen t fo rm ats for au th o rs' o r ed ito rs' n a m e s (e.g . full n a m e o r sep arated first and last n am es), o r d ifferen t u n its fo r p ric e s (e.g . d o llars, p o u n d s o r euros). M o reo v er, the sam e e x p re ssio n m ay h a v e a d ifferen t m ean in g , o r th e sam e m ea n in g m ay be sp e cified by d iffe re n t ex p ressio n s. T h is m ean s th a t sy n tactical d a ta an d m e ta d a ta a lo n e c a n n o t p ro v id e su fficie n t sem an tics fo r all p o ten tial in te g ratio n p u rp o ses. A s a resu lt, th e d ata in te g ratio n p ro ce ss is o ften v ery lab o u r-in te n siv e an d d e m a n d s m o re c o m p u tin g ex p ertise th an m ost a p p lic a tio n u sers hav e. T h e re fo re , se m i-a u to m a te d ap p ro ach es are the m o st p ro m isin g w ay fo rw ard , w h e re m ed iatio n en g in e ers are g iv en an easy to u se tool to d e scrib e m a p p in g s b e tw ee n the in teg rated (in teg rated and m a ste r are u sed in te rch a n g ea b ly in th is th esis) v iew an d local sch em as. T his p ro d u ce s an in teg rated sc h e m a w h ic h is a u n ifo rm v iew o v er all th e p articip a tin g local d a ta so u rc es [148]. In th e th esis w e use in terch an g eab ly the term s m ed iated , in teg rated , m a ste r an d g lo b al to d e scrib e the g lo b al v iew c reated by th e in te g ratio n p ro cess.

X M L is b e c o m in g th e d e -fa cto sta n d a rd fo rm at to ex ch a n g e in fo rm atio n over the in tern et. T h e a d v a n ta g e s o f X M L as an ex ch an g e m odel - such as rich e x p re ssiv e n ess, c le a r n o tatio n an d e x te n sib ility - m ake it an e x ce llen t can d id ate to be a d a ta m odel fo r an in teg rated sch em a. A s th e im p o rtan ce o f X M L h as in creased , a series o f sta n d ard s has g ro w n up a ro u n d it, m an y o f w hich w ere d e fin e d by the W o rld W id e W eb C o n so rtiu m (W 3C ). F o r exam ple, th e X M L S c h e m a lan g u ag e p ro v id es a n o tatio n fo r d e fin in g n ew

(22)

ty p es o f X M L e le m e n ts and X M L d o cu m en ts. X M L w ith its se lf­ d e sc rib in g h iera rch ica l stru c tu re an d a sso ciated lan g u ag e X M L S ch em a p ro v id e th e flex ib ility and e x p re ssiv e p o w e r n e ed e d to a cc o m m o d ate d istrib u te d an d h e te ro g e n e o u s d ata. A t the co n cep tu al level, the d ata can be v isu a lize d as tree s o r h iera rch ica l g rap h s.

T h is th e s is c o n c e n tra te s o n th e p ro b le m o f in teg ratin g and q u e ry in g a m u ltip lic ity o f d istrib u te d h e te ro g e n e o u s stru ctu red d ata resid in g in relatio n al d a ta b a se s an d se m i-stru c tu re d d a ta so u rces held as w e ll-fo rm ed X M L d o cu m en ts.

1.2 P r o b le m S ta te m e n t

A v a st an d g ro w in g a m o u n t o f h e te ro g e n e o u s d ata so u rces is av ailab le to in stitu tio n s o r c o m p an ies. A s a re su lt in te g ratio n o f such d a ta so u rces in the p u b lic d o m ain is in ev itab le. T h e re fo re , in teg ratin g and q u ery in g h e te ro g en e o u s d a ta so u rces is a fu n d am e n ta l p ro b lem in d a ta m an a g em en t [25, 52]. T h e p ro b lem is c o n c e rn e d w ith b u ild in g d a ta in teg ratio n sy stem s, w hich p ro v id e a u n ified v ie w o v e r h e te ro g o n o u s d a ta sources. Such a unified v iew is stru c tu re d a c c o rd in g to a so -c alled m e d ia te d sch em a (o ften referred to as a g lo b al sc h em a ), w h ic h d e scrib e s the co n ten ts o f the d ata so u rces an d ex p o ses th e asp ec ts o f th e d a ta th at m ig h t be o f in terest to the user. T h e rea so n fo r th is is th a t o n e o f th e p rin cip le g o als o f a d a ta in teg ratio n sy stem is to free th e u se r from h av in g to k n o w a b o u t the specific d a ta so u rces and th e ir stru c tu re in o rd er to in te rac t w ith them [35, 119]. A m ed itated sch em a is a v irtu al rep resen tatio n o f th e d ata av ailab le to its u ser in th e in teg rated sy stem , (in th e sense th at the d a ta in the local data so u rces n eed n o t c o n fo rm to its stru ctu re). A s a c o n seq u e n ce , the d ata integration sy stem m u st first re fo rm u la te a u ser qu ery into a q u ery th at refers d irectly to th e sc h em a s in th e d a ta sources. In o rd e r fo r the sy stem to be able to refo rm u late a u se r q u ery , it n eed s to h av e a set o f d a ta so u rce descrip tio n s, sp e c ify in g the m a p p in g b e tw ee n the e le m e n ts in the d ata

(23)

so u rces an d th e e le m e n ts in th e m ed iate d sch em a. T h ese d e scrip tio n s sp ecify th e re la tio n sh ip b e tw ee n e lem en ts.

In th is co n tex t, p ro v id in g a re a so n a b le stru ctu red and se m i-stru c tu re d d a ta in te g ratio n fram e w o rk fo r a u se r to effectiv ely in teg rate and q u ery d istrib u te d h e te ro g e n e o u s stru c tu re d d a ta resid in g in relatio n al d a ta b ases and se m i-stru c tu re d d a ta h eld in w e ll-fo rm e d X M L d o c u m e n ts has b eco m e a c h a lle n g e fo r d a ta b ase in te g ratio n resea rch e rs. T h ere is a lack o f fully au to m ate d sc h e m a -m a p p in g p ro c e sse s, an d a h ig h d eg ree o f logical h e te ro g en e ity b e tw ee n th e d a ta so u rces. A n o th e r p ro b lem im p ed in g d ata in te g ratio n is th e q u e ry tra n sla tio n p ro ce ss, w h ich is o n e o f the m o st im p o rtan t p ro b lem s in th e d e sig n o f a d a ta in teg ratio n sy stem , as it en ab les the sy stem to re fo rm u la te a q u ery p o se d in term s o f th e g lo b al sch em a into a set o f q u eries, su ited to th e local d a ta so u rces. T h u s, to o ls are n eed ed to m ed iate b e tw ee n u se r q u e rie s an d h e te ro g e n e o u s d a ta so u rces w h ich tran sfo rm such q u e rie s into local q u e rie s. D o in g th ese task s m an u ally is not on ly tim e c o n su m in g b u t also e rro r p ro n e. H en ce, m eth o d s for sim p lify in g h e te ro g e n e o u s d a ta so u rc e in teg ratio n w o u ld be o f g rea t th eo retical and p ractical im p o rtan c e . T h ere fo re , o u r o b jectiv e is to facilitate th e task o f a d e sig n e r b u ild in g an X M L d ata in teg ratio n system . In g en eral, b u ild in g d ata in te g ra tio n sy ste m s req u ires th e d esig n er to ad d ress sev eral issu es [87]. In th is th esis, w e co n ce n tra te on tw o b asic issues:

1. S p e c ify in g th e m ap p in g s b e tw e e n th e global sc h em a and th e local d ata sources.

2. P ro c essin g q u e rie s e x p re ssed a g ain st the global sc h em a into q u eries refle ctin g local sch em as.

1.3 H y p o th e sis, A im s an d O b je c tiv e s

In ou r research , the m ain focus is on in teg ratin g and q u ery in g d istrib u ted h eterogeneous stru ctu red and sem i-stru ctu red d ata sources. O u r h y p o th esis is that:

(24)

It is p o ssib le to in te g r a te an d q u ery th e d istrib u ted h e te r o g en eo u s str u c tu r e d d a ta resid in g in rela tio n a l d a ta b a se s an d se m i-str u c tu r e d d a ta held in w e ll-fo r m e d X M L d o c u m e n ts w h ich can be fo u n d on a lo ca l h ard d r iv e o r rem o te w eb se rv e r s, by b u ild in g in a b o tto m -u p a p p ro a ch a d y n a m ic X M L M eta d a ta K n o w le d g e B a se (X M K B ) o f d ata so u rc e m e ta -d a ta reso lv in g stru c tu r a l an d se m a n tic co n flic ts in th e d a ta th a t is u sed in r ew ritin g a u ser q u ery o v e r a ch o sen v ie w in to su b -q u e r ie s w h ic h fit ea ch local d ata so u rce, by u sin g th e m a p p in g in fo r m a tio n sto r ed in th e X M K B .

T h is th e sis sh o w s h o w to m ed iate d istrib u te d h e te ro g en e o u s stru ctu red and se m i-stru c tu re d d a ta so u rces in a m e d ia tio n a rc h ite ctu re w h ich en ab les users to q u ery m u ltip le stru c tu re d an d se m i-stru c tu re d d ata so u rces in a un ifo rm m an n er. S p ecifically , o u r g o a ls are to:

1. F ac ilita te th e d e sig n e r e ffo rt in v o lv ed in b u ild in g stru ctu red and sem i-stru c tu re d d a ta in te g ra tio n sy stem s.

2. D esig n a sy stem c ap a b le o f p a rtia lly a u to m atin g the in teg ratio n o f d istrib u te d h e te ro g e n e o u s stru c tu re d and sem i-stru ctu red d ata so u rces.

3. R eso lv e th e logical h e te ro g e n e ity , such as n am in g , stru ctu ral, an d sem an tic c o n flic ts w h ich , m ay o c c u r betw een th e sch em as. T h u s a so lu tio n w h ich o v e rc o m e s th e logical h e te ro g en e ity p ro b lem is needed.

4. E n ab le tra n sp a re n t q u e ry in g o f all d ata so u rces p a rticip a tin g in th e in teg ratio n sy stem w ith o u t th e u sers n eed in g a d e ta ile d k n o w le d g e o f the u n d e rly in g d ata so u rc es, th e ir lo catio n and th eir stru ctu re. T h u s, fo rm u la tin g a m eth o d fo r tra n sla tin g a u se r q u ery into local qu eries is d esired .

(25)

1.4 A c h ie v e m e n t o f th e R e se a r c h

T h e im p o rtan c e o f th is resea rch lies in its d e m o n stra tio n o f th e feasib ility o f b u ild in g an X M L M e tad a ta K n o w le d g e B ase (X M K B ), in a b o tto m -u p fash io n by e x tra c tin g and m e rg in g in crem en tally th e m etad a ta o f the d ata so u rces, and its d e m o n stra tio n o f th e b e n e fit o f th is X M K B in m ed iatin g u se r q u e rie s p o sed o v e r th e g lo b al sc h em a into local q u eries on the d istrib u te d h e te ro g en e o u s d a ta so u rc es, by tran sla tin g such q u eries into su b -q u e rie s w h ich a re a p p ro p ria te to e ac h local d ata source. T he m ain c o n trib u tio n s o f th is th esis are:

1. S in ce fully a u to m atic sc h e m a m a p p in g g en eratio n is infeasible, a se m i-a u to m a tic a p p ro a c h is d e m o n stra te d b ased on an a ssistin g tool w h ich red u c es th e d e sig n e r e ffo rt req u ire d to b u ild in teg ratio n sy stem s lin k in g stru c tu re d an d sem i-stru c tu re d data. A so lu tio n to o v e rc o m e th e h e te ro g e n e ity p ro b le m is fo rm u lated . T w o im p o rtan t task s w ere d e v elo p ed to so lv e th e p ro b lem : (1) estab lish in g a p p ro p ria te m ap p in g s b e tw e e n th e g lo b al sc h em a and the sch em as o f the local d ata so u rces; (2 ) u sers q u eried the d istrib u ted h e te ro g e n e o u s stru c tu re d an d se m i-stru c tu re d d ata so u rces in term s o f th e g lo b al sc h em a , w ith a m a p p in g p ro cess and qu ery tran sla tio n p ro ce ss fo rm u lated to tra n sfo rm th ese q u eries into local queries. 2. A p ro to ty p e sy stem is d e v e lo p e d to d e m o n strate th at the ideas

ex p lo re d in the th esis are so u n d an d p ractical.

3. A b o tto m -u p a p p ro a ch is u sed to estab lish and ev o lv e the X M L M e tad a ta K n o w le d g e B ase (X M K B ) in crem en tally from the m etad a ta ex tra cted from th e d a ta sources.

4. T o o ls h av e b een d ev elo p ed w h ic h can be u sed to o v e rc o m e co n flicts, such as n am in g , stru ctu ral, and se m an tic c o n flicts w h ich m ay o c cu r b etw een th e sch em as.

(26)

5. A m ap p in g is e stab lish ed b e tw ee n g lo b al sc h em a ele m e n ts and each local d a ta so u rce sch em a e le m e n ts to link th e ele m e n ts w ith th e sam e m ea n in g by u sin g a u n iq u e in d ex n u m b er g en erated au to m atica lly fo r th e g lo b al sc h em a elem en ts.

6. T h e d e sig n o f th e X M L M e ta d a ta K n o w led g e B ase (X M K B ) to cap tu re:

a) T h e m ap p in g in fo rm a tio n b etw een th e global sch em a ele m e n ts and th e local d a ta s o u rc e s’ elem en ts,

b ) T h e fu n ctio n n a m e s o f th e fu n ctio n s h a n d lin g sem antic an d stru ctu ral d isc re p a n c ie s,

an d to a ssist th e Q u ery P ro c e sso r (Q P ) in g en eratin g su b -q u eries fo r rele v an t local d a ta so u rces.

7. A so ftw are to o l h as b een d e sig n e d and b u ilt w h ich ex tracts m etad a ta from d a ta so u rces to b u ild th e S c h e m a S tru ctu re D efin itio n (S S D ) fo r th ese d a ta so u rces. T h is to o l can be ap p lied to relatio n al d a ta b ases, w e ll-fo rm ed X M L d o c u m e n ts w h ich h av e no referen ced D T D s o r X M L sch em as, an d also X M L d o cu m en ts w ith referen ced D T D s o r X M L sch em as.

1.5 O r g a n iz a tio n o f th e T h e sis

T his sectio n p resen ts an o v e rv ie w o f th e th esis' o rg an izatio n . T h e first ch ap ter h as p resen te d an in tro d u c tio n to the research u n d ertak en , m o tiv atio n s, th e h y p o th esis to be te ste d and h ig h lig h ts the aim s and o b jectiv es o f the resea rch and its o rig in a l ach iev em en ts.

C h a p ter 2: B a c k g r o u n d a n d s u r v e y o f th e sta te -o f-th e -a rt

This c h a p te r p resen ts an o v e rv ie w o f th e w o rk in th e field o f in teg ratin g distrib u ted h e te ro g e n e o u s d a ta so u rc es an d h o w it relates to th is thesis.

(27)

C h a p te r 3: X M L a n d re la te d te c h n o lo g ie s

T h is c h a p te r p re se n ts an o v e rv ie w o f X M L and related tech n o lo g ies.

C h a p te r 4: The S IS S D d a ta in te g ra tio n sy stem

T h is c h a p te r in tro d u c es th e m ain id eas o f th e thesis. It p resen ts a b rie f d e scrip tio n o f th e m o tiv a tio n o f th is w o rk , an d d escrib es o u r a p p ro ach and its sy stem a rc h ite ctu re . In ad d itio n , it d e scrib e s th e logical h etero g en eity p ro b lem , and in tro d u ces an a p p lic a tio n ex am p le w h ich is used th ro u g h out th e th e sis to sh o w h o w th e in te g ra tio n is a cc o m p lish e d by th e system .

C h a p te r 5: The m e d ia tio n p r o c e s s

T h is c h a p te r d e ta ils th e m e d ia tio n p ro c e ss w h ich is th e first p art o f ou r ap p ro ach . It is a b asic id ea o f th e th esis, as it is p ro p o sed as a tool to o v e rc o m e the h e te ro g en e ity p ro b le m s w h ic h m ay o ccu r a m o n g th e d ata sources.

C h a p te r 6: The q u e ry tra n sla tio n p r o c e s s

T his c h a p te r d e ta ils th e seco n d im p o rta n t p o in t in the th esis th at is the query tra n sla to r p ro ce ss w h ic h is an in teg ral p art o f the m ed iatio n lay er o f the sy stem . It g iv es a b r ie f in tro d u c tio n to th e query tran slatio n task in d ata in teg ratio n sy stem s, and p re se n ts th e q u ery tran slatio n p ro ce ss d ev elo p ed in this w ork. F in ally , it g iv es so m e e x am p le s o f q u ery tran slatio n s.

C h a p te r 7: The S IS S D im p le m e n ta tio n

T his c h ap ter co v ers th e im p le m e n ta tio n o f the p ro p o se d arch itectu re. It presents th e im p le m e n ta tio n o f the m etad a ta ex tra ctin g p ro cess. It also presents the im p le m e n ta tio n o f th e p ro ce sses used in c re atin g an X M K B . In ad d itio n , it in tro d u ces the d e v e lo p m e n t o f th e q u ery p a rsin g and tran slatin g p ro cesses.

(28)

C h a p te r 8: E v a lu a tio n & D isc u ss io n

T h is c h a p te r fo cu ses on th e e v a lu a tio n o f the p ro to ty p e system and c o n ta in s a critical a sse ssm e n t o f o u r resea rch a p p ro ach an d its c o n trib u tio n .

C h a p te r 9: S u m m a ry, c o n clu sio n a n d f u tu r e w o rk

T h is c h a p te r c o n clu d es th e th e sis w ith a su m m ary o f the acc o m p lish m en ts an d issu es to b e c o n sid ere d in th e fu tu re.

(29)

B a c k g r o u n d a n d su r v e y o f th e sta te -o f-th e -a r t

T h e in teg ratio n o f d a ta so u rces p o se s m an y c h alle n g es d u e to d ifferen ces in d ata m an a g e m e n t sy stem s, d a ta m o d els, q u ery and d a ta m an ip u latio n lan g u ag es, d ata ty p es, fo rm at (stru c tu re d , sem i-stru ctu red ), rep resen tatio n , an d sem an tics. T h is c h a p te r d isc u sse s rela te d w o rk and the b asic issu es a ffectin g the in te g ratio n o f h e te ro g e n e o u s d istrib u ted d a ta sources. F irstly , w e g iv e an o v e rv ie w o f th e field o f d istrib u te d h ete ro g en e o u s d atab ases. S eco n d ly , sin ce the m ain to p ic o f th is w o rk is q u e ry in g and in teg ratin g data from a n e tw o rk o f d ata so u rces, w e p resen t the a p p ro a ch e s fo r so lv in g this p ro b lem . T h ird ly , w e g iv e an o v e rv ie w o f d ata in tero p erab ility . N ex t, w e p resen t a d e ta ile d su rv ey on d a ta integration. F in ally , w e su m m arize related w o rk on q u e ry in g and in te g ratin g h e te ro g en e o u s d a ta sources.

(30)

2.1 D istr ib u te d h e te r o g e n e o u s d a ta b a ses

A d a ta b a se in teg rates and sto res rela te d d ata in an o rg an iz ed m anner. A d a ta b ase sy stem (D B S ) [48] c o n sists o f so ftw are, called a d a ta b ase m a n a g e m e n t sy stem (D B M S ), o n e o r m o re d atab ases th at it m an ag es, and any a sso c iated a p p lic a tio n so ftw a re u tilizin g the d a tab ase co n ten ts. A D B M S is the so ftw are th at h a n d le s all access to the d atab ase. A D B S m ay be e ith e r cen tralize d o r d istrib u te d . A cen tralize d D B S co n sists o f a sin g le cen tralize d D B M S m an a g in g a sin g le d a ta b ase on the sam e co m p u ter. A d istrib u te d D B S c o n sists o f a sin g le d istrib u ted D B M S (D D B M S ) m an a g in g m u ltip le d atab ases. T h e d a ta b a se s m ay resid e on a sin g le c o m p u te r sy stem o r on m u ltip le c o m p u te r sy stem s th at m ay d iffer in h a rd w are an d sy stem so ftw are.

A D is tr ib u te d D a ta b a se (D D B ) is d e fin e d as a co llec tio n o f m u ltip le, lo g ically in te rrelate d d a ta d istrib u te d o v e r d ifferen t co m p u ters o f a c o m p u te r n e tw o rk [23, 38, 4 5 , 62, 122]. T h e p h y sical d istrib u tio n d o es n o t n e ce ssa rily im ply th at th e c o m p u te r sy ste m s are g eo g rap h ically far apart; they c o u ld a ctu ally be in th e sam e b u ild in g o r even in th e sam e room . It sim p ly im p lies th at c o m m u n ic a tio n b e tw e e n th em is do n e o v e r a n etw o rk instead o f th ro u g h sh ared m em o ry . E ach n o d e o f the n etw o rk h as au to n o m o u s cap a b ility , p e rfo rm s local ap p lica tio n s and m ay p articip ate in the e x ec u tio n o f so m e g lo b al a p p lic a tio n s th at req u ire acc essin g d ata at several sites. D istrib u ted d a ta b a se s [64] em erg ed as a m erg e r o f tw o tech n o lo g ies: (1 ) d a ta b ase tec h n o lo g y , and (2) n e tw o rk and d ata

co m m u n icatio n tech n o lo g y . T h ey also m et th e req u ire m e n t o f

o rg an izatio n s in terested in th e d e ce n tra liz atio n o f p ro ce ssin g w h ile ach iev in g an in teg ratio n o f th e in fo rm a tio n reso u rces at the logical level w ithin th e ir g e o g ra p h ica lly d istrib u te d sy stem s o f d atab ases.

A p a rticu la r p ro p erty o f a d istrib u te d d atab ase is th a t it can be h o m o g en o u s o r h e te ro g en e o u s [136]. A h o m o g en o u s d istrib u te d d a ta b ase

(31)

(sim p ly called a d istrib u te d d a ta b a se ) is o n e in w h ich all th e p h y sical c o m p o n e n ts run on th e sam e d istrib u te d d atab ase m an a g e m e n t sy stem , and th e d istrib u te d d a ta b ase sy stem su p p o rts a sin g le d a ta m o d el and q u ery lan g u ag e w ith a sin g le sch em a.

C o n v e rsely , d a ta b ase sy stem s th a t p ro v id e in tero p eratio n and v ary in g d e g re es o f in teg ratio n a m o n g m u ltip le d atab ases o f d ifferen t ty p es h av e b een te rm e d h e te ro g en e o u s d istrib u te d d a ta b ase sy stem s (sim p ly called a h e te ro g e n e o u s d atab ase). T h ey c o n sist o f d atab ase sy stem s w hich d iffer p h y sic ally and lo g ically , h av e d iffe re n t d ata m o d els, m an ip u latio n lan g u ag es, an d sch em as. D e sp ite th e se d a ta b ases b e in g in d ep en d en tly c reated an d m an a g ed th ey m u st c o o p e ra te and in tero p erate. U sers need to acc ess and m an ip u late d ata fro m se v era l d atab ases and a p p licatio n s m ay req u ire d a ta fro m a w id e v a rie ty o f th e in d ep e n d en t d atab ases. T h erefo re, a n ew sy stem a rc h ite ctu re is re q u ire d to m an ip u late and m an ag e d istin ct and m u ltip le d a ta b ases, in a tra n sp a re n t w ay.

T h ere are a n u m b er o f facto rs th a t d iffe re n tia te ty p es o f D D B M S . T h ese facto rs c h arac terize a set o f m u ltip le D B S s in th ree o rth o g o n al d im en sio n s: d is trib u tio n , h e te r o g e n e ity , an d a u to n o m y [32, 62, 121, 134- 136]. T h ese d im e n sio n s c h a ra c te riz e sy ste m s in w h ich m u ltip le d atab ases m ay be p u t to g e th e r and be m a n a g e d by m u ltip le D B M S . W e in tro d u ce each o f th ese d im e n sio n s below .

T he d istrib u tio n d im e n sio n sp e c ifie s h o w the d a ta o f a D D B S is d istrib u ted a m o n g m u ltip le sites in a c o m p u te r netw ork.

H etero g en eity is c o n ce rn e d w ith th e d ifferen ces b etw een the local D B S s c o m p risin g th e D D B S . T h e ty p es o f h etero g en eity are cau sed by tech n o lo g ical d ifferen c e s an d in d ep e n d en t design. T h ese m ay be classified

as sy stem h e te ro g en e ity and lo g ical h e te ro g en e ity [71]. S ystem

(32)

m a n a g e m e n t sy stem (in c lu d in g d a ta m o d els, lan g u ag es, tran sac tio n m an a g e m e n t) and c o m m u n ic a tio n sy stem s. L o g ical h ete ro g en e ity co v ers d iffe re n c e s in th e w ay th e real w o rld is m o d eled in th e d atab ases (i.e. d ifferen c e s in sc h em a and d ata rep re se n ta tio n ).

A u to n o m y refers to th e d istrib u tio n o f co n tro l, n o t o f data. It in d icates the d e g re e to w h ich in d iv id u al D B S s can o p erate in d ep en d en tly [90]. A u to n o m y is a fu n ctio n o f a n u m b e r o f facto rs such as w h e th er the

c o m p o n e n t sy stem s e x c h a n g e in fo rm a tio n , w h eth er they can

in d ep e n d en tly e x ec u te tra n sa c tio n s, a n d w h o is allo w ed to m o d ify them . S ev eral k in d s o f au to n o m y (d e sig n , co m m u n ica tio n , ex ecu tio n and a sso c iatio n au to n o m y ) can be id e n tifie d [136].

2.2 A T a x o n o m y fo r I n te g r a tin g H e te r o g e n e o u s D a ta

S o u r ces

In teg ratio n o f h e te ro g e n e o u s d a ta so u rc es co n tin u es to receiv e m uch atten tio n from the research c o m m u n ity [19, 42, 46, 47, 74, 104, 107, 150]. In fo rm a tio n sy stem s in teg ratio n is a c o m p le x p ro b lem since in fo rm atio n sy stem s co m p rise d ata, p ro c e sse s an d a p p licatio n s. A s a co n seq u en ce th eir in teg ratio n m u st be d o n e at each lev el [53]. In th e co n tex t o f this th esis, w e c o n sid e r on ly d a ta in teg ratio n . S in ce th e m ain to p ic o f th is w o rk is q u e ry in g and in te g ratin g d a ta fro m a n e tw o rk o f d ata so u rces, w e p resen t o th er p ro p o se d so lu tio n s fo r th is p ro b lem and h ig h lig h t th e ir stren g th s and sh o rtco m in g s. W e th en c o n sid e r a p a rtic u la r ap p ro ach , M ed iatio n S y stem s, and c h arac terize it in m o re detail.

W e first d istin g u ish b etw een m a te ria liz e d and v irtu a l ap p ro ach es. T h ey are called in [144] th e e a g e r o r in -a d va n c e ap p ro ach and the la zy o r on-

d e m a n d ap p ro ach . In the m ateria lize d ap p ro ach , d ata co m in g from the

local d a ta so u rces are in teg rated and sto red in a sin g le n ew d atab ase. A ll queries th en o p e ra te on th is c o m p re h e n siv e d atab ase. W h ile in the v irtu al

(33)

a p p ro a ch , d a ta re m a in s in th e local d a ta so u rces. T h u s, q u eries o p e ra te d irec tly on th e local d ata so u rces an d d ata in teg ratio n tak e s p lace d u rin g q u ery p ro ce ssin g by c o m b in in g resu lts. A s a c o n seq u e n ce , the tw o a p p ro a ch e s h av e th e fo llo w in g a d v a n ta g e s and d isad v an tag es:

• In th e m ateria lize d a p p ro a ch , d a ta m u st first be p rep ared b efo re q u e rie s can be su b m itted . T h e p articip a tin g d a ta so u rces are (m an u a lly ) an aly zed ; a static v ie w o v e r th e d ata is d efin ed , the local d a ta is used to p o p u late a n e w in teg rated d atab ase c o n fo rm in g to the static v ie w an d q u e rie s are fo rm u la te d ag ain st th is view . A s a c o n seq u e n ce , n ew d a ta so u rc es c a n n o t be easily in teg rated and m ade av aila b le fo r q u ery in g . T h is a p p ro a c h is su itab le fo r ap p licatio n s w h ic h req u ire sp ecific, e x a c t p o rtio n s o f th e a v ailab le d ata w h ich are m o stly static (fo r e x am p le , fin an cial tran sactio n s). A qu ery is ev alu ated d irec tly u sin g th e m a te ria liz e d d atab ase and as a resu lt q u ery p ro ce ssin g can be o p tim iz e d fo r th is d atab ase. A d d itio n ally , th ere is no need to access th e u n d e rly in g d ata so u rces, so co n n ectio n co sts are n o n -ex iste n t. H o w e v e r i f th e local d ata is d y n am ic, u p d atin g o f th e in teg rated D B is h ard. A lso so m e o f the m aterialized d a ta m ay n e v er be accessed .

• F o r th e v irtu al a p p ro a ch , a q u e ry m u st first be an aly zed in o rd er to find d ata so u rc es w h ic h can a n sw e r it, and th en it is sp lit into su b ­ q u eries w h ich fin ally are a d ju ste d acco rd in g to the q u ery cap a b ilities o f each d ata so u rce. A s a c o n seq u en ce, q u ery p ro ce ssin g is d e p en d e n t on the a v aila b ility o f th e d ata so u rces, th eir c o n n ectio n tim es and q u ery p e rfo rm an c e . Q u ery o p tim izatio n o p p o rtu n ities are lim ited and an im p o rtan t re q u ire m e n t for this a p p ro ach is th at d ata so u rces acc ep t ad -h o c q u eries. Its m ain ad v an tag e is th at n ew d ata so u rces can be easily m ad e a v aila b le fo r q u ery in g . T h is a p p ro ach is su itab le fo r u sers w ith “ u n p red ic ta b le n e e d s” [144], i.e. i f users h av e a v ariety o f in fo rm atio n n eed s. It is su ited to d y n am ic d a ta b ases as

(34)

th e p ro ce ssin g o ccu rs o n th e local d ata, an d th ere is no n eed to p rep ro c ess d a ta n o t req u ire d by a query.

In o u r w o rk , w e ad o p t the v irtu al a p p ro a ch to su p p o rtin g a read -o n ly d ata in te g ratio n o f d istrib u te d h e te ro g e n e o u s stru ctu red and sem i-stru ctu red data, w h ich m ean s a global sc h e m a is created to be used fo r an sw erin g u se r q u e rie s, and n o t for u p d a tin g d ata. Since the n u m b er o f u n d erly in g d a ta so u rc es linked in th e in te g ratio n sy stem m ay in crease o r d ecrease at a n y tim e, and in a m ateria lize d a p p ro a c h d a ta is im p o rted into a n ew in te g rate d rep o sito ry , th is ty p e o f d y n a m ic ch an g e can n o t be easily m ade. T h e d a ta re q u ire m e n t o f th e e x p e c te d u sers is u n p red ictab le and likely to v a ry w ith th e reso u rce s c u rre n tly lin k ed . F o r th ese reaso n s, a v irtu al a p p ro a c h is m o re su itab le as it p ro d u c e s a scalab le sy stem w ith resp ect to th e d y n a m ic n atu re o f th e a v a ila b le in fo rm a tio n reso u rces.

A n o th e r c la ssifica tio n o f a p p ro a c h e s fo r in teg ratin g h e te ro g en e o u s d ata is b ased on th e stru c tu re o f th e d ata. M o st d ata so u rces can usually be c la ssified into one o f th ree c a te g o rie s d e p e n d in g on th e kind o f d ata th at th ey are p rim arily d e sig n ed to h an d le:

1. T ex t retriev al sy stem s are c o n c e rn e d w ith th e m an ag em en t and q u e ry -b a se d retriev al o f c o lle c tio n s o f u n stru ctu red text d o cu m en ts. 2. S tru ctu red d a ta b ase sy ste m s a re co n ce rn e d w ith th e m an a g em en t o f

stru ctu red o r stric tly -ty p e d d ata, i.e., d ata th at co n fo rm s to a w ell- d efin ed sc h em a (e.g ., d ata h eld in D B S m an ag ed by D B M S s).

3. S e m i-stru c tu re d d a ta b ases are d esig n ed to efficien tly m an ag e d ata th at on ly p a rtially c o n fo rm s to a sch em a, o r w h o se sch em a can ev o lv e rap id ly (e.g. X M L d o c u m e n ts) [9],

T here are a p p ro a ch e s w h ich c o n sid e r in teg ratin g ju s t one kind o f d ata such as relatio n al d a ta b ases [28, 93], o r O b je ct-O rien ted d a ta b ases [13, 59], o r X M L d o c u m e n ts [18, 120, 148], so q u ery fo rm u latio n , p ro cessin g and

(35)

resu lts a c c o m m o d a te on ly th at p a rtic u la r kin d o f data. O n the o th er h an d th ere h as b een a sig n ific an t in te rest in c o m b in in g , in teg ratin g , and in te r­ o p e ra tin g b etw een h e te ro g en e o u s d a ta th at b e lo n g to d ifferen t classes o f d a ta so u rces[8 6 , 113]. T he p rim a ry m o tiv a tio n fo r m o st o f th e w o rk in th is a re a is th a t m an y a p p lica tio n s re q u ire p ro ce ssin g o f d ata th at b e lo n g s to m o re th an o n e ty p e. F o r in stan ce, a m ed ical in fo rm atio n system a t a h o sp ital m u st p ro ce ss d o c to r re p o rts (free tex t d o c u m e n ts) as w ell as p a tien t rec o rd s (stru ctu re d rela tio n a l d ata). S im ilarly , an o rd er p ro ce ssin g ap p lica tio n m ig h t n e ed to h a n d le in v en to ry in fo rm atio n in a relatio n al d a ta b ase as w ell as p u rch ase o rd e rs re c e iv e d as (sem i-stru ctu red ) X M L d o c u m e n ts [126].

E arlie r w o rk o n d a ta b ase in te g ra tio n [12, 21, 65, 79, 96, 111, 140] fo cu ssed on th e in teg ratio n o f w e ll-stru c tu re d d atab ases, w ith fixed sch em as, th at su p p o rt p o w e rfu l q u e ry lan g u ag es. T h is th esis fo cu sses on th e in teg ratio n o f d istrib u te d h e te ro g e n e o u s stru ctu red and sem i-stru ctu red d a ta so u rces. F o r th e fo resee ab le fu tu re, m o st d a ta w ill co n tin u e to be sto red in relatio n al d a ta b ase sy ste m s b e ca u se o f the reliab ility , scalab ility , to o ls and p e rfo rm an c e asso c iated w ith th ese sy stem s. A d d itio n ally , m uch in te restin g and u sefu l d a ta can b e p u b lish e d as a w ell-fo rm ed X M L d o cu m en t, th is d a ta can be a u to m a tic a lly g en erated by W eb -b ased a p p lica tio n s o r can be h u m a n -c o d e d . S u ch d ata is called sem i-stru ctu red d ata d u e to its v a ry in g d eg ree o f stru c tu re . It can also v ary b etw een static d a ta b ases and e p h em eral d a ta h a v in g a v ery short life. H en ce, w ith the w e b ’s in cre asin g ro le as a d a ta p ro v id er, b u ild in g a d ata in teg ratio n system th at p ro v id es u n ified access to se m an tic a lly and stru ctu rally d iv erse d ata so u rces is h ig h ly d e sirab le as it w ill link stru ctu red d ata resid in g in relatio n al d a ta b ases and se m i-stru c tu re d d ata held in w e ll-fo rm ed X M L d o c u m e n ts p ro d u ce d by In tern et a p p lica tio n s o r h u m an -co d ed .

Since w e are ta rg e tin g a sy stem fo r q u e ry in g and in teg ratin g d istrib u ted h e te ro g en e o u s stru ctu red and sem i-stru c tu re d d ata so u rces, o u r w o rk has

(36)

a d o p t e d a m e d i a t i o n a p p r o a c h . T h e r e h a v e b e e n s e v e r a l i n t e g r a t i o n m e t h o d s w h i c h c o m b i n e d a t a f r o m s e v e r a l d a t a s o u r c e s s u c h a s u n i v e r s a l D B M S ( U D B M S ) m e t h o d [ 5 4 , 9 5 ] , f e d e r a t e d d a t a b a s e s [ 3 4 , 6 1 , 1 1 1 , 1 3 6 ] , d a t a w a r e h o u s e [ 2 7 , 4 1 , 1 5 0 ] , m u l t i - d a t a b a s e s [ 6 1 , 7 1 , 9 0 , 9 4 , 9 6 , 1 0 9 - 1 1 1, 1 3 4 ] , a n d m e d i a t o r m e t h o d [ 2 0 , 7 0 , 9 9 , 1 2 5 , 1 3 8 , 1 4 5 ] . W h i l e o t h e r m e t h o d s a r e a p p l i c a b l e t o i n t e g r a t i o n o f s t r u c t u r e d h e t e r o g e n e o u s d a t a w h i c h i s u s u a l l y s t o r e d u s i n g a D B M S , a m e d i a t i o n a p p r o a c h i s a p p r o p r i a t e t o i n t e g r a t i o n o f u n s t r u c t u r e d , s e m i - s t r u c t u r e d , a n d s t r u c t u r e d d a t a . W e n o w o v e r v i e w s o m e i n t e g r a t i o n a p p r o a c h e s i n m o r e d e t a i l a s c l a s s i f i e d i n F i g u r e 2 . 1 [ 5 4 ] . T h e r e a r e a d d i t i o n a l f e a t u r e s t h a t c h a r a c t e r i z e t h e s e a p p r o a c h e s w h i c h a r e n o t p r e s e n t e d i n t h i s f i g u r e , b u t t h e s e w i l l b e d i s c u s s e d i n t h e f o l l o w i n g s u b s e c t i o n . m aterialized Virtual Integrated Databases Mediated Query systems Multi-database Language Approach Data Warehouse Federated DBMS Unrversal DBMS Virtual Systems

(Meta) search Engines Systems for integrating

heterogeneous data sources

F igure 2.1: C lassification o f S ystem s f o r In teg ra tin g H eterogen eou s D a ta S ou rces [5 4 ].

(37)

2.2.1 U n iv ersa l D a ta b a se M a n a g e m e n t S y stem s

In a U D B M S a p p ro a ch d ata is m ig ra te d from the local sy stem s to a u n iq u e se p ara te D B M S . F irst, th e g lo b al in teg rated sch em a is d efin ed and th en d a ta from th e local sy stem s is im p o rted into the n ew d atab ase and th e local sy stem ceases to op erate. In th is w ay q u eries can be fo rm u lated ag ain st the n ew d a ta b ase and resu lts are p re se n te d to users. In th is case, th e u n d e rly in g d a ta so u rces are u su a lly also D B S s and the n ew D B S m u st a c c o m m o d ate all ty p es o f d a ta a v a ila b le in th e u n d erly in g sources. T h u s, th e n e w d a ta b ase m u st be ab le to h a n d le all (o r m an y ) ty p es o f in fo rm atio n , i.e. it m u st be a u n ive rsa l D B S .

D u rin g th e m ig ra tio n p ro cess, d a ta fro m th e u n d e rly in g system s are ex tra cted , tran sfo rm e d , in te g rate d an d stored in the cen tral u n iv ersal d atab ase. T h e m ain d raw b ac k o f th is ap p ro a ch is th at ex istin g ap p licatio n s for th e local sy stem s w ill h av e to be rew ritten for th e n ew d atab ase stru ctu re as the local D B s c ea ses to ex ist. M o reo v er, th e p ro cess o f d ata m ig ra tio n can be v ery ex p en siv e ; sin c e th e old d a ta h as to be tran sfo rm ed an d o ften se m an tically e n ric h ed fo r th e n e w sy stem (th e n ew d atab ase u su ally has a ric h e r d ata m o d el). N e v e rth e le ss, m ig ratio n can be a g o o d so lu tio n , fo r e x am p le , w h en u se rs o r ap p licatio n s need the w h o le fu n ctio n ality o f a D B M S (n o t ju s t th e q u ery fu n ctio n ality ) and the old sy stem s' a p p lica tio n s are no lo n g er n e ed e d [22]. N o te th at m ig ratio n is the only m ateria lize d ap p ro a ch in w h ic h n ativ e d ata is q u eried , and qu ery o p tim iza tio n on n ativ e d ata can be b e st achieved.

2.2.2 D ata W a r eh o u se s

D a ta w a re h o u sin g is a m ateria lize d ap p ro ach . D ata from the local d ata

so u rces are im p o rted into o n e D B M S , th e d ata w areh o u se. T he d ifferen ce to the U D B M S is th at the u n d e rly in g d a ta so u rces are still o p eratio n al, so the d a ta is in fact rep licated d e lib e ra te ly in at least tw o D B . First, the

(38)

w a reh o u se sc h em a is d efin ed , d a ta from the u n d e rly in g so u rces is p ro c e sse d an d sto red in the d a ta w areh o u se. T h e w a reh o u se d a ta is ty p ic ally n o t im p o rted in th e sam e fo rm and v o lu m e as it ex ists in th e local d a ta sy stem s. It m ay be tra n sfo rm e d , clean ed and p rep a red fo r certain a n aly sis task s, like d ata m in in g an d O L A P (O n lin e A n aly tical P ro cessin g ). D ata w a reh o u ses o ften do not m ak e th e m o st recen t d ata av ailab le, since a d a ta w a reh o u se is u su ally n o t u p d a te d im m ed iately after a local d ata so u rce h as c h an g e d b ecau se o f th e o v e rh e a d s asso ciated w ith im m ediate. T h u s, th e w a reh o u se sto res h isto ric al d ata, as req u ired by O L A P and d a ta m in in g ap p licatio n s.

A c c o rd in g to [144] th e d a ta w a re h o u se a p p ro ach is su itab le fo r the fo llo w in g k in d s o f clients:

• C lie n ts w h o do n o t n eed th e m o st rec en t d a ta av ailab le, since a d ata w a reh o u se is u su ally n o t u p d a te d im m ed iately a fte r a local d ata so u rce h as ch an g ed ;

• C lie n ts w ho req u ire h isto ric al, d e riv e d and sp ecific in fo rm atio n - for th is reaso n d a ta m ay n eed to be tran sfo rm e d , clean ed , ag g reg ated an d p rep a red fo r certain a n aly tic al task s, such as d ata m in in g and O L A P (O n lin e A n a ly tica l P ro c e ssin g ); o r

• C lie n ts w h o req u ire h ig h q u e ry p e rfo rm an c e - since large am o u n ts o f c o m p lex d a ta m u st b e q u e rie d ; d ata w a reh o u ses are o p tim ized for th e d o m in a n t b u sin e ss sc en a rio b u t are less th an o p tim al for others.

2.2.3 M eta se a rc h E n g in es

R eg ard in g th e q u e ry in g o f u n stru c tu re d d istrib u ted so u rces, sea rch en g in es and m e ta sea rch en g in es h av e g ain ed im p o rtan ce in rec en t y ears, m ainly b ecause o f th e d e v elo p m e n t o f th e W eb. S earch en g in es are sy stem s w hich accept as q u eries on ly natural lan g u ag e k e y w o rd s (o r sim p le c o m b in atio n s

(39)

o f th em ) and retu rn d o c u m e n ts as an sw ers. T h e m ain ch arac teristic o f search e n g in e s is th at d ata can be easily m ade a v ailab le fo r q u ery in g and q u e rie s can be fo rm u lated in a sim p le w ay.

S earch e n g in e s are ch arac terize d by:

• search efficien cy w h ich m ea n s h o w fast th e resu lts are retu rn ed , and

• search e ffe ctiv en e ss w h ich in d ic ate s h o w g ood the results are, o r “th e ab ility to retriev e w h a t th e u se r w an ts to see” [129].

In o rd e r to ach iev e h ig h e ffe c tiv e n e ss, search en g in es u se h eu ristics fo r fin d in g th e m ea n in g o f in p u t q u e rie s an d fo r retriev in g d o c u m e n ts w hich m ay m atch th em . H o w ev er, i f w e c o n sid e r a h e te ro g en e o u s e n v iro n m en t like th e W eb, it is v ery d iffic u lt to fin d th e rig h t m ea n in g o f q u eries and in such c a se s search en g in es p e rfo rm q u ite p o orly.

T o in crease e ffe ctiv en e ss d iffe re n t se arc h en g in es are co m b in ed to form m etasea rch en g in es. T h eir u sers fo rm u la te q u eries ag ain st a uniform in terface, w h ich are p ro ce ssed (fo r e x am p le , stop w o rd s are elim in ated ) and sp lit into su b -q u e rie s w h ich are th en sen t to the individual search en g in es. F in ally , the resu lts are c o lle c te d , c o m b in ed and p resen ted in a un ified w ay. E x am p les are S a v v y S e a rc h [57] and M e taC ra w ler [131, 132]. M etasearch e n g in e s do not p rep a re th e d a ta to be queried. T h ey sim ply use the q u ery in terfaces o f the u n d e rly in g search en g in es and p rep are th e in p u t q u eries fo r them . T h ey also n eed to im p lem en t su itab le h eu ristics for co m b in in g the resu lts from th e d iffe re n t sources. M etasearch en g in es are thus e x am p le s o f sy stem s fo r q u e ry in g d ata av ailab le in a n etw o rk o f d ata sources. H o w ev er, th e u n d e rly in g so u rces m u st be u n stru ctu red and for this reaso n th ey are n o t su itab le fo r q u e ry in g so u rces w h ere the d ata is stru ctu red in any w ay an d w o u ld b e in ap p ro p riate w h en lin k in g stru ctu red and sem i- stru ctu red data.

References

Related documents

Why did the international community decide to withdraw United Nations peacekeeping troops from Rwanda during the 1994 genocide? Analysis of newly released documents and results from

Section 2 (d) (Nature of the Business) - The following sentence shall be modified to read as follows: "The Applicant will discourage the illegal public consumption of alcohol

They use international credit spreads and domestic stock prices as proxies for the risk premium and they …nd that exchange rates were signi…cantly a¤ected by the risk premium

Contract Management team is a new unit under supply chain unit in finance department because of that, they are preparing a master database to keep record of

The first term in ) captures the supply response of deposits to changes in the bank’s risk profile. Basically this means that payments to the deposit insurance system increase

Fig. ,QLWLDODQG¿QDORXWSXWJDSHVWLPDWHV 6RXUFH Authors’ calculations... Initial assessments of an economy’s cyclical position are least reliable during recessions. 3 compares the

The activity of seedingand weeding were not done by men farmer because both activities still could be done by women farmer.The findings at Toba Samosir Regency

The focus group facilitators asked if the participants felt more loyal to the manufacturers who provided value-added programs and essential services. Essential services were defined