University of New Hampshire
University of New Hampshire Scholars' Repository
Doctoral Dissertations
Student Scholarship
Winter 1997
A metric-based scientific data model for knowledge
discovery
David T. Kao
University of New Hampshire, Durham
Follow this and additional works at:
https://scholars.unh.edu/dissertation
This Dissertation is brought to you for free and open access by the Student Scholarship at University of New Hampshire Scholars' Repository. It has been accepted for inclusion in Doctoral Dissertations by an authorized administrator of University of New Hampshire Scholars' Repository. For more information, please [email protected].
Recommended Citation
Kao, David T., "A metric-based scientific data model for knowledge discovery" (1997). Doctoral Dissertations. 1994.
INFORMATION TO USERS
This manuscript has been reproduced from the microfilm master. UMI
films the text directly from the original or copy submitted. Thus, some
thesis and dissertation copies are in typewriter face, while others may be
from any type o f computer printer.
The quality of this reproduction is dependent upon the quality of the
copy submitted. Broken or indistinct print, colored or poor quality
illustrations and photographs, print bleedthrough, substandard margins,
and improper alignment can adversely afreet reproduction.
In the unlikely event that the author did not send UMI a complete
manuscript and there are missing pages, these will be noted. Also, if
unauthorized copyright material had to be removed, a note will indicate
the deletion.
Oversize materials (e.g., maps, drawings, charts) are reproduced by
sectioning the original, beginning at the upper left-hand comer and
continuing from left to right in equal sections with small overlaps. Each
original is also photographed in one exposure and is included in reduced
form at the back of the book.
Photographs included in the original manuscript have been reproduced
xerographically in this copy. Higher quality 6” x 9” black and white
photographic prints are available for any photographs or illustrations
appearing in this copy for an additional charge. Contact UMI directly to
order.
A M
e t r i c- B
a s e dS
c i e n t i f i cD
a t aM
o d e lf o r
K
n o w l e d g eD
i s c o v e r yby
David T . Kao
B.S.. N ational C entral University. 1986
M.S.. M ichigan S ta te University. 1991
;
DISSERTATION
* r
Subm itted to the U niversity o f New H am pshire
in P artial Fulfillment of
the R equirem ents for the Degree of
I
D octor of Philosophy
in
C om puter Science
UMI Number: 9819677
Copyright 1997 by
K a o , David T .
All rights reserved.
I
1 it {i
t UMI Microform 9819677Copyright 1998, by UMI Company. All rights reserved. This microform edition is protected against unauthorized
copying under Title 17, United States Code.
j
UMI
f 300 North Zeeb Road
J
ALL RIGHTS RESERVED
© 1997
The dissertation has-been exam ined and approved.
D issertation Co-Directbr, R. Daniel Bergeron Professor of C om puter Science
D issertation Co-Dirpctor, Ted M. Sparr Professor of C om puter Science
/?
R aym ond Greenlaw
Associate Professor of C om puter Science
Georges G. G rinstein
Professor of C om puter Science University of M assachusetts, Lowell
Samuel D. Shore
Professor of M athem atics
IV
D ed ication
To iny fa th e r. Y u-San K ao. a n d m y m o th e r. Fong-M ei Lee. for th e ir u n d e r s ta n d in g , p atien ce, a n d love.
A cknow ledgem ents
M any p e o p le have c o n trib u te d , in big a n d sm a ll ways, to th e c o m p le tio n o f this d is s e rta tio n a n d as im p o rta n tly , to th e q u a lity o f m y life d u rin g th is jo u rn e y . I would like to take a d v a n ta g e o f th is o p p o r tu n ity to acknow ledge th e ir effort a n d e x p re ss m y a p p re c ia tio n .
P ro fe sso r R . D an iel B ergeron has b e e n a c o n s ta n t so u rc e o f in sp ira tio n . His vision played a p iv o tin g ro le in id en tify in g m y re search to p ic a n d d e fin in g m y d isse rta tio n scope. I ca n alw ays rely o n h im for s o u n d advice a n d h o n est o p in io n .
P ro fe sso r T ed M. S p a rr in tro d u ce d m e to th e field o f scien tific d a ta b a s e system s. His e x te n siv e k n o w led g e in d a ta b a s e a n d s ta tis tic s h e lp e d m e to p u t my research into p e rsp e c tiv e a n d p re v e n te d m e from re in v e n tin g w heels.
; T h a n k s to P ro fe sso r R ay m ond G reen law th is d is s e r ta tio n h a s m any fewer
incon-' sistencies a n d e d ito ria l p itfalls. His th o ro u g h n ess has set a new s t a n d a r d in w ritin g for me.
All th e re m a in in g in c o n siste n c ie s an d m ista k es a re solely m ine.
P ro fe sso r G eorges G . G rin ste in has p ro v id ed m a n y v a lu a b le com m ents a n d fresh
; ideas for m e to p o n d e r. It was a d in n e r co n v e rsatio n w ith h im a n d D a n in A tla n ta . G eorgia
th a t in stille d th e co n fid e n ce in m e to p u rsu e m y d is s e r ta tio n re se a rc h .
P ro fe sso r S am u el D. S h o re in tro d u c e d m e to th e fields o f logic, topology, a n d m e tric I theory, w h ich serv e as th e th e o re tic a l fo u n d a tio n o f th is d is s e r ta tio n . He is not only my
j m e n to r in m a th e m a tic s , b u t also a very d e a r friend.
D u rin g m y P h .D . stu d y . I have h ad th e o p p o r tu n ity to w ork w ith P ro fessor P ila r de la T o rre o n trie co m p le x ity an aly sis. W h ile th is c o lla b o r a tio n d id n o t d irec tly c o n trib u te to m y d is s e r ta tio n re se a rc h , it h a s been a very re w a rd in g ex p e rie n c e .
I enjo y ed th e o p p o r tu n ity to c o a u th o r w ith M ike C u llin a n e 011 two occasions. T h e m any h o u rs o f fun I s p e n t w ith h im a n d S am in o u r w eekly to p o lo g y se m in a rs will be m issed a lot.
I w ould a lso like to th a n k th e sta ff o f c o m p u te r scien ce d e p a rtm e n t for th e ir ad-k
? rn in istra tiv e s u p p o r t. L in d a A ndrew s is alw ays p le a s a n t a n d a tte n tiv e to an y m u n d a n e
p ro b lem I b ro u g h t to her. G in a Ross a n d A n d y E van s h av e p ro v id e d excellent h ard w are t
‘ a n d so ftw are serv ice.
M an y fellow g ra d u a te s tu d e n ts have en ric h ed m y life on a d a ily basis. T h e y will be re m e m b e re d a n d I believe t h a t our s e p a ra te p a th s w ill cro ss in th e future.
sup-p o rtiv e. A lth o u g h we a r e g eo g rasup-p h ically se sup-p a ra te d by th o u s a n d s o f m iles, sh e never fails to m ake h e r p resen ce fe lt th ro u g h o u r m an y phone calls.
L ast b u t n o t th e least. I have to th a n k m y p a r e n ts . T h e y have fa ith in me. It is th ro u g h th e ir e n c o u ra g e m e n t a n d s u p p o rt th a t I was a b le to com e to th e S ta te s to p u rsu e m y g ra d u a te stu d y . T h e y alw ays give m e the b e st: to th e m . I a m forever in d e b t.
V I !
Table o f C on ten ts
D e d ic a tio n iv A ck n o w le d g e m e n ts v L ist o f F ig u res x List o f T ables x ii A b stra c t x iii 1 In tr o d u ctio n 1 1.1 S cientific D a ta ... 1 1.2 T a x o n o m y o f C o n v e n tio n a l D a ta M o d e l s ... 3 1.3 In te r -In s ta n c e R e l a t i o n s h i p s ... 1 1.4 M odels o f In te r -In s ta n c e R e l a t i o n s h i p s ... (i 1.5 T h e sis O u tlin e ... 82 S cien tific D a t a M o d els 10 2.1 I n t r o d u c t i o n ... 10
2.2 Im p le m e n ta tio n P rim itiv e s a n d P h y sical D a ta M o d els ... 11
2.2.1 G e o m e t r i e s ... 12 2.2.2 T o p o lo g ie s ... 13 2.2.3 In d e x a b le T o p o lo g ie s ... 15 2.2.4 S tr u c tu r e d D a t a ... 17 2.3 B asic O p e r a t i o n s ... 20 2.3.1 S u b s e t O p e r a t i o n s ... 20 2.3.2 A n a ly sis a n d E x p l o r a t i o n ... 22 2.3.3 M iscellan e o u s D a ta b a s e O p e r a t io n s ... 23 2.4 C o m p u ta tio n a l G r i d ... 24 2.4.1 T a x o n o m y o f C o m p u ta tio n a l G r i d s ... 25 2.4.2 C o m p u ta tio n a l G rid S u m m a r y ... 27
2.5 A p p lic a tio n V isu a liz a tio n S y ste m ( A V S ) ... 28
2.5.1 AVS S tr u c tu r e d D a t a ... 28 2.5.2 AVS U n s tru c tu r e d D a t a ... 2!J
v ii i
2 .5 .3 AVS S u m m a r y ... 30
2.6 F ib e r B u n d l e ... 30
2.6.1 F ib e r B u n d le M odel o f F ield D a t a ... 30
2 .6 .2 P iecew ise Field R e p r e s e n ta tio n s ... 32
2 .6 .3 C o m p a c t Field R e p r e s e n t a t i o n s ... 33 2 .6 .4 F ib e r B u n d le S u m m a r y ... 34 2.7 L a t t i c e ... 34 2.7.1 M u ltip le Views o f a D a ta S e t ... 35 2 .7 .2 D e fin itio n of L a t t i c e ... 35 2 .7 .3 L a ttic e S u m m a r y ... 37 2.8 S u m m a r y ... 37 3 M a t h e m a t i c a l P r e l i m i n a r i e s 3 8 3.1 I n t r o d u c t i o n ... 38 3.2 D is ta n c e F u n c tio n a n d M etric A x i o m s ... 38 i 3.3 C o n tin u ity a n d N e i g h b o r h o o d s ... 42 J 3.4 T o p o lo g ies a n d D istances ... 47 = 3.5 S u m m a r y ... 50 4 A M e t r i c - B a s e d S c ie n tif ic D a t a M o d e l 5 1 ! 4.1 I n t r o d u c t i o n ... 51 ! 4.2 D a ta a s F u n c t i o n s ... 51 4.2.1 D a ta S ets and F u n c t i o n s ... 52 4 .2 .2 F o rm u la tio n s o f D a ta F u n c t i o n s ... 53 4 .2 .3 F u n c tio n a l D a ta M odel ... 54 4 .2 .4 O rd e rs o f D a t a ... 55
I 4 .2 .5 I n te r -E n tity R e la tio n sh ip s a n d F u n c tio n F o r m u la t io n ... 57
4.3 F u n c tio n a l D ep en d en cy am o n g A t t r i b u t e s ... 58 4.4 P s e u d o - Q u a s im e tr ic s ... 60 4.4.1 M o t i v a t i o n ... 61 : 4 .4 .2 F ro m P se u d o -Q u a sim e tric s to P a r t i a l O r d e r s ... 62
1
4.5 O b s e rv a b le P r o p e r t i e s ... 63 4.5.1 O rd e r-T h e o re tic versus S e t - T h e o r e t i c ... 64 4 .5 .2 P ro p e rty -B a se d M e t r i c s ... 65 4 .5 .3 P r o p e r ty T o p o lo g iz a tio n ... 66 j 4 .5 .4 M e tric E v a l u a t i o n ... 67 s 4.6 V a ria te M e tric D e r i v a t i o n ... 67 4.7 D o m a in M e tric D e r i v a t io n ... 72I 4.8 C o n tin u ity for M etric V a l i d a t i o n ... 76
! 4.9 S u m m a r y ... 77 l 5 P h y s i c a l D a t a M o d e l s 79 5.1 I n t r o d u c t i o n ... 79 5.2 H ie ra rc h ic a l M etric D a ta S tr u c tu re s ... 80 5.3 M u ltip o la r M a p p i n g s ... 83
ix 5.4 M axico D i s t a n c e s ... 86 5.5 D im en sio n ality I s s u e ... 88 5.6 A verage In te rp o le D ista n c e ... !J() 5.7 C o o rd in a te V olum e R e d u c t i o n ... 62 5.8 P ro v isio n s for P s e u d o - Q u a s im e tr ic s ... 94
5.9 M u ltip o la r M a p p in g v ersus E xisting A p proaches ... 97
5.10 S u m m a r y ... 98
6 P erfo rm a n ce A n a ly sis 99 6.1 I n t r o d u c t i o n ... 99
6.2 D a ta S y n t h e s i s ... 99
6.3 D ista n c e D is trib u tio n s a n d P ro x im ity A c c u m u la tio n s ... 101
6.4 In trin s ic D i m e n s i o n a l i t i e s ... 104 6.5 C lu ste re d D a t a ... 106 6.6 D a ta S et Sizes ... 115 i 6.7 N u m b e r o f P oles ... L16 I 6.8 P ole S e l e c t i o n ... 119 £ I 6.9 C lu s te r C e n te rs a s P o l e s ... 130 % 6.10 G re e d y A lg o rith m ... 131 L 6.11 Very H igh D i m e n s i o n a l i t y ... 135 ' 6.12 S u m m a r y ... 135 7 C on clu sion s 141 7.1 O verview ... 141 7.2 C o n t r i b u t i o n s ... 142 j 7.3 F u tu re R esearch ... 143 B ib liograp h y 145
A M a th em a tic a l D e fin itio n s 151
B T h eo rem P ro o fs 152
I
i
Ii.
X
List o f Figures
L.l E R S chem a D ia g ra m s o f th e S a te llite I m a g e ... 5
1.2 D ifferent P a tte r n s o f In te r-In s ta n c e R e la ti o n s h ip s ... 6
1.3 D a ta M odels a n d P rim itiv e s for R ep resen tin g In te r-In s ta n c e R e la tio n s h ip s . 7 ; 2.1 T w o D ifferent T o p o lo g ies D erived from th e S am e G e o m e t r y ... 14
2.2 G e o m etric S pace a n d In d e x S p a c e ... 16
2.3 N o n-D efault C o n n e c tio n s ... 17
2.4 D a ta Set w ith P a t t e r n a lo n g O n e D im ension O n l y ... IS j 2.5 In d ex Space for th e D a ta S et in F igure 2 . 4 ... 19
2.6 V arious 2-D G rid s ... 26
2.7 Topologies W h ich C a n N ot Be M odeled by C o m p u ta tio n a l G r i d s ... 28
2.8 P rim itiv e C ell T y p e s ... 29
2.9 C o m p o n e n ts o f a F i e l d ... 32 [ 2.10 A B ase C o n sists o f T w o S e g m e n t s ... 33 2.11 C ross P ro d u c t o f F i e l d s ... 34 2.12 L a ttic e as a F u n c t i o n ... 36 3.1 T w o D istances o n th e S am e S e t ... 49 4.1 E x am p le o f a R ich n ess R e l a t i o n ... 57
4.2 In te r-E n tity R e la tio n s h ip s Specified by F u n ctio n F o r m u l a t i o n ... 57
I 4.3 D ifferent D o m ain G e o m e trie s on th e S am e D o m ain S p a c e ... 60
4.4 C h o o sin g a G e o m e tric C e n t e r ... 63
4.5 T h e P ro cess o f V a ria te M e tric D e r i v a t i o n ... 67
: 5.1 H ierarch ical M e tric D e c o m p o s i t i o n ... 82
^ 5.2 D istan ce I llu s tra tio n o f X ... 84
5.3 P o in t P lo t o f M 2 [ A T ] ... 85
r 5.4 C o m p a riso n b e tw e e n S p h e re s B ased on rn > a n d E - > ... 87
s 5.5 D istan ce D is trib u tio n H i s t o g r a m ... 89
5.6 N u m b e r o f P o in ts E x a m in e d a t R adius 5. 10. a n d 1 5 ... 92
5.7 A reas Void o f P o i n t s ... 93
5.8 B o u n d in g P la n es fo r A/3 [A T ]... 94
XI
6.1 E x a m p le o f D ista n c e D is trib u tio n (D D ) ... 102 (i.2 E x a m p le o f P ro x im ity A c c u m u la tio n ( P A ) ... 103 6.3 (in . n . p . c . r ) = ( 1 . . . 1 0 . 10'1. 0 .0 .0 ) a n d 5 -P o la r M a p p in g s: D D s a n d PA s . . 105 6.4 (in . n . p . c . r ) = ( 1 . . . 10. 10°. 0 .0 .0 ): D D s a n d PA s ( O v e r la y e d ) ... 106 6.5 (in . n . p . c . r ) = ( 1 . . . 10. 10°. 0 .0 .0 ) a n d 5 -P o la r M a p p in g s: D D s a n d PA s . . 107 6.6 (in. n . p. c. r ) = ( 1 0 .1 0 1. 0.0 . . . 1 .0 .1 0 0 .2 ): DD H i s t o g r a m s ... 109 6.7 (rn. n . p . c . r ) = (10. 10'1. 0.0 . . . 1 .0 .1 0 0 .2 ): P a r tia l DD H i s t o g r a m s ... 110 6.8 ( i n . n . p . c . r ) = ( 1 0 .1 0 l . 0.0 . . . 1.0.100. 2): P a r tia l D D H isto g ra m s a f te r 5-P o la r M a p p i n g s ... 112 6.9 ( i n . n . p . c . r ) = ( 1 0 .1 0 1. 0.0 . . . 1 .0 .1 0 0 .2 ): P e rfo rm a n c e o f N e a re st N e ig h b o r Q u e ries ... 113 6.10 (m . n . p . c . r ) = (10. lO '.O .O ... 1.0. 100.2): P e rfo rm a n c e o f P ro x im ity Q u e rie s 111 6.11 ( i n . n . p . c . r ) = ( 3 . r t . 2 0 . 1 0 0 .2 ). n = 10*.2 x 10‘* 10 x 10'1: E fficiency for
D a ta S et o f V arious Sizes ... 117 6.12 ( i n . n . p . c . r ) = (3. n . 2 0 . 100. 2 ). n = 10°. 2 x 10° 10 x 10°: E fficiency for
I D a ta S et o f V arious Sizes ... 118
\ 6.13 ( m . n . p . c . r ) = (10 .1 0 '* .p . 100.2). p = 0 .2 5 .0 .5 0 .0 .7 5 . C o lle c tio n I: Effi
ciency w ith R esp ec t to 1...10 P o l e s 120
6.14 ( i n . n . p . c . r ) - (10. 1 0 1. p. 100.2). p = 0 .2 5 .0 .5 0 .0 .7 5 . C o lle c tio n 2:
Effi-| ciency w ith R e sp e c t to 1...10 P o l e s ... 121
i 6.15 ( i n . n . p . c . r ) 1 (10. 10 1. p. 100.2). p = 0 .2 5 .0 .5 0 .0 .7 5 . C o lle c tio n 3:
Effi-I ciency w ith R esp ec t to 1...10 P o l e s ... 122
I
6.16 ( m . n . p . c . r ) = ( 10. 10 1. p. 100. 2). p = 0 .2 5 .0 .5 0 .0 .7 5 : Efficiency o f 4 -P o la rI M a p p in g s w ith R e sp e c t to A verage In te rp o le D i s t a n c e s ... 124
| 6.17 A verage In te rp o le D ista n c e s versus S ta n d a r d D e v i a t i o n s ... 125
6.18 A verag e In te rp o le D ista n c e s versus S ta n d a r d D e v i a t i o n s ... 127 6.19 ( i n . n . p . c . r ) = ( 1 0 .10'1. 0.75 .1 0 0 . 2): Efficiency o f 4 - P o la r M a p p in g s w ith
R e sp e c t to A verag e In te rp o le D istan ces ... 128 6.20 ( n i . n . p . c . r ) = ( 1 0 .10'1. 0 .7 5 .1 0 0 .2 ): E fficiency for 4 -p o la r M a p p in g s w ith
R e sp e c t to th e S ta n d a r d D e v ia tio n s o f In te rp o le D i s t a n c e s ... 129 6.21 ( i n . n . p . c . r ) = (10. 1 0 * .0 .7 5 .4 .4 ): 1.000 4 -P o le S ets ... 130 6.22 ( i n . n . p . c . r ) = (10. 1 0 * .0 .7 5 .4 .4 ): E fficiency for 4 -P o Ia r M a p p in g s w ith
Re-j s p e c t to A v e rag e In te rp o le D i s t a n c e s ... 132 6.23 ( i n . n . p . c . r ) = (2. 101. 0 .7 5 .4 .5 ): 2 -D im en sio n al S c a tte r P l o t ... 133 6.24 ( i n . n . p . c . r ) = ( 2 . 1 0 1. 0 .7 5 .4 .5 ): 1000 4 -P o le S ets. A v e rag e In te rp o le
Dis-; ta n c e s v ersu s S ta n d a r d D e v i a t i o n s ... 133 ! 6.25 ( m . n . p . c . r ) 1 (2 .1 0 '* .0 .7 5 .4 .5 ): Efficiency o f 4 -P o la r M a p p in g s w ith R e-, sp e c t to A v e rag e In te rp o lc D i s t a n c e s ... 134 ■ 6.26 (rn. n . p. c . r ) = ( 1 0 .10'1. 0.25. 100.2): P e rfo rm a n c e o f th e G re e d y A lg o rith m 136 6.27 (rn. n . p. c. r) = ( 1 0 .10'1.0 .5 0 . 100.2): P e rfo rm a n c e o f th e G re e d y A lg o rith m 137 6.28 ( m . n . p . c . r ) = (1 0 .1 0 * .0 .7 5 . 100.2): P e rfo rm a n c e o f th e G re e d y A lg o rith m 138 6.29 ( m . n . p . c . r ) = ( 1 0 . 10'1.0 .5 0 . 100.2): P e rfo rm a n c e o f th e G re e d y A lg o rith m 139
x i i
List o f T ables
3.1 B asic M e tric A x i o m s ... 3(J
4.1 D ista n c e s S p ecifie d by d\ over X ... 09
4.2 V alues o f Zq- ... 70
4.3 V alues o f d > ... 70
4.4 O b se rv a b le P ro p e r tie s for Lo ... 75
5.1 N u m b e r o f P o in ts V isite d versus D im e n s io n a lity ... 90
6.1 P e rfo rm a n c e o f 5 -P o la r M a p p i n g s ... 108
1 L
A bstract
A
M e t r i c - B a s e d S c i e n t i f i c D a t a M o d e l f o r K n o w l e d g e D i s c o v e r ybv
DavicI T. Kao
University of New Ham pshire. December. 1997
Scientific d a ta , w hich in clu d e m u ltim e d ia (e.g.. im ages, au d io , a n d video) a n d n o n -s ta n d a rd d a t a (e.g.. Rnger p rin ts a n d DMA sequences), is c h a ra c te riz e d by rich a n d co m p lex int.er-
insta n ce relationships in a d d itio n to th e in te r -e n tity relationships fo u n d in tra d itio n a l d a ta . C onven tio n a l data m odels a re insufficient for m o d elin g such in te r-in s ta n c e re la tio n sh ip s.
T h is thesis p ro poses a m e tric-b a sed sc ie n tific data m odel from th e no tio n s o f data-as-
fu n c tio n s an d p seu d o -q u a sim e tric s, w hich a re u sed to m odel in te r-e n tity a n d in te r-in sta n c e
re la tio n sh ip s respectively. C o m p a re d to o th e r scientific d a t a m od els, th e m e tric -b a se d con c e p tu a l m odel ca n be a p p lie d to m an y d a t a se ts w h e re g eo m etric view s m ight n o t otherw ise b e available.
A d e ta ile d a p p ro a c h is o u tlin e d for e x p lo rin g a n d d e riv in g p sc iid o -q u a sim e trirs to re p re se n t in te r-in sta n c e re la tio n s h ip s in a w ide v a rie ty o f d a ta . In p a r tic u la r, we introduce* th e n o tio n of observable pro p erties a n d show how it ca n be a p p lie d w ith id eas from point
se t topology to s y s te m a tic a lly d e riv e m etrics from n o n m e tric d a t a c o m p o n e n ts su c h as c a t
eg o rical d a ta . We also d e m o n s tra te th e use o f c o n tin u ity as a m a th e m a tic a lly precise tool to v a lid a te m etrics d eriv ed th ro u g h th e p ro p o se d a p p ro a c h .
In o rd e r to s u p p o rt th e m e tric -b a se d m o d el a t th e phy sical level, we dev elo p ed two sim p le m echanism s, th e m u ltip o la r m apping, for tra n sfo rm in g a p se u d o -m e tric sp ace into a m u ltid im en sio n al space, a n d th e m edian tra n sfo rm a tio n , for d e riv in g a p seu d o -m e tric from a p seu d o -q u asim ctric. A fte r a p p lic a tio n o f m u ltip o la r m a p p in g a n d (p o ssib ly ) m edian tra n sfo rm a tio n , it is easy to use e x istin g p o in t spatial data s tru c tu re s such a s quadtree o r octree for m e tric d a ta s to ra g e a n d access. T h e re su lts o f o u r p e rfo rm a n c e analysis d e m o n s tra te th a t th e m u ltip o la r a p p ro a c h is ro b u s t a n d s ta b le o v er a wide ra n g e of d a ta p a ra m e te rs for d a t a se ts w ith in trin sic d im e n sio n a lity o f 10 or less. W h ile it is s till unclear w h e th e r the m u ltip o la r a p p ro a c h offers sig n ifican t p e rfo rm a n c e a d v a n ta g e o n proxim ity
q u erie s for d a t a sets o f very h ig h d im en sio n ality , p re lim in a ry re s u lts for 100 d im en sio n al d a t a still show ex cellen t p e rfo rm a n c e o n n earest n e ig h b o r q u erie s.
C hapter 1
In trod u ction
1.1
Scientific D a ta
In te rre la tio n s h ip s in scien tific data a re fu n d a m e n ta lly m o re im p licit, d ifficu lt to c a p tu re , a n d h a r d e r to m o d el th a n those in o th e r d a ta b a s e a p p lic a tio n s. For m o st c o n v en tio n al a p p lic a tio n s , th e o n ly in te rre la tio n s h ip s o f in te re s t a r e th e ones a m o n g d iffe ren t sem an tic e n titie s , su c h as th e ones w hich c a n b e d e s c rib e d by a n E R d ia g ra m [16. 12]. In stan c es o f a p a r tic u la r s e m a n tic e n tity a tu p le o f a re la tio n o r a n o b je c t o f a class are o n ly re la te d by s h a r in g a set o f com m on p ro p e rtie s o f t h a t se m a n tic e n tity . In m o st situ a tio n s , su ch in te r - e n tity re la tio n sh ip s a re know n in a d v a n c e a n d ex p ressed d ire c tly ;l s
m e ta d a ta .
F or scientific d a ta , th e re is a n o th e r a s p e c t o f c o m p le x ity in th e in te rre la tio n s h ip s th e in te rre la tio n s h ip s a m o n g in sta n ces o f th e sa m e s e m a n tic e n tity . W hile th e precise' m ean in g o f sc ie n tific d a ta is still o p e n to d iscu ssio n , we d e fin e scien tific d a ta as d a t a w h ich possess rich in te r -in s ta n c e relationships. For e x a m p le , b o th m u ltim e d ia (e.g.. im ages, a u d io . an d video) a n d n o n -sta n d a rd data (e.g.. finger p rin ts a n d DMA sequences) a re c o n sid e re d as scientific d a t a by o u r d e fin itio n . A sc ie n tific database s y s te m m u st m an ag e d a t a b efo re such in te r-in s ta n c e re la tio n s h ip s a re well u n d e rs to o d . In fa c t, o n e m a jo r o b je c tiv e o f data
m in in g o r know ledge d isco ve ry in databases (K D D ) [19. 18. 6] is to discover, q u a n tify a n d
fu rth e r v a lid a te th e se r e la tio n s h ip s . 1
'T h e p ro cess o f e x p lo rin g m e a n in g fu l p a tte r n s in d a t a is know n b y d ifferen t n am es in v arious re s e a rc h disciplines. In p a rtic u la r, t h e te r m exploratory data a n a ly sis (E D A ) [63] h a s b een used e x te n s iv e ly in th e past for su ch a c tiv ity by t h e s ta tis tic s c o m m u n ity . For th e re s t o f th i s th e s is, we use th e te rm kno w led g e
In a h roacl sen se , scien tific database s y ste m s a r e d a ta b a s e sy ste m s w ith in teg ra te d s u p p o rt for th e m a n a g e m e n t a n d analysis o f scientific d a ta . S im ila r to co n v e n tio n a l d a ta b a s e sy stem s, scien tific d a ta b a s e system s also m odel a n d m a n a g e d a t a a c c o rd in g to its se m a n tic
stru ctu re, i.e.. th e in te rre la tio n s h ip s in d a ta . Since p a r t o f th e s e m a n tic s tr u c tu r e is d e te r
m in ed o r evolves a s th e d a t a analysis process p ro c eed s, th e c o u p lin g b etw een d a ta m a n ag e m e n t a n d d a t a a n a ly sis is a tight one. A sc ie n tific d a ta m o d el is a c o h e re n t fram ew ork for m o d elin g b o t h in te r-e n tity a n d in te r-in sta n c e re la tio n s h ip s to s u p p o r t d a t a m anagem ent a n d a n a ly sis a c tiv itie s .
T h is th e s is su m m a riz e s o u r research in th e d e v e lo p m e n t o f a fo rm a l scientific d a ta m odel. A s p a r t o f th e th e sis, a d a ta m odel a n d a class o f p h y sic a l data m odels a re proposed. T h e d a t a m o d el, w h ic h we ca ll the m etric-based sc ie n tific d a ta m odel, is d ev elo p ed to e x te n d th e c a p a b ilitie s o f e x is tin g conceptual a n d im p le m e n ta tio n d a t a m o d els su c h th a t in te r in sta n c e re la tio n s h ip s c a n b e p ro p erly re p resen te d . A c la ss o f p h y sical d a t a m odels a re d ev elo p ed for efficien t im p le m e n ta tio n o f such a c o n c e p tu a l e x te n sio n .
A t th e c o n c e p tu a l level, o u r scientific d a t a m o d e l is b a se d on a m a th e m a tic a l fram ew ork d e v e lo p e d from tw o basic n o tio n s d a ta -a s -fu n c tio n s a n d p seu d o -q u a sim etrics - w hich a re u se d fo r m o d e lin g in ter-e n tity a n d in te r-in s ta n c e re la tio n s h ip s respectively. By
re p re se n tin g d a t a a s fu n c tio n s, the m a th e m a tic a l fu n c tio n fo rm u la tio n s o f d a t a can be used for m o d elin g in te r - e n tity re la tio n sh ip s. A d e ta ile d a p p r o a c h is o u tlin e d for exploring a n d d e riv in g p s e u d o -q u a s im e tric s to rep resen t in te r-in sta n c e re la tio n s h ip s in a w ide variety of d a ta . In p a r tic u la r , we in tro d u c e th e n o tio n o f observable p ro p erties a n d show how it can b e a p p lie d w ith id e a s from p o in t set topology* [5] to s y s te m a tic a lly d e riv e m e tric s from non- m e tric d a t a c o m p o n e n ts . F inally, we d e m o n s tra te th e use o f c o n tin u ity as a m a th e m a tic a l Im precise to o l to v a lid a te m e tric s derived th ro u g h th e p ro p o s e d a p p ro a c h .
A t th e p h y s ic a l level, various hierarchical m e tr ic d a ta str u c tu r e s c a n be utilized as p hysical d a t a m o d e ls to im p lem en t o u r m e tric -b a se d c o n c e p tu a l m odel. In o rd e r to have efficient im p le m e n ta tio n s , we propose an innovative a p p r o a c h to d e riv e a new class of h ie r arch ical m e tric d a t a s tr u c tu r e s from point spatial data s t r u c t u r e s In s te a d o f perform ing d ire c t d e c o m p o s itio n o n m e tr ic data as is d o n e for e x is tin g h ie ra rc h ic a l m e tric d a ta s tru c
-know ledgc from d a ta .
' P o in t se i topology is also know n as general topology.
3In p o in t s p a tia l d a t a s t r u c tu r e s , th e s p a tia l o b je c ts to be s to r e d a r e d iscrete p o in ts in m u ltid im en sio n al spaces, c o m p a re d to o th e r s p a tia l d a ta s tru c tu re s such as region, rectangle, o r lin e s p a tia l d a ta s tru c tu r e s w hich a re d esig n ed fo r c o n tin u o u s sp atial o b je c ts [54. 53].
cures su ch a s m e tric trees [65. 64] a n d up-trees [67]. we define a class o f sim p le p r o n m ity -
p reserving m a p p in g s from p se u d o -m etric spaces to m u ltid im e n sio n a l spaces, w hich we call m ultipolar m ap p in g s. By a p p ly in g m u ltip o la r m a p p in g s to m etric d a ta , h ie ra rc h ic a l d e
c o m p o sitio n s c a n be done in m u ltid im e n sio n a l sp a c e , a n d various e x istin g w e ll-u n d e rsto o d p o in t s p a tia l d a t a s tru c tu re s [51. 52]. such as quadtree, octree, o r k-d tree, c a n b e u tiliz e d its h ie ra rc h ic a l m e tric d a ta s tru c tu re s .
1.2
T axonom y of C onventional D a ta M odels
A da ta m odel is a collection o f c o n c e p tu a l to o ls for describ in g d a ta , re la tio n s h ip s , se m a n tic s, o p e ra tio n s , a n d c o n s tra in ts [43]. W e c a n ca te g o rize d a t a m odels by how th e y • d e sc rib e th e d a t a a b s tra c tio n . T ra d itio n a l d a ta b a s e a r c h ite c tu re p a r titio n s d a t a m o d els in to
th re e d iffe re n t levels o f a b stra c tio n : conceptual, im p le m e n ta tio n , a n d physical [16]. Ideally, d a t a a b s tr a c tio n a t each level is in d e p e n d e n t o f th e o nes in th e o th e r two.
C o n cep tu a l data m odels a re b ased on th e u s e rs ’ p ercep tio n s o f d a ta . T h e y p ro v id e
a high-level a b s tr a c tio n of th e real w orld e n titie s a n d th e in te rre la tio n sh ip s a m o n g th e m . T h e E n tity -R e la tio n s h ip (E R ) m odel p ro p o se d by C h e n [7] along w ith its v ario u s e x te n sio n s [57. 61. 24] is th e m ost w idely used c o n c e p tu a l d a t a m odel.
P h ysica l data models specify th e low -level o rg a n iz a tio n o f how d a ta is s to re d in
th e c o m p u te r. P h y sic al d a ta m odels a re u su a lly in v isib le to d a ta b a s e users. It is th e d a ta b a s e s y s te m d esig n er's jo b to define th e p h y sic a l d a t a m odels a n d d ec id e how th e y a re im p le m e n te d . C o m m o n physical d a t a m odels in c lu d e B - tr c e . B ~ -t.ree. B '- t r e e . R - t r r r . a n d so on.
1
Im p le m e n ta tio n data m odels serve a s a b rid g e betw een c o n c e p tu a l a n d p h y sical
d a t a m o d els. T h e y provide co n cep ts a t an in te rm e d ia te level o f a b s tra c tio n still u n d e r s ta n d a b le by u se rs b u t w ith enough d e ta il to d efine th e way d a t a is logically o rg a n iz ed w ith in I
\ th e c o m p u te r. T h e in tera ctio n s betw een users a n d d a ta b a s e sy stem s also ta k e p lace a t th is
I a b s tra c tio n level. T h e interface is im p le m e n te d u sin g a D D L (data d efin itio n language) a n d
a D M L (d a ta m a n ip u la tio n language). A D D L is u sed to specify th e logical s tr u c tu r e s o f
? d a ta w hile a D M L is used to access d a t a s to re d in th e logical s tru c tu re s specified by th e
D D L. C o m m o n im p le m e n ta tio n d a ta m odels in c lu d e hierarchical, netw ork, relational, a n d
object-oriented.
id eally the choices of d a t a m o d els a t th e th re e a b s tra c tio n levels s h o u ld b e in d e p e n d e n t from ea ch o th e r, in re a lity th e y a r e c o rre la te d . G enerally, not every d a ta m o d e l a t o n e a b s tra c tio n level s u p p o rts all th e m o d e ls a t th e h ig h er level(s) e q u a lly well. E ven tw o d iffe re n t schem as o f th e sam e d a t a m odel c a n h av e d iffe ren t re q u ire m e n ts w hich are b e t t e r se rv e d by different fla ta m odels a t th e low er lev el(s). T h u s , w hen a new d a t a m odel is in tro d u c e d o r a n old d a t a m odel is ex te n d e d , it is im p o rta n t to m ake su re t h a t th e re e x ist d a t a m o d els a t lower level(s) to pro v id e th e n e c e ssa ry s u p p o r t efficiently.
1.3
Inter-Instance R elationships
T h e focus o f c o n v e n tio n a l d a t a m odels is in te r-e n tity re la tio n s h ip s w h ich a r e of p rim a ry in te re st in tr a d itio n a l d a ta . However, scientific d a ta also e n c a p s u la te com plex in te r-in sta n c e re la tio n sh ip s. L et us illu s tra te such in te r-in sta n c e re la tio n s h ip s th ro u g h a sim p le exam ple.
A m u lti-b a n d la n d s a te llite im age can b e re p re se n te d as a tw o -d im e n sio n a l a rra y o f records w ith each o f th e re c o rd s re p re se n tin g m u ltip le m e a su re m e n ts o f a specific sm all la n d area. E ach such re c o rd s h a re s th e sa m e m em b ersh ip in fo rm a tio n o f th e se m a n tic e n tity in q u e stio n , in th is case, th e im ag e o r th e 2-D a rra y re p re s e n ta tio n o f th e im age. T h e m em b e rs h ip in fo rm a tio n in c lu d e s th e fo rm a t o f th e record a n d o th e r re le v a n t in fo rm a tio n such as th e d o m ain a n d p o ssib le c o n s tra in ts on each m e a su re m e n t. T h ese a r e th e s tr u c tu r e s a n d re la tio n sh ip s w hich can b e m o d eled by conventional c o n c e p tu a l d a t a m o d els. However, in a d d itio n to th e m e m b e rsh ip s h a rin g , re co rd s o f th e 2-D a r ra y a re also in te r r e la te d sp atially . W h ile a conventional c o n c e p tu a l d a t a m o d el such a s th e E R m odel c a n a s s o c ia te ea ch record w ith its s p a tia l c o o rd in a te s, th e s p a tia l re la tio n sh ip a m o n g records c a n n o t b e a d e q u a te ly d escrib ed .
L et us use th e e n tity n a m e S Q U A R E to re p re se n t th e records o f th e 2-D satellite; im age. U sing th e E R m odel, th e c o n c e p tu a l sch e m a o f th e im age is illu s tr a te d in F ig u re 1.1(a). T h e E R d ia g ra m show s t h a t th e re a re four possible re la tio n sh ip s a m o n g in s ta n c e s o f e n tity SQ U A R E - N O R T H . S O U T H . E A S T , a n d W E S T - b ased o n th e s p a tia l re la tio n s h ip am ong in stan ce s. However it fails to re p re se n t:
1. T h e way in stan ce s o f S Q U A R E a re re la te d th ro u g h N O R T H . S O U T H . E A S T , o r W E S T :
re la tio n a n d th u s , no in stan ce c a n b e im m e d ia te ly to th e N O R T H o f tw o different, in sta n c e s a n d no in s ta n c e can have m ore t h a n o n e in s ta n c e a s its im m e d ia te N O R T H .
B eyond t h a t , we know n o th in g a b o u t th e w ay in s ta n c e s a r e re la te d th ro u g h th e n o r t h
re la tio n . F or e x a m p le , th e c o n s tra in t t h a t th e N O R T H r e la tio n in tro d u c e s no cycles (i.e.. th e re is no in s ta n c e .s such t h a t s is N O R T H o f . . . itse lf) can n o t b e seen from th e E R d ia g r a m . 1
2. T h e w ay th e re la tio n s h ip s N O R T H . S O U T H . E A S T , a n d W E S T a re in te rre la te d : For ex a m p le , g iv en tw o in stan ce s .S[ a n d »■•>. we c a n n o t c o n c lu d e from th e E R d ia g ra m th a t .si is th e N O R T H o f .s-j if a n d o n ly if s-t is th e S O U T H o f .S[. N e ith e r c a n we say t h a t th e E A S T o f th e SO U T H o f a n in sta n c e , if it e x is ts , is th e sam e as th e S O U T H o f th e e a s t o f t h a t in sta n c e .
V
SQUARE I ✓ X I -< ----\ y SQUARE M \ 1 ---N E X T ---(a) (b)F ig u re 1. 1: E R S chem a D iag ram s o f th e S a te llite Im age
T h e se c o n d issu e is a classical e x a m p le o f o n e w ell-k n o w n lim ita tio n o f th e ER m o d el - it is n o t p o ssib le to ex p ress re la tio n sh ip s a m o n g re la tio n s h ip s [43], For co n v e n tio n a l d a ta b a s e a p p lic a tio n s , th is lim ita tio n can b e overcom e b y th e in tro d u c tio n o f th e aggregation c o n s tru c t as a n e x te n s io n to th e E R m odel. H ow ever, in th e c a se o f th e 2-D s a te llite image*, a g g re g a tio n re s u lts in a sin g le re la tio n , say N E X T , by c o lla p s in g N O R T H . S O U T H . E A S T , a n d
W E S T (F ig u re 1 .1 (b )). T h is sim p ly red u ces th e se c o n d issu e in to th e first issue above. V arious o th e r e x te n s io n s to th e E R m o d el have b e e n p ro p o s e d to a lle v ia te th is in h e re n t
’ in fact, ev en im p le m e n ta tio n sch em a d eriv e d from th is E R d ia g r a m c a n n o t enforce th is c o n s tra in t efficiently. A v io la tio n o f th is k in d can n o t be d e te c te d by c h e c k in g th e c o n te n t o f a specific in sta n c e ; it is n ecessary to tr a v e rs e all in s ta n c e s in th e tr a n s itiv e closure.
(i
lim ita tio n [61. 62. 24). N one o f th e m su cc eed s in p ro v id in g an effective m e c h a n ism to d e sc rib e co m p lex in te r-in sta n c e re la tio n s h ip s in g e n e ra l.
In su m m a ry . E R d iag ra m s ca n n o t re p re s e n t th e re g u la ritie s, c o n s tra in ts , o r p a t te rn s o f th e re la tio n s h ip s am o n g in sta n c e s o f a n e n tity . F or exam ple, from th e E R sch e m a , it is n o t p o ssib le to tell th e differences a m o n g th e th r e e d is tin c t p a tte r n s in F ig u re 1.2. S ince m a k in g re la tio n s h ip s ex p licit is on e o f th e m o st b asic o b jectiv es o f d a t a m o d els at th e c o n c e p tu a l level in p a r tic u la r - co n v e n tio n a l c o n c e p tu a l d a ta m o d els a r e in a d e q u a te as m o d els for scien tific d a ta . C learly, we need a c o n c e p tu a l scientific d a t a m o d el t h a t s u p p o r ts th e re p re s e n ta tio n o f in te r-in sta n c e re la tio n s h ip s . T h is is th e p rin c ip a l c o n trib u tio n o f th e
m e tric-b a sed sc ie n tific data m odel d e sc rib e d in th is th e sis.
F ig u re 1.2: D ifferent P a t te r n s o f In te r -In s ta n c e R elatio n sh ip s
A s m e n tio n e d in S ection 1.2. th e d e v e lo p m e n t o f a new c o n c e p tu a l d a t a m odel m ay n e c e s s ita te th e d evelopm ent o f new d a t a m o d els a t th e two low er a b s tr a c tio n levels. T h e d e v e lo p m e n t o f su p p o rtin g d a ta m o d e ls for o u r m e tric -b a se d scientific d a t a m o d e l a t th e im p le m e n ta tio n a n d physical levels is d isc u sse d la te r . In p a rtic u la r, th e d e v e lo p m e n t a n d a n a ly sis o f s u p p o r tin g p hysical d a t a m o d els, i.e.. hierarchical m e tr ic data stru c tu re s. form a n im p o r ta n t c o m p o n e n t o f th is th esis.
In th e re st o f th is thesis, th e g e n e ra l te rm •'scientific d a ta m o d el" is u sed in place o f " c o n c e p tu a l scientific d a ta m odel" o r "scien tific d a t a m o d el a t th e c o n c e p tu a l level."
1.4
M od els o f Inter-Instance R elation sh ips
W h ile th e re e x ist differences in te rm in o lo g y a n d im p le m e n ta tio n , e x is tin g scien tific d a t a m o d e ls a r e all b ase d on th e sam e c o n c e p tu a l m o d e l o f in te r-in sta n c e re la tio n s h ip s th e g e o m e tric m odel. D e p en d in g on th e c o m p le x ity o f th e in te r-in sta n c e re la tio n s h ip s , th re e p rim itiv e s, g eo m etry, topology, a n d indexable topology, c a n b e used to s u p p o rt th e g e o m e tric
m odel a t th e im p le m e n ta tio n level (C h a p te r 2). E ach o f th e e x is tin g scientific d a t a m odels in c o rp o ra te s som e o r all o f th e th re e im p le m e n ta tio n p r im itiv e s . T h e r e also ex ists a g ro u p o f d a t a s tru c tu re s to s u p p o r t ea ch of th ese n o tio u s in a p h y sic a l d a t a m odel. F ig u re 1.3 illu s tra te s a h iera rch y o f d a t a m o d els for in te r-in sta n c e re la tio n s h ip s , w h e re each d a t a m odel is re p re se n te d by a re c ta n g le . In ste a d o f re p re se n tin g e a c h o f th e e x istin g scientific d a t a m odels, only th e m o d elin g p rim itiv e s em ployed, re p re s e n te d by ellip ses, are illu s tra te d in th e im p le m e n ta tio n layer. In g en eral, th e m ore ex p ressiv e a n d flexible th e d a ta m odel a n d th e m o d elin g p rim itiv e , th e w id er th e ra n g e o f d a t a t h a t c a n b e m odeled, b u t th e less efficient th e s u p p o rtin g d a ta s tru c tu re s .
C o n c e p iu a l D a ta M o d e ls Im p le m e n tatio n D a ta M od els and P rim itiv es
P h y sica l D a ta M o d e ls (S u p p o rtin ': D a ta Stru ctu res)
M e tric -B a s e d M o d e l (P s e u d o -Q u a s im e tric i G e o m e tric M ode! F o rm u la tio n o f P s e u d o Q u a s im e tr ic M e d ia n T r a n s f o r m a tio n E x istin g S c ie n tific D a ta M o d e ls G e o m e try T o p o lo g y In d ex a b le T o p o lo g y H ie ra rc h ic a l M e tric D a ta S tru c tu re s M u ltip o la r M apping P o in t S p a tia l D a ta S tru c tu re s G ra p h D a ta S tru c tu re s M u ltid im e n s io n a l A rra y s
F ig u re 1.3: D a ta M odels a n d P rim itiv es for R e p re se n tin g In te r -In s ta n c e R elatio n sh ip s
In th is th esis, we p ro p o se th e use o f a sp ecial d a t a fu n c tio n , th e p seud o -q u a sim et-
ric. as a basic c o n c e p tu a l m o d e lin g p rim itiv e for re p re s e n tin g in te r-in s ta n c e re la tio n sh ip s
(C h a p te r s 3 a n d 4). In te rm s o f expressiveness, a p s e u d o -q u a s im e tric is m ore pow erfid th a n th e g e o m e tric m odel a n d c a n b e ap p lied to a m uch w id e r ra n g e o f scientific d a ta . W h ile th e re a r e no d a ta s tru c tu re s re a d ily available to s u p p o rt p s e u d o -q u a s im e tric s as a c o n c e p tu a l m o d el, th e re ex ists a g ro u p o f hierarchical m e tric data stru c tu re s, su c h a s m e tric trees [65. 64] a n ti up-trees [67]. w hich can b e u tilized to s u p p o rt p se u d o -m e tric s, a m ore re stric tiv e n o tio n t h a n p se u d o -q u a sim e tric . In C h a p te r 5. we p ro p o se a n in n o v a tiv e a p p ro a c h , th e m ultipolar
s
wo develop a sim p le a n d p ra c tic a l te c h n iq u e , m edian tra n sfo rm a tio n , for u tiliz in g ex istin g hiera rch ica l m e tric d a t a s tr u c tu r e s as well as m u ltip o la r m a p p in g s to s u p p o r t p seu d o -q u a- sim etrics.
It sh o u ld b e n o te d t h a t th e te rm topology has d u a l in te r p r e ta tio n s in th is thesis. T h e first in te r p r e ta tio n is b a se d o n th e conv en tio n s used a n d u n d e r s to o d by re se a rc h e rs and p ra c titio n e rs in v a rio u s fields o f c o m p u te r science. B ased o n th is in te r p r e ta tio n , a topology is a d e sc rip tio n o f c o n n e c tio n p a tte r n s (S e c tio n 2.2). F orm ally, it c a n b e re p re s e n te d as a
directed graph. For in s ta n c e , we c a n ta lk a b o u t th e to p o lo g y o f a c o m p u te r n e tw o rk , o r the
topology o f a c o m p u ta tio n a l g rid . T h e sec o n d in te rp re ta tio n o f to p o lo g y in th is th e sis is based o n its form al d e fin itio n in g e n e ra l to p o lo g y (S ec tio n 3.4). In th is sen se, we c a n talk a b o u t th in g s su ch a s the u su a l topology o f the real num bers, o r neighborhood stru c tu r e o f a
topological space. N o n e th e le ss, th e re e x ists a co n n e ctio n b e tw e e n th e tw o in te r p r e ta tio n s
§ nam ely, b o th in te r p r e ta tio n s p ro v id e a way to d e sc rib e th e id e a o f "closeness." “n c a rb v .” or 3
| "n eig h b o rh o o d ." In th is th e sis, th e in te n d e d in te rp re ta tio n o f th e te rm to p o lo g y is obvious
| from its c o n te x t.
1.5
T hesis O u tlin e
:■ C h a p te r 2 p re s e n ts a su rv e y o f e x istin g scientific d a t a m o d els a n d th e p rim itiv e s
j! em ployed for re p re s e n tin g in te r-in s ta n c e re la tio n sh ip s, nam ely, g eo m etry, topology a n d in
dexable topology. F o u r m o d els a re selec ted b a se d on th e c rite rio n t h a t each o n e c a n rep resen t
in te r-in sta n c e re la tio n s h ip s e x p lic itly b ase d o n one o r m ore o f th e s e p rim itiv e s. W h ile none of th ese m o d els a d e q u a te ly a d d re sse s th e p ro b le m o f m o d e lin g in te r-in s ta n c e re la tio n sh ip s, the su rv ey pro v id es a fram ew o rk for th e p ro p o se d m e tric -b a se d scien tific d a t a m o d el.
In C h a p te r 3. we give a b rie f in tro d u c tio n to th e m a th e m a tic a l fo u n d a tio n o f the
| m e tric-b ased m odel. S o m e b asic te rm in o lo g y a n d n o tio n s from gen era l topology a n d m e tric
■ theory a re covered th e re . T h e m a in th e m e is th e s tu d y o f s e m a n tic s for c o n tin u ity in J different m a th e m a tic a l s e ttin g s . In th e s u b se q u e n t c h a p te rs , th e n o tio n o f c o n tin u ity plays
an im p o rta n t role in b o th th e m e tric -b a se d m o d el a n d th e s u p p o r tin g d a t a s tr u c tu r e s . C h a p te r 4 d e s c rib e s o u r p ro p o se d m e tric -b a se d delta m o d el. S ev eral e x a m p le s are given to illu s tra te th e b asics. A s y s te m a tic a p p ro a c h is d e v e lo p e d to d e riv e m e tric s from no n -m etric a t tr ib u te s su c h as th e o nes in ca te g o rica l d a ta . L ast b u t n o t th e le a st, we p resent a form al m e th o d for v a lid a tin g th e d eriv e d m odel.
C h a p te r 5 is d e v o te d to th e s tu d y o f p hysical d a t a m o d els, i.e.. s u p p o r tin g d a ta s tru c tu re s , for th e p r o p o s e d m e tric -b a s e d d a t a m odel. T w o in n o v a tiv e a p p ro a c h e s , m u lti
polar m ap p in g a n d m e d ia n tr a n sfo rm a tio n , a r e dev elo p ed for th is p u rp o s e . T h e y e n a b le th e
a p p lic a tio n s o f e x is tin g h ierarchical m e tr ic data stru ctu res a n d p o in t sp a tia l d a ta stru ctu res as p h y sical d a t a m o d e ls fo r th e m e tric -b a s e d m odel.
In C h a p t e r 6 . we p re s e n t a p e rfo rm a n c e s tu d y on m a jo r p a r a m e te r s o f th e m ul tip o la r m a p p in g s. In p a r tic u la r , we a r e in te re ste d in le a rn in g how th e s e p a r a m e te rs affect efficiency o f p r o x im ity q u e rie s in d a t a s e ts o f various c h a ra c te ris tic s . T h e re s u lts give us a sim ple g u id e lin e to fo r m u la te a n effective m u ltip o la r m a p p in g for a w id e ra n g e o f d a ta .
C h a p te r 7 p ro v id e s c o n c lu d in g re m a rk s a n d d ire c tio n s for f u tu r e re se a rc h .
£
4
10
Chapter 2
Scientific D a ta M od els
r
2.1
Introduction
x *I
T h e need for m o d e lin g scien tific d a t a w ith com plex in te r-in s ta n c e re la tio n s h ip shas long been re cognized by th e c o m p u ta tio n a l flu id d yn a m ic s a n d sc ie n tific visu a liza tio n com m u n ities. N o ta b le w o rk in c lu d e s th e c o m p u ta tio n a l g rid m odel [40. 60] a n d its v ario u s ex tensions, th e A V S m o d e l o f G e lb e rg e t al. [23]. th e fib er bundle m odel o f H a b e r. L ucas a n d Collins [28]. a n d th e la ttic e m odel o f B erg ero n a n d G rin ste in [2. 35]. F ro m th e per-
f sp ec tiv e of m o d elin g in te r-in s ta n c e re la tio n sh ip s, th e y all sh are th e sa m e g e o m e tric m odel
a t th e concep tu al level. W h ile th e r e e x ist differences in term in o lo g y a n d m e th o d o lo g y , th e o b jectiv e for all th e e x is tin g sc ie n tific d a t a m odels a t th e im p le m e n ta tio n level is to d eriv e a co m p act re p re s e n ta tio n o f " s tr u c tu r e d d a t a ” o r " d a ta w ith re g u la r p a t t e r n s ' b a se d on som e adjacency relations in th e logical space on w hich a m a p p in g to th e p h y sic a l space is defined. For in sta n c e , in a 2-D logical sp ac e, fo u r-n eig h b o r adjacency a n d eig h t-n e ig h b o r
adjacency a re two co m m o n a d ja c e n c y re la tio n s. T h e se m odels a re n o t c a p a b le o f efficient.
§
• re p re se n ta tio n o f h ig h ly c o m p le x in te r-in s ta n c e re la tio n sh ip s w here closed form a d ja c e n c y
; re la tio n specifications a r e n o t av a ila b le .
A dvances in sp a tia l d a ta m odels [54] p re se n t a different a p p ro a c h for m o d e lin g in te r-in sta n c e re la tio n sh ip s. In a s p a tia l d a t a m o d el, th e in te r-in sta n c e re la tio n s h ip s are represen ted as a se t o f c o o r d in a te s in a vecto r space. S p a tia l d a t a m o d els a r e e x tre m e ly useful as th e fo u n d a tio n for sp a tia l da ta stru c tu re s su c h as th e quadtree o r R -tre e [52. 53]. w hich have n u m ero u s a p p lic a tio n s [51]. H ow ever, for n o il-sp a tia l d a ta , th e re m ig h t n o t exist, a n a tu ra l m ap p in g from d a t a p o in t re co rd s to som e v e c to r space such th a t th e in te r-in s ta n c e
re la tio n sh ip s a r e re p re se n te d by c o o rd in a te s in th a t v e c to r sp ace.
It sh o u ld b e n o te d t h a t th e re also exists a g ro u p o f scien tific d a t a ex c h an g e s t a n d a rd s w hich do n o t fall in to th e th re e layer d a ta m o d e l tax o n o m y . N o n e th e le ss, in a s u b tle way. th e ir fo rm a ts do re p re s e n t im p lic it c o n c e p tu a l view's o f th e d a ta . T h e m o st p o p u la r two such d a t a ex c h an g e s ta n d a r d s a re C D F (C o m m o n D a ta F o rm a t) d ev e lo p e d a t N A SA [49. 26. 45] a n d H D F (H ie ra rc h ic a l D a ta F o rm at) d e v e lo p e d a t N C S A [46]. T h e c o n c e p tu a l m odels im p lic itly d efin e d by b o th C D F a n d H D F a re e n c o m p a sse d by th e g en e ral c o m p u ta tio n a l g rid m o d el (S ec tio n 2.4). D a ta exchange s ta n d a r d s su c h as th e se a r e sp ec ific atio n s of file fo rm a ts w h ich c a n b e c o n sid e re d as scientific d a t a m o d els for file sy stem s.
T h is c h a p te r su m m a riz e s th e fo u r scientific d a t a m o d els w hich re p resen t in te r- instan ce re la tio n sh ip s e x p lic itly : c o m p u ta tio n a l g rid m o d el. AVS m odel, fib er b u n d le m o d e,
j a n d la ttic e m odel. B efore we g e t in to d e ta ils o f th e four re s p e c tiv e d a t a m o d els (S ectio n s 2.4.
£ 2.5. 2.6. a n d 2.7). it is useful to s tu d y th e g en eral c o n c e p ts o f d a t a o rg a n iz a tio n (S ectio n 2 .2 )
[ a n d com m on o p e r a tio n s p e rfo rm e d on scientific d a t a (S e c tio n 2.3).
2.2
Im p lem en tation P rim itives and P h ysical D a ta M odels
A s c ie n tific data s e t is a fin ite set o f data points. E ac h d a t a p o in t c o rre sp o n d s
| to o b serv atio n s m ad e a t a p o in t o r sm a ll a re a in th e p h y sic a l space a t a specific tim e o r
p erio d . A p h y sical sp a c e ca n b e a v o lu m e o f atm o sp h e re , a m a g n e tic e n e rg y field, a specific biological p o p u la tio n , o r a c o m p u te r s im u la tio n o f such p h y sic a l e n titie s. E ac h d a t a p o in t is re p resen te d by a re co rd o f a t tr i b u te s re flectin g o b se rv a tio n s m a d e a t t h a t p o in t. S ince th e physical sp ace is w h ere th e d a t a s a m p lin g takes place, it is also called th e sam plintj sp a ce .1
i
S ince we a r e o n ly in te re ste d in scien tific d a ta , the te rm data s e t is used in ste a d o f sc ie n tificdata set in th e thesis.
A t th e c o n c e p tu a l level, e x is tin g scientific d a t a m o d e ls a re all b a se d on th e geo m etric view o f th e p h y sical sp a c e - th e g eo m etric m odel. F or a given d a t a se t. w hile th e re is ex actly on e p h y sical sp ac e, th e re c a n b e m any d iffe ren t g e o m e tric view s o f th e sam e d a t a set. T he p o ssib ility o f m u ltip le g e o m e tric views e n a b le s us to ex p lo re im p lic it in te rre la tio n sh ip s in d a ta , b o th in te r-e n tity a n d in te r-in sta n c e , w ith o u t th e lim ita tio n im p o sed o r suggested by th e p ro cess u se d for s a m p lin g d a ta in th e p h y sic a l space. A m o n g th e e x istin g scientific d a t a m odels, th e la ttice m odel, in p a rtic u la r, is sp e c ia lly d e sig n ed to fa c ilita te
12
m u ltip le g eo m etric views (S ec tio n 2.7).
W h ile th e ex istin g scientific d a t a m o d e ls differ in te rm in o lo g y a n d o b je c tiv e s , th ey a re all b a se d on one o r m ore o f th e th r e e im p le m e n ta tio n p r im itiv e s id e n tifie d in th is sec tio n for re p re se n tin g th e in te r-in s ta n c e re la tio n s h ip s a t th e im p le m e n ta tio n level. T h ese p rim itiv e s a re geo m etry. topology, a n d in d exa b le topology. In te rm s o f th e c o m p le x ity o f th e in te r-in s ta n c e re la tio n sh ip s th e y a re c a p a b le o f ex p ressin g , th e th re e n o tio n s fo rm a h ie ra r chy w ith g e o m e try b ein g th e m o st p o w e rfu l a n d in d ex ab le to p o lo g y b e in g th e least. T h is k in d o f expressive pow er com es a t th e p ric e o f efficiency, how ever. T h e n o tio n s o f g eom etry, top o lo g y , a n d in d ex ab le to p o lo g y p ro v id e a fram ew o rk for s tu d y in g th e four scien tific d a ta m o d els p re se n te d in th is c h a p te r a n d o u r p ro p o s e d m e tric -b a se d m odel. O n e im p o rta n t o b je c tiv e is to deriv e s u ita b le p h y sical d a t a m o d e ls, i.e.. d a ta s tr u c tu r e s , for ea ch o f th ese
k p rim itiv e s. W hile th e re is a g ro u p o f d a t a s tr u c tu r e s for ea ch o f th e se p rim itiv e s, a scientific
|
d a t a m o d e l m ight need to be s u p p o r te d by m o re th a n o n e k in d o f d a t a s t r u c t u r e a t th e p h y sic a l level. F ig u re 1.3 illu s tra te s th e re la tio n s h ip s am o n g all th e d a t a m o d els, p rim itiv e s.
!
a n d s u p p o r tin g d a ta s tru c tu r e s d isc u sse d in th is section.
2 .2 .1 G e o m e t r ie s
I T h e te rm g eo m etry h as m a n y d iffe re n t m eanings. G iv en a d a ta s e t. we define
g e o m e try as a n im p le m e n ta tio n p rim itiv e for its in te r-in sta n c e re la tio n s h ip s , i.e.. th e in
te rr e la tio n s h ip s am o n g its d a t a p o in ts, b a se d o n a d irec t in te r p r e ta tio n o f th e c o n c e p tu a l g e o m e tric m odel. S pecifically, a g e o m e try c o n sists o f two co m p o n e n ts:
( 1. a m a p p in g from d a ta re co rd s to c o o r d in a te s in the g e o m e tric space.
I 2 . a d is ta n c e fu n c tio n in th e g e o m e tric sp ac e.
S E sse n tia lly , g eo m etry d esc rib es th e in te rre la tio n s h ip s am o n g d a t a p o in ts b a s e d o n th e ir
I lo c a tio n s in th e g eo m etric space. G e o m e tric in fo rm a tio n is e n c o d e d in th e form o f a tt r ib u te s
i ■j
;■ o f d a t a p o in ts, m eta d a ta o f th e d a t a s e t. a n d th e g eo m etric view su p p lie d by th e user.
A ttr ib u te s selected for re p re se n tin g g e o m e tric c o o rd in a te s a re c a lle d g e o m e tric a ttrib u tes. A d a t a s e t c a n have m an y d iffe ren t g e o m e trie s. D ifferent se ts o f a t tr i b u te s m ig h t b e used ;is g e o m e tric a ttr ib u te s , re s u ltin g in d iffe re n t g e o m e tric spaces, a n d th u s , d iffe ren t g eo m etrie s. F or th e sam e g eo m etric sp ac e, d iffe ren t choices o f d ista n c e fu n c tio n s also in d u c e different g e o m e trie s.
N o te t h a t fo r m a n y d a t a sets, th e re m ig h t n o t b e a m e a n in g fu l g e o m e tric rep re s e n ta tio n for its in te r - in s ta n c e re la tio n sh ip s. W h ile th e re is a lw a y s a n a t u r a l g eo m etric view for a s p a tia l d a t a s e t. th e p h y sic a l space for a n o n -s p a tia l d a t a set m ig h t n o t b e a c o o rd in a te sp ac e for in sta n c e , it m ig h t b e a m e tric space o r a b in a ry rela tio n . A g e o m e tric view a n d a g e o m e tric r e p re s e n ta tio n , i.e.. a geom etry, m ig h t n o t be re a d ily a v a ila b le in su c h cases.
T h e m o st o b v io u s w ay to sto re a d a ta set b a se d o n its g e o m e try is to u tilize som e
p o in t sp a tia l data s tr u c tu r e [54. 55] e n c o d in g t h a t g eo m etry . P o in t s p a tia l d a t a s tru c tu re s ,
how ever, a r e g e n e ra lly n o t a s efficient as m u ltid im e n sio n a l a r ra y s w h e re ea ch d a t a record c a n be accessed by its n u m e ric a l index. For m an y d a t a s e ts w ith a "sim p le" o r "regular" g eo m etry , it is o fte n p o s s ib le to d erive a b e tte r d a t a s tr u c t u r e for its access a n d storage. T h e re d u c tio n in c o m p le x ity re s u lts in th e n o tio n s o f topology a n d indexable topology.
»
| 2 .2 .2 T o p o lo g ie s
i A topology1 is a d ire c te d g ra p h defined on a set o f d a t a p o in ts b a se d o n a chosen
geo m etry . T h e o re tic a lly , a to p o lo g y is eq u iv ale n t to a d is ta n c e fu n c tio n w h ich takes only two values. 0 a n d 1. T h e r e e x is ts an arc. i.e.. d ire c te d ed g e , from p to q in a topology if a n d o n ly if th e d is ta n c e fro m p to q is 0. T h e m a p p in g fro m d a t a re c o rd s to geom etric c o o rd in a te s in th e g e o m e tr ic sp a c e , i.e.. th e first c o m p o n e n t o f a g e o m etry , is n o t p a rt o f a
!
topology. A to p o lo g y is u s u a lly d eriv ed to achieve on e o f t h e follow ing tw o objectiv es: 1. th e g e o m e try c a n b e tra v e rs e d o r ex p lo re d m ore efficien tly th ro u g h th e topology [60]. 2. specific c o m p u ta tio n a l p ro c e d u re s c a n b e a p p lie d to th e d a t a m o re efficiently [40].
j D e p e n d in g o n th e re q u ire m e n ts o f a n a p p lic a tio n , m o re t h a n o n e topology can
I b e d eriv e d from th e g e o m e try . N evertheless, th e e x iste n c e o f a n a r c from a p o in t p to a
I p o in t q u su a lly im p lie s t h a t q is re la tiv e ly close to p (it d o e s n o t n e c e ssa rily im p ly th a t p
t is close to q). U n d e r m o s t c irc u m sta n c e s, th e d is ta n c e fu n c tio n specified by a g eo m etry is
r sy m m e tr ic (S ec tio n 3.2) a n d so is th e sim p lified tw o-value d is ta n c e for th e to p o lo g y derived
t from it. T h u s , u n d ir e c te d g ra p h s suffice to d e sc rib e m o st to p o lo g ie s. F or th e rest, of th e
: th esis, u n less specified o th e rw is e , we a ssu m e u n d ire c te d g r a p h s for to p o lo g ie s.
In g en e ral, a u se fu l to p o lo g y for a d a ta set is a tra d e -o ff b etw e en th e following two factors:
1-t o 0 o 0 ° 0 -< --- --- 1 ,s - ° O p o ' , ° o\ V-> p p o' O / * 0 0 o 1 f o (a) (b) (c)
F ig u re 2.1: T w o D ifferent T opologies D eriv ed from th e S am e G e o m e try
L. T h e to p o lo g y closely reflects th e g eo m etry .
2. T h e to p o lo g y consists o f m o stly sim p le a n d re p e titiv e p a tte rn s such t h a t d a ta access a n d / o r c o m p u ta tio n a re fa c ilita te d .
F ig u re 2 .1 (a) illu s tra te s th e g e o m e try o f a d a t a s e t. a ssu m in g E u clid ea n d ista n c e . F or th e co n v e n ie n ce o f c o m p u ta tio n a n d access, we m ig h t d eriv e a to p o lo g y as d e p ic te d in F ig u re 2 .1 (b ). As c a n be seen, th e edges ro u g h ly in d ic a te closeness betw een p o in ts in th e g e o m e try , w ith a few exceptions. F ig u re 2.1(c) d e p ic ts a n o th e r to p o lo g y in tro d u c e d In s e ttin g a d is ta n c e th re sh o ld su ch t h a t two p o in ts a re c o n n e c te d by a n edge if a n d o n ly if th e ir d is ta n c e is lower th a n th e th re sh o ld . A lth o u g h th is topology re p re se n ts th e g e o m e try in a m o re p recise w ay th e ex isten c e o f a n e d g e re p re se n ts a c e rta in d egree o f closeness, as sp ec ifie d by th e th re sh o ld - it does n o t c o n sist o f re g u la r p a tte r n s to fa c ilita te c o m p u ta tio n o r s to ra g e access. T h u s, it m ig h t n o t be a s u sefu l as th e first topology.
T h e d e riv a tio n o f a to p o lo g y w hich closely reflects th e g eo m etry is u su ally achieved th ro u g h a series o f refinem ents. T h e in itia l to p o lo g y is o fte n c o n stru c te d from som e v a ria n t o f th e b a sic n o tio n o f nearest neighbor. F or in s ta n c e , we ca n c o n stru c t a n in itia l to p o lo g y by c o n n e c tin g ea ch p o in t to its n e a re st n e ig h b o r, o r th e n e a re st fc neighbors. T h e a p p ro a c h for d e riv in g F ig u re 2.1(c) is also b ase d o n a v a r ia n t o f th e n e a re st n eig h b o r notion.
F or a d a t a set w here th e in te r-in s ta n c e re la tio n s h ip s can b e a d e q u a te ly m o d eled by a topology, it is o ften b e tte r to d e riv e a d a t a s tr u c tu r e usin g such a to p o lo g y th a n to u tiliz e a p o in t s p a tia l d a t a s tr u c tu r e b a se d o n its g eo m etry . E ac h edge in th e topology- c a n b e im p le m e n te d as a b i-d ire c tio n a l p o in te r fro m o n e d a t a p o in t to a n o th e r. We ca n also la b e l ea c h p o in t w ith a u n ique in d ex a n d s to re th e a djacency m atrix. E ith e r way. access to d a t a p o in ts s till o ften requires som e se q u e n tia l tra v e rs a l o f p a r t o f the topology, a n d ra n d o m