• No results found

A metric-based scientific data model for knowledge discovery

N/A
N/A
Protected

Academic year: 2021

Share "A metric-based scientific data model for knowledge discovery"

Copied!
173
0
0

Loading.... (view fulltext now)

Full text

(1)

University of New Hampshire

University of New Hampshire Scholars' Repository

Doctoral Dissertations

Student Scholarship

Winter 1997

A metric-based scientific data model for knowledge

discovery

David T. Kao

University of New Hampshire, Durham

Follow this and additional works at:

https://scholars.unh.edu/dissertation

This Dissertation is brought to you for free and open access by the Student Scholarship at University of New Hampshire Scholars' Repository. It has been accepted for inclusion in Doctoral Dissertations by an authorized administrator of University of New Hampshire Scholars' Repository. For more information, please [email protected].

Recommended Citation

Kao, David T., "A metric-based scientific data model for knowledge discovery" (1997). Doctoral Dissertations. 1994.

(2)

INFORMATION TO USERS

This manuscript has been reproduced from the microfilm master. UMI

films the text directly from the original or copy submitted. Thus, some

thesis and dissertation copies are in typewriter face, while others may be

from any type o f computer printer.

The quality of this reproduction is dependent upon the quality of the

copy submitted. Broken or indistinct print, colored or poor quality

illustrations and photographs, print bleedthrough, substandard margins,

and improper alignment can adversely afreet reproduction.

In the unlikely event that the author did not send UMI a complete

manuscript and there are missing pages, these will be noted. Also, if

unauthorized copyright material had to be removed, a note will indicate

the deletion.

Oversize materials (e.g., maps, drawings, charts) are reproduced by

sectioning the original, beginning at the upper left-hand comer and

continuing from left to right in equal sections with small overlaps. Each

original is also photographed in one exposure and is included in reduced

form at the back of the book.

Photographs included in the original manuscript have been reproduced

xerographically in this copy. Higher quality 6” x 9” black and white

photographic prints are available for any photographs or illustrations

appearing in this copy for an additional charge. Contact UMI directly to

order.

(3)
(4)

A M

e t r i c

- B

a s e d

S

c i e n t i f i c

D

a t a

M

o d e l

f o r

K

n o w l e d g e

D

i s c o v e r y

by

David T . Kao

B.S.. N ational C entral University. 1986

M.S.. M ichigan S ta te University. 1991

;

DISSERTATION

* r

Subm itted to the U niversity o f New H am pshire

in P artial Fulfillment of

the R equirem ents for the Degree of

I

D octor of Philosophy

in

C om puter Science

(5)

UMI Number: 9819677

Copyright 1997 by

K a o , David T .

All rights reserved.

I

1 it {

i

t UMI Microform 9819677

Copyright 1998, by UMI Company. All rights reserved. This microform edition is protected against unauthorized

copying under Title 17, United States Code.

j

UMI

f 300 North Zeeb Road

(6)

J

ALL RIGHTS RESERVED

© 1997

(7)

The dissertation has-been exam ined and approved.

D issertation Co-Directbr, R. Daniel Bergeron Professor of C om puter Science

D issertation Co-Dirpctor, Ted M. Sparr Professor of C om puter Science

/?

R aym ond Greenlaw

Associate Professor of C om puter Science

Georges G. G rinstein

Professor of C om puter Science University of M assachusetts, Lowell

Samuel D. Shore

Professor of M athem atics

(8)

IV

D ed ication

To iny fa th e r. Y u-San K ao. a n d m y m o th e r. Fong-M ei Lee. for th e ir u n d e r s ta n d in g , p atien ce, a n d love.

(9)

A cknow ledgem ents

M any p e o p le have c o n trib u te d , in big a n d sm a ll ways, to th e c o m p le tio n o f this d is s e rta tio n a n d as im p o rta n tly , to th e q u a lity o f m y life d u rin g th is jo u rn e y . I would like to take a d v a n ta g e o f th is o p p o r tu n ity to acknow ledge th e ir effort a n d e x p re ss m y a p p re c ia tio n .

P ro fe sso r R . D an iel B ergeron has b e e n a c o n s ta n t so u rc e o f in sp ira tio n . His vision played a p iv o tin g ro le in id en tify in g m y re search to p ic a n d d e fin in g m y d isse rta tio n scope. I ca n alw ays rely o n h im for s o u n d advice a n d h o n est o p in io n .

P ro fe sso r T ed M. S p a rr in tro d u ce d m e to th e field o f scien tific d a ta b a s e system s. His e x te n siv e k n o w led g e in d a ta b a s e a n d s ta tis tic s h e lp e d m e to p u t my research into p e rsp e c tiv e a n d p re v e n te d m e from re in v e n tin g w heels.

; T h a n k s to P ro fe sso r R ay m ond G reen law th is d is s e r ta tio n h a s m any fewer

incon-' sistencies a n d e d ito ria l p itfalls. His th o ro u g h n ess has set a new s t a n d a r d in w ritin g for me.

All th e re m a in in g in c o n siste n c ie s an d m ista k es a re solely m ine.

P ro fe sso r G eorges G . G rin ste in has p ro v id ed m a n y v a lu a b le com m ents a n d fresh

; ideas for m e to p o n d e r. It was a d in n e r co n v e rsatio n w ith h im a n d D a n in A tla n ta . G eorgia

th a t in stille d th e co n fid e n ce in m e to p u rsu e m y d is s e r ta tio n re se a rc h .

P ro fe sso r S am u el D. S h o re in tro d u c e d m e to th e fields o f logic, topology, a n d m e tric I theory, w h ich serv e as th e th e o re tic a l fo u n d a tio n o f th is d is s e r ta tio n . He is not only my

j m e n to r in m a th e m a tic s , b u t also a very d e a r friend.

D u rin g m y P h .D . stu d y . I have h ad th e o p p o r tu n ity to w ork w ith P ro fessor P ila r de la T o rre o n trie co m p le x ity an aly sis. W h ile th is c o lla b o r a tio n d id n o t d irec tly c o n trib u te to m y d is s e r ta tio n re se a rc h , it h a s been a very re w a rd in g ex p e rie n c e .

I enjo y ed th e o p p o r tu n ity to c o a u th o r w ith M ike C u llin a n e 011 two occasions. T h e m any h o u rs o f fun I s p e n t w ith h im a n d S am in o u r w eekly to p o lo g y se m in a rs will be m issed a lot.

I w ould a lso like to th a n k th e sta ff o f c o m p u te r scien ce d e p a rtm e n t for th e ir ad-k

? rn in istra tiv e s u p p o r t. L in d a A ndrew s is alw ays p le a s a n t a n d a tte n tiv e to an y m u n d a n e

p ro b lem I b ro u g h t to her. G in a Ross a n d A n d y E van s h av e p ro v id e d excellent h ard w are t

‘ a n d so ftw are serv ice.

M an y fellow g ra d u a te s tu d e n ts have en ric h ed m y life on a d a ily basis. T h e y will be re m e m b e re d a n d I believe t h a t our s e p a ra te p a th s w ill cro ss in th e future.

(10)

sup-p o rtiv e. A lth o u g h we a r e g eo g rasup-p h ically se sup-p a ra te d by th o u s a n d s o f m iles, sh e never fails to m ake h e r p resen ce fe lt th ro u g h o u r m an y phone calls.

L ast b u t n o t th e least. I have to th a n k m y p a r e n ts . T h e y have fa ith in me. It is th ro u g h th e ir e n c o u ra g e m e n t a n d s u p p o rt th a t I was a b le to com e to th e S ta te s to p u rsu e m y g ra d u a te stu d y . T h e y alw ays give m e the b e st: to th e m . I a m forever in d e b t.

(11)

V I !

Table o f C on ten ts

D e d ic a tio n iv A ck n o w le d g e m e n ts v L ist o f F ig u res x List o f T ables x ii A b stra c t x iii 1 In tr o d u ctio n 1 1.1 S cientific D a ta ... 1 1.2 T a x o n o m y o f C o n v e n tio n a l D a ta M o d e l s ... 3 1.3 In te r -In s ta n c e R e l a t i o n s h i p s ... 1 1.4 M odels o f In te r -In s ta n c e R e l a t i o n s h i p s ... (i 1.5 T h e sis O u tlin e ... 8

2 S cien tific D a t a M o d els 10 2.1 I n t r o d u c t i o n ... 10

2.2 Im p le m e n ta tio n P rim itiv e s a n d P h y sical D a ta M o d els ... 11

2.2.1 G e o m e t r i e s ... 12 2.2.2 T o p o lo g ie s ... 13 2.2.3 In d e x a b le T o p o lo g ie s ... 15 2.2.4 S tr u c tu r e d D a t a ... 17 2.3 B asic O p e r a t i o n s ... 20 2.3.1 S u b s e t O p e r a t i o n s ... 20 2.3.2 A n a ly sis a n d E x p l o r a t i o n ... 22 2.3.3 M iscellan e o u s D a ta b a s e O p e r a t io n s ... 23 2.4 C o m p u ta tio n a l G r i d ... 24 2.4.1 T a x o n o m y o f C o m p u ta tio n a l G r i d s ... 25 2.4.2 C o m p u ta tio n a l G rid S u m m a r y ... 27

2.5 A p p lic a tio n V isu a liz a tio n S y ste m ( A V S ) ... 28

2.5.1 AVS S tr u c tu r e d D a t a ... 28 2.5.2 AVS U n s tru c tu r e d D a t a ... 2!J

(12)

v ii i

2 .5 .3 AVS S u m m a r y ... 30

2.6 F ib e r B u n d l e ... 30

2.6.1 F ib e r B u n d le M odel o f F ield D a t a ... 30

2 .6 .2 P iecew ise Field R e p r e s e n ta tio n s ... 32

2 .6 .3 C o m p a c t Field R e p r e s e n t a t i o n s ... 33 2 .6 .4 F ib e r B u n d le S u m m a r y ... 34 2.7 L a t t i c e ... 34 2.7.1 M u ltip le Views o f a D a ta S e t ... 35 2 .7 .2 D e fin itio n of L a t t i c e ... 35 2 .7 .3 L a ttic e S u m m a r y ... 37 2.8 S u m m a r y ... 37 3 M a t h e m a t i c a l P r e l i m i n a r i e s 3 8 3.1 I n t r o d u c t i o n ... 38 3.2 D is ta n c e F u n c tio n a n d M etric A x i o m s ... 38 i 3.3 C o n tin u ity a n d N e i g h b o r h o o d s ... 42 J 3.4 T o p o lo g ies a n d D istances ... 47 = 3.5 S u m m a r y ... 50 4 A M e t r i c - B a s e d S c ie n tif ic D a t a M o d e l 5 1 ! 4.1 I n t r o d u c t i o n ... 51 ! 4.2 D a ta a s F u n c t i o n s ... 51 4.2.1 D a ta S ets and F u n c t i o n s ... 52 4 .2 .2 F o rm u la tio n s o f D a ta F u n c t i o n s ... 53 4 .2 .3 F u n c tio n a l D a ta M odel ... 54 4 .2 .4 O rd e rs o f D a t a ... 55

I 4 .2 .5 I n te r -E n tity R e la tio n sh ip s a n d F u n c tio n F o r m u la t io n ... 57

4.3 F u n c tio n a l D ep en d en cy am o n g A t t r i b u t e s ... 58 4.4 P s e u d o - Q u a s im e tr ic s ... 60 4.4.1 M o t i v a t i o n ... 61 : 4 .4 .2 F ro m P se u d o -Q u a sim e tric s to P a r t i a l O r d e r s ... 62

1

4.5 O b s e rv a b le P r o p e r t i e s ... 63 4.5.1 O rd e r-T h e o re tic versus S e t - T h e o r e t i c ... 64 4 .5 .2 P ro p e rty -B a se d M e t r i c s ... 65 4 .5 .3 P r o p e r ty T o p o lo g iz a tio n ... 66 j 4 .5 .4 M e tric E v a l u a t i o n ... 67 s 4.6 V a ria te M e tric D e r i v a t i o n ... 67 4.7 D o m a in M e tric D e r i v a t io n ... 72

I 4.8 C o n tin u ity for M etric V a l i d a t i o n ... 76

! 4.9 S u m m a r y ... 77 l 5 P h y s i c a l D a t a M o d e l s 79 5.1 I n t r o d u c t i o n ... 79 5.2 H ie ra rc h ic a l M etric D a ta S tr u c tu re s ... 80 5.3 M u ltip o la r M a p p i n g s ... 83

(13)

ix 5.4 M axico D i s t a n c e s ... 86 5.5 D im en sio n ality I s s u e ... 88 5.6 A verage In te rp o le D ista n c e ... !J() 5.7 C o o rd in a te V olum e R e d u c t i o n ... 62 5.8 P ro v isio n s for P s e u d o - Q u a s im e tr ic s ... 94

5.9 M u ltip o la r M a p p in g v ersus E xisting A p proaches ... 97

5.10 S u m m a r y ... 98

6 P erfo rm a n ce A n a ly sis 99 6.1 I n t r o d u c t i o n ... 99

6.2 D a ta S y n t h e s i s ... 99

6.3 D ista n c e D is trib u tio n s a n d P ro x im ity A c c u m u la tio n s ... 101

6.4 In trin s ic D i m e n s i o n a l i t i e s ... 104 6.5 C lu ste re d D a t a ... 106 6.6 D a ta S et Sizes ... 115 i 6.7 N u m b e r o f P oles ... L16 I 6.8 P ole S e l e c t i o n ... 119 £ I 6.9 C lu s te r C e n te rs a s P o l e s ... 130 % 6.10 G re e d y A lg o rith m ... 131 L 6.11 Very H igh D i m e n s i o n a l i t y ... 135 ' 6.12 S u m m a r y ... 135 7 C on clu sion s 141 7.1 O verview ... 141 7.2 C o n t r i b u t i o n s ... 142 j 7.3 F u tu re R esearch ... 143 B ib liograp h y 145

A M a th em a tic a l D e fin itio n s 151

B T h eo rem P ro o fs 152

I

i

I

i.

(14)

X

List o f Figures

L.l E R S chem a D ia g ra m s o f th e S a te llite I m a g e ... 5

1.2 D ifferent P a tte r n s o f In te r-In s ta n c e R e la ti o n s h ip s ... 6

1.3 D a ta M odels a n d P rim itiv e s for R ep resen tin g In te r-In s ta n c e R e la tio n s h ip s . 7 ; 2.1 T w o D ifferent T o p o lo g ies D erived from th e S am e G e o m e t r y ... 14

2.2 G e o m etric S pace a n d In d e x S p a c e ... 16

2.3 N o n-D efault C o n n e c tio n s ... 17

2.4 D a ta Set w ith P a t t e r n a lo n g O n e D im ension O n l y ... IS j 2.5 In d ex Space for th e D a ta S et in F igure 2 . 4 ... 19

2.6 V arious 2-D G rid s ... 26

2.7 Topologies W h ich C a n N ot Be M odeled by C o m p u ta tio n a l G r i d s ... 28

2.8 P rim itiv e C ell T y p e s ... 29

2.9 C o m p o n e n ts o f a F i e l d ... 32 [ 2.10 A B ase C o n sists o f T w o S e g m e n t s ... 33 2.11 C ross P ro d u c t o f F i e l d s ... 34 2.12 L a ttic e as a F u n c t i o n ... 36 3.1 T w o D istances o n th e S am e S e t ... 49 4.1 E x am p le o f a R ich n ess R e l a t i o n ... 57

4.2 In te r-E n tity R e la tio n s h ip s Specified by F u n ctio n F o r m u l a t i o n ... 57

I 4.3 D ifferent D o m ain G e o m e trie s on th e S am e D o m ain S p a c e ... 60

4.4 C h o o sin g a G e o m e tric C e n t e r ... 63

4.5 T h e P ro cess o f V a ria te M e tric D e r i v a t i o n ... 67

: 5.1 H ierarch ical M e tric D e c o m p o s i t i o n ... 82

^ 5.2 D istan ce I llu s tra tio n o f X ... 84

5.3 P o in t P lo t o f M 2 [ A T ] ... 85

r 5.4 C o m p a riso n b e tw e e n S p h e re s B ased on rn > a n d E - > ... 87

s 5.5 D istan ce D is trib u tio n H i s t o g r a m ... 89

5.6 N u m b e r o f P o in ts E x a m in e d a t R adius 5. 10. a n d 1 5 ... 92

5.7 A reas Void o f P o i n t s ... 93

5.8 B o u n d in g P la n es fo r A/3 [A T ]... 94

(15)

XI

6.1 E x a m p le o f D ista n c e D is trib u tio n (D D ) ... 102 (i.2 E x a m p le o f P ro x im ity A c c u m u la tio n ( P A ) ... 103 6.3 (in . n . p . c . r ) = ( 1 . . . 1 0 . 10'1. 0 .0 .0 ) a n d 5 -P o la r M a p p in g s: D D s a n d PA s . . 105 6.4 (in . n . p . c . r ) = ( 1 . . . 10. 10°. 0 .0 .0 ): D D s a n d PA s ( O v e r la y e d ) ... 106 6.5 (in . n . p . c . r ) = ( 1 . . . 10. 10°. 0 .0 .0 ) a n d 5 -P o la r M a p p in g s: D D s a n d PA s . . 107 6.6 (in. n . p. c. r ) = ( 1 0 .1 0 1. 0.0 . . . 1 .0 .1 0 0 .2 ): DD H i s t o g r a m s ... 109 6.7 (rn. n . p . c . r ) = (10. 10'1. 0.0 . . . 1 .0 .1 0 0 .2 ): P a r tia l DD H i s t o g r a m s ... 110 6.8 ( i n . n . p . c . r ) = ( 1 0 .1 0 l . 0.0 . . . 1.0.100. 2): P a r tia l D D H isto g ra m s a f te r 5-P o la r M a p p i n g s ... 112 6.9 ( i n . n . p . c . r ) = ( 1 0 .1 0 1. 0.0 . . . 1 .0 .1 0 0 .2 ): P e rfo rm a n c e o f N e a re st N e ig h b o r Q u e ries ... 113 6.10 (m . n . p . c . r ) = (10. lO '.O .O ... 1.0. 100.2): P e rfo rm a n c e o f P ro x im ity Q u e rie s 111 6.11 ( i n . n . p . c . r ) = ( 3 . r t . 2 0 . 1 0 0 .2 ). n = 10*.2 x 10‘* 10 x 10'1: E fficiency for

D a ta S et o f V arious Sizes ... 117 6.12 ( i n . n . p . c . r ) = (3. n . 2 0 . 100. 2 ). n = 10°. 2 x 10° 10 x 10°: E fficiency for

I D a ta S et o f V arious Sizes ... 118

\ 6.13 ( m . n . p . c . r ) = (10 .1 0 '* .p . 100.2). p = 0 .2 5 .0 .5 0 .0 .7 5 . C o lle c tio n I: Effi­

ciency w ith R esp ec t to 1...10 P o l e s 120

6.14 ( i n . n . p . c . r ) - (10. 1 0 1. p. 100.2). p = 0 .2 5 .0 .5 0 .0 .7 5 . C o lle c tio n 2:

Effi-| ciency w ith R e sp e c t to 1...10 P o l e s ... 121

i 6.15 ( i n . n . p . c . r ) 1 (10. 10 1. p. 100.2). p = 0 .2 5 .0 .5 0 .0 .7 5 . C o lle c tio n 3:

Effi-I ciency w ith R esp ec t to 1...10 P o l e s ... 122

I

6.16 ( m . n . p . c . r ) = ( 10. 10 1. p. 100. 2). p = 0 .2 5 .0 .5 0 .0 .7 5 : Efficiency o f 4 -P o la r

I M a p p in g s w ith R e sp e c t to A verage In te rp o le D i s t a n c e s ... 124

| 6.17 A verage In te rp o le D ista n c e s versus S ta n d a r d D e v i a t i o n s ... 125

6.18 A verag e In te rp o le D ista n c e s versus S ta n d a r d D e v i a t i o n s ... 127 6.19 ( i n . n . p . c . r ) = ( 1 0 .10'1. 0.75 .1 0 0 . 2): Efficiency o f 4 - P o la r M a p p in g s w ith

R e sp e c t to A verag e In te rp o le D istan ces ... 128 6.20 ( n i . n . p . c . r ) = ( 1 0 .10'1. 0 .7 5 .1 0 0 .2 ): E fficiency for 4 -p o la r M a p p in g s w ith

R e sp e c t to th e S ta n d a r d D e v ia tio n s o f In te rp o le D i s t a n c e s ... 129 6.21 ( i n . n . p . c . r ) = (10. 1 0 * .0 .7 5 .4 .4 ): 1.000 4 -P o le S ets ... 130 6.22 ( i n . n . p . c . r ) = (10. 1 0 * .0 .7 5 .4 .4 ): E fficiency for 4 -P o Ia r M a p p in g s w ith

Re-j s p e c t to A v e rag e In te rp o le D i s t a n c e s ... 132 6.23 ( i n . n . p . c . r ) = (2. 101. 0 .7 5 .4 .5 ): 2 -D im en sio n al S c a tte r P l o t ... 133 6.24 ( i n . n . p . c . r ) = ( 2 . 1 0 1. 0 .7 5 .4 .5 ): 1000 4 -P o le S ets. A v e rag e In te rp o le

Dis-; ta n c e s v ersu s S ta n d a r d D e v i a t i o n s ... 133 ! 6.25 ( m . n . p . c . r ) 1 (2 .1 0 '* .0 .7 5 .4 .5 ): Efficiency o f 4 -P o la r M a p p in g s w ith R e-, sp e c t to A v e rag e In te rp o lc D i s t a n c e s ... 134 ■ 6.26 (rn. n . p. c . r ) = ( 1 0 .10'1. 0.25. 100.2): P e rfo rm a n c e o f th e G re e d y A lg o rith m 136 6.27 (rn. n . p. c. r) = ( 1 0 .10'1.0 .5 0 . 100.2): P e rfo rm a n c e o f th e G re e d y A lg o rith m 137 6.28 ( m . n . p . c . r ) = (1 0 .1 0 * .0 .7 5 . 100.2): P e rfo rm a n c e o f th e G re e d y A lg o rith m 138 6.29 ( m . n . p . c . r ) = ( 1 0 . 10'1.0 .5 0 . 100.2): P e rfo rm a n c e o f th e G re e d y A lg o rith m 139

(16)

x i i

List o f T ables

3.1 B asic M e tric A x i o m s ... 3(J

4.1 D ista n c e s S p ecifie d by d\ over X ... 09

4.2 V alues o f Zq- ... 70

4.3 V alues o f d > ... 70

4.4 O b se rv a b le P ro p e r tie s for Lo ... 75

5.1 N u m b e r o f P o in ts V isite d versus D im e n s io n a lity ... 90

6.1 P e rfo rm a n c e o f 5 -P o la r M a p p i n g s ... 108

1 L

(17)

A bstract

A

M e t r i c - B a s e d S c i e n t i f i c D a t a M o d e l f o r K n o w l e d g e D i s c o v e r y

bv

DavicI T. Kao

University of New Ham pshire. December. 1997

Scientific d a ta , w hich in clu d e m u ltim e d ia (e.g.. im ages, au d io , a n d video) a n d n o n -s ta n d a rd d a t a (e.g.. Rnger p rin ts a n d DMA sequences), is c h a ra c te riz e d by rich a n d co m p lex int.er-

insta n ce relationships in a d d itio n to th e in te r -e n tity relationships fo u n d in tra d itio n a l d a ta . C onven tio n a l data m odels a re insufficient for m o d elin g such in te r-in s ta n c e re la tio n sh ip s.

T h is thesis p ro poses a m e tric-b a sed sc ie n tific data m odel from th e no tio n s o f data-as-

fu n c tio n s an d p seu d o -q u a sim e tric s, w hich a re u sed to m odel in te r-e n tity a n d in te r-in sta n c e

re la tio n sh ip s respectively. C o m p a re d to o th e r scientific d a t a m od els, th e m e tric -b a se d con­ c e p tu a l m odel ca n be a p p lie d to m an y d a t a se ts w h e re g eo m etric view s m ight n o t otherw ise b e available.

A d e ta ile d a p p ro a c h is o u tlin e d for e x p lo rin g a n d d e riv in g p sc iid o -q u a sim e trirs to re p re se n t in te r-in sta n c e re la tio n s h ip s in a w ide v a rie ty o f d a ta . In p a r tic u la r, we introduce* th e n o tio n of observable pro p erties a n d show how it ca n be a p p lie d w ith id eas from point

se t topology to s y s te m a tic a lly d e riv e m etrics from n o n m e tric d a t a c o m p o n e n ts su c h as c a t­

eg o rical d a ta . We also d e m o n s tra te th e use o f c o n tin u ity as a m a th e m a tic a lly precise tool to v a lid a te m etrics d eriv ed th ro u g h th e p ro p o se d a p p ro a c h .

In o rd e r to s u p p o rt th e m e tric -b a se d m o d el a t th e phy sical level, we dev elo p ed two sim p le m echanism s, th e m u ltip o la r m apping, for tra n sfo rm in g a p se u d o -m e tric sp ace into a m u ltid im en sio n al space, a n d th e m edian tra n sfo rm a tio n , for d e riv in g a p seu d o -m e tric from a p seu d o -q u asim ctric. A fte r a p p lic a tio n o f m u ltip o la r m a p p in g a n d (p o ssib ly ) m edian tra n sfo rm a tio n , it is easy to use e x istin g p o in t spatial data s tru c tu re s such a s quadtree o r octree for m e tric d a ta s to ra g e a n d access. T h e re su lts o f o u r p e rfo rm a n c e analysis d e m o n s tra te th a t th e m u ltip o la r a p p ro a c h is ro b u s t a n d s ta b le o v er a wide ra n g e of d a ta p a ra m e te rs for d a t a se ts w ith in trin sic d im e n sio n a lity o f 10 or less. W h ile it is s till unclear w h e th e r the m u ltip o la r a p p ro a c h offers sig n ifican t p e rfo rm a n c e a d v a n ta g e o n proxim ity

(18)

q u erie s for d a t a sets o f very h ig h d im en sio n ality , p re lim in a ry re s u lts for 100 d im en sio n al d a t a still show ex cellen t p e rfo rm a n c e o n n earest n e ig h b o r q u erie s.

(19)

C hapter 1

In trod u ction

1.1

Scientific D a ta

In te rre la tio n s h ip s in scien tific data a re fu n d a m e n ta lly m o re im p licit, d ifficu lt to c a p tu re , a n d h a r d e r to m o d el th a n those in o th e r d a ta b a s e a p p lic a tio n s. For m o st c o n ­ v en tio n al a p p lic a tio n s , th e o n ly in te rre la tio n s h ip s o f in te re s t a r e th e ones a m o n g d iffe ren t sem an tic e n titie s , su c h as th e ones w hich c a n b e d e s c rib e d by a n E R d ia g ra m [16. 12]. In stan c es o f a p a r tic u la r s e m a n tic e n tity a tu p le o f a re la tio n o r a n o b je c t o f a class are o n ly re la te d by s h a r in g a set o f com m on p ro p e rtie s o f t h a t se m a n tic e n tity . In m o st situ a tio n s , su ch in te r - e n tity re la tio n sh ip s a re know n in a d v a n c e a n d ex p ressed d ire c tly ;l s

m e ta d a ta .

F or scientific d a ta , th e re is a n o th e r a s p e c t o f c o m p le x ity in th e in te rre la tio n s h ip s th e in te rre la tio n s h ip s a m o n g in sta n ces o f th e sa m e s e m a n tic e n tity . W hile th e precise' m ean in g o f sc ie n tific d a ta is still o p e n to d iscu ssio n , we d e fin e scien tific d a ta as d a t a w h ich possess rich in te r -in s ta n c e relationships. For e x a m p le , b o th m u ltim e d ia (e.g.. im ages, a u d io . an d video) a n d n o n -sta n d a rd data (e.g.. finger p rin ts a n d DMA sequences) a re c o n sid e re d as scientific d a t a by o u r d e fin itio n . A sc ie n tific database s y s te m m u st m an ag e d a t a b efo re such in te r-in s ta n c e re la tio n s h ip s a re well u n d e rs to o d . In fa c t, o n e m a jo r o b je c tiv e o f data

m in in g o r know ledge d isco ve ry in databases (K D D ) [19. 18. 6] is to discover, q u a n tify a n d

fu rth e r v a lid a te th e se r e la tio n s h ip s . 1

'T h e p ro cess o f e x p lo rin g m e a n in g fu l p a tte r n s in d a t a is know n b y d ifferen t n am es in v arious re s e a rc h disciplines. In p a rtic u la r, t h e te r m exploratory data a n a ly sis (E D A ) [63] h a s b een used e x te n s iv e ly in th e past for su ch a c tiv ity by t h e s ta tis tic s c o m m u n ity . For th e re s t o f th i s th e s is, we use th e te rm kno w led g e

(20)

In a h roacl sen se , scien tific database s y ste m s a r e d a ta b a s e sy ste m s w ith in teg ra te d s u p p o rt for th e m a n a g e m e n t a n d analysis o f scientific d a ta . S im ila r to co n v e n tio n a l d a ta b a s e sy stem s, scien tific d a ta b a s e system s also m odel a n d m a n a g e d a t a a c c o rd in g to its se m a n tic

stru ctu re, i.e.. th e in te rre la tio n s h ip s in d a ta . Since p a r t o f th e s e m a n tic s tr u c tu r e is d e te r­

m in ed o r evolves a s th e d a t a analysis process p ro c eed s, th e c o u p lin g b etw een d a ta m a n ­ ag e m e n t a n d d a t a a n a ly sis is a tight one. A sc ie n tific d a ta m o d el is a c o h e re n t fram ew ork for m o d elin g b o t h in te r-e n tity a n d in te r-in sta n c e re la tio n s h ip s to s u p p o r t d a t a m anagem ent a n d a n a ly sis a c tiv itie s .

T h is th e s is su m m a riz e s o u r research in th e d e v e lo p m e n t o f a fo rm a l scientific d a ta m odel. A s p a r t o f th e th e sis, a d a ta m odel a n d a class o f p h y sic a l data m odels a re proposed. T h e d a t a m o d el, w h ic h we ca ll the m etric-based sc ie n tific d a ta m odel, is d ev elo p ed to e x te n d th e c a p a b ilitie s o f e x is tin g conceptual a n d im p le m e n ta tio n d a t a m o d els su c h th a t in te r­ in sta n c e re la tio n s h ip s c a n b e p ro p erly re p resen te d . A c la ss o f p h y sical d a t a m odels a re d ev elo p ed for efficien t im p le m e n ta tio n o f such a c o n c e p tu a l e x te n sio n .

A t th e c o n c e p tu a l level, o u r scientific d a t a m o d e l is b a se d on a m a th e m a tic a l fram ew ork d e v e lo p e d from tw o basic n o tio n s d a ta -a s -fu n c tio n s a n d p seu d o -q u a sim etrics - w hich a re u se d fo r m o d e lin g in ter-e n tity a n d in te r-in s ta n c e re la tio n s h ip s respectively. By

re p re se n tin g d a t a a s fu n c tio n s, the m a th e m a tic a l fu n c tio n fo rm u la tio n s o f d a t a can be used for m o d elin g in te r - e n tity re la tio n sh ip s. A d e ta ile d a p p r o a c h is o u tlin e d for exploring a n d d e riv in g p s e u d o -q u a s im e tric s to rep resen t in te r-in sta n c e re la tio n s h ip s in a w ide variety of d a ta . In p a r tic u la r , we in tro d u c e th e n o tio n o f observable p ro p erties a n d show how it can b e a p p lie d w ith id e a s from p o in t set topology* [5] to s y s te m a tic a lly d e riv e m e tric s from non- m e tric d a t a c o m p o n e n ts . F inally, we d e m o n s tra te th e use o f c o n tin u ity as a m a th e m a tic a l Im­ precise to o l to v a lid a te m e tric s derived th ro u g h th e p ro p o s e d a p p ro a c h .

A t th e p h y s ic a l level, various hierarchical m e tr ic d a ta str u c tu r e s c a n be utilized as p hysical d a t a m o d e ls to im p lem en t o u r m e tric -b a se d c o n c e p tu a l m odel. In o rd e r to have efficient im p le m e n ta tio n s , we propose an innovative a p p r o a c h to d e riv e a new class of h ie r­ arch ical m e tric d a t a s tr u c tu r e s from point spatial data s t r u c t u r e s In s te a d o f perform ing d ire c t d e c o m p o s itio n o n m e tr ic data as is d o n e for e x is tin g h ie ra rc h ic a l m e tric d a ta s tru c

-know ledgc from d a ta .

' P o in t se i topology is also know n as general topology.

3In p o in t s p a tia l d a t a s t r u c tu r e s , th e s p a tia l o b je c ts to be s to r e d a r e d iscrete p o in ts in m u ltid im en sio n al spaces, c o m p a re d to o th e r s p a tia l d a ta s tru c tu re s such as region, rectangle, o r lin e s p a tia l d a ta s tru c tu r e s w hich a re d esig n ed fo r c o n tin u o u s sp atial o b je c ts [54. 53].

(21)

cures su ch a s m e tric trees [65. 64] a n d up-trees [67]. we define a class o f sim p le p r o n m ity -

p reserving m a p p in g s from p se u d o -m etric spaces to m u ltid im e n sio n a l spaces, w hich we call m ultipolar m ap p in g s. By a p p ly in g m u ltip o la r m a p p in g s to m etric d a ta , h ie ra rc h ic a l d e­

c o m p o sitio n s c a n be done in m u ltid im e n sio n a l sp a c e , a n d various e x istin g w e ll-u n d e rsto o d p o in t s p a tia l d a t a s tru c tu re s [51. 52]. such as quadtree, octree, o r k-d tree, c a n b e u tiliz e d its h ie ra rc h ic a l m e tric d a ta s tru c tu re s .

1.2

T axonom y of C onventional D a ta M odels

A da ta m odel is a collection o f c o n c e p tu a l to o ls for describ in g d a ta , re la tio n s h ip s , se m a n tic s, o p e ra tio n s , a n d c o n s tra in ts [43]. W e c a n ca te g o rize d a t a m odels by how th e y • d e sc rib e th e d a t a a b s tra c tio n . T ra d itio n a l d a ta b a s e a r c h ite c tu re p a r titio n s d a t a m o d els in to

th re e d iffe re n t levels o f a b stra c tio n : conceptual, im p le m e n ta tio n , a n d physical [16]. Ideally, d a t a a b s tr a c tio n a t each level is in d e p e n d e n t o f th e o nes in th e o th e r two.

C o n cep tu a l data m odels a re b ased on th e u s e rs ’ p ercep tio n s o f d a ta . T h e y p ro v id e

a high-level a b s tr a c tio n of th e real w orld e n titie s a n d th e in te rre la tio n sh ip s a m o n g th e m . T h e E n tity -R e la tio n s h ip (E R ) m odel p ro p o se d by C h e n [7] along w ith its v ario u s e x te n sio n s [57. 61. 24] is th e m ost w idely used c o n c e p tu a l d a t a m odel.

P h ysica l data models specify th e low -level o rg a n iz a tio n o f how d a ta is s to re d in

th e c o m p u te r. P h y sic al d a ta m odels a re u su a lly in v isib le to d a ta b a s e users. It is th e d a ta b a s e s y s te m d esig n er's jo b to define th e p h y sic a l d a t a m odels a n d d ec id e how th e y a re im p le m e n te d . C o m m o n physical d a t a m odels in c lu d e B - tr c e . B ~ -t.ree. B '- t r e e . R - t r r r . a n d so on.

1

Im p le m e n ta tio n data m odels serve a s a b rid g e betw een c o n c e p tu a l a n d p h y sical

d a t a m o d els. T h e y provide co n cep ts a t an in te rm e d ia te level o f a b s tra c tio n still u n d e r s ta n d ­ a b le by u se rs b u t w ith enough d e ta il to d efine th e way d a t a is logically o rg a n iz ed w ith in I

\ th e c o m p u te r. T h e in tera ctio n s betw een users a n d d a ta b a s e sy stem s also ta k e p lace a t th is

I a b s tra c tio n level. T h e interface is im p le m e n te d u sin g a D D L (data d efin itio n language) a n d

a D M L (d a ta m a n ip u la tio n language). A D D L is u sed to specify th e logical s tr u c tu r e s o f

? d a ta w hile a D M L is used to access d a t a s to re d in th e logical s tru c tu re s specified by th e

D D L. C o m m o n im p le m e n ta tio n d a ta m odels in c lu d e hierarchical, netw ork, relational, a n d

object-oriented.

(22)

id eally the choices of d a t a m o d els a t th e th re e a b s tra c tio n levels s h o u ld b e in d e p e n d e n t from ea ch o th e r, in re a lity th e y a r e c o rre la te d . G enerally, not every d a ta m o d e l a t o n e a b s tra c tio n level s u p p o rts all th e m o d e ls a t th e h ig h er level(s) e q u a lly well. E ven tw o d iffe re n t schem as o f th e sam e d a t a m odel c a n h av e d iffe ren t re q u ire m e n ts w hich are b e t t e r se rv e d by different fla ta m odels a t th e low er lev el(s). T h u s , w hen a new d a t a m odel is in tro d u c e d o r a n old d a t a m odel is ex te n d e d , it is im p o rta n t to m ake su re t h a t th e re e x ist d a t a m o d els a t lower level(s) to pro v id e th e n e c e ssa ry s u p p o r t efficiently.

1.3

Inter-Instance R elationships

T h e focus o f c o n v e n tio n a l d a t a m odels is in te r-e n tity re la tio n s h ip s w h ich a r e of p rim a ry in te re st in tr a d itio n a l d a ta . However, scientific d a ta also e n c a p s u la te com plex in te r-in sta n c e re la tio n sh ip s. L et us illu s tra te such in te r-in sta n c e re la tio n s h ip s th ro u g h a sim p le exam ple.

A m u lti-b a n d la n d s a te llite im age can b e re p re se n te d as a tw o -d im e n sio n a l a rra y o f records w ith each o f th e re c o rd s re p re se n tin g m u ltip le m e a su re m e n ts o f a specific sm all la n d area. E ach such re c o rd s h a re s th e sa m e m em b ersh ip in fo rm a tio n o f th e se m a n tic e n tity in q u e stio n , in th is case, th e im ag e o r th e 2-D a rra y re p re s e n ta tio n o f th e im age. T h e m em ­ b e rs h ip in fo rm a tio n in c lu d e s th e fo rm a t o f th e record a n d o th e r re le v a n t in fo rm a tio n such as th e d o m ain a n d p o ssib le c o n s tra in ts on each m e a su re m e n t. T h ese a r e th e s tr u c tu r e s a n d re la tio n sh ip s w hich can b e m o d eled by conventional c o n c e p tu a l d a t a m o d els. However, in a d d itio n to th e m e m b e rsh ip s h a rin g , re co rd s o f th e 2-D a r ra y a re also in te r r e la te d sp atially . W h ile a conventional c o n c e p tu a l d a t a m o d el such a s th e E R m odel c a n a s s o c ia te ea ch record w ith its s p a tia l c o o rd in a te s, th e s p a tia l re la tio n sh ip a m o n g records c a n n o t b e a d e q u a te ly d escrib ed .

L et us use th e e n tity n a m e S Q U A R E to re p re se n t th e records o f th e 2-D satellite; im ­ age. U sing th e E R m odel, th e c o n c e p tu a l sch e m a o f th e im age is illu s tr a te d in F ig u re 1.1(a). T h e E R d ia g ra m show s t h a t th e re a re four possible re la tio n sh ip s a m o n g in s ta n c e s o f e n ­ tity SQ U A R E - N O R T H . S O U T H . E A S T , a n d W E S T - b ased o n th e s p a tia l re la tio n s h ip am ong in stan ce s. However it fails to re p re se n t:

1. T h e way in stan ce s o f S Q U A R E a re re la te d th ro u g h N O R T H . S O U T H . E A S T , o r W E S T :

(23)

re la tio n a n d th u s , no in stan ce c a n b e im m e d ia te ly to th e N O R T H o f tw o different, in sta n c e s a n d no in s ta n c e can have m ore t h a n o n e in s ta n c e a s its im m e d ia te N O R T H .

B eyond t h a t , we know n o th in g a b o u t th e w ay in s ta n c e s a r e re la te d th ro u g h th e n o r t h

re la tio n . F or e x a m p le , th e c o n s tra in t t h a t th e N O R T H r e la tio n in tro d u c e s no cycles (i.e.. th e re is no in s ta n c e .s such t h a t s is N O R T H o f . . . itse lf) can n o t b e seen from th e E R d ia g r a m . 1

2. T h e w ay th e re la tio n s h ip s N O R T H . S O U T H . E A S T , a n d W E S T a re in te rre la te d : For ex a m p le , g iv en tw o in stan ce s .S[ a n d »■•>. we c a n n o t c o n c lu d e from th e E R d ia g ra m th a t .si is th e N O R T H o f .s-j if a n d o n ly if s-t is th e S O U T H o f .S[. N e ith e r c a n we say t h a t th e E A S T o f th e SO U T H o f a n in sta n c e , if it e x is ts , is th e sam e as th e S O U T H o f th e e a s t o f t h a t in sta n c e .

V

SQUARE I ✓ X I -< ----\ y SQUARE M \ 1 ---N E X T ---(a) (b)

F ig u re 1. 1: E R S chem a D iag ram s o f th e S a te llite Im age

T h e se c o n d issu e is a classical e x a m p le o f o n e w ell-k n o w n lim ita tio n o f th e ER m o d el - it is n o t p o ssib le to ex p ress re la tio n sh ip s a m o n g re la tio n s h ip s [43], For co n v e n tio n a l d a ta b a s e a p p lic a tio n s , th is lim ita tio n can b e overcom e b y th e in tro d u c tio n o f th e aggregation c o n s tru c t as a n e x te n s io n to th e E R m odel. H ow ever, in th e c a se o f th e 2-D s a te llite image*, a g g re g a tio n re s u lts in a sin g le re la tio n , say N E X T , by c o lla p s in g N O R T H . S O U T H . E A S T , a n d

W E S T (F ig u re 1 .1 (b )). T h is sim p ly red u ces th e se c o n d issu e in to th e first issue above. V arious o th e r e x te n s io n s to th e E R m o d el have b e e n p ro p o s e d to a lle v ia te th is in h e re n t

’ in fact, ev en im p le m e n ta tio n sch em a d eriv e d from th is E R d ia g r a m c a n n o t enforce th is c o n s tra in t efficiently. A v io la tio n o f th is k in d can n o t be d e te c te d by c h e c k in g th e c o n te n t o f a specific in sta n c e ; it is n ecessary to tr a v e rs e all in s ta n c e s in th e tr a n s itiv e closure.

(24)

(i

lim ita tio n [61. 62. 24). N one o f th e m su cc eed s in p ro v id in g an effective m e c h a n ism to d e sc rib e co m p lex in te r-in sta n c e re la tio n s h ip s in g e n e ra l.

In su m m a ry . E R d iag ra m s ca n n o t re p re s e n t th e re g u la ritie s, c o n s tra in ts , o r p a t­ te rn s o f th e re la tio n s h ip s am o n g in sta n c e s o f a n e n tity . F or exam ple, from th e E R sch e m a , it is n o t p o ssib le to tell th e differences a m o n g th e th r e e d is tin c t p a tte r n s in F ig u re 1.2. S ince m a k in g re la tio n s h ip s ex p licit is on e o f th e m o st b asic o b jectiv es o f d a t a m o d els at th e c o n c e p tu a l level in p a r tic u la r - co n v e n tio n a l c o n c e p tu a l d a ta m o d els a r e in a d e q u a te as m o d els for scien tific d a ta . C learly, we need a c o n c e p tu a l scientific d a t a m o d el t h a t s u p p o r ts th e re p re s e n ta tio n o f in te r-in sta n c e re la tio n s h ip s . T h is is th e p rin c ip a l c o n trib u tio n o f th e

m e tric-b a sed sc ie n tific data m odel d e sc rib e d in th is th e sis.

F ig u re 1.2: D ifferent P a t te r n s o f In te r -In s ta n c e R elatio n sh ip s

A s m e n tio n e d in S ection 1.2. th e d e v e lo p m e n t o f a new c o n c e p tu a l d a t a m odel m ay n e c e s s ita te th e d evelopm ent o f new d a t a m o d els a t th e two low er a b s tr a c tio n levels. T h e d e v e lo p m e n t o f su p p o rtin g d a ta m o d e ls for o u r m e tric -b a se d scientific d a t a m o d e l a t th e im p le m e n ta tio n a n d physical levels is d isc u sse d la te r . In p a rtic u la r, th e d e v e lo p m e n t a n d a n a ly sis o f s u p p o r tin g p hysical d a t a m o d els, i.e.. hierarchical m e tr ic data stru c tu re s. form a n im p o r ta n t c o m p o n e n t o f th is th esis.

In th e re st o f th is thesis, th e g e n e ra l te rm •'scientific d a ta m o d el" is u sed in place o f " c o n c e p tu a l scientific d a ta m odel" o r "scien tific d a t a m o d el a t th e c o n c e p tu a l level."

1.4

M od els o f Inter-Instance R elation sh ips

W h ile th e re e x ist differences in te rm in o lo g y a n d im p le m e n ta tio n , e x is tin g scien tific d a t a m o d e ls a r e all b ase d on th e sam e c o n c e p tu a l m o d e l o f in te r-in sta n c e re la tio n s h ip s th e g e o m e tric m odel. D e p en d in g on th e c o m p le x ity o f th e in te r-in sta n c e re la tio n s h ip s , th re e p rim itiv e s, g eo m etry, topology, a n d indexable topology, c a n b e used to s u p p o rt th e g e o m e tric

(25)

m odel a t th e im p le m e n ta tio n level (C h a p te r 2). E ach o f th e e x is tin g scientific d a t a m odels in c o rp o ra te s som e o r all o f th e th re e im p le m e n ta tio n p r im itiv e s . T h e r e also ex ists a g ro u p o f d a t a s tru c tu re s to s u p p o r t ea ch of th ese n o tio u s in a p h y sic a l d a t a m odel. F ig u re 1.3 illu s tra te s a h iera rch y o f d a t a m o d els for in te r-in sta n c e re la tio n s h ip s , w h e re each d a t a m odel is re p re se n te d by a re c ta n g le . In ste a d o f re p re se n tin g e a c h o f th e e x istin g scientific d a t a m odels, only th e m o d elin g p rim itiv e s em ployed, re p re s e n te d by ellip ses, are illu s tra te d in th e im p le m e n ta tio n layer. In g en eral, th e m ore ex p ressiv e a n d flexible th e d a ta m odel a n d th e m o d elin g p rim itiv e , th e w id er th e ra n g e o f d a t a t h a t c a n b e m odeled, b u t th e less efficient th e s u p p o rtin g d a ta s tru c tu re s .

C o n c e p iu a l D a ta M o d e ls Im p le m e n tatio n D a ta M od els and P rim itiv es

P h y sica l D a ta M o d e ls (S u p p o rtin ': D a ta Stru ctu res)

M e tric -B a s e d M o d e l (P s e u d o -Q u a s im e tric i G e o m e tric M ode! F o rm u la tio n o f P s e u d o Q u a s im e tr ic M e d ia n T r a n s f o r m a tio n E x istin g S c ie n tific D a ta M o d e ls G e o m e try T o p o lo g y In d ex a b le T o p o lo g y H ie ra rc h ic a l M e tric D a ta S tru c tu re s M u ltip o la r M apping P o in t S p a tia l D a ta S tru c tu re s G ra p h D a ta S tru c tu re s M u ltid im e n s io n a l A rra y s

F ig u re 1.3: D a ta M odels a n d P rim itiv es for R e p re se n tin g In te r -In s ta n c e R elatio n sh ip s

In th is th esis, we p ro p o se th e use o f a sp ecial d a t a fu n c tio n , th e p seud o -q u a sim et-

ric. as a basic c o n c e p tu a l m o d e lin g p rim itiv e for re p re s e n tin g in te r-in s ta n c e re la tio n sh ip s

(C h a p te r s 3 a n d 4). In te rm s o f expressiveness, a p s e u d o -q u a s im e tric is m ore pow erfid th a n th e g e o m e tric m odel a n d c a n b e ap p lied to a m uch w id e r ra n g e o f scientific d a ta . W h ile th e re a r e no d a ta s tru c tu re s re a d ily available to s u p p o rt p s e u d o -q u a s im e tric s as a c o n c e p tu a l m o d el, th e re ex ists a g ro u p o f hierarchical m e tric data stru c tu re s, su c h a s m e tric trees [65. 64] a n ti up-trees [67]. w hich can b e u tilized to s u p p o rt p se u d o -m e tric s, a m ore re stric tiv e n o tio n t h a n p se u d o -q u a sim e tric . In C h a p te r 5. we p ro p o se a n in n o v a tiv e a p p ro a c h , th e m ultipolar

(26)

s

wo develop a sim p le a n d p ra c tic a l te c h n iq u e , m edian tra n sfo rm a tio n , for u tiliz in g ex istin g hiera rch ica l m e tric d a t a s tr u c tu r e s as well as m u ltip o la r m a p p in g s to s u p p o r t p seu d o -q u a- sim etrics.

It sh o u ld b e n o te d t h a t th e te rm topology has d u a l in te r p r e ta tio n s in th is thesis. T h e first in te r p r e ta tio n is b a se d o n th e conv en tio n s used a n d u n d e r s to o d by re se a rc h e rs and p ra c titio n e rs in v a rio u s fields o f c o m p u te r science. B ased o n th is in te r p r e ta tio n , a topology is a d e sc rip tio n o f c o n n e c tio n p a tte r n s (S e c tio n 2.2). F orm ally, it c a n b e re p re s e n te d as a

directed graph. For in s ta n c e , we c a n ta lk a b o u t th e to p o lo g y o f a c o m p u te r n e tw o rk , o r the

topology o f a c o m p u ta tio n a l g rid . T h e sec o n d in te rp re ta tio n o f to p o lo g y in th is th e sis is based o n its form al d e fin itio n in g e n e ra l to p o lo g y (S ec tio n 3.4). In th is sen se, we c a n talk a b o u t th in g s su ch a s the u su a l topology o f the real num bers, o r neighborhood stru c tu r e o f a

topological space. N o n e th e le ss, th e re e x ists a co n n e ctio n b e tw e e n th e tw o in te r p r e ta tio n s

§ nam ely, b o th in te r p r e ta tio n s p ro v id e a way to d e sc rib e th e id e a o f "closeness." “n c a rb v .” or 3

| "n eig h b o rh o o d ." In th is th e sis, th e in te n d e d in te rp re ta tio n o f th e te rm to p o lo g y is obvious

| from its c o n te x t.

1.5

T hesis O u tlin e

:

■ C h a p te r 2 p re s e n ts a su rv e y o f e x istin g scientific d a t a m o d els a n d th e p rim itiv e s

j! em ployed for re p re s e n tin g in te r-in s ta n c e re la tio n sh ip s, nam ely, g eo m etry, topology a n d in ­

dexable topology. F o u r m o d els a re selec ted b a se d on th e c rite rio n t h a t each o n e c a n rep resen t

in te r-in sta n c e re la tio n s h ip s e x p lic itly b ase d o n one o r m ore o f th e s e p rim itiv e s. W h ile none of th ese m o d els a d e q u a te ly a d d re sse s th e p ro b le m o f m o d e lin g in te r-in s ta n c e re la tio n sh ip s, the su rv ey pro v id es a fram ew o rk for th e p ro p o se d m e tric -b a se d scien tific d a t a m o d el.

In C h a p te r 3. we give a b rie f in tro d u c tio n to th e m a th e m a tic a l fo u n d a tio n o f the

| m e tric-b ased m odel. S o m e b asic te rm in o lo g y a n d n o tio n s from gen era l topology a n d m e tric

theory a re covered th e re . T h e m a in th e m e is th e s tu d y o f s e m a n tic s for c o n tin u ity in J different m a th e m a tic a l s e ttin g s . In th e s u b se q u e n t c h a p te rs , th e n o tio n o f c o n tin u ity plays

an im p o rta n t role in b o th th e m e tric -b a se d m o d el a n d th e s u p p o r tin g d a t a s tr u c tu r e s . C h a p te r 4 d e s c rib e s o u r p ro p o se d m e tric -b a se d delta m o d el. S ev eral e x a m p le s are given to illu s tra te th e b asics. A s y s te m a tic a p p ro a c h is d e v e lo p e d to d e riv e m e tric s from no n -m etric a t tr ib u te s su c h as th e o nes in ca te g o rica l d a ta . L ast b u t n o t th e le a st, we p resent a form al m e th o d for v a lid a tin g th e d eriv e d m odel.

(27)

C h a p te r 5 is d e v o te d to th e s tu d y o f p hysical d a t a m o d els, i.e.. s u p p o r tin g d a ta s tru c tu re s , for th e p r o p o s e d m e tric -b a s e d d a t a m odel. T w o in n o v a tiv e a p p ro a c h e s , m u lti­

polar m ap p in g a n d m e d ia n tr a n sfo rm a tio n , a r e dev elo p ed for th is p u rp o s e . T h e y e n a b le th e

a p p lic a tio n s o f e x is tin g h ierarchical m e tr ic data stru ctu res a n d p o in t sp a tia l d a ta stru ctu res as p h y sical d a t a m o d e ls fo r th e m e tric -b a s e d m odel.

In C h a p t e r 6 . we p re s e n t a p e rfo rm a n c e s tu d y on m a jo r p a r a m e te r s o f th e m ul­ tip o la r m a p p in g s. In p a r tic u la r , we a r e in te re ste d in le a rn in g how th e s e p a r a m e te rs affect efficiency o f p r o x im ity q u e rie s in d a t a s e ts o f various c h a ra c te ris tic s . T h e re s u lts give us a sim ple g u id e lin e to fo r m u la te a n effective m u ltip o la r m a p p in g for a w id e ra n g e o f d a ta .

C h a p te r 7 p ro v id e s c o n c lu d in g re m a rk s a n d d ire c tio n s for f u tu r e re se a rc h .

£

4

(28)

10

Chapter 2

Scientific D a ta M od els

r

2.1

Introduction

x *

I

T h e need for m o d e lin g scien tific d a t a w ith com plex in te r-in s ta n c e re la tio n s h ip s

has long been re cognized by th e c o m p u ta tio n a l flu id d yn a m ic s a n d sc ie n tific visu a liza tio n com m u n ities. N o ta b le w o rk in c lu d e s th e c o m p u ta tio n a l g rid m odel [40. 60] a n d its v ario u s ex tensions, th e A V S m o d e l o f G e lb e rg e t al. [23]. th e fib er bundle m odel o f H a b e r. L ucas a n d Collins [28]. a n d th e la ttic e m odel o f B erg ero n a n d G rin ste in [2. 35]. F ro m th e per-

f sp ec tiv e of m o d elin g in te r-in s ta n c e re la tio n sh ip s, th e y all sh are th e sa m e g e o m e tric m odel

a t th e concep tu al level. W h ile th e r e e x ist differences in term in o lo g y a n d m e th o d o lo g y , th e o b jectiv e for all th e e x is tin g sc ie n tific d a t a m odels a t th e im p le m e n ta tio n level is to d eriv e a co m p act re p re s e n ta tio n o f " s tr u c tu r e d d a t a ” o r " d a ta w ith re g u la r p a t t e r n s ' b a se d on som e adjacency relations in th e logical space on w hich a m a p p in g to th e p h y sic a l space is defined. For in sta n c e , in a 2-D logical sp ac e, fo u r-n eig h b o r adjacency a n d eig h t-n e ig h b o r

adjacency a re two co m m o n a d ja c e n c y re la tio n s. T h e se m odels a re n o t c a p a b le o f efficient.

§

• re p re se n ta tio n o f h ig h ly c o m p le x in te r-in s ta n c e re la tio n sh ip s w here closed form a d ja c e n c y

; re la tio n specifications a r e n o t av a ila b le .

A dvances in sp a tia l d a ta m odels [54] p re se n t a different a p p ro a c h for m o d e lin g in te r-in sta n c e re la tio n sh ip s. In a s p a tia l d a t a m o d el, th e in te r-in sta n c e re la tio n s h ip s are represen ted as a se t o f c o o r d in a te s in a vecto r space. S p a tia l d a t a m o d els a r e e x tre m e ly useful as th e fo u n d a tio n for sp a tia l da ta stru c tu re s su c h as th e quadtree o r R -tre e [52. 53]. w hich have n u m ero u s a p p lic a tio n s [51]. H ow ever, for n o il-sp a tia l d a ta , th e re m ig h t n o t exist, a n a tu ra l m ap p in g from d a t a p o in t re co rd s to som e v e c to r space such th a t th e in te r-in s ta n c e

(29)

re la tio n sh ip s a r e re p re se n te d by c o o rd in a te s in th a t v e c to r sp ace.

It sh o u ld b e n o te d t h a t th e re also exists a g ro u p o f scien tific d a t a ex c h an g e s t a n ­ d a rd s w hich do n o t fall in to th e th re e layer d a ta m o d e l tax o n o m y . N o n e th e le ss, in a s u b tle way. th e ir fo rm a ts do re p re s e n t im p lic it c o n c e p tu a l view's o f th e d a ta . T h e m o st p o p u la r two such d a t a ex c h an g e s ta n d a r d s a re C D F (C o m m o n D a ta F o rm a t) d ev e lo p e d a t N A SA [49. 26. 45] a n d H D F (H ie ra rc h ic a l D a ta F o rm at) d e v e lo p e d a t N C S A [46]. T h e c o n c e p tu a l m odels im p lic itly d efin e d by b o th C D F a n d H D F a re e n c o m p a sse d by th e g en e ral c o m p u ­ ta tio n a l g rid m o d el (S ec tio n 2.4). D a ta exchange s ta n d a r d s su c h as th e se a r e sp ec ific atio n s of file fo rm a ts w h ich c a n b e c o n sid e re d as scientific d a t a m o d els for file sy stem s.

T h is c h a p te r su m m a riz e s th e fo u r scientific d a t a m o d els w hich re p resen t in te r- instan ce re la tio n sh ip s e x p lic itly : c o m p u ta tio n a l g rid m o d el. AVS m odel, fib er b u n d le m o d e,

j a n d la ttic e m odel. B efore we g e t in to d e ta ils o f th e four re s p e c tiv e d a t a m o d els (S ectio n s 2.4.

£ 2.5. 2.6. a n d 2.7). it is useful to s tu d y th e g en eral c o n c e p ts o f d a t a o rg a n iz a tio n (S ectio n 2 .2 )

[ a n d com m on o p e r a tio n s p e rfo rm e d on scientific d a t a (S e c tio n 2.3).

2.2

Im p lem en tation P rim itives and P h ysical D a ta M odels

A s c ie n tific data s e t is a fin ite set o f data points. E ac h d a t a p o in t c o rre sp o n d s

| to o b serv atio n s m ad e a t a p o in t o r sm a ll a re a in th e p h y sic a l space a t a specific tim e o r

p erio d . A p h y sical sp a c e ca n b e a v o lu m e o f atm o sp h e re , a m a g n e tic e n e rg y field, a specific biological p o p u la tio n , o r a c o m p u te r s im u la tio n o f such p h y sic a l e n titie s. E ac h d a t a p o in t is re p resen te d by a re co rd o f a t tr i b u te s re flectin g o b se rv a tio n s m a d e a t t h a t p o in t. S ince th e physical sp ace is w h ere th e d a t a s a m p lin g takes place, it is also called th e sam plintj sp a ce .1

i

S ince we a r e o n ly in te re ste d in scien tific d a ta , the te rm data s e t is used in ste a d o f sc ie n tific

data set in th e thesis.

A t th e c o n c e p tu a l level, e x is tin g scientific d a t a m o d e ls a re all b a se d on th e geo ­ m etric view o f th e p h y sical sp a c e - th e g eo m etric m odel. F or a given d a t a se t. w hile th e re is ex actly on e p h y sical sp ac e, th e re c a n b e m any d iffe ren t g e o m e tric view s o f th e sam e d a t a set. T he p o ssib ility o f m u ltip le g e o m e tric views e n a b le s us to ex p lo re im p lic it in te rre la ­ tio n sh ip s in d a ta , b o th in te r-e n tity a n d in te r-in sta n c e , w ith o u t th e lim ita tio n im p o sed o r suggested by th e p ro cess u se d for s a m p lin g d a ta in th e p h y sic a l space. A m o n g th e e x istin g scientific d a t a m odels, th e la ttice m odel, in p a rtic u la r, is sp e c ia lly d e sig n ed to fa c ilita te

(30)

12

m u ltip le g eo m etric views (S ec tio n 2.7).

W h ile th e ex istin g scientific d a t a m o d e ls differ in te rm in o lo g y a n d o b je c tiv e s , th ey a re all b a se d on one o r m ore o f th e th r e e im p le m e n ta tio n p r im itiv e s id e n tifie d in th is sec­ tio n for re p re se n tin g th e in te r-in s ta n c e re la tio n s h ip s a t th e im p le m e n ta tio n level. T h ese p rim itiv e s a re geo m etry. topology, a n d in d exa b le topology. In te rm s o f th e c o m p le x ity o f th e in te r-in s ta n c e re la tio n sh ip s th e y a re c a p a b le o f ex p ressin g , th e th re e n o tio n s fo rm a h ie ra r­ chy w ith g e o m e try b ein g th e m o st p o w e rfu l a n d in d ex ab le to p o lo g y b e in g th e least. T h is k in d o f expressive pow er com es a t th e p ric e o f efficiency, how ever. T h e n o tio n s o f g eom etry, top o lo g y , a n d in d ex ab le to p o lo g y p ro v id e a fram ew o rk for s tu d y in g th e four scien tific d a ta m o d els p re se n te d in th is c h a p te r a n d o u r p ro p o s e d m e tric -b a se d m odel. O n e im p o rta n t o b je c tiv e is to deriv e s u ita b le p h y sical d a t a m o d e ls, i.e.. d a ta s tr u c tu r e s , for ea ch o f th ese

k p rim itiv e s. W hile th e re is a g ro u p o f d a t a s tr u c tu r e s for ea ch o f th e se p rim itiv e s, a scientific

|

d a t a m o d e l m ight need to be s u p p o r te d by m o re th a n o n e k in d o f d a t a s t r u c t u r e a t th e p h y sic a l level. F ig u re 1.3 illu s tra te s th e re la tio n s h ip s am o n g all th e d a t a m o d els, p rim itiv e s.

!

a n d s u p p o r tin g d a ta s tru c tu r e s d isc u sse d in th is section.

2 .2 .1 G e o m e t r ie s

I T h e te rm g eo m etry h as m a n y d iffe re n t m eanings. G iv en a d a ta s e t. we define

g e o m e try as a n im p le m e n ta tio n p rim itiv e for its in te r-in sta n c e re la tio n s h ip s , i.e.. th e in­

te rr e la tio n s h ip s am o n g its d a t a p o in ts, b a se d o n a d irec t in te r p r e ta tio n o f th e c o n c e p tu a l g e o m e tric m odel. S pecifically, a g e o m e try c o n sists o f two co m p o n e n ts:

( 1. a m a p p in g from d a ta re co rd s to c o o r d in a te s in the g e o m e tric space.

I 2 . a d is ta n c e fu n c tio n in th e g e o m e tric sp ac e.

S E sse n tia lly , g eo m etry d esc rib es th e in te rre la tio n s h ip s am o n g d a t a p o in ts b a s e d o n th e ir

I lo c a tio n s in th e g eo m etric space. G e o m e tric in fo rm a tio n is e n c o d e d in th e form o f a tt r ib u te s

i ■j

;■ o f d a t a p o in ts, m eta d a ta o f th e d a t a s e t. a n d th e g eo m etric view su p p lie d by th e user.

A ttr ib u te s selected for re p re se n tin g g e o m e tric c o o rd in a te s a re c a lle d g e o m e tric a ttrib u tes. A d a t a s e t c a n have m an y d iffe ren t g e o m e trie s. D ifferent se ts o f a t tr i b u te s m ig h t b e used ;is g e o m e tric a ttr ib u te s , re s u ltin g in d iffe re n t g e o m e tric spaces, a n d th u s , d iffe ren t g eo m etrie s. F or th e sam e g eo m etric sp ac e, d iffe ren t choices o f d ista n c e fu n c tio n s also in d u c e different g e o m e trie s.

(31)

N o te t h a t fo r m a n y d a t a sets, th e re m ig h t n o t b e a m e a n in g fu l g e o m e tric rep re­ s e n ta tio n for its in te r - in s ta n c e re la tio n sh ip s. W h ile th e re is a lw a y s a n a t u r a l g eo m etric view for a s p a tia l d a t a s e t. th e p h y sic a l space for a n o n -s p a tia l d a t a set m ig h t n o t b e a c o o rd in a te sp ac e for in sta n c e , it m ig h t b e a m e tric space o r a b in a ry rela tio n . A g e o m e tric view a n d a g e o m e tric r e p re s e n ta tio n , i.e.. a geom etry, m ig h t n o t be re a d ily a v a ila b le in su c h cases.

T h e m o st o b v io u s w ay to sto re a d a ta set b a se d o n its g e o m e try is to u tilize som e

p o in t sp a tia l data s tr u c tu r e [54. 55] e n c o d in g t h a t g eo m etry . P o in t s p a tia l d a t a s tru c tu re s ,

how ever, a r e g e n e ra lly n o t a s efficient as m u ltid im e n sio n a l a r ra y s w h e re ea ch d a t a record c a n be accessed by its n u m e ric a l index. For m an y d a t a s e ts w ith a "sim p le" o r "regular" g eo m etry , it is o fte n p o s s ib le to d erive a b e tte r d a t a s tr u c t u r e for its access a n d storage. T h e re d u c tio n in c o m p le x ity re s u lts in th e n o tio n s o f topology a n d indexable topology.

»

| 2 .2 .2 T o p o lo g ie s

i A topology1 is a d ire c te d g ra p h defined on a set o f d a t a p o in ts b a se d o n a chosen

geo m etry . T h e o re tic a lly , a to p o lo g y is eq u iv ale n t to a d is ta n c e fu n c tio n w h ich takes only two values. 0 a n d 1. T h e r e e x is ts an arc. i.e.. d ire c te d ed g e , from p to q in a topology if a n d o n ly if th e d is ta n c e fro m p to q is 0. T h e m a p p in g fro m d a t a re c o rd s to geom etric c o o rd in a te s in th e g e o m e tr ic sp a c e , i.e.. th e first c o m p o n e n t o f a g e o m etry , is n o t p a rt o f a

!

topology. A to p o lo g y is u s u a lly d eriv ed to achieve on e o f t h e follow ing tw o objectiv es: 1. th e g e o m e try c a n b e tra v e rs e d o r ex p lo re d m ore efficien tly th ro u g h th e topology [60]. 2. specific c o m p u ta tio n a l p ro c e d u re s c a n b e a p p lie d to th e d a t a m o re efficiently [40].

j D e p e n d in g o n th e re q u ire m e n ts o f a n a p p lic a tio n , m o re t h a n o n e topology can

I b e d eriv e d from th e g e o m e try . N evertheless, th e e x iste n c e o f a n a r c from a p o in t p to a

I p o in t q u su a lly im p lie s t h a t q is re la tiv e ly close to p (it d o e s n o t n e c e ssa rily im p ly th a t p

t is close to q). U n d e r m o s t c irc u m sta n c e s, th e d is ta n c e fu n c tio n specified by a g eo m etry is

r sy m m e tr ic (S ec tio n 3.2) a n d so is th e sim p lified tw o-value d is ta n c e for th e to p o lo g y derived

t from it. T h u s , u n d ir e c te d g ra p h s suffice to d e sc rib e m o st to p o lo g ie s. F or th e rest, of th e

: th esis, u n less specified o th e rw is e , we a ssu m e u n d ire c te d g r a p h s for to p o lo g ie s.

In g en e ral, a u se fu l to p o lo g y for a d a ta set is a tra d e -o ff b etw e en th e following two factors:

(32)

1-t o 0 o 0 ° 0 -< --- --- 1 ,s - ° O p o ' , ° o\ V-> p p o' O / * 0 0 o 1 f o (a) (b) (c)

F ig u re 2.1: T w o D ifferent T opologies D eriv ed from th e S am e G e o m e try

L. T h e to p o lo g y closely reflects th e g eo m etry .

2. T h e to p o lo g y consists o f m o stly sim p le a n d re p e titiv e p a tte rn s such t h a t d a ta access a n d / o r c o m p u ta tio n a re fa c ilita te d .

F ig u re 2 .1 (a) illu s tra te s th e g e o m e try o f a d a t a s e t. a ssu m in g E u clid ea n d ista n c e . F or th e co n v e n ie n ce o f c o m p u ta tio n a n d access, we m ig h t d eriv e a to p o lo g y as d e p ic te d in F ig u re 2 .1 (b ). As c a n be seen, th e edges ro u g h ly in d ic a te closeness betw een p o in ts in th e g e o m e try , w ith a few exceptions. F ig u re 2.1(c) d e p ic ts a n o th e r to p o lo g y in tro d u c e d In­ s e ttin g a d is ta n c e th re sh o ld su ch t h a t two p o in ts a re c o n n e c te d by a n edge if a n d o n ly if th e ir d is ta n c e is lower th a n th e th re sh o ld . A lth o u g h th is topology re p re se n ts th e g e o m e try in a m o re p recise w ay th e ex isten c e o f a n e d g e re p re se n ts a c e rta in d egree o f closeness, as sp ec ifie d by th e th re sh o ld - it does n o t c o n sist o f re g u la r p a tte r n s to fa c ilita te c o m p u ta tio n o r s to ra g e access. T h u s, it m ig h t n o t be a s u sefu l as th e first topology.

T h e d e riv a tio n o f a to p o lo g y w hich closely reflects th e g eo m etry is u su ally achieved th ro u g h a series o f refinem ents. T h e in itia l to p o lo g y is o fte n c o n stru c te d from som e v a ria n t o f th e b a sic n o tio n o f nearest neighbor. F or in s ta n c e , we ca n c o n stru c t a n in itia l to p o lo g y by c o n n e c tin g ea ch p o in t to its n e a re st n e ig h b o r, o r th e n e a re st fc neighbors. T h e a p p ro a c h for d e riv in g F ig u re 2.1(c) is also b ase d o n a v a r ia n t o f th e n e a re st n eig h b o r notion.

F or a d a t a set w here th e in te r-in s ta n c e re la tio n s h ip s can b e a d e q u a te ly m o d eled by a topology, it is o ften b e tte r to d e riv e a d a t a s tr u c tu r e usin g such a to p o lo g y th a n to u tiliz e a p o in t s p a tia l d a t a s tr u c tu r e b a se d o n its g eo m etry . E ac h edge in th e topology- c a n b e im p le m e n te d as a b i-d ire c tio n a l p o in te r fro m o n e d a t a p o in t to a n o th e r. We ca n also la b e l ea c h p o in t w ith a u n ique in d ex a n d s to re th e a djacency m atrix. E ith e r way. access to d a t a p o in ts s till o ften requires som e se q u e n tia l tra v e rs a l o f p a r t o f the topology, a n d ra n d o m

References

Related documents

Berdasarkan hasil dari penelitian yang berjalan, sistem yang diusulkan untuk memerikan solusi pada permasalahan ini adalah dengan membuat Aplikasi Mobile Learning Client

• The program is an interdisciplinary team of public health, nursing, and social work students whose aim is to:.. • Provide evidence-based health information, screening,

The data quality assessment methodology as outlined in this paper has been used to recover seriously failing IT development / data warehousing projects; as a basis for

the 160 clinics in the Ebola sample (see Table D.20 for estimates using our full sample), in Table 5 we estimate treatment effects on general utilization; satisfaction with

Cost-benefit analysis is a framework for analysis that follows a logical sequence of steps: identifying policy options to solve a problem, setting out costs and benefits of

For example, clients will be able to upload documents and other electronic evidence into the War Room, lawyers can make use of the War Room to give clients access to draft

The fertility rates by second order live births were highest at age 28 on a level of 52 live births per 1 000 women in semi-urban or rural municipalities and at age 31 on a level