• No results found

Enhancing Traditional Databases to Support Broader Data Management Applications. Yi Chen Computer Science & Engineering Arizona State University

N/A
N/A
Protected

Academic year: 2021

Share "Enhancing Traditional Databases to Support Broader Data Management Applications. Yi Chen Computer Science & Engineering Arizona State University"

Copied!
41
0
0

Loading.... (view fulltext now)

Full text

(1)

Enhancing Traditional Databases to

Support Broader Data Management

Applications

Yi Chen

Computer Science & Engineering

Arizona State University

(2)

What Is a Database System?

„

Of course, there are traditional relational

database management systems (RDBMS)

„

Was introduced in 1970 by Dr. E. F. Codd (of

IBM)

„

Commercial relational databases began to

appear in the 1980s

(3)

3

A Relational Database (RDBMS)

Row

(Tuple, Record)

Column (Attribute)

Table (Relation)

Name Skill

age

James Beginner

21

Bob Experienced

33

Climber

Name Route

Date

Duration

Bob

Last Tango 10/10/05 5

Bob

Last Tango 1/10/06

4.5

Climbs

A predefined data structure (schema) is required.

(4)

projection:

Route = “Last

Tango”

selection:

σ

Name = “James”

Querying RDBMS: SQL

Name Skill

age

James Beginner

21

Bob Experienced

33

Climber

Name Route

Date

Duration

Bob

Last Tango 10/10/05 5

Bob

Last Tango 1/10/06

4.5

Climbs

Climber.name = climbs.name

Climbs

join:

Climber

Name

Skill

Age Route

Date

Duration

(5)

5

The Advantages of RDBMS

„

Good data organization

„

High efficiency for large datasets via indexing

and query optimization

(6)

But, 80% of the World’s Data is Not

in RDBMS!

Examples:

‹

WWW, Emails

‹

Personal data, documents of various format

‹

Sensor data

‹

A lot of scientific data (experimental data, large images,

documentation, etc)

¾

Why not?

¾

There are several assumptions in relational databases that

do not fit for handling this data.

(7)

7

Challenges for RDBMS (I)

„

RDBMS Assumption

: data conforms to a predefined

fixed schema, which is separated from the data itself

„

Reality:

‹

Data may be collected from different sources on the

web, therefore has different schemas

‹

Schema can change over time for a single source

„

Requirements:

We need to handle data of different

schemas and have the schemas tightly associated with

the data

(8)

XML as a Data Representation Format

„

XML has become a standard data format for

various applications, because of:

‹

Flexibility in schemas -- semi-structured data

‹

Self - describing feature

(9)

9

Yi Chen --- January 23, 2006

XML: the Standard for Web Data

Representation

PubMed

GenBank

BLAST

Internet

Web Service Requester

NCBI Web Service Publisher

...

XML Data

(10)

XML: Representing Phylogenetic Trees

From the Tree of the Life Website,

University of Arizona

(11)

11

Challenges for RDBMS (II)

„

RDBMS Assumption:

Data is clean and consistent.

„

Reality: real world data is dirty

‹

Data collected from different sources may have

missing and conflicting information

‹

Data that is obtained from data mining is often not

error-prone

‹

Experimental data often contains random errors

„

Requirements:

we need to measure data quality and

(12)

Roadmap of This Talk

„

Managing XML by leveraging mature RDBMS

[Chen et al 04]

‹

Introduction to XML

‹

A generic and efficient XML-to-RDBMS mapping

)

Data mapping from trees to tables

)

Query translation from tree navigation queries

to SQL queries that are efficient

„

Handling imprecise and incomplete data in

(13)

13

Sample XML Data

<books>

<book>

<title>

The lord of the rings...

</title>

<section>

<title>

Locating middle-earth

</title> ...

</section> …

</book>

</books>

books

book

title

section

title

section

title

figure

description

“The lord

of the

rings …”

“Locating

middle-earth”

“A hall

fit for a

king”

“King Theoden's

golden hall”

...

...

...

(14)

...

Sample XML Queries

„

XML query languages are

based on hierarchical

structure navigation (e.g.

XPath)

„

Sample queries:

‹

What are all the section

titles:

//section/title

books

book

title

section

title

section

title

figure

description

“The lord

of the

rings …”

“Locating

middle-earth”

“King Theoden's

golden hall”

...

...

“A hall

fit for a

king”

(15)

15

...

Sample XML Queries

„

XML query languages are

based on hierarchical

structure navigation (e.g.

XPath)

„

Sample queries:

‹

What are all the section

titles:

//section/title

‹

What are the titles of

sections that contain a

figure:

//section[/figure]/title

books

book

title

section

title

section

title

figure

description

“The lord

of the

rings …”

“Locating

middle-earth”

“King Theoden's

golden hall”

...

...

“A hall

fit for a

king”

Predicates

(16)

How to Query XML Data efficiently?

„

RDBMS have achieved high performance in

query evaluation.

„

Can we leverage RDBMS by encoding XML to

(17)

17

Analogy: Fourier Transforms

g * h =

∫∫

-

g(u)h(u)du

G(f)H(f)

Complex

+

(18)

Mapping XML Data to RDBMS

XPath

XML data

SQL

Query Translation

XML

fragments

Relational databases

Storage

Mapping

Challenge:

How to build the bridge

between hierarchies and

tables?

(19)

19

Data Mapping

ID

Tag

Value

Structural

Information

1

books

2

book

3

title

The...

4

section

5

title

Locating…

… …

T

(5)

books

book

title

section

title

section

title

figure

description

“Locating

middle-earth”

“A hall

fit for a

king”

“King Theoden's

golden hall”

(4)

(3)

(2)

(1)

“The lord of

the rings …”

Parent ID

(20)

Data Mapping

ID

Tag

Value

Structural

Information

1

books

2

book

3

title

The...

4

section

5

title

Locating…

… …

T

(5)

books

book

title

section

title

section

title

figure

description

“Locating

middle-earth”

“A hall

fit for a

king”

“King Theoden's

golden hall”

(4)

(3)

(2)

(1)

“The lord of

the rings …”

.

Design special labels

to encode node

(21)

21

Query Translator Architecture

„

How to choose XPath subqueries, such that:

‹

they can be easily translated to SQL subqueries

‹

the SQL subqueries can be efficiently evaluated

„

How to combine SQL subqueries to a complete one?

XPath

decomposition

XPath

Sub-query

translation

SQL sub-query

composition

Query Translator

(22)

Query Translator

Q: //book[//figure]/section/title

section

book

figure

title

(23)

23

Query Translator: (I) Decomposition to

Suffix Paths

Q: //book[//figure]/section/title

book

figure

section

title

book

(24)

(342000,343000)

σ

342000

Plabel

343000

T

Encoding Suffix Paths Using P-labeling

//book/section/title

books

book

title

section

title

section

title

figure

description

“The lord

of the

rings …”

“Locating

middle-earth”

“A hall

fit for a

king”

“King Theoden's

golden hall”

(1)

(2)

(3)

(4)

...

...

...

(100)

id

Plabel

1

100000

2

210000

3

321000

4

421000

5

342100

T

(5)

books

book

title

section

title

section

title

figure

description

“The lord

of the

rings …”

“Locating

middle-earth”

“A hall

fit for a

king”

“King Theoden's

golden hall”

(1)

(2)

(3)

(4)

...

...

...

(100)

Evaluating suffix paths

SQL selections on P-labels

(25)

25

Query Translator: (II) Selection on P-labels

Q: //book[//figure]/section/title

book

figure

section

title

(26)

D-labeling Scheme

books

book

title

section

title

section

title

figure

description

“The lord

of the

rings …”

“Locating

middle-earth”

“A hall

fit for a

king”

“King Theoden's

golden hall”

(1, 20000, 1)

(6, 1200, 2)

(10,80,3)

(81, 250,3)

...

...

...

(100, 200,4)

• D-labeling is used to connect

suffix paths.

D-labels

(start, end, depth)

can be used to detect

ancestor-descendant

relationships between nodes

in a tree.

(27)

27

A Bi-labeling Based Query Translation

//book//figure

books

book

title

section

title

section

title

Figure (120, 160, 5)

description

“The lord

of the

rings …”

“Locating

middle-earth”

“A hall

fit for a

king”

“King Theoden's

golden hall”

(1, 20000, 1)

(6, 1200, 2)

(10,80,3)

(81, 250,3)

...

...

...

(100, 200,4)

(

σ

Plabel for “//book”

(

ρ

T

T’

))

T’.Start

T.Start

T’.End

T.End

(

σ

Plabel for “//figure”

T )

Plabel Start End Depth Val

100000 1 20000 1 …

210000 6 1200

2 …

321000 10 80 3

421000 81 250 3

442100 100 200 4

554680 120 160 5

… …

T

Stitching suffix paths

SQL joins on D-labels

(28)

Query Translator: (III) Join on D-labels

Q: //book[//figure]/section/title

book

figure

section

title

book

(29)

29

Comparison with Previous Approach

book

figure

section

title

book

Ours:

fewer disk accesses,

fewer joins

book

figure

section

title

Previous Approach

[Li & Moon 01, Zhang et al 01,

Tatarinov et al 02,

(30)

Compare our system (BLAS) with the previous

approach using D-labeling scheme only

„

Data sets

„

Query sets

)

Suffix path queries

)

Path queries

)

XPath queries

)

Benchmark queries

„

Query Engines: DB2, TwigStack Join

[Bruno et al 02]

Experiment Setup

Data Set

Size(MB) Nodes(K) Tags Depth DTD

Protein 70 2277 66

7 Tree

Shakespeare 26

640

19

7

Acyclic

graph

(31)

31

Query Execution Time

Query Name:

A:Auction

P: Protein

S: Shakespeare

1: suffix path query

2: path query

3: XPath query

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

QA1 QA2 QA3 QP1 QP2 QP3 QS1 QS2 QS3

Queries

Ti

m

e

(32)

0

2

4

6

8

10

12

14

35

70

105

139

174

File Size (MB)

Ti

m

e

(

s

e

c

onds

)

DLabeling

BLAS

Benchmark data, Benchmark query Q3

(33)

33

Summary of XML Data Management

„

We proposed a generic XML-to-RDB mapping,

based on a bi-labeling scheme.

„

It is more efficient compared with previous

approach, since it generates SQL queries that

require:

‹

fewer disk accesses

‹

fewer joins

‹

fewer intermediate results

(34)

Roadmap of This Talk

„

Managing XML by leveraging mature RDBMS

[Chen et al 04]

‹

Introduction to XML

‹

A generic and efficient XML-to-RDBMS mapping

)

Data mapping from trees to tables

)

Query translation from tree navigation queries

to SQL queries that are efficient

„

Handling imprecise and incomplete data in

(35)

35

Probabilistic Databases:

Managing Imprecise Data

„

How to measure the imprecision of the data?

„

The simplest model: associate each tuple a

probability

Name Skill

age

Pr

James Beginner

21

p1

Bob Experienced

33

p2

Climber

Name Route

Date

Duration

Pr

Bob

Last Tango 10/10/05 5

q1

Bob

Last Tango 1/10/06

4.5

q2

Climbs

(36)

How Does This Affect Query

Evaluation?

[Fuhr&Roellke 97]

σ

v p

v p

v

1

p

1

v

1

v

2

p

1

p

2

v

2

p

2

Π

v

p

1

v

p

2

v

1-(1-p

1

)(1-p

2

)

(37)

37

Yi Chen --- January 23, 2006

LT

1-(1-p2q1)(1- p2q2)

Π

LT 1-(1-q

1

)(1-q

2

)

Bob LT

p2(1-(1-q

1

)(1-q

2

))

Wrong !

Correct

Challenges: Correctness Depends on

Execution Order

[Suciu et al 05]

!

Name Skill

Pr

Bob Exp

p2

Name Route

Pr

Bob LT

q1

Bob LT

q2

Find distinct climb routes that have been climbed by an

experienced climber.

Name Route

Pr

Bob LT

q1

Bob LT

q2

Π

Name Skill

Pr

Bob

Exp p2

Bob

LT p2q1

Bob

LT p2q2

(38)

This can not be

convertible !

How to Handle Incomplete Data?

Seller Make Model

Body

Style Year Price

Mark BMW 325Cic

convertible

2006 36000

Alice Ford Taurus

2001 8000

Our techniques infer possible values with

probability by mining data statistics

(39)

39

Querying Incomplete Databases

„

Given missing value prediction

Querying incomplete databases

Querying probabilistic databases

„

What if we are not able to store the predicted

values? --- e.g. data integration applications

„

On-the-fly query rewriting

How should we handling imprecise and

incomplete XML Data?

(40)

Conclusions

„

Traditional RDBMS does not satisfy the requirements

in ever growing scientific and web applications

„

We have discussed two enhancement to RDBMS

‹

Efficient XML data management

‹

Handling imprecise and incomplete data

„

Other enhancement to RDBMS that I am working on

‹

Data stream processing

(41)

41

Thank you!

Figure

figure section
figure section

References

Related documents

Why would you want to do this? An example would be a large spreadsheet in which one of the cells has a number in it that is used by three formulas in other cells. Using

The average annual cost of basic financial services in Australia (a basic transaction account, a low rate credit card and some basic general insurance) is $,740 – just to maintain

As more two-way bilingual programs in Oregon require effective bilingual teachers who can help students develop biliteracy, researching the lived experiences of

This paper develops a framework to analyze the business cycle movements of stock market returns, volatility and volatility risk-premia. In our model, the aggregate stock market

Early Detection of Mountain Pine Beetle Damage in Ponderosa Pine Forests of the Black Hills Using Hyperspectral and WorldView-2 Data.. Kyle

In the present study, a mini core collection of finger millet (80 accessions), along with accessions from different parts of India and four control cultivars were used to

There are many direct and indirect costs of high debt in low-income countries. Empirical evidence suggests that there is likely a link between debt levels and economic growth, but