• No results found

Shuffling Data Around

N/A
N/A
Protected

Academic year: 2021

Share "Shuffling Data Around"

Copied!
22
0
0

Loading.... (view fulltext now)

Full text

(1)

Shuffling Data Around…

An introduction to the keywords in

Data Integration, Exchange and Sharing

Shuffling Data Around

Shuffling Data Around

An introduction to the

An introduction to the keywords keywords in in

Data Integration, Exchange and Sharing Data Integration, Exchange and Sharing

Dr. Anastasios Kementsietsidis Dr. Anastasios

Dr. Anastasios Kementsietsidis Kementsietsidis

Special thanks to Prof. Renée J. Miller

Special thanks to Prof. Ren Ren é é e J. Miller e J. Miller

(2)

The Cause and Effect Principle

Cause:

Data sources are autonomous, heterogeneous – Different data models, types and schemas

– Different vocabularies (in data and schemas)

– Different requirements for what/how data is shared Effect:

Integration: Provide uniform access to heterogeneous data

Exchange: Move data between heterogeneous sources

Sharing: Provide non-uniform access to data

through each source’s schema and vocabulary

(3)

Data Warehousing Architecture

Outside Website Outside Website

Data Extraction/Forwarding tools Data Extraction/Forwarding tools

User Query

Relational Database (Warehouse) Relational Database

(Warehouse)

Swissprot Database Swissprot Database GDB

Database GDB Database

Image server

What about updates here?

What about updates here?

or here?

or here?

(4)

Virtual Integration Architecture

Metadata

Mediator Metadata

Swissprot Database Swissprot Database GDB

Database GDB Database

Wrapper Wrapper Wrapper Wrapper

Reformulation Engine Optimization Engine

Execution Engine

User Query

Mediated Schema

Mediated Schema

(5)

Peer-to-Peer Architecture

Outside Website Outside Website

Swissprot Database Swissprot Database GDB

Database GDB

Database Image server

User Query

User Query

User Query

(6)

What are the Metadata?

• Schemas (models of data)

– Structured or Semi-structured

• e.g., relational, object-oriented, XML data, … – Talk will not cover unstructured data

• e.g., documents, images, audio files,…

– Data(base) is an instance of a schema

• Mappings

– Model relationship between schemas or data

• Schema mapping (e.g., views)

• Data mapping (e.g., aliases)

– Requirements for mapping specifications

(7)

Metadata Lifecycle

• Creation

– Automatic discovery or creation – Design tools facilitating creation

• Maintenance

– Maintain (integrated) schemas as sources change – Maintain mappings as schemas change

• Use

– Query answering

– Data exchange (materialization), updates, etc…

(8)

Schema Integration

Schema Design Problem:

Create global integrated schema G (and mappings) for a set of independently designed local schemas S i , 1 ≤ i ≤ n

Data Data Data Data

Local Source Schema S n Local Source

Schema S n Local Source

Schema S 1 Local Source

Schema S 1

Data Data

Local Source Schema S 2 Local Source

Schema S 2

Global Integrated Schema G Global Integrated

Schema G

What about schema changes ? or source additions?

What about

schema changes ?

or source additions?

(9)

Mappings (a.k.a. views)

A view is just a query… works like a function….

• It accepts as input the local source instance(s)

• It outputs an instance of the global (target) schema

create

create view view funding funding ( ( aid aid , , amount amount , project, , project, date date ) as ) as

( select ( select grant grant . . gid gid , , grant grant . . amount amount , , grant grant . . project project , , received received . . date date from from grant grant , , received received

where

where grant grant . . gid gid = = received received . . gid gid ) )

Relation: funding (aid, amount, project, date) Relation: finances (aid, date, amount)

Global Integrated Schema G

Relation: grant (gid, amount, project) Relation: received (gid, date)

Global Integrated Schema G

Local Source Schema S Local Source

Schema S

This is

Global-as-View (GAV) This is

Global-as-View (GAV)

(10)

This is Local-as-View (LAV)

create

create view view grant grant ( ( gid gid , , amount amount , project) as , project) as

( select ( select funding funding . . aid aid , , funding funding . . amount amount , , funding funding . . project project , , from

from funding funding ) )

Relation: grant (gid, amount, project) Relation: received (gid, date)

Relation: funding (aid, amount, project, date) Relation: finances (aid, date, amount)

Global Integrated Schema G Global Integrated

Schema G

Local Source Schema S Local Source

Schema S

(11)

GAV vs. LAV (in “plain’’ English)

• GAV:

– Gives direct information about which data satisfy the elements of the global schema

– Not easily extendible on source schema changes or source additions

– Query answering is “easy”…

• LAV:

– Does not give direct information about which data satisfy the global schema

– Easily extendible on source schema changes or source additions

– Query answering is “hard”…

(12)

But There Is More…

Global-and-Local-as-View (GLAV)

Relation: grant (gid, amount, project) Relation: received (gid, date)

Relation: funding (aid, amount, project, date) Relation: finances (aid, date, amount)

Global Integrated Schema G Global Integrated

Schema G

Local Source Schema S Local Source

Schema S

( ( select select funding funding . . aid aid , , funding funding . . amount amount , , funding funding . . project project , , funding funding . . date date from from funding funding , , finances finances

where

where funding funding . . aid aid = = finances finances . . aid aid ) ⊇ ) ⊇

( ( select select grant grant . . gid gid , , grant grant . . amount amount , , grant grant . . project project , , received received . . date date from from grant grant , , received received

where

where grant grant . . gid gid = = received received . . gid gid ) )

(13)

Creating Mappings

… with the help of Schema Matching

grant gid

amount project

received gid date

funding aid amount proj date

financial aid date amount

0.56 0.95

0.85

0.65 0.40

It uses schema-level and/or instance-level information

(14)

© 2005 Anastasios Kementsietsidis and Renée J. Miller Renée J. Miller 4

QuickTime™ and a decompressor are needed to see this picture.

QuickTime™ and a decompressor are needed to see this picture.

QuickTime™ and a decompressor are needed to see this picture.

QuickTime™ and a decompressor are needed to see this picture.

QuickTime™ and a decompressor are needed to see this picture.

Virtual Integration Architecture

Outside Website Outside Website

Mediator

QuickTime™ and a decompressor are needed to see this picture.

Metadata

QuickTime™ and a decompressor are needed to see this picture.

QuickTime™ and a decompressor are needed to see this picture.

County

Database QuickTime™ and a decompressor

are needed to see this picture.

City

Database Wrapper

QuickTime™ and a decompressor are needed to see this picture.

Image server

Wrapper Wrapper Wrapper

QuickTime™ and a decompressor are needed to see this picture.

QuickTime™ and a decompressor are needed to see this picture.

Mediated Schema

Execution Engine Optimization Engine Reformulation Engine

User Query

QuickTime™ and a decompressor are needed to see this picture.

QuickTime™ and a decompressor are needed to see this picture.

In Data Integration

… And Using Them

© 2005 Anastasios Kementsietsidis and Renée J. Miller Renée J. Miller 3

QuickTime™ and a decompressor are needed to see this picture.

QuickTime™ and a decompressor are needed to see this picture.

QuickTime™ and a decompressor are needed to see this picture.

QuickTime™ and a decompressor are needed to see this picture.

QuickTime™ and a decompressor are needed to see this picture.

QuickTime™ and a decompressor are needed to see this picture.

Data Warehousing Architecture

Outside Website Outside Website

Data Extraction/Forwarding tools

User Query

QuickTime™ and a decompressor are needed to see this picture.

QuickTime™ and a decompressor are needed to see this picture.

Relational Database (Warehouse)

QuickTime™ and a decompressor are needed to see this picture.

County Database

QuickTime™ and a decompressor are needed to see this picture.

City

Database

QuickTime™ and a decompressor are needed to see this picture.

Image server

In Data Exchange

(15)

Data Integration vs. Exchange

• Data Integration

– Global schema is a reconciled virtual view of heterogeneous sources – Uses GAV or LAV mappings

– No constraints in the global schema are considered

– Query is answered using source data; integration is virtual – Answer is set of tuples in query result on ALL possible target

instances: certain answers

• Data Exchange

– Global schema is an independently created local source schema – Uses GLAV mappings

– Considers the presence of constraints

– Query is answered using ONE materialized target

– Can single target give same information as source(s)?

(16)

Something Slightly Different…

Data Mappings and Data Sharing

Useful in environments where:

• Sources are unwilling to share schemas

• The schema of one source cannot be expressed as a view of another

• There is a need to map (data) vocabularies

gid name 001 NF1 002 NGFR 003 NEU1

gid pid 001 101 002 102 003 103 003 104

“encodes for “ encodes for” ”

pid name

101 Neurofibromin 102 p75 ICD

103 Sialidase 1 104 G9 Sialidase

GDB GDB

gene ( gid , name ) gene gene ( ( gid gid , , name name ) )

SwissProt SwissProt protein ( pid , name ) protein

protein ( ( pid pid , , name name ) )

This is a mapping table

This is a

mapping table

(17)

Data Sharing Architecture

I have info for gene APEH I have I have info for info for

gene APEH

gene APEH gene

NF1

APEH …

… article id …

96281126 99455262 …

I have article 96281126 I have I have article article

96281126

96281126 I want info for

protein AARE I want I want info for info for

protein AARE protein AARE

protein …

AARE

G9 Sialidase

z z Establish mapping tables between the vocabularies of different Establish mapping tables between the vocabularies of different sources.

sources.

z z Use these tables to translate query requests between the sources Use these tables to translate query requests between the sources

SwissProt SwissProt GDB GDB

UofE Library

UofE Library

Mapping Tables Mapping Tables Mapping

Tables Mapping Tables

Mapping Tables Mapping Tables Mapping

Tables Mapping Tables

Mapping

Tables

Mapping

Tables

(18)

Network Network Network

Management of Tables

The Consistency and Inference problems, the main vehicles for managing mapping tables. Solving these problems allows us to:

– Infer new mapping tables from existing ones

– Augment existing mapping tables with new associations – Validate mapping tables

Consistent?

Consistent?

Mapping Tables Mapping Tables

Mapping Tables Mapping Tables Mapping

Tables Mapping Tables

Mapping Tables Mapping Tables

Inferred Tables Inferred Tables

Inferred Tables Inferred Tables

Source S 1 Source S 1

Source S n

Source S n

(19)

Closing Remarks…

… and things to remember (other than the keywords):

• Integration, one of the oldest problems in database research.

Research is still going strong in this area

• Exchange, an interesting and practical problem (e.g. B2B apps)

• Sharing, the latest twist in the integration problem, also of practical importance (e.g. in P2P apps)

Disclaimer:

This talk provides only a glimpse of the research issues in these areas.

Note: If you find any of these interesting, TALK to us…

We are in Appleton Tower, 2nd Floor

(20)

Questions!?

(21)

GAV vs. LAV

• GAV: Q S (S) ⊆ R G ,

where R G is a relation in G, Q S is a query on S

• LAV: R S ⊆ Q G (G),

where R s is a relation in S, Q G is a query on G

Q S/G

Data Data

Local Source Schema S Local Source

Schema S

Global Integrated Schema G Global Integrated

Schema G

Data (?) Data (?)

User Query

(22)

Query Answering

GAV uses Query/View Unfolding

create

create view view funding funding ( ( aid aid , , amount amount , project, , project, date date ) as ) as

( select ( select grant grant . . gid gid , , grant grant . . amount amount , , grant grant . . project project , , received received . . date date from from grant grant , , received received

where

where grant grant . . gid gid = = received received . . gid gid ) ) select

select project project from from funding funding where

where amount amount > > $45.000 $45.000

select

select project project

from from grant, received grant, received where

where grant.gid grant.gid = received.gid = received.gid AND AND grant.

grant.amount amount > > $45.000 $45.000

What about LAV?

It uses a method called Query Rewriting (not presented here)

What about LAV?

It uses a method called

Query Rewriting

(not presented here)

References

Related documents

1. Instagram merupakan media yang paling banyak digunakan oleh responden baik untuk melakukan maupun menggunakan e-WOM sesudah maupun sebelum melakukan

A cross sectional study was conducted from October 2010/11 to April 2011/12 at Hashim Nur’s Ethiopian livestock and meat export abattoir, with the aim to

Shipman , 593 S.E.2d at 324–25 (contending that the continuous representation rule takes into account “the special trust and confidence inherent in the attorney-client

Rob Rasmussen of The Transtec Group will discuss the pavement surface characteristics work ongoing at the National CP Tech Center including a link to two-lift concrete

• Test Negative – Taken to your home in a government bus or a designated hotel for centralized quarantine if not eligible for home quarantine.. Please note, private car pickup is

• Shipboard equipment details (Make; Type; Serial number; Model); • Software details (Previous software version, Current software. version, Last update applied, Last

This activity was supported by the United States Agency for International Development (USAID) under Cooperative Agreement No. The contents of this document do not necessarily

A thesis statement in an essay is a sentence that explicitly identifies the purpose of the paper or previews its main ideas.. A thesis statement is an assertion, not a statement of