• No results found

Data Integration and Network Marketing

N/A
N/A
Protected

Academic year: 2021

Share "Data Integration and Network Marketing"

Copied!
7
0
0

Loading.... (view fulltext now)

Full text

(1)

Data Integration

Alon Halevy Google Inc. University of Aalborg September, 2007

Introduction

What is Data Integration

and Why is it Hard?

DBMS: it’s all about abstraction

Logical vs. Physical; What vs. How.

SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … … SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 … Students: Takes:

CID Name Quarter

CSE444 Databases fall

CSE541 Operating systems winter Courses:

SELECT C.name

FROM Students S, Takes T, Courses C

WHERE S.name=“Mary” and S.ssn = T.ssn and T.cid = C.cid

Data Integration:

A Higher-level Abstraction

Mediated Schema Query S1 S2 S3 SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … … SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 …

CID Name Quarter CSE444 Databases fall CSE541 Operating systems winter

… …

Semantic Mappings

Independence of: • source & location • data model, syntax • semantic variations • …

<cd> <title> The best of … </title> <artist> Carreras </artist> <artist> Pavarotti </artist> <artist> Domingo </artist> <price> 19.95 </price> </cd>

Application Area 1: Business

Single Mediated View

Legacy Databases

Services and Applications Enterprise Databases CRM ERP Portals … EII Apps:

50% of all IT $$$ spent here!

Application Area 2: Science

OMIM HUGO Swiss-Prot GO

Gene-Clinics Locus-Link Entrez GEO Sequenceable

Entity Gene

Phenotype VocabularyStructured Experiment

Protein NucleotideSequence ExperimentMicroarray

Hundreds of biomedical data sources available; growing rapidly!

(2)

Application Area 3: The Web

FullServe Corporation

Employees Training Sales Resumes Services HelpLine FullTimeEmp Hire TempEmployees Courses Enrollments Products Sales Interview CV Services Customers Contracts Calls

EuroCard Corporation

Employees Credit Cards Resumes HelpLine Employees Hire Customer CustDetail Interview Calls

Heterogeneity 101

Employees FullTimeEmp ssn, empId, firstName middleName, lastName Hire

empId, hireDate, recruiter TempEmployees ssn, hireStart, hireEnd Employees Employees ID, firstNameMiddleInitial, lastName Hire

ID, hireDate, recruiter

Find all employees (making over $100K)

Customer Call Center

Sales Services Products Sales Services Customers Contracts Credit Cards Customer CustDetail Customer: I lost my credit card!!

Agent: would you like to buy an espresso machine?

Challenges & Opportunities

Create a (useful) web site for tracking services

Collaborate with third parties

E.g., create branded services

Comply with government regulations

Find “risky” employees

Business intelligence

(3)

The Deep Web

Millions of (good) forms out there Each form has its own special interface

 Hard to explore data across sites.

Goal (for some domains):

A single interface into a multitude of

deep-web sources.

The Semantic Web

Knowledge sharing at web scale.

Web resources are described by ontologies: Rich domain models; allow reasoning. RDF/OWL are the emerging standards

OWL-lite may actually be useful. Issues:

Too complex for users? Killer apps? Scalability of reasoning?

Proposal: let’s build the SW bottom up.

Goal of Data Integration

Uniform query access to a set of data sources

Handle:

Scale of sources: from tens to millions

Heterogeneity

Autonomy

(4)

Why is it Hard?

Systems-level reasons:

Managing different platforms

SQL across multiple systems is not so simple Distributed query processing

Logical reasons:

Schema (and data) heterogeneity

‘Social’ reasons:

Locating and capturing relevant data in the enterprise. Convincing people to share (data fiefdoms)

Security, privacy and performance implications.

Setting Expectations

Data integration is AI-Complete.

Completely automated solutions unlikely.

Goal 1:

Reduce the effort needed to set up an

integration application.

Goal 2:

Enable the system to perform gracefully

with uncertainty (e.g., see the web)

Terminology

Relations and Queries

Relational Terminology

Relational schemas

Tables, attributes

Relation instances

Sets (or multi-sets) of tuples

Integrity constraints

Keys, foreign keys, inclusion dependencies

Hitachi Household $203.99 MultiTouch Canon Photography $149.99 SingleTouch GizmoWorks Gadgets $29.99 Powergizmo GizmoWorks Gadgets $19.99 Gizmo Manufacturer Category Price PName Product Attribute names

Table/relation name

SQL (very basic)

Interview:

candidate, date, recruiter, hireDecision, grade EmployeePerf:

empID, name, reviewQuarter, grade, reviewer select recruiter, candidate from Interview, EmployeePerf where recuiter=name AND grade < 2.5

(5)

SQL (w/aggregation)

EmployeePerf:

empID, name, reviewQuarter, grade, reviewer select reviewer, Avg(grade) from EmployeePerf where reviewQuarter=“1/2007”

Conjunctive Queries

Q(R,C) :- Interview(X,D,Y,H,F), EmployeePerf(E,Y,T,W,Z), W < 2.5.

select recruiter, candidate from Interview, EmployeePerf where recuiter=name AND grade < 2.5

Unions of Conjunctive Queries

Q(R,C) :- Interview(X,D,Y,H,F), EmployeePerf(E,Y,T,W,Z), W < 2.5. Q(R,C) :- Interview(X,D,Y,H,F), EmployeePerf(E,Y,T,W,Z), Manager(y), W > 3.9.

Datalog (recursion)

Path(X,Y) :- edge(X,Y)

Path(X,Y) :- edge(X,Z), path(Z,Y)

Database: edge(X,Y)

Warmup Exercise

Virtual Data Integration

Architecture

(6)

wrapper wrapper wrapper wrapper wrapper Mediated Schema Source Descriptions Cinemas: place, movie, start Reviews: title, date grade, review Movies: name, actors, director, genre Cinemas in NYC: cinema, title, startTime Cinemas in SF: location, movie, startingTime

Movie: Title, director, year, genre Actors: title, actor

Plays: movie, location, startTime Reviews: title, rating, description

S1 S2 S3 S4 S5

Wrappers

<cd> <title> The best of … </title> <artist> Abiteboul </artist> <artist> Pavarotti </artist> <artist> Domingo </artist> <price> 19.95 </price> </cd>

•Lixto [Vienna] •Fetch [ISI]

•XQWrap [Georgia Tech] •Wrapper induction [Dublin]

Mediation Languages

Books Title ISBN Price DiscountPrice Edition CDs Album ASIN Price DiscountPrice Studio BookCategories ISBN Category CDCategories ASIN Category Artists ASIN ArtistName GroupName Authors ISBN FirstName LastName

CD: ASIN, Title, Genre,… Artist: ASIN, name, …

Mediated Schema

logic

Execution engine

wrapper wrapper wrapper wrapper wrapper Query optimizer

Query reformulation Query

Logical query plan

Physical query plan Replanning request

Query Processing

Woody Allen Comedies in NY

select title, startTime from Movie, Plays

where Movie.title=Plays.movie AND location=“New York” AND

Movie: Title, director, year, genre Actors: title, actor

Plays: movie, location, startTime Reviews: title, rating, description

(7)

Cinemas: place, movie, start Reviews: title, date grade, review Movies: name, actors, director, genre Cinemas in NYC: cinema, title, startTime Cinemas in SF: location, movie, startingTime

Movie: Title, director, year, genre Actors: title, actor

Plays: movie, location, startTime Reviews: title, rating, description

S1 S2 S3 S4 S5

select title, startTime from Movie, Plays

where Movie.title=Plays.movie AND location=“New York” AND director=“Woody Allen”

Outline (1)

Mediation languages:

theoretical foundations (containment,

answering queries using views)

Creating schema mappings Query processing

Adaptive query processing

XML and its role in data integration

Outline (2)

Other architectures for data integration

Warehousing, data exchange, p2p dbms

Advanced (i.e., current) topics:

Dataspaces

The deep web

References

Related documents

Register online by 11 June at www.ibanet.org/conferences/conf546.aspx and make payment by credit card, to avail of the ten per cent online registration discount or complete

In order to combat the subpar job that carriers do in helping to optimize wireless bills, and the growing cost of wireless bills in general, organizations are turning to MMS to

Awards serve as incentives along many dimensions because they entail a number of different benefits for the recipient such as status, recognition, current and future material

Additionally, based on results from other studies in chronic illnesses we expect that longer time since diagnosis will be positively associated with illness coherence and

Clinical Position: Medical Director of Anesthesiology, Dayton Children’s Hospital, Dayton,.

A more interesting auction would occur if we allowed a second price auction for each individual slot but introduced the concept of “bump-up.” Because the valuations are ordered

According to the view, in postwar Japan, the main bank relationship has contributed to efficient management of borrower firms in place of the capital market that disciplines

The existence of the childbed mortality data published in the Scots Magazine indicates, first, the enthusiasm of the editors to enlighten, by sharing the need for improvement