Data Integration
Alon Halevy Google Inc. University of Aalborg September, 2007Introduction
What is Data Integrationand Why is it Hard?
DBMS: it’s all about abstraction
Logical vs. Physical; What vs. How.
SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … … SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 … Students: Takes:
CID Name Quarter
CSE444 Databases fall
CSE541 Operating systems winter Courses:
SELECT C.name
FROM Students S, Takes T, Courses C
WHERE S.name=“Mary” and S.ssn = T.ssn and T.cid = C.cid
Data Integration:
A Higher-level Abstraction
Mediated Schema Query S1 S2 S3 SSN Name Category 123-45-6789 Charles undergrad 234-56-7890 Dan grad … … SSN CID 123-45-6789 CSE444 123-45-6789 CSE444 234-56-7890 CSE142 …CID Name Quarter CSE444 Databases fall CSE541 Operating systems winter
… …
Semantic Mappings
Independence of: • source & location • data model, syntax • semantic variations • …
<cd> <title> The best of … </title> <artist> Carreras </artist> <artist> Pavarotti </artist> <artist> Domingo </artist> <price> 19.95 </price> </cd>
Application Area 1: Business
Single Mediated View
Legacy Databases
Services and Applications Enterprise Databases CRM ERP Portals … EII Apps:
50% of all IT $$$ spent here!
Application Area 2: Science
OMIM HUGO Swiss-Prot GO
Gene-Clinics Locus-Link Entrez GEO Sequenceable
Entity Gene
Phenotype VocabularyStructured Experiment
Protein NucleotideSequence ExperimentMicroarray
Hundreds of biomedical data sources available; growing rapidly!
Application Area 3: The Web
FullServe Corporation
Employees Training Sales Resumes Services HelpLine FullTimeEmp Hire TempEmployees Courses Enrollments Products Sales Interview CV Services Customers Contracts CallsEuroCard Corporation
Employees Credit Cards Resumes HelpLine Employees Hire Customer CustDetail Interview CallsHeterogeneity 101
Employees FullTimeEmp ssn, empId, firstName middleName, lastName HireempId, hireDate, recruiter TempEmployees ssn, hireStart, hireEnd Employees Employees ID, firstNameMiddleInitial, lastName Hire
ID, hireDate, recruiter
Find all employees (making over $100K)
Customer Call Center
Sales Services Products Sales Services Customers Contracts Credit Cards Customer CustDetail Customer: I lost my credit card!!
Agent: would you like to buy an espresso machine?
Challenges & Opportunities
Create a (useful) web site for tracking servicesCollaborate with third parties
E.g., create branded services
Comply with government regulations
Find “risky” employees
Business intelligence
The Deep Web
Millions of (good) forms out there Each form has its own special interface Hard to explore data across sites.
Goal (for some domains):
A single interface into a multitude of
deep-web sources.
The Semantic Web
Knowledge sharing at web scale.Web resources are described by ontologies: Rich domain models; allow reasoning. RDF/OWL are the emerging standards
OWL-lite may actually be useful. Issues:
Too complex for users? Killer apps? Scalability of reasoning?
Proposal: let’s build the SW bottom up.
Goal of Data Integration
Uniform query access to a set of data sourcesHandle:
Scale of sources: from tens to millions
Heterogeneity
Autonomy
Why is it Hard?
Systems-level reasons:
Managing different platforms
SQL across multiple systems is not so simple Distributed query processing
Logical reasons:
Schema (and data) heterogeneity
‘Social’ reasons:
Locating and capturing relevant data in the enterprise. Convincing people to share (data fiefdoms)
Security, privacy and performance implications.
Setting Expectations
Data integration is AI-Complete.Completely automated solutions unlikely.
Goal 1:
Reduce the effort needed to set up an
integration application.
Goal 2:
Enable the system to perform gracefully
with uncertainty (e.g., see the web)
Terminology
Relations and Queries
Relational Terminology
Relational schemasTables, attributes
Relation instances
Sets (or multi-sets) of tuples
Integrity constraints
Keys, foreign keys, inclusion dependencies
Hitachi Household $203.99 MultiTouch Canon Photography $149.99 SingleTouch GizmoWorks Gadgets $29.99 Powergizmo GizmoWorks Gadgets $19.99 Gizmo Manufacturer Category Price PName Product Attribute names
Table/relation name
SQL (very basic)
Interview:
candidate, date, recruiter, hireDecision, grade EmployeePerf:
empID, name, reviewQuarter, grade, reviewer select recruiter, candidate from Interview, EmployeePerf where recuiter=name AND grade < 2.5
SQL (w/aggregation)
EmployeePerf:empID, name, reviewQuarter, grade, reviewer select reviewer, Avg(grade) from EmployeePerf where reviewQuarter=“1/2007”
Conjunctive Queries
Q(R,C) :- Interview(X,D,Y,H,F), EmployeePerf(E,Y,T,W,Z), W < 2.5.select recruiter, candidate from Interview, EmployeePerf where recuiter=name AND grade < 2.5
Unions of Conjunctive Queries
Q(R,C) :- Interview(X,D,Y,H,F), EmployeePerf(E,Y,T,W,Z), W < 2.5. Q(R,C) :- Interview(X,D,Y,H,F), EmployeePerf(E,Y,T,W,Z), Manager(y), W > 3.9.Datalog (recursion)
Path(X,Y) :- edge(X,Y)Path(X,Y) :- edge(X,Z), path(Z,Y)
Database: edge(X,Y)
Warmup Exercise
Virtual Data Integration
Architecture
wrapper wrapper wrapper wrapper wrapper Mediated Schema Source Descriptions Cinemas: place, movie, start Reviews: title, date grade, review Movies: name, actors, director, genre Cinemas in NYC: cinema, title, startTime Cinemas in SF: location, movie, startingTime
Movie: Title, director, year, genre Actors: title, actor
Plays: movie, location, startTime Reviews: title, rating, description
S1 S2 S3 S4 S5
Wrappers
<cd> <title> The best of … </title> <artist> Abiteboul </artist> <artist> Pavarotti </artist> <artist> Domingo </artist> <price> 19.95 </price> </cd>
…
•Lixto [Vienna] •Fetch [ISI]
•XQWrap [Georgia Tech] •Wrapper induction [Dublin]
Mediation Languages
Books Title ISBN Price DiscountPrice Edition CDs Album ASIN Price DiscountPrice Studio BookCategories ISBN Category CDCategories ASIN Category Artists ASIN ArtistName GroupName Authors ISBN FirstName LastNameCD: ASIN, Title, Genre,… Artist: ASIN, name, …
Mediated Schema
logic
Execution engine
wrapper wrapper wrapper wrapper wrapper Query optimizer
Query reformulation Query
Logical query plan
Physical query plan Replanning request
Query Processing
Woody Allen Comedies in NY
select title, startTime from Movie, Plays
where Movie.title=Plays.movie AND location=“New York” AND
Movie: Title, director, year, genre Actors: title, actor
Plays: movie, location, startTime Reviews: title, rating, description
Cinemas: place, movie, start Reviews: title, date grade, review Movies: name, actors, director, genre Cinemas in NYC: cinema, title, startTime Cinemas in SF: location, movie, startingTime
Movie: Title, director, year, genre Actors: title, actor
Plays: movie, location, startTime Reviews: title, rating, description
S1 S2 S3 S4 S5
select title, startTime from Movie, Plays
where Movie.title=Plays.movie AND location=“New York” AND director=“Woody Allen”
Outline (1)
Mediation languages:theoretical foundations (containment,
answering queries using views)
Creating schema mappings Query processing
Adaptive query processing
XML and its role in data integration
Outline (2)
Other architectures for data integration
Warehousing, data exchange, p2p dbms
Advanced (i.e., current) topics:
Dataspaces
The deep web