• No results found

Functional Dependency Generation and Applications in Pay As You Go Data Integration Systems

N/A
N/A
Protected

Academic year: 2021

Share "Functional Dependency Generation and Applications in Pay As You Go Data Integration Systems"

Copied!
15
0
0

Loading.... (view fulltext now)

Full text

(1)

Functional Dependency Generation and Applications  in Pay‐As‐You‐Go Data Integration Systems

Daisy Zhe Wang, Luna Dong, Anish Das Sarma, Michael J. Franklin, and Alon Halevy

UC Berkeley, AT&T Research,  Stanford University, and Google Inc.y, g

(2)

Web scale Structured Data Web‐scale Structured Data

HTML Tables extracted from the Web Database Views in the

Deep Web accessed through HTML Forms on the Web

For years, Microsoft CorporationCEOBill Gateswas against open source. But today he appears to have changed his mind. "We can be open source. We love

Relations generated by  information extraction  from web pages

mind. We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super- important shift for us in terms of code access.“

Name Title Organization

Bill Gates CEO Microsoft

Bill Veghte VP Microsoft

Richard Stallman Founder Free Soft..

from web pages

(3)

A Typical Data Integration System A Typical Data Integration System

Mediated Schema (G)

Semantic Mappings (M)

Data Sources (S): a set of data sources of a specific domain

M di t d S h (G) t f l ti d tt ib t th t i h t

Different Structured Data Sources (S)

Mediated Schema (G): a set of relations and attributes that we wish to  expose to users

Schema Mapping (M): a set of mappings from the attributes in S to the  attributes in G

attributes in G

Query processing 

A user query over G is reformulated into multiple queries over S using M Results are retrieved from multiple data sources and combined

Results are retrieved from multiple data sources and combined

(4)

Data Integration at Web scale Data Integration at Web‐scale

A t i l d t i t ti l ti i i ti l f b

• A typical data integration solution is impractical for web‐

scale data

– Too many domains of interest (Web Data is about everything)y y g – Huge number of sources for each domain

– Designing mediated schema is infeasible

Data sources are dirty incomplete and lack of meta data – Data sources are dirty, incomplete and lack of meta‐data

• A web‐scale data integration system 

– can only afford pay‐as‐you‐go [Franklin et. al 2005]

Support automated schema design and mapping – Support automated schema design and mapping – Provide best‐Effort services

(5)

Functional Dependency (FD) Functional Dependency (FD)

C FD h i h i

Can we use FD theory  in some way to automate the massive  data integration problem?

• FDs are specified top down in the database design process as

• FDs are specified top‐down in the database design process as  statements of truth on how attributes relates to each other  FD X Æ Y h ld if d l if h X l i i d i h

FD X Æ Y holds if and only if each X value is associated with  precisely one Y value

• One of Armstrong’s Axioms for Normalization Transitivity: if XÆY, YÆZ, then XÆZ

(6)

Probabilistic Functional Dependencies  (pFDs)

Wh b bili ti FD ?

• Why probabilistic FDs?

• Definition of a probabilistic FD (pFD) Definition of a probabilistic FD (pFD)

X   ÆA, p is the likelihood of FD holds in general

• Related work

– TANE [Huhtala et al 1999]

– TANE [Huhtala et al. 1999]

– CORDS [Ilyas et al. 2004]

• The new challenge: from single large table to many, 

potentially incomplete and dirty tables

(7)

Generating pFDs Generating pFDs

• P b bilit f FD i l d t R

• Probability of pFD over single data source R

Per‐Tuple counting:

Per‐Value counting:g

• Probability of pFD over multiple data sources

Merge pFDs:

Merge Data

(8)

Results for pFDs Generation  Algorithms

Number of data sources: 50 ‐‐ 600

(9)

App1: Normalize Mediated Schema  Example (I)

Attributes in the mediated schema of the Bibliography Domain

author issn

i abstract

Attributes in the mediated schema of the Bibliography Domain

paper  title

author authors author(s)

eissn

pages subject

year

journal Title journal

subjects key words

y

conference meeting

editor school

colloquium location

venue place

website date

company association

place date position

(10)

App1: Normalize Mediate Schema  Example (II)

Paper Journal

author authors

eissn issn abstract

paper title

authors author(s)

journal title

j l

pages subject

subjects

year

journal

conference key words

editor conference

meeting

colloquium school

company address

website date

dates

association

position dates

(11)

Normalizing Mediated Schema Normalizing Mediated Schema

• Prune pFD setPrune pFD set

Prune low‐probability pFDs

Prune pFDs that can be generated by transitivity  paper 

title author

authors

issn b 0.95

0.9

0.95 0.950.92 author(s) journal title

journal

subject subjects 0.97

Avoid over‐splitting conference conference

meeting colloquium

zip

address

city 0.95

1.0 0.9

address

(12)

Results for Schema Normalization

Results for Schema Normalization

(13)

App2: Identify Dirty Data Sources App2: Identify Dirty Data Sources

• Structured data sources from the Web can be dirtyStructured data sources from the Web can be dirty

name company email name country city name city country

Ali B t 02101 USA

Dummy Values  Entity Ambiguity Nested Columns

Alice IBM email Bob Google email

C th Y h il

Alice USA Boston

Bob US Boston

C th B t

Alice Boston 02101,USA Bob Seattle 98101,USA Cathy Chicago 60601,USA Cathy Yahoo email

David MSR email

Cathy u.s.a Boston David United

States

Boston

Cathy Chicago 6060 ,USA David New

York

12201,USA

W d h i l FD i h hi h b bili i

• We report data sources that violate pFDs with high probabilities:

Results People: 3 out of 3 reported are dirty

Course: 31 out of 80 reported are dirty (estimate total 66)p y ( )

(14)

Conclusion: FD in Pay as you go Conclusion: FD in Pay‐as‐you‐go

• Web‐scale data integration can only afford to pay‐as‐you‐go

• Automation is the key 

Automatically setting up mediated schema, mapping  [Das Sarma et. al. 2008]

Automatically measuring and improve the quality of data integrationAutomatically measuring and improve the quality of data integration

Measuring quality of data sources 

Measuring and Improving quality of mediated schema, schema mapping, etc.

• FD‐based Quality Measuring and Improvement

Id tif di t d t

Identify dirty data sources Improving mediated schemas

(15)

Future Work Future Work

• Automatic mediated schema design for millions of  HTML tables

HTML tables

– Domain cluster (clustering over source schema)

– Entity/Relationship cluster (clustering + pFD normalization) – Attribute cluster (synonyms + string similarity)

• Related Issues

Scalability

– Scalability

– Visualization

References

Related documents

Sim card slot will increase credibility with multiple unlimited calls within the tmobile pay as you go plans do not only applies to provide your usage alerts, which provide spam

mobile operators including calls to T-Mobile customers 35p Text Messaging 10p Answer Phone 20p Photo Messaging* 40p Video Messaging 50p Video Calling to Orange phones 30p

(xi) “ Inspection Authority ” means an authority specified in Part I of Schedule XI or an officer of the Directorate of Plant Protection, Quarantine and Storage duly authorized by

Setting your budget 16 Setting credit alerts 18 Viewing credit alerts 18 Seeing your energy usage See what you’re using now 20

[r]

When we estimate whether Inflation Targeting has played any role in the change of this structure, we find that after the exclusion of Asian Crisis time period, IT has a

1) The adoption of advanced technologies in NZEBs are strongly in- fluenced by their countries' building energy codes and standards. During the case analysis on NZEB

Therefore, in this study, we explore not only the existence of strategic interaction but also the change in the strength of the strategic interaction among local governments from