H.Galhardas SAD 2007/08
Extract, Transform, Load
(ETL)
Overview
• General ETL issues:
– Data staging area (DSA)
– Building dimensions
– Building fact tables
– Extract
– Load
– Transformation/cleaning
• Commercial (and open-source) tools
• The AJAX data cleaning and transformation
framework
H.Galhardas SAD 2007/08
DW Phases
Design phase
– Modeling, DB design, source selection,…
Loading phase
– First load/population of the DW
– Based on all data in sources
Refreshment phase
– Keep the DW up-to-date wrt. source data
changes
H.Galhardas SAD 2007/08
The ETL Process
• The
most underestimated
process in DW development
• The
most time-consuming
process in DW development
– Often, 80% of development time is spent on ETL
Extract
– Extract relevant data
Transform
– Transform data to DW format – Build keys, etc.
– Cleansing of data
Load
– Load data into DW – Build aggregates, etc.
H.Galhardas SAD 2007/08
Data
Warehouse
Extract Transform Load RefreshOLAP Engine
Analysis
Query
Reports
Data mining
Monitor & Integrator MetadataData Sources
Front-End Tools
Serve
Data Marts OperationalDBs
other sourcesData Storage
OLAP Server Data StagingDW Architecture
DW Architecture
OperationalDBs
Data
Warehouse
other sources Extract Transform Load Refresh Data StagingETL process
ETL process
Implemented using an
ETL tool!
Ex
: SQLServer 2005
Integration Services
H.Galhardas SAD 2007/08
Data Staging Area
• Transit storage
for data underway in the ETL
process
– Transformations/cleansing done here
• No user queries
(some do it)
• Sequential operations
(few) on large data volumes
– Performed by central ETL logic
– Easily restarted
– No need for locking, logging, etc.
– RDBMS or flat files? (DBMS have become better at this)
• Finished dimensions copied from DSA to relevant
marts
H.Galhardas SAD 2007/08
ETL construction process
Plan
1)Make high-level diagram of source-destination flow 2)Test, choose and implement ETL tool
3)Outline complex transformations, key generation and job sequence for every destination table
Construction of dimensions
4)Construct and test building static dimension
5)Construct and test change mechanisms for one dimension 6)Construct and test remaining dimension builds
Construction of fact tables and automation
7)Construct and test initial fact table build 8)Construct and test incremental update 9)Construct and test aggregate build
H.Galhardas SAD 2007/08
Building Dimensions
Static dimension table
– Assignment of keys: production keys to DW using
table
– Combination of data sources: find common key?
Handling dimension changes
– Slowly changing dimensions
– Find newest DW key for a given production key
– Table for mapping production keys to DW keys
must be updated
Load of dimensions
– Small dimensions: replace
– Large dimensions: load only changes
Building fact tables
Two types of load:
• Initial load
– ETL for all data up till now
– Done when DW is started the first time
– Often problematic to get correct historical data – Very heavy -large data volumes
• Incremental update
– Move only changes since last load
– Done periodically (../month/week/day/hour/...) after DW start – Less heavy -smaller data volumes
• Dimensions must be updated before facts
– The relevant dimension rows for new facts must be in place – Special key considerations if initial load must be performed
H.Galhardas SAD 2007/08
Types of data sources
Non-cooperative sources
�
Snapshot sources
–provides only full copy of source
�
Specific sources
–each is different, e.g., legacy systems
�
Logged sources
–writes change log (DB log)
�
Queryable sources
–provides query interface, e.g., SQL
Cooperative sources
�
Replicated sources
–publish/subscribe mechanism
�
Call back sources
–calls external code (ETL) when changes
occur
�
Internal action sources
–only internal actions when changes
occur (DB triggers is an example)
Extract strategy is very dependent on the source types
H.Galhardas SAD 2007/08
Extract phase
Goal
: fast extract of relevant data
�
Extract from source systems can take
a long
time
Types of extracts:
�
Extract applications (SQL):
co-existence with other
applications
�
DB unload tools:
must faster than SQL-based extracts
�
Extract applications sometimes the only solution
Often
too time consuming
to ETL
– Extracts can take days/weeks – Drain on the operational systems – Drain on DW systems
H.Galhardas SAD 2007/08
Computing deltas
• Much faster to only ”ETL” changes since last load
A number of methods can be used
• Store sorted total extracts in DSA
– Delta can easily be computed from current+last extract
+ Always possible
+ Handles deletions
- Does not reduce extract time
• Put update timestamp on all rows
– Updated by DB trigger
– Extract only where ”timestamp > time for last extract”
+ Reduces extract time
+/- Less operational overhead
- Cannot (alone) handle deletions
- Source system must be changed
Load (1)
Goal
: fast loading into DW
– Loading deltas is much faster than total load
•
SQL-based update
is
slow
– Large overhead (optimization,locking,etc.) for every SQL call – DB load tools are much faster
– Some load tools can also perform UPDATEs
•
Index
on tables
slows
load a lot
– Drop index and rebuild after load – Can be done per partition
•
Parallellization
– Dimensions can be loaded concurrently – Fact tables can be loaded concurrently – Partitions can be loaded concurrently
H.Galhardas SAD 2007/08
Load (2)
• Relationships
in the data
– Referential integrity must be ensured – Can be done by loader
• Aggregates
– Must be built and loaded at the same time as the detail data – Today, RDBMSes can often do this
• Load tuning
– Load without log – Sort load file first
– Make only simple transformations in loader – Use loader facilities for building aggregates – Use loader within the same database
• Should DW be
on-line 24*7
?
– Use partitions or several sets of tables
H.Galhardas SAD 2007/08
Overview
• General ETL issues:
– Data staging area (DSA)
– Building dimensions
– Building fact tables
– Extract
– Load
– Transformation/cleaning
• Commercial (and open-source) tools
• The AJAX data cleaning and transformation
framework
H.Galhardas SAD 2007/08
Data Cleaning
Activity of converting source data into target
data without errors, duplicates, and
inconsistencies, i.e.,
Cleaning and Transforming to get…
High-quality data!
Why Data Cleaning and
Transformation?
Data in the real world is
dirty
incomplete
: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
• e.g., occupation=“”
noisy
: containing errors or outliers (spelling, phonetic
and typing errors, word transpositions, multiple values
in a single free-form field)
• e.g., Salary=“-10”
inconsistent
: containing discrepancies in codes or
names (synonyms and nicknames, prefix and suffix
variations, abbreviations, truncation and initials)
• e.g., Age=“42” Birthday=“03/07/1997” • e.g., Was rating “1,2,3”, now rating “A, B, C” • e.g., discrepancy between duplicate records
H.Galhardas SAD 2007/08
Why Is Data Dirty?
• Incomplete data
comes from:
– non available data value when collected
– different criteria between the time when the data was collected and when it is analyzed.
– human/hardware/software problems
• Noisy data
comes from:
– data collection: faulty instruments – data entry: human or computer errors – data transmission
• Inconsistent (and redundant) data
comes from:
– Different data sources, so non uniform naming conventions/data codes
– Functional dependency and/or referential integrity violation
H.Galhardas SAD 2007/08
Why Is Data Cleaning
Important?
• Data warehouse needs consistent
integration of quality data
– Data extraction, cleaning, and transformation comprises
the majority of the work of building a data warehouse
• No quality data,
no quality decisions
!
– Quality decisions must be based on quality data (e.g.,
duplicate or missing data may cause incorrect or even
misleading statistics)
H.Galhardas SAD 2007/08
Types of data cleaning
• Conversion, parsing and normalization
– Text coding, date formats, etc.
– Most common type of cleansing
• Special-purpose cleansing
– Normalize spellings of names, addresses, etc.
– Remove duplicates, e.g., duplicate customers
– Domain-independent cleansing
• Approximate, ”fuzzy” joins on not-quite-matching
keys
Data quality vs cleaning
• Data quality = Data cleaning +
– Data enrichment
• enhancing the value of internally held data by appending related attributes from external sources (for example, consumer
demographic attributes or geographic descriptors).
– Data profiling
• analysis of data to capture statistics (metadata) that provide insight into the quality of the data and aid in the identification of data quality issues.
– Data monitoring
• deployment of controls to ensure ongoing conformance of data to business rules that define data quality for the organization.
– Data stewards responsible for data quality – DW-controlled improvement
– Source-controlled improvement
H.Galhardas SAD 2007/08
Overview
• General ETL issues:
– Data staging area (DSA)
– Building dimensions
– Building fact tables
– Extract
– Load
– Transformation/cleaning
• Commercial (and open-source) tools
• The AJAX data cleaning and transformation
framework
H.Galhardas SAD 2007/08
ETL tools
ETL tools from the
big vendors
, e.g.,
– �Oracle Warehouse Builder – �IBM DB2 Warehouse Manager – �Microsoft Integration Services
Offer much functionality at a reasonable price (included…)
– �Data modeling – �ETL code generation – �Scheduling DW jobs – �…
Many others
– �Hundreds of tools
– �Often specialized tools for certain jobs (insurance cleansing,…)
The ”best” tool does not exist
– �Choose based on your own needs
H.Galhardas SAD 2007/08
ETL and data quality tools
•
http://www.etltool.com/
•
Magic Quadrant for Data Quality Tools, 2007
• Some open source ETL tools:
–
Talend
–
Enhydra Octopus
–
Clover.ETL
• Not so many open source quality/cleaning
tools
Application context
Integrate
data from different sources
• E.g.,populating a DW from different operational data stores
• Eliminate errors and duplicates
within a single source
• E.g., duplicates in a file of customers
• Migrate
data from a source schema into a different
fixed target schema
• E.g., discontinued application packages
• Convert poorly structured
data into structured data
H.Galhardas SAD 2007/08
The AJAX data transformation
and cleaning framework
H.Galhardas SAD 2007/08
Motivating example (1)
DirtyData(paper:String)
Data Cleaning &
Transformation
Events(eventKey, name)Publications(pubKey, title, eventKey, url, volume, number, pages, city, month, year) Authors(authorKey, name)
H.Galhardas SAD 2007/08
Motivating example (2)
[1] Dallan Quass, Ashish Gupta, Inderpal Singh Mumick, and Jennifer
Widom. Making Views Self-Maintainable for Data Warehousing. In Proceedings of the Conference on Parallel and Distributed Information Systems. Miami Beach, Florida, USA, 1996
[2] D. Quass, A. Gupta, I. Mumick, J. Widom, Making views
self-maintianable for data warehousing, PDIS’95
DirtyData
Data Cleaning &
Transformation
PDIS | Conference on Parallel and Distributed Information Systems EventsQGMW96| Making Views Self-Maintainable for Data Warehousing |PDIS| null | null | null | null | Miami Beach | Florida, USA | 1996
Publications Authors
DQua | Dallan Quass
AGup | Ashish Gupta
JWid | Jennifer Widom ….. QGMW96 | DQua QGMW96 | AGup …. PubsAuthors H.Galhardas SAD 2007/08
Modeling a data cleaning process
A data cleaning process is
modeled by a
directed acyclic
graph
of data transformations
DirtyData
DirtyAuthors Authors Duplicate Elimination Extraction Standardization Formattin g DirtyTitles... DirtyEvents Cities TagsH.Galhardas SAD 2007/08
Existing technology
• Ad-hoc programs
written in a programming language
like C or Java or using an RDBMS proprietary
language
– Programs difficult to optimize and maintain
• Data transformation scripts using an
ETL
(Extraction-Transformation-Loading)
or
a data quality tool
H.Galhardas SAD 2007/08
Problems of ETL and data quality
solutions (1)
The semantics of some data transformations
is defined in terms of their implementation
algorithms
App. Domain 1 App. Domain 2 App. Domain 3
Data cleaning transformations
H.Galhardas SAD 2007/08
There is a lack of interactive facilities to tune
a data cleaning application program
Problems of ETL and data quality
solutions (2)
Dirty Data
Cleaning process
Clean data
Rejected data
AJAX features
• An extensible
data quality framework
– Logical operators as extensions of relational algebra
– Physical execution algorithms
• A
declarative language
for logical operators
– SQL extension
• A
debugger facility
for tuning a data cleaning
program application
H.Galhardas SAD 2007/08
AJAX features
• An extensible
data quality framework
– Logical operators as extensions of relational algebra
– Physical execution algorithms
• A
declarative language
for logical operators
– SQL extension
• A
debugger facility
for tuning a data cleaning
program application
– Based on a mechanism of exceptions
H.Galhardas SAD 2007/08
Logical level: parametric
operators
• View: arbitrary SQL query
• Map: iterator-based one-to-many mapping with arbitrary user-defined functions
• Match: iterator-based approximate join
• Cluster: uses an arbitrary clustering function
• Merge: extends SQL group-by with user-defined aggregate functions
• Apply: executes an arbitrary user-defined algorithm Map Match Merge Cluster View Apply
H.Galhardas SAD 2007/08
Logical level
DirtyData
DirtyAuthors Authors Duplicate Elimination Extraction Standardization Formattin g DirtyTitles... Cities TagsLogical level
DirtyData
DirtyAuthors Map Cluster Match Merge Authors Map Map Duplicate Elimination Extraction Standardization Formattin g DirtyTitles... Cities TagsDirtyData
DirtyAuthors TC NL Authors SQL Scan Java ScanPhysical level
DirtyTitles... Java Scan Java Scan Cities TagsH.Galhardas SAD 2007/08
AJAX features
• An extensible
data quality framework
– Logical operators as extensions of relational algebra
– Physical execution algorithms
• A
declarative language
for logical operators
– SQL extension
• A
debugger facility
for tuning a data cleaning
program application
– Based on a mechanism of exceptions
H.Galhardas SAD 2007/08
Match
• Input:
2
relations
• Finds data records that correspond to the
same real object
• Calls
distance functions
for comparing field
values and computing the
distance
between input tuples
• Output:
1
relation containing matching
tuples and possibly
1
or
2
relations
containing non-matching tuples
H.Galhardas SAD 2007/08
Example
Cluster Match Merge Duplicate Elimination Authors DirtyAuthors MatchAuthorsExample
CREATE MATCH MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2
LET distance = editDistance(da1.name, da2.name) WHERE distance < maxDist
INTO MatchAuthors Cluster Match Merge Duplicate Elimination Authors DirtyAuthors MatchAuthors
H.Galhardas SAD 2007/08
Example
CREATE MATCH MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2
LET distance = editDistance(da1.name, da2.name) WHERE distance < maxDist
INTO MatchAuthors Input:
DirtyAuthors(authorKey, name) 861|johann christoph freytag
822|jc freytag 819|j freytag 814|j-c freytag
Output:
MatchAuthors(authorKey1, authorKey2, name1, name2) 861|822|johann christoph freytag| jc freytag
822|814|jc freytag|j-c freytag ... Cluster Match Merge Duplicate Elimination Authors DirtyAuthors MatchAuthors H.Galhardas SAD 2007/08
Implementation of the match
operator
∀ s
1∈ S
1, s
2∈ S
2(s
1, s
2) is a
match
if
H.Galhardas SAD 2007/08
Nested loop
S
1S
2...
• Very
expensive
evaluation when handling
large amounts of data
Þ Need alternative execution algorithms for
the same logical specification
editDistance
A database solution
CREATE TABLE MatchAuthors AS
SELECT authorKey1, authorKey2, distance
FROM (SELECT a1.authorKey authorKey1,
a2.authorKey authorKey2,
editDistance
(a1.name, a2.name) distance
FROM DirtyAuthors a1, DirtyAuthors a2)
WHERE distance < maxDist;
· No optimization supported for a
Cartesian product with external function
calls
H.Galhardas SAD 2007/08
Window scanning
S
n
H.Galhardas SAD 2007/08Window scanning
S
n
H.Galhardas SAD 2007/08
Window scanning
S
n
· May loose some matches
String distance filtering
S
1S
2maxDist = 1
John Smith John Smit Jogn Smith John Smithe length length- 1 length length + 1editDistance
H.Galhardas SAD 2007/08
Annotation-based optimization
• The user specifies types of optimization
• The system suggests which algorithm
to use
Ex:
CREATE MATCHING MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2 LET dist = editDistance(da1.name, da2.name) WHERE dist < maxDist
% distance-filtering: map= length; dist = abs % INTO MatchAuthors
H.Galhardas SAD 2007/08
DEFINE FUNCTIONS AS
Choose.uniqueString(OBJECT[]) RETURN STRING THROWS CiteSeerException
Generate.generateId(INTEGER) RETURN STRING Normal.removeCitationTags(STRING) RETURN STRING (600)
DEFINE ALGORITHMS AS TransitiveClosure SourceClustering(STRING) DEFINE INPUT DATA FLOWS AS TABLE DirtyData (paper STRING (400) ); TABLE City (city STRING (80), citysyn STRING (80) ) KEY city,citysyn; DEFINE TRANSFORMATIONS AS CREATE MAPPING m apKeDiDa FROM DirtyData Dd
LET keyKdd = generateId(1)
{SELECT keyKdd AS paperKey, Dd.paper AS paper KEY paperKey
CONSTRAINT NOT NULL m apKeDiDa.paper }
H.Galhardas SAD 2007/08
AJAX features
• An extensible
data quality framework
– Logical operators as extensions of relational algebra
– Physical execution algorithms
• A
declarative language
for logical operators
– SQL extension
• A
debugger facility
for tuning a data cleaning
program application
– Based on a mechanism of exceptions
Management of exceptions
• Problem
: to mark tuples not handled
by the
cleaning criteria of an operator
• Solution
: to specify the generation of
exception tuples
within a logical operator
– exceptions are thrown by external functions
– output constraints are violated
H.Galhardas SAD 2007/08
Debugger facility
• Supports the
(backward and forward) data
derivation
of tuples wrt an operator to
debug exceptions
• Supports the
interactive data modification
and, in the future, the
incremental
execution of logical operators
H.Galhardas SAD 2007/08
H.Galhardas SAD 2007/08