Extract, Transform, Load (ETL)

(1)

H.Galhardas SAD 2007/08

Extract, Transform, Load

(ETL)

Overview

• General ETL issues:

– Data staging area (DSA)

– Building dimensions

– Building fact tables

– Extract

– Load

– Transformation/cleaning

• Commercial (and open-source) tools

• The AJAX data cleaning and transformation

framework

(2)

DW Phases

Design phase

– Modeling, DB design, source selection,…

Loading phase

– First load/population of the DW

– Based on all data in sources

Refreshment phase

– Keep the DW up-to-date wrt. source data

changes

The ETL Process

• The

most underestimated

process in DW development

• The

most time-consuming

process in DW development

– Often, 80% of development time is spent on ETL

Extract

– Extract relevant data

Transform

– Transform data to DW format – Build keys, etc.

– Cleansing of data

Load

– Load data into DW – Build aggregates, etc.

(3)

Data

Warehouse

Extract Transform Load Refresh

OLAP Engine

Analysis

Query

Reports

Data mining

Monitor & Integrator Metadata

Data Sources

Front-End Tools

Serve

Data Marts Operational

DBs

other sources

Data Storage

OLAP Server Data Staging

DW Architecture

Operational

DBs

Data

Warehouse

other sources Extract Transform Load Refresh Data Staging

ETL process

Implemented using an

ETL tool!

Ex

: SQLServer 2005

Integration Services

(4)

Data Staging Area

• Transit storage

for data underway in the ETL

process

– Transformations/cleansing done here

• No user queries

(some do it)

• Sequential operations

(few) on large data volumes

– Performed by central ETL logic

– Easily restarted

– No need for locking, logging, etc.

– RDBMS or flat files? (DBMS have become better at this)

• Finished dimensions copied from DSA to relevant

marts

ETL construction process

Plan

1)Make high-level diagram of source-destination flow 2)Test, choose and implement ETL tool

3)Outline complex transformations, key generation and job sequence for every destination table

Construction of dimensions

4)Construct and test building static dimension

5)Construct and test change mechanisms for one dimension 6)Construct and test remaining dimension builds

Construction of fact tables and automation

7)Construct and test initial fact table build 8)Construct and test incremental update 9)Construct and test aggregate build

(5)

Building Dimensions

Static dimension table

– Assignment of keys: production keys to DW using

table

– Combination of data sources: find common key?

Handling dimension changes

– Slowly changing dimensions

– Find newest DW key for a given production key

– Table for mapping production keys to DW keys

must be updated

Load of dimensions

– Small dimensions: replace

– Large dimensions: load only changes

Building fact tables

Two types of load:

• Initial load

– ETL for all data up till now

– Done when DW is started the first time

– Often problematic to get correct historical data – Very heavy -large data volumes

• Incremental update

– Move only changes since last load

– Done periodically (../month/week/day/hour/...) after DW start – Less heavy -smaller data volumes

• Dimensions must be updated before facts

– The relevant dimension rows for new facts must be in place – Special key considerations if initial load must be performed

(6)

Types of data sources

Non-cooperative sources

�

Snapshot sources

–provides only full copy of source

�

Specific sources

–each is different, e.g., legacy systems

�

Logged sources

–writes change log (DB log)

�

Queryable sources

–provides query interface, e.g., SQL

Cooperative sources

�

Replicated sources

–publish/subscribe mechanism

�

Call back sources

–calls external code (ETL) when changes

occur

�

Internal action sources

–only internal actions when changes

occur (DB triggers is an example)

Extract strategy is very dependent on the source types

Extract phase

Goal

: fast extract of relevant data

�

Extract from source systems can take

a long

time

Types of extracts:

�

Extract applications (SQL):

co-existence with other

applications

�

DB unload tools:

must faster than SQL-based extracts

�

Extract applications sometimes the only solution

Often

too time consuming

to ETL

– Extracts can take days/weeks – Drain on the operational systems – Drain on DW systems

(7)

Computing deltas

• Much faster to only ”ETL” changes since last load

A number of methods can be used

• Store sorted total extracts in DSA

– Delta can easily be computed from current+last extract

+ Always possible

+ Handles deletions

- Does not reduce extract time

• Put update timestamp on all rows

– Updated by DB trigger

– Extract only where ”timestamp > time for last extract”

+ Reduces extract time

+/- Less operational overhead

- Cannot (alone) handle deletions

- Source system must be changed

Load (1)

Goal

: fast loading into DW

– Loading deltas is much faster than total load

• SQL-based update

is

slow

– Large overhead (optimization,locking,etc.) for every SQL call – DB load tools are much faster

– Some load tools can also perform UPDATEs

• Index

on tables

slows

load a lot

– Drop index and rebuild after load – Can be done per partition

• Parallellization

– Dimensions can be loaded concurrently – Fact tables can be loaded concurrently – Partitions can be loaded concurrently

(8)

Load (2)

• Relationships

in the data

– Referential integrity must be ensured – Can be done by loader

• Aggregates

– Must be built and loaded at the same time as the detail data – Today, RDBMSes can often do this

• Load tuning

– Load without log – Sort load file first

– Make only simple transformations in loader – Use loader facilities for building aggregates – Use loader within the same database

• Should DW be

on-line 24*7

?

– Use partitions or several sets of tables

Overview

• General ETL issues:

– Data staging area (DSA)

– Building dimensions

– Building fact tables

– Extract

– Load

– Transformation/cleaning

• Commercial (and open-source) tools

• The AJAX data cleaning and transformation

framework

(9)

Data Cleaning

Activity of converting source data into target

data without errors, duplicates, and

inconsistencies, i.e.,

Cleaning and Transforming to get…

High-quality data!

Why Data Cleaning and

Transformation?

Data in the real world is

dirty

incomplete

: lacking attribute values, lacking certain

attributes of interest, or containing only aggregate data

• e.g., occupation=“”

noisy

: containing errors or outliers (spelling, phonetic

and typing errors, word transpositions, multiple values

in a single free-form field)

• e.g., Salary=“-10”

inconsistent

: containing discrepancies in codes or

names (synonyms and nicknames, prefix and suffix

variations, abbreviations, truncation and initials)

• e.g., Age=“42” Birthday=“03/07/1997” • e.g., Was rating “1,2,3”, now rating “A, B, C” • e.g., discrepancy between duplicate records

(10)

Why Is Data Dirty?

• Incomplete data

comes from:

– non available data value when collected

– different criteria between the time when the data was collected and when it is analyzed.

– human/hardware/software problems

• Noisy data

comes from:

– data collection: faulty instruments – data entry: human or computer errors – data transmission

• Inconsistent (and redundant) data

comes from:

– Different data sources, so non uniform naming conventions/data codes

– Functional dependency and/or referential integrity violation

Why Is Data Cleaning

Important?

• Data warehouse needs consistent

integration of quality data

– Data extraction, cleaning, and transformation comprises

the majority of the work of building a data warehouse

• No quality data,

no quality decisions

!

– Quality decisions must be based on quality data (e.g.,

duplicate or missing data may cause incorrect or even

misleading statistics)

(11)

Types of data cleaning

• Conversion, parsing and normalization

– Text coding, date formats, etc.

– Most common type of cleansing

• Special-purpose cleansing

– Normalize spellings of names, addresses, etc.

– Remove duplicates, e.g., duplicate customers

– Domain-independent cleansing

• Approximate, ”fuzzy” joins on not-quite-matching

keys

Data quality vs cleaning

• Data quality = Data cleaning +

– Data enrichment

• enhancing the value of internally held data by appending related attributes from external sources (for example, consumer

demographic attributes or geographic descriptors).

– Data profiling

• analysis of data to capture statistics (metadata) that provide insight into the quality of the data and aid in the identification of data quality issues.

– Data monitoring

• deployment of controls to ensure ongoing conformance of data to business rules that define data quality for the organization.

– Data stewards responsible for data quality – DW-controlled improvement

– Source-controlled improvement

(12)

Overview

• General ETL issues:

– Data staging area (DSA)

– Building dimensions

– Building fact tables

– Extract

– Load

– Transformation/cleaning

• Commercial (and open-source) tools

• The AJAX data cleaning and transformation

framework

ETL tools

ETL tools from the

big vendors

, e.g.,

– �Oracle Warehouse Builder – �IBM DB2 Warehouse Manager – �Microsoft Integration Services

Offer much functionality at a reasonable price (included…)

– �Data modeling – �ETL code generation – �Scheduling DW jobs – �…

Many others

– �Hundreds of tools

– �Often specialized tools for certain jobs (insurance cleansing,…)

The ”best” tool does not exist

– �Choose based on your own needs

(13)

ETL and data quality tools

• http://www.etltool.com/

• Magic Quadrant for Data Quality Tools, 2007

• Some open source ETL tools:

–

Talend

–

Enhydra Octopus

–

Clover.ETL

• Not so many open source quality/cleaning

tools

Application context



Integrate

data from different sources

• E.g.,populating a DW from different operational data stores

• Eliminate errors and duplicates

within a single source

• E.g., duplicates in a file of customers

• Migrate

data from a source schema into a different

fixed target schema

• E.g., discontinued application packages

• Convert poorly structured

data into structured data

(14)

The AJAX data transformation

and cleaning framework

Motivating example (1)

DirtyData(paper:String)

Data Cleaning &

Transformation

Events(eventKey, name)

Publications(pubKey, title, eventKey, url, volume, number, pages, city, month, year) Authors(authorKey, name)

(15)

Motivating example (2)

[1] Dallan Quass, Ashish Gupta, Inderpal Singh Mumick, and Jennifer

Widom. Making Views Self-Maintainable for Data Warehousing. In Proceedings of the Conference on Parallel and Distributed Information Systems. Miami Beach, Florida, USA, 1996

[2] D. Quass, A. Gupta, I. Mumick, J. Widom, Making views

self-maintianable for data warehousing, PDIS’95

DirtyData

Data Cleaning &

Transformation

PDIS | Conference on Parallel and Distributed Information Systems Events

Publications Authors

DQua | Dallan Quass

AGup | Ashish Gupta

JWid | Jennifer Widom ….. QGMW96 | DQua QGMW96 | AGup …. PubsAuthors H.Galhardas SAD 2007/08

Modeling a data cleaning process

A data cleaning process is

modeled by a

directed acyclic

graph

of data transformations

DirtyData

DirtyAuthors Authors Duplicate Elimination Extraction Standardization Formattin g DirtyTitles... DirtyEvents Cities Tags

(16)

Existing technology

• Ad-hoc programs

written in a programming language

like C or Java or using an RDBMS proprietary

language

– Programs difficult to optimize and maintain

• Data transformation scripts using an

ETL

(Extraction-Transformation-Loading)

or

a data quality tool

Problems of ETL and data quality

solutions (1)

The semantics of some data transformations

is defined in terms of their implementation

algorithms

App. Domain 1 App. Domain 2 App. Domain 3

Data cleaning transformations

(17)

There is a lack of interactive facilities to tune

a data cleaning application program

Problems of ETL and data quality

solutions (2)

Dirty Data

Cleaning process

Clean data

Rejected data

AJAX features

• An extensible

data quality framework

– Logical operators as extensions of relational algebra

– Physical execution algorithms

• A

declarative language

for logical operators

– SQL extension

• A

debugger facility

for tuning a data cleaning

program application

(18)

AJAX features

• An extensible

data quality framework

– Logical operators as extensions of relational algebra

– Physical execution algorithms

• A

declarative language

for logical operators

– SQL extension

• A

debugger facility

for tuning a data cleaning

program application

– Based on a mechanism of exceptions

Logical level: parametric

operators

• View: arbitrary SQL query

• Map: iterator-based one-to-many mapping with arbitrary user-defined functions

• Match: iterator-based approximate join

• Cluster: uses an arbitrary clustering function

• Merge: extends SQL group-by with user-defined aggregate functions

• Apply: executes an arbitrary user-defined algorithm Map _Match Merge Cluster View Apply

(19)

Logical level

DirtyData

DirtyAuthors Authors Duplicate Elimination Extraction Standardization Formattin g DirtyTitles... Cities Tags

Logical level

DirtyData

DirtyAuthors Map Cluster Match Merge Authors Map Map Duplicate Elimination Extraction Standardization Formattin g DirtyTitles... Cities Tags

DirtyData

DirtyAuthors TC NL Authors SQL Scan Java Scan

Physical level

DirtyTitles... Java Scan Java Scan Cities Tags

(20)

AJAX features

• An extensible

data quality framework

– Logical operators as extensions of relational algebra

– Physical execution algorithms

• A

declarative language

for logical operators

– SQL extension

• A

debugger facility

for tuning a data cleaning

program application

– Based on a mechanism of exceptions

Match

• Input:

2 relations

• Finds data records that correspond to the

same real object

• Calls

distance functions

for comparing field

values and computing the

distance

between input tuples

• Output:

1 relation containing matching

tuples and possibly

1 or

2 relations

containing non-matching tuples

(21)

Example

Cluster Match Merge Duplicate Elimination Authors DirtyAuthors MatchAuthors

Example

CREATE MATCH MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2

LET distance = editDistance(da1.name, da2.name) WHERE distance < maxDist

INTO MatchAuthors Cluster Match Merge Duplicate Elimination Authors DirtyAuthors MatchAuthors

(22)

Example

CREATE MATCH MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2

LET distance = editDistance(da1.name, da2.name) WHERE distance < maxDist

INTO MatchAuthors Input:

DirtyAuthors(authorKey, name) 861|johann christoph freytag

822|jc freytag 819|j freytag 814|j-c freytag

Output:

MatchAuthors(authorKey1, authorKey2, name1, name2) 861|822|johann christoph freytag| jc freytag

822|814|jc freytag|j-c freytag ... Cluster Match Merge Duplicate Elimination Authors DirtyAuthors MatchAuthors H.Galhardas SAD 2007/08

Implementation of the match

operator

∀ s

₁

∈ S

₁

, s

₂

∈ S

₂

(s

₁

, s

₂

) is a

match

if

(23)

Nested loop

S

₁

S

₂

...

• Very

expensive

evaluation when handling

large amounts of data

Þ Need alternative execution algorithms for

the same logical specification

editDistance

A database solution

CREATE TABLE MatchAuthors AS

SELECT authorKey1, authorKey2, distance

FROM (SELECT a1.authorKey authorKey1,

a2.authorKey authorKey2,

editDistance

(a1.name, a2.name) distance

FROM DirtyAuthors a1, DirtyAuthors a2)

WHERE distance < maxDist;

· No optimization supported for a

Cartesian product with external function

calls

(24)

Window scanning

S

n

Window scanning

S

n

(25)

Window scanning

S

n

· May loose some matches

String distance filtering

S

₁

S

₂

maxDist = 1

John Smith John Smit Jogn Smith John Smithe length length- 1 length length + 1

editDistance

(26)

Annotation-based optimization

• The user specifies types of optimization

• The system suggests which algorithm

to use

Ex:

CREATE MATCHING MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2 LET dist = editDistance(da1.name, da2.name) WHERE dist < maxDist

% distance-filtering: map= length; dist = abs % INTO MatchAuthors

DEFINE FUNCTIONS AS

Choose.uniqueString(OBJECT[]) RETURN STRING THROWS CiteSeerException

Generate.generateId(INTEGER) RETURN STRING Normal.removeCitationTags(STRING) RETURN STRING (600)

DEFINE ALGORITHMS AS TransitiveClosure SourceClustering(STRING) DEFINE INPUT DATA FLOWS AS TABLE DirtyData (paper STRING (400) ); TABLE City (city STRING (80), citysyn STRING (80) ) KEY city,citysyn; DEFINE TRANSFORMATIONS AS CREATE MAPPING m apKeDiDa FROM DirtyData Dd