• No results found

Extract, Transform, Load (ETL)

N/A
N/A
Protected

Academic year: 2021

Share "Extract, Transform, Load (ETL)"

Copied!
29
0
0

Loading.... (view fulltext now)

Full text

(1)

H.Galhardas SAD 2007/08

Extract, Transform, Load

(ETL)

Overview

• General ETL issues:

– Data staging area (DSA)

– Building dimensions

– Building fact tables

– Extract

– Load

– Transformation/cleaning

• Commercial (and open-source) tools

• The AJAX data cleaning and transformation

framework

(2)

H.Galhardas SAD 2007/08

DW Phases

Design phase

– Modeling, DB design, source selection,…

Loading phase

– First load/population of the DW

– Based on all data in sources

Refreshment phase

– Keep the DW up-to-date wrt. source data

changes

H.Galhardas SAD 2007/08

The ETL Process

• The

most underestimated

process in DW development

• The

most time-consuming

process in DW development

– Often, 80% of development time is spent on ETL

Extract

– Extract relevant data

Transform

– Transform data to DW format – Build keys, etc.

– Cleansing of data

Load

– Load data into DW – Build aggregates, etc.

(3)

H.Galhardas SAD 2007/08

Data

Warehouse

Extract Transform Load Refresh

OLAP Engine

Analysis

Query

Reports

Data mining

Monitor & Integrator Metadata

Data Sources

Front-End Tools

Serve

Data Marts Operational

DBs

other sources

Data Storage

OLAP Server Data Staging

DW Architecture

DW Architecture

Operational

DBs

Data

Warehouse

other sources Extract Transform Load Refresh Data Staging

ETL process

ETL process

Implemented using an

ETL tool!

Ex

: SQLServer 2005

Integration Services

(4)

H.Galhardas SAD 2007/08

Data Staging Area

• Transit storage

for data underway in the ETL

process

– Transformations/cleansing done here

• No user queries

(some do it)

• Sequential operations

(few) on large data volumes

– Performed by central ETL logic

– Easily restarted

– No need for locking, logging, etc.

– RDBMS or flat files? (DBMS have become better at this)

• Finished dimensions copied from DSA to relevant

marts

H.Galhardas SAD 2007/08

ETL construction process

Plan

1)Make high-level diagram of source-destination flow 2)Test, choose and implement ETL tool

3)Outline complex transformations, key generation and job sequence for every destination table

Construction of dimensions

4)Construct and test building static dimension

5)Construct and test change mechanisms for one dimension 6)Construct and test remaining dimension builds

Construction of fact tables and automation

7)Construct and test initial fact table build 8)Construct and test incremental update 9)Construct and test aggregate build

(5)

H.Galhardas SAD 2007/08

Building Dimensions

Static dimension table

– Assignment of keys: production keys to DW using

table

– Combination of data sources: find common key?

Handling dimension changes

– Slowly changing dimensions

– Find newest DW key for a given production key

– Table for mapping production keys to DW keys

must be updated

Load of dimensions

– Small dimensions: replace

– Large dimensions: load only changes

Building fact tables

Two types of load:

• Initial load

– ETL for all data up till now

– Done when DW is started the first time

– Often problematic to get correct historical data – Very heavy -large data volumes

• Incremental update

– Move only changes since last load

– Done periodically (../month/week/day/hour/...) after DW start – Less heavy -smaller data volumes

• Dimensions must be updated before facts

– The relevant dimension rows for new facts must be in place – Special key considerations if initial load must be performed

(6)

H.Galhardas SAD 2007/08

Types of data sources

Non-cooperative sources

Snapshot sources

–provides only full copy of source

Specific sources

–each is different, e.g., legacy systems

Logged sources

–writes change log (DB log)

Queryable sources

–provides query interface, e.g., SQL

Cooperative sources

Replicated sources

–publish/subscribe mechanism

Call back sources

–calls external code (ETL) when changes

occur

Internal action sources

–only internal actions when changes

occur (DB triggers is an example)

Extract strategy is very dependent on the source types

H.Galhardas SAD 2007/08

Extract phase

Goal

: fast extract of relevant data

Extract from source systems can take

a long

time

Types of extracts:

Extract applications (SQL):

co-existence with other

applications

DB unload tools:

must faster than SQL-based extracts

Extract applications sometimes the only solution

Often

too time consuming

to ETL

– Extracts can take days/weeks – Drain on the operational systems – Drain on DW systems

(7)

H.Galhardas SAD 2007/08

Computing deltas

• Much faster to only ”ETL” changes since last load

A number of methods can be used

• Store sorted total extracts in DSA

– Delta can easily be computed from current+last extract

+ Always possible

+ Handles deletions

- Does not reduce extract time

• Put update timestamp on all rows

– Updated by DB trigger

– Extract only where ”timestamp > time for last extract”

+ Reduces extract time

+/- Less operational overhead

- Cannot (alone) handle deletions

- Source system must be changed

Load (1)

Goal

: fast loading into DW

– Loading deltas is much faster than total load

SQL-based update

is

slow

– Large overhead (optimization,locking,etc.) for every SQL call – DB load tools are much faster

– Some load tools can also perform UPDATEs

Index

on tables

slows

load a lot

– Drop index and rebuild after load – Can be done per partition

Parallellization

– Dimensions can be loaded concurrently – Fact tables can be loaded concurrently – Partitions can be loaded concurrently

(8)

H.Galhardas SAD 2007/08

Load (2)

• Relationships

in the data

– Referential integrity must be ensured – Can be done by loader

• Aggregates

– Must be built and loaded at the same time as the detail data – Today, RDBMSes can often do this

• Load tuning

– Load without log – Sort load file first

– Make only simple transformations in loader – Use loader facilities for building aggregates – Use loader within the same database

• Should DW be

on-line 24*7

?

– Use partitions or several sets of tables

H.Galhardas SAD 2007/08

Overview

• General ETL issues:

– Data staging area (DSA)

– Building dimensions

– Building fact tables

– Extract

– Load

– Transformation/cleaning

• Commercial (and open-source) tools

• The AJAX data cleaning and transformation

framework

(9)

H.Galhardas SAD 2007/08

Data Cleaning

Activity of converting source data into target

data without errors, duplicates, and

inconsistencies, i.e.,

Cleaning and Transforming to get…

High-quality data!

Why Data Cleaning and

Transformation?

Data in the real world is

dirty

incomplete

: lacking attribute values, lacking certain

attributes of interest, or containing only aggregate data

• e.g., occupation=“”

noisy

: containing errors or outliers (spelling, phonetic

and typing errors, word transpositions, multiple values

in a single free-form field)

• e.g., Salary=“-10”

inconsistent

: containing discrepancies in codes or

names (synonyms and nicknames, prefix and suffix

variations, abbreviations, truncation and initials)

• e.g., Age=“42” Birthday=“03/07/1997” • e.g., Was rating “1,2,3”, now rating “A, B, C” • e.g., discrepancy between duplicate records

(10)

H.Galhardas SAD 2007/08

Why Is Data Dirty?

• Incomplete data

comes from:

– non available data value when collected

– different criteria between the time when the data was collected and when it is analyzed.

– human/hardware/software problems

• Noisy data

comes from:

– data collection: faulty instruments – data entry: human or computer errors – data transmission

• Inconsistent (and redundant) data

comes from:

– Different data sources, so non uniform naming conventions/data codes

– Functional dependency and/or referential integrity violation

H.Galhardas SAD 2007/08

Why Is Data Cleaning

Important?

• Data warehouse needs consistent

integration of quality data

– Data extraction, cleaning, and transformation comprises

the majority of the work of building a data warehouse

• No quality data,

no quality decisions

!

– Quality decisions must be based on quality data (e.g.,

duplicate or missing data may cause incorrect or even

misleading statistics)

(11)

H.Galhardas SAD 2007/08

Types of data cleaning

• Conversion, parsing and normalization

– Text coding, date formats, etc.

– Most common type of cleansing

• Special-purpose cleansing

– Normalize spellings of names, addresses, etc.

– Remove duplicates, e.g., duplicate customers

– Domain-independent cleansing

• Approximate, ”fuzzy” joins on not-quite-matching

keys

Data quality vs cleaning

• Data quality = Data cleaning +

– Data enrichment

• enhancing the value of internally held data by appending related attributes from external sources (for example, consumer

demographic attributes or geographic descriptors).

– Data profiling

• analysis of data to capture statistics (metadata) that provide insight into the quality of the data and aid in the identification of data quality issues.

– Data monitoring

• deployment of controls to ensure ongoing conformance of data to business rules that define data quality for the organization.

– Data stewards responsible for data quality – DW-controlled improvement

– Source-controlled improvement

(12)

H.Galhardas SAD 2007/08

Overview

• General ETL issues:

– Data staging area (DSA)

– Building dimensions

– Building fact tables

– Extract

– Load

– Transformation/cleaning

• Commercial (and open-source) tools

• The AJAX data cleaning and transformation

framework

H.Galhardas SAD 2007/08

ETL tools

ETL tools from the

big vendors

, e.g.,

– �Oracle Warehouse Builder – �IBM DB2 Warehouse Manager – �Microsoft Integration Services

Offer much functionality at a reasonable price (included…)

– �Data modeling – �ETL code generation – �Scheduling DW jobs – �…

Many others

– �Hundreds of tools

– �Often specialized tools for certain jobs (insurance cleansing,…)

The ”best” tool does not exist

– �Choose based on your own needs

(13)

H.Galhardas SAD 2007/08

ETL and data quality tools

http://www.etltool.com/

Magic Quadrant for Data Quality Tools, 2007

• Some open source ETL tools:

Talend

Enhydra Octopus

Clover.ETL

• Not so many open source quality/cleaning

tools

Application context

Integrate

data from different sources

• E.g.,populating a DW from different operational data stores

• Eliminate errors and duplicates

within a single source

• E.g., duplicates in a file of customers

• Migrate

data from a source schema into a different

fixed target schema

• E.g., discontinued application packages

• Convert poorly structured

data into structured data

(14)

H.Galhardas SAD 2007/08

The AJAX data transformation

and cleaning framework

H.Galhardas SAD 2007/08

Motivating example (1)

DirtyData(paper:String)

Data Cleaning &

Transformation

Events(eventKey, name)

Publications(pubKey, title, eventKey, url, volume, number, pages, city, month, year) Authors(authorKey, name)

(15)

H.Galhardas SAD 2007/08

Motivating example (2)

[1] Dallan Quass, Ashish Gupta, Inderpal Singh Mumick, and Jennifer

Widom. Making Views Self-Maintainable for Data Warehousing. In Proceedings of the Conference on Parallel and Distributed Information Systems. Miami Beach, Florida, USA, 1996

[2] D. Quass, A. Gupta, I. Mumick, J. Widom, Making views

self-maintianable for data warehousing, PDIS’95

DirtyData

Data Cleaning &

Transformation

PDIS | Conference on Parallel and Distributed Information Systems Events

QGMW96| Making Views Self-Maintainable for Data Warehousing |PDIS| null | null | null | null | Miami Beach | Florida, USA | 1996

Publications Authors

DQua | Dallan Quass

AGup | Ashish Gupta

JWid | Jennifer Widom ….. QGMW96 | DQua QGMW96 | AGup …. PubsAuthors H.Galhardas SAD 2007/08

Modeling a data cleaning process

A data cleaning process is

modeled by a

directed acyclic

graph

of data transformations

DirtyData

DirtyAuthors Authors Duplicate Elimination Extraction Standardization Formattin g DirtyTitles... DirtyEvents Cities Tags

(16)

H.Galhardas SAD 2007/08

Existing technology

• Ad-hoc programs

written in a programming language

like C or Java or using an RDBMS proprietary

language

– Programs difficult to optimize and maintain

• Data transformation scripts using an

ETL

(Extraction-Transformation-Loading)

or

a data quality tool

H.Galhardas SAD 2007/08

Problems of ETL and data quality

solutions (1)

The semantics of some data transformations

is defined in terms of their implementation

algorithms

App. Domain 1 App. Domain 2 App. Domain 3

Data cleaning transformations

(17)

H.Galhardas SAD 2007/08

There is a lack of interactive facilities to tune

a data cleaning application program

Problems of ETL and data quality

solutions (2)

Dirty Data

Cleaning process

Clean data

Rejected data

AJAX features

• An extensible

data quality framework

– Logical operators as extensions of relational algebra

– Physical execution algorithms

• A

declarative language

for logical operators

– SQL extension

• A

debugger facility

for tuning a data cleaning

program application

(18)

H.Galhardas SAD 2007/08

AJAX features

• An extensible

data quality framework

– Logical operators as extensions of relational algebra

– Physical execution algorithms

• A

declarative language

for logical operators

– SQL extension

• A

debugger facility

for tuning a data cleaning

program application

– Based on a mechanism of exceptions

H.Galhardas SAD 2007/08

Logical level: parametric

operators

• View: arbitrary SQL query

• Map: iterator-based one-to-many mapping with arbitrary user-defined functions

• Match: iterator-based approximate join

• Cluster: uses an arbitrary clustering function

• Merge: extends SQL group-by with user-defined aggregate functions

• Apply: executes an arbitrary user-defined algorithm Map Match Merge Cluster View Apply

(19)

H.Galhardas SAD 2007/08

Logical level

DirtyData

DirtyAuthors Authors Duplicate Elimination Extraction Standardization Formattin g DirtyTitles... Cities Tags

Logical level

DirtyData

DirtyAuthors Map Cluster Match Merge Authors Map Map Duplicate Elimination Extraction Standardization Formattin g DirtyTitles... Cities Tags

DirtyData

DirtyAuthors TC NL Authors SQL Scan Java Scan

Physical level

DirtyTitles... Java Scan Java Scan Cities Tags

(20)

H.Galhardas SAD 2007/08

AJAX features

• An extensible

data quality framework

– Logical operators as extensions of relational algebra

– Physical execution algorithms

• A

declarative language

for logical operators

– SQL extension

• A

debugger facility

for tuning a data cleaning

program application

– Based on a mechanism of exceptions

H.Galhardas SAD 2007/08

Match

• Input:

2

relations

• Finds data records that correspond to the

same real object

• Calls

distance functions

for comparing field

values and computing the

distance

between input tuples

• Output:

1

relation containing matching

tuples and possibly

1

or

2

relations

containing non-matching tuples

(21)

H.Galhardas SAD 2007/08

Example

Cluster Match Merge Duplicate Elimination Authors DirtyAuthors MatchAuthors

Example

CREATE MATCH MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2

LET distance = editDistance(da1.name, da2.name) WHERE distance < maxDist

INTO MatchAuthors Cluster Match Merge Duplicate Elimination Authors DirtyAuthors MatchAuthors

(22)

H.Galhardas SAD 2007/08

Example

CREATE MATCH MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2

LET distance = editDistance(da1.name, da2.name) WHERE distance < maxDist

INTO MatchAuthors Input:

DirtyAuthors(authorKey, name) 861|johann christoph freytag

822|jc freytag 819|j freytag 814|j-c freytag

Output:

MatchAuthors(authorKey1, authorKey2, name1, name2) 861|822|johann christoph freytag| jc freytag

822|814|jc freytag|j-c freytag ... Cluster Match Merge Duplicate Elimination Authors DirtyAuthors MatchAuthors H.Galhardas SAD 2007/08

Implementation of the match

operator

∀ s

1

∈ S

1

, s

2

∈ S

2

(s

1

, s

2

) is a

match

if

(23)

H.Galhardas SAD 2007/08

Nested loop

S

1

S

2

...

• Very

expensive

evaluation when handling

large amounts of data

Þ Need alternative execution algorithms for

the same logical specification

editDistance

A database solution

CREATE TABLE MatchAuthors AS

SELECT authorKey1, authorKey2, distance

FROM (SELECT a1.authorKey authorKey1,

a2.authorKey authorKey2,

editDistance

(a1.name, a2.name) distance

FROM DirtyAuthors a1, DirtyAuthors a2)

WHERE distance < maxDist;

· No optimization supported for a

Cartesian product with external function

calls

(24)

H.Galhardas SAD 2007/08

Window scanning

S

n

H.Galhardas SAD 2007/08

Window scanning

S

n

(25)

H.Galhardas SAD 2007/08

Window scanning

S

n

· May loose some matches

String distance filtering

S

1

S

2

maxDist = 1

John Smith John Smit Jogn Smith John Smithe length length- 1 length length + 1

editDistance

(26)

H.Galhardas SAD 2007/08

Annotation-based optimization

• The user specifies types of optimization

• The system suggests which algorithm

to use

Ex:

CREATE MATCHING MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2 LET dist = editDistance(da1.name, da2.name) WHERE dist < maxDist

% distance-filtering: map= length; dist = abs % INTO MatchAuthors

H.Galhardas SAD 2007/08

DEFINE FUNCTIONS AS

Choose.uniqueString(OBJECT[]) RETURN STRING THROWS CiteSeerException

Generate.generateId(INTEGER) RETURN STRING Normal.removeCitationTags(STRING) RETURN STRING (600)

DEFINE ALGORITHMS AS TransitiveClosure SourceClustering(STRING) DEFINE INPUT DATA FLOWS AS TABLE DirtyData (paper STRING (400) ); TABLE City (city STRING (80), citysyn STRING (80) ) KEY city,citysyn; DEFINE TRANSFORMATIONS AS CREATE MAPPING m apKeDiDa FROM DirtyData Dd

LET keyKdd = generateId(1)

{SELECT keyKdd AS paperKey, Dd.paper AS paper KEY paperKey

CONSTRAINT NOT NULL m apKeDiDa.paper }

(27)

H.Galhardas SAD 2007/08

AJAX features

• An extensible

data quality framework

– Logical operators as extensions of relational algebra

– Physical execution algorithms

• A

declarative language

for logical operators

– SQL extension

• A

debugger facility

for tuning a data cleaning

program application

– Based on a mechanism of exceptions

Management of exceptions

• Problem

: to mark tuples not handled

by the

cleaning criteria of an operator

• Solution

: to specify the generation of

exception tuples

within a logical operator

– exceptions are thrown by external functions

– output constraints are violated

(28)

H.Galhardas SAD 2007/08

Debugger facility

• Supports the

(backward and forward) data

derivation

of tuples wrt an operator to

debug exceptions

• Supports the

interactive data modification

and, in the future, the

incremental

execution of logical operators

H.Galhardas SAD 2007/08

(29)

H.Galhardas SAD 2007/08

To see it working....

Workshop-UQ?

Tomorrow at Tagus Park,

10H-16H

References

Related documents

Advanced Papers Foundation Certificate Trust Creation: Law and Practice Company Law and Practice Trust Administration and Accounts Trustee Investment and Financial

Data Warehouse Analyst Business Analyst Software Quality Assurance Engineeer Web Development Graphics Designer Multimedia Specialist Digital Visualization User Support Other

Like researchers who work in the field of English as an international language (EIL), García, Johnson, and Seltzer clearly support an approach to TESOL teacher preparation that

Introducing the NEW XCEED Master Broker Uninsured Mortgage Program Approved LENDING AREAS. XCEED XECUTIVE Solution XCEED XTRA Solution XCEED XPLUS

Resolution 11 – is to empower the directors to offer and grant awards pursuant to the Sembcorp Industries Performance Share Plan 2010 and the Sembcorp Industries Restricted Share

3:00 PM Islands of Venice Boat Tour - Murano, Isola di Burano Saturday, August 20, 2016 - Vicenza, Venezia. Sunday, August 21, 2016 - Venezia, Murano.. All options our exquisite

Full Day and Half Day time includes call time, set up time, show time and tear

   with   David  Fleming and  Alberto  Chong. Trust and  Trustworthiness  in the