• No results found

Data Integration and ETL Process

N/A
N/A
Protected

Academic year: 2021

Share "Data Integration and ETL Process"

Copied!
76
0
0

Loading.... (view fulltext now)

Full text

(1)

Data Integration and ETL Process

Krzysztof Dembczy´nski

Intelligent Decision Support Systems Laboratory (IDSS) Pozna´n University of Technology, Poland

(2)

Review of the previous lecture

Mining of massive datasets

Evolution of database systems: operational vs. analytical systems.

Dimensional modeling:

I Three goals of the logical design of data warehouse: simplicity, expressiveness and performance.

I The most popular conceptual schema: star schema.

I Designing data warehouses is not an easy task.

(3)

Outline

1 Motivation

2 Data Extraction

3 Transformation and Integration of Data

4 Load of Data

5 Data Warehouse Refreshment

(4)

Outline

1 Motivation

2 Data Extraction

3 Transformation and Integration of Data

4 Load of Data

5 Data Warehouse Refreshment

6 Summary

(5)

Motivation

OLAP queries are usually performed in a separate system, i.e., in a data warehouse.

Data warehouses combine data from multiple sources.

Data must be translated into a consistent format.

Data integration represents 80% of effort for a typical data warehouse project!

(6)

ETL process

ETL = Extraction, Transformation, and Load

I Extractionof data from source systems,

I Transformation and integrationof data into a useful format for analysis,

I Loadof data into the warehouse and build of additional structures.

Refreshment of data warehouse is closely related to ETL process.

The ETL process is described by metadata stored in data warehouse.

Architecture of data warehousing:

Data sources → Data staging area → Data warehouse

(7)

ETL process

(8)

ETL tools

Data extraction from heterogeneous data sources.

Data transformation, integration, and cleansing.

Data quality analysis and control.

Data loading.

High-speed data transfer.

Data refreshment.

Managing and analyzing metadata.

Examples of ETL tools:

I MS SQL Server Integration Services(SSIS), IBM Infosphere DataStage, SAS ETL Studio, Oracle Warehouse Builder, Oracle Data Integrator, Business Objects Data Integrator, Pentaho Data Integration.

(9)

ETL tools

MS SQL Server Integration Services(SSIS)

(10)

ETL tools

MS SQL Server Integration Services(SSIS)

(11)

Outline

1 Motivation

2 Data Extraction

3 Transformation and Integration of Data

4 Load of Data

5 Data Warehouse Refreshment

(12)

Data extraction

Data warehouse needs extraction of data from different external data sources:

I operational databases (relational, hierarchical, network, itp.),

I files of standard applications (Excel, COBOL applications),

I additional databases (direct marketing databases) and data services (stock data),

I and other documents (.doc, XML, WWW).

(13)

Data sources

Data sources are often operational systems, providing the lowest level of data.

Data sources are designed for operational use, not for decision support, and the data reflect this fact.

Multiple data sources are often from different systems, run on a wide range of hardware and much of the software is built in-house or highly customized.

Data sources can be designed using different logical structures.

(14)

Data sources

Data sources are often operational systems, providing the lowest level of data.

Data sources are designed for operational use, not for decision support, and the data reflect this fact.

Multiple data sources are often from different systems, run on a wide range of hardware and much of the software is built in-house or highly customized.

Data sources can be designed using different logical structures.

(15)

Data sources

Data sources are often operational systems, providing the lowest level of data.

Data sources are designed for operational use, not for decision support, and the data reflect this fact.

Multiple data sources are often from different systems, run on a wide range of hardware and much of the software is built in-house or highly customized.

Data sources can be designed using different logical structures.

(16)

Data sources

Data sources are often operational systems, providing the lowest level of data.

Data sources are designed for operational use, not for decision support, and the data reflect this fact.

Multiple data sources are often from different systems, run on a wide range of hardware and much of the software is built in-house or highly customized.

Data sources can be designed using different logical structures.

(17)

Data extraction

Identification of concepts and objects does not have to be easy.

Example: Extract information about sales from the source system.

I What is meant by the termsale?

A sale has occurred when

• the order has been received by a customer,

• the order is sent to the customer,

• the invoice has been raised against the order.

I It is a common problem that there is no table SALES in the operational databases; some other tables can exist like ORDER with an attribute ORDER STATUS.

(18)

Data extraction

Identification of concepts and objects does not have to be easy.

Example: Extract information about sales from the source system.

I What is meant by the termsale? A sale has occurred when

• the order has been received by a customer,

• the order is sent to the customer,

• the invoice has been raised against the order.

I It is a common problem that there is no table SALES in the operational databases; some other tables can exist like ORDER with an attribute ORDER STATUS.

(19)

Data extraction

Identification of concepts and objects does not have to be easy.

Example: Extract information about sales from the source system.

I What is meant by the termsale? A sale has occurred when

• the order has been received by a customer,

• the order is sent to the customer,

• the invoice has been raised against the order.

I It is a common problem that there is no table SALES in the operational databases; some other tables can exist like ORDER with an attribute ORDER STATUS.

(20)

Change monitoring

Change monitoring is directly connected with data warehouse refreshment.

Detect changes to an information source.

Different monitoring techniques: external and intrusive techniques.

Monitoring Techniques

I Snapshot vs. timestamped sources

I Queryable, logged, and replicated sources

I Callback and internal action sources

(21)

Change monitoring

Change monitoring is directly connected with data warehouse refreshment.

Detect changes to an information source.

Different monitoring techniques: external and intrusive techniques.

Monitoring Techniques

I Snapshot vs. timestamped sources

I Queryable, logged, and replicated sources

I Callback and internal action sources

(22)

Outline

1 Motivation

2 Data Extraction

3 Transformation and Integration of Data

4 Load of Data

5 Data Warehouse Refreshment

6 Summary

(23)

Transformation and integration of data

Transformation and integration of data is the most important part of data warehousing.

This phase consists in removing all inconsistencies and redundancies of data coming to the data warehouse from operational data sources.

The data are conform to the conceptual schema used by the warehouse.

Integration concerns data and data schemas.

Different levels of integration: schema, table, tuple, attribute values.

(24)

Transformation and integration of data

Transformation and integration of data is the most important part of data warehousing.

This phase consists in removing all inconsistencies and redundancies of data coming to the data warehouse from operational data sources.

The data are conform to the conceptual schema used by the warehouse.

Integration concerns data and data schemas.

Different levels of integration: schema, table, tuple, attribute values.

(25)

Transformation and integration of data

Transformation and integration of data is the most important part of data warehousing.

This phase consists in removing all inconsistencies and redundancies of data coming to the data warehouse from operational data sources.

The data are conform to the conceptual schema used by the warehouse.

Integration concerns data and data schemas.

Different levels of integration: schema, table, tuple, attribute values.

(26)

Transformation and integration of data

Transformation and integration of data is the most important part of data warehousing.

This phase consists in removing all inconsistencies and redundancies of data coming to the data warehouse from operational data sources.

The data are conform to the conceptual schema used by the warehouse.

Integration concerns data and data schemas.

Different levels of integration: schema, table, tuple, attribute values.

(27)

Transformation and integration of data

Transformation and integration of data is the most important part of data warehousing.

This phase consists in removing all inconsistencies and redundancies of data coming to the data warehouse from operational data sources.

The data are conform to the conceptual schema used by the warehouse.

Integration concerns data and data schemas.

Different levels of integration: schema, table, tuple, attribute values.

(28)

Conflicts and dirty data

Different logical models of operational sources,

Different data types (account number stored asString or Numeric),

Different data domains (gender: M,F,male,female, 1, 0),

Different date formats (dd-mm-yyyy or mm-dd-yyyy),

Different field lengths (address stored by using 20 or 50 chars),

Different naming conventions: homonyms and synonyms,

Semantic conflicts, when the same objects are modeled on different logical levels,

Structural conflicts, when the same concepts are modeled using different structures.

(29)

Conflicts and dirty data

Different logical models of operational sources,

Different data types (account number stored asString or Numeric),

Different data domains (gender: M,F,male,female, 1, 0),

Different date formats (dd-mm-yyyy or mm-dd-yyyy),

Different field lengths (address stored by using 20 or 50 chars),

Different naming conventions: homonyms and synonyms,

Semantic conflicts, when the same objects are modeled on different logical levels,

Structural conflicts, when the same concepts are modeled using different structures.

(30)

Conflicts and dirty data

Different logical models of operational sources,

Different data types (account number stored asString or Numeric),

Different data domains (gender: M,F,male,female, 1, 0),

Different date formats (dd-mm-yyyy or mm-dd-yyyy),

Different field lengths (address stored by using 20 or 50 chars),

Different naming conventions: homonyms and synonyms,

Semantic conflicts, when the same objects are modeled on different logical levels,

Structural conflicts, when the same concepts are modeled using different structures.

(31)

Conflicts and dirty data

Different logical models of operational sources,

Different data types (account number stored asString or Numeric),

Different data domains (gender: M,F,male,female, 1, 0),

Different date formats (dd-mm-yyyy or mm-dd-yyyy),

Different field lengths (address stored by using 20 or 50 chars),

Different naming conventions: homonyms and synonyms,

Semantic conflicts, when the same objects are modeled on different logical levels,

Structural conflicts, when the same concepts are modeled using different structures.

(32)

Conflicts and dirty data

Different logical models of operational sources,

Different data types (account number stored asString or Numeric),

Different data domains (gender: M,F,male,female, 1, 0),

Different date formats (dd-mm-yyyy or mm-dd-yyyy),

Different field lengths (address stored by using 20 or 50 chars),

Different naming conventions: homonyms and synonyms,

Semantic conflicts, when the same objects are modeled on different logical levels,

Structural conflicts, when the same concepts are modeled using different structures.

(33)

Conflicts and dirty data

Different logical models of operational sources,

Different data types (account number stored asString or Numeric),

Different data domains (gender: M,F,male,female, 1, 0),

Different date formats (dd-mm-yyyy or mm-dd-yyyy),

Different field lengths (address stored by using 20 or 50 chars),

Different naming conventions: homonyms and synonyms,

Semantic conflicts, when the same objects are modeled on different logical levels,

Structural conflicts, when the same concepts are modeled using different structures.

(34)

Conflicts and dirty data

Different logical models of operational sources,

Different data types (account number stored asString or Numeric),

Different data domains (gender: M,F,male,female, 1, 0),

Different date formats (dd-mm-yyyy or mm-dd-yyyy),

Different field lengths (address stored by using 20 or 50 chars),

Different naming conventions: homonyms and synonyms,

Semantic conflicts, when the same objects are modeled on different logical levels,

Structural conflicts, when the same concepts are modeled using different structures.

(35)

Conflicts and dirty data

Different logical models of operational sources,

Different data types (account number stored asString or Numeric),

Different data domains (gender: M,F,male,female, 1, 0),

Different date formats (dd-mm-yyyy or mm-dd-yyyy),

Different field lengths (address stored by using 20 or 50 chars),

Different naming conventions: homonyms and synonyms,

Semantic conflicts, when the same objects are modeled on different logical levels,

Structural conflicts, when the same concepts are modeled using

(36)

Conflicts and dirty data

Different entries for the same attribute (state name or abbreviation),

Text fields can possess hidden information (contact person name can be given in the company name field),

Wrong attribute entries (attribute name contains company name or contact person name),

Inconsistent information concerning the same object,

Information concerning the same object, but indicated by different keys,

Missing values,

. . .

(37)

Conflicts and dirty data

Different entries for the same attribute (state name or abbreviation),

Text fields can possess hidden information (contact person name can be given in the company name field),

Wrong attribute entries (attribute name contains company name or contact person name),

Inconsistent information concerning the same object,

Information concerning the same object, but indicated by different keys,

Missing values,

. . .

(38)

Conflicts and dirty data

Different entries for the same attribute (state name or abbreviation),

Text fields can possess hidden information (contact person name can be given in the company name field),

Wrong attribute entries (attribute name contains company name or contact person name),

Inconsistent information concerning the same object,

Information concerning the same object, but indicated by different keys,

Missing values,

. . .

(39)

Conflicts and dirty data

Different entries for the same attribute (state name or abbreviation),

Text fields can possess hidden information (contact person name can be given in the company name field),

Wrong attribute entries (attribute name contains company name or contact person name),

Inconsistent information concerning the same object,

Information concerning the same object, but indicated by different keys,

Missing values,

. . .

(40)

Conflicts and dirty data

Different entries for the same attribute (state name or abbreviation),

Text fields can possess hidden information (contact person name can be given in the company name field),

Wrong attribute entries (attribute name contains company name or contact person name),

Inconsistent information concerning the same object,

Information concerning the same object, but indicated by different keys,

Missing values,

. . .

(41)

Conflicts and dirty data

Different entries for the same attribute (state name or abbreviation),

Text fields can possess hidden information (contact person name can be given in the company name field),

Wrong attribute entries (attribute name contains company name or contact person name),

Inconsistent information concerning the same object,

Information concerning the same object, but indicated by different keys,

Missing values,

. . .

(42)

Conflicts and dirty data

Different entries for the same attribute (state name or abbreviation),

Text fields can possess hidden information (contact person name can be given in the company name field),

Wrong attribute entries (attribute name contains company name or contact person name),

Inconsistent information concerning the same object,

Information concerning the same object, but indicated by different keys,

Missing values,

. . .

(43)

Data cleansing

We would like to analyze high-quality data, since our goal is to support decision making.

(44)

Data cleansing techniques

Conversion and normalization methods (date formats “dd/mm/rrrr”, names conventions: Jan Kowalski),

Parsing text fields in order to identify and isolate data elements:

I Transformation (splitting the text into records {title = mgr, first name

= Jan, last name = Kowalski}),

I Standardization (Jan Kowalski, magister ⇒ mgr Jan Kowalski),

Dictionary-based methods (database of names, geographical places, pharmaceutical data),

Domain-specific knowledge methods to complete data (postal codes),

Rationalization of data (PHX323RFD110A4 ⇒ Print paper, format A4),

Rule-based cleansing (replace gender by sex )

Cleansing by using data mining.

(45)

Data cleansing techniques

Conversion and normalization methods (date formats “dd/mm/rrrr”, names conventions: Jan Kowalski),

Parsing text fields in order to identify and isolate data elements:

I Transformation (splitting the text into records {title = mgr, first name

= Jan, last name = Kowalski}),

I Standardization (Jan Kowalski, magister ⇒ mgr Jan Kowalski),

Dictionary-based methods (database of names, geographical places, pharmaceutical data),

Domain-specific knowledge methods to complete data (postal codes),

Rationalization of data (PHX323RFD110A4 ⇒ Print paper, format A4),

Rule-based cleansing (replace gender by sex )

Cleansing by using data mining.

(46)

Data cleansing techniques

Conversion and normalization methods (date formats “dd/mm/rrrr”, names conventions: Jan Kowalski),

Parsing text fields in order to identify and isolate data elements:

I Transformation (splitting the text into records {title = mgr, first name

= Jan, last name = Kowalski}),

I Standardization (Jan Kowalski, magister ⇒ mgr Jan Kowalski),

Dictionary-based methods (database of names, geographical places, pharmaceutical data),

Domain-specific knowledge methods to complete data (postal codes),

Rationalization of data (PHX323RFD110A4 ⇒ Print paper, format A4),

Rule-based cleansing (replace gender by sex )

Cleansing by using data mining.

(47)

Data cleansing techniques

Conversion and normalization methods (date formats “dd/mm/rrrr”, names conventions: Jan Kowalski),

Parsing text fields in order to identify and isolate data elements:

I Transformation (splitting the text into records {title = mgr, first name

= Jan, last name = Kowalski}),

I Standardization (Jan Kowalski, magister ⇒ mgr Jan Kowalski),

Dictionary-based methods (database of names, geographical places, pharmaceutical data),

Domain-specific knowledge methods to complete data (postal codes),

Rationalization of data (PHX323RFD110A4 ⇒ Print paper, format A4),

Rule-based cleansing (replace gender by sex )

Cleansing by using data mining.

(48)

Data cleansing techniques

Conversion and normalization methods (date formats “dd/mm/rrrr”, names conventions: Jan Kowalski),

Parsing text fields in order to identify and isolate data elements:

I Transformation (splitting the text into records {title = mgr, first name

= Jan, last name = Kowalski}),

I Standardization (Jan Kowalski, magister ⇒ mgr Jan Kowalski),

Dictionary-based methods (database of names, geographical places, pharmaceutical data),

Domain-specific knowledge methods to complete data (postal codes),

Rationalization of data (PHX323RFD110A4 ⇒ Print paper, format A4),

Rule-based cleansing (replace gender by sex )

Cleansing by using data mining.

(49)

Data cleansing techniques

Conversion and normalization methods (date formats “dd/mm/rrrr”, names conventions: Jan Kowalski),

Parsing text fields in order to identify and isolate data elements:

I Transformation (splitting the text into records {title = mgr, first name

= Jan, last name = Kowalski}),

I Standardization (Jan Kowalski, magister ⇒ mgr Jan Kowalski),

Dictionary-based methods (database of names, geographical places, pharmaceutical data),

Domain-specific knowledge methods to complete data (postal codes),

Rationalization of data (PHX323RFD110A4 ⇒ Print paper, format A4),

Rule-based cleansing (replace gender by sex )

Cleansing by using data mining.

(50)

Data cleansing techniques

Conversion and normalization methods (date formats “dd/mm/rrrr”, names conventions: Jan Kowalski),

Parsing text fields in order to identify and isolate data elements:

I Transformation (splitting the text into records {title = mgr, first name

= Jan, last name = Kowalski}),

I Standardization (Jan Kowalski, magister ⇒ mgr Jan Kowalski),

Dictionary-based methods (database of names, geographical places, pharmaceutical data),

Domain-specific knowledge methods to complete data (postal codes),

Rationalization of data (PHX323RFD110A4 ⇒ Print paper, format A4),

Rule-based cleansing (replace gender by sex )

Cleansing by using data mining.

(51)

Data cleansing techniques

Conversion and normalization methods (date formats “dd/mm/rrrr”, names conventions: Jan Kowalski),

Parsing text fields in order to identify and isolate data elements:

I Transformation (splitting the text into records {title = mgr, first name

= Jan, last name = Kowalski}),

I Standardization (Jan Kowalski, magister ⇒ mgr Jan Kowalski),

Dictionary-based methods (database of names, geographical places, pharmaceutical data),

Domain-specific knowledge methods to complete data (postal codes),

Rationalization of data (PHX323RFD110A4 ⇒ Print paper, format A4),

Cleansing by using data mining.

(52)

Data cleansing techniques

Conversion and normalization methods (date formats “dd/mm/rrrr”, names conventions: Jan Kowalski),

Parsing text fields in order to identify and isolate data elements:

I Transformation (splitting the text into records {title = mgr, first name

= Jan, last name = Kowalski}),

I Standardization (Jan Kowalski, magister ⇒ mgr Jan Kowalski),

Dictionary-based methods (database of names, geographical places, pharmaceutical data),

Domain-specific knowledge methods to complete data (postal codes),

Rationalization of data (PHX323RFD110A4 ⇒ Print paper, format A4),

Rule-based cleansing (replace gender by sex )

(53)

Data cleansing techniques

Deduplication ensures that one accurate record exists for each business entity represented in a database,

Householding is the technique of grouping individual customers by the household or organization of which they are a member; this technique has some interesting marketing implications, and can also support cost-saving measures of direct advertising.

Example:

I Consider the following rows in a database:

Tim Jones 123 Main Street Marlboro MA 12234 T. Jones 123 Main St. Marlborogh MA 12234 Timothy Jones 321 Maine Street Marlborog AM 12234 Jones, Timothy 123 Maine Ave Marlborough MA 13324

I The sales for around $500 are counted for each tuple.

I Is it the same person?

(54)

Data cleansing techniques

Deduplication ensures that one accurate record exists for each business entity represented in a database,

Householding is the technique of grouping individual customers by the household or organization of which they are a member; this technique has some interesting marketing implications, and can also support cost-saving measures of direct advertising.

Example:

I Consider the following rows in a database:

Tim Jones 123 Main Street Marlboro MA 12234 T. Jones 123 Main St. Marlborogh MA 12234 Timothy Jones 321 Maine Street Marlborog AM 12234 Jones, Timothy 123 Maine Ave Marlborough MA 13324

I The sales for around $500 are counted for each tuple.

I Is it the same person?

(55)

Data cleansing techniques

Deduplication ensures that one accurate record exists for each business entity represented in a database,

Householding is the technique of grouping individual customers by the household or organization of which they are a member; this technique has some interesting marketing implications, and can also support cost-saving measures of direct advertising.

Example:

I Consider the following rows in a database:

Tim Jones 123 Main Street Marlboro MA 12234 T. Jones 123 Main St. Marlborogh MA 12234 Timothy Jones 321 Maine Street Marlborog AM 12234 Jones, Timothy 123 Maine Ave Marlborough MA 13324

I The sales for around $500 are counted for each tuple.

I Is it the same person?

(56)

Data cleansing techniques

Deduplication ensures that one accurate record exists for each business entity represented in a database,

Householding is the technique of grouping individual customers by the household or organization of which they are a member; this technique has some interesting marketing implications, and can also support cost-saving measures of direct advertising.

Example:

I Consider the following rows in a database:

Tim Jones 123 Main Street Marlboro MA 12234 T. Jones 123 Main St. Marlborogh MA 12234 Timothy Jones 321 Maine Street Marlborog AM 12234 Jones, Timothy 123 Maine Ave Marlborough MA 13324

I The sales for around $500 are counted for each tuple.

I Is it the same person?

(57)

Data cleansing techniques

Deduplication ensures that one accurate record exists for each business entity represented in a database,

Householding is the technique of grouping individual customers by the household or organization of which they are a member; this technique has some interesting marketing implications, and can also support cost-saving measures of direct advertising.

Example:

I Consider the following rows in a database:

Tim Jones 123 Main Street Marlboro MA 12234 T. Jones 123 Main St. Marlborogh MA 12234 Timothy Jones 321 Maine Street Marlborog AM 12234 Jones, Timothy 123 Maine Ave Marlborough MA 13324

I Is it the same person?

(58)

Data cleansing techniques

Deduplication ensures that one accurate record exists for each business entity represented in a database,

Householding is the technique of grouping individual customers by the household or organization of which they are a member; this technique has some interesting marketing implications, and can also support cost-saving measures of direct advertising.

Example:

I Consider the following rows in a database:

Tim Jones 123 Main Street Marlboro MA 12234 T. Jones 123 Main St. Marlborogh MA 12234 Timothy Jones 321 Maine Street Marlborog AM 12234 Jones, Timothy 123 Maine Ave Marlborough MA 13324

I The sales for around $500 are counted for each tuple.

I Is it the same person?

(59)

Outline

1 Motivation

2 Data Extraction

3 Transformation and Integration of Data

4 Load of Data

5 Data Warehouse Refreshment

(60)

Load of data

After extracting, cleaning and transforming, data must be loaded into the warehouse.

Loading the warehouse includes some other processing tasks: checking integrity constraints, sorting, summarizing, creating indexes, etc.

Batch (bulk) load utilities are used for loading.

A load utility must allow the administrator to monitor status, to cancel, suspend, and resume a load, and to restart after failure with no loss of data integrity.

(61)

Load of data

After extracting, cleaning and transforming, data must be loaded into the warehouse.

Loading the warehouse includes some other processing tasks: checking integrity constraints, sorting, summarizing, creating indexes, etc.

Batch (bulk) load utilities are used for loading.

A load utility must allow the administrator to monitor status, to cancel, suspend, and resume a load, and to restart after failure with no loss of data integrity.

(62)

Load of data

After extracting, cleaning and transforming, data must be loaded into the warehouse.

Loading the warehouse includes some other processing tasks: checking integrity constraints, sorting, summarizing, creating indexes, etc.

Batch (bulk) load utilities are used for loading.

A load utility must allow the administrator to monitor status, to cancel, suspend, and resume a load, and to restart after failure with no loss of data integrity.

(63)

Load of data

After extracting, cleaning and transforming, data must be loaded into the warehouse.

Loading the warehouse includes some other processing tasks: checking integrity constraints, sorting, summarizing, creating indexes, etc.

Batch (bulk) load utilities are used for loading.

A load utility must allow the administrator to monitor status, to cancel, suspend, and resume a load, and to restart after failure with no loss of data integrity.

(64)

Load of data

Concern very large data volumes.

Sequential loads can take a very long time.

Full load can be treated as a single long batch transaction that builds up a new database.

Using checkpoints ensures that if a failure occurs during the load, the process can restart from the last checkpoint.

(65)

Outline

1 Motivation

2 Data Extraction

3 Transformation and Integration of Data

4 Load of Data

5 Data Warehouse Refreshment

(66)

Data warehouse refreshment

Refreshing a warehouse means propagating updates on source data to the data stored in the warehouse.

Follows the same structure as ETL process.

Several constraints: accessibility of data sources, size of data, size of data warehouse, frequency of data refreshing, degradation of

performance of operational systems.

(67)

Data warehouse refreshment

Refreshing a warehouse means propagating updates on source data to the data stored in the warehouse.

Follows the same structure as ETL process.

Several constraints: accessibility of data sources, size of data, size of data warehouse, frequency of data refreshing, degradation of

performance of operational systems.

(68)

Data warehouse refreshment

Refreshing a warehouse means propagating updates on source data to the data stored in the warehouse.

Follows the same structure as ETL process.

Several constraints: accessibility of data sources, size of data, size of data warehouse, frequency of data refreshing, degradation of

performance of operational systems.

(69)

Data warehouse refreshment

Detect changes in external data sources.

Extract the changes and integrate into the warehouse.

Update indexes, subaggregates and materialized views.

(70)

Data warehouse refreshment

Detect changes in external data sources.

Extract the changes and integrate into the warehouse.

Update indexes, subaggregates and materialized views.

(71)

Data warehouse refreshment

Detect changes in external data sources.

Extract the changes and integrate into the warehouse.

Update indexes, subaggregates and materialized views.

(72)

Data warehouse refreshment

Periodical refreshment (daily or weekly).

Immediate refreshment.

Determined by usage, types of data source, etc.

(73)

Refreshment vs. load

Refreshment can be asynchronous.

Load of data requires a long-term access to source data.

Refreshment concerns less data.

Refreshment is faster.

(74)

Outline

1 Motivation

2 Data Extraction

3 Transformation and Integration of Data

4 Load of Data

5 Data Warehouse Refreshment

6 Summary

(75)

Summary

ETL process is a strategic element of data warehousing.

Main concepts: extraction, transformation and integration, load, data warehouse refreshment and metadata.

New emerging technology . . ..

(76)

Bibliography

C. Todman. Designing a Data Warehouse: Supporting Customer Relationship Management.

Prentice Hall PTR, 2001

Matthias Jarke, Y. Vassiliou, P. Vassiliadis, and M. Lenzerini. Fundamentals of Data Warehouses.

Springer-Verlag New York, second edition, 1999

R. Kimball and M. Ross. The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling.

John Wiley & Sons, third edition, 2013

www.kimballuniversity.com

References

Related documents

In this regard, the department will focus on enhanced support to Small Business and Cooperatives development with an emphasis on programmes to

If the metabolite cocaethylene (alcohol) is positive on a nail or hair test, it proves that the person consumed alcohol at the same time as the cocaine... EtG ALCOHOL TEST

If Box 3 is checked YES on the official Intel ISEF Abstract and Certification, Regulated Research Institution/Industrial Setting Form 1C must be vertically displayed.. Display

K EY WORDS critical discourse analysis; Finland; folk church; freedom of religion; ideology; national identity; religious equality; state church.. We have come in this country

Composing Elements: Carbon, Hydrogen , and Oxygen(CHO) Composing Elements: Carbon, Hydrogen , and Oxygen (CHO) Composing Elements: Carbon, Hydrogen , Oxygen, and

This system effectively identifies security laws in the CRM applications using Blowfish algorithm effectively.After establishing “Independent Encryption/Decryption

Flexible analog input: 1 V p-p to 2 V p-p range Offset binary or twos complement data format Clock duty cycle stabilizer.. APPLICATIONS

Historically, the nation-state building processes were an important phase in the period of transition from the governmental traditions in Europe and the establishment of a