• No results found

Integrate multiple, heterogeneous data sources. Data cleaning and data integration techniques are applied

N/A
N/A
Protected

Academic year: 2021

Share "Integrate multiple, heterogeneous data sources. Data cleaning and data integration techniques are applied"

Copied!
11
0
0

Loading.... (view fulltext now)

Full text

(1)

Data Warehousing and OLAP

Data Warehousing and OLAP

Lecture 2/DMBI/IKI83403T/MTI/UI

Yudho Giri Sucahyo, Ph.D, CISA ([email protected]) Faculty of Computer Science, University of Indonesia

Objectives

Objectives

`

Motivation: Why data warehouse?

`

What is a data warehouse?

`

Why separate DW?

y p

`

Conceptual modeling of DW

`

Data Mart

`

Data Mart

`

Data Warehousing Architectures

`

Data Warehouse Development

`

Data Warehouse Development

`

Data Warehouse Vendors

R l

DW

`

Real-time DW

2

Motivation: Why data warehouse?

Motivation: Why data warehouse?

`

Construction of data warehouses (DW) involves data

cleaning and data integration Æ important

preprocessing step for data mining (DM).

`

DW provide OLAP for the interactive analysis of

multidimensional data, which facilitates effective DM.

,

`

Data mining functions can be integrated with OLAP

operations to enhance interactive mining of knowledge.

operations to enhance interactive mining of knowledge.

`

DW will provide an effective platform for DM.

`

Whil DW

t

i

t t d DM DW t

`

While DWs are not requirements to do DM, DW store

massive amounts of data that can be uses for DM. [DO]

What is a data warehouse? [JH]

What is a data warehouse? [JH]

`

Defined in many different ways, but not rigorously.

` A decision support database that is maintained separately

from the organization’s ODB.

` Support information processing by providing a solid platform

of consolidated, historical data for analysis.

`

“A data warehouse is a

subject-oriented, integrated,

time-variant

,

and

nonvolatile

collection of data in

support of management’s decision-making process.” —

W. H. Inmon

`

Case Study 2: Continental Airlines flies high with its

(2)

What is a data warehouse? [ET]

What is a data warehouse? [ET]

`

Data warehouse

A physical repository where relational data are specially

organized to provide enterprise-wide, cleansed data in a

standardized format.

`

Characteristics

` Subject oriented, Integrated, Time Variant, Non-volatile

` Web-based, Relational/multidimensional, Client/server, Real-time ` Include metadata

`

Data warehousing

` Process of constructing and using data warehouses.

` Requires data integration, data cleaning, and data consolidation.

5

Subject Oriented

Subject Oriented

`

Organized around major subjects, such as

`

Organized around major subjects, such as

customer, product, sales

.

P

id

i

l

d

i

i

d

`

Provide

a simple and concise

view around

particular subject issues by

excluding data that

are not useful in the decision support process

.

`

Focusing on the modeling and analysis of data

`

Focusing on the modeling and analysis of data

for decision makers, not on daily operations or

transaction processing

transaction processing.

6

Integrated

Integrated

`

Integrate multiple, heterogeneous data sources

` Relational databases, flat-files, on-line transaction records

`

Data cleaning and data integration techniques are

g

g

q

applied

` Ensure consistency in naming conventions, encoding ` Ensure consistency in naming conventions, encoding

structures, attribute measures, etc. among different data sources

sources

` E.g., Hotel price: currency, tax, breakfast covered, etc.

Wh d i d h h i i d

` When data is moved to the warehouse, it is converted.

Time Variant

Time Variant

`

The time horizon for the data warehouse is significantly

longer than that of operational systems.

` Operational database: current value data. ` Operational database: current value data.

` Data warehouse data: provide information from a historical

perspective (e g past 5-10 years) perspective (e.g., past 5-10 years)

`

Every key structure in the data warehouse

` Contains an element of time, explicitly or implicitly

` But the key of operational data may or may not contain “time

(3)

Non volatile

Non-volatile

`

A

physically separate store

p y

y p

of data transformed from the

operational environment.

O

i

l

d

f d

d

i h d

`

Operational

update of data does not occur

in the data

warehouse environment.

` Does not require transaction processing, recovery, and

concurrency control mechanismsy

` Requires only two operations in data accessing:

i i i l l di f d d f d

` initial loading of dataand access of data.

9

Data Warehouse vs Heterogeneous DBMS

Data Warehouse vs. Heterogeneous DBMS

` Traditional heterogeneous DB integration:

` Build wrappers/mediatorson top of multiple, heterogeneous databases. Ex: IBM Data Joiner, Informix DataBlade

Q d i h

` Query driven approach:

` When a query is posed to a client site, a metadata-dictionary is used

to translate the query into queries appropriate for the individual to translate the query into queries appropriate for the individual heterogeneous sites involved. There queries are then mapped and sent to local query processors. The results returned from the different

d l b l

sites are integrated into a global answer set.

` Complex information filtering and integration processes, compete for

resources resources.

` Inefficient and potentially expensive for frequent queries, especially for

queries requireing aggregations.

q q g gg g

10

Data Warehouse vs Heterogeneous DBMS (2)

Data Warehouse vs. Heterogeneous DBMS (2)

` Using DW Æ update-driven approach

` Information from multiple, heterogeneous sources is integrated in advance and stored in a warehouse for direct querying and analysis.

` Unlike OLTP DW do not contain the most current information ` Unlike OLTP, DW do not contain the most current information. ` DW brings high performance to the integrated heterogeneous

DB system since data are copied preprocessed integrated DB system since data are copied, preprocessed, integrated, annotated, summarized, and restructured into one data store.

` Query processing in DW does not interfere with the processing ` Query processing in DW does not interfere with the processing

at local sources

` DW can store and integrate historical information and support g pp

complex multidimensional queries.

DW vs ODB

DW vs. ODB

`

Major task of ODB Æ OLTP:

` Day-to-day operations: purchasing, inventory, banking,

manufacturing, payroll, registration, accounting, etc.

DW f d l i d d i i ki Æ OLAP

` DW serve for data analysis and decision making Æ OLAP `

Distinct Features (OLTP vs. OLAP)

U d i i k

` User and system orientation: customer vs. market

` Data contents: current, detailed vs. historical, consolidated ` Database design: ER + application vs star + subject

` Database design: ER + application vs. star + subject ` View: current, local vs. evolutionary, integrated

` Access patterns: update vs. read-only but complex queries ` Access patterns: update vs. read only but complex queries

(4)

OLTP vs OLAP

OLTP vs OLAP

OLTP OLAP

users Clerk IT professional Knowledge worker users Clerk, IT professional Knowledge worker function day to day operations decision support DB design application-oriented subject-oriented

data current, up-to-date

detailed, flat relational isolated

historical,

summarized, multidimensional integrated, consolidated

usage repetitive ad-hoc

access read/write

index/hash on prim. key

lots of scans unit of work short, simple transaction complex query

# records accessed tens millions

#users thousands hundreds

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response

13

Why Separate DW?

Why Separate DW?

`

High performance for both systems:

g p

y

` DBMS — tuned for OLTP: access methods, indexing,

concurrency control, recovery

` Warehouse — tuned for OLAP: complex OLAP queries,

computation of large groups of data at summarized levels, multidimensional view, consolidation.,

`

Processing OLAP queries in operational databases would

degrade the performance of operational tasks.

`

In ODB, concurrency control and recovery mechanisms

(locking, logging) are required to ensure the consistency

d b

f

i

and robustness of transactions.

`

OLAP Æ read only access. No need for concurrency

control and recovery

control and recovery.

14

Why Separate DW? (2)

Why Separate DW? (2)

`

Different functions and different data:

` missing data: Decision support requires historical data which

operational DBs do not typically maintain. So, data in ODB is usually yfar from complete p for decision making. g

` data consolidation: DS requires consolidation (aggregation,

summarization) of data from heterogeneous sources. ODB

t i d t il d d t (t ti ) hi h d t b

contain detailed raw data (transactions) which need to be consolidated before analysis.

` data quality: q y different sources typically use inconsistent data yp y

representations, codes and formats which have to be reconciled.

Conceptual Modeling of DW

Conceptual Modeling of DW

`

Data Cube:

` see TSBD Lecture Notes on Visualization of Data Cubes

`

M d li d t

h

di

i

&

t

`

Modeling data warehouses: dimensions & measurements

` Star schema: A single object (fact table) in the middle connected

to a number of objects (dimension tables one for each

to a number of objects (dimension tables, one for each dimension).

` Snowflake schema: A refinement of star schema where the

dimensional hierarchy is represented explicitly by normalizing the dimension tables.

` Fact constellations: Multiple fact tables share dimension tables.

(5)

Example of Star Schema

Example of Star Schema

Date Product

Day Month Year

Sales Fact Table Date ProductNo ProdName ProdDesc C Date Product Store Category QOH Store CustId C tN Cust Store Customer StoreID City State CustName CustCity CustCountry unit_sales dollar_sales State Country Region Yen_sales Measurements `

Potensi Redundansi

` Bandung, Bogor keduanya

17

ada di Jawa Barat

Snowflake Schema

Snowflake Schema

Product Year

Day

Date Sales Fact Table ProductNo ProdName ProdDesc Month Year Month Year Day Month Date Product ProdDesc Category QOH Year CustId Cust Store Customer City StoreID Cit Store CustId CustName CustCity CustCountry unit_sales dollar sales City State State State City CustCountry _ Yen_sales Country Region Country State Country 18 Measurements Region

View of Warehouses and Hierarchies

View of Warehouses and Hierarchies

` Importing data ` Table Browsing ` Dimension creation ` Dimension browsing ` Cube buildingg ` Cube browsing

Data Cube

Data Cube

Total annual sales

D t Total annual sales

of TV in U.S.A. Date sum TV PC 1Qtr 2Qtr 3Qtr 4Qtr U S A ry sum VCR PC U.S.A Canada Count Ca ada Mexico sum

(6)

Data Cube

Data Cube

`

Visualization

`

OLAP capabilities

21

p

`

Interactive manipulation

Typical OLAP Operations

Typical OLAP Operations

` Roll up (drill-up): summarize data

` by climbing up hierarchy or by dimension reduction ` by climbing up hierarchy or by dimension reduction

` Drill down (roll down): reverse of roll-up

` from higher level summary to lower level summary or detailed data or ` from higher level summary to lower level summary or detailed data, or

introducing new dimensions

` Slice and dice:

` project and select

` Pivot (rotate):

` reorient the cube, visualization, 3D to series of 2D planes.

` Other operations

` d ill i l i ( ) th f t t bl

` drill across: involving (across) more than one fact table.

` drill through: through the bottom level to its back-end relational tables.

` More info: ` More info:

` www.knowledgecenters.org, www.olapreport.com, www.olapcouncil.org

22

Data Mart

Data Mart

`

DW collects information about subjects that span the

entire organization, such as customers, products, sales, assets,

and personnel. Its scope is

enterprise-wide

.

`

For DW,

fact constellation

schema is commonly used

since it can model multiple, interrelated subjects.

`

Data Mart is a subset of a DW, focuses on a particular

subject. Its scope is

department-wide

. Typically, a data mart

f l b

( k

consisting of a single subject area (e.g. marketing,

operations).

`

For Data Mart,

star or snowflake schema

are commonly

used since both are geared towards modeling single

bj t lth

h th t h

i

l

subjects, although the star schema is more popular.

Data Mart

Data Mart

`

A data mart can be either

dependent

or

independent

.

`

A

dependent data mart

is a subset that is created directly

from the DW.

` Consistent data model ` Providing quality data

` DW must be constructed first

` Ensures that the user viewing the same version of the data that

d b ll h d h

are accessed by all other data warehouse users

`

An

independent data mart

is a small warehouse designed

f

d

d i

i

EDW

(7)

Data Warehousing Process Overview

Data Warehousing Process Overview

25

Data Warehousing Process Overview

Data Warehousing Process Overview

`

The major components of a data warehousing process

` Data sources

` Legacy systems, external data providers (e.g. BPS), OLTP,

ERP Systems ` Data extraction ` Data loading ` Comprehensive database ` Metadata ` Middleware tools 26

Data Warehousing Architectures

(8)

Data Warehousing Architectures

Data Warehousing Architectures

29

Data Warehousing Architectures

Data Warehousing Architectures

30

Data Integration and the ETL Process

Data Integration and the ETL Process

`

Various integration technologies:

` Enterprise Application Integration (EAI)

` A technology that provides a vehicle for pushing data from source

t i t d t h systems into a data warehouse

` Integrating application functionality and is focused on sharing

functionality across systems

` Traditionally, API. Nowadays, SOA (web services).

` Enterprise Information Integration (EII)

` An evolving tool space that promises real-time data integration from

a variety of sources, such as relational databases, Web services, and multidimensional databases

` A mechanism for pulling data from source systems to satisfy a request

for information.

Data Integration and the ETL Process

Data Integration and the ETL Process

` ETL

` 60-70% of the time in a data-centric project.

` Extraction: Reading data from one or more databases

` Transformation

` Transformation

` Converting the extracted data from its previous form into the form in

which it needs to be so that it can be placed into a DW ` Load

` Putting the

d

data into the DW

(9)

Data Warehouse Development

Data Warehouse Development

`

Direct benefits

` Allowing end users to perform extensive analysis in numerous

ways

A f ( f

` A consolidated view of corporate data (i.e a single version of

the truth)

` Better and more timely information ` Better and more timely information

` Enhanced system performance. DW frees production

processing because some operational system reporting processing because some operational system reporting requirements are moved to DSS

` Simplification of data access

33

Data Warehouse Development

Data Warehouse Development

`

Some best practices for implementing a DW (Weir, 2002):

` Project must fit with corporate strategy and business objectives ` There must be complete buy-in to the project by executives,

managers and users managers, and users

` It is important to manage user expectations about the completed

project

` The data warehouse must be built incrementally ` Build in adaptability

M d b b h IT d b i f i l

` Managed by both IT and business professionals ` Develop a business/supplier relationship

` O l l d d t th t h b l d d f lit

` Only load data that have been cleansed and are of a quality

understood by the organization

` Do not overlook training requirements ` Do not overlook training requirements `34 Be politically aware

Data Warehouse Vendors

Data Warehouse Vendors

`

Computer Associates

`

Microsoft

`

DataMirror

`

Data Advantage Group

`

Oracle

`

SAS

g

p

`

Dell Computer

`

Embarcadero Technologies

`

Siemens

`

Sybase

`

Embarcadero Technologies

`

Business Objects

`

HP

`

Sybase

`

Teradata

`

Please visit:

`

HP

`

Hummingbird

H

`

Please visit:

` Data Warehousing Institute

(tdwi com)

`

Hyperion

`

IBM

(tdwi.com)

` DM Review (dmreview.com)

Data Warehouse Vendors

Data Warehouse Vendors

`

Six guidelines to considered when developing a

g

p g

vendor list:

1

Financial strength

1.

Financial strength

2.

ERP linkages

Q lifi d l

3.

Qualified consultants

4.

Market share

5.

Industry experience

6.

Established partnerships

p

p

(10)

Real time DW

Real-time DW

`

Traditionally, updated on a weekly basis.

`

Unsuitable for some businesses.

`

Real-time (active) data warehousing

(

)

g

The process of loading and providing data via a data

warehouse as they become available

y

`

Levels of data warehouses:

1. Reports what happened

1. Reports what happened

2. Some analysis occurs

3. Provides prediction capabilities,p p , 4. Operationalization

5. Becomes capable of making events happenp g pp

37

Real time DW

Real-time DW

38

Real time DW

Real-time DW

From DW to DM [JH]

From DW to DM [JH]

`

Three kinds of data warehouse applications

` Information processing

` supports querying, basic statistical analysis, and reporting

using crosstabs, tables, charts and graphs

` Analytical processing

` multidimensional analysis of data warehouse data

` supports basic OLAP operations, slice-dice, drilling, pivoting

` Data mining

` knowledge discovery from hidden patterns

` supports associations, constructing analytical models,

performing classification and prediction, and presenting the i i lt i i li ti t l

(11)

References

References

`

[JH] Jiawei Han and Micheline Kamber, Data Mining:

Concepts and Techniques, Morgan Kaufmann, 2001.

`

[ET] Efraim Turban et al., Decision Support and Business

Intelligence Systems, Pearson, 2007.

`

[DO] David Olson and Yong Shi, Introduction to Business

Data Mining, McGraw-Hill, 2007.

References

Related documents

Section 2 (d) (Nature of the Business) - The following sentence shall be modified to read as follows: "The Applicant will discourage the illegal public consumption of alcohol

Why did the international community decide to withdraw United Nations peacekeeping troops from Rwanda during the 1994 genocide? Analysis of newly released documents and results from

property, it is often the desire of the shareholders to transfer the leased ground (and perhaps some equipment) to the newly created corporation with the intention that the departing

Short (Division of Entomology, Biodiversity Institute and Department of Ecology and Evolutionary Biology, University of Kansas, Lawrence, KS, U.S.A) — Martin Fikáček (Department

Table 5.22: Correlation between IAUGC 60 measurement and clinical markers .... Table 5.23: Difference in mean multiparametric values according to clinical features ... 245 Table

The activity of seedingand weeding were not done by men farmer because both activities still could be done by women farmer.The findings at Toba Samosir Regency

The focus group facilitators asked if the participants felt more loyal to the manufacturers who provided value-added programs and essential services. Essential services were defined

contested RCD has no individual character, as it produces the same overall impression on the informed user as a number of identically or similarly shaped umbrellas that have been