Data warehouses. Data Mining. Abraham Otero. Data Mining. Agenda

18 

Full text

(1)

1/36

Abraham Otero Data Mining

Data warehouses

Data Mining

Abraham Otero

Agenda

Why do I need a data warehouse?

Data warehouse architecture

ETL systems

Real-Time Data Warehousing

Open problems

(2)

3/36

Abraham Otero Data Mining

Why do I need a data warehouse?

Why do I need a data warehouse?

Maybe you do not need it…

If the volume of data is small and the data is static, a file can be enough.

If we are going to work on multiple data sources, new data arrives continuously and/or the volume of data is very high, in the long term a data warehouse will save us time.

OK, but wouldnd’t it be enough with the database

of the company?

Usually not.

Operational requirements differ greatly from the analytical ones.

Why do I need a data warehouse?

Case Study

An international company wants to identify which products are selling best and worst in each country where it operates in order to refine their marketing campaigns within each country.

Do they have all the information they need in their

databases?

PRODUCT ... STRORE ... SALE ... COUNTRY ... Database EMPLOYEE ... OFFICE ... DEPARTMENT ...

(3)

5/36

Abraham Otero Data Mining

Why do I need a data warehouse?

No

Required Information PRODUCT ... STRORE ... SALE ... COUNTRY ... Database EMPLOYEE ... OFFICE ... DEPARTMENT ... Geographical data Census Climate …

Why do I need a data warehouse?

On the other hand, OLTP and OLAP systems

have completely different purposes…

…which translates into different requirements and therefore a different design.

OLTP, On-Line Transactional Processing

Must meet the operational requirements of the

company. It supports the operation of the organization applications.

OLAP, On-Line Analytical Processing

Supports analytical processes that try to help in decision making processes.

Typically, companies do not invest in them until they have all their operational requirements satisfied.

(4)

7/36

Abraham Otero Data Mining

Why do I need a data warehouse?

OLTP Systems Data warehouses •Support operational requirements

•Current data •Dynamic data

•Response time is small •It serves many users •Large Size

•Contain data of the organization •SQL

•Read and write operations •Transactional operations

•Support analytical requirements •Historical data

•Static data (only increases)

•Response time is large (killer queries) •Serve few users

•Larger size

•Contain data relating to the organization and other sources

•SQL and custom tools •Read operations

•Non-transactional operations

Why do I need a data warehouse?

On average, the construction and initial load of a

data warehouse are 50% of the work of the data

mining process .

Do not underestimate the time needed for this task.

This task is very important; if the data quality is

low, no matter how good the data mining

(5)

9/36

Abraham Otero Data Mining

Agenda

Why do I need a data warehouse?

Data warehouse architecture

ETL systems

Real-Time Data Warehousing

Open problems

Data warehouse architecture

Data are often organized into "facts", or

“instances".

“The client X on February 20, 2008 bought the products P1, P2, and P3 at the store T"

"On May 25 the temperature was 78 °, there was 75% humidity, and wind, and the game was not played" "The length of the sepal is 5.1 cm, with a width of 3.5 cm; the petal length is 1.4 cm, with a width and 0.2 cm"

(6)

11/36

Abraham Otero Data Mining

Data warehouse architecture

The wheather problem

Data warehouse architecture

The most famous data mining test set:

(7)

13/36

Abraham Otero Data Mining

Data warehouse architecture

The iris data:

Data warehouse architecture

It is often necessary to transform the data to

organize it in this way.

How could we learn the relationship "sister" from this data?

(8)

15/36

Abraham Otero Data Mining

Data warehouse architecture

One possible

representation would

be to list all possible

pairs indicating whether

or not they fulfill the

relationship:

Data warehouse architecture

Under the " closed world

assumption " the table can

be compressed:

The closed world assumption considers that all the cases not listed are negative.

However, from this table we

cannot learn anything that

allows us to predict whether

or not two people are

sisters.

(9)

17/36

Abraham Otero Data Mining

Data warehouse architecture

The following table contains all the information we

need, expressed as facts:

The knowledge we need

Data warehouse architecture

The examples we have seen so far are easy.

For them, a plain text file would be enough (we do not need a data warehouse).

For more complex examples, the data warehouse

is advisable.

How do we organize the information for the data

warehouse?

Also as facts or instances, but with more complex attributes that define various dimensions of the fact. The dimensions have an internal hierarchical structure that defines different levels of aggregation.

(10)

19/36

Abraham Otero Data Mining

Data warehouse architecture

Hierarchy of different levels of aggregation:

Data warehouse architecture

Star schema

SALE Amount # of items Client Item Store Time ITEM Wholesale Price Range CITY Name State Country # of inhabitants Climate WHOLESALE Name Country City Valuation YEAR STORE City Address Info regarding the area. DAY Months Week Name HOUR Date morning/afternoon Holiday/Work day MONTH Quarter Name QUARTER Year Name COUNTRY Name Country # of inhabitants Climate WORLD REGION Name # of inhabitants Climate

"It is aggregated on"

Location dimension Item dimension Time dimension

(11)

21/36

Abraham Otero Data Mining

STATE Name Country # of inhabitants Climate SALE Amount # of items Client Item Store Time ITEM Wholesale Price Range RANGE Category CITY Name State Country # of inhabitants Climate WHOLESALE Name Country City Valuation YEAR STORE City Address Info regarding the area. DAY Months Week Name HOUR Date morning/afternoon Holiday/Work day MONTH Quarter Name QUARTER Year Name WEEK Year COUNTRY Name Country # of inhabitants Climate WORLD REGION Name # of inhabitants Climate

"It is aggregated on"

Location dimension Item dimension Time dimension

Data warehouse architecture

Snowflake schema

Data warehouse architecture

Is it possible to collect all the information into a

single star or snowflake?

No, more than one are usually needed.

Each of the schemes is often called DataMart.

Usually we shall have one for every different aspect of the organization that we want to explore.

Sales Staff CAMPAIGN Time Time Time Item Location Project Team Product Supplier Location

(12)

23/36

Abraham Otero Data Mining

Agenda

Why do I need a data warehouse?

Data warehouse architecture

ETL systems

Real-Time Data Warehousing

Open problems

ETL systems

The ETL systems (extraction,

transformation and load) have to be built by the data warehouse team. Its implementation is highly dependent on the application.

(13)

25/36

Abraham Otero Data Mining

ETL systems

There are certain patterns in the pre-processing of

data before data mining:

Integration and cleansing of data. Transformation of attributes. Numerization and discretization. ...

This will be discussed in a separate section

because they should be used whether we are

using a data warehouse or not.

Agenda

Why do I need a data warehouse?

Data warehouse architecture

ETL systems

Real-Time Data Warehousing

(14)

27/36

Abraham Otero

Real-Time Data Warehousing

Real-time (active) data warehousing: the process

of loading and providing data via a data

warehouse as they become available

Levels of data warehouses:

1. Reports what happened 2. Some analysis occurs

3. Provides prediction capabilities 4. Operationalization

5. Becomes capable of making events happen

Data Mining

(15)

29/36

Abraham Otero

Real-Time Data Warehousing

The need for real-time data

A business often cannot afford to wait a whole day for its operational data to load into the data warehouse for analysis

Provides incremental real-time data showing every state change and almost analogous patterns over time

Maintaining metadata in sync is possible

Reduce or eliminate the nightly batch processes

Data Mining

Agenda

Why do I need a data warehouse?

Data warehouse architecture

ETL systems

Real-Time Data Warehousing

Open problems

(16)

31/36

Abraham Otero Data Mining

Open problems

S Rizzi, Open problems in data warehousing: 8 years later. 5th International Workshop on Design and Management of Data Warehouses. 2003 (Keynote).

S. Rizzi, A. Abelló, J. Lechtenbörger, J. Trujillo. Research in data warehouse modeling and design: dead or alive?. 9th ACM international workshop on Data warehousing and OLAP, pp 3-10. 2006.

Widom, J. Research problems in data warehousing. International Conference on Information and Knowledge Management (CIKM95), ACM Press. 1995.

Dinter, B., Sapia, C. Hölfing, G., Blaschka, M. OLAP market and research: initiating the cooperation. Journal of Computer Science and Information Management, 2(3), 1999.

Open problems

Data warehousing conferences:

International Workshop on Data Warehousing and

OLAP. (DOLAP)

International Conference on Data Warehousing

and Knowledege Discovery. (DaWaK)

International Workshop on Data Warehouse and

Data Mining. (DWDM)

(17)

33/36

Abraham Otero Data Mining

Open problems

Journals:

International Journal of Data Warehousing and

Mining

.

Data and Knowledge Engineering

.

Information Sciences

.

Open problems

Hot topics

How to integrate data arising from multiple sources. Queries: language optimization, processing.

Consistency and quality

Data Warehouse design: conceptual models, design methodologies.

ETL loading and recovery of failures during loading. Planning loads and refreshments.

Maintenance of Data Warehouse. Data cleaning and preprocessing

(18)

35/36

Abraham Otero Data Mining

Bibliography

William H. Inmon. Building the Data Warehouse. John Wiley and Sons, 2005. ISBN 0764599445,

9780764599446.

A. Sen, AP. Sinha. A comparison of data warehousing methodologies. Communications of the ACM archive, Volume 48 , Issue 3, Pages: 79 - 84. 2005.

J. Van den Berg. Integral Warehouse Management: The Next Generation in Transparency, Collaboration and

Warehouse Management Systems. Management Outlook, 2007. ISBN: 1419668765.

Jiawei Han y Micheline Kamber (2005). Data Mining, Second Edition, Second Edition : Concepts and Techniques. The Morgan Kaufmann Series in Data Management Systems.

Bibliography

Inmon, W.H. et al. "Managing the Data Warehouse", John Wiley, 1997

Inmon, W.H. et al. "Data Warehouse Performance", John Wiley, 1999

Kimball, R. "The Data Warehouse Toolkit", John Wiley, 1996

Kimball, R et al. "The Data Warehouse Lifecycle Toolkit", John Wiley, 1998

Giovinazzo, W. "Object-Oriented Data Warehouse Design", Prentice-Hall, 2000.

Jarke, M. et al. "Fundamentals of Data Warehouses", Springer, 2000.

Figure

Updating...

References

Updating...

Related subjects :