• No results found

Two-Phase Data Warehouse Optimized for Data Mining

N/A
N/A
Protected

Academic year: 2021

Share "Two-Phase Data Warehouse Optimized for Data Mining"

Copied!
20
0
0

Loading.... (view fulltext now)

Full text

(1)

Balázs Rácz Csaba István Sidló András Lukács András A. Benczúr

Data Mining and Web Search Research Group Computer and Automation Research Institute Hungarian Academy of Sciences (MTA SZTAKI)

http://datamining.sztaki.hu/

Two-Phase Data Warehouse

Optimized for Data Mining

(2)

Outline

Demands and challenges

Related data warehouse techniques

The two-phase architecture

The second phase component

Data model and data mining framework

(3)

Data storage and management

• long term storage

Data manipulation, statistical queries Data mining platform

• custom tasks

• quick development/reconfiguration

DM visualization specialized to user needs Data analysis know-how

• e.g. web/telco usage, churn, user groups, …

Basic DBMS functionalities • long range • changing • high dimensional data sources patterns knowledge

BI and knowledge discovery

(4)

Demands in data processing

Long term storage

• growing data volume

• cost of storage

Proper coupling between data warehouses

and data mining tools

• optimized data access

• effectively implemented data mining tools

Data mining query language

• help and guide the knowledge discovery

process

• flexibility, fast deployment

(5)

Related data warehouse techniques

Read/cost optimized databases

• column-oriented databases (C-store), column wise

compression

• nearline data warehouses (Sybase IQ, Sand/DNA)

Coupling with data mining systems

• tight – DM is integrated into the DBMS

• semi-tight – interfaced extension of the SQL

(6)

The two-phase architecture

Streaming data management and data mining toolkit (second phase)

Short-term data cubes Data

sources

Database (first phase)

Long-term disk storage Long-term aggregates, patterns Metadata Compression module Data mining module Data stream query engine ETL

tools

OLAP and other analysis tools built by standard DBMS tools new C++ implementation

(7)

The second phase

Data stream approach

• optimize data mining access to very large data sets

• (block) sequential access, typically full scans of data

• fit for data mining algorithms: frequent itemset mining,

partitioning clustering, Bayes classification, decision trees

Streaming data management and data mining toolkit (second phase)

Short-term data cubes Data

sources

Database (first phase)

Long-term disk storage Long-term aggregates, patterns Metadata Compression module Data mining module Data stream query engine ETL

tools

OLAP and other analysis tools

Effective storage

• compression by columns

• large compression rate

• row-wise storage

• fast read-only access

• a new semantic

(8)

New compression method for log data

0 5 10 15 20 25 0 0,2 0,4 0,6 0,8 1 1,2

compression rate rel. to gzip's

decompression time 0 5 10 15 20 0 0,2 0,4 0,6 0,8 1 1,2

compression time rel. to gzip's

gzip bzip2 dslc 0.00 0.05 0.10 0.15 0.20 0.25 0.30 100 1 000 10 000 100 000 1 000 000 10 000 000

size of weblog (kbyte)

compression rate

dslc

original compr. compr.

size size rate

(GB) (MB) (%) weblog 1.89 57.03 2.95% PIX log 0.92 46.70 4.97% data type Examples of compressions

(9)

Data models

1 4 5 9 0 14 1 2 3 4 5 6 7 8 9 10 11 12 13 1 0 0 1 1 0 0 0 1 0 0 0 0 a row of a binary matrix

sparse matrix data model relational data model

sparse format of the row

n-tuple

(10)

The data model of the second phase

A common generalization of the relational (k=0 or k=n) and the sparse matrix (h for the rows, b for the nonzero columns) data model

k header n-k body

bodies with a same value in header

h b1 h b3 h b2 h b4 h b5 h b1 b2 b3 b4 b5

(11)

DM framework and its query language

Data mining toolkit

• flexible modular architecture

• standardized streaming interface between modules

• pre- and postprocessing, data transforming and mining

• variety of data mining algorithms implemented

• so far 200+ modules (more than 100k lines C++ code)

• new modules can be added

DM query language

• order and configuration of

the necessary modules should be given

• graphical application

(12)

transactional & customer data from DW compression for internal aggregates prefix, private, company users with several numbers, user groups ISDN churn classifier, clustering cluster info refeed as input results refilled into data tables

(13)

A case study: T-Online Hungary

Field of operation: online content provider

Business needs:

– Long-term log storage with access to analysis – Custom reports to management and editors

State before:

– Logs kept on tapes, never read back

– Previous data warehouse projects failed due to data volume

– MOLAP technology failed on dimensionality

(14)

Space requirements of one month web log

1.9 GB Second-phase compressed storage

39.1 GB Compressed DB table

44.9 GB Standard DB table

17.1 GB Compressed (bzip) preprocessed log files

180.7 GB Compressed (bzip) raw log files

Size on disk Storing method

(15)
(16)

Short summary of the two phase

new compression method and data mining framework built by standard DBMS tools

prepared for data mining handling dimension tables

stream data model relational data model

longer time storage (archive) shorter time storage

second phase first phase

Streaming data management and data mining toolkit (second phase)

Short-term data cubes Data

sources

Database (first phase)

Long-term disk storage Long-term aggregates, patterns Metadata Compression module Data mining module Data stream query engine ETL

tools

OLAP and other analysis tools

(17)

Thanks

• Katalin Hum and László Lukács • T-Online Hungary Inc.

• Hungarian National Office for Research and Technology (NKTH)

– T-mining GVOP-3.1.1-2004/-05-0054/3.0 – Data Riddle NKFP-2/0017/2002

• Hungarian Scientific Research Fund (OTKA) – T042706

• Inter-University Center for Telecommunications and Informatics, Budapest (ETIK)

(18)

Compression of PIX router log

Compression of PIX router log

10

3

-10

5

records/sec

Goal: preservation for off-line security

Exemplar PIX log of 100 minutes router activity

– 5.2 million records

– 940 MB in its raw form

Results:

– 46.7 MB compressed log – 4.96% compression ratio – 5 min parsing from text +

(19)

Execution times for reference SQL queries

Execution times for reference SQL queries

select count(distinct USER_ID) from FACT_PAGE_IMPRESSION where DATE_KEY between 20060116 and 20060122

Q3

select count(*) from FACT_PAGE_IMPRESSION

where DATE_KEY between 20060101 and 20060131 and HTTP_STATUS_CODE = 200 Q2

select sum(PAGE_ID) from FACT_PAGE_IMPRESSION where DATE_KEY between 20060101 and 20060131 Q1

(20)

Frequent

References

Related documents

A hybrid statistical model representing both the pose and shape variation of the carpal bones is built, based on a number of 3D CT data sets obtained from different subjects

Resolution 11 – is to empower the directors to offer and grant awards pursuant to the Sembcorp Industries Performance Share Plan 2010 and the Sembcorp Industries Restricted Share

[87] demonstrated the use of time-resolved fluorescence measurements to study the enhanced FRET efficiency and increased fluorescent lifetime of immobi- lized quantum dots on a

Immunoprecipi- tation and Western blot for FGFR3 proteins confirmed the presence of both FGFR3 proteins in the cell lysate, suggesting that this decrease in phosphorylation did

In examining the ways in which nurses access information as a response to these uncertainties (Thompson et al. 2001a) and their perceptions of the information’s usefulness in

As a formal method it allows the user to test their applications reliably based on the SXM method of testing, whilst using a notation which is closer to a programming language.

For the cells sharing a given channel, the antenna pointing angles are first calculated and the azimuth and elevation angles subtended by each cell may be used to derive

Scrooge said to the Ghost, 'Oh, please tell me who that dead man was!' The Ghost took him near his office, but it didn't stop. 'Wait!'