• No results found

Knowledge Discovery in Databases. Databases. date name surname street city account no. payment balance

N/A
N/A
Protected

Academic year: 2021

Share "Knowledge Discovery in Databases. Databases. date name surname street city account no. payment balance"

Copied!
12
0
0

Loading.... (view fulltext now)

Full text

(1)

Databases

Flat file client transaction account Relational database

Querying: QBE vs. SQL

date name surname street city account no. payment balance

980103 Jan Novak Dlouha 5 Praha 1 9945371 100.00 100.00 980105 Jan Novak Dlouha 5 Praha 1 9945371 1500.00 1600.00 980106 Jan Novak Dlouha 5 Praha 1 9945371 -1550.00 50.00 980106 Karel Nemec Podolska 4 Praha 2 24867134 3000.00 6000.00 980107 Karel Nemec Podolska 4 Praha 2 24867134 -4000.00 2000.00 980108 Jan Novak Dlouha 5 Praha 1 9945371 -150.00 -100.00 980111 Karel Nemec Podolska 4 Praha 2 24867134 5000.00 7000.00 . . . id_client name surname street city . . . id_transaction id_account date payment balance . . . id_account id_client . . .

SELECT client.name, client.surname, client.street, client.city, account. id_client, transaction.balance

FROM client, account, transaction

WHERE client.id_client = account.id_account;

AND transaction.id_account = account.id_account; AND transaction.balance < 100;

(2)

Decision support using databases

1. Executive Information Systems

managerial information systems designed for

fast access to information

user friendly interface but less flexible

2. On-Line Analytical Processing

multidimensional concept of data storage and

manipulation (DATA CUBE),

intuitive data manipulation,

work with data collected from heterogeneous

data sources – data conversion is necessary,

use of analytical methods – statistical

summaries, what-if analysis,

Client/Server architecture,

support for multi-user access,

OLAP results stored separately from the

source data,

dynamic manipulation with sparse matrices,

missing values processing,

unlimited number of dimensions and

aggregation levels.

(3)

Database structure

date product city sales

10.1. screws Praha 241 10.1. nuts Praha 61 10.1. screws Brno 17 10.1. nails Brno 42 10.2. screws Praha 92 10.2. nails Praha 27 10.2. screws Kladno 35 Table SALES

Praha Brno Kladno

screws nuts nails screws nuts nails screws nuts nails

10.1. 241 61 17 42

10.2. 92 27 35

Sparse matrix

product date region

city sales

(4)

Data manipulation:

slice and dice

roll up vs. drill down

Microsoft Data Analyzer

sales aggregation cities aggregation regions aggregation products

(5)

Implementation:

hypercube

multicube

„true“ OLAP vs. ROLAP

MOLAP vs. ROLAP

physical implementation of the system:

star schema,

snowflake schema.

SQL engine OLAP engine summarized data granular data User interface ROLAP MOLAP

(6)

dimension store fact table dimension time

dimension product

Star

dimension store

fact table region fact table district

fact table store

Snowflake

STORE KEY PRODUCT KEY PERIOD KEY cena počet STORE KEY data o prodejně město ID okresu data o okresu ID regionu data o regionu úroveň (level) PRODUCT KEY data o produktu značka výrobce úroveň (level) PERIOD KEY data o období rok čtvrtletí měsíc den STORE KEY PRODUCT KEY PERIOD KEY cena počet STORE KEY data o prodejně město ID okresu data o okresu ID regionu data o regionu

úroveň (level) ID okresu

PRODUCT KEY PERIOD KEY cena počet ID regionu PRODUCT KEY PERIOD KEY cena počet ID okresu data o okresu ID regionu ID regionu data o regionu

(7)

OLAP functionality „reachable“ by classic

technologies

Microsoft Access

(8)

3. Data warehouse

subject oriented,

integrated,

time variant,

nonvolatile data repository used for decision

support

ddddd Data

současná detailní data

starší detailní data m e t a d a t a středně sumarizovaná data silně sumarizovaná data 1. vrstva Data Mart Data Warehouse 2. vrstva 3. vrstva produkční databáze

(9)

4. Business Intelligence

computerized tools and techniques used to

collect, integrate, analyze, interpret and

present (business) data and information. 4 main

components:

data warehouse

business analytics (querying, reporting,

statistical analyses, data mining)

business performance management

user interface (presentations)

(10)

Databases meet Data Mining

1. Query languages for KDD

Mine Rule

(Boulicaut, 1998) - association rules

MSQL

(Imielinski, Virmani, 1999) – association

rules, records

MSQL - rules

MSQL - exceptions

MINE RULE Priklad AS

SELECT DISTINCT 1..n produkt AS BODY, 1..1 produkt AS HEAD, SUPPORT, CONFIDENCE

FROM Prodej

WHERE BODY.město = HEAD.město AND BODY.datum = HEAD.datum

EXTRACTING RULES WITH SUPPORT: 0.1, CONFIDENCE: 0.5

Emp(Id,Age,Sex,Salary,Position,Car)

GetRules (Emp) into R

where support > 0.1 and confidence > 0.9

SelectRules (R)

where body has {Age=*), (Sex=*)} and body is {(Car=*)}

Select * from Emp

where violates all (GetRules (Emp) where body is {(Age=*)}

and head is {(Salary=*)} and confidence > 0.3)

(11)

DMQL

(Han et al., 1996) – different rules

DMQL – association rules DMQL – classification rules DMQL – discrimination rules

2. API standards

SQL/MM Data Mining

OLE DB for Data Mining

Find classification rules for computer_science_students according to average_grading

related to birth_place, address from student

where major = „computer_science“ and birth_place = „Canada“

Find association rules

related to average_grading, birth_place, address from student

where major = „computer_science“ and birth_place = „Canada“ with support threshold = 0.05

with confidence threshold = 0.7

Find discriminant rule

for cs_grads with status = „graduate“

in contrast to cs_undergrads with status = „undergraduate“ related to average_grading, birth_place, address

from student

where major = „computer_science“ and birth_place = „Canada“

(12)

3. Extending databse systems with data

mining („in-database“ data mining)

MicroSoft SQL Server 2005

- Decision trees, association rules, naive bayes

classifier, neuronal nets, text mining, sequence

clustering, time series

- uses OLE DB DM and

PMML to describe

tasks and models) and

BI

Development

studio as interface

Oracle Data Mining

tools for classification, regression, anomaly

detection, association mining, clustering, feature

extraction, attribute ranking

References

Related documents