Databases
Flat file client transaction account Relational databaseQuerying: QBE vs. SQL
date name surname street city account no. payment balance
980103 Jan Novak Dlouha 5 Praha 1 9945371 100.00 100.00 980105 Jan Novak Dlouha 5 Praha 1 9945371 1500.00 1600.00 980106 Jan Novak Dlouha 5 Praha 1 9945371 -1550.00 50.00 980106 Karel Nemec Podolska 4 Praha 2 24867134 3000.00 6000.00 980107 Karel Nemec Podolska 4 Praha 2 24867134 -4000.00 2000.00 980108 Jan Novak Dlouha 5 Praha 1 9945371 -150.00 -100.00 980111 Karel Nemec Podolska 4 Praha 2 24867134 5000.00 7000.00 . . . id_client name surname street city . . . id_transaction id_account date payment balance . . . id_account id_client . . .
SELECT client.name, client.surname, client.street, client.city, account. id_client, transaction.balance
FROM client, account, transaction
WHERE client.id_client = account.id_account;
AND transaction.id_account = account.id_account; AND transaction.balance < 100;
Decision support using databases
1. Executive Information Systems
managerial information systems designed for
fast access to information
user friendly interface but less flexible
2. On-Line Analytical Processing
multidimensional concept of data storage and
manipulation (DATA CUBE),
intuitive data manipulation,
work with data collected from heterogeneous
data sources – data conversion is necessary,
use of analytical methods – statistical
summaries, what-if analysis,
Client/Server architecture,
support for multi-user access,
OLAP results stored separately from the
source data,
dynamic manipulation with sparse matrices,
missing values processing,
unlimited number of dimensions and
aggregation levels.
Database structure
date product city sales
10.1. screws Praha 241 10.1. nuts Praha 61 10.1. screws Brno 17 10.1. nails Brno 42 10.2. screws Praha 92 10.2. nails Praha 27 10.2. screws Kladno 35 Table SALES
Praha Brno Kladno
screws nuts nails screws nuts nails screws nuts nails
10.1. 241 61 17 42
10.2. 92 27 35
Sparse matrix
product date region
city sales
Data manipulation:
slice and dice
roll up vs. drill down
Microsoft Data Analyzer
sales aggregation cities aggregation regions aggregation products
Implementation:
hypercube
multicube
„true“ OLAP vs. ROLAP
MOLAP vs. ROLAP
physical implementation of the system:
star schema,
snowflake schema.
SQL engine OLAP engine summarized data granular data User interface ROLAP MOLAPdimension store fact table dimension time
dimension product
Star
dimension store
fact table region fact table district
fact table store
Snowflake
STORE KEY PRODUCT KEY PERIOD KEY cena počet STORE KEY data o prodejně město ID okresu data o okresu ID regionu data o regionu úroveň (level) PRODUCT KEY data o produktu značka výrobce úroveň (level) PERIOD KEY data o období rok čtvrtletí měsíc den STORE KEY PRODUCT KEY PERIOD KEY cena počet STORE KEY data o prodejně město ID okresu data o okresu ID regionu data o regionuúroveň (level) ID okresu
PRODUCT KEY PERIOD KEY cena počet ID regionu PRODUCT KEY PERIOD KEY cena počet ID okresu data o okresu ID regionu ID regionu data o regionu
OLAP functionality „reachable“ by classic
technologies
Microsoft Access
3. Data warehouse
subject oriented,
integrated,
time variant,
nonvolatile data repository used for decision
support
ddddd Data
současná detailní data
starší detailní data m e t a d a t a středně sumarizovaná data silně sumarizovaná data 1. vrstva Data Mart Data Warehouse 2. vrstva 3. vrstva produkční databáze
4. Business Intelligence
computerized tools and techniques used to
collect, integrate, analyze, interpret and
present (business) data and information. 4 main
components:
data warehouse
business analytics (querying, reporting,
statistical analyses, data mining)
business performance management
user interface (presentations)
Databases meet Data Mining
1. Query languages for KDD
Mine Rule
(Boulicaut, 1998) - association rules
MSQL
(Imielinski, Virmani, 1999) – association
rules, records
MSQL - rules
MSQL - exceptions
MINE RULE Priklad AS
SELECT DISTINCT 1..n produkt AS BODY, 1..1 produkt AS HEAD, SUPPORT, CONFIDENCE
FROM Prodej
WHERE BODY.město = HEAD.město AND BODY.datum = HEAD.datum
EXTRACTING RULES WITH SUPPORT: 0.1, CONFIDENCE: 0.5
Emp(Id,Age,Sex,Salary,Position,Car)
GetRules (Emp) into R
where support > 0.1 and confidence > 0.9
SelectRules (R)
where body has {Age=*), (Sex=*)} and body is {(Car=*)}
Select * from Emp
where violates all (GetRules (Emp) where body is {(Age=*)}
and head is {(Salary=*)} and confidence > 0.3)
DMQL
(Han et al., 1996) – different rules
DMQL – association rules DMQL – classification rules DMQL – discrimination rules2. API standards
SQL/MM Data Mining
OLE DB for Data Mining
Find classification rules for computer_science_students according to average_grading
related to birth_place, address from student
where major = „computer_science“ and birth_place = „Canada“
Find association rules
related to average_grading, birth_place, address from student
where major = „computer_science“ and birth_place = „Canada“ with support threshold = 0.05
with confidence threshold = 0.7
Find discriminant rule
for cs_grads with status = „graduate“
in contrast to cs_undergrads with status = „undergraduate“ related to average_grading, birth_place, address
from student
where major = „computer_science“ and birth_place = „Canada“