In the last lesso n we lo aded o ur fact tables, assuming that we co uld reso lve all o f the fo reign keys required fo r the dimensio ns.
But what if a key canno t be fo und? Let's use o ur stageFactCustomerCount and factCustomerCount tables to help us wo rk o n this pro blem . First, run yo ur jo b to make sure yo ur tables co ntain data. Take a lo o k at the o utput (so me lines have been o mitted):
OBSERVE:
Starting job ProcessDataWarehouse at 14:24 23/12/2008. Inserting with :
INSERT INTO factCustomerCount (SELECT null, dd.date_key , dc.customer_key , ds.store_ke y , 1, ss.run_id FROM stageFactCustomerCount ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.create_date) ) INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_i d AND dc.start_date <= dd.date AND dc.end_date > dd.date ) INNER JOIN dimStore ds O N( ds.store_id = ss.store_id AND ds.start_date <= dd.date AND ds.end_date > dd.date ))
--> 589 rows inserted.
We kno w that 5 89 ro ws were put into factCustomerCount. How many were in stageFactCustomerCount? Run this co mmand against yo ur perso nal database:
CODE TO TYPE:
select count(*) from stageFactCustomerCount; Oh my! It lo o ks like stageFactCustomerCount has 599 rows!
OBSERVE:
mysql> select count(*) from stageFactCustomerCount; +---+
| count(*) | +---+ | 599 | +---+
1 row in set (0.05 sec) mysql>
Our jo in has excluded 10 ro ws. Ho w do we find the missing ro ws?
Debugging tELT MysqlMap
OBSERVE: Output fro m TOS
Starting job ProcessDataWarehouse at 14:24 23/12/2008. Inserting with :
INSERT INTO factCustomerCount (SELECT null, dd.date_key , dc.customer_key , ds.s tore_key , 1, ss.run_id FROM stageFactCustomerCount ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.create_date) ) INNER JOIN dimCustomer dc ON( dc.custom er_id = ss.customer_id AND dc.start_date <= dd.date AND dc.end_date > dd.date ) INNER JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd.date AND ds.end_date > dd.date ))
--> 589 rows inserted. Inserting with :
INSERT INTO factSales (SELECT null, dd.date_key , dc.customer_key , IFNULL(dm.mo vie_key , -1), ds.store_key , dst.staff_key , ss.amount , ss.run_id FROM stage FactSales ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.payment_date) ) LEFT OUTER JOIN dimMovie dm ON( dm.film_id = ss.film_id ) INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND dc.start_date <= dd.date AND dc.e nd_date > dd.date ) INNER JOIN dimStaff dst ON( dst.staff_id = ss.staff_id AN D dst.start_date <= dd.date AND dst.end_date > dd.date ) INNER JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd.date AND ds.end_dat e > dd.date ))
--> 15767 rows inserted.
Job ProcessDataWarehouse ended at 14:24 23/12/2008. [exit code=0]
The o utput includes the INSERT ... SELECT statement used to populate f act Cust o merCo unt . If we get rid o f the INSERT part, and reformat the query slightly, it will look like this:
OBSERVE:
SELECT null, dd.date_key , dc.customer_key , ds.store_key , 1, ss.run_id
FROM stageFactCustomerCount ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.cr eate_date) )
INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND dc.start_da te <= dd.date AND dc.end_date > dd.date )
INNER JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd. date AND ds.end_date > dd.date )
Because we used inner jo ins in o ur query, no n-matching ro ws are excluded fro m the results. There is ho pe after all. =)
We can change this query so it returns o nly no n-matching ro ws. We do that by co nverting the INNER J OIN to LEFT J OIN and adding a WHERE clause. We also want to add three co lumns to the SELECT part, to help us debug: ss.cust o m e r_id, ss.st o re _id, ss.cre at e _dat e. Run this co mmand against yo ur perso nal
database:
CODE TO TYPE: SELECT
ss.customer_id, ss.store_id, ss.create_date,
dd.date_key , dc.customer_key , ds.store_key , ss.run_id FROM stageFactCustomerCount ss
INNER JOIN dimDate dd ON( dd.date = DATE(ss.create_date) )
LEFT JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND dc.start_dat e <= dd.date AND dc.end_date > dd.date )
LEFT JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd.d ate AND ds.end_date > dd.date )
WHERE dd.date_key IS NULL OR dc.customer_key IS NULL OR ds.store_key IS NULL
This query will return ro ws fro m stageFactCustomerCount that do not have corresponding rows in either dimDate, dimCustomer o r dimStore. No n-matching ro ws will have a NULL dat e _ke y, cust o m e r_ke y, o r a NULL st o re _ke y.
After yo u run the query, yo u'll see the missing ro ws: OBSERVE:
mysql> SELECT
-> ss.customer_id, ss.store_id, ss.create_date,
-> dd.date_key , dc.customer_key , ds.store_key , ss.run_id -> FROM stageFactCustomerCount ss
-> INNER JOIN dimDate dd ON( dd.date = DATE(ss.create_date) )
-> LEFT JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND dc.st art_date <= dd.date AND dc.end_date > dd.date )
-> LEFT JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd.date AND ds.end_date > dd.date )
-> WHERE dd.date_key IS NULL OR dc.customer_key IS NULL OR ds.store_key IS N ULL;
+---+---+---+---+---+--- ---+---+
| customer_id | store_id | create_date | date_key | customer_key | store _key | run_id | +---+---+---+---+---+--- ---+---+ | 1 | 1 | 2004-06-26 00:00:00 | 20040626 | NULL | 1 | 76 | | 2 | 1 | 2004-10-12 00:00:00 | 20041012 | NULL | 1 | 76 | | 3 | 1 | 2004-06-12 00:00:00 | 20040612 | NULL | 1 | 76 | | 4 | 2 | 2004-11-22 00:00:00 | 20041122 | NULL | 2 | 76 | | 5 | 1 | 2004-02-17 00:00:00 | 20040217 | NULL | 1 | 76 | | 6 | 2 | 2004-12-18 00:00:00 | 20041218 | NULL | 2 | 76 | | 7 | 1 | 2004-06-09 00:00:00 | 20040609 | NULL | 1 | 76 | | 8 | 2 | 2004-04-18 00:00:00 | 20040418 | NULL | 2 | 76 | | 9 | 2 | 2004-02-29 00:00:00 | 20040229 | NULL | 2 | 76 | | 10 | 1 | 2004-12-03 00:00:00 | 20041203 | NULL | 1 | 76 | +---+---+---+---+---+--- ---+---+
10 rows in set (0.10 sec)
It lo o ks like we fo und o ur missing ro ws: cust o m e r_id 1 thro ugh 10 . If yo u recall, we specifically excluded these custo mers fro m o ur dim Cust o m e r dimensio n, because o ur business users to ld us they were "test" acco unts.
To fix this pro blem, we'll alter o ur select statement. Edit the query o n the t MysqlInput co mpo nent fo r st age Cust o m e rCo unt . Change the query so it lo o ks like this (make sure to use the co rrect quo tatio n marks):
CODE TO TYPE:
select customer_id, store_id, create_date from customer WHERE customer_id > 10;
Handling Missing Keys
We pro cess dimensio ns befo re we pro cess facts so we can be sure that o ur dimensio ns are current and then o ur facts can lo ad co rrectly. If we lo ad o ur sales facts in o ur data wareho use, but o ne ro w do esn't have a mo vie asso ciated with it, we have several o ptio ns to deal with the missing mo vie:
2. Sto p the current pro cess, alerting so meo ne to the erro r.
3. Lo ad the ro w, but set the ro w to a special "!!!! MISSING MOVIE !!!!" entry in the dimMovie dimensio n.
Lo gically, yo ur business users will have the last wo rd o n handling this situatio n. Mo st will no t cho o se pick o ptio ns # 1 o r # 2 fo r these reaso ns:
1. Igno ring the ro w can cause daily to tals to be o ff, so the wareho use repo rts may no t match existing repo rts fro m the so urce systems.
2. Sto pping the current pro cess to alert users to the erro r causes the who le pro cess to be interrupted, and will cause the entire wareho use to be inaccessible.
Since o ur fact tables are relo aded o n a daily basis, o ptio n # 3 is attractive because the ro w in erro r can be fixed within the so urce system and the wareho use will be "fixed" o n the next run.
Let's implement this fix. We'll start my altering dimMo vie so film_id is a no rmal integer. Currently it is
unsigned, meaning that no negative values are allo wed. Run this co mmand against yo ur perso nal database: CODE TO TYPE:
ALTER TABLE dimMovie modify film_id int not null;
Next we will add the special missing ro w to dimMovie. Run this command against your personal database: CODE TO TYPE:
insert into dimMovie (movie_key, film_id, title, run_id ) values (-1, -1, '!!! M ISSING MOVIE !!!', 1);
Let's check the ro w we just inserted to see what it lo o ks like. Run this co mmand against yo ur perso nal database:
CODE TO TYPE:
select * from dimMovie where movie_key=-1; As lo ng as yo u typed everything co rrectly yo u'll see this:
OBSERVE:
select * from dimMovie where movie_key=-1;
+---+---+---+---+---+---- ---+---+---+---+---+--- -+---+
| movie_key | film_id | title | description | release_year | lan guage | original_language | rental_duration | length | rating | special_features | run_id |
+---+---+---+---+---+---- ---+---+---+---+---+--- -+---+
| -1 | -1 | !!! MISSING MOVIE !!! | NULL | NULL | | NULL | 0 | 0 | | NULL | 1 |
+---+---+---+---+---+---- ---+---+---+---+---+--- -+---+
1 row in set (0.05 sec) mysql>
Everything lo o ks go o d. Next, o pen yo ur jo b in TOS, and do uble click o n the t ELT MysqlMap co mpo nent fo r f act Sale s.
We can use MySQL's IFNULL functio n to be sure o ur jo in was successful. If the jo in failed, dm.movie_key will be null, so the IFNULL function will return -1. If the join worked, dm.movie_key will have a value, which will be returned by IFNULL.
Next, change the expressio n o n the f act Sale s o utput fro m dm.movie_key to IFNULL(dm.movie_key, -1):
Click OK, then save yo ur jo b.
We do n't have access to o ur so urce system, so we canno t insert test data into sakila. Instead, we can
tempo rarily mo dify o ur sto red pro cedure to insert test data into stageFactSales. Run this command against yo ur perso nal database:
CODE TO TYPE:
DROP PROCEDURE etl_preFactSales; DELIMITER //
CREATE PROCEDURE etl_preFactSales () BEGIN
ALTER TABLE stageFactSales add index(payment_id); ALTER TABLE stageFactSales add index(payment_date); ALTER TABLE stageFactSales add index(customer_id); ALTER TABLE stageFactSales add index(film_id); ALTER TABLE stageFactSales add index(store_id); TRUNCATE TABLE factSales;
INSERT INTO stageFactSales values (-1, -1, '999.99', '2005-01-01', 11, 99999, 1, 1);
END //
OBSERVE:
Starting job ProcessDataWarehouse at 14:24 23/12/2008. Inserting with :
INSERT INTO factCustomerCount (SELECT null, dd.date_key , dc.customer_key , ds.s tore_key , 1, ss.run_id FROM stageFactCustomerCount ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.create_date) ) INNER JOIN dimCustomer dc ON( dc.custom er_id = ss.customer_id AND dc.start_date <= dd.date AND dc.end_date > dd.date ) INNER JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd.date AND ds.end_date > dd.date ))
--> 589 rows inserted. Inserting with :
INSERT INTO factSales (SELECT null, dd.date_key , dc.customer_key , IFNULL(dm.mo vie_key , -1), ds.store_key , dst.staff_key , ss.amount , ss.run_id FROM stage FactSales ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.payment_date) ) LEFT OUTER JOIN dimMovie dm ON( dm.film_id = ss.film_id ) INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND dc.start_date <= dd.date AND dc.e nd_date > dd.date ) INNER JOIN dimStaff dst ON( dst.staff_id = ss.staff_id AN D dst.start_date <= dd.date AND dst.end_date > dd.date ) INNER JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd.date AND ds.end_dat e > dd.date ))
--> 15767 rows inserted.
Job ProcessDataWarehouse ended at 14:24 23/12/2008. [exit code=0]
Switch back to MySql mo de to see if yo ur change wo rked. Run this co mmand against yo ur perso nal database:
CODE TO TYPE:
select * from factSales where movie_key=-1; If yo ur jo b ran successfully, yo u'll see this:
OBSERVE:
mysql> select * from factSales where movie_key=-1;
+---+---+---+---+---+---+--- ---+---+
| sales_key | date_key | customer_key | movie_key | store_key | staff_key | sale s_amount | run_id | +---+---+---+---+---+---+--- ---+---+ | 6858 | 20050101 | 884 | -1 | 1 | 1 | 999.99 | -1 | +---+---+---+---+---+---+--- ---+---+
1 row in set (0.08 sec) mysql>
It lo o ks like yo ur left jo in saved the day!
Aggregating
Data wareho uses are built with the presumptio n that data needs to be aggregated in different ways. If the increment we are using in o ur data wareho use is o ne day, and we query the wareho use to see sales in May, we need to SUM(Sales) fo r each day in May.
This wo rks fo r mo st types o f facts, but what if the fact under co nsideratio n is an acco unt balance? Acco unt balances are usually sto red as po int in t im e values. Take a lo o k:
Dat e De script io n Paym e nt De po sit Balance
0 1/0 1 Starting Balance 5 9 2.20
0 1/0 2 Gro cery Sto re 25.9 0 5 6 6 .30 0 1/0 3 Co mputer Sto re 19 .50 5 4 6 .80 0 1/0 4 Co nsulting Wo rk 150 0 .0 0 20 4 6 .80
The acco unt balance o n 0 1/0 3 is 5 4 6 .80 , and the acco unt balance o n 0 1/0 4 is 20 4 6 .80 . If to day is January 4th, the acco unt balance fo r the mo nth o f January is 20 4 6 .80 , no t 59 2.20 + 56 6 .30 + 546 .8 0 + 20 46 .8 0 = 3752.10 . Likewise, the acco unt balance fo r 20 0 8 is also 20 4 6 .80 , no t 3752.10 .
The pro per aggregate fo r an acco unt balance is no t SUM, it is LAST.
If we take a lo o k at MySQL's gro up by functio ns yo u'll no tice there is MAX, MIN, and of course SUM, however there is no LAST o r FIRST. This is because LAST and FIRST are no t currently suppo rted by MySQL.
Getting aro und this pro blem is tricky with MySQL. Our o nly o ptio n is to use ORDER BY to so rt the results by date, so the o ldest reco rd will appear first, and to LIMIT o ur result to o ne ro w. A sample query to get the last value fro m factSales wo uld lo o k like this:
CODE TO TYPE: SELECT * FROM factSales
ORDER BY date_key DESC
LIMIT 0, 1;
MySQL wo uld return the last ro w: OBSERVE:
mysql> SELECT * FROM factSales
ORDER BY date_key DESC LIMIT 0, 1;
+---+---+---+---+---+---+--- --+---+
| sales_key | date_key | customer_key | movie_key | store_key | staff_key | sales_amoun t | run_id | +---+---+---+---+---+---+--- --+---+ | 7 | 20060214 | 591 | 267 | 1 | 1 | 4.9 9 | 76 | +---+---+---+---+---+---+--- --+---+
1 row in set (0.06 sec) mysql>
Other databases have extensio ns that make this type o f query easier.
Deaggregating Data
We already saw that aggregating certain types o f data can po se so me pro blems. In so me situatio ns, data may already be aggregated, causing a different type o f pro blem.
Suppo se the DVD sto re started shipping packages. Shippers usually want to kno w the weight o f packages, since it is used to calculate shipping co st. If so meo ne o rders fo ur DVDs at the same time, tho se fo ur DVDs are co mbined into a single package and sent to the custo mer. Their o rder may lo o k so mething like this:
T it le Price
TITANIC BOONDOCK 5.9 9 NEWTON LABYRINTH 5.9 9
APOLLO TEEN 9 .9 9
Shipping & Handling 5.9 0 == T o t al == 37 .86
Our business user wants to kno w, fo r example, ho w much shipping co sts were fo r the mo vie APOLLO TEEN? Lo o king at o ur data, we o nly kno w that it co st $ 5 .9 0 to ship APOLLO TEEN alo ng with DADDY PITTSBURG, TITANIC BOONDOCK and NEWTON LABYRINTH.
Our business user uses this fo rmula to calculate shipping co sts:
Shipping o n an it e m = Shipping & Handling / # o f It e m s
With this fo rmula in mind, we can calculate the shipping & handling o n each individual item in the o rder:
T it le Price Shipping
DADDY PITTSBURGH 9 .9 9 5 .9 0 /4 = 1.4 7 5 TITANIC BOONDOCK 5.9 9 5 .9 0 /4 = 1.4 7 5 NEWTON LABYRINTH 5.9 9 5 .9 0 /4 = 1.4 7 5 APOLLO TEEN 9 .9 9 5 .9 0 /4 = 1.4 7 5 Shipping & Handling 5.9 0 1.475 * 4 = 5.9 0
== T o t al == 37 .86
So no w yo u might ask yo urself, "How can shipping be 1.475? Shouldn't it be rounded?"
The answer to that questio n (like the shipping calculatio n itself) can o nly be answered by yo ur business users.
Po ssible so lutio ns might be to do nothing - meaning let the end users pick if and ho w they want to ro und the data, o r to implement and do cument a ro unding algo rithm.
So where and ho w wo uld yo u implement this deaggregatio n? Yo u wo uld have to add a T ransfo rmatio n step to yo ur fact pro cessing to calculate the shipping & handling value.