• No results found

Missing Keys

In document DBA 3: Creating a Data Warehouse (Page 144-151)

In the last lesso n we lo aded o ur fact tables, assuming that we co uld reso lve all o f the fo reign keys required fo r the dimensio ns.

But what if a key canno t be fo und? Let's use o ur stageFactCustomerCount and factCustomerCount tables to help us wo rk o n this pro blem . First, run yo ur jo b to make sure yo ur tables co ntain data. Take a lo o k at the o utput (so me lines have been o mitted):

OBSERVE:

Starting job ProcessDataWarehouse at 14:24 23/12/2008. Inserting with :

INSERT INTO factCustomerCount (SELECT null, dd.date_key , dc.customer_key , ds.store_ke y , 1, ss.run_id FROM stageFactCustomerCount ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.create_date) ) INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_i d AND dc.start_date <= dd.date AND dc.end_date > dd.date ) INNER JOIN dimStore ds O N( ds.store_id = ss.store_id AND ds.start_date <= dd.date AND ds.end_date > dd.date ))

--> 589 rows inserted.

We kno w that 5 89 ro ws were put into factCustomerCount. How many were in stageFactCustomerCount? Run this co mmand against yo ur perso nal database:

CODE TO TYPE:

select count(*) from stageFactCustomerCount; Oh my! It lo o ks like stageFactCustomerCount has 599 rows!

OBSERVE:

mysql> select count(*) from stageFactCustomerCount; +---+

| count(*) | +---+ | 599 | +---+

1 row in set (0.05 sec) mysql>

Our jo in has excluded 10 ro ws. Ho w do we find the missing ro ws?

Debugging tELT MysqlMap

OBSERVE: Output fro m TOS

Starting job ProcessDataWarehouse at 14:24 23/12/2008. Inserting with :

INSERT INTO factCustomerCount (SELECT null, dd.date_key , dc.customer_key , ds.s tore_key , 1, ss.run_id FROM stageFactCustomerCount ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.create_date) ) INNER JOIN dimCustomer dc ON( dc.custom er_id = ss.customer_id AND dc.start_date <= dd.date AND dc.end_date > dd.date ) INNER JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd.date AND ds.end_date > dd.date ))

--> 589 rows inserted. Inserting with :

INSERT INTO factSales (SELECT null, dd.date_key , dc.customer_key , IFNULL(dm.mo vie_key , -1), ds.store_key , dst.staff_key , ss.amount , ss.run_id FROM stage FactSales ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.payment_date) ) LEFT OUTER JOIN dimMovie dm ON( dm.film_id = ss.film_id ) INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND dc.start_date <= dd.date AND dc.e nd_date > dd.date ) INNER JOIN dimStaff dst ON( dst.staff_id = ss.staff_id AN D dst.start_date <= dd.date AND dst.end_date > dd.date ) INNER JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd.date AND ds.end_dat e > dd.date ))

--> 15767 rows inserted.

Job ProcessDataWarehouse ended at 14:24 23/12/2008. [exit code=0]

The o utput includes the INSERT ... SELECT statement used to populate f act Cust o merCo unt . If we get rid o f the INSERT part, and reformat the query slightly, it will look like this:

OBSERVE:

SELECT null, dd.date_key , dc.customer_key , ds.store_key , 1, ss.run_id

FROM stageFactCustomerCount ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.cr eate_date) )

INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND dc.start_da te <= dd.date AND dc.end_date > dd.date )

INNER JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd. date AND ds.end_date > dd.date )

Because we used inner jo ins in o ur query, no n-matching ro ws are excluded fro m the results. There is ho pe after all. =)

We can change this query so it returns o nly no n-matching ro ws. We do that by co nverting the INNER J OIN to LEFT J OIN and adding a WHERE clause. We also want to add three co lumns to the SELECT part, to help us debug: ss.cust o m e r_id, ss.st o re _id, ss.cre at e _dat e. Run this co mmand against yo ur perso nal

database:

CODE TO TYPE: SELECT

ss.customer_id, ss.store_id, ss.create_date,

dd.date_key , dc.customer_key , ds.store_key , ss.run_id FROM stageFactCustomerCount ss

INNER JOIN dimDate dd ON( dd.date = DATE(ss.create_date) )

LEFT JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND dc.start_dat e <= dd.date AND dc.end_date > dd.date )

LEFT JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd.d ate AND ds.end_date > dd.date )

WHERE dd.date_key IS NULL OR dc.customer_key IS NULL OR ds.store_key IS NULL

This query will return ro ws fro m stageFactCustomerCount that do not have corresponding rows in either dimDate, dimCustomer o r dimStore. No n-matching ro ws will have a NULL dat e _ke y, cust o m e r_ke y, o r a NULL st o re _ke y.

After yo u run the query, yo u'll see the missing ro ws: OBSERVE:

mysql> SELECT

-> ss.customer_id, ss.store_id, ss.create_date,

-> dd.date_key , dc.customer_key , ds.store_key , ss.run_id -> FROM stageFactCustomerCount ss

-> INNER JOIN dimDate dd ON( dd.date = DATE(ss.create_date) )

-> LEFT JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND dc.st art_date <= dd.date AND dc.end_date > dd.date )

-> LEFT JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd.date AND ds.end_date > dd.date )

-> WHERE dd.date_key IS NULL OR dc.customer_key IS NULL OR ds.store_key IS N ULL;

+---+---+---+---+---+--- ---+---+

| customer_id | store_id | create_date | date_key | customer_key | store _key | run_id | +---+---+---+---+---+--- ---+---+ | 1 | 1 | 2004-06-26 00:00:00 | 20040626 | NULL | 1 | 76 | | 2 | 1 | 2004-10-12 00:00:00 | 20041012 | NULL | 1 | 76 | | 3 | 1 | 2004-06-12 00:00:00 | 20040612 | NULL | 1 | 76 | | 4 | 2 | 2004-11-22 00:00:00 | 20041122 | NULL | 2 | 76 | | 5 | 1 | 2004-02-17 00:00:00 | 20040217 | NULL | 1 | 76 | | 6 | 2 | 2004-12-18 00:00:00 | 20041218 | NULL | 2 | 76 | | 7 | 1 | 2004-06-09 00:00:00 | 20040609 | NULL | 1 | 76 | | 8 | 2 | 2004-04-18 00:00:00 | 20040418 | NULL | 2 | 76 | | 9 | 2 | 2004-02-29 00:00:00 | 20040229 | NULL | 2 | 76 | | 10 | 1 | 2004-12-03 00:00:00 | 20041203 | NULL | 1 | 76 | +---+---+---+---+---+--- ---+---+

10 rows in set (0.10 sec)

It lo o ks like we fo und o ur missing ro ws: cust o m e r_id 1 thro ugh 10 . If yo u recall, we specifically excluded these custo mers fro m o ur dim Cust o m e r dimensio n, because o ur business users to ld us they were "test" acco unts.

To fix this pro blem, we'll alter o ur select statement. Edit the query o n the t MysqlInput co mpo nent fo r st age Cust o m e rCo unt . Change the query so it lo o ks like this (make sure to use the co rrect quo tatio n marks):

CODE TO TYPE:

select customer_id, store_id, create_date from customer WHERE customer_id > 10;

Handling Missing Keys

We pro cess dimensio ns befo re we pro cess facts so we can be sure that o ur dimensio ns are current and then o ur facts can lo ad co rrectly. If we lo ad o ur sales facts in o ur data wareho use, but o ne ro w do esn't have a mo vie asso ciated with it, we have several o ptio ns to deal with the missing mo vie:

2. Sto p the current pro cess, alerting so meo ne to the erro r.

3. Lo ad the ro w, but set the ro w to a special "!!!! MISSING MOVIE !!!!" entry in the dimMovie dimensio n.

Lo gically, yo ur business users will have the last wo rd o n handling this situatio n. Mo st will no t cho o se pick o ptio ns # 1 o r # 2 fo r these reaso ns:

1. Igno ring the ro w can cause daily to tals to be o ff, so the wareho use repo rts may no t match existing repo rts fro m the so urce systems.

2. Sto pping the current pro cess to alert users to the erro r causes the who le pro cess to be interrupted, and will cause the entire wareho use to be inaccessible.

Since o ur fact tables are relo aded o n a daily basis, o ptio n # 3 is attractive because the ro w in erro r can be fixed within the so urce system and the wareho use will be "fixed" o n the next run.

Let's implement this fix. We'll start my altering dimMo vie so film_id is a no rmal integer. Currently it is

unsigned, meaning that no negative values are allo wed. Run this co mmand against yo ur perso nal database: CODE TO TYPE:

ALTER TABLE dimMovie modify film_id int not null;

Next we will add the special missing ro w to dimMovie. Run this command against your personal database: CODE TO TYPE:

insert into dimMovie (movie_key, film_id, title, run_id ) values (-1, -1, '!!! M ISSING MOVIE !!!', 1);

Let's check the ro w we just inserted to see what it lo o ks like. Run this co mmand against yo ur perso nal database:

CODE TO TYPE:

select * from dimMovie where movie_key=-1; As lo ng as yo u typed everything co rrectly yo u'll see this:

OBSERVE:

select * from dimMovie where movie_key=-1;

+---+---+---+---+---+---- ---+---+---+---+---+--- -+---+

| movie_key | film_id | title | description | release_year | lan guage | original_language | rental_duration | length | rating | special_features | run_id |

+---+---+---+---+---+---- ---+---+---+---+---+--- -+---+

| -1 | -1 | !!! MISSING MOVIE !!! | NULL | NULL | | NULL | 0 | 0 | | NULL | 1 |

+---+---+---+---+---+---- ---+---+---+---+---+--- -+---+

1 row in set (0.05 sec) mysql>

Everything lo o ks go o d. Next, o pen yo ur jo b in TOS, and do uble click o n the t ELT MysqlMap co mpo nent fo r f act Sale s.

We can use MySQL's IFNULL functio n to be sure o ur jo in was successful. If the jo in failed, dm.movie_key will be null, so the IFNULL function will return -1. If the join worked, dm.movie_key will have a value, which will be returned by IFNULL.

Next, change the expressio n o n the f act Sale s o utput fro m dm.movie_key to IFNULL(dm.movie_key, -1):

Click OK, then save yo ur jo b.

We do n't have access to o ur so urce system, so we canno t insert test data into sakila. Instead, we can

tempo rarily mo dify o ur sto red pro cedure to insert test data into stageFactSales. Run this command against yo ur perso nal database:

CODE TO TYPE:

DROP PROCEDURE etl_preFactSales; DELIMITER //

CREATE PROCEDURE etl_preFactSales () BEGIN

ALTER TABLE stageFactSales add index(payment_id); ALTER TABLE stageFactSales add index(payment_date); ALTER TABLE stageFactSales add index(customer_id); ALTER TABLE stageFactSales add index(film_id); ALTER TABLE stageFactSales add index(store_id); TRUNCATE TABLE factSales;

INSERT INTO stageFactSales values (-1, -1, '999.99', '2005-01-01', 11, 99999, 1, 1);

END //

OBSERVE:

Starting job ProcessDataWarehouse at 14:24 23/12/2008. Inserting with :

INSERT INTO factCustomerCount (SELECT null, dd.date_key , dc.customer_key , ds.s tore_key , 1, ss.run_id FROM stageFactCustomerCount ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.create_date) ) INNER JOIN dimCustomer dc ON( dc.custom er_id = ss.customer_id AND dc.start_date <= dd.date AND dc.end_date > dd.date ) INNER JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd.date AND ds.end_date > dd.date ))

--> 589 rows inserted. Inserting with :

INSERT INTO factSales (SELECT null, dd.date_key , dc.customer_key , IFNULL(dm.mo vie_key , -1), ds.store_key , dst.staff_key , ss.amount , ss.run_id FROM stage FactSales ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.payment_date) ) LEFT OUTER JOIN dimMovie dm ON( dm.film_id = ss.film_id ) INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND dc.start_date <= dd.date AND dc.e nd_date > dd.date ) INNER JOIN dimStaff dst ON( dst.staff_id = ss.staff_id AN D dst.start_date <= dd.date AND dst.end_date > dd.date ) INNER JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd.date AND ds.end_dat e > dd.date ))

--> 15767 rows inserted.

Job ProcessDataWarehouse ended at 14:24 23/12/2008. [exit code=0]

Switch back to MySql mo de to see if yo ur change wo rked. Run this co mmand against yo ur perso nal database:

CODE TO TYPE:

select * from factSales where movie_key=-1; If yo ur jo b ran successfully, yo u'll see this:

OBSERVE:

mysql> select * from factSales where movie_key=-1;

+---+---+---+---+---+---+--- ---+---+

| sales_key | date_key | customer_key | movie_key | store_key | staff_key | sale s_amount | run_id | +---+---+---+---+---+---+--- ---+---+ | 6858 | 20050101 | 884 | -1 | 1 | 1 | 999.99 | -1 | +---+---+---+---+---+---+--- ---+---+

1 row in set (0.08 sec) mysql>

It lo o ks like yo ur left jo in saved the day!

Aggregating

Data wareho uses are built with the presumptio n that data needs to be aggregated in different ways. If the increment we are using in o ur data wareho use is o ne day, and we query the wareho use to see sales in May, we need to SUM(Sales) fo r each day in May.

This wo rks fo r mo st types o f facts, but what if the fact under co nsideratio n is an acco unt balance? Acco unt balances are usually sto red as po int in t im e values. Take a lo o k:

Dat e De script io n Paym e nt De po sit Balance

0 1/0 1 Starting Balance 5 9 2.20

0 1/0 2 Gro cery Sto re 25.9 0 5 6 6 .30 0 1/0 3 Co mputer Sto re 19 .50 5 4 6 .80 0 1/0 4 Co nsulting Wo rk 150 0 .0 0 20 4 6 .80

The acco unt balance o n 0 1/0 3 is 5 4 6 .80 , and the acco unt balance o n 0 1/0 4 is 20 4 6 .80 . If to day is January 4th, the acco unt balance fo r the mo nth o f January is 20 4 6 .80 , no t 59 2.20 + 56 6 .30 + 546 .8 0 + 20 46 .8 0 = 3752.10 . Likewise, the acco unt balance fo r 20 0 8 is also 20 4 6 .80 , no t 3752.10 .

The pro per aggregate fo r an acco unt balance is no t SUM, it is LAST.

If we take a lo o k at MySQL's gro up by functio ns yo u'll no tice there is MAX, MIN, and of course SUM, however there is no LAST o r FIRST. This is because LAST and FIRST are no t currently suppo rted by MySQL.

Getting aro und this pro blem is tricky with MySQL. Our o nly o ptio n is to use ORDER BY to so rt the results by date, so the o ldest reco rd will appear first, and to LIMIT o ur result to o ne ro w. A sample query to get the last value fro m factSales wo uld lo o k like this:

CODE TO TYPE: SELECT * FROM factSales

ORDER BY date_key DESC

LIMIT 0, 1;

MySQL wo uld return the last ro w: OBSERVE:

mysql> SELECT * FROM factSales

ORDER BY date_key DESC LIMIT 0, 1;

+---+---+---+---+---+---+--- --+---+

| sales_key | date_key | customer_key | movie_key | store_key | staff_key | sales_amoun t | run_id | +---+---+---+---+---+---+--- --+---+ | 7 | 20060214 | 591 | 267 | 1 | 1 | 4.9 9 | 76 | +---+---+---+---+---+---+--- --+---+

1 row in set (0.06 sec) mysql>

Other databases have extensio ns that make this type o f query easier.

Deaggregating Data

We already saw that aggregating certain types o f data can po se so me pro blems. In so me situatio ns, data may already be aggregated, causing a different type o f pro blem.

Suppo se the DVD sto re started shipping packages. Shippers usually want to kno w the weight o f packages, since it is used to calculate shipping co st. If so meo ne o rders fo ur DVDs at the same time, tho se fo ur DVDs are co mbined into a single package and sent to the custo mer. Their o rder may lo o k so mething like this:

T it le Price

TITANIC BOONDOCK 5.9 9 NEWTON LABYRINTH 5.9 9

APOLLO TEEN 9 .9 9

Shipping & Handling 5.9 0 == T o t al == 37 .86

Our business user wants to kno w, fo r example, ho w much shipping co sts were fo r the mo vie APOLLO TEEN? Lo o king at o ur data, we o nly kno w that it co st $ 5 .9 0 to ship APOLLO TEEN alo ng with DADDY PITTSBURG, TITANIC BOONDOCK and NEWTON LABYRINTH.

Our business user uses this fo rmula to calculate shipping co sts:

Shipping o n an it e m = Shipping & Handling / # o f It e m s

With this fo rmula in mind, we can calculate the shipping & handling o n each individual item in the o rder:

T it le Price Shipping

DADDY PITTSBURGH 9 .9 9 5 .9 0 /4 = 1.4 7 5 TITANIC BOONDOCK 5.9 9 5 .9 0 /4 = 1.4 7 5 NEWTON LABYRINTH 5.9 9 5 .9 0 /4 = 1.4 7 5 APOLLO TEEN 9 .9 9 5 .9 0 /4 = 1.4 7 5 Shipping & Handling 5.9 0 1.475 * 4 = 5.9 0

== T o t al == 37 .86

So no w yo u might ask yo urself, "How can shipping be 1.475? Shouldn't it be rounded?"

The answer to that questio n (like the shipping calculatio n itself) can o nly be answered by yo ur business users.

Po ssible so lutio ns might be to do nothing - meaning let the end users pick if and ho w they want to ro und the data, o r to implement and do cument a ro unding algo rithm.

So where and ho w wo uld yo u implement this deaggregatio n? Yo u wo uld have to add a T ransfo rmatio n step to yo ur fact pro cessing to calculate the shipping & handling value.

In document DBA 3: Creating a Data Warehouse (Page 144-151)