• No results found

DBA 3: Creating a Data Warehouse

N/A
N/A
Protected

Academic year: 2021

Share "DBA 3: Creating a Data Warehouse"

Copied!
168
0
0

Loading.... (view fulltext now)

Full text

(1)

DBA 3: Creating a Data Warehouse

Lesso n 1: Int ro duct io n

Using the Learning Sandbo x Enviro nment Data Wareho using

Lesso n 2: A Dat a Ware ho use Facts and Dimensio ns

Facts Dimensio ns The Dimensio nal Mo del

Selecting Facts and Dimensio ns Star Schema

Lesso n 3: Im ple m e nt ing t he Dim e nsio nal Mo de l, Part I Creating the Date Dimensio n

Slo wly Changing Dimensio ns Type 0 SCD

Type 1 SCD Type 2 SCD Type 3 SCD Type 4 SCD

Creating the Custo mer Dimensio n Sno wflake Schemas

Lesso n 4: Im ple m e nt ing T he Dim e nsio nal Mo de l, Part II Creating the Mo vie Dimensio n

Creating the Sto re Dimensio n Creating Facts

Sales

Custo merCo unt RentalCo unt

Lesso n 5: Ext ract , T ransf o rm , Lo ad (ET L) What is ETL?

Lo gging and Auditing

Getting Data into the Wareho use dimDate

dimCusto mer dimMo vie dimSto re

Lesso n 6 : T o o ls f o r ET L

ETL--Past, Present, and Future

Getting Started with Talend Open Studio Yo ur First TOS Jo b

Lesso n 7: ET L: T he Dat e Dim e nsio n Jo b Structure

(2)

Jo b Structure

Lo ading Data fro m Excel

Adding Co lumns to o ur Data Flo w Adding Data to dimDate

If yo u run into pro blems...

Lesso n 8 : Basic Dim e nsio n Pro ce ssing Lo ading dimMo vie

Jo b Structure Pre and Po st Jo b Lo gging

dimMo vie Perfo rmance

Lesso n 9 : SCD Pro ce ssing

The Algo rithm: Slo wly Changing Dimensio ns Implementing the Dimensio ns

dimCusto mer Do es o ur SCD wo rk? dimSto re

Lesso n 10 : Pro ce ssing Fact s, Part I Orchestratio n

factCusto merCo unt

Lesso n 11: Pro ce ssing Fact s, Part II factSales

Lesso n 12: Spe cial Fact s Missing Keys

Debugging tELTMysqlMap Handling Missing Keys Aggregating

Deaggregating Data Early Arriving Facts

Lesso n 13: Que rying a Re lat io nal Dat a Ware ho use Viewing Data

Answering Questio ns Pro blems with Queries

Bad Jo ins Inco rrect Filtering Lesso n 14: Final Pro je ct

No rthwind Traders fp_dimDate fp_dimEmplo yees fp_dimCusto mers fp_dimSuppliers fp_dimPro ducts

(3)

fp_dimOrders Order Unit Price Supplier Unit Price

Copyright © 1998-2014 O'Reilly Media, Inc.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.

(4)

Introduction

Welco me to the third co urse in the O'Reilly Scho o l o f Techno lo gy (OST) DBA series.

Course Objectives

When yo u co mplete this co urse, yo u will be able to :

implement the dimensio nal mo del using standard ETL pro cesses. demo nstrate understanding o f dimensio n, SCD, and fact pro cessing. query relatio nal data wareho uses using standard SQL co mmands. develo p a co mplete data wareho use using Talend Open Studio .

This co urse is written under the assumptio n that yo u have wo rked thro ugh the first two co urses in the series and are familiar with MySQL. If yo u'd like to refresh yo ur memo ry, feel free to go back o ver the first two co urses—then, get ready to take yo ur MySQL kno wledge to the next level! In this co urse, yo u'll learn what makes up a data wareho use and gain an understanding o f the dimensio nal mo del.

Fro m beginning to end, yo u will learn by do ing pro jects using Talend Open Studio , an Eclipse-based to o l fo r implementing data wareho uses. Yo u'll co mplete pro jects using Talend, develo ping yo ur o wn co mplete data wareho uses. The pro jects add to yo ur po rtfo lio and will co ntribute to certificate co mpletio n. Besides a bro wser and internet co nnectio n, all so ftware is pro vided o nline by the O'Reilly Scho o l o f Techno lo gy.

Learning with O'Reilly School of Technology Courses

As with every O'Reilly Scho o l o f Techno lo gy co urse, we'll take a user-active appro ach to learning. This means that yo u (the user) will be active! Yo u'll learn by do ing, building live pro grams, testing them and experimenting with them— hands-o n!

To learn a new skill o r techno lo gy, yo u have to experiment. The mo re yo u experiment, the mo re yo u learn. Our system is designed to maximize experimentatio n and help yo u learn to learn a new skill.

We'll pro gram as much as po ssible to be sure that the principles sink in and stay with yo u.

Each time we discuss a new co ncept, yo u'll put it into co de and see what YOU can do with it. On o ccasio n we'll even give yo u co de that do esn't wo rk, so yo u can see co mmo n mistakes and ho w to reco ver fro m them. Making mistakes is actually ano ther go o d way to learn.

Abo ve all, we want to help yo u to learn to learn. We give yo u the to o ls to take co ntro l o f yo ur o wn learning experience. When yo u co mplete an OST co urse, yo u kno w the subject matter, and yo u kno w ho w to expand yo ur kno wledge, so yo u can handle changes like so ftware and o perating system updates.

Here are so me tips fo r using O'Reilly Scho o l o f Techno lo gy co urses effectively:

T ype t he co de . Resist the temptatio n to cut and paste the example co de we give yo u. Typing the co de actually gives yo u a feel fo r the pro gramming task. Then play aro und with the examples to find o ut what else yo u can make them do , and to check yo ur understanding. It's highly unlikely yo u'll break anything by

experimentatio n. If yo u do break so mething, that's an indicatio n to us that we need to impro ve o ur system! T ake yo ur t im e . Learning takes time. Rushing can have negative effects o n yo ur pro gress. Slo w do wn and let yo ur brain abso rb the new info rmatio n tho ro ughly. Taking yo ur time helps to maintain a relaxed, po sitive appro ach. It also gives yo u the chance to try new things and learn mo re than yo u o therwise wo uld if yo u blew thro ugh all o f the co ursewo rk to o quickly.

Expe rim e nt . Wander fro m the path o ften and explo re the po ssibilities. We can't anticipate all o f yo ur questio ns and ideas, so it's up to yo u to experiment and create o n yo ur o wn. Yo ur instructo r will help if yo u go co mpletely o ff the rails.

Acce pt guidance , but do n't de pe nd o n it . Try to so lve pro blems o n yo ur o wn. Go ing fro m misunderstanding to understanding is the best way to acquire a new skill. Part o f what yo u're learning is pro blem so lving. Of co urse, yo u can always co ntact yo ur instructo r fo r hints when yo u need them. Use all available re so urce s! In real-life pro blem-so lving, yo u aren't bo und by false limitatio ns; in OST co urses, yo u are free to use any reso urces at yo ur dispo sal to so lve pro blems yo u enco unter: the Internet, reference bo o ks, and o nline help are all fair game.

Have f un! Relax, keep practicing, and do n't be afraid to make mistakes! Yo ur instructo r will keep yo u at it until yo u've mastered the skill. We want yo u to get that satisfied, "I'm so co o l! I did it!" feeling. And yo u'll have

(5)

until yo u've mastered the skill. We want yo u to get that satisfied, "I'm so co o l! I did it!" feeling. And yo u'll have so me pro jects to sho w o ff when yo u're do ne.

Lesson Format

We'll try o ut lo ts o f examples in each lesso n. We'll have yo u write co de, lo o k at co de, and edit existing co de. The co de will be presented in bo xes that will indicate what needs to be do ne to the co de inside.

Whenever yo u see white bo xes like the o ne belo w, yo u'll type the co ntents into the edito r windo w to try the example yo urself. The CODE TO TYPE bar o n to p o f the white bo x co ntains directio ns fo r yo u to fo llo w:

CODE TO TYPE:

White boxes like this contain code for you to try out (type into a file to run).

If you have already written some of the code, new code for you to add looks like this.

If we want you to remove existing code, the code to remove will look like this.

We may also include instructive comments that you don't need to type.

We may run pro grams and do so me o ther activities in a terminal sessio n in the o perating system o r o ther co mmand-line enviro nment. These will be sho wn like this:

INTERACTIVE SESSION:

The plain black text that we present in these INTERACTIVE boxes is

provided by the system (not for you to type). The commands we want you to type look lik e this.

Co de and info rmatio n presented in a gray OBSERVE bo x is fo r yo u to inspect and absorb. This info rmatio n is o ften co lo r-co ded, and fo llo wed by text explaining the co de in detail:

OBSERVE:

Gray "Observe" boxes like this contain information (usually code specifics) for you to observe.

The paragraph(s) that fo llo w may pro vide additio n details o n inf o rm at io n that was highlighted in the Observe bo x. We'll also set especially pertinent info rmatio n apart in "No te" bo xes:

Note

No tes pro vide info rmatio n that is useful, but no t abso lutely necessary fo r perfo rming the tasks at hand.

T ip

Tips pro vide info rmatio n that might help make the to o ls easier fo r yo u to use, such as sho rtcut keys.

WARNING

Warnings pro vide info rmatio n that can help prevent pro gram crashes and data lo ss.

Using the Learning Sandbox Environment

Note

If yo u haven't read Getting Started with Talend Open Studio yet, go ahead and do that no w. In this co urse, yo u'll need to co nnect to yo ur Unix acco unts using SSH. To allo w yo u to do that, we pro vide two T e rm inal views:

(6)

Note

Depending o n the width o f yo ur mo nito r, the text o n the tabs may be truncated. Terminal 1 is the leftterminal, and Terminal 2 is the right terminal. If yo u click o n the seco nd red leaf, the terminals will be lo cated lo wer o n the screen:

(7)

To co nnect, click o n o ne o f the terminal tabs and then click o n Co nne ct :

Change the Co nne ct io n T ype to SSH, set the ho st to co ld.o re illyscho o l.co m , and then enter yo ur OST username and passwo rd:

(8)

The first time yo u co nnect yo u will see a few o ther warnings. Click o n Ye s fo r all o f them:

After yo u click OK, yo u will be co nnected to yo ur acco unt:

Yo u'll also be saving so me o f yo ur SQL queries and do cumentatio n in text files. We'll sto re these in a pro ject accessible fro m TOS. To add this pro ject, click o n the Navigat o r tab:

(9)

Yo u co uld use this view to peek at the files that TOS sto res "behind the scenes." We'll use it to ho ld o ur text files. Right-click in the white space under the fo lders, and select Ne w | Pro je ct :

(10)

Name this pro ject Do cum e nt at io n and click Finish:

WARNING

Yo u must name yo ur o bjects exactly as specified in the lesso n so yo ur mento r can find yo ur wo rkand help yo u if yo u need it. To create a new text file, right-click o n Do cum e nt at io n and select Ne w | Ot he r...:

(11)

Under the Ge ne ral fo lder, cho o se File and click Ne xt :

Make sure yo u select Do cum e nt at io n as the parent fo lder. Name yo ur new file dba3_le sso n1_pro je ct 1.t xt —yo u will add to this file as yo u co mplete yo ur first pro ject. Click Finish:

(12)

An edito r will o pen, allo wing yo u to enter the fo llo wing text: CODE TO TYPE:

I can't wait to learn more about data warehouses!

Save yo ur wo rk:

Data Warehousing

At this po int in yo ur database educatio n, yo u are familiar with SQL databases and their capabilities. By far, the mo st po pular use fo r databases is the sto rage o f o peratio nal data generated thro ugh transactio ns.

In the previo us co urses, we examined the database o f a mo vie rental sto re. We used the database to keep track o f custo mers, the mo vies in the sto re's invento ry, and the mo vies currently being rented. We designed and o ptimized o ur tables fo r this "o peratio nal" purpo se.

Keeping track o f current custo mers and rentals is a key task o f a mo vie rental sto re database, but at so me po int a sto re manager may want to gain additio nal insight into the business. He may have questio ns such as:

Ho w many new custo mers were added this quarter? What is the mo st po pular rental?

Ho w much revenue did o ur East side sto re generate co mpared to o ur West side sto re? Ho w do sales this mo nth co mpare to last mo nth o r last year?

Are mo vies that were po pular in the theater po pular rentals as well?

So me o f o ur questio ns can be answered by writing a straightfo rward query against the o peratio nal database. Other, mo re co mplicated, questio ns, such as "Ho w much revenue did o ur East side sto re generate co mpared to o ur West

(13)

side sto re?", can be mo re challenging to answer. And still o ther questio ns canno t be answered at all witho ut additio nal info rmatio n that isn't sto red in the database.

A Data Warehouse is designed to pro vide a unified platfo rm to answer all o f the questio ns po sed abo ve. A go o d data wareho use pro vides:

A separate system that wo n't interrupt business critical o peratio nal systems:

A single po int o f access fo r all analytical queries:

A unified and co nsistent view o f underlying data (even data fro m external systems):

(14)

In this co urse yo u'll learn everything yo u need to kno w abo ut a data wareho use, fro m planning to implementatio n. In the next lesso n we'll take a fresh lo o k at o ur o peratio nal database and start planning o ur wareho use. See yo u there! To co ntinue, do yo ur ho mewo rk! Right-click in the lesso n text and select Back to return to the syllabus and select the Quiz and Pro je ct fo r this lesso n, co mpleting each and handing them in to the instructo r.

Copyright © 1998-2014 O'Reilly Media, Inc.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.

(15)

A Data Warehouse

Facts and Dimensions

In the first lesso n, we discussed reaso ns to develo p and use a Data Wareho use. The bo tto m line was that o ur mo vie sto re manager might want answers to a few go o d questio ns:

Ho w many new custo mers were added this quarter? What is the mo st po pular rental?

Ho w much revenue did o ur East side sto re generate co mpared to o ur West side sto re? Ho w do sales this mo nth co mpare to last mo nth o r last year?

Are mo vies that were po pular in the theater po pular rentals as well? Which custo mers rent the mo st DVDs each mo nth and at which sto re?

We can rewrite so me o f the questio ns so that they share a fo rmat we can use mo re readily in o ur queries: Ho w m any ne w cust o m e rs did we add by quart e r?

Ho w m any t im e s we re DVDs re nt e d, by DVD and by m o nt h? Ho w m uch sale s did we do, by st o re and by m o nt h?

Ho w m uch sale s did we do by m o nt h?

Ho w m any t im e s we re DVDs re nt e d, by m o nt h and by t he at e r po pularit y? Ho w m any t im e s we re DVDs re nt e d, by cust o m e r, m o nt h and st o re ?

Tho ugh the questio ns are slightly different fro m the o riginal o nes, they are no w structured like analytical queries, with f act s and dim e nsio ns.

Facts

Fact s are numbers, and are so metimes referred to as measures. A fact relating to sales co uld be "Sales in US Do llars" o r "Sales in Euro s." Other facts co uld be "Ho urs o f Wo rk," o r "Times Rented."

Fact s have a defined grain—a level o f detail. Fo r example, "Sales in US Do llars" may be daily, o r even ho urly. If yo u have sales data o n a daily grain, yo u canno t display sales by ho ur. Yo u can, ho wever, co mbine (aggregate) daily sales to larger grains such as weekly o r mo nthly:

(16)

Aggregates are applied to facts in o rder to mo ve to a larger grain. The mo st co mmo n aggregate is SUM. Other aggregates are Average (AVG), co unt (COUNT), maximum (MAX), and minimum (MIN). Aggregates take a set o f data and return a summary o f that data.

Let's experiment with so me aggregates in o ur SQL database. Switch to the SSH mo de, and lo g into yo ur acco unt. In Unix mo de, use the m ysql co mmand to co nnect to the sakila database as the ano nym o us user. When pro mpted fo r a passwo rd, press Ent e r. In Unix mo de, run the fo llo wing co mmand:

INTERACTIVE SESSION:

cold1:~$ mysql -h sql sakila -u anonymous -p

Yo u see so mething like this: OBSERVE:

Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 28527

Server version: 5.0.62-log Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the buffer. mysql>

Let's take a lo o k at the tables. Run this co mmand: INTERACTIVE SESSION:

mysql> show tables;

+---+ | Tables_in_sakila | +---+ | actor | | actor_info | | address | | category | | city | | country | | customer | | customer_list | | film | | film_actor | | film_category | | film_list | | film_text | | inventory | | language | | nicer_but_slower_film_list | | payment | | rental | | sales_by_film_category | | sales_by_store | | staff | | staff_list | | store | +---+ 23 rows in set (0.00 sec)

(17)

fo llo wing co mmand:

INTERACTIVE SESSION: mysql> describe payment;

+---+---+---+---+---+---+

| Field | Type | Null | Key | Default | Extra |

+---+---+---+---+---+---+

| payment_id | smallint(5) unsigned | NO | PRI | NULL | auto_in crement |

| customer_id | smallint(5) unsigned | NO | MUL | NULL | |

| staff_id | tinyint(3) unsigned | NO | MUL | NULL | |

| rental_id | int(11) | YES | MUL | NULL | |

| amount | decimal(5,2) | NO | | NULL | |

| payment_date | datetime | NO | | NULL | |

| last_update | timestamp | NO | | CURRENT_TIMESTAMP | |

+---+---+---+---+---+---+

7 rows in set (0.00 sec)

This table has has a co lumn named am o unt , which is a measure.

So , "What is the to tal o f all payments received?" Run this co mmand to find o ut: INTERACTIVE SESSION:

mysql> SELECT sum(amount) from payment;

+---+ | sum(amount) | +---+ | 67416.51 | +---+

1 row in set (0.14 sec)

No w let's answer this questio n: "What is the largest payment received?" Run this co mmand: INTERACTIVE SESSION:

mysql> SELECT max(amount) from payment;

+---+ | max(amount) | +---+ | 11.99 | +---+

1 row in set (0.02 sec)

So , what if yo u need to track an event in the data wareho use? Events do n't usually have a numeric value attached to them (o ther than perhaps "co unt"). They are kno wn as factless facts. We'll talk mo re abo ut factless facts later.

(18)

Dimensions

Dim e nsio ns are used to filter, catego rize, and label f act s. A fact such as "Sales in US Do llars" might have dimensio ns fo r Date, Customer, Store, and Movie. Written in English, this might translate to so mething like this:

On May 25 t h, Rut h Mart ine z re nt e d t he m o vie " Cabin Flash" f ro m t he We st side st o re fo r $ 9 .9 9. Or, bro ken into its co mpo nents, it lo o ks like this:

Nam e Value

Dim e nsio n Date May 25th

Dim e nsio n Custo mer Ruth Martinez Dim e nsio n Mo vie Cabin Flash

Dim e nsio n Sto re West

Fact Sales in US Do llars $9 .9 9

The first and mo st impo rtant dim e nsio n used in a wareho use is the date dimensio n. This dimensio n is o ften presented in a hierarchy:

Year -> Quarter -> Mo nth -> Day

Days can "ro ll up" to a mo nth. Mo nths can "ro ll up" to a quarter, and quarters "ro ll up" to a year. Daily sales "ro ll up" to mo nthly sales, mo nthly sales "ro ll up" to quarterly sales, and quarterly sales "ro ll up" to yearly sales:

(19)

Ye ar, Quart e r, Mo nt h, and Day are no t dimensio ns themselves. They represent levels in the Dat e dimensio n's hierarchy.

Dates o ften have multiple uses in a wareho use. Fo r DVD rentals, dates are reco rded at least twice: o nce when a mo vie is rented and again when the mo vie is returned. When the same underlying date dimensio n is used fo r bo th o f these purpo ses, the dimensio n is kno wn as a role-playing dimensio n.

In the SQL wo rld, we specify dimensio ns in the GROUP BY and WHERE clauses. Let's see these clauses in actio n using o ur examples.

First let's examine the data sto red in o ur database that co rrespo nds to Ruth renting "Cabin Flash" fro m the West sto re o n May 25th fo r $9 .9 9 . Fo r the sake o f experiment, we happen to kno w that this data is sto red with a paym e nt _id=4 9 1. (Just play alo ng fo r no w.) Run the fo llo wing co mmand:

INTERACTIVE SESSION:

mysql> select c.first_name, c.last_name, f.title, p.amount,

-> DATE_FORMAT(p.payment_date, '%b %D') as paymentDate, s.region

-> from payment p

-> join customer c on (p.customer_id=c.customer_id)

-> join rental r on (p.rental_id = r.rental_id)

-> join inventory i on (r.inventory_id=i.inventory_id)

-> join film f on (i.film_id=f.film_id)

-> join store s on (c.store_id=s.store_id)

-> where p.payment_id=491;

+---+---+---+---+---+---+ | first_name | last_name | title | amount | paymentDate | region | +---+---+---+---+---+---+ | RUTH | MARTINEZ | CABIN FLASH | 9.99 | May 25th | West | +---+---+---+---+---+---+ 1 row in set (0.09 sec)

Lo o ks go o d! No w let's answer o ur first questio n: How much was rented on May 25th by Ruth Martinez in the West store? Go ahead and run this co mmand:

(20)

INTERACTIVE SESSION:

mysql> select c.first_name, c.last_name, p.amount,

-> DATE_FORMAT(p.payment_date, '%b %D') as paymentDate, s.region

-> from payment p

-> join customer c on (p.customer_id=c.customer_id)

-> join store s on (c.store_id=s.store_id)

-> where day(p.payment_date)=25 and month(p.payment_date)=5

-> AND c.first_name='RUTH' and c.last_name='MARTINEZ';

+---+---+---+---+---+ | first_name | last_name | amount | paymentDate | region | +---+---+---+---+---+ | RUTH | MARTINEZ | 0.99 | May 25th | West | | RUTH | MARTINEZ | 9.99 | May 25th | West | +---+---+---+---+---+ 2 rows in set (0.15 sec)

This is co rrect, but unfo rtunately it isn't exactly what we're after. We actually want o ne ro w o f summarized data instead o f two ro ws o f detail data. We need to aggregate the am o unt fact, and make sure to GROUP BY o ur dimensio ns. Run this co mmand:

INTERACTIVE SESSION:

mysql> select c.first_name, c.last_name, sum(p.amount),

-> DATE_FORMAT(p.payment_date, '%b %D') as paymentDate, s.region

-> from payment p

-> join customer c on (p.customer_id=c.customer_id)

-> join store s on (c.store_id=s.store_id)

-> where day(p.payment_date)=25 and month(p.payment_date)=5

-> AND c.first_name='RUTH' and c.last_name='MARTINEZ'

-> GROUP BY c.first_name, c.last_name, paymentDate, s.region;

+---+---+---+---+---+ | first_name | last_name | sum(p.amount) | paymentDate | region | +---+---+---+---+---+ | RUTH | MARTINEZ | 10.98 | May 25th | West | +---+---+---+---+---+ 1 row in set (0.00 sec)

We were able to answer o ur questio n using the info rmatio n sto red in o ur current tables. So if that's the case, ho w is a data warehouse different fro m o ur existing database? Read o n...

The Dimensional Model

So why go to the tro uble o f creating a wareho use when o ur existing database has all the info rmatio n we need? It seems like we've just invented a few new terms fo r o ur existing data.

In the last lesso n we had several go o d reaso ns fo r creating a wareho use, remember? Data wareho uses pro vide: a separate system that wo n't interrupt business critical o peratio nal systems.

a single po int o f access fo r all analytical queries.

a unified and co nsistent view o f underlying data (even data fro m external systems). a straightfo rward way to analyze trends (such as mo nthly sales co mpariso ns).

Our existing database can pro vide answers to so me o f o ur pertinent questio ns, but it do esn't pro vide any o f the features listed abo ve. Data wareho uses do .

(21)

Our o riginal database is fairly co mplex. Take a lo o k at the database diagram.

Selecting Facts and Dimensions

Ho w do we cho o se which facts and dimensio ns to use in o ur data wareho use? Ask the users, of course! After all, if the system do esn't meet the needs o f the users, what go o d is it? Ask them, "Which questio ns wo uld yo u like to ask?" If they need an example, say so mething like, "I wo uld like to see ho w much the to tal sales were fo r the West sto re last week."

Co mpile tho se questio ns, o rganize them acco rding to effo rt and split them into f act s and dim e nsio ns, just like we did earlier in this lesso n. Keep in mind that facts and dimensio ns are generic terms—like the fact is "sales," no t "to tal sales," and the dimensio ns are "regio n" and "date," no t "east and west regio ns" and "mo nth." Also , wo rds like "to tal" and "to p" and "lo ngest"—they are just extra wo rds, and are no t part o f the dimensio n o r fact themselves.

Let's take the questio ns fro m earlier in the lesso n, and o rganize them acco rding to difficulty. We already split the f act s and dim e nsio ns:

1. Ho w m uch sale s did we doby m o nt h?

2. Ho w m uch sale s did we do, by st o re and by by m o nt h? 3. Ho w m any ne w cust o m e rs did we addby quart e r?

4. Ho w m any t im e s we re DVDs re nt e d, by DVD and by m o nt h?

5. Ho w m any t im e s we re DVDs re nt e d, by m o nt h and by m o vie po pularit y? 6 . Ho w m any t im e s we re DVDs re nt e d, by cust o m e r, m o nt h and st o re?

The first two questio ns use a sale s fact. The third questio n uses a Cust o m e r Co unt fact. The fo urth and fifth use a Re nt al Co unt fact.

All questio ns use a dat e dimensio n. The fo urth questio n uses a f ilm dimensio n.

The fifth questio n po ses a pro blem, tho ugh, because we do n't have any po pularity data right no w. We'll have to revisit that questio n later.

In the real wo rld, it's impo rtant to get feedback fro m end users so yo u can determine what's impo rtant to them. Yo u can always create multiple facts o r dimensio ns if end users do n't agree o n o r even kno w what's impo rtant yet. It's perfectly fine to have two o r mo re facts that are very similar if it helps end users get the info rmatio n they need.

Star Schema

No w that we've picked o ur facts and dimensio ns, its time to o rganize o ur data. Data wareho uses are typically o rganized using a star schema. Facts (measures) are sto red in fact tables at the center o f the star, and the dimensio ns surro und the measures. Facts have fo reign keys (using the integer data type) to each dimensio n table:

(22)

Here is o ur custo mer co unt fact:

Here is o ur rental co unt fact:

These separate diagrams might suggest o ur facts and dimensio ns are sto red separately, but that's no t the case. The dimensio ns are shared:

(23)

Yo u may wo nder why we're using separate tables fo r dimensio ns. Co uldn't we just put the mo vie dimensio ns next to the fact in the same table? Well, we could do this, but we sho uldn't, fo r o ne go o d reaso n: perfo rmance. It is safe to assume that yo ur fact table will beco me very large (millio ns o r even billio ns o f ro ws). Yo ur

dimensio ns may be large as well, but it is unlikely they will be nearly as large as o ur fact tables.

Suppo se yo u have ten millio n ro ws o f fact data and ten tho usand distinct mo vies. Then yo u realize so meo ne entered a film into yo ur wareho use using the name "The Dude" instead o f the film's actual name, "The Big Lebo wski." Updating every fact ro w to co rrect that mistake co uld take a very lo ng time. Even a simple query fo r "The Big Lebo wski" co uld cause the database great pain; text is much mo re difficult to index and search than integers.

Well, we've co vered a lo t in this lesso n. In the next lesso n we'll begin to implement o ur fact and dimensio n tables. See yo u there!

Copyright © 1998-2014 O'Reilly Media, Inc.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.

(24)

Implementing the Dimensional Model, Part I

In the last lesso n we discussed facts, dimensio ns, and the star schema. No w it's time to implement what we learned!

Creating the Date Dimension

The first and mo st impo rtant dimensio n in a data wareho use is the dat e dimensio n. Mo st, if no t all, queries use o ne o r mo re dates. In o ur DVD rental sto re there are several dates captured:

1. Custo mers have a create_date. 2. Payments have a payment_date.

3. Rentals have a rental_date and a return_date.

We are go ing to use a table to create a single date dimensio n to handle all o f these dates. Let's call the table dimDate. FYI, if yo ur dimensio n is reused fo r multiple purpo ses, such as reco rding bo th rental and return dates, it is a kno wn as a role-playing dimensio n.

So , what kind o f structure sho uld we create fo r dimDate? In star schemas, facts use foreign keys of the data type integer to "po int" to dimensio ns. That means we'll use a single integer as a primary key. At first yo u might envisio n so mething like this:

Co lum n T ype

date_key integer, primary key, auto _increment

date date

This setup is a go o d start, but is it the best way to help o ur end users? If yo u recall the sample questio ns the users gave us, several wanted to see results by month. So ho w wo uld yo u extract info rmatio n abo ut a mo nth fro m a date type? Yo u co uld use a functio n like month(), but it's probably not reasonable to expect end users to use that function. A better so lutio n wo uld be to pre-calculate and pre-po pulate the mo st impo rtant date attributes as required by the users. The best way to determine what's mo st impo rtant is ask the users what they need. So let's suppo se we did ask them, and used the info rmatio n they supplied to co me up with this structure:

Co lum n T ype

date_key int e ge r, prim ary ke y

date date year smallint quarter tinyint mo nth tinyint day tinyint week tinyint is_we e ke nd bo o lean

is_ho liday bo o lean

We kept the date column because it can be used to calculate attributes that didn't make it to the table. We are not going to use an auto_increment. No rmally we wo uld use an auto_increment column, but it's much more convenient to make the key a co ded fo rmat such as yyyyMMDD. With this format, a value of 20080101 would represent January 1st, 20 0 8 .

We included two additio nal co lumns: is_we e ke nd and is_ho liday. These wo uld be useful if we wanted to co mpare weekend sales o r ho liday sales to weekday sales. We keep the number o f data types required fo r o ur co lumns to a minimum by co nsulting MySQL's do cumentatio n.

Let's go ahead and implement this table. (We'll po pulate it with data in a future lesso n.) Switch to the terminal mo de, and lo g into yo ur acco unt. Once lo gged in, co nnect to yo ur o wn MySQL database. Be sure to replace use rnam e and use rnam e with yo ur o wn user name. Type in the co de belo w at the UNIX pro mpt:

(25)

CODE TO TYPE:

cold1:~$ mysql -h sql -p -u username username

Next, create the dimDate table. Run this command: CODE TO TYPE:

CREATE TABLE dimDate (

date_key integer NOT NULL, date date NOT NULL,

year smallint NOT NULL, quarter tinyint NOT NULL, month tinyint NOT NULL, day tinyint NOT NULL, week tinyint NOT NULL, is_weekend boolean, is_holiday boolean, PRIMARY KEY(date_key) );

Execute the query. Yo u'll see Query OK, 0 rows affected.

Slowly Changing Dimensions

So there yo u are, abo ut to create a dimensio n fo r yo ur custo mers, when a business user mentio ns, " I'd like t o se e sale s acco rding t o t he cit y in which a cust o m e r live . What happe ns whe n so m e o ne m o ve s f ro m o ne cit y t o ano t he r? Will t he sale s dat a f ro m last ye ar re f le ct t hat change ?" When a dimensio n's values change o ver time, the dimensio n is kno wn as a slowly changing dimension (o r SCD). There are several ways we can deal with these changes.

T ype 0 SCD

The mo st basic SCD isn't really a change at all. If yo u do abso lutely no thing to handle a changing dimensio n, that dimensio n is Type 0. In English, a type 0 translates to , "Do n't do anything when this value changes."

T ype 1 SCD

A Type 1 SCD is o ften the easiest way to acco mmo date changing dimensio ns. In this type, ro ws in the dimensio n tables are updated when changes o ccur. Suppo se Mary Smith gets married in April and changes her last name to J o ne s. (She'll keep the same email address fo r no w.)

The o ld dimensio n ro w wo uld lo o k like this:

Cust o m e r ID First Nam e Last Nam e Em ail Cit y

1 MARY SMITH MARY.SMITH@sakilacusto mer.o rg Sasebo

(26)

Cust o m e r ID First Nam e Last Nam e Em ail Cit y

1 MARY J ONES MARY.SMITH@sakilacusto mer.o rg Sasebo

So me changes are less impo rtant than o thers. Name changes are no t always impo rtant to business users. Fo r their purpo ses, it's irrelevant whether Mary Jo nes used to be kno wn as Mary Smith. But suppo se Mary Smith mo ves fro m o ne city to ano ther in July. A Type 1 custo mer SCD wo uld simply update the existing ro w fo r Mary Smith, fo rgetting the previo us city. No w a user wo uld be unable to see sales trends acco rding to custo mer and city, because all histo rical data co ncerning Mary prio r to July wo uld no w be asso ciated with the new city.

T ype 2 SCD

Type 1 isn't the best way to handle all slo wly changing dimensio ns tho ugh. Ano ther metho d to track changes in dimensio ns is to create a new ro w in the dimensio n table when each change o ccurs, and then use be gin and e nd dates to specify the valid time perio d fo r a ro w.

The database ro w fo r Mary Smith wo uld initially lo o k like this: Cust o m e r

Ke y

First Nam e

Last

Nam e Em ail Cit y

St art

Dat e End Dat e 1 MARY SMITH MARY.SMITH@sakilacusto mer.o rg Sasebo 0 1-Jan-20 0 8 0 1-JAN-20 9 9 No w suppo se Mary Smith gets married in April and beco mes Mary Jo nes. Her dimensio n time line wo uld lo o k like this:

And her table wo uld lo o k like this: Cust o m e r

Ke y

First Nam e

Last

Nam e Em ail Cit y

St art Dat e

End Dat e 1 MARY SMITH MARY.SMITH@sakilacusto mer.o rg Sasebo 0 1-JAN-20 0 8 0 1-APR-20 0 8 2 MARY JONES MARY.SMITH@sakilacusto mer.o rg Sasebo 0 1-APR-20 0 8 0 1-JAN-20 9 9 No w let's say she mo ves fro m Sasebo to Bellevue in July, her dimensio n time line wo uld lo o k like this:

(27)

The updated dimensio n data wo uld lo o k like this: Cust o m e r Ke y First Nam e Last

Nam e Em ail Cit y

St art Dat e

End Dat e 1 MARY SMITH MARY.SMITH@sakilacusto mer.o rg Sasebo 0 1-JAN-20 0 8 0 1-APR-20 0 8 2 MARY JONES MARY.SMITH@sakilacusto mer.o rg Sasebo 0 1-APR-20 0 8 0 1-JUL-20 0 8

3 MARY JONES MARY.SMITH@sakilacusto mer.o rg Bellevue 0

1-JUL-20 0 8

0 1-JAN-20 9 9 In each o f the two tables that reflect Mary's new circumstances, there is o ne "current" ro w that has 01-JAN-2099 fo r an End Date.

Note

Instead o f using 01-JAN-2099 fo r an end date, so me wareho uses use NULL, but usually it'sbetter to use a real date instead o f NULL, because real dates can make better use o f indexes.

T ype 3 SCD

Type 2 slo wly changing dimensio ns (SCDs) allo w unlimited changes, but this might be excessive fo r so me types o f changes. Fo r example, when a po stal co de is changed, even tho ugh it's a fairly mino r change and do esn't happen that o ften, it wo uld still need to be tracked in the database. In this case, we wo uld cho o se to use the Type 3 SCD metho d.

Suppo se Mary Smith in Sasebo has her po stal co de changed fro m 3520 0 to 3520 1. The change wo uld lo o k like this:

The o ld dimensio n data wo uld lo o k like this: Cust o m e r

Ke y

First Nam e

Last

Nam e Em ail Cit y

Curre nt Po st al Co de Pre vio us Po st al Co de 1 MARY SMITH MARY.SMITH@sakilacusto mer.o rg Sasebo 3520 0

The updated dimensio n data wo uld lo o k like this: Cust o m e r

Ke y

First Nam e

Last

Nam e Em ail Cit y

Curre nt Po st al Co de Pre vio us Po st al Co de

1 MARY SMITH MARY.SMITH@sakilacusto mer.o rg Sasebo 3520 1 3520 0

The table may o r may no t have an "Effective Date" co lumn to explain when the po stal co de changed. Type 3 SCDs wo rk well fo r changes that happen infrequently, ho wever this type fails to capture multiple changes.

(28)

T ype 4 SCD

A Type 4 SCD is fairly straightfo rward; the dimensio n table always co ntains up-to -date info rmatio n. Changes are reco rded in a separate history table. This adds co mplexity to dimensio ns, but may cause co nfusio n because users must keep in mind that histo rical data is sto red in a separate lo catio n.

Fo r example, suppo se Mary mo ves fro m Sasebo to Bellevue o n July 15. The change wo uld lo o k like this:

The dimCustomer table would look like this:

Cust o m e r Ke y First Nam e Last Nam e Em ail Cit y

1 MARY SMITH MARY.SMITH@sakilacusto mer.o rg Bellevue

The dimCustomerHistory table would look like this:

Cust o m e r Ke y First Nam e Last Nam e Em ail Cit y Change Dat e

1 MARY SMITH MARY.SMITH@sakilacusto mer.o rg Sasebo 15-Jul-20 0 8

In practice, Type 1 and Type 2 are the mo st widely used ways to deal with slo wly changing dimensio ns.

Ro ws do no t have to be co mprised entirely o f a single SCD type. Fo r example, fo r many data wareho uses, the time that a custo mer name change takes place is no t significant, and the change is po sted fo r that reco rd o n-the-fly. In that case, the name co lumns wo uld be o f Type 1. Custo mer addresses are usually mo re impo rtant, so tho se co lumns wo uld be o f Type 2. It's perfectly fine to handle changes in this way.

Creating the Customer Dimension

Okay, let's create o ur Customer dimension as Type 2. But before we do, we'll review the source of data for our dimensio n. The sakila database has a table called customer which will be the source of information for this dimensio n. Switch to the seco nd SSH terminal, and lo g into yo ur acco unt. Use the co mmand line mysql pro gram to co nnect to the sakila database. When you're prompted for a password, click enter. In Unix mode, run the following co mmand:

CODE TO TYPE:

cold1:~$ mysql -h sql sakila -u anonymous -p

Once we're co nnected, we're able to see the structure o f the customer table. Run the following command against the sakila database:

CODE TO TYPE: describe customer;

As lo ng as yo u have typed everything co rrectly, and are co nnected to the sakila database (no t yo ur perso nal database) yo u'll see this:

(29)

OBSERVE:

mysql> describe customer;

+---+---+---+---+---+---+

| Field | Type | Null | Key | Default | Extra |

+---+---+---+---+---+---+

| customer_id | smallint(5) unsigned | NO | PRI | NULL | auto_increment |

| store_id | tinyint(3) unsigned | NO | MUL | NULL | |

| first_name | varchar(45) | NO | | NULL | |

| last_name | varchar(45) | NO | MUL | NULL | |

| email | varchar(50) | YES | | NULL | |

| address_id | smallint(5) unsigned | NO | MUL | NULL | |

| active | tinyint(1) | NO | | 1 | |

| create_date | datetime | NO | | NULL | |

| last_update | timestamp | NO | | CURRENT_TIMESTAMP | |

+---+---+---+---+---+---+

9 rows in set (0.00 sec)

The table has a lo t o f info rmatio n. Observe that it co ntains a co lumn called address_id. This indicates that the address info rmatio n is sto red in a different table. Let's take a lo o k at the address table. Now run the following co mmand against the sakila database:

CODE TO TYPE: describe address; Yo u'll see these results:

(30)

OBSERVE:

mysql> describe address;

+---+---+---+---+---+---+

| Field | Type | Null | Key | Default | Extra |

+---+---+---+---+---+---+

| address_id | smallint(5) unsigned | NO | PRI | NULL | auto_increment |

| address | varchar(50) | NO | | NULL | |

| address2 | varchar(50) | YES | | NULL | |

| district | varchar(20) | NO | | NULL | |

| city_id | smallint(5) unsigned | NO | MUL | NULL | |

| postal_code | varchar(10) | YES | | NULL | |

| phone | varchar(20) | NO | | NULL | |

| last_update | timestamp | NO | | CURRENT_TIMESTAMP | |

+---+---+---+---+---+---+

8 rows in set (0.00 sec)

See the co lumn cit y_id? It is a fo reign key to the table city. Take a look at that table. Then run the following command against the sakila database:

CODE TO TYPE: describe city;

Yo u'll see the fo llo wing structure: OBSERVE:

mysql> describe city;

+---+---+---+---+---+---+

| Field | Type | Null | Key | Default | Extra |

+---+---+---+---+---+---+

| city_id | smallint(5) unsigned | NO | PRI | NULL | auto_increment |

| city | varchar(50) | NO | | NULL | |

| country_id | smallint(5) unsigned | NO | MUL | NULL | |

| last_update | timestamp | NO | | CURRENT_TIMESTAMP | |

+---+---+---+---+---+---+

4 rows in set (0.00 sec)

It lo o ks like this table references yet ano ther table, using co unt ry_id. Let's take a lo o k at that table as well. Then run the fo llo wing co mmand against the sakila database:

CODE TO TYPE: describe country;

(31)

Yo u'll see these results: OBSERVE:

mysql> describe country;

+---+---+---+---+---+---+

| Field | Type | Null | Key | Default | Extra |

+---+---+---+---+---+---+

| country_id | smallint(5) unsigned | NO | PRI | NULL | auto_increment |

| country | varchar(50) | NO | | NULL | |

| last_update | timestamp | NO | | CURRENT_TIMESTAMP | |

+---+---+---+---+---+---+

3 rows in set (0.00 sec)

No w we'll co mbine all o f these tables into o ur single dimCustomer dimension. At first glance, we might come up with the fo llo wing structure, reusing cust o m e r_id fro m o ur so urce customer table as the primary key for dimCustomer.

OBSERVE:

CREATE TABLE dimCustomer (

customer_id smallint(5) unsigned NOT NULL,

first_name varchar(45) NOT NULL, last_name varchar(45) NOT NULL, email varchar(50),

address varchar(50) NOT NULL, address2 varchar(50),

district varchar(20) NOT NULL, city varchar(50) NOT NULL,

country varchar(50) NOT NULL, postal_code varchar(10), phone varchar(20) NOT NULL,

active tinyint(1) NOT NULL, create_date datetime NOT NULL,

last_update datetime NOT NULL, PRIMARY KEY(customer_id)

);

We could simply reuse this co lumn as the key o n o ur dimensio n table, but it's no t a go o d practice because:

It fo rces o ur slo wly changing dimensio n to be T ype 1 instead o f Type 2 since custo mer_id must be unique fo r all ro ws in the table.

Changes in the so urce customer table could break the keys in the data warehouse.

Co mbining multiple so urces o f data into a single dimCustomer dimension is impossible if we rely on keys generated in o ne system.

We'll avo id po tential pro blems by keeping o ur o riginal key, cust o m e r_id, and using o ur o wn cust o m e r_ke y surro gate key.

We'll po pulate the Type 2 slo wly changing dimensio n (SCD) in a future lesso n, but fo r no w we'll just create the table. Switch back to the first SSH mo de, the o ne that's co nnected to yo ur perso nal database. Then run the fo llo wing co mmand against yo ur perso nal database:

(32)

CODE TO TYPE:

CREATE TABLE dimCustomer (

customer_key int NOT NULL AUTO_INCREMENT, customer_id smallint(5) unsigned NOT NULL, first_name varchar(45) NOT NULL,

last_name varchar(45) NOT NULL, email varchar(50),

address varchar(50) NOT NULL, address2 varchar(50),

district varchar(20) NOT NULL, city varchar(50) NOT NULL, country varchar(50) NOT NULL, postal_code varchar(10),

phone varchar(20) NOT NULL, active tinyint(1) NOT NULL, create_date datetime NOT NULL, start_date date NOT NULL, end_date date NOT NULL, PRIMARY KEY(customer_key) );

Execute the query. If everything went o kay yo u will see this: Query OK, 0 rows affected.

Snowflake Schemas

Fo r o ur custo mer dimensio n, we've taken fo ur tables and co llapsed them into o ne table. Why did we do this? Sim plicit y.

One o f the go als o f a data wareho use is to create a simple structure that users can query easily. Multiple tables means multiple jo ins, and added co mplexity. Here we traded disk space fo r simplicity.

We can also wo rk in the o ppo site directio n, using mo re co mplex schemas when o ur purpo se calls fo r that. Addresses represent such a hierarchy. Co unt rie s have St at e s (o r regio ns), and states have Cit ie s. So me business users may be interested in seeing sales data by co unt ry, whereas o thers may be interested in viewing sales data by st at e o r by cit y. One way to deal with this hierarchy is with a snowflake schema. In a snowflake schema yo u split a dimensio n into o ne "primary" dimensio n table and o ne o r mo re snowflake tables. It lo o ks like this:

Sno wflake schemas are also an effective way to handle a different type o f pro blem. Suppo se o ur DVD sto re starts to rent DVDs o ver the internet. Our sto re no w has two types o f custo mers - Internet custo mers and In Store custo mers. We kno w very little abo ut the In Store custo mers; perhaps we o nly kno w their telepho ne numbers and ho me addresses. By co mpariso n we kno w a lo t abo ut o ur Internet custo mers; we might have their email addresses, telepho ne numbers, physical addresses, mo vie preferences, and the number o f times

(33)

they have visited the web site.

Suppo se we have 10 ,0 0 0 custo mers - 2,50 0 In Store and 7,50 0 Internet custo mers. We may actually cause co nfusio n (and waste a lo t o f disk space) by sho ving bo th types o f custo mers into a single table. Also , we may want to see sales statistics as they vary by cust o m e r t ype.

We can use a snowflake schema to help with that as well. In this situatio n there wo uld be o ne custo mer dimensio n that wo uld ho ld attributes shared amo ng Internet and In Sto re custo mers. Two additio nal tables -dimCustomerInternet and dimCustomerInStore wo uld sto re type-specific attributes. This schema lo o ks like this:

Fo r right no w, we'll stick to o ur simple star schema.

We co vered a lo t o f material in this lesso n! We'll create the rest o f o ur dimensio ns in the next lesso n. See yo u there! Copyright © 1998-2014 O'Reilly Media, Inc.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.

(34)

Implementing The Dimensional Model, Part II

We began implementatio n o f the dimensio nal mo del in the last lesso n. No w it's time to finish creating o ur dimensio ns, and then create o ur facts.

Creating the Movie Dimension

Befo re we implement dimMovie, let's examine the data source: the film table. Connect to the sakila database. Now in Unix mo de, run this co mmand:

CODE TO TYPE:

cold1:~$ mysql -h sql sakila -u anonymous

Once yo u're co nnected, we'll examine the structure o f the film table. Run the following command against the sakila database:

CODE TO TYPE: describe film; The table lo o ks like this:

OBSERVE:

mysql> describe film;

+---+---+---+---+---+---+

| Field | Type | Null | Key | Default | Extra |

+---+---+---+---+---+---+

| film_id | smallint(5) unsigned | NO | PRI | NULL | auto_increment |

| title | varchar(255) | NO | MUL | NULL | | | description | text | YES | | NULL | | | release_year | year(4) | YES | | NULL | |

| language_id | tinyint(3) unsigned | NO | MUL | NULL | |

| original_language_id | tinyint(3) unsigned | YES | MUL | NULL | |

| rental_duration | tinyint(3) unsigned | NO | | 3 | |

| rental_rate | decimal(4,2) | NO | | 4.99 | |

| length | smallint(5) unsigned | YES | | NULL | |

| replacement_cost | decimal(5,2) | NO | | 19.99 | |

| rating | enum('G','PG','PG-13','R','NC-17') | YES | | G | |

| special_features | set('Trailers','Commentaries','Deleted Scenes','Behind the Sce nes') | YES | | NULL | |

| last_update | timestamp | NO | | CURRENT_TIMESTAMP | |

+---+---+---+---+---+---+

13 rows in set (0.01 sec)

(35)

sto red in o ur data wareho use that allo w us to answer questio ns like, "What is o ur pro fit (amo unt o f rental inco me, minus film co st) fo r each mo vie?" But since that and similar questio ns are o utside the sco pe fo r o ur current pro ject, we'll o mit tho se facts fro m o ur data wareho use and free up so me space.

So it lo o ks like we have two fo reign keys: language _id and o riginal_language _id. Bo th po int to the language table. Take a lo o k. Run the fo llo wing co mmand against the sakila database:

CODE TO TYPE: describe language;

Execute the line to see the structure o f language. OBSERVE:

mysql> describe language;

+---+---+---+---+---+---+ | Field | Type | Null | Key | Default | Extra | +---+---+---+---+---+---+ | language_id | tinyint(3) unsigned | NO | PRI | NULL | auto_increment | | name | char(20) | NO | | NULL | | | last_update | timestamp | NO | | CURRENT_TIMESTAMP | | +---+---+---+---+---+---+ 3 rows in set (0.00 sec)

We'll co nso lidate these tables into dimMovie. As for changes, they are fairly infrequent in this case, so we'll implement a Type 1 slo wly changing dimensio n. Switch to the terminal mo de, and lo g into yo ur acco unt. Once yo u're lo gged in, co nnect to yo ur o wn MySQL database. Be sure to replace use rnam e and use rnam e with yo ur o wn user name. Then type the fo llo wing at the UNIX pro mpt:

CODE TO TYPE:

cold1:~$ mysql -h sql -p -u username username

Next, run the statement belo w, against yo ur perso nal database, in o rder to create the dimMovie table: CODE TO TYPE:

CREATE TABLE dimMovie (

movie_key int NOT NULL AUTO_INCREMENT, film_id smallint(5) unsigned NOT NULL, title varchar(255) NOT NULL,

description text, release_year year(4),

language varchar(20) NOT NULL, original_language varchar(20),

rental_duration tinyint(3) unsigned NOT NULL, length smallint(5) unsigned NOT NULL, rating varchar(5) NOT NULL,

special_features varchar(60) NOT NULL, PRIMARY KEY (movie_key)

);

Once again, yo u'll see Query OK, 0 rows affected.

Creating the Store Dimension

No w it's time to implement dimStore. Take a look at the data source: the store table. Switch terminals so that you're using the sakila database. Run the following command against the sakila database:

(36)

CODE TO TYPE: describe store;

Execute the line to see the structure o f the store table. OBSERVE:

mysql> describe store;

+---+---+---+---+---+---+

| Field | Type | Null | Key | Default | Extra |

+---+---+---+---+---+---+

| store_id | tinyint(3) unsigned | NO | PRI | NULL | auto_incre ment |

| manager_staff_id | tinyint(3) unsigned | NO | UNI | NULL | |

| address_id | smallint(5) unsigned | NO | MUL | NULL | |

| last_update | timestamp | NO | | CURRENT_TIMESTAMP | |

| region | varchar(10) | YES | | NULL | |

+---+---+---+---+---+---+

5 rows in set (0.00 sec)

Our versio n o f the sakila database is slightly different than the versio n distributed by MySQL. Our versio n includes a region co lumn. Our table also includes an addre ss_id co lumn. (Feel free to refer back to the previo us lesso n if yo u want to go o ver the address table structure again.)

The next interesting aspect to this table is the m anage r_st af f _id co lumn. This co lumn is a fo reign key to staff. Let's take a lo o k at that table no w. Run the fo llo wing co mmand against the sakila database:

CODE TO TYPE: describe staff;

(37)

OBSERVE:

mysql> describe staff;

+---+---+---+---+---+---+

| Field | Type | Null | Key | Default | Extra |

+---+---+---+---+---+---+

| staff_id | tinyint(3) unsigned | NO | PRI | NULL | auto_increment |

| first_name | varchar(45) | NO | | NULL | |

| last_name | varchar(45) | NO | | NULL | |

| address_id | smallint(5) unsigned | NO | MUL | NULL | |

| picture | blob | YES | | NULL | |

| email | varchar(50) | YES | | NULL | |

| store_id | tinyint(3) unsigned | NO | MUL | NULL | |

| active | tinyint(1) | NO | | 1 | |

| username | varchar(16) | NO | | NULL | |

| password | varchar(40) | YES | | NULL | |

| last_update | timestamp | NO | | CURRENT_TIMESTAMP | |

+---+---+---+---+---+---+

11 rows in set (0.00 sec)

We'll merge the staff table into a single dimStore dimension, and omit many of the columns from staff such as picture, email, address, username, and passwo rd. Since sto res may change managers, we'll make o ur dimensio n a Type 2 SCD so we can track management changes accurately o ver time. That will require two additio nal co lumns: start_date and end_date. Feel free to review the Type 2 SCD sectio n in the third lesso n if yo u like.

Switch terminals so that yo u're using yo ur perso nal database. No w let's create o ur dimensio n! Run the co mmand belo w against yo ur perso nal database:

CODE TO TYPE:

CREATE TABLE dimStore (

store_key int NOT NULL AUTO_INCREMENT, store_id smallint(5) unsigned NOT NULL, address varchar(50) NOT NULL, address2 varchar(50),

district varchar(20) NOT NULL, city varchar(50) NOT NULL,

country varchar(50) NOT NULL, postal_code varchar(10), region varchar(10),

manager_first_name varchar(45) NOT NULL, manager_last_name varchar(45) NOT NULL, start_date date NOT NULL,

end_date date NOT NULL, PRIMARY KEY (store_key)

);

(38)

Creating Facts

No w that o ur dimensio ns have been created, we can implement o ur f act s. Fact tables are fairly straightfo rward; they co ntain fo reign keys to all dimensio n tables, and a single co lumn fo r the fact value.

Let's get started!

Sales

Our sales data will co me fro m the payment table in the sakila database. Let's take a look. Switch back to the sakila database and run this co mmand:

CODE TO TYPE: describe payment;

Execute the line to see the structure o f payment: OBSERVE:

mysql> describe payment;

+---+---+---+---+---+---+

| Field | Type | Null | Key | Default | Extra |

+---+---+---+---+---+---+

| payment_id | smallint(5) unsigned | NO | PRI | NULL | auto_in crement |

| customer_id | smallint(5) unsigned | NO | MUL | NULL | |

| staff_id | tinyint(3) unsigned | NO | MUL | NULL | |

| rental_id | int(11) | YES | MUL | NULL | |

| amount | decimal(5,2) | NO | | NULL | |

| payment_date | datetime | NO | | NULL | |

| last_update | timestamp | NO | | CURRENT_TIMESTAMP | |

+---+---+---+---+---+---+

7 rows in set (0.00 sec)

We'll pay particular attentio n to the am o unt co lumn. It will be the basis fo r o ur factSales table.

Note

Make sure to review the so urces o f yo ur facts, so yo u do n't implement the wro ng data type. Switch back to yo ur perso nal database. Let's create o ur fact. Run the co mmand belo w against yo ur perso nal database:

(39)

CODE TO TYPE:

CREATE TABLE factSales (

sales_key INT NOT NULL AUTO_INCREMENT,

date_key INT NOT NULL, customer_key INT NOT NULL, movie_key INT NOT NULL, store_key INT NOT NULL,

sales_amount decimal(5,2) NOT NULL,

FOREIGN KEY fk_date (date_key) REFERENCES dimDate(date_key),

FOREIGN KEY fk_customer (customer_key) REFERENCES dimCustomer(customer_key), FOREIGN KEY fk_movie (movie_key) REFERENCES dimMovie(movie_key),

FOREIGN KEY fk_store (store_key) REFERENCES dimStore(store_key), PRIMARY KEY (sales_key)

);

Once again, if everything went acco rding to plan, yo u'll see Query OK, 0 rows affected.

A single ro w in factSales will represent the amount of sales for a specific date, for a specific customer, for a specific mo vie, at a specific sto re.

Yo u might think that the primary key sho uld be a co mpo site key acro ss all fo reign keys to the dimensio ns. After all, these co lumns sho uld uniquely identify a fact ro w, right? But the pro blem with that type o f primary key is that it tends to be very wide. To start, create a primary key o n the surro gate key alo ne - sale s_ke y. This will give yo u o ptimum flexibility when evaluating future indexing strategies.

CustomerCount

No w we'll implement o ur factCustomerCount. The factCustomerCount is a tally of the number of

custo mers who created acco unts with o ur sto re. This table do es no t have a fo reign key to dimMovie because the number o f custo mers isn't relative to any particular mo vie.

We'll examine the so urce fo r this data in a future lesso n. Fo r no w, let's create the fact. Make sure yo u are using yo ur perso nal database. Review the fo llo wing CREATE TABLE statement:

OBSERVE:

CREATE TABLE factCustomerCount (

customerCount_key INT NOT NULL AUTO_INCREMENT,

date_key INT NOT NULL, customer_key INT NOT NULL,

store_key INT NOT NULL, customer_count INT NOT NULL,,

FOREIGN KEY fk_date (date_key) REFERENCES dimDate(date_key),

FOREIGN KEY fk_customer (customer_key) REFERENCES dimCustomer(customer_key), FOREIGN KEY fk_store (store_key) REFERENCES dimStore(store_key),

PRIMARY KEY (customerCount_key) );

A single ro w in this table represents a specific custo mer who created an acco unt o n a specific day, at a specific sto re.

Befo re yo u execute the co mmand, take a clo ser lo o k at the cust o m e r_co unt measure. What values might it have?

Since cust o m e r_ke y po ints to exactly o ne custo mer, cust o m e r_co unt will always have the value o f 1.

Since cust o m e r_co unt will always be 1, we co uld o mit the co lumn fro m the table. Ho wever we will leave it in o ur table since it will make it easier fo r business users to query the table.

Since factCustomerCount doesn't have any "real" facts, it is known as a f act less f act . There will be no measure co lumns in this table, o nly fo reign keys to dimensio ns. Factless facts are go o d at sto ring events.

(40)

Let's create the table. This time we'll specify a de f ault value o f 1 o n cust o m e r_co unt. Run this co mmand against yo ur perso nal database:

CODE TO TYPE:

CREATE TABLE factCustomerCount (

customerCount_key INT NOT NULL AUTO_INCREMENT, date_key INT NOT NULL,

customer_key INT NOT NULL, store_key INT NOT NULL,

customer_count INT NOT NULL DEFAULT 1,

FOREIGN KEY fk_date (date_key) REFERENCES dimDate(date_key),

FOREIGN KEY fk_customer (customer_key) REFERENCES dimCustomer(customer_key), FOREIGN KEY fk_store (store_key) REFERENCES dimStore(store_key),

PRIMARY KEY (customerCount_key) );

Again, as lo ng as yo u see Query OK, 0 rows affected, you're all set!

RentalCount

Our final fact is factRentalCount. It's similar to factCustomerCount in that it is also a f act less f act . As such, we'll also specify a default value fo r the re nt al_co unt co lumn. (We'll po pulate this table in a future lesso n.) Run this co mmand against yo ur perso nal database:

(41)

CODE TO TYPE:

CREATE TABLE factRentalCount (

rentalCount_key INT NOT NULL AUTO_INCREMENT,

date_key INT NOT NULL, customer_key INT NOT NULL, movie_key INT NOT NULL, store_key INT NOT NULL,

rental_count INT NOT NULL DEFAULT 1,

FOREIGN KEY fk_date (date_key) REFERENCES dimDate(date_key),

FOREIGN KEY fk_customer (customer_key) REFERENCES dimCustomer(customer_key), FOREIGN KEY fk_movie (movie_key) REFERENCES dimMovie(movie_key),

FOREIGN KEY fk_store (store_key) REFERENCES dimStore(store_key), PRIMARY KEY (rentalCount_key)

);

Yo u sho uld see: Query OK, 0 rows affected.

A single ro w in this table represents a specific custo mer who rented an specific mo vie, o n a specific day, at a specific sto re.

We co vered a lo t o f material in this lesso n. No w that o ur dimensio ns and facts are implemented, we'll develo p a strategy fo r transferring data fro m o ur OLTP database to o ur OLAP database. See yo u in the next lesso n!

Copyright © 1998-2014 O'Reilly Media, Inc.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.

(42)

Extract, Transform, Load (ETL)

Welco me back! In the last few lesso ns we implemented the dimensio nal mo del. No w it's time to figure o ut ho w to po pulate the tables we created.

What is ETL?

ET L is an acro nym fo r Extract, T ransfo rm, and Lo ad. It's the pro cess that takes data fro m o ne o r mo re so urce systems, transfo rms and cleanses that data, and lo ads the result into the data wareho use.

Yo u might be thinking "This sounds simple! We learned about exports and imports in the last course!"

And actually, we did learn abo ut the E and L in the last co urse, but we didn't co ver any T ransfo rmatio ns. An example o f a Transfo rmatio ns wo uld be co nverting co des like "D" within a so urce system, to a wo rd like "Deleted" within a destinatio n.:

And we still haven't learned ho w to auto mate data lo ading, o r handle failures auto matically and gracefully. Failure is no t always an o ptio n, but when yo u're do ing bulk expo rts and lo ads it's no t to o difficult to handle. If an expo rt fails due to a disk space issue, yo u free up so me space and try again. If an impo rt fails, yo u figure o ut what went wro ng and try again.

Failure may no t be an o ptio n with ET L. Data wareho uses ho ld a lo t o f data, and mo st o f that data is pro cessed o n a daily (o r even ho urly) basis and is extremely time sensitive. If yo u miss a day o f pro cessing, yo u may lo se data. The o nly way to handle a large vo lume o f co mplex data is to have an auto mated ETL pro cess.

Logging and Auditing

We kno w which data we want to pull into the data wareho use, and we kno w the destinatio n tables fo r that data. There are several bits o f info rmatio n to be lo gged during an ETL pro cess. First, yo u'll track the dat a co unt (the number o f ro ws) transfered at each step in the pro cess. This is useful info rmatio n fo r a few go o d reaso ns:

We can make sure no ro ws are "lo st" in the ETL pro cess.

We can detect abno rmal data; if we pro cess 10 0 0 ro ws o ne day , but the next day we o nly pro cess 10 , we'll kno w that so mething pro bably went wro ng.

In the future we can use this captured data to predict future capacity needs. Other useful bits o f info rmatio n are the st art and e nd t im e s o f the pro cess:

They can be used to alert us to pro blems. They also allo w us to plan fo r future capacity.

Lo gging is no t always co mplex. While we co uld use a single table to track this info rmatio n, we'll split it into two tables: e t lRuns and e t lLo g. Switch to a terminal, and lo g into yo ur acco unt, then co nnect to yo ur perso nal database. Run this co mmand against yo ur perso nal database:

(43)

CODE TO TYPE:

CREATE TABLE etlRuns (

run_id integer NOT NULL AUTO_INCREMENT, start_time datetime NOT NULL,

end_time datetime, PRIMARY KEY(run_id) );

Next we'll create the e t lLo g table, which will be used to lo g messages and statistics. Many o f these co lumns are TOS-specific (Talend Open Studio -TOS-specific. We'll explain mo re abo ut Talend in the next lesso n). We will see them again when we implement lo gging in a later lesso n. So me o f this info rmatio n wo n't be useful fo r every wareho use; it is up to yo u to decide the amo unt and type lo gging yo u need. Run the fo llo wing co mmand against yo ur perso nal database:

CODE TO TYPE: CREATE TABLE etlLog (

run_id integer NOT NULL, moment datetime NOT NULL, pid varchar(20), father_pid varchar(20), root_pid varchar(20), system_pid double, project varchar(50), job varchar(50), job_repository_id varchar(255), job_version varchar(255), context varchar(50), priority int, origin varchar(255), message_type varchar(255), message varchar(255), code int, duration double, count int, reference int, thresholds varchar(255), key(run_id) );

With these tables in place, we are ready to tackle audit ing.

Why do we need auditing? Suppo se we co me into wo rk o ne day to find that o ur daily sales jumped o vernight fro m $10 ,0 0 0 to $1,0 0 0 ,0 0 0 . Yo u kno w that the co mpany did no t sell $1,0 0 0 ,0 0 0 in o ne day, but ho w do yo u track do wn the pro blem?

Yo u can use the auditing features o f the data wareho use to debug the pro blem. Auditing allo ws us to link ro ws in tables with specific "runs" via the run_id column in etlLog. Each row in our fact and dimension tables will have a run_id co lumn, letting us kno w exactly when that data was added to the wareho use.

We implemented o ur dimensio ns and fact tables in the prio r lesso n, and tho se tables do n't have any co lumns related to auditing. We did create so me dimensio ns type 2 SCD, but they can't be used fo r auditing purpo ses. Instead we'll need to add a run_id column to all of those tables.

References

Related documents

Methods: All cases were subjected to thorough history taking including autistic symptoms and sensory abnormalities, comprehensive medical examination, psychiatric assessment

It is therefore prudent to consider OpenStack as a software-based operations ecosystem that requires a sufficiently large team of operators to design, deploy, operate,

Uncultivated microbes and specific changes in community composition of selected samples of mudflat sediment (0-2 mm oiled and non-oiled from 2-h, days 3 and 12; 2-4 mm oiled and

This study was aimed to investigate the roles of a local grocery store for a rural community by considering the concept of sustainable community development (SCD), sense of community

For example, set one group to route SIP calls within your enterprise to an Avaya Converged Communications Server on your LAN, and set another group to route calls through your

Courts have held that nearly any securities violation, occurring within or outside the United States borders, and having a real impact on investors or markets in the United

Umeå International Jazz Festival, Umeå Open, House of metal, Kulturnatta, Littfest, Umeå Autumn Lights, Holmön Folk Song Festival, Sami Week, MADE in Umeå, Umeå Football Festival and

A study was conducted to determine the reproductive performance, success rate and factors affecting conception rates following artificial insemination (Al) atAmbewela,