• No results found

Now perf orm these steps:

In document DBA 3: Creating a Data Warehouse (Page 90-94)

1. Dro p a t MysqlOut put co mpo nent o n the canvas. 2. Set the co nnectio n to the Re po sit o ry -->

Dat aWare ho use .

3. Link the o utput o f t Map to yo ur new t MysqlOut put co mpo nent.

4. Set the table to "dimMovie".

5. Change the Act io n o n dat a to Update or Insert.

When yo u're do ne, t MysqlOut put sho uld lo o k similar to this:

What do es the Act io n o n dat a setting o f Update or Insert mean?

Go o d questio n! Update or Insert means that fo r each ro w o f data sent to t MysqlOut put , TOS will determine if the ro w exists in the database table dimMovie. If it exists and is different, the row will be updated. If it do es no t exist, the ro w is inserted.

Well then, yo u ask, how does TOS know if the row exists?. Ano ther go o d questio n. TOS lo o ks at the Ke y that we specified in the schema. In this case, we set the co lumn film_id to be a key column, so this column is used to check to see whether the ro w exists. All co lumns specified as keys are checked when perfo rming an insert o r update.

Ano ther questio n yo u might be asking yo urself is, why don't we just delete data from dimMovie and reload the whole thing?

Yet ano ther go o d questio n. The answer is: f o re ign ke ys. Our primary key fo r dimMovie is an auto increment field called movie_key. This column will be used to link facts to this dimension. If we wipe out dimMovie each time we run o ur lo ad, we'll always have to relo ad all facts as well. This might be o kay in the sho rt term, but at so me po int we may no t want to relo ad everything in o ur data wareho use. Using "insert o r update" means that existing data do es no t get deleted, so a movie_key is preserved across runs. This is important even if the underlying database do es no t enfo rce fo reign key co nstraints.

No w that we have o ur jo b do ne, it's time to run it! Click o n the butto n at the to p o f the screen. If everything go es well, yo u'll o nly see two lines o f o utput:

OBSERVE:

Starting job ProcessDataWarehouse at 13:29 09/06/2009.

Job ProcessDataWarehouse ended at 13:32 09/06/2009. [exit code=0]

Yo u can also check the database to see what has been lo gged fo r yo ur run. Switch to a terminal, and lo g into yo ur perso nal database. Run the fo llo wing co mmand against yo ur perso nal database:

CODE TO TYPE:

mysql> select * from etlRuns;

Yo ur results wo n't be exactly the same, but they sho uld lo o k similar to the fo llo wing: OBSERVE:

mysql> select * from etlRuns;

+---+---+---+ | run_id | start_time | end_time | +---+---+---+ | 1 | 2009-06-09 11:49:07 | 2009-06-09 11:49:12 | +---+---+---+ 1 row in set (0.00 sec)

Next, check the etlLog table. Run the following command against your personal database: CODE TO TYPE:

mysql> select * from etlLog;

OBSERVE:

mysql> select * from etlLog;

+---+---+---+---+---+---+--- +---+---+---+---+---+-- ---+---+---+---+---+---+---+---+ | run_id | moment | pid | father_pid | root_pid | system_pid | project | job | job_repository_id | job_version | context | priority | o rigin | message_type | message | code | duration | count | reference | thresholds | +---+---+---+---+---+---+--- +---+---+---+---+---+-- ---+---+---+---+---+---+---+---+ | 1 | 2009-06-09 11:49:07 | GOrr6m | GOrr6m | GOrr6m | 8356 | DBA3 | ProcessDataWarehouse | _pzsw4GWKEd6GbtKHsp1gXA | 0.1 | Default | NULL | N ULL | begin | NULL | NULL | NULL | NULL | NULL | NULL | | 1 | 2009-06-09 11:49:12 | GOrr6m | GOrr6m | GOrr6m | 8356 | DBA3 | ProcessDataWarehouse | _pzsw4GWKEd6GbtKHsp1gXA | 0.1 | Default | NULL | N ULL | end | success | NULL | 5594 | NULL | NULL | NULL | +---+---+---+---+---+---+--- +---+---+---+---+---+-- ---+---+---+---+---+---+---+---+ 2 rows in set (0.00 sec)

These results sho w that o ur run was succe ssf ul, and to o k 5 5 9 4 milliseco nds (~5 seco nds) to run. This is duplicated by the etlRuns table, which shows the st art and e nd t im e s o f the jo b. We're lo o king pretty go o d!

Performance

When develo ping a so ftware system, it is o ften best to start with a simple so lutio n and mo ve to a mo re co mplex so lutio n as develo pment go es o n:

Our mo vie dimensio n is very small - it o nly has 10 0 0 ro ws. Since it is so small, it is acceptable to recreate this dimensio n (alo ng with o ther related facts) fro m scratch each day. This so lutio n is simple and wo rks well fo r small data wareho uses.

But what if o ur mo vie dimensio n had 50 0 ,0 0 0 ro ws in it? What if o ur so urce system was an ancient co mputer,

requiring 10 ho urs to extract mo vie data? Running a data lo ad fo r 10 ho urs everyday wo uld no t be a great o ptio n; even if the rest o f the wareho use pro cessing o nly to o k a minute o r two , there wo uld o nly be 14 ho urs left fo r wareho use use. Sho uld the mo vie dimensio n gro w to be 1,0 0 0 ,0 0 0 ro ws, the wareho use lo ad might take 20 ho urs to co mplete! When dimensio ns are large, it is necessary to add co mplexity to the wareho use in o rder to reduce lo ad times. The first step to ward o ptimizing perfo rmance seems straightfo rward: o nly que ry t he so urce syst e m f o r ne w and change d re co rds. Our audit tables capture the date and time that the dimensio n was updated. That time stamp can be used to select reco rds in the so urce system. But this pro cess is o ften mo re difficult than it initially seems.

has a co lumn called last_update which should get set when the row is created, and updated when the row changes. But what if it had a co lumn called date_created instead? How would we know when a row had changed?

In many situatio ns yo u will have to make changes to so urce systems to make data wareho use lo ads easier to

manage. This might invo lve adding time stamp co lumns, mo difying existing co lumns, o r even creating co mpletely new tables. Keep this in mind as we mo ve fo rward.

We've co vered a lo t in this lesso n! Stay tuned - in the next lesso n we'll co ntinue wo rking with o ur dimensio ns and learn ho w to pro cess slowly changing dimensions. See yo u then!

Copyright © 1998-2014 O'Reilly Media, Inc.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.

In document DBA 3: Creating a Data Warehouse (Page 90-94)