Sie sind auf Seite 1von 185

DBA 3: Data Warehousing (PDF)

Lesson 1: Introduction
Using t he Learning Sandbox Environment
Dat a Warehousing
Lesson 2: A Data Warehouse
Fact s and Dimensions
Fact s
Dimensions
T he Dimensional Model
Select ing Fact s and Dimensions
St ar Schema
Lesson 3: Implementing the Dimensional Model, Part I
Creat ing t he Dat e Dimension
Slowly Changing Dimensions
T ype 0 SCD
T ype 1 SCD
T ype 2 SCD
T ype 3 SCD
T ype 4 SCD
Creat ing t he Cust omer Dimension
Snowf lake Schemas
Lesson 4 : Implementing T he Dimensional Model, Part II
Creat ing t he Movie Dimension
Creat ing t he St ore Dimension
Creat ing Fact s
Sales
Cust omerCount
Rent alCount
Lesson 5: Extract, T ransf orm, Load (ET L)
What is ET L?
Logging and Audit ing
Get t ing Dat a int o t he Warehouse
dimDat e
dimCust omer
dimMovie
dimSt ore

Lesson 6: T ools f or ET L
ET L--Past , Present , and Fut ure
Get t ing St art ed wit h T alend Open St udio
Your First T OS Job
Lesson 7: ET L: T he Date Dimension
Job St ruct ure
Loading Dat a f rom Excel
Adding Columns t o our Dat a Flow
Adding Dat a t o dimDat e
If you run int o problems...
Lesson 8: Basic Dimension Processing
Loading dimMovie
Job St ruct ure
Pre and Post Job
Logging
dimMovie
Perf ormance
Lesson 9: SCD Processing
T he Algorit hm: Slowly Changing Dimensions
Implement ing t he Dimensions
dimCust omer
Does our SCD work?
dimSt ore
Lesson 10: Processing Facts, Part I
Orchest rat ion
f act Cust omerCount
Lesson 11: Processing Facts, Part II
f act Sales
Lesson 12: Special Facts
Missing Keys
Debugging t ELT MysqlMap
Handling Missing Keys
Aggregat ing
Deaggregat ing Dat a
Early Arriving Fact s

Lesson 13: Querying a Relational Data Warehouse


Viewing Dat a
Answering Quest ions
Problems wit h Queries
Bad Joins
Incorrect Filt ering
Lesson 14 : Final Project
Nort hwind T raders
f p_dimDat e
f p_dimEmployees
f p_dimCust omers
f p_dimSuppliers
f p_dimProduct s
f p_dimOrders
Order Unit Price
Supplier Unit Price
Copyright 1998-2013 O'Reilly Media, Inc.
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.

Getting Started with Talend Open Studio


DBA 3: Data Warehousing Lesson 0
Getting Started with T alend Open Studio
Note

These instructio ns will o nly sho w up here, at the beginning o f this co urse. Altho ugh yo u pro bably wo n't
need them instructio ns again, feel free to bo o kmark o r print them o ut, just in case.

In this co urse we'll be using Talend Open Studio , an Eclipse based data wareho using to o l. Our versio n o f Talend
Open Studio , o r TOS is nearly identical to the versio n yo u can get fo r free fro m Talend's website. We've added a few
features to enhance yo ur learning experience. Our plug-in allo ws yo u to view the co urse, pro gram yo ur labs, submit
yo ur pro jects, and receive yo ur grades and co mments, all witho ut leaving Talend Open Studio .
We will be using a Terminal Service sessio n o r thin client. A thin client allo ws yo u to co nnect to a remo te terminal
server running Talend Open Studio . The t e rm inal se rve r is a co mputer that se rve s a de skt o p to o ther co mputers
via the netwo rk. Yo u will be using this thin client to access Talend Open Studio o n o ur Windo ws servers:

Yo ur machine will send mo use and keybo ard info rmatio n to o ur server. Our server will send the resulting visual o utput
back to yo ur co mputer. It will se e m just like yo u are running T ale nd Ope n St udio o n yo ur o wn m achine , but
in reality it is being run o n o ur server and returned to yo ur machine. We call o ur system the Le arning Sandbo x. The
Learning Sandbo x is a safe place where yo u can write and execute yo ur o wn pro grams witho ut wo rrying abo ut
breaking yo ur o wn co mputer. It also gives yo u the ability to wo rk fro m anywhere there's an internet co nnectio n. Since
all o f yo ur wo rk is sto red o n o ur server, there are no disks o r USB drives to carry aro und.
In a mo ment yo u will be asked to switch windo ws back to yo ur student start page and to press the Ent e r butto n fo r
yo ur co urse:

After yo u fo llo w the specific instructio ns fo r yo ur machine, return to the student startup windo w.

Connecting to the Sandbox - Windows


T he se inst ruct io ns are f o r Windo ws use rs o nly.
Switch back to yo ur student start page, and press the enter butto n fo r yo ur co urse. Yo u will see this:

Make sure the Ope n wit h radio butto n is selected. If yo u like, yo u can also check the Do t his
aut o m at ically f o r f ile s like t his f ro m no w o n checkbo x. No w click OK.

Connecting to the Sandbox - Macintosh


T he se inst ruct io ns are f o r Macint o sh use rs o nly.
Yo u need to install the Remo te Destko p Co nnectio n so ftware o n yo ur co mputer.

Note

The file names and screen sho ts belo w may differ o n yo ur o wn co mputer. Micro so ft
o ccasio nally updates the RDC pro gram.
Start by do wnlo ading this file

Once yo u've do wnlo aded that RDC20 0 _ALL.dmg (disk image) yo u need to lo cate it and o pen it.

Next yo u will see the fo llo wing screen. Do uble click o n the bo x and fo llo w the install instructio ns.

Click "Co ntinue" and all the default reco mmendatio ns and butto ns fo r each screen:

Yo u'll be asked to give yo ur lo gin and passwo rd T O YOUR MACHINE:

Switch back to yo ur student start page, and press the enter butto n fo r yo ur co urse. Yo ur bro wser will
do wnlo ad an RDP file. Save this file to yo ur deskto p.

Next, do uble click the RDP file to o pen it. Yo u be asked fo r the USERNAME and PASSWORD fo r the O'Reilly
Scho o l o f Techno lo gy:

If yo u see a warning sign, just click co nnect, it's just Micro so ft trying to make peo ple buy Vista.

Connecting to the Sandbox - Linux


T he se inst ruct io ns are f o r Linux use rs o nly.
There are many distributio ns o f Linux. Unfo rtunately, we do no t have the reso urces to suppo rt them all. These
instructio ns are geared to wards the Ubuntu family o f Linux distributio ns. If yo u are versed in Linux, yo u sho uld
be able to tailo r these instructio ns to yo ur specific distributio n and co mputer setup.
In Ubuntu, yo u will need to make sure that Terminal Server Client is installed and functio ning. Using the
Synaptic o r Adept package manager, simply install the Terminal Server Client. Or, fro m a terminal windo w,
execute the fo llo wing line
OBSERVE:
sudo apt-get install tsclient
Using the menu system fo r yo ur Ubuntu o r Kubuntu Linux distributio n, start the Terminal Service Client
applicatio n, o r fro m a terminal windo w, execute the fo llo wing line:
OBSERVE:
tsclient
Switch back to yo ur student start page, and press the enter butto n fo r yo ur co urse.
Yo ur web bro wser will do wnlo ad an RDP file to yo ur co mputer. Switch back to the Terminal Server Client
pro gram, and o pen the do wnlo aded RDP file. Click Co nne ct . Yo u be asked fo r the USERNAME and
PASSWORD fo r the O'Reilly Scho o l o f Techno lo gy.
If yo u have any pro blems o r questio ns, do n't hesitate to co ntact yo ur mento r at le arn@ o re illyscho o l.co m .

Initial Setup
At this po int yo u sho uld see the Talend splash screen:

The next screen yo u will see is the license ackno wledgment. Click Acce pt :

Befo re we can get started with TOS we need to setup a repository connection and then a pro duct. The
repository is where TOS will keep yo ur o bjects. Click o n the ... butto n to manage the repo sito ry co nnectio n:

Enter yo ur email address in the repo sito ry management windo w to create a new repo sito ry, then click OK:

Next, create yo ur pro ject by cho o sing Cre at e a ne w lo cal pro je ct fro m the dro p do wn list and click Go !:

Name yo ur pro ject DBA3, set its language to java, and click Finish:

Next, cho o se yo ur new pro ject fro m the dro p-do wn and click Ope n:

When yo ur pro ject lo ads yo u'll see a screen asking if yo u want to register with Talend. If yo u are interested in
keeping up to date with Talend, enter yo ur email address. Otherwise click Cance l:

The next step may take so me time to co mplete. Under the ho o d, TOS generates Java co de fro m yo ur design
instructio ns. This pro cess requires so me wo rk to co mplete--TOS do es a lo t o f set-up to make everything
wo rk pro perly. While this wo rk is taking place, yo u'll see the fo llo wing screen:

WARNING

Do no t cancel this pro cess. If yo u do , it is highly likely that TOS will no t wo rk pro perly.

Once this pro cess is co mplete, click St art No w to begin using TOS.
The next time yo u lo g in, yo u wo n't need to go thro ugh all o f these steps. TOS will start, and yo u'll be able to
cho o se the DBA 3 pro ject yo u created already. Yo u'll see the "generatio n" screen again, and when TOS
finishes that wo rk, yo u will be ready begin yo urs.
TOS is highly custo mizable - windo ws, palettes, and to o lbars can all be mo ved, resized, and clo sed. We've
also added two reset butto ns to TOS (the red leaves at the to p o f yo ur screen). The first red leaf resets TOS
fo r the first part o f the co urse, and the seco nd red leaf resets TOS fo r the seco nd part o f the co urse. If yo u
accidentally clo se so me aspect o f TOS, o r just want to get back to the beginning, click o n the re d le af to reset
TOS.
If yo u have no t do ne so alre ady, re se t T OS by clicking o n t he f irst re d le af :

Yo ur student start page is o n the to p. Scro ll do wn to find the DBA 3 co urse, and click the Ent e r butto n to view
yo ur syllabus:

Use yo ur syllabus to view yo ur lesso ns, pro jects, and quizzes.

Note

If yo u have any questio ns o r need assistance yo u can always co ntact yo ur mento r at


learn@o reillyscho o l.co m.

Logging Out
When yo u lo g o ut, yo u need to quit TOS instead o f just clo sing the windo w. If yo u disco nnect fro m yo ur sessio n and
then reco nnect using a different co mputer, two co pies o f TOS will be running, po tentially o verwriting each o ther's files.
Yo u can prevent that pro blem by clo sing TOS fro m the File ->Exit menu after yo u're do ne wo rking.

WARNING

Again, make sure yo u quit TOS when yo u are do ne wo rking so yo ur jo bs are saved co rrectly!

Copyright 1998-2013 O'Reilly Media, Inc.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.

Introduction
DBA 3: Data Warehousing Lesson 1
Course Objectives
Welco me to the third co urse in O'Reilly Scho o l o f Techno lo gy's (OST) DBA series. The co ntent o f this co urse has been written
under the assumptio n that yo u have wo rked thro ugh the first two co urses in the series, and are familiar with MySQL. If yo u'd like
to refresh yo ur memo ry, feel free to go back o ver the first two co urses. Then get ready to take yo ur MySQL kno wledge to the
next level!
In this co urse, yo u'll learn what makes up a data wareho use and gain an understanding o f the dimensio nal mo del. Upo n
co mpletio n o f this co urse, yo u will be able to :
Implement the dimensio nal mo del using standard ETL pro cesses
Demo nstrate understanding o f dimensio n, SCD and fact pro cessing
Query relatio nal data wareho uses using standard SQL co mmands
Develo p a co mplete data wareho use using Talend Open Studio
Fro m beginning to end, yo u will learn by do ing pro jects using Talend Open Studio , an Eclipse based to o l fo r implementing data
wareho uses. Yo u'll co mplete pro jects using Talend, develo ping yo ur o wn co mplete data wareho uses. The pro jects add to yo ur
po rtfo lio and will co ntribute to certificate co mpletio n. Besides a bro wser and internet co nnectio n, all so ftware is pro vided o nline
by the O'Reilly Scho o l o f Techno lo gy.

Learning with O'Reilly School of T echnology Courses


As with every O'Reilly Scho o l o f Techno lo gy co urse, we'll take the useractive appro ach to learning. This means that
yo u (the user) will be active! Yo u'll learn by do ing, building live pro grams, testing them and experimenting with them
hands-o n!
To learn a new skill o r techno lo gy, yo u have to experiment. The mo re yo u experiment, the mo re yo u learn. Our system
is designed to maximize experimentatio n and help yo u learn to learn a new skill.
We'll pro gram as much as po ssible to be sure that the principles sink in and stay with yo u.
Each time we discuss a new co ncept, yo u'll put it into co de and see what YOU can do with it. On o ccasio n we'll even
give yo u co de that do esn't wo rk, so yo u can see co mmo n mistakes and ho w to reco ver fro m them. Making mistakes
is actually ano ther go o d way to learn.
Here are so me tips fo r using O'Reilly Scho o l o r Techno lo gy co urses effectively:
T ype t he co de . Resist the temptatio n to cut and paste the example co de we give yo u. Typing the co de
actually gives yo u a feel fo r the pro gramming task. Then play aro und with the examples to find o ut what else
yo u can make them do , and to check yo ur understanding. It's highly unlikely yo u'll break anything by
experimentatio n. If yo u do break so mething, that's an indicatio n to us that we need to impro ve o ur system!
T ake yo ur t im e . Learning takes time. Rushing can have negative effects o n yo ur pro gress. Slo w do wn and
let yo ur brain abso rb the new info rmatio n tho ro ughly. Taking yo ur time helps to maintain a relaxed, po sitive
appro ach. It also gives yo u the chance to try new things and learn mo re than yo u o therwise wo uld if yo u
blew thro ugh all o f the co ursewo rk to o quickly.
Expe rim e nt . Wander fro m the path o ften and explo re the po ssibilities. We can't anticipate all o f yo ur
questio ns and ideas, so it's up to yo u to experiment and create o n yo ur o wn. Yo ur instructo r will help if yo u
go co mpletely o ff the rails.
Acce pt guidance , but do n't de pe nd o n it . Try to so lve pro blems o n yo ur o wn. Go ing fro m
misunderstanding to understanding is the best way to acquire a new skill. Part o f what yo u're learning is
pro blem so lving. Of co urse, yo u can always co ntact yo ur instructo r fo r hints when yo u need them.
Use all available re so urce s! In real-life pro blem-so lving, yo u aren't bo und by false limitatio ns; in OST
co urses, yo u are free to use any reso urces at yo ur dispo sal to so lve pro blems yo u enco unter: the Internet,
reference bo o ks, and o nline help are all fair game.
Have f un! Relax, keep practicing, and do n't be afraid to make mistakes! Yo ur instructo r will keep yo u at it
until yo u've mastered the skill. We want yo u to get that satisfied, "I'm so co o l! I did it!" feeling. And yo u'll have
so me pro jects to sho w o ff when yo u're do ne.

Lesson Format

We'll try o ut lo ts o f examples in each lesso n. We'll have yo u write co de, lo o k at co de, and edit existing co de. The co de
will be presented in bo xes that will indicate what needs to be do ne to the co de inside.
Whenever yo u see white bo xes like the o ne belo w, yo u'll type the co ntents into the edito r windo w to try the example
yo urself. The CODE TO TYPE bar o n to p o f the white bo x co ntains directio ns fo r yo u to fo llo w:
CODE TO TYPE:
White boxes like this contain code for you to try out (type into a file to run).
If you have already written some of the code, new code for you to add looks like this.
If we want you to remove existing code, the code to remove will look like this.
We may run pro grams and do so me o ther activities in a terminal sessio n in the o perating system o r o ther co mmandline enviro nment. These will be sho wn like this:
INTERACTIVE SESSION:
The plain black text that we present in these INTERACTIVE boxes is
provided by the system (not for you to type). The commands we want you to type look lik
e this.

Co de and info rmatio n presented in a gray OBSERVE bo x is fo r yo u to inspect and absorb. This info rmatio n is o ften
co lo r-co ded, and fo llo wed by text explaining the co de in detail:
OBSERVE:
Gray "Observe" boxes like this contain information (usually code specifics) for you to
observe.
The paragraph(s) that fo llo w may pro vide additio n details o n inf o rm at io n that was highlighted in the Observe bo x.
We'll also set especially pertinent info rmatio n apart in "No te" bo xes:

Note
T ip

No tes pro vide info rmatio n that is useful, but no t abso lutely necessary fo r perfo rming the tasks at hand.

Tips pro vide info rmatio n that might help make the to o ls easier fo r yo u to use, such as sho rtcut keys.

WARNING

Note

Warnings pro vide info rmatio n that can help prevent pro gram crashes and data lo ss.

If yo u have no t read the initial co urse instructio ns Getting Started with Talend Open Studio yet, please go ahead
and do that no w.

Using the Learning Sandbox Environment


In this co urse yo u will need to co nnect to yo ur Unix acco unts using SSH. To allo w yo u to do that, there are two
T e rm inal views pro vided:

Note

Depending o n the width o f yo ur mo nito r, the text o n the tabs may be truncated. Terminal 1 is the left
terminal, and Terminal 2 is the right terminal.

If yo u clicked o n the seco nd red leaf, the terminals will be lo cated lo wer o n the screen:

To co nnect, click o n o ne o f the terminal tabs and then click o n Co nne ct :

Change the Co nne ct io n T ype to SSH, set the ho st to co ld.use ract ive .co m , then enter yo ur use rnam e and
passwo rd:

The first time yo u co nnect yo u will see a few o ther warnings. Click o n Ye s fo r all o f them:

After yo u click OK, yo u will be co nnected to yo ur acco unt:

Yo u'll also be saving so me o f yo ur SQL queries and do cumentatio n in text files. We'll sto re these in a pro ject
accessible fro m TOS. To add this pro ject, click o n the Navigat io n tab:

Yo u co uld use this view to peek at the files that TOS sto res "under the ho o d." We'll use it to ho ld o ur text files. Rightclick in the white space under the fo lders, and cho o se Ne w -> Pro je ct :

Expand the Ge ne ral fo lder, cho o se Pro je ct , and click Ne xt :

Name this pro ject Do cum e nt at io n and click Finish:

Note

Yo u must name yo ur o bjects exactly as specified in the lesso n to allo w yo ur mento r to lo cate yo ur wo rk
and help yo u if yo u need it.

To create a new text file, right-click o n Do cum e nt at io n and cho o se Ne w -> Ot he r...:

Under the Ge ne ral fo lder, cho o se File and click Ne xt :

Make sure yo u select Do cum e nt at io n as the parent fo lder. Name yo ur new file dba3_le sso n1_pro je ct 1.t xt - yo u
will add to this file as yo u co mplete yo ur first pro ject. Click o n Finish:

An edito r will o pen, allo wing yo u to wo rk o n yo ur file:

Save yo ur wo rk:

Data Warehousing
At this po int in yo ur database educatio n, yo u are familiar with SQL databases and their capabilities. By far, the mo st
po pular use fo r databases is the sto rage o f o peratio nal data generated thro ugh transactio ns.
In the previo us co urses we examined the database o f a DVD rental sto re. The database was used to keep track o f
custo mers, the DVDs in the sto re's invento ry, and the DVDs that were currently being rented. Tables were designed

and o ptimized fo r this "o peratio nal" purpo se.


Keeping track o f current custo mers and rentals is a key task o f a DVD rental sto re database, but at so me po int a sto re
manager may want to gain additio nal insight into the business. He may have questio ns such as:
Ho w many new custo mers were added this quarter?
What is the mo st po pular rental?
Ho w much revenue did o ur East side sto re generate co mpared to o ur West side sto re?
Ho w do sales this mo nth co mpare to last mo nth o r last year?
Are mo vies that were po pular in the theater po pular rentals as well?
So me o f o ur questio ns can be answered by writing a straightfo rward query against the o peratio nal database. Other
mo re co mplicated questio ns, such as "Ho w much revenue did o ur East side sto re generate co mpared to o ur West
side sto re?" can be mo re challenging to answer. And still o ther questio ns canno t be answered at all witho ut additio nal
info rmatio n that isn't sto red in the database.
A Dat a Ware ho use is designed to pro vide a unified platfo rm to answer all o f the questio ns po sed abo ve. A go o d
data wareho use pro vides:
A se parat e syst e m t hat wo n't int e rrupt busine ss crit ical o pe rat io nal syst e m s:

A single po int o f acce ss f o r all analyt ical que rie s:

A unif ie d and co nsist e nt vie w o f unde rlying dat a (e ve n dat a f ro m e xt e rnal syst e m s):

A st raight f o rward way t o analyze t re nds:

In this co urse yo u'll learn everything yo u need to kno w abo ut a data wareho use - fro m planning to implementatio n. In
the next lesso n we'll take a fresh lo o k at o ur o peratio nal database and start planning o ur wareho use. See yo u there!

Copyright 1998-2013 O'Reilly Media, Inc.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.

A Data Warehouse
DBA 3: Data Warehousing Lesson 2
Facts and Dimensions
In the first lesso n we discussed reaso ns to develo p and use a Dat a Ware ho use . The bo tto m line was that o ur video
sto re manager wanted answers to a few go o d questio ns:
Ho w many new custo mers were added this quarter?
What is the mo st po pular rental?
Ho w much revenue did o ur East side sto re generate co mpared to o ur West side sto re?
Ho w do sales this mo nth co mpare to last mo nth o r last year?
Are mo vies that were po pular in the theater po pular rentals as well?
Which custo mers rent the mo st DVDs each mo nth and at which sto re?
We can rewrite so me o f the questio ns so that they share a fo rmat we can use mo re readily in o ur queries:
Ho w m any ne w cust o m e rs did we add by quart e r?
Ho w m any t im e s we re DVDs re nt e d, by DVD and by m o nt h?
Ho w m uch sale s did we do , by st o re and by m o nt h?
Ho w m uch sale s did we do by m o nt h?
Ho w m any t im e s we re DVDs re nt e d, by m o nt h and by t he at e r po pularit y?
Ho w m any t im e s we re DVDs re nt e d, by cust o m e r, m o nt h and st o re ?
Tho ugh the questio ns are slightly different than they were o riginally, they are no w structured like analytical queries, with
f act s and dim e nsio ns.

Facts
Fact s are numbers, and are so metimes referred to as measures. A facts relating to sales co uld be "Sales in
US Do llars" and "Sales in Euro s." Other facts co uld be "Ho urs o f Wo rk," o r "Times Rented."

Fact s have a defined grain - the level o f detail. Fo r example, "Sales in US Do llars" may be daily, o r even
ho urly. If yo u have sales data o n a daily grain, yo u canno t display sales by ho ur. Yo u can, ho wever, co mbine
(aggregate) daily sales to larger grains such as weekly o r mo nthly:

Aggregates are applied to facts in o rder to mo ve to a larger grain. The mo st co mmo n aggregate is SUM.
Other aggregates are Average (AVG), co unt, maximum (MAX) and minimum (MIN). Aggregates take a set o f
data and return a summary o f that data.
Let's experiment with so me aggregates in o ur SQL database. Switch to the SSH mo de, and lo g into yo ur
acco unt. In Unix mo de, use the mysql co mmand to co nnect to the sakila database as the ano nym o us user.
When pro mpted fo r a passwo rd, press enter. In Unix mo de, run the fo llo wing co mmand:
CODE TO TYPE:
cold1:~$ mysql -h sql sakila -u anonymous -p
If yo u have entered everything co rrectly yo u will see this:
OBSERVE:
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 28527
Server version: 5.0.62-log Source distribution
Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
mysql>
Let's take a lo o k at the tables, using the show tables co mmand. Run this co mmand:
CODE TO TYPE:
mysql> show tables;

OBSERVE:
mysql> show tables;
+----------------------------+
| Tables_in_sakila
|
+----------------------------+
| actor
|
| actor_info
|
| address
|
| category
|
| city
|
| country
|
| customer
|
| customer_list
|
| film
|
| film_actor
|
| film_category
|
| film_list
|
| film_text
|
| inventory
|
| language
|
| nicer_but_slower_film_list |
| payment
|
| rental
|
| sales_by_film_category
|
| sales_by_store
|
| staff
|
| staff_list
|
| store
|
+----------------------------+
23 rows in set (0.00 sec)
The sakila database has 23 tables and views. Let's take a clo ser lo o k at the table called paym e nt . Run the
fo llo wing co mmand:
CODE TO TYPE:
mysql> describe payment;
OBSERVE:
mysql> describe payment;
+--------------+----------------------+------+-----+-------------------+---------------+
| Field
| Type
| Null | Key | Default
| Extra
|
+--------------+----------------------+------+-----+-------------------+---------------+
| payment_id
| smallint(5) unsigned | NO
| PRI | NULL
| auto_in
crement |
| customer_id | smallint(5) unsigned | NO
| MUL | NULL
|
|
| staff_id
| tinyint(3) unsigned | NO
| MUL | NULL
|
|
| rental_id
| int(11)
| YES | MUL | NULL
|
|
| amount
| decimal(5,2)
| NO
|
| NULL
|
|
| payment_date | datetime
| NO
|
| NULL
|
|
| last_update | timestamp
| NO
|
| CURRENT_TIMESTAMP |
|
+--------------+----------------------+------+-----+-------------------+---------------+
7 rows in set (0.00 sec)

This table has has a co lumn named am o unt , which is a m e asure .


So , "What is the to tal o f all payments received?" Run this co mmand to find o ut:
CODE TO TYPE:
SELECT sum(amount) from payment;
If yo u typed yo ur query co rrectly yo u'll see the fo llo wing results:
OBSERVE:
mysql> SELECT sum(amount) from payment;
+-------------+
| sum(amount) |
+-------------+
|
67416.51 |
+-------------+
1 row in set (0.14 sec)
No w let's answer this questio n: "What is the largest payment received?" In o rder to do that, run this
co mmand:
CODE TO TYPE:
SELECT max(amount) from payment;
It lo o ks like o ur largest payment was $11.9 9 :
OBSERVE:
mysql> SELECT max(amount) from payment;
+-------------+
| max(amount) |
+-------------+
|
11.99 |
+-------------+
1 row in set (0.02 sec)
So , what if yo u need to track an event in the data wareho use? Events do n't usually have a numeric value
attached to them (o ther than perhaps "co unt"). They are kno wn as f act le ss f act s. We'll talk mo re abo ut
f act le ss f act s later.
By themselves, facts are no t particularly useful, so next we'll add co ntext thro ugh dimensions.

Dimensions
Dim e nsio ns are used to filter, catego rize, and label f act s. A fact such as "Sales in US Do llars" might have
dimensio ns fo r Date, Customer, Store, and Movie. Written in English, this might translate to so mething like
this:
On May 25 t h, Rut h Mart ine z re nt e d t he m o vie " Cabin Flash" f ro m t he We st side st o re fo r $ 9 .9 9 .
Or, bro ken into its co mpo nents, it lo o ks like this:

Nam e

Value

Dim e nsio n Date

May 25th

Dim e nsio n Custo mer

Ruth Martinez

Dim e nsio n Mo vie

Cabin Flash

Dim e nsio n Sto re

West

Fact

Sales in US Do llars $9 .9 9

The first and mo st impo rtant dim e nsio n used in a wareho use is the date dimensio n. This dimensio n is o ften
presented in a hierarchy:
Year -> Quarter -> Mo nth -> Day

Days can "ro ll up" to a mo nth. Mo nths can "ro ll up" to a quarter, and quarters "ro ll up" to a year. Daily sales
"ro ll up" to mo nthly sales, mo nthly sales "ro ll up" to quarterly sales, and quarterly sales "ro ll up" to yearly
sales:

Ye ar, Quart e r, Mo nt h, and Day are no t dimensio ns themselves. They represent levels in the Dat e

dimensio n's hierarchy.

Dates o ften have multiple uses in a wareho use. Fo r DVD rentals, dates are reco rded at least twice: o nce
when a mo vie is rented and again when the mo vie is returned. When the same underlying date dimensio n is
used fo r bo th o f these purpo ses, the dimensio n is kno wn as a role-playing dimensio n.
In the SQL wo rld we specify dimensio ns in the GROUP BY and WHERE clauses. Let's see these clauses in
actio n using o ur examples.
First let's examine the data sto red in o ur database that co rrespo nds to Ruth renting "Cabin Flash" fro m the
West sto re o n May 25th fo r $9 .9 9 . Fo r the sake o f experiment, we happen to kno w that this data is sto red with
a payment_id=491. (Just play alo ng fo r no w.) Run the fo llo wing co mmand:
CODE TO TYPE:
select c.first_name, c.last_name, f.title, p.amount,
DATE_FORMAT(p.payment_date, '%b %D') as paymentDate, s.region
from payment p
join customer c on (p.customer_id=c.customer_id)
join rental r on (p.rental_id = r.rental_id)
join inventory i on (r.inventory_id=i.inventory_id)
join film f on (i.film_id=f.film_id)
join store s on (c.store_id=s.store_id)
where p.payment_id=491;
MySQL respo nds with o ur data:
OBSERVE:
mysql> select c.first_name, c.last_name, f.title, p.amount,
-> DATE_FORMAT(p.payment_date, '%b %D') as paymentDate, s.region
-> from payment p
-> join customer c on (p.customer_id=c.customer_id)
-> join rental r on (p.rental_id = r.rental_id)
-> join inventory i on (r.inventory_id=i.inventory_id)
-> join film f on (i.film_id=f.film_id)
-> join store s on (c.store_id=s.store_id)
-> where p.payment_id=491;
+------------+-----------+-------------+--------+-------------+--------+
| first_name | last_name | title
| amount | paymentDate | region |
+------------+-----------+-------------+--------+-------------+--------+
| RUTH
| MARTINEZ | CABIN FLASH |
9.99 | May 25th
| West
|
+------------+-----------+-------------+--------+-------------+--------+
1 row in set (0.09 sec)
Lo o ks go o d! No w let's answer o ur first questio n: How much was rented on May 25th by Ruth Martinez in the
West store? Go ahead and run this co mmand:
CODE TO TYPE:
select c.first_name, c.last_name, p.amount,
DATE_FORMAT(p.payment_date, '%b %D') as paymentDate, s.region
from payment p
join customer c on (p.customer_id=c.customer_id)
join store s on (c.store_id=s.store_id)
where day(p.payment_date)=25 and month(p.payment_date)=5
AND c.first_name='RUTH' and c.last_name='MARTINEZ';
The database do es its jo b and returns the requested info rmatio n:

OBSERVE:
mysql> select c.first_name, c.last_name, p.amount,
-> DATE_FORMAT(p.payment_date, '%b %D') as paymentDate, s.region
-> from payment p
-> join customer c on (p.customer_id=c.customer_id)
-> join store s on (c.store_id=s.store_id)
-> where day(p.payment_date)=25 and month(p.payment_date)=5
-> AND c.first_name='RUTH' and c.last_name='MARTINEZ';
+------------+-----------+--------+-------------+--------+
| first_name | last_name | amount | paymentDate | region |
+------------+-----------+--------+-------------+--------+
| RUTH
| MARTINEZ |
0.99 | May 25th
| West
|
| RUTH
| MARTINEZ |
9.99 | May 25th
| West
|
+------------+-----------+--------+-------------+--------+
2 rows in set (0.15 sec)
This is co rrect, but unfo rtunately it isn't exactly what we're after. We actually want o ne ro w o f summarized data
instead o f two ro ws o f detail data. We need to aggregate the am o unt fact, and make sure to GROUP BY o ur
dimensio ns. Run this co mmand:
CODE TO TYPE:
select c.first_name, c.last_name, sum(p.amount),
DATE_FORMAT(p.payment_date, '%b %D') as paymentDate, s.region
from payment p
join customer c on (p.customer_id=c.customer_id)
join store s on (c.store_id=s.store_id)
where day(p.payment_date)=25 and month(p.payment_date)=5
AND c.first_name='RUTH' and c.last_name='MARTINEZ'
GROUP BY c.first_name, c.last_name, paymentDate, s.region;
Excellent--no w we have o ur desired result:
OBSERVE:
mysql> select c.first_name, c.last_name, sum(p.amount),
-> DATE_FORMAT(p.payment_date, '%b %D') as paymentDate, s.region
-> from payment p
-> join customer c on (p.customer_id=c.customer_id)
-> join store s on (c.store_id=s.store_id)
-> where day(p.payment_date)=25 and month(p.payment_date)=5
-> AND c.first_name='RUTH' and c.last_name='MARTINEZ'
-> GROUP BY c.first_name, c.last_name, paymentDate, s.region;
+------------+-----------+---------------+-------------+--------+
| first_name | last_name | sum(p.amount) | paymentDate | region |
+------------+-----------+---------------+-------------+--------+
| RUTH
| MARTINEZ |
10.98 | May 25th
| West
|
+------------+-----------+---------------+-------------+--------+
1 row in set (0.00 sec)
We were able to answer o ur questio n using the info rmatio n sto red in o ur current tables. So if that's the case, ho w is a
dat a ware ho use different than o ur existing dat abase ? Read o n...

T he Dimensional Model
So why go to the tro uble o f creating a wareho use when o ur existing database has all the info rmatio n we need? It
seems like we've just invented a few new terms fo r o ur existing data.
In the last lesso n we had several go o d reaso ns fo r creating a wareho use, remember? Data wareho uses pro vide:
a separate system that wo n't interrupt business critical o peratio nal systems.
a single po int o f access fo r all analytical queries.
a unified and co nsistent view o f underlying data (even data fro m external systems).

a straightfo rward way to analyze trends (such as mo nthly sales co mpariso ns).
Our existing database can pro vide answers to so me o f o ur pertinent questio ns, but it do esn't pro vide any o f the
features listed abo ve. Data wareho uses do .

Note

Generally, yo u will create a data wareho use o n a separate physical machine fro m yo ur business critical
databases. Fo r develo pment purpo ses it is o kay to share machines.

Our o riginal database is fairly co mplex. Take a lo o k at the database diagram.

Selecting Facts and Dimensions


Ho w do we cho o se which f act s and dim e nsio ns to use in o ur data wareho use? Ask the users, of course!
After all, if the system do esn't meet the needs o f the users, what go o d is it? Ask them, "Which questio ns
wo uld yo u like to ask?" If they need an example, say so mething like, "I wo uld like to see ho w much the to tal
sales were fo r the West sto re last week."
Co mpile tho se questio ns, o rganize them acco rding to effo rt and split them to the f act s and dim e nsio ns,
just like we did earlier in this lesso n. Keep in mind that that facts and dimensio ns are generic terms, like the
fact is "sales" no t "to tal sales" and the dimensio ns are "regio n" and "date" no t "east and west regio ns" and
"mo nth." Also , wo rds like "to tal" and "to p" and "lo ngest" - they are just extra wo rds, and are no t part o f the
dimensio n o r fact themselves.
Let's take the questio ns fro m earlier in the lesso n, and o rganize them acco rding to difficulty. We already split
the f act s and dim e nsio ns:
1. Ho w m uch sale s did we do by m o nt h?
2. Ho w m uch sale s did we do , by st o re and by by m o nt h?
3. Ho w m any ne w cust o m e rs did we add by quart e r?
4. Ho w m any t im e s we re DVDs re nt e d, by DVD and by m o nt h?
5. Ho w m any t im e s we re DVDs re nt e d, by m o nt h and by m o vie po pularit y?
6 . Ho w m any t im e s we re DVDs re nt e d, by cust o m e r, m o nt h and st o re ?
The first two questio ns use a sale s fact. The third questio n uses a Cust o m e r Co unt fact. The fo urth and fifth
use a Re nt al Co unt fact.
All questio ns use a dat e dimensio n. The fo urth questio n uses a f ilm dimensio n.
The fifth questio n po ses a pro blem tho ugh, because we do n't have any po pularity data right no w. We'll have
to revisit that questio n later.
In the real wo rld it's impo rtant to get feedback fro m end users so yo u can determine what's impo rtant to them.
Yo u can always create multiple facts o r dimensio ns if end users do n't agree o n o r even kno w what's
impo rtant yet. It's perfectly fine to have two o r mo re facts that are very similar if it helps end users get the
info rmatio n they need.

Star Schema
No w that we've picked o ur facts and dimensio ns, its time to o rganize o ur data. Data wareho uses are typically
o rganized using a star schema. Facts (measures) are sto red in fact tables at the center o f the star, and the
dimensio ns surro und the measures. Facts have fo reign keys (using the integer data type) to each dimensio n
table:

Here is o ur sales fact:

Here is o ur custo mer co unt fact:

Here is o ur rental co unt fact:

These separate diagrams might suggest o ur facts and dimensio ns are sto red separately, but that's no t the
case. The dimensio ns are shared:

Yo u may wo nder why we're using separate tables fo r dimensio ns. Co uldn't we just put the mo vie dimensio ns
next to the fact in the same table? Well, we could do this, but we sho uldn't fo r o ne go o d reaso n: perfo rmance.
It is safe to assume that yo ur fact table will beco me very large (millio ns o r even billio ns o f ro ws). Yo ur
dimensio ns may be large as well, but it is unlikely they will be nearly as large as o ur fact tables.
Suppo se yo u have ten millio n ro ws o f fact data and ten tho usand distinct mo vies. Then yo u realize so meo ne
entered a film into yo ur wareho use using the name "The Dude" instead o f the film's actual name, "The Big
Lebo wski." Updating every fact ro w to co rrect that mistake co uld take a very lo ng time. Even a simple query fo r
"The Big Lebo wski" co uld cause the database great pain; text is much mo re difficult to index and search than
integers.
Well, we've co vered a lo t in this lesso n.. In the next lesso n we'll begin to implement o ur fact and dimensio n tables. See
yo u there!

Copyright 1998-2013 O'Reilly Media, Inc.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.

Implementing the Dimensional Model, Part I


DBA 3: Data Warehousing Lesson 3
In the last lesso n we discussed facts, dimensio ns, and the star schema. No w it's time to implement what we learned!

Creating the Date Dimension


The first and mo st impo rtant dimensio n in a data wareho use is the dat e dimensio n. Mo st, if no t all, queries use o ne
o r mo re dates. In o ur DVD rental sto re there are several dates captured:
1. Custo mers have a create_date.
2. Payments have a payment_date.
3. Rentals have a rental_date and a return_date.
We are go ing to use a table to create a single date dimensio n to handle all o f these dates. Let's call the table dimDate.
FYI, if yo ur dimensio n is reused fo r multiple purpo ses, such as reco rding bo th rental and return dates, it is a kno wn as
a role-playing dimensio n.
So , what kind o f structure sho uld we create fo r dimDate? In star schemas, facts use fo reign keys o f the data type
integer to "po int" to dimensio ns. That means we'll use a single integer as a primary key. At first yo u might envisio n
so mething like this:
Co lum n

T ype

date_key integer, primary key, auto _increment


date

date

This setup is a go o d start, but is it the best way to help o ur end users? If yo u recall the sample questio ns the users
gave us, several wanted to see results by month. So ho w wo uld yo u extract info rmatio n abo ut a mo nth fro m a date
type? Yo u co uld use a functio n like month(), but it's pro bably no t reaso nable to expect end users to use that functio n.
A better so lutio n wo uld be to pre-calculate and pre-po pulate the mo st impo rtant date attributes as required by the
users. The best way to determine what's mo st impo rtant is ask the users what they need. So let's suppo se we did ask
them, and used the info rmatio n they supplied to co me up with this structure:
Co lum n

T ype

date_key

int e ge r, prim ary ke y

date

date

year

smallint

quarter

tinyint

mo nth

tinyint

day

tinyint

week

tinyint

is_we e ke nd bo o lean
is_ho liday

bo o lean

We kept the date co lumn because it can be used to calculate attributes that didn't make it to the table. We are not go ing
to use an auto_increment. No rmally we wo uld use an auto_increment co lumn, but it's much mo re co nvenient to
make the key a co ded fo rmat such as yyyyMMDD. With this fo rmat, a value o f 20080101 wo uld represent January 1st,
20 0 8 .
We included two additio nal co lumns: is_we e ke nd and is_ho liday. These wo uld be useful if we wanted to co mpare
weekend sales o r ho liday sales to weekday sales. We keep the number o f data types required fo r o ur co lumns to a
minimum by co nsulting MySQL's do cumentatio n.
Let's go ahead and implement this table. (We'll po pulate it with data in a future lesso n.) Switch to the terminal mo de,
and lo g into yo ur acco unt. Once lo gged in, co nnect to yo ur o wn MySQL database. Be sure to replace use rnam e and

use rnam e with yo ur o wn user name. Type in the co de belo w at the UNIX pro mpt:
CODE TO TYPE:
cold1:~$ mysql -h sql -p -u username username
Next, create the dimDate table. Run this co mmand:
CODE TO TYPE:
CREATE TABLE dimDate
(
date_key integer NOT NULL,
date date NOT NULL,
year smallint NOT NULL,
quarter tinyint NOT NULL,
month tinyint NOT NULL,
day tinyint NOT NULL,
week tinyint NOT NULL,
is_weekend boolean,
is_holiday boolean,
PRIMARY KEY(date_key)
);

Execute the query. Yo u'll see Query OK, 0 rows affected.

Slowly Changing Dimensions


So there yo u are, abo ut to create a dimensio n fo r yo ur custo mers, when a business user mentio ns, " I'd like t o se e
sale s acco rding t o t he cit y in which a cust o m e r live . What happe ns whe n so m e o ne m o ve s f ro m o ne cit y
t o ano t he r? Will t he sale s dat a f ro m last ye ar re f le ct t hat change ?" When a dimensio n's values change
o ver time, the dimensio n is kno wn as a slowly changing dimension (o r SCD). There are several ways we can deal with
these changes.

T ype 0 SCD
The mo st basic SCD isn't really a change at all. If yo u do abso lutely no thing to handle a changing dimensio n,
that dimensio n is Type 0. In English, a type 0 translates to , "Do n't do anything when this value changes."

T ype 1 SCD
A Type 1 SCD is o ften the easiest way to acco mmo date changing dimensio ns. In this type, ro ws in the
dimensio n tables are updated when changes o ccur. Suppo se Mary Smith gets married in April and changes
her last name to J o ne s. (She'll keep the same email address fo r no w.)

The o ld dimensio n ro w wo uld lo o k like this:


Cust o m e r ID First Nam e Last Nam e
1

MARY

SMITH

Em ail

Cit y

MARY.SMITH@sakilacusto mer.o rg Sasebo

The updated dimensio n ro w wo uld lo o k like this:


Cust o m e r ID First Nam e Last Nam e
1

MARY

J ONES

Em ail

Cit y

MARY.SMITH@sakilacusto mer.o rg Sasebo

So me changes are less impo rtant than o thers. Name changes are no t always impo rtant to business users.
Fo r their purpo ses, it's irrelevant whether Mary Jo nes used to be kno wn as Mary Smith. But suppo se Mary
Smith mo ves fro m o ne city to ano ther in July. A Type 1 custo mer SCD wo uld simply update the existing ro w
fo r Mary Smith, fo rgetting the previo us city. No w a user wo uld be unable to see sales trends acco rding to
custo mer and city, because all histo rical data co ncerning Mary prio r to July wo uld no w be asso ciated with the
new city.

T ype 2 SCD
Type 1 isn't the best way to handle all slo wly changing dimensio ns tho ugh. Ano ther metho d to track changes
in dimensio ns is to create a new ro w in the dimensio n table when each change o ccurs, and then use be gin
and e nd dates to specify the valid time perio d fo r a ro w.
The database ro w fo r Mary Smith wo uld initially lo o k like this:
Cust o m e r
Ke y
1

First
Nam e
MARY

Last
Nam e
SMITH

Em ail

Cit y

MARY.SMITH@sakilacusto mer.o rg Sasebo

St art
Dat e
0 1-Jan20 0 8

End Dat e
0 1-JAN20 9 9

No w suppo se Mary Smith gets married in April and beco mes Mary Jo nes. Her dimensio n time line wo uld
lo o k like this:

And her table wo uld lo o k like this:


Cust o m e r
Ke y

First
Nam e

Last
Nam e

Em ail

Cit y

St art
Dat e

End
Dat e

MARY

SMITH

MARY.SMITH@sakilacusto mer.o rg Sasebo

0 1-JAN20 0 8

0 1-APR20 0 8

MARY

JONES

MARY.SMITH@sakilacusto mer.o rg Sasebo

0 1-APR20 0 8

0 1-JAN20 9 9

No w let's say she mo ves fro m Sasebo to Bellevue in July, her dimensio n time line wo uld lo o k like this:

The updated dimensio n data wo uld lo o k like this:


Cust o m e r
Ke y

First
Nam e

Last
Nam e

Em ail

Cit y

St art
Dat e

End
Dat e

MARY

SMITH

MARY.SMITH@sakilacusto mer.o rg Sasebo

0 1-JAN20 0 8

0 1-APR20 0 8

MARY

JONES

MARY.SMITH@sakilacusto mer.o rg Sasebo

0 1-APR20 0 8

0 1-JUL20 0 8

MARY

JONES

MARY.SMITH@sakilacusto mer.o rg Bellevue

0 1-JUL20 0 8

0 1-JAN20 9 9

In each o f the two tables that reflect Mary's new circumstances, there is o ne "current" ro w that has 01-JAN2099 fo r an End Date.

Note

Instead o f using 01-JAN-2099 fo r an end date, so me wareho uses use NULL, but usually it's
better to use a real date instead o f NULL, because real dates can make better use o f indexes.

T ype 3 SCD
Type 2 slo wly changing dimensio ns (SCDs) allo w unlimited changes, but this might be excessive fo r so me
types o f changes. Fo r example, when a po stal co de is changed, even tho ugh it's a fairly mino r change and
do esn't happen that o ften, it wo uld still need to be tracked in the database. In this case, we wo uld cho o se to
use the Type 3 SCD metho d.
Suppo se Mary Smith in Sasebo has her po stal co de changed fro m 3520 0 to 3520 1. The change wo uld lo o k
like this:

The o ld dimensio n data wo uld lo o k like this:


Cust o m e r
Ke y
1

First
Nam e
MARY

Last
Nam e
SMITH

Em ail

Cit y

Curre nt
Po st al
Co de

MARY.SMITH@sakilacusto mer.o rg Sasebo 3520 0

Pre vio us
Po st al
Co de

The updated dimensio n data wo uld lo o k like this:


Cust o m e r
Ke y
1

First
Nam e
MARY

Last
Nam e

Em ail

SMITH

Cit y

Curre nt
Po st al
Co de

MARY.SMITH@sakilacusto mer.o rg Sasebo 3520 1

Pre vio us
Po st al
Co de
3520 0

The table may o r may no t have an "Effective Date" co lumn to explain when the po stal co de changed.
Type 3 SCDs wo rk well fo r changes that happen infrequently, ho wever this type fails to capture multiple
changes.

T ype 4 SCD
A Type 4 SCD is fairly straightfo rward; the dimensio n table always co ntains up-to -date info rmatio n. Changes
are reco rded in a separate history table. This adds co mplexity to dimensio ns, but may cause co nfusio n
because users must keep in mind that histo rical data is sto red in a separate lo catio n.
Fo r example, suppo se Mary mo ves fro m Sasebo to Bellevue o n July 15. The change wo uld lo o k like this:

The dimCustomer table wo uld lo o k like this:


Cust o m e r Ke y First Nam e Last Nam e
1

MARY

SMITH

Em ail

Cit y

MARY.SMITH@sakilacusto mer.o rg Bellevue

The dimCustomerHistory table wo uld lo o k like this:


Cust o m e r Ke y First Nam e Last Nam e
1

MARY

SMITH

Em ail

Cit y

Change Dat e

MARY.SMITH@sakilacusto mer.o rg Sasebo 15-Jul-20 0 8

In practice, Type 1 and Type 2 are the mo st widely used ways to deal with slo wly changing dimensio ns.
Ro ws do no t have to be co mprised entirely o f a single SCD type. Fo r example, fo r many data wareho uses, the time
that a custo mer name change takes place is no t significant, and the change is po sted fo r that reco rd o n-the-fly. In that
case, the name co lumns wo uld be o f Type 1. Custo mer addresses are usually mo re impo rtant, so tho se co lumns
wo uld be o f Type 2. It's perfectly fine to handle changes in this way.

Creating the Customer Dimension


Okay, let's create o ur Customer dimensio n as Type 2. But befo re we do , we'll review the so urce o f data fo r o ur
dimensio n. The sakila database has a table called customer which will be the so urce o f info rmatio n fo r this
dimensio n. Switch to the seco nd SSH terminal, and lo g into yo ur acco unt. Use the co mmand line mysql pro gram to
co nnect to the sakila database. When yo u're pro mpted fo r a passwo rd, click enter. In Unix mo de, run the fo llo wing
co mmand:
CODE TO TYPE:
cold1:~$ mysql -h sql sakila -u anonymous -p

Once we're co nnected, we're able to see the structure o f the customer table. Run the fo llo wing co mmand against the
sakila database:
CODE TO TYPE:
describe customer;
As lo ng as yo u have typed everything co rrectly, and are co nnected to the sakila database (no t yo ur perso nal database)
yo u'll see this:
OBSERVE:
mysql> describe customer;
+-------------+----------------------+------+-----+-------------------+---------------+
| Field
| Type
| Null | Key | Default
| Extra
|
+-------------+----------------------+------+-----+-------------------+---------------+
| customer_id | smallint(5) unsigned | NO
| PRI | NULL
| auto_increment
|
| store_id
| tinyint(3) unsigned | NO
| MUL | NULL
|
|
| first_name | varchar(45)
| NO
|
| NULL
|
|
| last_name
| varchar(45)
| NO
| MUL | NULL
|
|
| email
| varchar(50)
| YES |
| NULL
|
|
| address_id | smallint(5) unsigned | NO
| MUL | NULL
|
|
| active
| tinyint(1)
| NO
|
| 1
|
|
| create_date | datetime
| NO
|
| NULL
|
|
| last_update | timestamp
| NO
|
| CURRENT_TIMESTAMP |
|
+-------------+----------------------+------+-----+-------------------+---------------+
9 rows in set (0.00 sec)
The table has a lo t o f info rmatio n. Observe that it co ntains a co lumn called address_id. This indicates that the
address info rmatio n is sto red in a different table. Let's take a lo o k at the address table. No w run the fo llo wing
co mmand against the sakila database:
CODE TO TYPE:
describe address;
Yo u'll see these results:

OBSERVE:
mysql> describe address;
+-------------+----------------------+------+-----+-------------------+---------------+
| Field
| Type
| Null | Key | Default
| Extra
|
+-------------+----------------------+------+-----+-------------------+---------------+
| address_id | smallint(5) unsigned | NO
| PRI | NULL
| auto_increment
|
| address
| varchar(50)
| NO
|
| NULL
|
|
| address2
| varchar(50)
| YES |
| NULL
|
|
| district
| varchar(20)
| NO
|
| NULL
|
|
| city_id
| smallint(5) unsigned | NO
| MUL | NULL
|
|
| postal_code | varchar(10)
| YES |
| NULL
|
|
| phone
| varchar(20)
| NO
|
| NULL
|
|
| last_update | timestamp
| NO
|
| CURRENT_TIMESTAMP |
|
+-------------+----------------------+------+-----+-------------------+---------------+
8 rows in set (0.00 sec)
See the co lumn cit y_id? It is a fo reign key to the table city. Take a lo o k at that table. Then run the fo llo wing co mmand
against the sakila database:
CODE TO TYPE:
describe city;
Yo u'll see the fo llo wing structure:
OBSERVE:
mysql> describe city;
+-------------+----------------------+------+-----+-------------------+---------------+
| Field
| Type
| Null | Key | Default
| Extra
|
+-------------+----------------------+------+-----+-------------------+---------------+
| city_id
| smallint(5) unsigned | NO
| PRI | NULL
| auto_increment
|
| city
| varchar(50)
| NO
|
| NULL
|
|
| country_id | smallint(5) unsigned | NO
| MUL | NULL
|
|
| last_update | timestamp
| NO
|
| CURRENT_TIMESTAMP |
|
+-------------+----------------------+------+-----+-------------------+---------------+
4 rows in set (0.00 sec)
It lo o ks like this table references yet ano ther table, using co unt ry_id. Let's take a lo o k at that table as well. Then run
the fo llo wing co mmand against the sakila database:
CODE TO TYPE:
describe country;

Yo u'll see these results:


OBSERVE:
mysql> describe country;
+-------------+----------------------+------+-----+-------------------+---------------+
| Field
| Type
| Null | Key | Default
| Extra
|
+-------------+----------------------+------+-----+-------------------+---------------+
| country_id | smallint(5) unsigned | NO
| PRI | NULL
| auto_increment
|
| country
| varchar(50)
| NO
|
| NULL
|
|
| last_update | timestamp
| NO
|
| CURRENT_TIMESTAMP |
|
+-------------+----------------------+------+-----+-------------------+---------------+
3 rows in set (0.00 sec)
No w we'll co mbine all o f these tables into o ur single dimCustomer dimensio n. At first glance, we might co me up with
the fo llo wing structure, reusing cust o m e r_id fro m o ur so urce customer table as the primary key fo r dimCustomer.
OBSERVE:
CREATE TABLE dimCustomer
(
customer_id smallint(5) unsigned NOT NULL,
first_name varchar(45) NOT NULL,
last_name
varchar(45) NOT NULL,
email
varchar(50),
address
varchar(50) NOT NULL,
address2
varchar(50),
district
varchar(20) NOT NULL,
city
varchar(50) NOT NULL,
country
varchar(50) NOT NULL,
postal_code varchar(10),
phone
varchar(20) NOT NULL,
active
tinyint(1) NOT NULL,
create_date datetime NOT NULL,
last_update datetime NOT NULL,
PRIMARY KEY(customer_id)
);
We could simply reuse this co lumn as the key o n o ur dimensio n table, but it's no t a go o d practice because:
It fo rces o ur slo wly changing dimensio n to be T ype 1 instead o f Type 2 since custo mer_id must be unique
fo r all ro ws in the table.
Changes in the so urce customer table co uld break the keys in the data wareho use.
Co mbining multiple so urces o f data into a single dimCustomer dimensio n is impo ssible if we rely o n keys
generated in o ne system.
We'll avo id po tential pro blems by keeping o ur o riginal key, cust o m e r_id, and using o ur o wn cust o m e r_ke y
surro gate key.
We'll po pulate the Type 2 slo wly changing dimensio n (SCD) in a future lesso n, but fo r no w we'll just create the table.
Switch back to the first SSH mo de, the o ne that's co nnected to yo ur perso nal database. Then run the fo llo wing
co mmand against yo ur perso nal database:

CODE TO TYPE:
CREATE TABLE dimCustomer
(
customer_key int NOT NULL AUTO_INCREMENT,
customer_id smallint(5) unsigned NOT NULL,
first_name varchar(45) NOT NULL,
last_name
varchar(45) NOT NULL,
email
varchar(50),
address
varchar(50) NOT NULL,
address2
varchar(50),
district
varchar(20) NOT NULL,
city
varchar(50) NOT NULL,
country
varchar(50) NOT NULL,
postal_code varchar(10),
phone
varchar(20) NOT NULL,
active
tinyint(1) NOT NULL,
create_date datetime NOT NULL,
start_date date NOT NULL,
end_date
date NOT NULL,
PRIMARY KEY(customer_key)
);

Execute the query. If everything went o kay yo u will see this: Query OK, 0 rows affected.

Snowflake Schemas
Fo r o ur custo mer dimensio n, we've taken fo ur tables and co llapsed them into o ne table. Why did we do this?
Sim plicit y.
One o f the go als o f a data wareho use is to create a simple structure that users can query easily. Multiple
tables means multiple jo ins, and added co mplexity. Here we traded disk space fo r simplicity.
We can also wo rk in the o ppo site directio n, using mo re co mplex schemas when o ur purpo se calls fo r that.
Addresses represent such a hierarchy. Co unt rie s have St at e s (o r regio ns), and states have Cit ie s. So me
business users may be interested in seeing sales data by co unt ry, whereas o thers may be interested in
viewing sales data by st at e o r by cit y. One way to deal with this hierarchy is with a snowflake schema.
In a snowflake schema yo u split a dimensio n into o ne "primary" dimensio n table and o ne o r mo re snowflake
tables. It lo o ks like this:

Sno wflake schemas are also an effective way to handle a different type o f pro blem. Suppo se o ur DVD sto re
starts to rent DVDs o ver the internet. Our sto re no w has two types o f custo mers - Internet custo mers and In
Store custo mers. We kno w very little abo ut the In Store custo mers; perhaps we o nly kno w their telepho ne
numbers and ho me addresses. By co mpariso n we kno w a lo t abo ut o ur Internet custo mers; we might have
their email addresses, telepho ne numbers, physical addresses, mo vie preferences, and the number o f times

they have visited the web site.


Suppo se we have 10 ,0 0 0 custo mers - 2,50 0 In Store and 7,50 0 Internet custo mers. We may actually cause
co nfusio n (and waste a lo t o f disk space) by sho ving bo th types o f custo mers into a single table. Also , we
may want to see sales statistics as they vary by cust o m e r t ype .
We can use a snowflake schema to help with that as well. In this situatio n there wo uld be o ne custo mer
dimensio n that wo uld ho ld attributes shared amo ng Internet and In Sto re custo mers. Two additio nal tables dimCustomerInternet and dimCustomerInStore wo uld sto re type-specific attributes. This schema lo o ks
like this:

Fo r right no w, we'll stick to o ur simple star schema.


We co vered a lo t o f material in this lesso n! We'll create the rest o f o ur dimensio ns in the next lesso n. See yo u there!

Copyright 1998-2013 O'Reilly Media, Inc.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.

Implementing The Dimensional Model, Part II


DBA 3: Data Warehousing Lesson 4
We began implementatio n o f the dimensio nal mo del in the last lesso n. No w it's time to finish creating o ur dimensio ns, and
then create o ur facts.

Creating the Movie Dimension


Befo re we implement dimMovie, let's examine the data so urce: the film table. Co nnect to the sakila database. No w in
Unix mo de, run this co mmand:
CODE TO TYPE:
cold1:~$ mysql -h sql sakila -u anonymous
Once yo u're co nnected, we'll examine the structure o f the film table. Run the fo llo wing co mmand against the sakila
database:
CODE TO TYPE:
describe film;
The table lo o ks like this:

OBSERVE:
mysql> describe film;
+----------------------+--------------------------------------------------------------------+------+-----+-------------------+----------------+
| Field
| Type
| Null | Key | Default
| Extra
|
+----------------------+--------------------------------------------------------------------+------+-----+-------------------+----------------+
| film_id
| smallint(5) unsigned
| NO
| PRI | NULL
| auto_increment |
| title
| varchar(255)
| NO
| MUL | NULL
|
|
| description
| text
| YES |
| NULL
|
|
| release_year
| year(4)
| YES |
| NULL
|
|
| language_id
| tinyint(3) unsigned
| NO
| MUL | NULL
|
|
| original_language_id | tinyint(3) unsigned
| YES | MUL | NULL
|
|
| rental_duration
| tinyint(3) unsigned
| NO
|
| 3
|
|
| rental_rate
| decimal(4,2)
| NO
|
| 4.99
|
|
| length
| smallint(5) unsigned
| YES |
| NULL
|
|
| replacement_cost
| decimal(5,2)
| NO
|
| 19.99
|
|
| rating
| enum('G','PG','PG-13','R','NC-17')
| YES |
| G
|
|
| special_features
| set('Trailers','Commentaries','Deleted Scenes','Behind the Sce
nes') | YES |
| NULL
|
|
| last_update
| timestamp
| NO
|
| CURRENT_TIMESTAMP |
|
+----------------------+--------------------------------------------------------------------+------+-----+-------------------+----------------+
13 rows in set (0.01 sec)
This table has two num e ric data types: re nt al_rat e and re place m e nt _co st . These quantities might beco me f act s
sto red in o ur data wareho use that allo w us to answer questio ns like, "What is o ur pro fit (amo unt o f rental inco me,
minus film co st) fo r each mo vie?" But since that and similar questio ns are o utside the sco pe fo r o ur current pro ject,
we'll o mit tho se facts fro m o ur data wareho use and free up so me space.
So it lo o ks like we have two fo reign keys: language _id and o riginal_language _id. Bo th po int to the language table.
Take a lo o k. Run the fo llo wing co mmand against the sakila database:
CODE TO TYPE:
describe language;
Execute the line to see the structure o f language.

OBSERVE:
mysql> describe language;
+-------------+---------------------+------+-----+-------------------+----------------+
| Field
| Type
| Null | Key | Default
| Extra
|
+-------------+---------------------+------+-----+-------------------+----------------+
| language_id | tinyint(3) unsigned | NO
| PRI | NULL
| auto_increment |
| name

| char(20)

| last_update | timestamp

| NO

| NULL

| NO

| CURRENT_TIMESTAMP |

+-------------+---------------------+------+-----+-------------------+----------------+
3 rows in set (0.00 sec)
We'll co nso lidate these tables into dimMovie. As fo r changes, they are fairly infrequent in this case, so we'll implement
a Type 1 slo wly changing dimensio n. Switch to the terminal mo de, and lo g into yo ur acco unt. Once yo u're lo gged in,
co nnect to yo ur o wn MySQL database. Be sure to replace use rnam e and use rnam e with yo ur o wn user name. Then
type the fo llo wing at the UNIX pro mpt:
CODE TO TYPE:
cold1:~$ mysql -h sql -p -u username username
Next, run the statement belo w, against yo ur perso nal database, in o rder to create the dimMovie table:
CODE TO TYPE:
CREATE TABLE dimMovie
(
movie_key
int NOT NULL AUTO_INCREMENT,
film_id
smallint(5) unsigned NOT NULL,
title
varchar(255) NOT NULL,
description
text,
release_year
year(4),
language
varchar(20) NOT NULL,
original_language varchar(20),
rental_duration
tinyint(3) unsigned NOT NULL,
length
smallint(5) unsigned NOT NULL,
rating
varchar(5) NOT NULL,
special_features
varchar(60) NOT NULL,
PRIMARY KEY (movie_key)
);

Once again, yo u'll see Query OK, 0 rows affected.

Creating the Store Dimension


No w it's time to implement dimStore. Take a lo o k at the data so urce: the store table. Switch terminals so that yo u're
using the sakila database. Run the fo llo wing co mmand against the sakila database:
CODE TO TYPE:
describe store;
Execute the line to see the structure o f the store table.

OBSERVE:
mysql> describe store;
+------------------+----------------------+------+-----+-------------------+---------------+
| Field
| Type
| Null | Key | Default
| Extra
|
+------------------+----------------------+------+-----+-------------------+---------------+
| store_id
| tinyint(3) unsigned | NO
| PRI | NULL
| auto_incre
ment |
| manager_staff_id | tinyint(3) unsigned | NO
| UNI | NULL
|
|
| address_id
| smallint(5) unsigned | NO
| MUL | NULL
|
|
| last_update
| timestamp
| NO
|
| CURRENT_TIMESTAMP |
|
| region
| varchar(10)
| YES |
| NULL
|
|
+------------------+----------------------+------+-----+-------------------+---------------+
5 rows in set (0.00 sec)
Our versio n o f the sakila database is slightly different than the versio n distributed by MySQL. Our versio n includes a
region co lumn. Our table also includes an addre ss_id co lumn. (Feel free to refer back to the previo us lesso n if yo u
want to go o ver the address table structure again.)
The next interesting aspect to this table is the m anage r_st af f _id co lumn. This co lumn is a fo reign key to staff. Let's
take a lo o k at that table no w. Run the fo llo wing co mmand against the sakila database:
CODE TO TYPE:
describe staff;
Execute the line to see the structure o f staff.

OBSERVE:
mysql> describe staff;
+-------------+----------------------+------+-----+-------------------+---------------+
| Field
| Type
| Null | Key | Default
| Extra
|
+-------------+----------------------+------+-----+-------------------+---------------+
| staff_id
| tinyint(3) unsigned | NO
| PRI | NULL
| auto_increment
|
| first_name | varchar(45)
| NO
|
| NULL
|
|
| last_name
| varchar(45)
| NO
|
| NULL
|
|
| address_id | smallint(5) unsigned | NO
| MUL | NULL
|
|
| picture
| blob
| YES |
| NULL
|
|
| email
| varchar(50)
| YES |
| NULL
|
|
| store_id
| tinyint(3) unsigned | NO
| MUL | NULL
|
|
| active
| tinyint(1)
| NO
|
| 1
|
|
| username
| varchar(16)
| NO
|
| NULL
|
|
| password
| varchar(40)
| YES |
| NULL
|
|
| last_update | timestamp
| NO
|
| CURRENT_TIMESTAMP |
|
+-------------+----------------------+------+-----+-------------------+---------------+
11 rows in set (0.00 sec)
We'll merge the staff table into a single dimStore dimensio n, and o mit many o f the co lumns fro m staff such as
picture, email, address, username, and passwo rd. Since sto res may change managers, we'll make o ur dimensio n a
Type 2 SCD so we can track management changes accurately o ver time. That will require two additio nal co lumns:
start_date and end_date. Feel free to review the Type 2 SCD sectio n in the third lesso n if yo u like.
Switch terminals so that yo u're using yo ur perso nal database. No w let's create o ur dimensio n! Run the co mmand
belo w against yo ur perso nal database:
CODE TO TYPE:
CREATE TABLE dimStore
(
store_key
int NOT NULL AUTO_INCREMENT,
store_id
smallint(5) unsigned NOT NULL,
address
varchar(50) NOT NULL,
address2
varchar(50),
district
varchar(20) NOT NULL,
city
varchar(50) NOT NULL,
country
varchar(50) NOT NULL,
postal_code
varchar(10),
region
varchar(10),
manager_first_name varchar(45) NOT NULL,
manager_last_name
varchar(45) NOT NULL,
start_date
date NOT NULL,
end_date
date NOT NULL,
PRIMARY KEY (store_key)
);

So lo ng as yo u see the familiar Query OK, 0 rows affected, yo u're all set.

Creating Facts
No w that o ur dimensio ns have been created, we can implement o ur f act s. Fact tables are fairly straightfo rward; they
co ntain fo reign keys to all dimensio n tables, and a single co lumn fo r the fact value.
Let's get started!

Sales
Our sales data will co me fro m the payment table in the sakila database. Let's take a lo o k. Switch back to the
sakila database and run this co mmand:
CODE TO TYPE:
describe payment;
Execute the line to see the structure o f payment:
OBSERVE:
mysql> describe payment;
+--------------+----------------------+------+-----+-------------------+---------------+
| Field
| Type
| Null | Key | Default
| Extra
|
+--------------+----------------------+------+-----+-------------------+---------------+
| payment_id
| smallint(5) unsigned | NO
| PRI | NULL
| auto_in
crement |
| customer_id | smallint(5) unsigned | NO
| MUL | NULL
|
|
| staff_id
| tinyint(3) unsigned | NO
| MUL | NULL
|
|
| rental_id
| int(11)
| YES | MUL | NULL
|
|
| amount
| decimal(5,2)
| NO
|
| NULL
|
|
| payment_date | datetime
| NO
|
| NULL
|
|
| last_update | timestamp
| NO
|
| CURRENT_TIMESTAMP |
|
+--------------+----------------------+------+-----+-------------------+---------------+
7 rows in set (0.00 sec)
We'll pay particular attentio n to the am o unt co lumn. It will be the basis fo r o ur factSales table.

Note

Make sure to review the so urces o f yo ur facts, so yo u do n't implement the wro ng data type.

Switch back to yo ur perso nal database. Let's create o ur fact. Run the co mmand belo w against yo ur perso nal
database:

CODE TO TYPE:
CREATE TABLE factSales
(
sales_key
INT NOT NULL AUTO_INCREMENT,
date_key
INT NOT NULL,
customer_key
INT NOT NULL,
movie_key
INT NOT NULL,
store_key
INT NOT NULL,
sales_amount
decimal(5,2) NOT NULL,
FOREIGN KEY fk_date (date_key) REFERENCES dimDate(date_key),
FOREIGN KEY fk_customer (customer_key) REFERENCES dimCustomer(customer_key),
FOREIGN KEY fk_movie (movie_key) REFERENCES dimMovie(movie_key),
FOREIGN KEY fk_store (store_key) REFERENCES dimStore(store_key),
PRIMARY KEY (sales_key)
);

Once again, if everything went acco rding to plan, yo u'll see Query OK, 0 rows affected.
A single ro w in factSales will represent the amo unt o f sales fo r a specific date, fo r a specific custo mer, fo r a
specific mo vie, at a specific sto re.
Yo u might think that the primary key sho uld be a co mpo site key acro ss all fo reign keys to the dimensio ns.
After all, these co lumns sho uld uniquely identify a fact ro w, right? But the pro blem with that type o f primary key
is that it tends to be very wide. To start, create a primary key o n the surro gate key alo ne - sale s_ke y. This will
give yo u o ptimum flexibility when evaluating future indexing strategies.

CustomerCount
No w we'll implement o ur factCustomerCount. The factCustomerCount is a tally o f the number o f
custo mers who created acco unts with o ur sto re. This table do es no t have a fo reign key to dimMovie because
the number o f custo mers isn't relative to any particular mo vie.
We'll examine the so urce fo r this data in a future lesso n. Fo r no w, let's create the fact. Make sure yo u are
using yo ur perso nal database. Review the fo llo wing CREATE TABLE statement:
OBSERVE:
CREATE TABLE factCustomerCount
(
customerCount_key INT NOT NULL AUTO_INCREMENT,
date_key
INT NOT NULL,
customer_key
INT NOT NULL,
store_key
INT NOT NULL,
customer_count
INT NOT NULL,,
FOREIGN KEY fk_date (date_key) REFERENCES dimDate(date_key),
FOREIGN KEY fk_customer (customer_key) REFERENCES dimCustomer(customer_key),
FOREIGN KEY fk_store (store_key) REFERENCES dimStore(store_key),
PRIMARY KEY (customerCount_key)
);

A single ro w in this table represents a specific custo mer who created an acco unt o n a specific day, at a
specific sto re.
Befo re yo u execute the co mmand, take a clo ser lo o k at the cust o m e r_co unt measure. What values might it
have?
Since cust o m e r_ke y po ints to exactly o ne custo mer, cust o m e r_co unt will always have the
value o f 1.
Since cust o m e r_co unt will always be 1, we co uld o mit the co lumn fro m the table. Ho wever we
will leave it in o ur table since it will make it easier fo r business users to query the table.
Since factCustomerCount do esn't have any "real" facts, it is kno wn as a f act le ss f act . There will
be no measure co lumns in this table, o nly fo reign keys to dimensio ns. Factless facts are go o d at
sto ring events.

Let's create the table. This time we'll specify a de f ault value o f 1 o n cust o m e r_co unt . Run this co mmand
against yo ur perso nal database:
CODE TO TYPE:
CREATE TABLE factCustomerCount
(
customerCount_key INT NOT NULL AUTO_INCREMENT,
date_key
INT NOT NULL,
customer_key
INT NOT NULL,
store_key
INT NOT NULL,
customer_count
INT NOT NULL DEFAULT 1,
FOREIGN KEY fk_date (date_key) REFERENCES dimDate(date_key),
FOREIGN KEY fk_customer (customer_key) REFERENCES dimCustomer(customer_key),
FOREIGN KEY fk_store (store_key) REFERENCES dimStore(store_key),
PRIMARY KEY (customerCount_key)
);

Again, as lo ng as yo u see Query OK, 0 rows affected, yo u're all set!

RentalCount
Our final fact is factRentalCount. It's similar to factCustomerCount in that it is also a f act le ss f act . As
such, we'll also specify a default value fo r the re nt al_co unt co lumn. (We'll po pulate this table in a future
lesso n.) Run this co mmand against yo ur perso nal database:

CODE TO TYPE:
CREATE TABLE factRentalCount
(
rentalCount_key INT NOT NULL AUTO_INCREMENT,
date_key
INT NOT NULL,
customer_key
INT NOT NULL,
movie_key
INT NOT NULL,
store_key
INT NOT NULL,
rental_count
INT NOT NULL DEFAULT 1,
FOREIGN KEY fk_date (date_key) REFERENCES dimDate(date_key),
FOREIGN KEY fk_customer (customer_key) REFERENCES dimCustomer(customer_key),
FOREIGN KEY fk_movie (movie_key) REFERENCES dimMovie(movie_key),
FOREIGN KEY fk_store (store_key) REFERENCES dimStore(store_key),
PRIMARY KEY (rentalCount_key)
);

Yo u sho uld see: Query OK, 0 rows affected.


A single ro w in this table represents a specific custo mer who rented an specific mo vie, o n a specific day, at a
specific sto re.
We co vered a lo t o f material in this lesso n. No w that o ur dimensio ns and facts are implemented, we'll develo p a strategy fo r
transferring data fro m o ur OLTP database to o ur OLAP database. See yo u in the next lesso n!

Copyright 1998-2013 O'Reilly Media, Inc.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.

Extract, Transform, Load (ETL)


DBA 3: Data Warehousing Lesson 5
Welco me back! In the last few lesso ns we implemented the dimensio nal mo del. No w it's time to figure o ut ho w to po pulate the
tables we created.

What is ET L?
ET L is an acro nym fo r Extract, T ransfo rm, and Lo ad. It's the pro cess that takes data fro m o ne o r mo re so urce
systems, transfo rms and cleanses that data, and lo ads the result into the data wareho use.
Yo u might be thinking "This sounds simple! We learned about exports and imports in the last course!"
And actually, we did learn abo ut the E and L in the last co urse, but we didn't co ver any T ransfo rmatio ns. An example o f
a Transfo rmatio ns wo uld be co nverting co des like "D" within a so urce system, to a wo rd like "Deleted" within a
destinatio n.:

And we still haven't learned ho w to auto mate data lo ading, o r handle failures auto matically and gracefully. Failure is
no t always an o ptio n, but when yo u're do ing bulk expo rts and lo ads it's no t to o difficult to handle. If an expo rt fails due
to a disk space issue, yo u free up so me space and try again. If an impo rt fails, yo u figure o ut what went wro ng and try
again.
Failure may no t be an o ptio n with ET L. Data wareho uses ho ld a lo t o f data, and mo st o f that data is pro cessed o n a
daily (o r even ho urly) basis and is extremely time sensitive. If yo u miss a day o f pro cessing, yo u may lo se data.
The o nly way to handle a large vo lume o f co mplex data is to have an auto mated ETL pro cess.

Logging and Auditing


We kno w which data we want to pull into the data wareho use, and we kno w the destinatio n tables fo r that data. There
are several bits o f info rmatio n to be lo gged during an ETL pro cess. First, yo u'll track the dat a co unt (the number o f
ro ws) transfered at each step in the pro cess. This is useful info rmatio n fo r a few go o d reaso ns:
We can make sure no ro ws are "lo st" in the ETL pro cess.
We can detect abno rmal data; if we pro cess 10 0 0 ro ws o ne day , but the next day we o nly pro cess 10 , we'll
kno w that so mething pro bably went wro ng.
In the future we can use this captured data to predict future capacity needs.
Other useful bits o f info rmatio n are the st art and e nd t im e s o f the pro cess:
They can be used to alert us to pro blems.
They also allo w us to plan fo r future capacity.
Lo gging is no t always co mplex. While we co uld use a single table to track this info rmatio n, we'll split it into two tables:
e t lRuns and e t lLo g. Switch to a terminal, and lo g into yo ur acco unt, then co nnect to yo ur perso nal database. Run

this co mmand against yo ur perso nal database:


CODE TO TYPE:
CREATE TABLE etlRuns (
run_id integer NOT NULL AUTO_INCREMENT,
start_time datetime NOT NULL,
end_time datetime,
PRIMARY KEY(run_id)
);

Next we'll create the e t lLo g table, which will be used to lo g messages and statistics. Many o f these co lumns are TOSspecific (Talend Open Studio -specific. We'll explain mo re abo ut Talend in the next lesso n). We will see them again
when we implement lo gging in a later lesso n. So me o f this info rmatio n wo n't be useful fo r every wareho use; it is up to
yo u to decide the amo unt and type lo gging yo u need. Run the fo llo wing co mmand against yo ur perso nal database:
CODE TO TYPE:
CREATE TABLE etlLog
(
run_id integer NOT NULL,
moment datetime NOT NULL,
pid varchar(20),
father_pid varchar(20),
root_pid varchar(20),
system_pid double,
project varchar(50),
job varchar(50),
job_repository_id varchar(255),
job_version varchar(255),
context varchar(50),
priority int,
origin varchar(255),
message_type varchar(255),
message varchar(255),
code int,
duration double,
count int,
reference int,
thresholds varchar(255),
key(run_id)
);

With these tables in place, we are ready to tackle audit ing.


Why do we need auditing? Suppo se we co me into wo rk o ne day to find that o ur daily sales jumped o vernight fro m
$10 ,0 0 0 to $1,0 0 0 ,0 0 0 . Yo u kno w that the co mpany did no t sell $1,0 0 0 ,0 0 0 in o ne day, but ho w do yo u track do wn
the pro blem?
Yo u can use the auditing features o f the data wareho use to debug the pro blem. Auditing allo ws us to link ro ws in
tables with specific "runs" via the run_id co lumn in etlLog. Each ro w in o ur fact and dimensio n tables will have a
run_id co lumn, letting us kno w exactly when that data was added to the wareho use.
We implemented o ur dimensio ns and fact tables in the prio r lesso n, and tho se tables do n't have any co lumns related
to auditing. We did create so me dimensio ns type 2 SCD, but they can't be used fo r auditing purpo ses. Instead we'll
need to add a run_id co lumn to all o f tho se tables.
Let's alter o ur tables to add that co lumn. Run this co mmand against yo ur perso nal database:

CODE TO TYPE:
ALTER
ALTER
ALTER
ALTER
ALTER
ALTER
ALTER
ALTER

TABLE
TABLE
TABLE
TABLE
TABLE
TABLE
TABLE
TABLE

dimCustomer ADD run_id int not null REFERENCES etlRuns(run_id);


dimMovie ADD run_id int not null REFERENCES etlRuns(run_id);
dimStore ADD run_id int not null REFERENCES etlRuns(run_id);
dimStaff ADD run_id int not null REFERENCES etlRuns(run_id);
factSales ADD run_id int not null REFERENCES etlRuns(run_id);
factCustomerCount ADD run_id int not null REFERENCES etlRuns(run_id);
factRentalCount ADD run_id int not null REFERENCES etlRuns(run_id);
factRentalDuration ADD run_id int not null REFERENCES etlRuns(run_id);

Note

We wo n't add auditing to dimDate since it will o nly be lo aded o nce.

ETL pro cesses themselves are typically bro ken into three parts:
1. Initial ho usekeeping such as create a "run", o r clear temp files and tables.
2. Extract, Transfo rm, and Lo ad data.
3. Final ho usekeeping such as end a "run," send email, o r clear temp files and tables.
To do the initial ho usekeeping we will use a sto red pro cedure, called etl_StartRun. This pro cedure will be used to
po pulate the etlRuns table and return the run_id to be used in all ETL pro cesses. It will return the same run_id each
time it is called, until the co rrespo nding "final ho usekeeping" pro cedure etl_EndRun is called. Run this co mmand
against yo ur perso nal database:
CODE TO TYPE:
DELIMITER //
CREATE PROCEDURE etl_StartRun()
BEGIN
DECLARE current_run_id INTEGER;
SELECT max(run_id) into current_run_id
FROM etlRuns
WHERE end_time IS NULL;
IF current_run_id IS NULL THEN
BEGIN
INSERT INTO etlRuns (start_time) VALUES (now());
SELECT LAST_INSERT_ID() into current_run_id;
END;
END IF;
SELECT 'run_id' as "key", current_run_id as value;
END;
//
DELIMITER ;
With that pro cedure o ut o f the way, we can think abo ut the last part o f the pro cess: a pro cedure to perfo rm final
ho usekeeping. Run this co mmand against yo ur perso nal database:
CODE TO TYPE:
DELIMITER //
CREATE PROCEDURE etl_EndRun
()
BEGIN
UPDATE etlRuns SET end_time=now() where end_time IS NULL;
END;
//
DELIMITER ;

That lo o ks great! No w we're ready to lo o k at o ur so urce data.

Getting Data into the Warehouse


No w that we have o ur data wareho use setup, it's time to review o ur so urce data. We defined the data we're putting into
o ur wareho use in the previo us lesso ns, and we have so me understanding o f the data so urce. But do we kno w
everything abo ut o ur so urce data? What info rmatio n can we pro vide fo r o ur business users?
Part o f ETL is T ransf o rm at io n - cleaning and transfo rming so urce data so it's easier to understand and mo re useful.
T ransf o rm at io n can alter so urce data to make it better by:
Changing custo mer status co des, such as O, D, C to "OK," "Deleted Custo mer," and "Acco unt in
Co llectio ns."
Handling kno wn so urce data erro rs o r test data, such as all acco unts with the prefix "TST_0 1."
Splitting data in o ne co lumn into multiple co lumns, fo r example, splitting a single field fo r "R:20 0 8 -0 5-20 "
into "Rental" and "20 0 8 -0 5-20 ."
Often business analysts and users are respo nsible fo r determining and do cumenting transfo rmatio n and mapping
rules. Other times, tho se respo nsibilities fall upo n the pro grammer. But in any o f tho se situatio ns, it's impo rtant to
have clear do cumentatio n. We want no co nfusio n abo ut the reaso ns custo mer co des o f "D" are getting translated to
"Deleted Custo mer" in the data wareho use.

Note

Mo st co mpanies have data scattered acro ss many different systems, databases, and files. We'll keep
things simple fo r this co urse by restricting o ur data so urces. No matter where yo ur data o riginates fro m,
the pro cess fo r getting it into the data wareho use is the same.

So , ho w will we do cument o ur transfo rmatio n and mapping rules? We 'll use t he e asie st and m o st use f ul
m e t ho d available . This might be a wo rd do cument in so me situatio ns, o r a spreadsheet in ano ther. Fo r this co urse
we'll just use plain text do cuments to describe o ur transfo rmatio ns.

dimDate
Our date dimensio n do esn't really have a so urce o ther than a calendar. So ho w do we co me up with the
data? Pro grammers will co mmo nly use o ne o f these metho ds:
Create a pro gram to po pulate the date table.
Create a spreadsheet with date data in it.
Co py the date dimensio n fro m an existing data wareho use.
Suppo se o ne o f the business users is handy with Excel, and has o ffered to create a spreadsheet fo r yo u. The
spreadsheet will already co ntain all o f the required info rmatio n, including ho lidays and weekends. In this
case, mo st o f the wo rk is do ne fo r us. We o nly need to lo ad the data (which we'll do in the next lesso ns).

dimCustomer
Let's take a clo ser lo o k at dimCustomer. Back in lesso n three we disco vered that a custo mer reco rd is sto red
in several tables in the sakila database: cust o m e r, addre ss, cit y and co unt ry. We're no t planning o n
do ing any transfo rmatio ns o n the data, but suppo se a business user info rms us that ro ws in the custo mer
table where customer_id <= 10 are actually test acco unts that sho uld be excluded fro m the data
wareho use.
Let's write the query we need to extract the data fro m the custo mers table. Switch to the seco nd terminal, and
lo g into the sakila database. Run this co mmand against the sakila database:

CODE TO TYPE:
SELECT
c.customer_id, c.first_name, c.last_name, c.email,
a.address, a.address2, a.district,
ci.city,
co.country,
postal_code,
a.phone,c.active, c.create_date
FROM customer c
JOIN address a on (c.address_id = a.address_id)
JOIN city ci on (a.city_id = ci.city_id)
JOIN country co on (ci.country_id = co.country_id)
WHERE customer_id > 10;

If yo u are co nnected to the sakila database and typed everything co rrectly, yo u'll see lo ts o f results:

OBSERVE:
mysql> SELECT
-> c.customer_id, c.first_name, c.last_name, c.email,
-> a.address, a.address2, a.district,
-> ci.city,
-> co.country,
-> postal_code,
-> a.phone,c.active, c.create_date
-> FROM customer c
-> JOIN address a on (c.address_id = a.address_id)
-> JOIN city ci on (a.city_id = ci.city_id)
-> JOIN country co on (ci.country_id = co.country_id)
-> WHERE customer_id > 10;
+-------------+-------------+--------------+-----------------------------------------+----------------------------------------+----------+---------------------+----------------------------+---------------------------------------+------------+--------------+--------+---------------------+
| customer_id | first_name | last_name
| email
| address
| address2 | district
| city
| country
| postal_c
ode | phone
| active | create_date
|
+-------------+-------------+--------------+-----------------------------------------+----------------------------------------+----------+---------------------+----------------------------+---------------------------------------+------------+--------------+--------+---------------------+
|
218 | VERA
| MCCOY
| VERA.MCCOY@sakilacustomer.org
| 1168 Najafabad Parkway
|
| Kabol
| Kabul
| Afghanistan
| 40301
| 886649065861 |
1 | 2004-03-19 00:00:00 |
|
441 | MARIO
| CHEATHAM
| MARIO.CHEATHAM@sakilacustomer.org
| 1924 Shimonoseki Drive
|
| Batna
| Batna
| Algeria
| 52625
| 406784385440 |
1 | 2004-10-07 00:00:00 |
|
69 | JUDY
| GRAY
| JUDY.GRAY@sakilacustomer.org
| 1031 Daugavpils Parkway
|
| Bchar
| Bchar
| Algeria
| 59025
| 107137400143 |
1 | 2004-02-25 00:00:00 |
|
176 | JUNE
| CARROLL
| JUNE.CARROLL@sakilacustomer.org
| 757 Rustenburg Avenue
|
| Skikda
| Skikda
| Algeria
| 89668
| 506134035434 |
1 | 2004-08-11 00:00:00 |
|
320 | ANTHONY
| SCHWAB
| ANTHONY.SCHWAB@sakilacustomer.org
| 1892 Nabereznyje Telny Lane
|
| Tutuila
| Tafuna
| American Samoa
| 28396
| 478229987054 |
1 | 2004-07-20 00:00:00 |
|
528 | CLAUDE
| HERZOG
| CLAUDE.HERZOG@sakilacustomer.org
| 486 Ondo Parkway
|
| Benguela
| Benguela
| Angola
| 35202
| 105882218332 |
1 | 2004-01-24 00:00:00 |
...lines ommitted...
|
303 | WILLIAM
| SATTERFIELD | WILLIAM.SATTERFIELD@sakilacustomer.
org
| 687 Alessandria Parkway
|
| Sanaa
| Sanaa
| Yemen
| 57587
| 407218522294 |
1 | 2004-04-22 00:00:00 |
|
213 | GINA
| WILLIAMSON
| GINA.WILLIAMSON@sakilacustomer.org
| 1001 Miyakonojo Lane
|
| Taizz
| Taizz
| Yemen
| 67924
| 584316724815 |
1 | 2004-08-02 00:00:00 |
|
553 | MAX
| PITT
| MAX.PITT@sakilacustomer.org
| 1917 Kumbakonam Parkway
|
| Vojvodina
| Novi Sad
| Yugoslavia
| 11892
| 698182547686 |
1 | 2004-02-09 00:00:00 |
|
438 | BARRY
| LOVELACE
| BARRY.LOVELACE@sakilacustomer.org
| 1836 Korla Parkway
|
| Copperbelt
| Kitwe
| Zambia
| 55405
| 689681677428 |
1 | 2004-09-24 00:00:00 |
+-------------+-------------+--------------+------------------------------------

------+----------------------------------------+----------+---------------------+----------------------------+---------------------------------------+------------+--------------+--------+---------------------+
589 rows in set (0.04 sec)
It lo o ks like this is a go o d query to use to extract custo mer info rmatio n. Save this query - we will use it in a
future lesso n.

dimMovie
The next table we will po pulating is dim Mo vie . Data fro m this table co mes fro m two tables: f ilm and
language . We will have to jo in o n language twice ho wever, since the f ilm table jo ins to language o n
language_id and original_language_id.
Let's write the query needed to extract the data fro m the custo mers table. Run this co mmand against the
sakila database:
CODE TO TYPE:
SELECT f.film_id, f.title, f.description, f.release_year,
l.name as language, orig_lang.name as original_language,
f.rental_duration, f.length, f.rating, f.special_features
FROM film f
JOIN language l on (f.language_id=l.language_id)
JOIN language orig_lang on (f.original_language_id = orig_lang.language_id);
Try executing the query. If yo u typed everything co rrectly, yo u will see the fo llo wing:
OBSERVE:
mysql> SELECT f.film_id, f.title, f.description, f.release_year,
-> l.name as language, orig_lang.name as original_language,
-> f.rental_duration, f.length, f.rating, f.special_features
-> FROM film f
-> JOIN language l on (f.language_id=l.language_id)
-> JOIN language orig_lang on (f.original_language_id = orig_lang.language_
id);
Empty set (0.01 sec)
What happened to the data? We do n't have a WHERE clause, so that can't be the pro blem. But we do have two
jo ins. Let's write ano ther query to find o ut which jo in is failing us. Run this co mmand against the sakila
database:
CODE TO TYPE:
SELECT count(distinct language_id), count(distinct original_language_id)
FROM film f;
Run the query, and o bserve the results:
OBSERVE:
mysql> SELECT count(distinct language_id), count(distinct original_language_id)
-> FROM film f;
+-----------------------------+--------------------------------------+
| count(distinct language_id) | count(distinct original_language_id) |
+-----------------------------+--------------------------------------+
|
1 |
0 |
+-----------------------------+--------------------------------------+
1 row in set (0.00 sec)
It lo o ks like we do n't have any films that have been translated. Perhaps this is a feature in pro gress, o r an o ld
feature that has since been abando ned. Whatever the reaso n, we will need to alter o ur SELECT query to use a
LEFT J OIN instead o f a no rmal jo in. Run this co mmand against the sakila database:

CODE TO TYPE:
SELECT f.film_id, f.title, f.description, f.release_year,
l.name as language, orig_lang.name as original_language,
f.rental_duration, f.length, f.rating, f.special_features
FROM film f
JOIN language l on (f.language_id=l.language_id)
LEFT JOIN language orig_lang on (f.original_language_id = orig_lang.language_id
);
As lo ng as yo u typed everything co rrectly, yo u will see lo ts o f results:
OBSERVE:
mysql> SELECT f.film_id, f.title, f.description, f.release_year,
-> l.name as language, orig_lang.name as original_language,
-> f.rental_duration, f.length, f.rating, f.special_features
-> FROM film f
-> JOIN language l on (f.language_id=l.language_id)
-> LEFT JOIN language orig_lang on (f.original_language_id = orig_lang.lang
uage_id);
+---------+-----------------------------+----------------------------------------------------------------------------------------------------------------------------------+--------------+----------+-------------------+-----------------+-------+--------+--------------------------------------------------------+
| film_id | title
| description
| release_year | language | original_language | rental_duration | l
ength | rating | special_features
|
+---------+-----------------------------+----------------------------------------------------------------------------------------------------------------------------------+--------------+----------+-------------------+-----------------+-------+--------+--------------------------------------------------------+
|
1 | ACADEMY DINOSAUR
| A Epic Drama of a Feminist And a Mad S
cientist who must Battle a Teacher in The Canadian Rockies
|
2006 | English | NULL
|
6 |
86 | PG
| Deleted Scenes,Behind the Scenes
|
|
2 | ACE GOLDFINGER
| A Astounding Epistle of a Database Adm
inistrator And a Explorer who must Find a Car in Ancient China
|
2006 | English | NULL
|
3 |
48 | G
| Trailers,Deleted Scenes
|
|
3 | ADAPTATION HOLES
| A Astounding Reflection of a Lumberjac
k And a Car who must Sink a Lumberjack in A Baloon Factory
|
2006 | English | NULL
|
7 |
50 | NC-17 | Trailers,Deleted Scenes
|
...lines omitted...
|
998 | ZHIVAGO CORE
| A Fateful Yarn of a Composer And a Man
who must Face a Boy in The Canadian Rockies
|
2006 | English | NULL
|
6 |
105 | NC-17 | Deleted Scenes
|
|
999 | ZOOLANDER FICTION
| A Fateful Reflection of a Waitress And
a Boat who must Discover a Sumo Wrestler in Ancient China
|
2006 | English | NULL
|
5 |
101 | R
| Trailers,Deleted Scenes
|
|
1000 | ZORRO ARK
| A Intrepid Panorama of a Mad Scientist
And a Boy who must Redeem a Boy in A Monastery
|
2006 | English | NULL
|
3 |
50 | NC-17 | Trailers,Commentaries,Behind the Scenes
|
+---------+-----------------------------+----------------------------------------------------------------------------------------------------------------------------------+--------------+----------+-------------------+-----------------+-------+--------+--------------------------------------------------------+
1000 rows in set (0.02 sec)
This lo o ks great!

dimStore
The last table we'll wo rk o n po pulating is dim St o re . Data fro m this table co mes fro m many tables: st o re ,
st af f , addre ss, cit y, and co unt ry. Run this co mmand against the sakila database:
CODE TO TYPE:
SELECT s.store_id, a.address, a.address2, a.district,
c.city, co.country, a.postal_code, s.region,
st.first_name as manager_first_name,
st.last_name as manager_last_name
FROM
store s
JOIN staff st on (s.manager_staff_id = st.staff_id)
JOIN address a on (s.address_id = a.address_id)
JOIN city c on (a.city_id = c.city_id)
JOIN country co on (c.country_id = co.country_id);
Run the query, and o bserve the results:
OBSERVE:
mysql> SELECT s.store_id, a.address, a.address2, a.district,
-> c.city, co.country, a.postal_code, s.region,
-> st.first_name as manager_first_name,
-> st.last_name as manager_last_name
-> FROM
-> store s
-> JOIN staff st on (s.manager_staff_id = st.staff_id)
-> JOIN address a on (s.address_id = a.address_id)
-> JOIN city c on (a.city_id = c.city_id)
-> JOIN country co on (c.country_id = co.country_id)
-> ;
+----------+--------------------+----------+----------+------------+-----------+
-------------+--------+--------------------+-------------------+
| store_id | address
| address2 | district | city
| country
|
postal_code | region | manager_first_name | manager_last_name |
+----------+--------------------+----------+----------+------------+-----------+
-------------+--------+--------------------+-------------------+
|
1 | 47 MySakila Drive | NULL
| Alberta | Lethbridge | Canada
|
| West
| Mike
| Hillyer
|
|
2 | 28 MySQL Boulevard | NULL
| QLD
| Woodridge | Australia |
| East
| Jon
| Stephens
|
+----------+--------------------+----------+----------+------------+-----------+
-------------+--------+--------------------+-------------------+
2 rows in set (0.00 sec)
This lo o ks great to o !
Great jo b so far. In the next lesso n w'll practice writing an ETL jo b. See yo u then!

Copyright 1998-2013 O'Reilly Media, Inc.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.

Tools for ETL


DBA 3: Data Warehousing Lesson 6
No w that we have the mapping fo r o ur data so urces do ne, let's co nsider ho w we are go ing to extract the data, transfo rm it, and
get it into the data wareho use.

ET L--Past, Present, and Future


There is no o ne right way to perfo rm ETL. In the past peo ple have written custo m pro grams in any number o f
languages such as C, C++, Perl, Pytho n, o r used the sto red pro cedures pro vided by an SQL database. Pro grams were
glued to gether using BAT files o n Windo ws machines, shell scripts o n Unix machines, and everything in between.
But tho se ways o f do ing ETL are histo ry. Since then, the to o ls available no w have impro ved dramatically. No w we
do n't have figure o ut ho w to get Pytho n to read data fro m an Excel spreadsheet fo r transfo rmatio n and placement in an
Oracle database.
In this co urse we'll use T ale nd Ope n St udio (TOS), an Eclipse-based to o l that simplifies develo pment and
maintenance o f ETL pro cedures, yet allo ws eno ugh access to the underlying Java language fo r po wer-users to extend
and expand its capabilities.
TOS (and o ther ETL to o ls) o perate with the data flow mo del. Yo u specify so urces o f data, transfo rmatio ns o n that data,
and then a destinatio n. This co rrespo nds fairly clo sely with the standard ETL co ncepts o f extracting data, transfo rming
data, and lo ading data. One o r mo re data flo ws are gro uped to gether in a job.

ETL to o ls keep track o f schemas, a definitio n o f the co lumns in a data flo w that includes data type and sizes. The
schema tracks which co lumns are allo wed to be null, and which co lumns are part o f the key.
Schemas are an impo rtant abstractio n in the ETL wo rld. They allo w us to specify the makeup o f o ur data o nce, and use
that specificatio n in many different co mpo nents. (We'll learn mo re abo ut schemas so o n.)
What do es the future ho ld? No bo dy can say fo r sure, but it is lo o king like we will see faster, easier to use, and mo re
reliable ETL so lutio ns that let us fo cus o n interesting pro blems instead o f mundane co nnectio ns and transfo rmatio ns.
Let's create a sample jo b to see what we can do .

Getting Started with T alend Open Studio


So far we haven't really used any features o f Talend Open Studio . That's abo ut to change. Befo re we get started, switch
to the se co nd OST pe rspe ct ive by clicking o n the red leaf with a 2 inside it:

Note

Feel free to resize any widget o n yo ur screen. Yo u can always get back to the default perspective by
clicking o n the red leaf.

On the left side o f the screen yo u will see a tab called "Repo sito ry." The text may be truncated to "Rep," depending o n
the width o f yo ur screen.

The repo sito ry is where TOS sto res all o bjects related to yo ur pro ject. They are:
Business Mo dels - diagrams to do cument pro cesses o r flo ws.
Jo b Designs - implemented pro cesses o r flo ws.
Co ntexts - sets o f variables o r values that are shared acro ss several jo bs.
Co de - bits o f Java co de shared acro ss several jo bs.
SQL Patterns - templates o f SQL co de that can be used as a basis fo r queries in jo bs.
Metadata - data abo ut yo ur data - database co nnectio ns, file layo uts, and descriptio ns o f database tables
and query results.
Do cumentatio n - sto rage fo r wo rd do cuments, spreadsheets and o ther items created o utside o f TOS.
Recycle bin - last sto p fo r trash, just like the recycle bin in Windo ws.
To simplify things, fo r this co urse we wo n't use Business Mo dels, Co de, SQL Patterns, o r Do cumentatio n.

Your First T OS Job


No w that yo u've read a little abo ut TOS, it's time to create a simple jo b.
Right click o n J o b De signs and cho o se Cre at e J o b:

Name the jo b ET LDe m o and click finish:

If everything went o kay, yo u will see a blank "canvas" fo r Job ETL Demo 0.1 and a new Palette o n the lo wer left:

Right no w yo ur ETL jo b is blank, so it do esn't do anything. We need to add a data so urce. On the Palette, click File to
expand that catego ry, then click Input .

Note

If yo u do n't see Input , click o n the up and do wn arro ws.

Click t File Input De lim it e d o nce to select, then mo ve yo ur mo use o ver the canvas. Click the canvas to dro p the
t File Input De lim it e d widget.

Yo ur canvas sho uld no w lo o k like this:

So , what's with that red circle with the X thro ugh it? Drag yo ur mo use o ver that circle, and yo u'll see this:

The warning and erro r o ccur because we haven't set any pro perties o n the t File Input De lim it e d widget. Let's do that
no w. Click o nce in the middle o f the t File Input De lim it e d widget, then switch to the Co m po ne nt tab at the bo tto m o f
the screen:

No w yo u'll see the basic aspects o f t File Input De lim it e d that yo u can mo dify. We'll need to change the file to po int
to a sample CSV input. Change the File Name so it lo o ks like this:
CODE TO TYPE:
"C:/talend_files/in/csv/customer1.csv"

WARNING

Make sure yo u type fo rward slashes ( // ) instead o f the usual back slashes ( \\ ). Under the ho o d,
TOS is using Java to run yo ur transfo rmatio n; back slashes are used to delimit special characters
in Java.

Next, we want TOS to skip o ver the header ro w in the file. To do this, change the 0 next to He ade r to a 1. 1 tells TOS to
skip o ne ro w at the beginning o f the file.

No w that we've specified the input file, we need to specify the schema (structure) o f the input file. We do this by clicking
the butto n named "..." next to Edit Sche m a. Yo u may have to scro ll the co mpo nent panel to see the Sche m a.

Note

Read thro ugh the next set o f instructio ns befo re trying them. TOS uses many modal windo ws (windo ws
that are always o n to p o f o ther windo ws), so yo u wo n't be able to scro ll in this lesso n unless yo u clo se
the mo dal windo w.

After yo u click the butto n, yo u'll see an empty windo w:

No w we co uld click the


butto n to add co lumns to the schema. And we co uld enter the schema by hand, but
instead we'll impo rt it fro m an XML file. Click o n the ico n to the left o f the flo ppy disk - it lo o ks like this:

Pick C:\talend_files\in\csv\customer1_Schema.xml as the file and click OK.


With the schema impo rted, yo u'll see the definitio ns o f all o f the co lumns. It will lo o k so mething like this:

Click OK to save yo ur changes. The red circle with the X is go ne no w, replaced by a warning sign. The warning still
exists because we do n't have a destinatio n fo r o ur data.

Note

Transfo rmatio ns are no t always necessary. So metimes there isn't anything to do o ther than read data
fro m o ne place and place it so mewhere else.

We do n't really care where the data ends up, since we are just do ing a little test. Instead o f putting the data in a
database so mewhere and then querying the database o r putting the data in a different text file, let's use the t Lo gRo w
widget to display ro ws o n the co nso le.
Click the File gro up to co llapse it, then click the Lo gs & Erro rs gro up. Click o nce o n t Lo gRo w and drag it to the
canvas:

No w bo th co mpo nents have warnings. What's the pro blem? Well, we haven't made any co nnectio ns between the
so urce o f data and the data destinatio n.

To make a co nnectio n, right click o n the data so urce and cho o se Ro w -> Main:

Dro p this co nnectio n o n t Lo gRo w:

The canvas sho uld no w lo o k like this:

Note

If yo u make a mistake, yo u can always select a co mpo nent and hit the delete key to remo ve it. Yo u can
also right click o n a co mpo nent and cho o se De le t e .

As lo ng as yo ur canvas lo o ks similar to the screen sho t, we're ready to run the jo b!

Note

Yo ur canvas do esn't have to lo o k exactly the same as o ur image here. The layo ut is strictly info rmatio nal.
But the co nnectio ns are impo rtant because they define ho w data flo ws thro ugh the jo b.

To run the jo b, click o nce o n the canvas to make sure it is selected, then click the little green "Play" butto n at the to p o f
the screen:

Yo u'll see so me messages and activity, then a who le bunch o f data scro lling o n the co nso le:

Co ngratulatio ns! Yo u've co mpleted yo ur first ETL jo b! In the next lesso n we'll write o ur first real ETL jo b - it will impo rt
an Excel spreadsheet fo r o ur date dimensio n. See yo u there!

Copyright 1998-2013 O'Reilly Media, Inc.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.

ETL: The Date Dimension


DBA 3: Data Warehousing Lesson 7
Welco me back! In the last lesso n we created o ur first ETL jo b. In this lesso n we'll create o ur first "real" jo b, o ne that will lo ad o ur
date dimensio n fro m an existing Excel spreadsheet.

Job Structure
Jo bs in TOS can be as large o r small as yo u want them to be. Yo ur initial jo bs can then execute subsequent jo bs. We
will use this basic structure fo r o ur jo bs:

Since dimensio ns give co ntext to facts, we must pro cess dimensio ns befo re we pro cess facts.
This structure makes it po ssible fo r us to pro cess the vario us co mpo nents o f entire data wareho use separately: we
might o nly pro cess dimensio ns, o nly pro cess facts, o r o nly pro cess a single dimensio n o r fact. Flexibility like this is
great when yo u're develo ping a data wareho use. Why bo ther to pro cess the entire wareho use when yo u are tracking
do wn an issue with a single dimensio n, right?
So me data wareho use tasks (like lo ading o ur date dimensio n) o ccur o nly o nce. We'll create a jo b fo r each o f tho se
types o f tasks, but they will no t execute as part o f o ur no rmal wareho use pro cessing.

Loading Data from Excel


Befo re we can lo ad o ur spreadsheet to o ur database, we need to create a new jo b. Right click o n J o b De signs and
cho o se Cre at e jo b:

Name the new jo b dim Dat e .


At this po int yo u will see the blank jo b canvas o n the screen. In the Palette, click to expand File --> Input . Find the
t File Input Exce l co mpo nent and drag it to yo ur canvas:

Yo ur co mpo nent may be named tFileInputExcel_1, o r tFileInputExcel_2. Its unique name is generated
auto matically by TOS, and isn't really useful to us. We'll rename it so it makes mo re sense. Click in the middle o f the
t File Input Exce l co mpo nent yo u just dro pped o n the canvas, then switch to the Co m po ne nt tab:

No w yo u might see the Basic se t t ings sub tab. To change the name o f the co mpo nent, change to the Vie w sub tab:

The co mpo nent's current name is set to __UNIQUE_NAME__. Click in the label fo rmat bo x, and change the text to a
mo re meaningful value: Dat e Spre adshe e t .
No w we can switch back to the Basic se t t ings sub tab. The current f ile nam e po ints to an invalid spreadsheet. Yo u
can click inside the text bo x to type in the co rrect lo catio n, o r click the
to :
CODE TO TYPE:
C:/talend_files/DateDimension.xls

butto n to pick the file. Change the file name

Note

If yo u type in the file name, be sure to use fo rward slashes instead o f back slashes.

Excel spreadsheets are also kno wn as workbooks, and wo rkbo o ks co ntain sheets. Suppo se yo ur co wo rker tells yo u
that the data fo r the date dimensio n is lo cated o n a sheet called She e t 1. Scro ll do wn thro ugh the basic settings until
yo u see the She e t list . Find the
tho se do uble quo tatio n marks):

butto n and click it. Change the default text to " She e t 1" . (Do n't fo rget to use

Yo ur spreadsheet also co ntains a header ro w. We need to tell TOS abo ut this header ro w, o therwise it wo uld try to
impo rt it as data. Scro ll do wn farther until yo u see the He ade r and Fo o t e r sectio n. Change the He ade r text bo x to 1:

Our co mpo nent is still in erro r because we haven't specified the schema o f o ur Excel file. Typically, we wo uld specify
the schema in the Metadata sectio n o f the repo sito ry, but this spreadsheet is o nly go ing to be used in this single jo b,
so in this instance, we'll keep the schema within o ur co mpo nent. Our co wo rker has already pro vided us with the
schema definitio n:
Co lum n

T ype

date

Date

is_weekend

Bo o lean

is_ho liday

Bo o lean

year

Integer

quarter

Character

mo nth

Integer

week_in_year Integer
day_in_week

Integer

To edit the schema fo r o ur spreadsheet, scro ll to the bo tto m o f the co mpo nent windo w, and click the

butto n

lo cated next to Edit Sche m a. Click the


butto n to add a new entry to the schema. Specify the co lumn name and
type fo r each o f the co lumns in the previo us table. All o f o ur co lumns co ntain data, so we do n't want to allo w NULL
values. Uncheck the Nullable bo x fo r each co lumn.

This schema defines the layo ut o f the data in the Excel spreadsheet. It is impo rtant to match the co lumn data types and
o rder pre cise ly. Bad things can happen if we get the o rder wro ng. Fo r instance, a mismatched data type co uld
co nfuse TOS so that it wo uldn't kno w whether to interpret the date "20 0 8 " as a mo nth o r a year. That kind o f co nfusio n
has the po tential to wreak all so rts o f havo c o n o ur wo rk.
When yo u're do ne, yo ur schema sho uld lo o k like this:

Click "OK" to clo se the windo w. TOS puts an asterisk * next the filename when yo ur jo b has changes that have no t
been saved. When yo u make changes to yo ur jo b, get in the habit o f saving them. Yo u can save yo ur file in o ne o f two
ways: use the Save co mmand o n the File menu o r click the flo ppy disk ico n o n the to o lbar:

Let's test o ur co mpo nent to make sure it's wo rking pro perly. We can test it using the t Lo gRo w co mpo nent. In the
Palette, click to expand the Lo gs & Erro rs tab, then click and drag t Lo gRo w to yo ur canvas.
Link the Dat e Spre adshe e t co mpo nent to the t Lo gRo w co mpo nent by right-clicking o n yo ur Dat e Spre adshe e t
co mpo nent, and cho o sing Ro w -> Main: Dro p the link o n t Lo gRo w:

Note

Fo r mo re info rmatio n o n linking co mpo nents, review the previo us lesso n.

Run yo ur jo b by clicking o n the

butto n. If everything is set up co rrectly, yo u'll see o utput that lo o ks like this:

OBSERVE:
Starting job dimDate at 16:36 14/10/2008.
01-01-2000|true|false|2000|1|1|1|7
02-01-2000|true|false|2000|1|1|2|1
03-01-2000|false|false|2000|1|1|2|2
04-01-2000|false|false|2000|1|1|2|3
... lines omitted ...
27-12-2050|false|false|2050|4|12|53|3
28-12-2050|false|false|2050|4|12|53|4
29-12-2050|false|false|2050|4|12|53|5
30-12-2050|true|false|2050|4|12|53|6
31-12-2050|true|false|2050|4|12|53|7
Job dimDate ended at 16:36 14/10/2008. [exit code=0]
We are o ff to a great start!

Adding Columns to our Data Flow


Befo re we lo ad this data into o ur dimDate table, let's review the co lumns in dimDate. Open a terminal and co nnect to
yo ur perso nal database. Run the fo llo wing co mmand:
CODE TO TYPE:
describe dimDate;
Run the query. Yo ur results sho uld lo o k so mething like this:
OBSERVE:
mysql> describe dimDate;
+------------+-------------+------+-----+---------+-------+
| Field
| Type
| Null | Key | Default | Extra |
+------------+-------------+------+-----+---------+-------+
| date_key
| int(11)
| NO
| PRI | NULL
|
|
| date
| date
| NO
|
| NULL
|
|
| year
| smallint(6) | NO
|
| NULL
|
|
| quarter
| char(2)
| NO
|
| NULL
|
|
| month
| tinyint(4) | NO
|
| NULL
|
|
| day
| tinyint(4) | NO
|
| NULL
|
|
| week
| tinyint(4) | NO
|
| NULL
|
|
| is_weekend | tinyint(1) | YES |
| NULL
|
|
| is_holiday | tinyint(1) | YES |
| NULL
|
|
| month_name | varchar(9) | YES |
| NULL
|
|
| day_name
| varchar(9) | YES |
| NULL
|
|
+------------+-------------+------+-----+---------+-------+
11 rows in set (0.00 sec)
No w take a lo o k back at the schema fo r o ur data so urce. It co ntains the fo llo wing co lumns: date, is_weekend,
is_holiday, year, quarter, month, week_in_year, day_in_week. But it lo o ks like o ur spreadsheet is
missing the date_key, and day (o f the mo nth). And while the spreadsheet includes a co lumn fo r day_in_week, o ur date
dimensio n do es no t have that co lumn.
ETL is used specifically to address these kinds o f differences. So urces o f data are rarely in the fo rmat we want fo r o ur
data wareho use. They might be missing so me info rmatio n, o r co ntain info rmatio n that we do n't need.
TOS has many different co mpo nents, and several co uld be used to add co lumns to o ur data flo w. Fo r this example,
we'll use a co mpo nent called t Map. The map co mpo nent is an all-purpo se to o l in TOS that allo ws yo u to perfo rm
jo ins, filters, transfo rmatio ns, and data splits. We will use it fo r transfo rmatio ns.
Befo re we add t Map to o ur canvas, we need to delete the link between Dat e Spre adshe e t and t Lo gRo w_1. We will
leave t Lo gRo w_1 o n o ur canvas fo r no w, since we will use it to test o ur changes.
To delete the link, yo u can right click o n the ro w1 (Main) arro w between Dat e Spre adshe e t and t Lo gRo w_1 and
cho o se Delete, o r click o nce o n the ro w1 (Main) arro w and press the Delete key o n yo ur keybo ard.

Note

If yo u accidentally delete the wro ng co mpo nent, just cho o se "Undo " fro m the Edit menu.

Next, expand the Palette, and click o n Pro ce ssing. Scro ll do wn until yo u find the t Map co mpo nent. Add it to yo ur
canvas, and feel free to rearrange the o ther co mpo nents:

Rename t Map_1 to so mething mo re meaningful - change it to Add Co lum ns. Next link Dat e Spre adshe e t to Add
Co lum ns. Befo re we link Add Co lum ns to o ur lo gging co mpo nent, let's add o ur new co lumns. Click o nce o n Add
Co lum ns, then change the tab to Co m po ne nt . The windo w will lo o k like this:

Note

The t Map windo w is ano ther modal windo w - so yo u wo n't be able to scro ll co urse co ntent while yo u're
wo rking with t Map. Make sure yo u click OK to clo se the t Map windo w so yo ur pro gress is retained, and
save yo ur jo b o ften!

To edit o ur transfo rmatio n, click o n Basic se t t ings and then click o n the
pro bably want to expand the new windo w to fill yo ur entire screen.
The edito r fo r t Map has three distinct areas: Inputs, Variables, and Outputs:

butto n, next to Map Edit o r. Yo u'll

Each flo w into t Map sho ws up in the Input sectio n. Yo u must have at least o ne input.
The Variables sectio n lets yo u set o r mo dify variables, which is useful when yo u want to create co unters.
The Output sectio n lets yo u define the way input ro ws are passed o n to the next co mpo nent. Yo u may have
multiple o utputs.

Fo r o ur current dimensio n, we wo n't use variables, o nly inputs and o utputs. Click the
sectio n:

Name yo ur new o utput dim Dat e :

butto n in the Outputs

No w we need to link o ur input co lumns to o utput co lumns. Click and ho ld o n the dat e input co lumn:

No w drag the co lumn o ver to the dim Dat e o utput that was just created:

Once yo u dro p the co lumn, yo u'll see the link:

Repeat these steps fo r all o f the remaining co lumns except day_in_week -- we are no t using that co lumn. When
yo u're do ne, yo u'll see this:

Next, let's add a new co lumn fo r date_key. Primary keys in data wareho uses are o ften implemented using auto increment co lumns, but it's much mo re co nvenient to have date_key in a co ded fo rmat, such as yyyyMMdd. Using this
fo rmat, a value o f 20080101 wo uld represent January 1st, 20 0 8 .

To add a new co lumn to the o ut put , click the

butto n in the schema sectio n:

Name this co lumn dat e _ke y, change its type to int , and uncheck the Nullable checkbo x. Eventually the o rder o f
co lumns in o ur o utput must match o ur actual dimDate table, so we might as well mo ve the dat e _ke y co lumn to the
very to p no w. We can do this using the up and do wn arro ws to the right o f the add butto n:

Next, change the name o f the o utput co lumn fro m we e k_in_ye ar to we e k.


Add ano ther new co lumn called day, change its type to int , and uncheck the Nullable checkbo x. Next, add two new
string co lumns: m o nt h_nam e and day_nam e .
Finally, reo rder the remaining co lumns so they match YOUR dimDate table. The o rder will be similar to this:
1. date_key
2. date
3. year
4. quarter
5. mo nth
6 . day
7. week
8 . is_weekend
9 . is_ho liday
10 . mo nth_name
11. day_name
TOS inserts data into yo ur dimDate table using the co lumn o rder yo u specify. If yo u specify the inco rrect o rder, TOS
may insert data into the wro ng co lumn.
With dat e _ke y at the to p, we can no w use an expressio n to set the value o f dat e _ke y. Click the
expressio n bo x:

The expressio n builder lo o ks like this:

butto n in the

Because o ur TOS pro ject is based o n Java, expressio ns are also written in Java. This gives us lo ts o f po wer. Yo u can
use Java string functio ns to create so me very po werful expressio ns. The bo tto m half o f the expressio n windo w is a
catalo g o f so me co mmo n expressio ns that yo u can use. We'll use a functio n fro m the TalendDate catego ry to co nvert
o ur date to "yyyyMMdd" fo rmat, then co nvert that string into an integer, ready fo r the database. (Do n't wo rry if yo u're
no t quite an expert using Java - we'll pro vide yo u with the expressio ns yo u need fo r this co urse. If yo u're interested in
learning mo re abo ut Java, check o ut the Java Certificate Series.)
In the expressio n builder, type in this co de:
CODE TO TYPE:
Integer.parseInt(
TalendDate.formatDate("yyyyMMdd",row1.date)
)
Click "OK" to save yo ur expressio n. Next, edit the expressio n fo r the day co lumn. Type the co de belo w into the
expressio n builder:
CODE TO TYPE:
row1.date.getDate()
Next, set the expressio n fo r month_name. Type the co de belo w into the expressio n builder:
CODE TO TYPE:
TalendDate.formatDate("MMMM",row1.date)
Finally, set the expressio n fo r day_name. Type the co de belo w into the expressio n builder:
CODE TO TYPE:
TalendDate.formatDate("EEEE",row1.date)
We've used the date input co lumn to co me up with several o utput co lumns. Graphically, TOS displays this with
multiple arro ws fro m date go ing to different ro ws in the o utput:

With o ur co lumns co mplete, we are free to clo se the map edito r. Click "OK" and then save yo ur jo b.
We are nearly there! Link Add Co lum ns to t Lo gRo w_1.
Hey, it lo o ks like we have a pro blem. There's that little red circle with an X in it:

If yo u ho ver o ver the co mpo nent, yo u'll see this erro r:


OBSERVE:
tLogRow_1
Errors:
- The schema from the input link "dimDate" is different from the schema defined in
the component.
The pro blem is that t Lo gRo w_1 still has the schema fro m its earlier co nnectio n. To fix this pro blem, click
t Lo gRo w_1 o nce to select it, then change to the Component tab. Yo u'll see a butto n called Sync Columns - click it.

With that pro blem fixed, we are free to run the jo b.


After yo u run the jo b, yo u will see new o utput:
OBSERVE:
Starting job dimDate at 16:34 15/06/2009.
20000101|01-01-2000|2000|1|1|1|1|true|false
20000102|02-01-2000|2000|1|1|2|2|true|false
20000103|03-01-2000|2000|1|1|3|2|false|false
... lines omitted ...
20501229|29-12-2050|2050|4|12|29|53|false|false
20501230|30-12-2050|2050|4|12|30|53|true|false
20501231|31-12-2050|2050|4|12|31|53|true|false
Job dimDate ended at 16:34 15/06/2009. [exit code=0]
It's lo o king really go o d no w.

Adding Data to dimDate

So far, we've read data fro m o ur Excel spreadsheet and added two new co lumns to the data flo w. The next step is to
depo sit o ur data into the dimDate table.
We'll be co nnecting to MySQL o ften, so it wo uld be go o d to keep MySQL co nnectio n info rmatio n in o ne place. We can
do this using the Me t adat a sectio n o f o ur repo sito ry.
To create a co nnectio n, click to expand the Me t adat a sectio n o f the repo sito ry:

Next, right-click o n Db Co nne ct io ns and cho o se Cre at e Co nne ct io n:

Spaces are no t allo wed in co nnectio n names, so give yo ur co nnectio n this name: Dat aWare ho use . If yo u want to ,
yo u can leave the purpose and description fields blank, keep version set at 0 .1, and leave status unselected. Tho se
co lumns are just additio nal metadata fo r yo ur co nnectio n:

When yo u're do ne, click Ne xt >.


In the final screen, cho o se MySQL as the DB Type, and enter this info rmatio n:
Lo gin: your username
Passwo rd: your password
Server: sql.o reillyscho o l.co m
Po rt: 330 6
Database: your username

Click the Che ck butto n to try to co nnect to yo ur database. If yo ur co nnectio n is go o d, yo u'll see the message
"DataWarehouse" connection successful. Click Finish to save yo ur co nnectio n.

Note

If yo u edit yo ur database co nnectio n, TOS will ask yo u if yo u want to pro pagate the mo dificatio ns to all
jo bs. Cho o se yes - yo u want yo ur changes to apply everywhere.

Since we are sto ring o ur database co nnectio n info rmatio n in the repo sito ry, we sho uld also sto re o ur table schema in
the repo sito ry. Right-click o n yo ur database co nnectio n, and cho o se Re t rie ve Sche m a:

We do n't need to filter o ur schema, so click Ne xt > at the bo tto m o f the windo w:

In the next windo w, scro ll thro ugh yo ur database until yo u co me acro ss the o bjects fo r this co urse:
dimCusto mer
dimDate
dimMo vie
dimSto re
etlLo g
etlRuns
factCusto merCo unt
factRentalCo unt
factSales

Click Ne xt >. In this final screen yo u can review the schema and make any necessary changes. We do n't need to
change anything, so click Finish:

No w the repo sito ry will sho w the tables asso ciated with o ur co nnectio n:

With o ur co nnectio n setup, we can swap o ut t Lo gRo w_1 with a MySQL o utput. First, delete t Lo gRo w_1 fro m yo ur
canvas. Next, expand the Dat abase s sectio n o f the Palette, then expand the MySQL sub-sectio n:

Drag t MysqlOut put to yo ur canvas, and link it fro m the Add Co lum ns co mpo nent. Yo ur canvas will no w lo o k
so mething like this:

Next, select t MySQLOut put _1, and switch to the Co m po ne nt tab. Change the name o f t MySQLOut put _1 to
dim Dat e T able .
No w switch back to the Basic se t t ings tab. By default, TOS assumes yo u are go ing to save the database co nnectio n
info rmatio n inside o f the co mpo nent. This is kno wn as a Built-In pro perty type.
Our co nnectio n info rmatio n is sto red in the repo sito ry, so change the Pro pe rt y T ype fro m Built-In to Repository. We
o nly have o ne database co nnectio n in the repo sito ry, and TOS selects it fo r us auto matically:

We're almo st do ne. Next, we'll tell o ur o utput co nnectio n where and ho w to place the data. Scro ll do wn the Basic
Se t t ings tab to see the remaining pro perties.
Set the T able to dimDate. We do n't want duplicate ro ws in o ur table, and we want to relo ad o ur T able co mpletely each
time we execute this jo b, so set Act io n o n t able to Clear Table. This essentially executes a DELETE FROM dimDate;
befo re inserting data into dimDate. We want to insert data (witho ut deleting o r updating anything), so set Act io n o n
dat a to Insert. Finally, if there is any pro blem, we want to sto p the jo b immediately, so check the bo x next to Die o n
e rro r:

Save yo ur jo b. No w yo u're ready to run it! It might take a minute o r two to read fro m the Excel spreadsheet and transfer
everything to yo ur database. When the jo b is co mplete, yo u'll see this o utput:
OBSERVE:
Starting job dimDate at 13:29 16/10/2008.
Job dimDate ended at 13:31 16/10/2008. [exit code=0]

We can do uble-check by running a quick query. Switch back to the terminal, run this co mmand:
CODE TO TYPE:
SELECT * from dimDate
LIMIT 0, 10;

If yo ur jo b ran successfully, and yo ur query is co rrect, yo u will see this:


OBSERVE:
mysql> SELECT * from dimDate
-> LIMIT 0, 10;
+----------+------------+------+---------+-------+-----+------+------------+-----------+------------+-----------+
| date_key | date
| year | quarter | month | day | week | is_weekend | is_holiday
| month_name | day_name |
+----------+------------+------+---------+-------+-----+------+------------+-----------+------------+-----------+
| 20000101 | 2000-01-01 | 2000 |
1 |
1 |
1 |
1 |
1 |
0
| January
| Saturday |
| 20000102 | 2000-01-02 | 2000 |
1 |
1 |
2 |
2 |
1 |
0
| January
| Sunday
|
| 20000103 | 2000-01-03 | 2000 |
1 |
1 |
3 |
2 |
0 |
0
| January
| Monday
|
| 20000104 | 2000-01-04 | 2000 |
1 |
1 |
4 |
2 |
0 |
0
| January
| Tuesday
|
| 20000105 | 2000-01-05 | 2000 |
1 |
1 |
5 |
2 |
0 |
0
| January
| Wednesday |
| 20000106 | 2000-01-06 | 2000 |
1 |
1 |
6 |
2 |
0 |
0
| January
| Thursday |
| 20000107 | 2000-01-07 | 2000 |
1 |
1 |
7 |
2 |
1 |
0
| January
| Friday
|
| 20000108 | 2000-01-08 | 2000 |
1 |
1 |
8 |
2 |
1 |
0
| January
| Saturday |
| 20000109 | 2000-01-09 | 2000 |
1 |
1 |
9 |
3 |
1 |
0
| January
| Sunday
|
| 20000110 | 2000-01-10 | 2000 |
1 |
1 | 10 |
3 |
0 |
0
| January
| Monday
|
+----------+------------+------+---------+-------+-----+------+------------+-----------+------------+-----------+
10 rows in set (0.00 sec)
It lo o ks great!
dim Dat e is almo st co mplete. Because mo st facts reference dimDate and need to lo o kup a date_key based o n a real
date, we need to add an index o n the date co lumn. Run the fo llo wing co mmand against yo ur perso nal database:
CODE TO TYPE:
ALTER TABLE dimDate ADD INDEX(date);
If yo u typed everything co rrectly, yo u'll see this:
OBSERVE:
Query OK, 18628 rows affected (0.46 sec)
Records: 18628 Duplicates: 0 Warnings: 0

If you run into problems...


Yo u have a co uple o f debugging o ptio ns if TOS gives yo u an erro r when yo u try to run yo ur jo b o r yo ur
SELECT * from dimDate query do esn't return any results.
First, take a lo o k at the jo b itself fo r any red X's:

If yo u see a red X, o ne o f yo ur co mpo nents co ntains an erro r. Ho ver o ver the co mpo nent and TOS sho uld let
yo u kno w what the pro blem is. If yo ur schemas differ between yo ur co mpo nents, click o n the Sync Co lum ns
butto n o f yo ur last co mpo nent:

Next, check the jo b o utput fo r red text that lo o ks like this:

TOS is telling us that it was unable to interpret the value fo r 01-Jan-2000. Chances are, the input schema is
inco rrect - either the co lumns are o ut o f o rder o r a data type is inco rrect. Check yo ur schema again.
If yo u do n't see any results fro m the query, o ther than 0 rows in set (0.00 sec) - do uble-check yo ur
o utput schema in t Map. Yo ur co lumns may be o ut o f o rder o r yo u might have an inco rrect data type.
If yo u are unable to find the so urce o f a pro blem, yo u can always co ntact yo ur mento r at
le arn@ o re illyscho o l.co m .
Yo u've acco mplished a lo t in this lesso n - yo u extracted an Excel spreadsheet, transfo rmed the co lumns in the spreadsheet,
and lo aded the data into the wareho use: ET L! In the next lesso n we'll press o n with o ur ETL and the remaining dimensio ns.
See yo u then!

Copyright 1998-2013 O'Reilly Media, Inc.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.

Basic Dimension Processing


DBA 3: Data Warehousing Lesson 8
Welco me back. In the last lesso n we created o ur first dimensio n: dimDate. In this lesso n we'll create ano ther basic dimensio n:
dimMovie.

Loading dimMovie
Job Structure
Befo re we dive into o ur mo vie dimensio n, let's review o ur jo b structure:

Our jo b fo r dimMovie is a sub job, executed by Process Dimensions, which is a sub-jo b executed by Daily
Warehouse Update. This structure allo ws us a great deal o f flexibility; we can relo ad the entire data
wareho use by running a single jo b, o r we can relo ad the facts alo ne by running a single jo b, o r relo ad o nly a
single fact.
Let's set up this structure. Create a new jo b:

Name this jo b Pro ce ssDat aWare ho use . (Remember spaces are no t allo wed in jo b names.)

Pre and Post Job


TOS has two special co mpo nents: t Pre J o b and t Po st J o b. These co mpo nents let yo u specify what
happens befo re a jo b is run and after the jo b is co mpleted. We'll use a t Pre J o b co mpo nent to run o ur sto red
pro cedure etl_StartRun which we created back in lesso n 5. We'll use a t Po st J o b co mpo nent to run the
sto red pro cedure etl_EndRun.
Our sto red pro cedure returns two co lumns. TOS interprets these key/value pairs as part o f the context o f jo b
executio n. In TOS, the context is a gro uping o f glo bal variables that can be referenced within the jo b.
Essentially, these key/value pairs are turned into variables in each jo b and then enable us to use the run_id
in o ther parts o f o ur jo b.
In the last lesso n we added a database co nnectio n, and used it to sto re the schemas fo r several tables. TOS

also lets us sto re generic schemas, which are schemas no t asso ciated with specific co nnectio ns. Let's
create a generic schema that defines the co lumns returned by the etl_StartRun sto red pro cedure.
First, expand the Me t adat a sectio n o f the repo sito ry, then right-click o n Ge ne ric sche m as. Cho o se Cre at e
ge ne ric sche m a:

Name it Audit ing and click next:

Finally, name the schema e t lRun instead o f metadata, and add two co lumns to the schema. The co lumns
will be named ke y and value ; they are strings o f length 255 and they allo w nulls. When yo u're do ne, click
Finish:

No w we're ready to start wo rking o n the actual jo b. Drag a t Pre jo b co mpo nent fro m the palette to yo ur
canvas. It's lo cated under the Orche st rat io n sectio n:

Next, drag a t MysqlInput co mpo nent to the canvas. It's lo cated under Dat abase s --> MySQL. Po sitio n it to
the right o f t Pre jo b. Right-click o n t Pre jo b and link T rigge r --> On Co m po ne nt OK fro m t Pre jo b to
t MysqlInput .
Change the name t MysqlInput to e t l_St art Run using the Vie w menu in the Co m po ne nt tab.
Click o n e t l_St art Run, then cho o se the Co m po ne nt tab belo w yo ur canvas. Set these pro perties:
Pro perty Type: Repository - DB (MYSQL):DataWarehouse (select your database connection
from the repository)
Schema - Repository - GENERIC:Auditing - etlRun (select the metadata you just created
from the repository)
Query - "CALL etl_StartRun()" - (do n't fo rget the quo tatio n marks.)
Cho o se the generic schema yo u just created by clicking o n the
the schema that was just added and select e t lRun:

to the le f t o f Edit Schema. Navigate to

Yo ur screen sho uld lo o k so mething like this:

So far, so go o d. Next, dro p a t Co nt e xt Lo ad co mpo nent o n the canvas. This co mpo nent is lo cated under
the Misc tab in the palette. This co mpo nent accepts data and lo ads it to the current context.
Link the main ro w o utput o f e t l_St art Run to t Co nt e xt Lo ad.
To extract the run_id fro m the co ntext, we'll need to give TOS a little mo re info rmatio n. At the bo tto m o f the
windo w, click o n the Co nt e xt s tab. Make sure yo u are o n the Variable s sub-tab, then click the
new entry.

to add a

The Name is run_id, its so urce is built -in, its type is Int e ge r, and the script co de is context.run_id.
When yo u are do ne, the tab sho uld lo o k like this:

Finally, left-click in the blue area surro unding e t l_St art Run to select the sub jo b.

Make sure yo u're o n the Co m po ne nt tab, then click to select the bo x Sho w subjo b t it le . Type in an
appro priate title:

Yo ur canvas sho uld no w lo o k so mething like this:

If yo u haven't saved the jo b, no w is a go o d time to do that.


No w that we have o ur start in place, we need to put o ur end in place. Since o ur sto red pro cedure do esn't
o utput any ro ws, we'll use the t MysqlSP co mpo nent instead o f t MysqlInput . t MysqlSP is used specifically
to call sto red pro cedures. Okay, no w execute these steps:
1. Drag a t Po st jo b co mpo nent fro m the palette (under o rche st rat io n) to yo ur canvas.
2. Drag a t MysqlSP co mpo nent to the right o f t Po st jo b (under Dat abase s --> MySQL).
3. Link the OnCo m po ne nt OK fro m t Po st jo b to t MysqlSP.
4. Set the co nnectio n pro perties o n t MysqlSP.
5. Set the SP Nam e to e t l_EndRun.
Yo ur t MysqlSP co mpo nent's settings sho uld lo o k like this:

When yo u're do ne, yo ur po st jo b sho uld lo o k like this:

Logging
TOS has two "catcher" co mpo nents: t Lo gCat che r and t St at Cat che r. These co mpo nents listen fo r lo g
messages and statistics and then create a flo w using that info rmatio n. This allo ws us the flexibility to write to
a lo g file, to a database anywhere!
In lesso n 5 we created a table called etlLog. We'll use the catcher co mpo nents and so me transfo rmatio n

lo gic fro m etlLog to save lo gs and stats in o ur table. Let's get started. Drag o ne t Lo gCat che r co mpo nent
and o ne t St at Cat che r co mpo nent to yo ur canvas; they are lo cated under the Lo gs & Erro rs tab:

Take a lo o k at the schema fo r the co mpo nents yo u just dro pped o n the canvas. Select o ne o f them and
switch to the Co m po ne nt tab. Then click the
butto n next to Edit Sche m a. (This is a little misleading
since yo u can't actually edit the schema fo r these co mpo nents.)
Yo u can see that these co mpo nents are similar, but no t exactly the same. We co uld have sent each
co mpo nent to a different database table, but then we wo uld have to do extra wo rk to see all o ur lo g
info rmatio n. With a little wo rk we can transfo rm, then unite the two co mpo nents to lo g into o ur single etlLog
table.
We'll use two t Map co mpo nents to transfo rm the lo g and stat o utput. Drag two t Map co mpo nents fro m the
Pro ce ssing menu o f the palette to yo ur canvas, then link the Main o utputs to the t Map co mpo nents. When
do ne that part o f yo ur canvas sho uld lo o k like this:

Next, do uble click the t Map co mpo nent linked to t Lo gCat che r. Yo ur input to t Map might no t be called ro w2
- that's fine. A mo dal windo w will o pen, yo u can add o utput info rmatio n there. So me co lumns sho uld have
blank expressio ns. Tho se co lumns will be used by the stat catcher o nly. So me co lumns, like run_id, do no t
exist in the input. Yo u'll have to add them manually. Also , be sure yo ur co lumns are in the co rrect o rder. Add
o ne o utput, then add o r link the co lumns belo w:
Expre ssio n

Co lum n

co ntext.run_id

run_id

ro w2.mo ment

mo ment

ro w2.pid

pid

T ype
integer

ro w2.father_pid father_pid
ro w2.ro o t_pid

ro o t_pid

blank

system_pid

Lo ng, length 8

ro w2.pro ject

pro ject

ro w2.jo b

jo b

blank

jo b_repo sito ry_id string, length 255

blank

jo b_versio n

ro w2.co ntext

co ntext

ro w2.prio rity

prio rity

ro w2.o rigin

o rigin

ro w2.type

m e ssage _t ype

string, length 255

ro w2.message message
ro w2.co de

co de

blank

duratio n

Note

Lo ng, length 8
Make sure yo u rename t ype co lumn m e ssage _t ype .

When yo u're do ne, yo ur mappings sho uld lo o k like this:

No w is a go o d t im e t o save yo ur wo rk.
Next, do uble click the t Map co mpo nent linked to t Lo gSt at Cat che r. Yo ur input to t Map may no t be called
ro w4 - that's o kay. So me co lumns sho uld have blank expressio ns. Tho se co lumns will be used by the lo g
catcher o nly. Add o ne o utput, then add o r link the fo llo wing co lumns:
Expre ssio n

Co lum n

co ntext.run_id

run_id

ro w4.mo ment

mo ment

ro w4.pid

pid

ro w4.father_pid

father_pid

ro w4.ro o t_pid

ro o t_pid

ro w4.system_pid

system_pid

ro w4.pro ject

pro ject

ro w4.jo b

jo b

ro w4.jo b_repo sito ry_id jo b_repo sito ry_id


ro w4.jo b_versio n

jo b_versio n

ro w4.co ntext

co ntext

T ype
integer

blank

prio rity

ro w4.o rigin

o rigin

ro w4.message_type

message_type

ro w4.message

message

blank

co de

ro w4.duratio n

duratio n

integer, length 3

integer, length 3

When yo u're do ne, yo ur mappings sho uld lo o k like this:

That lo o ks great! No w we need to take o ur two flo ws o f lo g data and unite them into a single flo w. This is
do ne using the t Unit e co mpo nent, which is under the Orche st rat io n sectio n in the palette. Dro p o ne o nto
yo ur canvas.
Next, link the o utput fro m each o f yo ur map co mpo nents to t Unit e . When yo u're finished, yo ur canvas
sho uld lo o k similar to this (do n't wo rry abo ut names, it's o kay if they're different):

Yo u might see a warning o n yo ur t Unit e co mpo nent. This happens when the schema fo r o ne o f the input
flo ws is different fro m the o thers. Yo u can check o ut the schemas by selecting t Unit e , clicking o n the
Co m po ne nt tab, then clicking o n the

butto n next to Edit Sche m a.

Yo u'll have to do a visual inspectio n to see which co lumn differs, then go back to the co rrect t Map
co mpo nent to fix the issue. If TOS asks yo u whether yo u want to Propagate Changes, cho o se Ye s.
Finally, dro p a t MysqlOut put co mpo nent fro m the Dat abase s menu o f the palette o nto yo ur canvas. Link
the Main o utput o f t Unit e to t MysqlOut put .
Set the database co nnectio n. Click o n the t MysqlOut put Co m po ne nt tab, and go to Basic se t t ings. Set
the Pro pe rt y T ype to Re po sit o ry. TOS will ask if yo u want to take the schema fro m the input co mpo nent.
Yo u do , so cho o se Ye s. Make sure yo u specify that the T able is " e t lLo g" , and that yo u want to Inse rt data.
The Act io n o n t able sho uld remain No ne .
Yo ur canvas sho uld no w lo o k similar to this:

Save yo ur wo rk. With this co de do ne, we are ready to start wo rking o n o ur dimensio n!

dimMovie
Our basic jo b structure is in place, so no w we're free to co ncentrate o n the actual dimensio n pro cessing.
Let's start with dim Mo vie . Yo u might recall fro m lesso n 3 that dim Mo vie is a Type-1 slo wly changing
dimensio n. This means that we do no t track any changes o n the dimensio n - instead we update ro ws in o ur
data wareho use as they change in the so urce system. (We'll learn mo re abo ut this a little later.)
This is the first time we need to co nnect to the sakila database, so we need to add a new shared database

co nnectio n to the repo sito ry.


To create a co nnectio n, click to expand the Me t adat a sectio n o f the repo sito ry:

Next, right-click o n Db Co nne ct io ns and cho o se Cre at e Co nne ct io n:

Name the co nnectio n sakila then click Ne xt >. The database type is o nce again MySQL. Leave the Lo gin
and Passwo rd o ptio ns blank, but set the server to sql.o re illyscho o l.co m . The po rt is 330 6 , and the
Database is sakila. Click Che ck to make sure yo u can co nnect to sakila, then click Finish to save yo ur new
co nnectio n.
Dro p a t MysqlInput co lumn o n yo ur canvas. Go to the Co m po ne nt tab, and set the co nnectio n to sakila
fro m the Re po sit o ry using the

butto n o n the far right o f the Pro pe rt y T ype ro w.

At this po int yo u co uld cho o se to save yo ur query in the repo sito ry next to the database co nnectio n. But since
this query will o nly be used in this specific jo b, we'll leave the query in this t MysqlInput co mpo nent. Click o n
the

to the right o f the query text bo x.

Back in lesso n 5, we wro te a query to get data o ut o f the mo vie tables. We'll use that same query no w. Just
this o nce, co py and paste this query into the SQL builder windo w:
CODE TO USE:
SELECT f.film_id, f.title, f.description, f.release_year,
l.name as language, orig_lang.name as original_language,
f.rental_duration, f.length, f.rating, f.special_features
FROM film f
JOIN language l on (f.language_id=l.language_id)
LEFT JOIN language orig_lang on (f.original_language_id = orig_lang.language_id
)

Yo ur windo w sho uld lo o k so mething like this:

To test yo ur query, click o n the runner, o r press ctrl-enter. Yo u'll see, at mo st, 10 0 ro ws o f results:

When yo u're satisfied with yo ur query, click OK.


Next, we need to specify o ur schema. TOS do es have a Guess Schema butto n which can save time.
Unfo rtunately, it has difficulty reading the schema o f o ur query, so we'll have to enter it by hand. Click o n the
butto n to the right o f Edit Sche m a. Add the fo llo wing co lumns (remember, o rder do es matter!):
Co lum n

T ype

Le ngt h

film_id

int

title

String

27

descriptio n

String

20 0

release_year

int

language

String

20

o riginal_language String

20

No t e s
key, no t null

rental_duratio n

Integer

length

Integer

rating

String

special_features

String

60

dimMovie do esn't really need any transfo rmatio ns, since the so urce data o f the dimensio n is very go o d. But
we do need to add o ur audit co lumn to the flo w:
1. Drag a t Map co mpo nent to yo ur canvas.
2. Map the Main o utput o f t MysqlInput to t Map.

To
do

3. Edit t Map, adding an o utput called Main.


4. Add a co lumn to the o utput: run_id o f type Integer that canno t be null. The
expressio n fo r this co lumn is context.run_id.
5. Click and drag all co lumns fro m the input to the Main o utput.

When yo u're do ne, yo ur map sho uld lo o k similar to this:

1. Dro p a t MysqlOut put co mpo nent o n the canvas.


2. Set the co nnectio n to the Re po sit o ry -->
Dat aWare ho use .

Now perf orm these


steps:

3. Link the o utput o f t Map to yo ur new t MysqlOut put


co mpo nent.
4. Set the table to "dimMovie".
5. Change the Act io n o n dat a to Update or Insert.

When yo u're do ne, t MysqlOut put sho uld lo o k similar to this:

What do es the Act io n o n dat a setting o f Update or Insert mean?


Go o d questio n! Update or Insert means that fo r each ro w o f data sent to t MysqlOut put , TOS will
determine if the ro w exists in the database table dimMovie. If it exists and is different, the ro w will be updated.
If it do es no t exist, the ro w is inserted.

Well then, yo u ask, how does TOS know if the row exists?. Ano ther go o d questio n. TOS lo o ks at the Ke y that
we specified in the schema. In this case, we set the co lumn film_id to be a key co lumn, so this co lumn is
used to check to see whether the ro w exists. All co lumns specified as keys are checked when perfo rming an
insert o r update.
Ano ther questio n yo u might be asking yo urself is, why don't we just delete data from dimMovie and reload the
whole thing?
Yet ano ther go o d questio n. The answer is: f o re ign ke ys. Our primary key fo r dimMovie is an auto increment
field called movie_key. This co lumn will be used to link facts to this dimensio n. If we wipe o ut dimMovie each
time we run o ur lo ad, we'll always have to relo ad all facts as well. This might be o kay in the sho rt term, but at
so me po int we may no t want to relo ad everything in o ur data wareho use. Using "insert o r update" means that
existing data do es no t get deleted, so a movie_key is preserved acro ss runs. This is impo rtant even if the
underlying database do es no t enfo rce fo reign key co nstraints.
No w that we have o ur jo b do ne, it's time to run it! Click o n the
yo u'll o nly see two lines o f o utput:

butto n at the to p o f the screen. If everything go es well,

OBSERVE:
Starting job ProcessDataWarehouse at 13:29 09/06/2009.
Job ProcessDataWarehouse ended at 13:32 09/06/2009. [exit code=0]
Yo u can also check the database to see what has been lo gged fo r yo ur run. Switch to a terminal, and lo g into yo ur
perso nal database. Run the fo llo wing co mmand against yo ur perso nal database:
CODE TO TYPE:
mysql> select * from etlRuns;
Yo ur results wo n't be exactly the same, but they sho uld lo o k similar to the fo llo wing:
OBSERVE:
mysql> select * from etlRuns;
+--------+---------------------+---------------------+
| run_id | start_time
| end_time
|
+--------+---------------------+---------------------+
|
1 | 2009-06-09 11:49:07 | 2009-06-09 11:49:12 |
+--------+---------------------+---------------------+
1 row in set (0.00 sec)
Next, check the etlLog table. Run the fo llo wing co mmand against yo ur perso nal database:
CODE TO TYPE:
mysql> select * from etlLog;
Yo u'll see so me results similar to the fo llo wing:

OBSERVE:
mysql> select * from etlLog;
+--------+---------------------+--------+------------+----------+------------+--------+----------------------+-------------------------+-------------+---------+----------+-------+--------------+---------+------+----------+-------+-----------+------------+
| run_id | moment
| pid
| father_pid | root_pid | system_pid | project
| job
| job_repository_id
| job_version | context | priority | o
rigin | message_type | message | code | duration | count | reference | thresholds |
+--------+---------------------+--------+------------+----------+------------+--------+----------------------+-------------------------+-------------+---------+----------+-------+--------------+---------+------+----------+-------+-----------+------------+
|
1 | 2009-06-09 11:49:07 | GOrr6m | GOrr6m
| GOrr6m
|
8356 | DBA3
| ProcessDataWarehouse | _pzsw4GWKEd6GbtKHsp1gXA | 0.1
| Default |
NULL | N
ULL
| begin
| NULL
| NULL |
NULL | NULL |
NULL | NULL
|
|
1 | 2009-06-09 11:49:12 | GOrr6m | GOrr6m
| GOrr6m
|
8356 | DBA3
| ProcessDataWarehouse | _pzsw4GWKEd6GbtKHsp1gXA | 0.1
| Default |
NULL | N
ULL
| end
| success | NULL |
5594 | NULL |
NULL | NULL
|
+--------+---------------------+--------+------------+----------+------------+--------+----------------------+-------------------------+-------------+---------+----------+-------+--------------+---------+------+----------+-------+-----------+------------+
2 rows in set (0.00 sec)

These results sho w that o ur run was succe ssf ul, and to o k 5 5 9 4 milliseco nds (~5 seco nds) to run. This is duplicated
by the etlRuns table, which sho ws the st art and e nd t im e s o f the jo b. We're lo o king pretty go o d!

Performance
When develo ping a so ftware system, it is o ften best to start with a simple so lutio n and mo ve to a mo re co mplex
so lutio n as develo pment go es o n:

Our mo vie dimensio n is very small - it o nly has 10 0 0 ro ws. Since it is so small, it is acceptable to recreate this
dimensio n (alo ng with o ther related facts) fro m scratch each day. This so lutio n is simple and wo rks well fo r small data
wareho uses.
But what if o ur mo vie dimensio n had 50 0 ,0 0 0 ro ws in it? What if o ur so urce system was an ancient co mputer,
requiring 10 ho urs to extract mo vie data? Running a data lo ad fo r 10 ho urs everyday wo uld no t be a great o ptio n; even
if the rest o f the wareho use pro cessing o nly to o k a minute o r two , there wo uld o nly be 14 ho urs left fo r wareho use use.
Sho uld the mo vie dimensio n gro w to be 1,0 0 0 ,0 0 0 ro ws, the wareho use lo ad might take 20 ho urs to co mplete!
When dimensio ns are large, it is necessary to add co mplexity to the wareho use in o rder to reduce lo ad times.
The first step to ward o ptimizing perfo rmance seems straightfo rward: o nly que ry t he so urce syst e m f o r ne w and
change d re co rds. Our audit tables capture the date and time that the dimensio n was updated. That time stamp can be
used to select reco rds in the so urce system. But this pro cess is o ften mo re difficult than it initially seems.
Many so urce systems simply do n't track eno ugh data to make this query wo rk. The film table in the sakila database

has a co lumn called last_update which sho uld get set when the ro w is created, and updated when the ro w changes.
But what if it had a co lumn called date_created instead? Ho w wo uld we kno w when a ro w had changed?
In many situatio ns yo u will have to make changes to so urce systems to make data wareho use lo ads easier to
manage. This might invo lve adding time stamp co lumns, mo difying existing co lumns, o r even creating co mpletely new
tables. Keep this in mind as we mo ve fo rward.
We've co vered a lo t in this lesso n! Stay tuned - in the next lesso n we'll co ntinue wo rking with o ur dimensio ns and
learn ho w to pro cess slowly changing dimensions. See yo u then!

Copyright 1998-2013 O'Reilly Media, Inc.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.

SCD Processing
DBA 3: Data Warehousing Lesson 9
Welco me back! In the last lesso n we implemented o ur first dimensio n, dimMovie, as a Type 1 slo wly changing dimensio n. In
this lesso n we'll take a lo o k at the remaining dimensio ns, and implement them as Type 2 slo wly changing dimensio ns.

T he Algorithm: Slowly Changing Dimensions


If yo u think back to the third lesso n, we defined several types o f slo wly changing dimensio ns. They are:
T ype 0 -- Do no thing; dimensio n reco rds are no t updated.
T ype 1-- No histo ry is kept; dimensio n reco rds are updated in-place.
T ype 2-- A full histo ry is kept; dimensio n reco rds have start and end dates to determine when they are valid.
T ype 3-- A limited histo ry is kept; usually the mo st recent histo ry o f a co lumn o r two is kept in the same ro w
as current info rmatio n.
T ype 4 -- A full histo ry is kept in a histo ry table; the dimensio n table has current reco rds.
We implemented a T ype 1 dimensio n in the last lesso n. In this lesso n we'll co ver T ype 2.
Slo wly changing dimensio ns (SCD) are really co mmo n, so TOS has a specific way to deal with them: the SCD family
o f co mpo nents. Yo u'll need an o pen jo b to see the palette, so o pen yo ur Pro ce ssDat aWare ho use jo b. Expand the
palette under Databases, then under MySQL, and yo u'll see this:

MySQL has two SCD co mpo nents: t MysqlSCD and t MysqlSCDELT . In TOS, co mpo nents that are named ELT are
pro cessed o n the database server itself. This can make the pro cess much faster, because data do esn't have to travel
o utside o f the database server to be pro cessed. The drawback to this pro cess is that all data must be lo cated o n a
single database server - so mething that isn't o ften po ssible.
So ho w do es the SCD co mpo nent wo rk fo r Type 2 dimensio ns? Like this:

OBSERVE:
Check each row of data to determine whether the row exists within the dimension. If it
does not, insert it:
1. Now that the row exists within the dimension, determine whether the specified Type
2 columns have changed:
a. If they have not changed, go on to the next row:
1) If the row has changed:
-Insert a new row into the dimension, with the start date of today and the end date
of 31-DEC-2099.
-Update the prior row, setting the end date to today.

Note

Remember, so me wareho uses use NULL instead o f 31-DEC-2099 as the end date.

This algo rithm isn't extremely difficult to implement, ho wever it wo uld really stink if yo u had to implement it fo r each
dimensio n yo u wanted to pro cess! Fo rtunately, it has been implemented, tested, and o ptimized fo r us already.
Type 3 and Type 4 SCDs are handled in almo st the same way, except that the histo ry is lo cated elsewhere. In Type 3 it
is put into the same ro w; in Type 4 it is put into a different table.

Implementing the Dimensions


dimCustomer
Let's dive right in. First, drag a t MysqlInput co mpo nent fro m the palette to yo ur canvas. Set its database
co nnectio n to the sakila co nnectio n fro m the repo sito ry. Fo r the query, we'll use this co de (fro m lesso n 5):
OBSERVE:
SELECT
c.customer_id, c.first_name, c.last_name, c.email,
a.address, a.address2, a.district,
ci.city,
co.country,
postal_code,
a.phone,c.active, c.create_date
FROM customer c
JOIN address a on (c.address_id = a.address_id)
JOIN city ci on (a.city_id = ci.city_id)
JOIN country co on (ci.country_id = co.country_id)
WHERE customer_id > 10;

Run yo ur query to make sure yo u typed it in co rrectly. Then click o n the


enter these co lumns:
Co lum n

T ype Nullable Le ngt h

custo mer_id int


first_name

String

45

last_name

String

45

email

String Yes

50

address

String

50

address2

String Yes

50

district

String

20

butto n next to Edit Sche m a, and

city

String

50

co untry

String

50

po stal_co de String

10

pho ne

String

20

active

int

create_date

Date

Note

If yo ur dimCusto mer table is slightly different, then yo u'll need to change yo ur schema. Fo r
example, if yo ur table might allo w NULLs fo r email, yo u sho uld change the schema to allo w
nulls in that co lumn.

Yo u might wo nder why we aren't using the Gue ss Sche m a butto n to have TOS figure o ut the schema fo r us.
It seems like that wo uld be fast and relatively easy. But in practice, the Gue ss Sche m a o ptio n o nly wo rks well
with simple schemas and data types. That's because TOS examines the data that the database returns fro m
the query, in o rder to make decisio ns o n data types and lengths, and igno res the underlying data types set in
the database tables.
That might be o kay fo r so me queries, but it do esn't wo rk fo r o ur dimCustomer query. In this query, currently
the address2 co lumn o nly has NULL values. TOS can't determine a data type when there's no data. Fo r o ther
co lumns, like postal_code, TOS sees o nly values like 90210, so TOS guesses that the data type is
Integer. We kno w that o ther parts o f the wo rld have different po stal co de fo rmats ("SW1A 0 AA" is a valid
po stal co de in the United Kingdo m), so o ur dimCustomer table uses varchar as its data type, no t integer.
It's fine to use Gue ss Sche m a as a starting po int, but yo u still need to verify manually that the co lumns TOS
picks are o f the co rrect data type, nullability, and length.
With the input o ut o f the way, we are free to mo ve o n to the mapping. Drag a t Map co mpo nent to the canvas,
and link the main ro w o f the previo us t MysqlInput co mpo nent to t Map. Just like befo re, add a new o utput,
and add a co lumn called run_id (type: integer, expressio n: context.run_id) to the o utput. Link every input
co lumn to the o utput.
The last step fo r dimCustomer is to add a t MysqlSCD co mpo nent to the canvas. Link the o utput o f t Map to
the input o f t MysqlSCD, and allo w TOS to take the schema fro m the input co mpo nent. If TOS do es no t ask
whether yo u want to use the input schema, click o n Sync Co lum ns.
Set the co nnectio n o n t MysqlSCD to the data wareho use co nnectio n in the repo sito ry, and specify
dimCustomer fo r the table. Once that's do ne, click o n the

butto n next to SCD Edit o r.

In this new windo w yo u'll tell t MysqlSCD ho w to handle every co lumn in the data flo w. To set up the
co mpo nent, drag a co lumn fro m the Unuse d sectio n to a different sectio n. We'll start with So urce Ke ys. Our
so urce key is a single co lumn: customer_id. Drag that co lumn to the So urce Ke ys sectio n. When yo u're
do ne, yo ur screen will lo o k like this:

Next, we'll setup o ur surro gate keys. Our surro gate key do es no t exist in the data flo w. Instead, it will be
created by auto increment in MySQL. Name the surro gate key customer_key, and name the creatio n Auto
increment. When yo u're finished, that sectio n will lo o k like this:

No w we'll specify o ur Type 0 co lumns. Since o ur dimensio n is o nly included fo r auditing and lo gging, run_id
sho uld never be co nsidered part o f the dimensio n. This value changes at each run, so we never want to track
changes o n it. Drag run_id to the Type 0 fields sectio n. That sectio n will no w lo o k like this:

We are no t particularly interested in tracking changes to o ur custo mers' names. And if a custo mer's
create_date changes, it pro bably means the so urce system had an erro r, and the current reco rd is being fixed,
so we do n't want to track changes o n that either. Drag first_name, last_name and create_date to the Type
1 fields sectio n. That sectio n will no w lo o k like this:

Let's check o ut Type 2 changes no w. Type 2 changes require additio nal co nfiguratio n because there are
different ways to track histo ry within the same table (if yo u'd like to review type 2 changes, refer back to lesso n
3).
Drag the remaining co lumns fro m Unuse d to the Type 2 fields sectio n. Then we need to tell TOS ho w we are
keeping histo ry. In this dimensio n, we are using a start and end date, but we are no t using a versio n number
co lumn o r an active flag. Rename the start co lumn st art _dat e , and set its creatio n to J o b st art t im e .
Rename the end co lumn e nd_dat e , change its creatio n to Fixe d ye ar value , and set its co mplement to
20 9 9 . After yo u have made these changes, the Type 2 fields sectio n will lo o k like this:

The last sectio n is fo r Type 3 changes. Here yo u specify the type 3 co lumns and no te the co rrespo nding

histo ry co lumns.
This dimensio n do esn't have any Type 3 fields, so it's left blank:

Click "OK" to clo se the SCD Co mpo nent edito r, then save yo ur changes. When yo u're do ne, yo ur
dim Cust o m e r sub jo b sho uld lo o k like this:

Does our SCD work?


At this po int, we need to test o ur sub jo b to see if dimCustomer is receiving data. We do n't need to run o ur
entire jo b tho ugh, because we're really o nly interested in dimCustomer. Fo rtunately TOS lets us disable sub
jo bs.
To do that, right click o n the t MysqlInput co mpo nent fo r the dim Mo vie jo b, then cho o se De act ivat e
curre nt sub jo b.

The sub jo b fo r dim Mo vie is no w greyed-o ut and disabled.


No w run yo ur jo b by clicking o n the
yo u'll see the fo llo wing o utput:

. As lo ng as yo u've typed everything co rrectly, yo ur jo b will run and

OBSERVE:
Starting job ProcessDataWarehouse at 21:08 30/06/2009.
Job ProcessDataWarehouse ended at 21:08 30/06/2009. [exit code=0]
The jo b ran, but did it po pulate the dimCustomer table? Switch to a terminal, and lo g into yo ur perso nal
database. Run the fo llo wing co mmand against yo ur perso nal database:
CODE TO TYPE:
mysql> SELECT count(*) FROM dimCustomer;
If yo ur jo b ran successfully, yo u will see this:
OBSERVE:
mysql> SELECT count(*) FROM dimCustomer;
+----------+
| count(*) |
+----------+
|
589 |
+----------+
1 row in set (0.00 sec)
To be sure everything wo rked, take a lo o k at so me ro ws. Run the fo llo wing co mmand against yo ur perso nal
database:
CODE TO TYPE:
mysql> SELECT * FROM dimCustomer
LIMIT 0, 10;

OBSERVE:
mysql> SELECT * FROM dimCustomer
-> LIMIT 0, 10;
+--------------+-------------+------------+-----------+------------------------------------+-------------------------------+----------+--------------+----------------+----------------+-------------+--------------+--------+--------------------+------------+------------+--------+
| customer_key | customer_id | first_name | last_name | email
| address
| address2 | district
| city
| country
| postal_code | phone
| active | create_date
| start_date | end_date
| run_id |
+--------------+-------------+------------+-----------+------------------------------------+-------------------------------+----------+--------------+----------------+----------------+-------------+--------------+--------+--------------------+------------+------------+--------+
|
1 |
218 | VERA
| MCCOY
| VERA.MCCOY@sakilacustome
r.org
| 1168 Najafabad Parkway
|
| KABOL
| Kabul
| Afghanistan
| 40301
| 886649065861 |
1 | 2004-03-19 00:0
0:00 | 2009-06-30 | 2099-01-01 |
81 |
|
2 |
441 | MARIO
| CHEATHAM | MARIO.CHEATHAM@sakilacus
tomer.org
| 1924 Shimonoseki Drive
|
| BATNA
| Batna
| Algeria
| 52625
| 406784385440 |
1 | 2004-10-07 00:0
0:00 | 2009-06-30 | 2099-01-01 |
81 |
|
3 |
69 | JUDY
| GRAY
| JUDY.GRAY@sakilacustomer
.org
| 1031 Daugavpils Parkway
|
| BCHAR
| Bchar
| Algeria
| 59025
| 107137400143 |
1 | 2004-02-25 00:0
0:00 | 2009-06-30 | 2099-01-01 |
81 |
|
4 |
176 | JUNE
| CARROLL
| JUNE.CARROLL@sakilacusto
mer.org
| 757 Rustenburg Avenue
|
| SKIKDA
| Skikda
| Algeria
| 89668
| 506134035434 |
1 | 2004-08-11 00:0
0:00 | 2009-06-30 | 2099-01-01 |
81 |
|
5 |
320 | ANTHONY
| SCHWAB
| ANTHONY.SCHWAB@sakilacus
tomer.org
| 1892 Nabereznyje Telny Lane
|
| TUTUILA
| Tafuna
| American Samoa | 28396
| 478229987054 |
1 | 2004-07-20 00:0
0:00 | 2009-06-30 | 2099-01-01 |
81 |
|
6 |
528 | CLAUDE
| HERZOG
| CLAUDE.HERZOG@sakilacust
omer.org
| 486 Ondo Parkway
|
| BENGUELA
| Benguela
| Angola
| 35202
| 105882218332 |
1 | 2004-01-24 00:0
0:00 | 2009-06-30 | 2099-01-01 |
81 |
|
7 |
383 | MARTIN
| BALES
| MARTIN.BALES@sakilacusto
mer.org
| 368 Hunuco Boulevard
|
| NAMIBE
| Namibe
| Angola
| 17165
| 106439158941 |
1 | 2004-05-31 00:0
0:00 | 2009-06-30 | 2099-01-01 |
81 |
|
8 |
381 | BOBBY
| BOUDREAU | BOBBY.BOUDREAU@sakilacus
tomer.org
| 1368 Maracabo Boulevard
|
|
| South Hi
ll
| Anguilla
| 32716
| 934352415130 |
1 | 2004-08-29 00:0
0:00 | 2009-06-30 | 2099-01-01 |
81 |
|
9 |
359 | WILLIE
| MARKHAM
| WILLIE.MARKHAM@sakilacus
tomer.org
| 1623 Kingstown Drive
|
| BUENOS AIRES | Almirant
e Brown | Argentina
| 91299
| 296394569728 |
1 | 2004-08-13 00:0
0:00 | 2009-06-30 | 2099-01-01 |
81 |
|
10 |
560 | JORDAN
| ARCHULETA | JORDAN.ARCHULETA@sakilac
ustomer.org | 1229 Varanasi (Benares) Manor |
| BUENOS AIRES | Avellane
da
| Argentina
| 40195
| 817740355461 |
1 | 2004-01-15 00:0
0:00 | 2009-06-30 | 2099-01-01 |
81 |
+--------------+-------------+------------+-----------+------------------------------------+-------------------------------+----------+--------------+----------------+----------------+-------------+--------------+--------+--------------------+------------+------------+--------+
10 rows in set (0.01 sec)
If yo u scro ll to the right, yo u might no tice so mething a bit strange. Check o ut the create_date and
start_date (co pied belo w):

OBSERVE:
+--------+---------------------+------------+------------+--------+
| active | create_date
| start_date | end_date
| run_id |
+--------+---------------------+------------+------------+--------+
|
1 | 2004-03-19 00:00:00 | 2009-06-30 | 2099-01-01 |
81 |
|
1 | 2004-10-07 00:00:00 | 2009-06-30 | 2099-01-01 |
81 |
|
1 | 2004-02-25 00:00:00 | 2009-06-30 | 2099-01-01 |
81 |
|
1 | 2004-08-11 00:00:00 | 2009-06-30 | 2099-01-01 |
81 |
|
1 | 2004-07-20 00:00:00 | 2009-06-30 | 2099-01-01 |
81 |
|
1 | 2004-01-24 00:00:00 | 2009-06-30 | 2099-01-01 |
81 |
|
1 | 2004-05-31 00:00:00 | 2009-06-30 | 2099-01-01 |
81 |
|
1 | 2004-08-29 00:00:00 | 2009-06-30 | 2099-01-01 |
81 |
|
1 | 2004-08-13 00:00:00 | 2009-06-30 | 2099-01-01 |
81 |
|
1 | 2004-01-15 00:00:00 | 2009-06-30 | 2099-01-01 |
81 |
+--------+---------------------+------------+------------+--------+
The custo mer's reco rd was created back o n March 19 t h, 20 0 4 , but the ro w in the dimensio n has a
start_date o f to day (20 0 9 -0 6 -30 ).
The start_date and end_date co lumns o n this Type 2 SCD indicate that " t his ro w is valid and co rre ct
be t we e n t he dat e s o f J une 30 t h 20 0 9 and J anuary 1st , 20 9 9 ."
The custo mer with customer_id=218 existed o n J anuary 1st , 20 0 7 ..., but that's no t what the ro w tells us
no w.
Yo u may recall that earlier in the lesso n, we set the pro perties fo r the SCD co mpo nent. Specifically, we
renamed the start co lumn to st art _dat e , and set its creatio n to J o b st art t im e .

The pro blem here has to do with histo rical data. The very first time we setup o ur dimensio n, the first valid date
fo r the ro w must be start_date, no t to day's date. Graphically, the time line after the initial lo ad lo o ks like this:

We need to fix this date, so each ro w has a valid start date. Graphically, the time line lo o ks like this:

Fo rtunately, o ur so urce system co ntains the date co mmand we need: create_date.


To execute this co mmand, we co uld just switch to SQL mo de and run an update statement. This isn't an ideal
so lutio n tho ugh, because we might fo rget to include this step the next time we have to relo ad the entire
dimensio n.
TOS has a co mpo nent that will let us execute a single query: t MysqlRo w. Drag that co mpo nent to yo ur
canvas, and name the new sub jo b On Init ial Lo ad Only. Set its co nnectio n to the data wareho use
co nnectio n fro m the repo sito ry, and set its co mmand to this (make sure the quo tatio n marks aro und the
query are co rrect!):
CODE TO TYPE:
UPDATE dimCustomer SET start_date = create_date;

Next, right-click o n t MysqlSCD, and link the On Co m po ne nt OK trigger to t MysqlRo w:

After o ur initial lo ad, we'll disable this sub jo b.


So , what if yo ur data so urce do esn't have a valid start_date? In that case, yo u'll have to figure o ut the
earliest date where valid data is present in yo ur dimensio n, perhaps J anuary 1st , 20 0 0 . Generally, yo u'll
use J o b St art t im e as yo ur start_date, but after the first lo ad o f yo ur dimensio n, yo u'll have to update the
start_date o f every ro w in yo ur table to J anuary 1st , 20 0 0 . Here's an example o f a query to update a
dimensio n witho ut valid start date:
OBSERVE:
UPDATE dimCustomer SET start_date='2000-01-01';
Picking the earliest valid date fo r yo ur dimensio n is similar to picking a high end_date o f J anuary 1st , 20 9 9 .
Befo re we rerun o ur jo b, we sho uld co mpletely clear o ur dimensio n. To do this, we'll use the SQL keywo rd
TRUNCATE. Run this co mmand against yo ur perso nal database:

CODE TO TYPE:
mysql> TRUNCATE TABLE dimCustomer;
When the co mmand executes, yo u'll see Query OK, 0 rows affected (0.00 sec), even tho ugh the table
is no w empty.
Rerun the jo b. When it co mpletes, switch back to the terminal, and run this co mmand against yo ur perso nal
database:
CODE TO TYPE:
mysql> SELECT count(*) FROM dimCustomer;
There sho uld still be 58 9 ro ws in the table:
OBSERVE:
mysql> SELECT count(*) FROM dimCustomer;
+----------+
| count(*) |
+----------+
|
589 |
+----------+
1 row in set (0.00 sec)
Check the create and start date next. Run this co mmand against yo ur perso nal database:
CODE TO TYPE:
mysql> SELECT customer_key, first_name, last_name, address, district, city, coun
try, create_date, start_date, end_date FROM dimCustomer
LIMIT 0, 10;
The results lo o k much better:

OBSERVE:
mysql> SELECT customer_key, first_name, last_name, address, district, city, coun
try, create_date, start_date, end_date FROM dimCustomer
-> LIMIT 0, 10;
+--------------+------------+-----------+-------------------------------+-------------+-----------------+----------------+---------------------+------------+-----------+
| customer_key | first_name | last_name | address
| distri
ct
| city
| country
| create_date
| start_date | e
nd_date
|
+--------------+------------+-----------+-------------------------------+-------------+-----------------+----------------+---------------------+------------+-----------+
|
1 | VERA
| MCCOY
| 1168 Najafabad Parkway
| KABOL
| Kabul
| Afghanistan
| 2004-03-19 00:00:00 | 2004-03-19 | 2
099-01-01 |
|
2 | MARIO
| CHEATHAM | 1924 Shimonoseki Drive
| BATNA
| Batna
| Algeria
| 2004-10-07 00:00:00 | 2004-10-07 | 2
099-01-01 |
|
3 | JUDY
| GRAY
| 1031 Daugavpils Parkway
| BCHAR
| Bchar
| Algeria
| 2004-02-25 00:00:00 | 2004-02-25 | 2
099-01-01 |
|
4 | JUNE
| CARROLL
| 757 Rustenburg Avenue
| SKIKDA
| Skikda
| Algeria
| 2004-08-11 00:00:00 | 2004-08-11 | 2
099-01-01 |
|
5 | ANTHONY
| SCHWAB
| 1892 Nabereznyje Telny Lane
| TUTUIL
A
| Tafuna
| American Samoa | 2004-07-20 00:00:00 | 2004-07-20 | 2
099-01-01 |
|
6 | CLAUDE
| HERZOG
| 486 Ondo Parkway
| BENGUE
LA
| Benguela
| Angola
| 2004-01-24 00:00:00 | 2004-01-24 | 2
099-01-01 |
|
7 | MARTIN
| BALES
| 368 Hunuco Boulevard
| NAMIBE
| Namibe
| Angola
| 2004-05-31 00:00:00 | 2004-05-31 | 2
099-01-01 |
|
8 | BOBBY
| BOUDREAU | 1368 Maracabo Boulevard
|
| South Hill
| Anguilla
| 2004-08-29 00:00:00 | 2004-08-29 | 2
099-01-01 |
|
9 | WILLIE
| MARKHAM
| 1623 Kingstown Drive
| BUENOS
AIRES | Almirante Brown | Argentina
| 2004-08-13 00:00:00 | 2004-08-13 | 2
099-01-01 |
|
10 | JORDAN
| ARCHULETA | 1229 Varanasi (Benares) Manor | BUENOS
AIRES | Avellaneda
| Argentina
| 2004-01-15 00:00:00 | 2004-01-15 | 2
099-01-01 |
+--------------+------------+-----------+-------------------------------+-------------+-----------------+----------------+---------------------+------------+-----------+
10 rows in set (0.00 sec)
No w that the start date is lo o king go o d, we can disable the sub jo b that updates start_date. Right-click o n
t MysqlRo w and cho o se De act ivat e curre nt sub jo b:

We have data in dimCustomer, but do es o ur dimensio n really track histo ry?


Take a lo o k at the first ro w o f data returned fro m o ur previo us query:
OBSERVE:
|

1 | VERA
| Kabul
099-01-01 |

| MCCOY
| Afghanistan

| 1168 Najafabad Parkway


| KABOL
| 2004-03-19 00:00:00 | 2004-03-19 | 2

The FIRST NAME and LAST NAME are all in UPPERCASE letters, but the address, district, city, and co untry are
in Mixed Case letters. Let's alter o ur t Map co mpo nent so that the address, district, city, and co untry are in
UPPERCASE letters as well. The next time we run o ur jo b, each ro w sho uld change. That's ho w we'll be able
to tell if o ur SCD co mpo nent is wo rking co rrectly o r no t.
Do uble-click o n the t Map co mpo nent fo r the dimCustomer subjo b. When the map screen o pens, select the
address co lumn o n the o utput, and click o n the

butto n to o pen the expressio n edito r windo w.

TOS has a built-in functio n fo r making a string uppercase. It's lo cated in the StringHandling catego ry, and is
called UPCASE. Set the expressio n fo r address so it lo o ks like this:
CODE TO TYPE:
StringHandling.UPCASE(row6.address)
Yo ur input may no t be called ro w6 ; make sure to use its existing name.

Click OK to clo se the expressio n builder. Make similar changes fo r the o ther co lumns: address2, city,
country, and district. When yo u are do ne, t Map will lo o k so mething like this:

Run yo ur jo b again by clicking o n the


yo u'll see the fo llo wing o utput:

. As lo ng as yo u've typed everything co rrectly, yo ur jo b will run and

OBSERVE:
Starting job ProcessDataWarehouse at 21:58 30/06/2009.
Job ProcessDataWarehouse ended at 22:02 30/06/2009. [exit code=0]
Switch to SQL mo de. We'll inspect a single reco rd (customer_id = 218) fro m o ur last SQL query to see
whether it changes. Run this co mmand against yo ur perso nal database:

CODE TO TYPE:
mysql> SELECT * FROM dimCustomer
WHERE customer_id=218
ORDER BY customer_key;
If yo u typed everything co rrectly, yo u'll see this:
OBSERVE:
mysql> SELECT * FROM dimCustomer
-> WHERE customer_id=218
-> ORDER BY customer_key;
+--------------+-------------+------------+-----------+------------------------------+------------------------+----------+----------+-------+-------------+------------+--------------+--------+---------------------+------------+-----------+--------+
| customer_key | customer_id | first_name | last_name | email
| address
| address2 | district | city | country
| pos
tal_code | phone
| active | create_date
| start_date | end_date
| run_id |
+--------------+-------------+------------+-----------+------------------------------+------------------------+----------+----------+-------+-------------+------------+--------------+--------+---------------------+------------+-----------+--------+
|
1 |
218 | VERA
| MCCOY
| VERA.MCCOY@sakilacustome
r.org | 1168 Najafabad Parkway |
| KABOL
| Kabul | Afghanistan | 403
01
| 886649065861 |
1 | 2004-03-19 00:00:00 | 2004-03-19 | 2009-06-30
|
82 |
|
590 |
218 | VERA
| MCCOY
| VERA.MCCOY@sakilacustome
r.org | 1168 NAJAFABAD PARKWAY |
| KABOL
| KABUL | AFGHANISTAN | 403
01
| 886649065861 |
1 | 2004-03-19 00:00:00 | 2009-06-30 | 2099-01-01
|
83 |
+--------------+-------------+------------+-----------+------------------------------+------------------------+----------+----------+-------+-------------+------------+--------------+--------+---------------------+------------+-----------+--------+
2 rows in set (0.01 sec)
Sure eno ugh, the changes were reco rded!
Since o ur test is o ver, we'll relo ad o ur dimensio n to get it back to a go o d starting po int. Switch to SQL mo de
and run the fo llo wing query:
CODE TO TYPE:
TRUNCATE TABLE dimCustomer;
When the co mmand executes, yo u'll see Query OK, 0 rows affected (0.00 sec). In TOS, enable the
Init ial Lo ad Only subjo b, then run the jo b. Once it co mpletes, disable the Init ial Lo ad Only subjo b, and
right-click o n t MysqlInput to disable the dim Cust o m e r subjo b.
We're in great shape!

dimStore
The subjo b fo r dimStore is nearly identical to dimCustomer. Just like befo re, drag a t MysqlInput
co mpo nent to yo ur canvas. Co nfigure it to use the sakila database. Fo r the query, we'll use this co de (fro m
lesso n 5):

OBSERVE:
SELECT s.store_id, a.address, a.address2, a.district,
c.city, co.country, a.postal_code, s.region,
st.first_name as manager_first_name,
st.last_name as manager_last_name
FROM
store s
JOIN staff st on (s.manager_staff_id = st.staff_id)
JOIN address a on (s.address_id = a.address_id)
JOIN city c on (a.city_id = c.city_id)
JOIN country co on (c.country_id = co.country_id)

Once again, be sure to run yo ur query to make sure yo u typed it in co rrectly. Once that's do ne, click o n the
butto n next to Edit Sche m a, then enter these co lumns:
Co lum n

T ype Nullable Le ngt h

sto re_id

int

address

String

50

address2

String Yes

50

district

String

20

city

String

50

co untry

String

50

po stal_co de

String

10

regio n

String

20

manager_first_name String

45

manager_last_name String

45

Let's mo ve o n to the mapping. Drag a t Map co mpo nent to the canvas, and link the main ro w o f the previo us
t MysqlInput co mpo nent to t Map. Add a new o utput and add a co lumn called run_id (type: integer,
expressio n: context.run_id) to the o utput. Link every input co lumn to the o utput.
Add a t MysqlSCD co mpo nent to the canvas. Link the o utput o f t Map to the input o f t MysqlSCD, and allo w
TOS to take the schema fro m the input co mpo nent.
Set the co nnectio n o n t MysqlSCD to the data wareho use co nnectio n in the repo sito ry, and specify dimStore
fo r the table. Once that's do ne, click o n the

butto n next to SCD Edit o r.

Set up these SCD co lumns:


Se ct io n
So urce
keys

Co lum ns
store_id

Surro gate
store_key - creatio n Auto Increment
keys
Type 0
fields

run_id

Type 1
fields
Type 2
fields

Use all remaining fields. Rename the start co lumn st art _dat e , and set its creatio n to J o b
st art t im e . Rename the end co lumn e nd_dat e , change its creatio n to Fixe d ye ar value ,
and set its co mpliment to 20 9 9 .

Type 3
fields
Just like we did in dimCustomer, we'll have to run a special update statement the first time we lo ad o ur

dimensio n here. Let's make life a little easier and reuse o ur wo rk; co py the subjo b we created fo r
dimCustomer. Right-click o n the title Init ial Lo ad Only, and cho o se Co py:

Once co pied, right-click o n yo ur canvas and cho o se Past e . The who le subjo b sho uld be pasted o n yo ur
canvas, but it is pro bably no t next to dimStore. Mo ve it so it is next to dimStore:

Enable the subjo b yo u just pasted, then link the On Co m po ne nt OK trigger fro m t MysqlSCD to it:

Our so urce data do es no t give us a valid start date, so we'll have to co me up with o ne. Let's suppo se o ur
business users info rmed us that we co uld use J anuary 1st , 20 0 0 as a valid start date. Click o n t MysqlRo w,
and change the query so it lo o ks like this:
CODE TO TYPE:
UPDATE dimStore SET start_date = '2000-01-01';
Save and run yo ur jo b. If everything went alright, yo u'll see this familiar o utput:
OBSERVE:
Starting job ProcessDataWarehouse at 13:42 16/12/2008.
Job ProcessDataWarehouse ended at 13:42 16/12/2008. [exit code=0]
Let's take a lo o k at the dimStore table. Switch to yo ur terminal and run the fo llo wing co mmand against yo ur
perso nal database:
CODE TO TYPE:
mysql> SELECT * from dimStore;
If yo ur lo ad ran pro perly, yo u'll see the fo llo wing data:

OBSERVE:
mysql> SELECT * from dimStore;
+-----------+----------+--------------------+----------+----------+------------+
-----------+-------------+--------+--------------------+-------------------+-----------+------------+--------+
| store_key | store_id | address
| address2 | district | city
|
country
| postal_code | region | manager_first_name | manager_last_name | sta
rt_date | end_date
| run_id |
+-----------+----------+--------------------+----------+----------+------------+
-----------+-------------+--------+--------------------+-------------------+-----------+------------+--------+
|
1 |
1 | 47 MySakila Drive | NULL
| Alberta | Lethbridge |
Canada
|
| West
| Mike
| Hillyer
| 200
0-01-01 | 2099-01-01 |
85 |
|
2 |
2 | 28 MySQL Boulevard | NULL
| QLD
| Woodridge |
Australia |
| East
| Jon
| Stephens
| 200
0-01-01 | 2099-01-01 |
85 |
+-----------+----------+--------------------+----------+----------+------------+
-----------+-------------+--------+--------------------+-------------------+-----------+------------+--------+
2 rows in set (0.00 sec)
This lo o ks great! Since o ur inital lo ad is co mplete, right-click o n t MysqlRo w in Init ial Lo ad Only and
deactivate the current subjo b.
Since o ur dimensio ns are wo rking well no w, we can reactiavte all o f the subjo bs we disabled earlier. Right-click o n the
t MysqlInput o f the dim Mo vie jo b, then cho o se Act ivat e curre nt subjo b:

Do the same fo r dim Cust o m e r, but be sure to leave the Init al Lo ad Only jo b disabled.
Wo w. We co vered a lo t in this lesso n! In the next lesso n we'll co mbine o ur dimensio ns with facts to fo rm o ur co mplete data
wareho use. See yo u there!

Copyright 1998-2013 O'Reilly Media, Inc.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.

Processing Facts, Part I


DBA 3: Data Warehousing Lesson 10
Go o d to have yo u back. In the last lesso n we implemented o ur dimensio ns. No w we'll mo ve o n to facts. Let's get started!

Orchestration
Mo st machines have at least two pro cesso r co res to o ptimize perfo rmance. TOS is built to take advantage o f that, so
unless we specify o therwise, TOS may execute o ur sub jo bs in parallel. That means o ur dimensio ns may be lo aded at
the same time. Fo r no w, we want to keep o ur wareho use pro cesses simple and we do n't want all o f o ur sub jo bs
execute at the same time. And we definitely want o ur dimensio ns lo aded befo re we even co nsider lo ading o ur f act s.

Note

Yo ur co mpo nents' names will likely differ fro m the examples here. That's fine.

The first co mpo nent in each sub jo b is the "master" co mpo nent - it can be used to trigger executio n o f ano ther sub jo b
after the running sub jo b is co mplete. We will use it to link o ur dimensio n sub jo bs to gether, and eventually to link to
o ur f act sub jo bs.
To start, right-click o n the t MysqlInput co mpo nent fo r dim Mo vie . Select T rigge r and then select On Subjo b OK:

Yo u'll no tice the curso r change. Drag the co nnectio n to the t MysqlInput co mpo nent fo r dim Cust o m e r, and dro p the
co nnectio n:

Do this same pro cess to link dim Cust o m e r to dim St o re , and dim St o re to dim St af f . Yo ur jo b will lo o k like this:

Go o d wo rk so far.

factCustomerCount
The algo rithm fo r po pulating a fact is fairly sho rt. In English, it might read like this: "Fo r e ach ro w, lo o k up t he ke ys
f o r e ach dim e nsio n, re spe ct ing st art and e nd dat e s f o r t he ro w and t he dim e nsio n. Whe n t he ke ys have
be e n f o und, inse rt t he ro w int o t he f act t able ."

Note

Fo r this lesso n we will assume that we can successfully lo o kup each dimensio n key. In a future lesso n
we will study what might happen if a lo o kup fails.

This algo rithm is fairly straightfo rward, but selecting the lo cat io n and m e t ho d used to do the lo o kup can have so me
pretty drastic perfo rmance implicatio ns. The lo o kup is just like a database jo in - yo u specify ho w data is related in
o rder to pro duce a co mbined result.
We have two o ptio ns fo r lo cat io n:
1. the co mputer yo u are using to develo p and run TOS jo bs
2. the MySQL server
If yo u pick o ptio n 1, essentially yo u will mo ve data fro m yo ur so urce and yo ur data wareho use to yo ur co mputer, then
send it back to yo ur data wareho use. This pro cess is reso urce intensive and as such, isn't usually the preferred o ptio n.
If yo u pick o ptio n 2, yo u need to mo ve data fro m yo ur so urce to yo ur data wareho use and place it in so me tempo rary
lo catio n (called a staging table), then let the database server pro cess the data. This takes so me time, but is still usually
much faster than o ptio n 1. And to make things just a bit mo re difficult, mo st ETL to o ls are no t capable o f pro cessing
data this way, leaving yo u (the develo per) to write a who le mess o f SQL to make this happen. Fo rtunately, TOS has
special ELT co mpo nents that can be used to make the pro cess a bit easier.

We also have two o ptio ns fo r m e t ho d:


1. Write SQL (perhaps using sto red pro cedures) to lo o kup and sto re fo reign keys.
2. Use TOS to retrieve and jo in fo reign keys.
To pro cess o ur custo mer co unt f act , we will perfo rm these steps:
1. Create a staging table called stageCustomerCount, with no indexes (o r have TOS create the table fo r us).
2. Mo ve data fro m the so urce system into the staging table.
3. Create indexes o n the staging table.
4. Use the ELT family o f co mpo nents in TOS to po pulate factCustomerCount.
Yo u are pro bably asking yo urself, "Why do I create a table without any indexes, just to add them later?"
Great questio n! If yo u recall back in DBA 2, indexes are a great way to impro ve database query perfo rmance.
Unfo rtunately, using indexes means that each INSERT statement takes lo nger to execute, because the database must
do extra wo rk to po pulate indexes. In o ther wo rds, it usually takes mo re time fo r the database to insert to a table with
indexes than it do es to insert to a "blank" table first and add indexes to the table later.
In lesso n 4 we implemented the table fo r factCustomerCount. Let's review its structure by reviewing the CREATE
TABLE statement we used back then:
OBSERVE:
CREATE TABLE factCustomerCount
(
customerCount_key INT NOT NULL AUTO_INCREMENT,
date_key
INT NOT NULL REFERENCES dimDate,
customer_key
INT NOT NULL REFERENCES dimCustomer,
store_key
INT NOT NULL REFERENCES dimStore,
customer_count
INT NOT NULL DEFAULT 1,
PRIMARY KEY (customerCount_key)
);
A single ro w in this table represents a specific custo mer who created an acco unt, o n a specific day, at a specific sto re.
This is a factless fact, so we do n't have any numeric values in o ur table. To make queries easier fo r o thers, we included
customer_count, which has a default value o f 1.
Let's get started by po pulating o ur staging table, stageFactCustomerCount. Drag three co mpo nents to yo ur canvas:
t MysqlInput , t Map and t MysqlOut put .
Edit the pro perties fo r t MysqlInput , setting its co nnectio n to the sakila database. Then edit its query. Use this query to
retrieve custo mer data:
CODE TO TYPE:
select customer_id, store_id, create_date FROM customer;
Run the query to make sure yo u typed everything co rrectly.
No w let's specify the schema fo r t MysqlInput . Fo r this query, we can use the Gue ss Sche m a butto n to do mo st o f
the wo rk. Go ahead and click o n it.
TOS identifies mo st o f the co lumns witho ut difficulty, except fo r o ne: cre at e _dat e . cre at e _dat e isn't a string value,
it's a Date. Fix the schema, and it will lo o k like this:

With that specified, link t MysqlInput to t Map, then o pen t Map to edit its co nnectio ns.
We will use this t Map just like we used the map co mpo nents fo r dimensio n pro cessing. We'll add o ur auditing co lumn
called run_id. Next add an o utput and then the run_id co lumn. Finally, drag all o f the input co lumns to the o utput.
When yo u are do ne, yo ur t Map sho uld lo o k like this:

Next, co nnect t Map to t MysqlOut put . Set the co nnectio n pro perties o n t MysqlOut put fro m the repo sito ry to the
data wareho use co nnectio n, and set the table to " st age Fact Cust o m e rCo unt " (in quo tatio n marks).
We co uld manually create the st age Fact Cust o m e rCo unt table, but then we wo uld have to delete its data and dro p
its indexes befo re we po pulate it with data. Instead o f do ing that, we can have the t MysqlOut put co mpo nent Dro p
t able if e xist s and cre at e , which acco mplishes the same thing. To do this, set the action on table to Dro p t able if
e xist s and cre at e .
No w let's specify when we want to execute o ur sub jo b. Right-click o n the t MysqlInput co mpo nent o f dim St af f , and
co nnect the On Subjo b OK trigger to the t MysqlInput o f the st age Fact Cust o m e rCo unt sub jo b.
When yo u are do ne, yo ur jo b will lo o k so mething like this:

No w we can begin adding indexes to stageFactCustomerCount and clearing data fro m factCustomerCount.
The tMysqlRo w co mpo nent in TOS can o nly execute o ne statement, but we need to execute five ALTER TABLE
statements and o ne TRUNCATE TABLE statement. We co uld include several tMysqlRo w co mpo nents, o r write a single
sto red pro cedure that executes everything we need, using just o ne tMysqlRo w co mpo nent. We'll use a sto red
pro cedure to simplify things.
Switch to the terminal, and run the fo llo wing co mmand against yo ur perso nal database:
CODE TO TYPE:
DELIMITER //
CREATE PROCEDURE etl_preFactCustomerCount ()
BEGIN
ALTER TABLE stageFactCustomerCount add index(create_date);
ALTER TABLE stageFactCustomerCount add index(customer_id);
ALTER TABLE stageFactCustomerCount add index(store_id);
TRUNCATE TABLE factCustomerCount;
END
//

Add a t MysqlRo w co mpo nent to yo ur canvas. Set its database co nnectio n to the data wareho use, and set its query to
the fo llo wing:
CODE TO TYPE:
call etl_preFactCustomerCount();
This pro cedure canno t execute befo re we po pulate stageFactCustomerCount with data. To do that, right-click o n the
t MysqlInput co mpo nent o f the st age Fact Cust o m e rCo unt sub jo b and link its On Subjo b OK trigger to
t MysqlRo w.
We are nearly ready to po pulate factCustomerCount, but we have o ne mo re small step to co mplete first. We are
planning o n using the ELT co mpo nents to lo ad o ur fact table, but tho se co mpo nents need up-to -date schemas fo r all
tables.
Back in lesso n seven, we created o ur data wareho use database co nnectio n, and let TOS read the schema fo r several
o f o ur tables. We have made several changes since then, so we need to update o ur schema.
It wo uld be nice to include stageFactCustomerCount, but that table hasn't been created yet. The easiest way to create
the table is to execute o ur jo b. Do so by clicking o n the

at the to p o f the windo w.

As lo ng as yo u do no t have any erro rs in yo ur jo b, yo u'll see o utput like this:


OBSERVE:
Starting job ProcessDataWarehouse at 13:00 21/12/2008.
Job ProcessDataWarehouse ended at 13:00 21/12/2008. [exit code=0]
After yo ur jo b runs successfully, yo u can update the database schema. Right-click o n the Dat a Ware ho use
co nnectio n in the metadata sectio n o f TOS, and cho o se Re t rie ve Sche m a:

We do n't need to filter o ur schema, so click Ne xt > at the bo tto m o f the windo w:

In the next windo w, scro ll thro ugh yo ur database until yo u co me acro ss the o bjects fo r this co urse:
dimCusto mer
dimDate
dimMo vie
dimStaff
dimSto re
etlLo g
etlRuns
factCusto merCo unt
factRentalCo unt
factRentalDuratio n
factSales
stageFactCusto merCo unt
Make sure all o f tho se o bjects are checked, then click Ne xt >.
In the final screen, select each table o n the left and then click o n the Re t rie ve Sche m a butto n. Yo u might see a dialo g
that lo o ks like this:

Click "OK." Repeat this pro cess fo r each table to make sure yo u have the latest schema fo r all o f them. When yo u are
do ne, click Finish.
The repo sito ry will still sho w the tables asso ciated with o ur co nnectio n:

With o ur schemas updated, we are finally ready to po pulate f act Cust o m e rCo unt .
Start by putting a t ELT MysqlMap co mpo nent o n yo ur canvas. The t ELT MysqlMap co mpo nent acts as the
"co ntro ller," so yo u'll want to po sitio n it in the middle so mewhere. To ensure this co mpo nent executes after the
staging table is lo aded with indexes, link the On Subjo b OK trigger o f the previo us t MysqlRo w to t ELT MysqlMap:

Set the database co nnectio n fo r t ELT MysqlMap fro m the repo sito ry, to the data wareho use.
Our next task is to add t ELT MysqlInput co mpo nents fo r each so urce table. Dro p a t ELT MysqlInput co mpo nent
o nto the canvas. Edit its pro perties - set its schema to the st age Fact Cust o m e rCo unt table fro m the repo sito ry,
under the "Db Co nnectio ns" and "DataWareho use:"

With the metadata set, we can link o ur t ELT MysqlInput co mpo nent to t ELT MysqlMap. Right-click o n
t ELT MysqlInput and cho o se Link, then st age Fact Cust o m e rCo unt :

Dro p the link o n to p o f t ELT MysqlMap:

Repeat this pro cess fo r the three remaining tables - dim Dat e , dim St o re and dim Cust o m e r. When yo u're do ne,
yo ur jo b sho uld lo o k so mething like this:

We have inputs to o ur t ELT MysqlMap "co ntro ller" co mpo nent, but what abo ut o utputs? We o nly need o ne o utput fo r
this applicatio n, so drag a single t ELT MysqlOut put co mpo nent to the canvas. Right-click o n t ELT MysqlMap,
cho o se Link, and then cho o se *ne w o ut put *:

Name this new o utput f act Cust o m e rCo unt :

Our jo b no w lo o ks like this:

Do n't wo rry abo ut warnings and erro rs just yet - we are no t do ne co nfiguring o ur co mpo nents. Yo u do want to set the
database co nnectio n fo r t ELT MysqlOut put no w, ho wever, set it to yo ur perso nal database.
t ELT MysqlMap is very similar to t Map, with a co uple o f exceptio ns:
1. t Map has a variables sectio n, but t ELT MysqlMap do es no t. This is because t ELT MysqlMap is
executed o n the Mysql server directly, which do es no t have any understanding o f variables fro m TOS.
2. t ELT MysqlMap uses SQL fo r its expressio ns, whereas t Map uses Java fo r its expressio ns.
Our links are no w co mplete, but o ur co mpo nent is in erro r. This is because we haven't specified ho w we want TOS to
co mbine o ur inputs into a single o utput. Do uble-click o n t ELT MysqlMap.

Once the windo w lo ads, click o n the


butto n o n the left side, to add an alias fo r yo ur first input table,
stageFactCustomerCount. We'll add this table first since it is the basis o f all o f o ur jo ins:

Select the st age Fact Cust o m e rCo unt table, and specify ss as the alias:

This ss alias will be translated into the SQL statement. Check it o ut -- at the bo tto m o f the windo w, click o n Ge ne rat e d
SQL Se le ct que ry f o r 't able ' o ut put :
OBSERVE:
SELECT
FROM
stageFactCustomerCount ss
No w fo r the next table, add an alias dimDate. Name the alias dd.
At this po int, dim Dat e is no t jo ined to st age Fact Cust o m e rCo unt . To specify ho w these tables are jo ined we will
do two things. First, click o n the triangle dro p do wn to change the jo in type fro m (IMPLICIT JOIN) to INNER JOIN:

Next, check the Explicit J o in bo x fo r the date ro w under dim Dat e , specify = as the Ope rat o r, and specify the
Fo re ign Co lum n as DAT E(ss.cre at e _dat e ). When yo u are do ne yo u'll see the jo in:

Note

We need to use the DATE() functio n because create_date in stageFactCustomerCount is a DATETIME


data type, and DATE in dimDate is a DATE data type.

We have o ur first jo in in place!

Note

There isn't any way to save yo ur jo b when yo u are inside o f the t ELT MysqlMap edito r. Be sure to click
OK and save yo ur wo rk o ften - yo u'll lo o se it if yo u accidentally hit Cancel!

To make mo re ro o m o n yo ur screen yo u can co llapse the dim Dat e table by clicking o n the Minimize/Maximize
butto n:

Next, add an alias called dc fo r the dim Cust o m e r table. Co llapse the o ther tables to make ro o m. Then do the
fo llo wing:
1. Change the jo in type to INNER J OIN.
2. Check the Explicit J o in bo x fo r the cust o m e r_id ro w.
3. Set the o perato r to =.
4. Set the Fo re ign co lum n / e xpre ssio n to ss.cust o m e r_id.
But wait ! What abo ut the Type-2 co lumns: start_date and end_date?
Great questio n!
It isn't eno ugh to specify the customer_id fo r this jo in, because there co uld be multiple ro ws in dimCustomer with the
same customer_id. We also need to jo in o n start_date and end_date.
We'll jo in tho se co lumns against dimDate because it has a nice date co lumn with the exact type that exists in
dimCustomer. As fo r the jo in o perato r, we are lo o king fo r a ro w in dimCustomer that has a start_date less than o r
equal to create_date and an end_date greater than create_date.
On a time line, we need to pick the ro w in dimDate fo r a specific time perio d:

To jo in o n these co lumns as well, do the fo llo wing:


1. Check the Explicit J o in bo x fo r the st art _dat e and e nd_dat e ro ws.
2. Fo r st art _dat e , set the o perato r to <= .
3. Fo r e nd_dat e , set the o perato r to >.
4. Fo r bo th co lumns, set the Fo re ign co lum n / e xpre ssio n to dd.dat e .
When yo u're do ne, yo ur jo in will lo o k like this:

The query at the bo tto m o f the windo w sho uld lo o k like this:
OBSERVE:
SELECT
FROM
stageFactCustomerCount ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.create_date) )
INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND dc.start_date <=
dd.date AND dc.end_date > dd.date )
Finally, add an alias called ds fo r the dim St o re table. Once again, this dimensio n is a Type-2. Execute the fo llo wing
steps:
1. Change the jo in type to INNER J OIN.
2. Check the Explicit J o in bo x fo r the st o re _id ro w.
3. Set the o perato r to =

4. Set the Fo re ign co lum n / e xpre ssio n to ss.st o re _id.


5. Check the Explicit J o in bo x fo r the st art _dat e and e nd_dat e ro ws.
6 . Fo r st art _dat e , set the o perato r to <= .
7. Fo r e nd_dat e , set the o perato r to >.
8 . Fo r bo th co lumns, set the Fo re ign co lum n / e xpre ssio n to dd.dat e .
Once that's do ne, yo ur query will lo o k like this:
OBSERVE:
SELECT
FROM
stageFactCustomerCount ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.create_date) )
INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND dc.start_date <=
dd.date AND dc.end_date > dd.date )
INNER JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd.date A
ND ds.end_date > dd.date )

Note

Are yo u missing a table? Make sure yo u set the jo in type to INNER JOIN.

The o nly thing left to do is to specify o ur o utputs. But befo re we do , let's review the structure o f factCustomerCount.
Run the fo llo wing co mmand against yo ur perso nal database:
CODE TO TYPE:
explain factCustomerCount;
Yo u'll see these results:
OBSERVE:
mysql> explain factCustomerCount;
+-------------------+---------+------+-----+---------+----------------+
| Field
| Type
| Null | Key | Default | Extra
|
+-------------------+---------+------+-----+---------+----------------+
| customerCount_key | int(11) | NO
| PRI | NULL
| auto_increment |
| date_key
| int(11) | NO
|
| NULL
|
|
| customer_key
| int(11) | NO
|
| NULL
|
|
| store_key
| int(11) | NO
|
| NULL
|
|
| customer_count
| int(11) | NO
|
| 1
|
|
| run_id
| int(11) | NO
|
| 1
|
|
+-------------------+---------+------+-----+---------+----------------+
6 rows in set (0.05 sec)
We need to specify all o f these co lumns in t ELT MysqlMap. Two co lumns are no t present - o ur surrogate key,
customerCount_key, and o ur customer_count co lumn.
ORDER IS IMPORT ANT fo r these co lumns, so we'll start by adding customerCount_key. To add this co lumn, click
the

butto n:

Name the co lumn customerCount_key, and make sure yo u keep the Nullable bo x checked. Under the Expre ssio n
co lumn abo ve, type in NULL:

Note

To reiterate, o rde r m at t e rs fo r this sectio n, so use care when adding these co lumns. Yo u can always
reo rder the co lumns using the up and do wn arro w butto ns at the bo tto m o f the windo w.

Expand dim Dat e , then drag dat e _ke y o ver to the o utput table, right underneath cust o m e rCo unt _ke y:

Repeat this pro cess, in o rde r, fo r cust o m e r_ke y and st o re _ke y.

The next co lumn is cust o m e rCo unt . Add this by clicking the
butto n. Name the co lumn customerCount, and
make sure yo u keep the Nullable bo x checked. Under the Expre ssio n co lumn abo ve, type in 1.
Finally, drag the run_id co lumn to the o utput. Yo ur co mpleted map will lo o k like this:

Yo ur query will lo o k like this:


OBSERVE:
SELECT
null, dd.date_key , dc.customer_key , ds.store_key , null, ss.run_id
FROM
stageFactCustomerCount ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.create_date) )
INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND dc.start_date <=
dd.date AND dc.end_date > dd.date )
INNER JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd.date A
ND ds.end_date > dd.date )
There is o ne last thing we need to do befo re we run o ur jo b. Currently there's an erro r o n t ELT MysqlOut put :

Note

Yo u can igno re the warning o n t ELT MysqlMap - TOS is just co nfused by the table aliases.

To fix this erro r, do uble click t ELT MysqlOut put , then click o n Sync Co lum ns:

We are no w ready to run the jo b. Click o n the

at the to p o f the windo w.

As lo ng as yo ur co mpo nents are no t in erro r, yo u'll see o utput that lo o ks like this:

OBSERVE:
Starting job ProcessDataWarehouse at 21:07 22/12/2008.
Inserting with :
INSERT INTO factCustomerCount (SELECT null, dd.date_key , dc.customer_key , ds.store_ke
y , 1, ss.run_id FROM stageFactCustomerCount ss INNER JOIN dimDate dd ON( dd.date =
DATE(ss.create_date) ) INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_i
d AND dc.start_date <= dd.date AND dc.end_date > dd.date ) INNER JOIN dimStore ds O
N( ds.store_id = ss.store_id AND ds.start_date <= dd.date AND ds.end_date > dd.date
))
--> 589 rows inserted.
Job ProcessDataWarehouse ended at 21:07 22/12/2008. [exit code=0]
Yo u did it! f act Cust o m e rCo unt no w has data in it!
We co vered a who le lo t in this lesso n! We'll finish implementing the rest o f the facts in the next lesso n. See yo u in a bit!

Copyright 1998-2013 O'Reilly Media, Inc.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.

Processing Facts, Part II


DBA 3: Data Warehousing Lesson 11
In the last lesso n we implemented factCustomerCount. In this lesso n we will lo o k at o ur largest fact, factSales.

factSales
It was back in lesso n fo ur when we implemented the table fo r factSales. Let's review its structure by reviewing the
CREATE TABLE statement we used:
OBSERVE:
CREATE TABLE factSales
(
sales_key
INT NOT NULL
date_key
INT NOT NULL
customer_key
INT NOT NULL
movie_key
INT NOT NULL
store_key
INT NOT NULL
sales_amount
decimal(5,2)
PRIMARY KEY (sales_key)
);

AUTO_INCREMENT,
REFERENCES dimDate,
REFERENCES dimCustomer,
REFERENCES dimMovie,
REFERENCES dimStore,
NOT NULL,

There is o nly o ne numeric value in this f act : sales_amount. The rest are fo reign keys that po int to dimensio ns.
To pro cess o ur sales f act , we will perfo rm the fo llo wing steps:
1. Mo ve data fro m the so urce system into the staging table called stageFactSales.
2. Create indexes o n the staging table.
3. Use the ELT family o f co mpo nents in TOS to po pulate factSales.
Let's get started by po pulating o ur staging table, stageFactSales. To do this, drag these three co mpo nents to yo ur
canvas: t MysqlInput , t Map and t MysqlOut put .
Edit the pro perties fo r t MysqlInput , setting its co nnectio n to the sakila database. Next, edit its query. Use this query to
retrieve sales data:
CODE TO TYPE:
select p.payment_id, p.amount, p.payment_date,
p.customer_id, i.film_id, i.store_id, p.staff_id
from
payment p
join rental r on ( p.rental_id = r.rental_id )
join inventory i on ( r.inventory_id = i.inventory_id )
WHERE p.customer_id > 10

Run the query to make sure yo u typed everything co rrectly.


Once that's do ne, it's time to specify the schema fo r t MysqlInput . Fo r this query, we can use the Gue ss Sche m a
butto n to do mo st o f the wo rk; click o n it.
TOS identifies mo st o f the co lumns witho ut difficulty, except fo r o ne: am o unt . This isn't a float value, it's a BigDecimal
(decimal) value o f length 5 and precisio n 2. Fix the schema. It will ultimately lo o k like this:

Note

Yo u might be wo ndering abo ut the differences between f lo at , do uble , de cim al and int e ge r numbers.
Flo at and do uble are approximate numeric types, meaning the co mputer may no t sto re yo ur value
exactly like yo u enter it "under the ho o d." So , even if two flo at numbers lo o k the same to yo u, the
co mputer may no t sto re them the same way, and they may no t be equal to o ne ano ther, at least
acco rding to the co mputer. De cim al and integer are exact numeric types, so what yo u see is what the
co mputer sto res.

With that specified, link t MysqlInput to t Map, then o pen t Map to edit its co nnectio ns.
We will use this t Map just like we used the map co mpo nents fo r dimensio n pro cessing. Add o ur auditing co lumn
called run_id. Add an o utput, then add the run_id co lumn. Finally, drag all o f the input co lumns to the o utput. When
yo u're do ne, yo ur t Map sho uld lo o k so mewhat like this:

Next, co nnect t Map to t MysqlOut put . Set the co nnectio n pro perties o n t MysqlOut put to the data wareho use
co nnectio n fro m the repo sito ry, and set the table to " st age Fact Sale s" (within quo tatio n marks).
We co uld manually create the st age Fact Sale s table, but then we wo uld have to delete its data and dro p its indexes
befo re we po pulate it with data. Instead o f do ing that, we can have the t MysqlOut put co mpo nent Dro p t able if
e xist s and cre at e , which acco mplishes the same thing. To do this, set the action on table to Dro p t able if e xist s
and cre at e .
Right-click o n the t MysqlInput co mpo nent o f f act Re nt alCo unt , and co nnect the On Subjo b OK trigger to the
t MysqlInput o f the st age Fact Sale s sub jo b.
When yo u're do ne, yo ur jo b will lo o k so mething like this:

Next, we need to create o ur sto red pro cedure to create o ur indexes and clear the factSales table. Switch to the
terminal, and run the fo llo wing query:
CODE TO TYPE:
DELIMITER //
CREATE PROCEDURE etl_preFactSales ()
BEGIN
ALTER TABLE stageFactSales add index(payment_id);
ALTER TABLE stageFactSales add index(payment_date);
ALTER TABLE stageFactSales add index(customer_id);
ALTER TABLE stageFactSales add index(film_id);
ALTER TABLE stageFactSales add index(store_id);
TRUNCATE TABLE factSales;
END
//
No w, add a t MysqlRo w co mpo nent to yo ur canvas. Set its database co nnectio n to the data wareho use, and set its
query to the fo llo wing:
CODE TO TYPE:
call etl_preFactSales();
This pro cedure canno t execute befo re we po pulate stageFactSales with data. Right-click o n the t MysqlInput
co mpo nent o f the st age Fact Sale s sub jo b and link its On Subjo b OK trigger to t MysqlRo w.
In the last lesso n, we updated o ur table schema to include stageFactCustomerCount. We need to do the same fo r
stageFactSales. First, run yo ur jo b to create the table. Click o n the

butto n at the to p o f the windo w.

As lo ng as yo u do n't have any erro rs in yo ur jo b, yo u'll see so mething like the fo llo wing o utput:
OBSERVE:
Starting job ProcessDataWarehouse at 13:00 21/12/2008.
Job ProcessDataWarehouse ended at 13:00 21/12/2008. [exit code=0]
After yo ur jo b runs successfully, yo u can update the database schema by right-clicking o n the Dat a Ware ho use
co nnectio n in the metadata sectio n o f TOS, and cho o sing Re t rie ve Sche m a:

We do n't need to filter o ur schema, so click Ne xt > at the bo tto m o f the windo w:

In the next windo w, scro ll thro ugh yo ur database until yo u co me acro ss the o bjects fo r this co urse, then check
st age Fact Sale s.
In the final screen, select st age Fact Sale s o n the left and then click o n the Re t rie ve Sche m a butto n.
No w we're ready to create o ur map. Drag a t ELT MysqlMap co mpo nent to yo ur canvas. To ensure this co mpo nent
executes after the staging table is lo aded with indexes, link the On Subjo b OK trigger o f the previo us t MysqlRo w to
t ELT MysqlMap:

Note

Be sure to Set the database co nnectio n fo r t ELT MysqlMap to the data wareho use, fro m the repo sito ry.

Next, we need to add t ELT MysqlInput co mpo nents fo r each so urce table. Start by adding a co mpo nent fo r
st age Fact Sale s:

With the metadata set, we can link o ur t ELT MysqlInput co mpo nent to t ELT MysqlMap. Right-click o n
t ELT MysqlInput and cho o se Link, then st age Fact Sale s:

Dro p the link o n to p o f t ELT MysqlMap.


Repeat this pro cess fo r the remaining required tables:
dimCusto mer
dimDate
dimMo vie
dimStaff
dimSto re
When yo u are do ne, yo ur canvas will lo o k so mething like this:

We have inputs to o ur t ELT MysqlMap "co ntro ller" co mpo nent, but what abo ut o utputs? We o nly need o ne o utput fo r
this applicatio n, so drag a single t ELT MysqlOut put co mpo nent to the canvas. Right-click o n t ELT MysqlMap,
cho o se Link, then cho o se *ne w o ut put *:

Name this new o utput f act Sale s:

Our jo b no w lo o ks like this:

We're ready to set up o ur jo ins. Do uble click o n t ELT MysqlMap.

Once the windo w lo ads, click o n the


butto n o n the left side to add an alias fo r yo ur first input table,
stageFactSales. We'll add this table first since it is the basis o f all o f o ur jo ins:

Select the st age Fact Sale s table, and specify ss as the alias:

This ss alias will be translated into the SQL statement. At the bo tto m o f the windo w, click o n Ge ne rat e d SQL Se le ct
que ry f o r 't able ' o ut put :
OBSERVE:
SELECT
FROM
stageFactSales ss
No w add an alias to the next table, dimDate. Name the alias dd.
dim Dat e is no t jo ined to st age Fact Sale s no w. To specify ho w these tables are jo ined we will do two things: first,
click o n the triangle dro p do wn to change the jo in type fro m (IMPLICIT JOIN) to INNER JOIN:

Next, check the Explicit J o in bo x fo r the date ro w under dim Dat e , specify = as the Ope rat o r, and specify the
Fo re ign Co lum n as DAT E(ss.paym e nt _dat e ). When yo u are do ne yo u'll see the jo in:

No w is a go o d time to save. (Actually, it's almo st always a go o d time to save.) Click "OK" to clo se the map windo w,
then save yo ur wo rk.
Let's mo ve o n to o ur next input. Add an alias fo r the table dim Mo vie , called dm .
With dimDate o ut o f the way, we can specify the jo in to dim Mo vie . No w take these steps:
1. Change the jo in type to INNER J OIN.
2. Check the Explicit J o in bo x fo r the f ilm _id ro w.
3. Set the o perato r to =.
4. Set the Fo re ign co lum n / e xpre ssio n to ss.f ilm _id.
When yo u are do ne, yo u're jo in will sho w:

Three tables do wn, three mo re to go !


Next, add an alias called dc fo r the dim Cust o m e r table, a Type-2 SCD. Co llapse the o ther tables to make ro o m.
Then execute these steps:
1. Change the jo in type to INNER J OIN.
2. Check the Explicit J o in bo x fo r the cust o m e r_id ro w.
3. Set the o perato r to =.
4. Set the Fo re ign co lum n / e xpre ssio n to ss.cust o m e r_id.
5. Check the Explicit J o in bo x fo r the st art _dat e and e nd_dat e ro ws.
6 . Fo r st art _dat e , set the o perato r to <= .
7. Fo r e nd_dat e , set the o perato r to >.
8 . Fo r bo th co lumns, set the Fo re ign co lum n / e xpre ssio n to dd.dat e .
When yo u're do ne, yo ur jo in will lo o k like this:

We're almo st there! Add an alias called dst fo r the dim St af f table. Co llapse the o ther tables to make ro o m. This
dimensio n is also a Type-2, so we'll have to jo in o n start_date and end_date again. No w execute these steps:
1. Change the jo in type to INNER J OIN.
2. Check the Explicit J o in bo x fo r the st af f _id ro w.
3. Set the o perato r to =.
4. Set the Fo re ign co lum n / e xpre ssio n to ss.st af f _id.
5. Check the Explicit J o in bo x fo r the st art _dat e and e nd_dat e ro ws.
6 . Fo r st art _dat e , set the o perato r to <= .
7. Fo r e nd_dat e , set the o perato r to >.
8 . Fo r bo th co lumns, set the Fo re ign co lum n / e xpre ssio n to dd.dat e .
Finally, add an alias called ds fo r the dim St o re table. Once again, this dimensio n is a Type-2 Execute the fo llo wing
steps:
1. Change the jo in type to INNER J OIN.
2. Check the Explicit J o in bo x fo r the st o re _id ro w.
3. Set the o perato r to =.
4. Set the Fo re ign co lum n / e xpre ssio n to ss.st o re _id.
5. Check the Explicit J o in bo x fo r the st art _dat e and e nd_dat e ro ws.
6 . Fo r st art _dat e , set the o perato r to <= .
7. Fo r e nd_dat e , set the o perato r to >.
8 . Fo r bo th co lumns, set the Fo re ign co lum n / e xpre ssio n to dd.dat e .
Once yo u are do ne with all that, yo ur query will lo o k like this:

OBSERVE:
SELECT
FROM
stageFactSales ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.payment_date) )
INNER JOIN dimMovie dm ON( dm.film_id = ss.film_id )
INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND dc.start_date <=
dd.date AND dc.end_date > dd.date )
INNER JOIN dimStaff dst ON( dst.staff_id = ss.staff_id AND dst.start_date <= dd.dat
e AND dst.end_date > dd.date )
INNER JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd.date A
ND ds.end_date > dd.date )

Note

Are yo u missing a table? Make sure yo u set the jo in type to INNER JOIN.

The o nly thing left to do no w is to specify o ur o utputs. Befo re we do that tho ugh, let's review the structure o f
factSales. Run the fo llo wing co mmand against yo ur perso nal database:
CODE TO TYPE:
explain factSales;
Yo u'll see the fo llo wing results:
OBSERVE:
mysql> explain factSales;
+--------------+--------------+------+-----+---------+----------------+
| Field
| Type
| Null | Key | Default | Extra
|
+--------------+--------------+------+-----+---------+----------------+
| sales_key
| int(11)
| NO
| PRI | NULL
| auto_increment |
| date_key
| int(11)
| NO
|
| NULL
|
|
| customer_key | int(11)
| NO
|
| NULL
|
|
| movie_key
| int(11)
| NO
|
| NULL
|
|
| store_key
| int(11)
| NO
|
| NULL
|
|
| staff_key
| int(11)
| NO
|
| NULL
|
|
| sales_amount | decimal(6,2) | YES |
| NULL
|
|
| run_id
| int(11)
| NO
|
| NULL
|
|
+--------------+--------------+------+-----+---------+----------------+
8 rows in set (0.10 sec)
We need to specify all eight o f these co lumns in t ELT MysqlMap. The o nly co lumn that isn't present is o ur surrogate
key, sales_key. To add this co lumn, click the

butto n:

Name the co lumn sales_key, and make sure yo u keep the Nullable bo x checked. Under the Expre ssio n co lumn
abo ve, type in NULL.

With that o ut o f the way, we can add o ur o ther co lumns.

Note

One mo re time: o rde r m at t e rs fo r this sectio n, so use care when adding these co lumns. Yo u can
always reo rder the co lumns using the up and do wn arro w butto ns at the bo tto m o f the windo w.

First, expand the dim Dat e , then drag dat e _ke y o ver to the o utput table, right under sale s_ke y.

After yo u dro p the co lumn, the o utput will lo o k like this:

Repeat this pro cess, in o rde r, fo r the remaining co lumns:


1. custo mer_key

2. mo vie_key
3. sto re_key
4. staff_key
5. amo unt
6 . run_id
Take a lo o k at the generated query - it sho uld lo o k so mething like this:
OBSERVE:
SELECT
null, dd.date_key , dc.customer_key , dm.movie_key , ds.store_key , dst.staff_key , ss.
amount , ss.run_id
FROM
stageFactSales ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.payment_date) )
INNER JOIN dimMovie dm ON( dm.film_id = ss.film_id )
INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND dc.start_date <=
dd.date AND dc.end_date > dd.date )
INNER JOIN dimStaff dst ON( dst.staff_id = ss.staff_id AND dst.start_date <= dd.dat
e AND dst.end_date > dd.date )
INNER JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd.date A
ND ds.end_date > dd.date )
We're do ne with t ELT MysqlMap, so click "OK" to clo se the windo w, and save yo ur wo rk.
There is o ne thing we sho uld check befo re we go much further - we sho uld run EXPLAIN o n the generated query to see
ho w it will run. Run the fo llo wing co mmand against yo ur perso nal database:
CODE TO TYPE:
EXPLAIN SELECT
null, dd.date_key , dc.customer_key , dm.movie_key , ds.store_key , dst.staff_key , ss.
amount , ss.run_id
FROM
stageFactSales ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.payment_date) )
INNER JOIN dimMovie dm ON( dm.film_id = ss.film_id )
INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND dc.start_date <=
dd.date AND dc.end_date > dd.date )
INNER JOIN dimStaff dst ON( dst.staff_id = ss.staff_id AND dst.start_date <= dd.dat
e AND dst.end_date > dd.date )
INNER JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd.date A
ND ds.end_date > dd.date )

Yo u'll see the fo llo wing results:

OBSERVE:
mysql> explain SELECT
-> null, dd.date_key , dc.customer_key , dm.movie_key , ds.store_key , dst.staff_ke
y , ss.amount , ss.run_id
-> FROM
-> stageFactSales ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.payment_date) )
-> INNER JOIN dimMovie dm ON( dm.film_id = ss.film_id )
-> INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND dc.start_d
ate <= dd.date AND dc.end_date > dd.date )
-> INNER JOIN dimStaff dst ON( dst.staff_id = ss.staff_id AND dst.start_date <=
dd.date AND dst.end_date > dd.date )
-> INNER JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd
.date AND ds.end_date > dd.date );
+----+-------------+-------+------+------------------------------+---------+---------+--------------------+-------+-------------+
| id | select_type | table | type | possible_keys
| key
| key_len |
ref
| rows | Extra
|
+----+-------------+-------+------+------------------------------+---------+---------+--------------------+-------+-------------+
| 1 | SIMPLE
| dst
| ALL | NULL
| NULL
| NULL
|
NULL
|
2 |
|
| 1 | SIMPLE
| ds
| ALL | NULL
| NULL
| NULL
|
NULL
|
2 |
|
| 1 | SIMPLE
| dm
| ALL | NULL
| NULL
| NULL
|
NULL
| 1000 |
|
| 1 | SIMPLE
| ss
| ref | customer_id,film_id,store_id | film_id | 4
|
certjosh.dm.film_id |
16 | Using where |
| 1 | SIMPLE
| dc
| ALL | NULL
| NULL
| NULL
|
NULL
| 1178 | Using where |
| 1 | SIMPLE
| dd
| ALL | NULL
| NULL
| NULL
|
NULL
| 18628 | Using where |
+----+-------------+-------+------+------------------------------+---------+---------+--------------------+-------+-------------+
6 rows in set (0.08 sec)
mysql>
This lo o ks pretty bad - o ur query isn't using any indexes. That's because we never indexed any co lumns (o ther than
the primary key) when we created o ur dimensio ns.
Fo rtunately this pro blem has a quick fix -- we'll add indexes to mo st o f o ur co lumns. We'll o mit start_date and
end_date fro m the indexes fo r no w, just to keep things sho rter. Run the fo llo wing co mmand against yo ur perso nal
database:
CODE TO TYPE:
alter
alter
alter
alter
alter

table
table
table
table
table

dimCustomer add index(customer_id);


dimDate add index(date);
dimMovie add index(film_id);
dimStaff add index(staff_id);
dimStore add index(store_id);

As lo ng as yo u typed everything co rrectly yo u'll see the fo llo wing results:

OBSERVE:
mysql> alter table dimCustomer add index(customer_id);
Query OK, 1178 rows affected (0.08 sec)
Records: 1178 Duplicates: 0 Warnings: 0
mysql> alter table dimDate add index(date);
Query OK, 18628 rows affected (0.15 sec)
Records: 18628 Duplicates: 0 Warnings: 0
mysql> alter table dimMovie add index(film_id);
Query OK, 1000 rows affected (0.09 sec)
Records: 1000 Duplicates: 0 Warnings: 0
mysql> alter table dimStaff add index(staff_id);
Query OK, 2 rows affected (0.09 sec)
Records: 2 Duplicates: 0 Warnings: 0
mysql> alter table dimStore add index(store_id);
Query OK, 2 rows affected (0.05 sec)
Records: 2 Duplicates: 0 Warnings: 0
mysql>
Let's try the EXPLAIN again. Run the fo llo wing co mmand against yo ur perso nal database:
CODE TO TYPE:
EXPLAIN SELECT
null, dd.date_key , dc.customer_key , dm.movie_key , ds.store_key , dst.staff_key , ss.
amount , ss.run_id
FROM
stageFactSales ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.payment_date) )
INNER JOIN dimMovie dm ON( dm.film_id = ss.film_id )
INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND dc.start_date <=
dd.date AND dc.end_date > dd.date )
INNER JOIN dimStaff dst ON( dst.staff_id = ss.staff_id AND dst.start_date <= dd.dat
e AND dst.end_date > dd.date )
INNER JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd.date A
ND ds.end_date > dd.date );
It lo o ks like o ur query is using mo st o f the indexes we created. The dimStaff and dimStore sho uld no t be a pro blem,
since there are o nly two ro ws in bo th o f tho se tables.

OBSERVE:
mysql> EXPLAIN SELECT
-> null, dd.date_key , dc.customer_key , dm.movie_key , ds.store_key , dst.staff_ke
y , ss.amount , ss.run_id
-> FROM
-> stageFactSales ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.payment_date) )
-> INNER JOIN dimMovie dm ON( dm.film_id = ss.film_id )
-> INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND dc.start_d
ate <= dd.date AND dc.end_date > dd.date )
-> INNER JOIN dimStaff dst ON( dst.staff_id = ss.staff_id AND dst.start_date <=
dd.date AND dst.end_date > dd.date )
-> INNER JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd
.date AND ds.end_date > dd.date );
+----+-------------+-------+------+------------------------------+-------------+--------+-------------------------+-------+-------------+
| id | select_type | table | type | possible_keys
| key
| key_le
n | ref
| rows | Extra
|
+----+-------------+-------+------+------------------------------+-------------+--------+-------------------------+-------+-------------+
| 1 | SIMPLE
| ss
| ALL | customer_id,film_id,store_id | NULL
| NULL
| NULL
| 15766 |
|
| 1 | SIMPLE
| dm
| ref | film_id
| film_id
| 2
| certjosh.ss.film_id
|
1 | Using where |
| 1 | SIMPLE
| dd
| ref | date
| date
| 3
| func
|
1 | Using where |
| 1 | SIMPLE
| dc
| ref | customer_id
| customer_id | 4
| certjosh.ss.customer_id |
2 | Using where |
| 1 | SIMPLE
| dst
| ALL | staff_id
| NULL
| NULL
| NULL
|
2 | Using where |
| 1 | SIMPLE
| ds
| ALL | store_id
| NULL
| NULL
| NULL
|
2 | Using where |
+----+-------------+-------+------+------------------------------+-------------+--------+-------------------------+-------+-------------+
6 rows in set (0.99 sec)
There is o ne last thing we need to do befo re we run o ur jo b. There is currently an erro r o n t ELT MysqlOut put :

Note

Yo u can igno re the warning o n t ELT MysqlMap -- TOS is just co nfused by the table aliases.

To fix this erro r, do uble-click t ELT MysqlOut put , then click o n Sync Co lum ns:

We are no w ready to run the jo b. Do this by clicking o n the

at the to p o f the windo w.

As lo ng as yo ur co mpo nents are no t in erro r, yo u'll see o utput that lo o ks like this:

OBSERVE:
Starting job ProcessDataWarehouse at 18:56 21/12/2008.
Inserting with :
INSERT INTO factSales (SELECT null, dd.date_key , dc.customer_key , dm.movie_key , ds.s
tore_key , dst.staff_key , ss.amount , ss.run_id FROM stageFactSales ss INNER JOIN d
imDate dd ON( dd.date = DATE(ss.payment_date) ) INNER JOIN dimMovie dm ON( dm.film_
id = ss.film_id ) INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND
dc.start_date <= dd.date AND dc.end_date > dd.date ) INNER JOIN dimStaff dst ON( d
st.staff_id = ss.staff_id AND dst.start_date <= dd.date AND dst.end_date > dd.date )
INNER JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd.date A
ND ds.end_date > dd.date ))
--> 15766 rows inserted.
Job ProcessDataWarehouse ended at 18:56 21/12/2008. [exit code=0]
Great jo b! We're really ro lling no w. Keep it up and see yo u sho rtly!

Copyright 1998-2013 O'Reilly Media, Inc.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.

Special Facts
DBA 3: Data Warehousing Lesson 12
Hello ! In the last lesso n we implemented o ur f act s. Our data wareho use is fairly straightfo rward. We read data fro m o ne
database, do a few basic transfo rmatio ns, and lo ad the data into the wareho use.
The DVD Rental sto re o nly tracks sales. It do esn't have to wo rry abo ut invo icing custo mers, shipping pro ducts, internet o rders,
o r tracking co sts. Other businesses are much mo re co mplex. In this lesso n, we'll investigate o ptio ns fo r dealing with o ther
types o f data and situatio ns that o ccur in data wareho uses.

Missing Keys
In the last lesso n we lo aded o ur fact tables, assuming that we co uld reso lve all o f the fo reign keys required fo r the
dimensio ns.
But what if a key canno t be fo und? Let's use o ur stageFactCustomerCount and factCustomerCount tables to help
us wo rk o n this pro blem . First, run yo ur jo b to make sure yo ur tables co ntain data. Take a lo o k at the o utput (so me
lines have been o mitted):
OBSERVE:
Starting job ProcessDataWarehouse at 14:24 23/12/2008.
Inserting with :
INSERT INTO factCustomerCount (SELECT null, dd.date_key , dc.customer_key , ds.store_ke
y , 1, ss.run_id FROM stageFactCustomerCount ss INNER JOIN dimDate dd ON( dd.date =
DATE(ss.create_date) ) INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_i
d AND dc.start_date <= dd.date AND dc.end_date > dd.date ) INNER JOIN dimStore ds O
N( ds.store_id = ss.store_id AND ds.start_date <= dd.date AND ds.end_date > dd.date
))
--> 589 rows inserted.
We kno w that 5 89 ro ws were put into factCustomerCount. Ho w many were in stageFactCustomerCount? Run this
co mmand against yo ur perso nal database:
CODE TO TYPE:
select count(*) from stageFactCustomerCount;
Oh my! It lo o ks like stageFactCustomerCount has 5 9 9 ro ws!
OBSERVE:
mysql> select count(*) from stageFactCustomerCount;
+----------+
| count(*) |
+----------+
|
599 |
+----------+
1 row in set (0.05 sec)
mysql>
Our jo in has excluded 10 ro ws. Ho w do we find the missing ro ws?

Debugging tELT MysqlMap


Remember the o utput fro m TOS? Take a lo o k:

OBSERVE: Output fro m TOS


Starting job ProcessDataWarehouse at 14:24 23/12/2008.
Inserting with :
INSERT INTO factCustomerCount (SELECT null, dd.date_key , dc.customer_key , ds.s
tore_key , 1, ss.run_id FROM stageFactCustomerCount ss INNER JOIN dimDate dd
ON( dd.date = DATE(ss.create_date) ) INNER JOIN dimCustomer dc ON( dc.custom
er_id = ss.customer_id AND dc.start_date <= dd.date AND dc.end_date > dd.date
) INNER JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <=
dd.date AND ds.end_date > dd.date ))
--> 589 rows inserted.
Inserting with :
INSERT INTO factSales (SELECT null, dd.date_key , dc.customer_key , IFNULL(dm.mo
vie_key , -1), ds.store_key , dst.staff_key , ss.amount , ss.run_id FROM stage
FactSales ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.payment_date) ) LEFT
OUTER JOIN dimMovie dm ON( dm.film_id = ss.film_id ) INNER JOIN dimCustomer
dc ON( dc.customer_id = ss.customer_id AND dc.start_date <= dd.date AND dc.e
nd_date > dd.date ) INNER JOIN dimStaff dst ON( dst.staff_id = ss.staff_id AN
D dst.start_date <= dd.date AND dst.end_date > dd.date ) INNER JOIN dimStore
ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd.date AND ds.end_dat
e > dd.date ))
--> 15767 rows inserted.
Job ProcessDataWarehouse ended at 14:24 23/12/2008. [exit code=0]
The o utput includes the INSERT ... SELECT statement used to po pulate f act Cust o m e rCo unt . If we get rid
o f the INSERT part, and refo rmat the query slightly, it will lo o k like this:
OBSERVE:
SELECT null, dd.date_key , dc.customer_key , ds.store_key , 1, ss.run_id
FROM stageFactCustomerCount ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.cr
eate_date) )
INNER JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND dc.start_da
te <= dd.date AND dc.end_date > dd.date )
INNER JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd.
date AND ds.end_date > dd.date )
Because we used inner jo ins in o ur query, no n-matching ro ws are excluded fro m the results. There is ho pe
after all. =)
We can change this query so it returns o nly no n-matching ro ws. We do that by co nverting the INNER J OIN to
LEFT J OIN and adding a WHERE clause. We also want to add three co lumns to the SELECT part, to help us
debug: ss.cust o m e r_id, ss.st o re _id, ss.cre at e _dat e . Run this co mmand against yo ur perso nal
database:
CODE TO TYPE:
SELECT
ss.customer_id, ss.store_id, ss.create_date,
dd.date_key , dc.customer_key , ds.store_key , ss.run_id
FROM stageFactCustomerCount ss
INNER JOIN dimDate dd ON( dd.date = DATE(ss.create_date) )
LEFT JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND dc.start_dat
e <= dd.date AND dc.end_date > dd.date )
LEFT JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd.d
ate AND ds.end_date > dd.date )
WHERE dd.date_key IS NULL OR dc.customer_key IS NULL OR ds.store_key IS NULL

This query will return ro ws fro m stageFactCustomerCount that do no t have co rrespo nding ro ws in either
dimDate, dimCustomer o r dimStore. No n-matching ro ws will have a NULL dat e _ke y, cust o m e r_ke y, o r a
NULL st o re _ke y.

After yo u run the query, yo u'll see the missing ro ws:


OBSERVE:
mysql> SELECT
-> ss.customer_id, ss.store_id, ss.create_date,
-> dd.date_key , dc.customer_key , ds.store_key , ss.run_id
-> FROM stageFactCustomerCount ss
-> INNER JOIN dimDate dd ON( dd.date = DATE(ss.create_date) )
-> LEFT JOIN dimCustomer dc ON( dc.customer_id = ss.customer_id AND dc.st
art_date <= dd.date AND dc.end_date > dd.date )
-> LEFT JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date
<= dd.date AND ds.end_date > dd.date )
-> WHERE dd.date_key IS NULL OR dc.customer_key IS NULL OR ds.store_key IS N
ULL;
+-------------+----------+---------------------+----------+--------------+----------+--------+
| customer_id | store_id | create_date
| date_key | customer_key | store
_key | run_id |
+-------------+----------+---------------------+----------+--------------+----------+--------+
|
1 |
1 | 2004-06-26 00:00:00 | 20040626 |
NULL |
1 |
76 |
|
2 |
1 | 2004-10-12 00:00:00 | 20041012 |
NULL |
1 |
76 |
|
3 |
1 | 2004-06-12 00:00:00 | 20040612 |
NULL |
1 |
76 |
|
4 |
2 | 2004-11-22 00:00:00 | 20041122 |
NULL |
2 |
76 |
|
5 |
1 | 2004-02-17 00:00:00 | 20040217 |
NULL |
1 |
76 |
|
6 |
2 | 2004-12-18 00:00:00 | 20041218 |
NULL |
2 |
76 |
|
7 |
1 | 2004-06-09 00:00:00 | 20040609 |
NULL |
1 |
76 |
|
8 |
2 | 2004-04-18 00:00:00 | 20040418 |
NULL |
2 |
76 |
|
9 |
2 | 2004-02-29 00:00:00 | 20040229 |
NULL |
2 |
76 |
|
10 |
1 | 2004-12-03 00:00:00 | 20041203 |
NULL |
1 |
76 |
+-------------+----------+---------------------+----------+--------------+----------+--------+
10 rows in set (0.10 sec)
It lo o ks like we fo und o ur missing ro ws: cust o m e r_id 1 thro ugh 10 . If yo u recall, we specifically excluded
these custo mers fro m o ur dim Cust o m e r dimensio n, because o ur business users to ld us they were "test"
acco unts.
To fix this pro blem, we'll alter o ur select statement. Edit the query o n the t MysqlInput co mpo nent fo r
st age Cust o m e rCo unt . Change the query so it lo o ks like this (make sure to use the co rrect quo tatio n
marks):
CODE TO TYPE:
select customer_id, store_id, create_date
from customer WHERE customer_id > 10;

Handling Missing Keys


We pro cess dimensio ns befo re we pro cess facts so we can be sure that o ur dimensio ns are current and then
o ur facts can lo ad co rrectly. If we lo ad o ur sales facts in o ur data wareho use, but o ne ro w do esn't have a
mo vie asso ciated with it, we have several o ptio ns to deal with the missing mo vie:
1. Igno re the ro w - do n't lo ad it.

2. Sto p the current pro cess, alerting so meo ne to the erro r.


3. Lo ad the ro w, but set the ro w to a special "!!!! MISSING MOVIE !!!!" entry in the dimMovie
dimensio n.
Lo gically, yo ur business users will have the last wo rd o n handling this situatio n. Mo st will no t cho o se pick
o ptio ns # 1 o r # 2 fo r these reaso ns:
1. Igno ring the ro w can cause daily to tals to be o ff, so the wareho use repo rts may no t match
existing repo rts fro m the so urce systems.
2. Sto pping the current pro cess to alert users to the erro r causes the who le pro cess to be
interrupted, and will cause the entire wareho use to be inaccessible.
Since o ur fact tables are relo aded o n a daily basis, o ptio n # 3 is attractive because the ro w in erro r can be
fixed within the so urce system and the wareho use will be "fixed" o n the next run.
Let's implement this fix. We'll start my altering dimMo vie so film_id is a no rmal integer. Currently it is
unsigned, meaning that no negative values are allo wed. Run this co mmand against yo ur perso nal database:
CODE TO TYPE:
ALTER TABLE dimMovie modify film_id int not null;
Next we will add the special missing ro w to dimMovie. Run this co mmand against yo ur perso nal database:
CODE TO TYPE:
insert into dimMovie (movie_key, film_id, title, run_id ) values (-1, -1, '!!! M
ISSING MOVIE !!!', 1);
Let's check the ro w we just inserted to see what it lo o ks like. Run this co mmand against yo ur perso nal
database:
CODE TO TYPE:
select * from dimMovie where movie_key=-1;
As lo ng as yo u typed everything co rrectly yo u'll see this:
OBSERVE:
select * from dimMovie where movie_key=-1;
+-----------+---------+-----------------------+-------------+--------------+---------+-------------------+-----------------+--------+--------+-----------------+--------+
| movie_key | film_id | title
| description | release_year | lan
guage | original_language | rental_duration | length | rating | special_features
| run_id |
+-----------+---------+-----------------------+-------------+--------------+---------+-------------------+-----------------+--------+--------+-----------------+--------+
|
-1 |
-1 | !!! MISSING MOVIE !!! | NULL
|
NULL |
| NULL
|
0 |
0 |
| NULL
|
1 |
+-----------+---------+-----------------------+-------------+--------------+---------+-------------------+-----------------+--------+--------+-----------------+--------+
1 row in set (0.05 sec)
mysql>
Everything lo o ks go o d. Next, o pen yo ur jo b in TOS, and do uble click o n the t ELT MysqlMap co mpo nent fo r
f act Sale s.
Change the jo in type o n dm (dimMo vie) fro m INNER JOIN to LEFT OUTER JOIN:

We can use MySQL's IFNULL functio n to be sure o ur jo in was successful. If the jo in failed, dm.movie_key will
be null, so the IFNULL functio n will return -1. If the jo in wo rked, dm.movie_key will have a value, which will be
returned by IFNULL.
Next, change the expressio n o n the f act Sale s o utput fro m dm.movie_key to IFNULL(dm.movie_key, -1):

Click OK, then save yo ur jo b.


We do n't have access to o ur so urce system, so we canno t insert test data into sakila. Instead, we can
tempo rarily mo dify o ur sto red pro cedure to insert test data into stageFactSales. Run this co mmand against
yo ur perso nal database:
CODE TO TYPE:
DROP PROCEDURE etl_preFactSales;
DELIMITER //
CREATE PROCEDURE etl_preFactSales ()
BEGIN
ALTER TABLE stageFactSales add index(payment_id);
ALTER TABLE stageFactSales add index(payment_date);
ALTER TABLE stageFactSales add index(customer_id);
ALTER TABLE stageFactSales add index(film_id);
ALTER TABLE stageFactSales add index(store_id);
TRUNCATE TABLE factSales;
INSERT INTO stageFactSales values (-1, -1, '999.99', '2005-01-01', 11, 99999, 1,
1);
END
//
After yo u make that change, run yo ur jo b. Yo u'll see the familiar o utput:

OBSERVE:
Starting job ProcessDataWarehouse at 14:24 23/12/2008.
Inserting with :
INSERT INTO factCustomerCount (SELECT null, dd.date_key , dc.customer_key , ds.s
tore_key , 1, ss.run_id FROM stageFactCustomerCount ss INNER JOIN dimDate dd
ON( dd.date = DATE(ss.create_date) ) INNER JOIN dimCustomer dc ON( dc.custom
er_id = ss.customer_id AND dc.start_date <= dd.date AND dc.end_date > dd.date
) INNER JOIN dimStore ds ON( ds.store_id = ss.store_id AND ds.start_date <=
dd.date AND ds.end_date > dd.date ))
--> 589 rows inserted.
Inserting with :
INSERT INTO factSales (SELECT null, dd.date_key , dc.customer_key , IFNULL(dm.mo
vie_key , -1), ds.store_key , dst.staff_key , ss.amount , ss.run_id FROM stage
FactSales ss INNER JOIN dimDate dd ON( dd.date = DATE(ss.payment_date) ) LEFT
OUTER JOIN dimMovie dm ON( dm.film_id = ss.film_id ) INNER JOIN dimCustomer
dc ON( dc.customer_id = ss.customer_id AND dc.start_date <= dd.date AND dc.e
nd_date > dd.date ) INNER JOIN dimStaff dst ON( dst.staff_id = ss.staff_id AN
D dst.start_date <= dd.date AND dst.end_date > dd.date ) INNER JOIN dimStore
ds ON( ds.store_id = ss.store_id AND ds.start_date <= dd.date AND ds.end_dat
e > dd.date ))
--> 15767 rows inserted.
Job ProcessDataWarehouse ended at 14:24 23/12/2008. [exit code=0]
Switch back to MySql mo de to see if yo ur change wo rked. Run this co mmand against yo ur perso nal
database:
CODE TO TYPE:
select * from factSales where movie_key=-1;
If yo ur jo b ran successfully, yo u'll see this:
OBSERVE:
mysql> select * from factSales where movie_key=-1;
+-----------+----------+--------------+-----------+-----------+-----------+-------------+--------+
| sales_key | date_key | customer_key | movie_key | store_key | staff_key | sale
s_amount | run_id |
+-----------+----------+--------------+-----------+-----------+-----------+-------------+--------+
|
6858 | 20050101 |
884 |
-1 |
1 |
1 |
999.99 |
-1 |
+-----------+----------+--------------+-----------+-----------+-----------+-------------+--------+
1 row in set (0.08 sec)
mysql>
It lo o ks like yo ur left jo in saved the day!

Aggregating
Data wareho uses are built with the presumptio n that data needs to be aggregated in different ways. If the increment we
are using in o ur data wareho use is o ne day, and we query the wareho use to see sales in May, we need to SUM(Sales)
fo r each day in May.
This wo rks fo r mo st types o f facts, but what if the fact under co nsideratio n is an acco unt balance? Acco unt balances
are usually sto red as po int in t im e values. Take a lo o k:

Dat e

De script io n

Paym e nt De po sit Balance

0 1/0 1 Starting Balance

5 9 2.20

0 1/0 2 Gro cery Sto re

25.9 0

5 6 6 .30

0 1/0 3 Co mputer Sto re

19 .50

5 4 6 .80

0 1/0 4 Co nsulting Wo rk

150 0 .0 0 20 4 6 .80

The acco unt balance o n 0 1/0 3 is 5 4 6 .80 , and the acco unt balance o n 0 1/0 4 is 20 4 6 .80 . If to day is January 4th, the
acco unt balance fo r the mo nth o f January is 20 4 6 .80 , no t 59 2.20 + 56 6 .30 + 546 .8 0 + 20 46 .8 0 = 3752.10 . Likewise,
the acco unt balance fo r 20 0 8 is also 20 4 6 .80 , no t 3752.10 .
The pro per aggregate fo r an acco unt balance is no t SUM, it is LAST.
If we take a lo o k at MySQL's gro up by functio ns yo u'll no tice there is MAX, MIN, and o f co urse SUM, ho wever there is no
LAST o r FIRST. This is because LAST and FIRST are no t currently suppo rted by MySQL.
Getting aro und this pro blem is tricky with MySQL. Our o nly o ptio n is to use ORDER BY to so rt the results by date, so
the o ldest reco rd will appear first, and to LIMIT o ur result to o ne ro w. A sample query to get the last value fro m
factSales wo uld lo o k like this:
CODE TO TYPE:
SELECT *
FROM factSales
ORDER BY date_key DESC
LIMIT 0, 1;

MySQL wo uld return the last ro w:


OBSERVE:
mysql> SELECT *
FROM factSales
ORDER BY date_key DESC
LIMIT 0, 1;
+-----------+----------+--------------+-----------+-----------+-----------+-------------+--------+
| sales_key | date_key | customer_key | movie_key | store_key | staff_key | sales_amoun
t | run_id |
+-----------+----------+--------------+-----------+-----------+-----------+-------------+--------+
|
7 | 20060214 |
591 |
267 |
1 |
1 |
4.9
9 |
76 |
+-----------+----------+--------------+-----------+-----------+-----------+-------------+--------+
1 row in set (0.06 sec)
mysql>
Other databases have extensio ns that make this type o f query easier.

Deaggregating Data
We already saw that aggregating certain types o f data can po se so me pro blems. In so me situatio ns, data may already
be aggregated, causing a different type o f pro blem.
Suppo se the DVD sto re started shipping packages. Shippers usually want to kno w the weight o f packages, since it is
used to calculate shipping co st. If so meo ne o rders fo ur DVDs at the same time, tho se fo ur DVDs are co mbined into a
single package and sent to the custo mer. Their o rder may lo o k so mething like this:
T it le
DADDY PITTSBURGH

Price
9 .9 9

TITANIC BOONDOCK

5.9 9

NEWTON LABYRINTH

5.9 9

APOLLO TEEN

9 .9 9

Shipping & Handling

5.9 0

== T o t al ==

37 .86

Our business user wants to kno w, fo r example, ho w much shipping co sts were fo r the mo vie APOLLO TEEN?
Lo o king at o ur data, we o nly kno w that it co st $ 5 .9 0 to ship APOLLO TEEN alo ng with DADDY PITTSBURG, TITANIC
BOONDOCK and NEWTON LABYRINTH.
Our business user uses this fo rmula to calculate shipping co sts:
Shipping o n an it e m = Shipping & Handling / # o f It e m s
With this fo rmula in mind, we can calculate the shipping & handling o n each individual item in the o rder:
T it le

Price

Shipping

DADDY PITTSBURGH

9 .9 9 5 .9 0 /4 = 1.4 7 5

TITANIC BOONDOCK

5.9 9 5 .9 0 /4 = 1.4 7 5

NEWTON LABYRINTH

5.9 9 5 .9 0 /4 = 1.4 7 5

APOLLO TEEN

9 .9 9 5 .9 0 /4 = 1.4 7 5

Shipping & Handling

5.9 0 1.475 * 4 = 5.9 0

== T o t al ==

37 .86

So no w yo u might ask yo urself, "How can shipping be 1.475? Shouldn't it be rounded?"


The answer to that questio n (like the shipping calculatio n itself) can o nly be answered by yo ur business users.
Po ssible so lutio ns might be to do nothing - meaning let the end users pick if and ho w they want to ro und the data, o r to
implement and do cument a ro unding algo rithm.
So where and ho w wo uld yo u implement this deaggregatio n? Yo u wo uld have to add a T ransfo rmatio n step to yo ur
fact pro cessing to calculate the shipping & handling value.

Early Arriving Facts


The next type o f pro blem yo u may enco unter in yo ur data wareho use is Early Arriving Fact s, This is also kno wn as
Lat e Arriving Dim e nsio ns.
Here's a sample scenario o f such a dilemma: On January 1st, a new custo mer gets in line to check o ut at the gro cery
sto re. She decides to sign up fo r the sto re's disco unt card. She fills o ut the paperwo rk and is allo wed to use the card
(card #10 59 259 ) fo r her purchase. On January 4th, the card finally arrives at the co rpo rate o ffice, where the info rmatio n
is entered into the database. The card number is unique to a custo mer, so it fo rms the primary key in the database.
A time line o f the events wo uld lo o k like this:

The custo mer's purchases are entered into the data wareho use o n the mo rning o f January 2nd, ho wever no details
exist in the custo mer dimensio n (under card #10 59 259 ) until January 5th.
This type o f situatio n is slightly different than Missing Ke ys, because the keys are no t exactly missing, they're just late.
Ho w do yo u deal with this type o f situatio n? First, yo u may have to alter the rules o n dimCustomer to allo w NULLs in
mo st co lumns. The o nly data we kno w abo ut "late" custo mers is the card number (because it is the primary key).
Next, change yo ur fact pro cess like so :
1. Go thro ugh every fact ro w, check to see if the custo mer exists in dimCustomer (perhaps using the LEFT
JOIN/IS NULL techniques fro m earlier in this lesso n).
2. Fo r every missing dimensio n ro w, insert a new ro w into the dimensio n, po ssibly using default values
such as "PENDING ACCOUNT" fo r first and last name.
3. Once we are certain the co rrespo nding dimensio n exists, add the fact ro w to the table.

Note

We assume that dimCustomer has already been pro cessed and is up-to -date. If yo ur dimensio n isn't
current, then a lo t o f fact data is go ing to appear to be late!

After January 1st, dimCusto mer wo uld lo o k so mething like this:


Card #

First Nam e Last Nam e Addre ss Pho ne St art Dat e

10 59 259 PENDING

ACCOUNT

NULL

NULL

0 1-Jan

End Dat e
31-DEC-20 9 9

So , what happens o n January 5th, when the dimensio n data finally catches up to the fact data?
No t hing.
Our existing dimensio n pro cess will see the custo mer's details, and update the dimensio n acco rdingly. After January
5th, there will be two ro ws in the database fo r card #10 59 259 :
Card #

First Nam e Last Nam e

Addre ss

Pho ne
NULL

St art Dat e

10 59 259 PENDING

ACCOUNT

NULL

0 1-Jan

10 59 259 Sue

Sho pper

519 Wells St. 555-2592 0 5-Jan

End Dat e
0 5-Jan
31-DEC-20 9 9

This reflects histo ry, exactly as it happened. Between January 1st and January 5th we didn't kno w the details fo r the
custo mer with card #10 59 259 , and after January 5th we did.
We co vered a lo t o f info rmatio n in this lesso n! Go o d jo b. In the next lesso n we'll examine so me co mmo n queries that peo ple
run against data wareho uses. See yo u there!

Copyright 1998-2013 O'Reilly Media, Inc.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.

Querying a Relational Data Warehouse


DBA 3: Data Warehousing Lesson 13
We've spent mo st o f o ur time implementing o ur data wareho use, but we still haven't really seen the fruits o f o ur labo r. In this
lesso n we'll examine basic SQL queries yo u can use to unlo ck info rmatio n sto red in the wareho use.

Note

In the next co urse, DBA 4, yo u'll learn all abo ut using MDX - Multiple Dimensio n Expressio ns - to query data
wareho uses, so stay tuned!

Viewing Data
In the first lesso n, we discussed the go als o f a data wareho use. We want to create:
a separate system that wo n't interrupt business critical o peratio nal systems.
a single po int o f access fo r all analytical queries.
a unified, co nsistent view o f underlying data (even data fro m external systems).
a straightfo rward way to analyze trends (to see the way sales co mpare fro m mo nth to mo nth).
Our current structure o f dim e nsio ns and f act s is co nsistent and straightfo rward, but we can do better. Co lumns like
run_id are no t impo rtant to end users, and may even be co nfusing to them. Tables like etlRuns do n't matter to
anyo ne except database administrato rs, so they sho uld be hidden fro m end users.
Ease o f use aside, views also pro vide o ther benefits:
So me info rmatio n in the data wareho use may be very sensitive, so views can be used to pro vide ro w level
security.
Views can be used to pro vide co nsistency to the data wareho use, especially as underlying tables gro w in
co mplexity and undergo changes.
Co lumns can be renamed to make things easier to understand.
Fo r o ur views, we will:
name them with a co mmo n prefix - Fact _ fo r fact tables, and Dim e nsio n_ fo r dimensio n tables.
o mit surro gate keys fro m fact tables (such as sales_key).
o mit start_date and end_date fro m Type-2 slo wly changing dimensio ns.
o mit run_id fro m all tables.
keep fo reign key co lumns (the _key co lumns) unchanged.
keep keys fro m so urce systems (like customer_id) unchanged.
rename fact and dimensio n co lumns to mo re readable equivalents (fo r example, customer_count wo uld
beco me Customer Count).
Let's get started and create a view fo r o ur factCustomerCount table. In this view we will o mit the customerCount_key
and run_id. Switch to MySQL mo de, and run this co mmand against yo ur perso nal database::
CODE TO TYPE:
CREATE VIEW Fact_CustomerCount
AS
SELECT date_key, customer_key, store_key, customer_count as `Customer Count`
FROM factCustomerCount;

If yo u typed everything co rrectly, yo u'll see these results:

OBSERVE:
mysql> CREATE VIEW Fact_CustomerCount
-> AS
-> SELECT date_key, customer_key, store_key, customer_count as "Customer Count"
-> FROM factCustomerCount;
Query OK, 0 rows affected (0.06 sec)
mysql>
It all lo o ks go o d so far. To test o ut this query, let's check o ut the first ten ro ws. Run this co mmand against yo ur
perso nal database:
CODE TO TYPE:
SELECT * from Fact_CustomerCount
LIMIT 0, 10;

This is easier fo r end users to understand:


OBSERVE:
mysql> SELECT * from Fact_CustomerCount
-> LIMIT 0, 10;
+----------+--------------+-----------+----------------+
| date_key | customer_key | store_key | Customer Count |
+----------+--------------+-----------+----------------+
| 20040319 |
590 |
1 |
1 |
| 20041007 |
591 |
1 |
1 |
| 20040811 |
593 |
1 |
1 |
| 20040124 |
595 |
1 |
1 |
| 20040531 |
596 |
1 |
1 |
| 20040115 |
599 |
1 |
1 |
| 20040205 |
600 |
1 |
1 |
| 20040921 |
602 |
1 |
1 |
| 20040427 |
604 |
1 |
1 |
| 20041114 |
605 |
1 |
1 |
+----------+--------------+-----------+----------------+
10 rows in set (0.08 sec)
Next, let's create a view fo r factSales. In this view, we will o mit the sales_key and run_id co lumns. Run this
co mmand against yo ur perso nal database:
CODE TO TYPE:
CREATE VIEW Fact_Sales
AS
SELECT date_key, customer_key, movie_key, store_key, staff_key, sales_amount as `Sales
Amount`
FROM factSales;

If yo u typed everything co rrectly, yo u'll see this familiar MySQL result: Query OK, 0 rows affected (0.13 sec).
With a co uple o f f act s o ut o f the way, let's turn o ur attentio n to dim e nsio ns. Start with dimCustomer. Run this
co mmand against yo ur perso nal database:

CODE TO TYPE:
CREATE VIEW Dimension_Customer
AS
SELECT customer_key, customer_id, first_name as `First Name`, last_name as `Last Name`,
Email, Address, address2 as "Address 2",
District, City, Country, postal_code as `Postal Code`,
Phone, Active, create_date as `Create Date`
FROM dimCustomer;
Great! Let's mo ve o n to dimDate. Run this co mmand against yo ur perso nal database:
CODE TO TYPE:
CREATE VIEW Dimension_Date
AS
SELECT date_key, Date, Year, Quarter, Month, month_name as `Month Name`, Day,
day_name as `Day of Week`, week as `Week In Year`,
is_weekend as `Is Weekend`, is_holiday as `Is Holiday`
FROM dimDate;
The next view we'll create is fo r dimMovie. Run this co mmand against yo ur perso nal database:
CODE TO TYPE:
CREATE VIEW Dimension_Movie
AS
SELECT movie_key, film_id, Title, Description, release_year as `Release Year`,
Language, original_language as `Original Language`,
rental_duration as `Rental Duration`, Length,
Rating, special_features as `Special Features`
FROM dimMovie;
The last view we'll create fo r no w is fo r the dimStore table. Run this co mmand against yo ur perso nal database:
CODE TO TYPE:
CREATE VIEW Dimension_Store
AS
SELECT store_key, store_id, Address, address2 as `Address 2`, District,
City, Country, postal_code as `Postal Code`, Region,
manager_first_name as `Manger First Name`,
manager_last_name as `Manager Last Name`
FROM dimStore;
Great ! We're ready to get started with o ur queries!

Answering Questions
We'll use a standard template fo r querying the data wareho use. In the text belo w, blue signifies f act s, and re d
signifies dim e nsio ns. It isn't necessary to use all parts o f o ur template, especially if we aren't interested in limiting o ur
query using a WHERE clause.

OBSERVE:
SELECT columns from dimension tables,
SUM( fact columns )
FROM fact view
INNER JOIN dimension view 1 on (fact column = dimension column )
INNER JOIN dimension view 2 on (fact column = dimension column )
.
.
WHERE Limits to dimensions
limits to facts
GROUP BY dimension columns
ORDER BY dimension columns, fact columns
LIMIT 0, 5 (optional "top 5" results)

In lesso n two we discussed questio ns that wo uld be po sed by management. We rewro te these questio ns so that they
were in the fo rmat o f f act s and dim e nsio ns. No w let's try to answer so me o f them!
First up: Ho w m any ne w cust o m e rs did we add by quart e r?
To answer this questio n, we'll need data fro m the Fact _Cust o m e rCo unt and Dim e nsio n_Dat e tables. Let's write a
query using the template we already created. Altho ugh o ur questio n do es no t specify a particular so rting o rder, we'll
so rt by Quart e r. Run this co mmand against yo ur perso nal database:
CODE TO TYPE:
SELECT dd.Quarter,
SUM( `Customer Count` ) as `Customer Count`
FROM Fact_CustomerCount fc
INNER JOIN Dimension_Date dd on (fc.date_key = dd.date_key)
GROUP BY dd.Quarter
ORDER BY dd.Quarter;
MySQL will reply with yo ur answer:
OBSERVE:
mysql> SELECT dd.Quarter,
-> SUM( `Customer Count` ) as `Customer Count`
-> FROM Fact_CustomerCount fc
-> INNER JOIN Dimension_Date dd on (fc.date_key = dd.date_key)
-> GROUP BY dd.Quarter
-> ORDER BY dd.Quarter;
+---------+----------------+
| Quarter | Customer Count |
+---------+----------------+
| Q1
|
158 |
| Q2
|
140 |
| Q3
|
143 |
| Q4
|
148 |
+---------+----------------+
4 rows in set (0.06 sec)
mysql>
That's pretty slick! We didn't even have to figure o ut the specific quarter each custo mer registered.

Note

If yo u see an erro r that lo o ks like this: ERROR 130 5 (4 20 0 0 ): FUNCT ION ce rt jo sh.SUM do e s no t
e xist , make sure yo u do no t have any spaces between SUM and (. SUM(column). This wo rks in MySQL,
but SUM (column) will return an erro r.

No w suppo se we want to extend this query, so it answers these questio ns: Ho w m any ne w cust o m e rs did we add
by quart e r and by m o nt h? Let's go back to o ur template, and add a new co lumn. Run this co mmand against yo ur

perso nal database:


CODE TO TYPE:
SELECT dd.Quarter, dd.`Month Name`,
SUM( `Customer Count` ) as `Customer Count`
FROM Fact_CustomerCount fc
INNER JOIN Dimension_Date dd on (fc.date_key = dd.date_key)
GROUP BY dd.Quarter, dd.`Month Name`
ORDER BY dd.Quarter, dd.`Month Name`;
It lo o ks like the database gave us exactly what we asked fo r, even if it isn't exactly what we wanted:
OBSERVE:
mysql> SELECT dd.Quarter, dd.`Month Name`,
-> SUM( `Customer Count` ) as `Customer Count`
-> FROM Fact_CustomerCount fc
-> INNER JOIN Dimension_Date dd on (fc.date_key = dd.date_key)
-> GROUP BY dd.Quarter, dd.`Month Name`
-> ORDER BY dd.Quarter, dd.`Month Name`;
+---------+------------+----------------+
| Quarter | Month Name | Customer Count |
+---------+------------+----------------+
| Q1
| February
|
47 |
| Q1
| January
|
54 |
| Q1
| March
|
57 |
| Q2
| April
|
44 |
| Q2
| June
|
51 |
| Q2
| May
|
45 |
| Q3
| August
|
51 |
| Q3
| July
|
46 |
| Q3
| September |
46 |
| Q4
| December
|
49 |
| Q4
| November
|
51 |
| Q4
| October
|
48 |
+---------+------------+----------------+
12 rows in set (0.06 sec)
Let's try o rdering the mo nths by number instead o f by name, so January wo uld o ccur befo re February in o ur results.
Run this co mmand against yo ur perso nal database:
CODE TO TYPE:
SELECT dd.Quarter, dd.`Month Name`,
SUM( `Customer Count` ) as `Customer Count`
FROM Fact_CustomerCount fc
INNER JOIN Dimension_Date dd on (fc.date_key = dd.date_key)
GROUP BY dd.Quarter, dd.`Month Name`
ORDER BY dd.Quarter, dd.`Month`;
No w that's mo re like it!

OBSERVE:
mysql> SELECT dd.Quarter, dd.`Month Name`,
-> SUM( `Customer Count` ) as `Customer Count`
-> FROM Fact_CustomerCount fc
-> INNER JOIN Dimension_Date dd on (fc.date_key = dd.date_key)
-> GROUP BY dd.Quarter, dd.`Month Name`
-> ORDER BY dd.Quarter, dd.`Month`;
+---------+------------+----------------+
| Quarter | Month Name | Customer Count |
+---------+------------+----------------+
| Q1
| January
|
54 |
| Q1
| February
|
47 |
| Q1
| March
|
57 |
| Q2
| April
|
44 |
| Q2
| May
|
45 |
| Q2
| June
|
51 |
| Q3
| July
|
46 |
| Q3
| August
|
51 |
| Q3
| September |
46 |
| Q4
| October
|
48 |
| Q4
| November
|
51 |
| Q4
| December
|
49 |
+---------+------------+----------------+
Okay, let's mo ve o n to a new questio n: What was t he am o unt o f sale s re ve nue we e arne d, by st o re and by by
m o nt h? To answer these questio ns, we'll need to use the Fact _Sale s, Dim e nsio n_St o re , and Dim e nsio n_Dat e
views. This time we will try an alternate ORDER BY syntax; we'll specify the co lumns by position instead o f by name.
Fo r this query, 1 is the first co lumn, ds.Addre ss, and 2 is the seco nd co lumn, ds.`Mo nt h Nam e `. Run this co mmand
against yo ur perso nal database:
CODE TO TYPE:
SELECT ds.Address, dd.`Month Name`,
SUM( `Sales Amount` ) as `Sales Amount`
FROM Fact_Sales fs
INNER JOIN Dimension_Store ds on (fs.store_key = ds.store_key)
INNER JOIN Dimension_Date dd on (fs.date_key = dd.date_key )
GROUP BY 1, 2
ORDER BY ds.Address, dd.`Month`;
Once again, o ur wareho use answers o ur questio ns right away:

OBSERVE:
mysql> SELECT ds.Address, dd.`Month Name`,
-> SUM( `Sales Amount` ) as `Sales Amount`
-> FROM Fact_Sales fs
-> INNER JOIN Dimension_Store ds on (fs.store_key = ds.store_key)
-> INNER JOIN Dimension_Date dd on (fs.date_key = dd.date_key )
-> GROUP BY ds.Address, dd.`Month Name`
-> ORDER BY 1, 2;
+--------------------+------------+--------------+
| Address
| Month Name | Sales Amount |
+--------------------+------------+--------------+
| 28 MySQL Boulevard | February
|
270.09 |
| 28 MySQL Boulevard | May
|
2328.30 |
| 28 MySQL Boulevard | June
|
4829.30 |
| 28 MySQL Boulevard | July
|
13873.70 |
| 28 MySQL Boulevard | August
|
11910.70 |
| 47 MySakila Drive | January
|
999.99 |
| 47 MySakila Drive | February
|
238.11 |
| 47 MySakila Drive | May
|
2418.35 |
| 47 MySakila Drive | June
|
4640.01 |
| 47 MySakila Drive | July
|
14020.33 |
| 47 MySakila Drive | August
|
11740.45 |
+--------------------+------------+--------------+
11 rows in set (0.70 sec)

Note

The sakila database is a rando m set o f data. That's the reaso n there were no sales fo r "47 MySakila
Drive" in March.

No w suppo se we want to find o ut the t o p f ive sale s, by sto re and by mo nth. Let's give it a try! Run this co mmand
against yo ur perso nal database:
CODE TO TYPE:
SELECT ds.Address, dd.`Month Name`,
SUM( `Sales Amount` ) as `Sales Amount`
FROM Fact_Sales fs
INNER JOIN Dimension_Store ds on (fs.store_key = ds.store_key)
INNER JOIN Dimension_Date dd on (fs.date_key = dd.date_key )
GROUP BY ds.Address, dd.`Month Name`
ORDER BY 1, 2
LIMIT 0, 5;

Once again, MySQL answered o ur questio ns, but it isn't exactly the info rmatio n we want:

OBSERVE:
mysql> SELECT ds.Address, dd.`Month Name`,
-> SUM( `Sales Amount` ) as `Sales Amount`
-> FROM Fact_Sales fs
-> INNER JOIN Dimension_Store ds on (fs.store_key = ds.store_key)
-> INNER JOIN Dimension_Date dd on (fs.date_key = dd.date_key )
-> GROUP BY ds.Address, dd.`Month Name`
-> ORDER BY 1, 2
-> LIMIT 0, 5;
+--------------------+------------+--------------+
| Address
| Month Name | Sales Amount |
+--------------------+------------+--------------+
| 28 MySQL Boulevard | August
|
11910.70 |
| 28 MySQL Boulevard | February
|
270.09 |
| 28 MySQL Boulevard | July
|
13873.70 |
| 28 MySQL Boulevard | June
|
4829.30 |
| 28 MySQL Boulevard | May
|
2328.30 |
+--------------------+------------+--------------+
5 rows in set (0.39 sec)
We retrieved the to p five results, but we didn't o rder by Sale s Am o unt , and then in descending o rder fro m there. Run
this co mmand against yo ur perso nal database:
CODE TO TYPE:
SELECT ds.Address, dd.`Month Name`,
SUM( `Sales Amount` ) as `Sales Amount`
FROM Fact_Sales fs
INNER JOIN Dimension_Store ds on (fs.store_key = ds.store_key)
INNER JOIN Dimension_Date dd on (fs.date_key = dd.date_key )
GROUP BY ds.Address, dd.`Month Name`
ORDER BY 3 DESC, 1, 2
LIMIT 0, 5;

The results lo o k much better no w:


OBSERVE:
mysql> SELECT ds.Address, dd.`Month Name`,
-> SUM( `Sales Amount` ) as `Sales Amount`
-> FROM Fact_Sales fs
-> INNER JOIN Dimension_Store ds on (fs.store_key = ds.store_key)
-> INNER JOIN Dimension_Date dd on (fs.date_key = dd.date_key )
-> GROUP BY ds.Address, dd.`Month Name`
-> ORDER BY 3 DESC, 1, 2
-> LIMIT 0, 5;
+--------------------+------------+--------------+
| Address
| Month Name | Sales Amount |
+--------------------+------------+--------------+
| 47 MySakila Drive | July
|
14020.33 |
| 28 MySQL Boulevard | July
|
13873.70 |
| 28 MySQL Boulevard | August
|
11910.70 |
| 47 MySakila Drive | August
|
11740.45 |
| 28 MySQL Boulevard | June
|
4829.30 |
+--------------------+------------+--------------+
5 rows in set (0.38 sec)

Problems with Queries


So far we've written several queries against o ur data wareho use. The biggest pro blem we've run into so far was that
so me results were no t o rdered co rrectly. So what else co uld go wro ng?

Bad Joins

There is ano ther mo re serio us pro blem that may take place in o ur data wareho use - bad jo ins.
Suppo se yo u co me into the o ffice o ne day, and are asked to answer a questio n we've seen many times
befo re: Ho w m any ne w cust o m e rs did we add by quart e r? Let's appro ach this questio n again. Run this
co mmand against yo ur perso nal database:
CODE TO TYPE:
SELECT dd.Quarter,
SUM( `Customer Count` ) as `Customer Count`
FROM Fact_CustomerCount fc
INNER JOIN Dimension_Date dd on (fc.date_key = fc.date_key)
GROUP BY dd.Quarter
ORDER BY dd.Quarter;
Yo u pro bably no ticed right away that so mething was strange. The query takes a very lo ng time to return
results, and when it finally do es, they lo o k really strange:
OBSERVE:
mysql> SELECT dd.Quarter,
-> SUM( `Customer Count` ) as `Customer Count`
-> FROM Fact_CustomerCount fc
-> INNER JOIN Dimension_Date dd on (fc.date_key = fc.date_key)
-> GROUP BY dd.Quarter
-> ORDER BY dd.Quarter;
+---------+----------------+
| Quarter | Customer Count |
+---------+----------------+
| Q1
|
2711167 |
| Q2
|
2733549 |
| Q3
|
2763588 |
| Q4
|
2763588 |
+---------+----------------+
4 rows in set (14.04 sec)
Co mpare these results to the results we calculated previo usly:
OBSERVE:
+---------+----------------+
| Quarter | Customer Count |
+---------+----------------+
| Q1
|
158 |
| Q2
|
140 |
| Q3
|
143 |
| Q4
|
148 |
+---------+----------------+
So what happened here? It was a bad jo in. Instead o f writing (fc.date_key = fc.date_key) fo r o ur jo in
criteria, we sho uld have written (fc.date_key = dd.date_key).
Back in o ur query we fo rgo t to specify ho w Fact _Cust o m e rCo unt jo ins to Dim e nsio n_Dat e . This caused
MySQL to return the cartesian product o f tho se two tables instead o f the pro perly jo ined results.

The real danger behind bad jo ins is that they can o ften go undetected. This example is extreme - o ur
business users wo uld likely kno w there is a pro blem with the query, since the results fo r Q1 are o ver 17,0 0 0
times the actual value. That is pretty far o ff! But what if o ur co mpany typically added 3,0 0 0 ,0 0 0 new custo mers
in a quarter? Then 2,711,16 7 wo uldn't seem so far o ff at all.
The best way to prevent bad jo ins is to have many different peo ple review each query written against the data
wareho use. No query to o l can tell yo u if yo ur jo in is bad, o r if yo ur query is o therwise written inco rrectly.

Incorrect Filtering
No w suppo se yo ur bo ss wants to kno w which mo vies had sales greater than $10 .0 0 . Yo u sit do wn at yo ur
desk, and write a quick query to find the answer the questio n. Run this co mmand against yo ur perso nal
database:
CODE TO TYPE:
SELECT dm.Title,
SUM( `Sales Amount` ) as `Sales Amount`
FROM Fact_Sales fs
INNER JOIN Dimension_Movie dm on (fs.movie_key = dm.movie_key)
WHERE fs.`Sales Amount` > 10
GROUP BY 1;
It lo o ks like fifty mo vies have had sales greater $10 .0 0 :

OBSERVE:
mysql> SELECT dm.Title,
-> SUM( `Sales Amount` ) as `Sales Amount`
-> FROM Fact_Sales fs
-> INNER JOIN Dimension_Movie dm on (fs.movie_key = dm.movie_key)
-> WHERE fs.`Sales Amount` > 10
-> GROUP BY 1;
+---------------------------+--------------+
| Title
| Sales Amount |
+---------------------------+--------------+
| !!! MISSING MOVIE !!!
|
999.99 |
| AMERICAN CIRCUS
|
43.96 |
| AUTUMN CROW
|
10.99 |
| BACKLASH UNDEFEATED
|
10.99 |
| BEAST HUNCHBACK
|
21.98 |
| BEHAVIOR RUNAWAY
|
21.98 |
| BILKO ANONYMOUS
|
43.96 |
| BRIGHT ENCOUNTERS
|
10.99 |
| CARIBBEAN LIBERTY
|
43.96 |
| CASUALTIES ENCINO
|
10.99 |
| DAUGHTER MADIGAN
|
10.99 |
| DOORS PRESIDENT
|
10.99 |
| FLASH WARS
|
10.99 |
| FLINTSTONES HAPPINESS
|
44.96 |
| FOOL MOCKINGBIRD
|
32.97 |
| GARDEN ISLAND
|
10.99 |
| HUSTLER PARTY
|
32.97 |
| INNOCENT USUAL
|
21.98 |
| ISHTAR ROCKETEER
|
10.99 |
| KING EVOLUTION
|
21.98 |
| KISSING DOLLS
|
43.96 |
| MAIDEN HOME
|
21.98 |
| MIDSUMMER GROUNDHOG
|
22.98 |
| MINDS TRUMAN
|
32.97 |
| MINE TITANS
|
55.95 |
| NIGHTMARE CHILL
|
10.99 |
| PANIC CLUB
|
10.99 |
| PATHS CONTROL
|
10.99 |
| PINOCCHIO SIMON
|
10.99 |
| RANGE MOONWALKER
|
21.98 |
| SATISFACTION CONFIDENTIAL |
10.99 |
| SATURDAY LAMBS
|
54.95 |
| SCORPION APOLLO
|
23.98 |
| SECRETS PARADISE
|
10.99 |
| SHOW LORD
|
22.98 |
| STING PERSONAL
|
33.97 |
| STRANGER STRANGERS
|
10.99 |
| SUIT WALLS
|
21.98 |
| SUNRISE LEAGUE
|
21.98 |
| TEEN APOLLO
|
21.98 |
| TELEGRAPH VOYAGE
|
65.94 |
| TIES HUNGER
|
11.99 |
| TITANIC BOONDOCK
|
21.98 |
| TORQUE BOUND
|
43.96 |
| TRAP GUYS
|
33.97 |
| TYCOON GATHERING
|
43.96 |
| VIRTUAL SPOILERS
|
44.96 |
| WIFE TURN
|
43.96 |
| WONDERLAND CHRISTMAS
|
10.99 |
| ZORRO ARK
|
10.99 |
+---------------------------+--------------+
50 rows in set (0.10 sec)
At first glance this answer appears to be co rrect, but is this result exactly what we wanted? No t quite. Let's take
a lo o k at the query again, with an English translatio n fo r each line:

OBSERVE:
SELECT dm.Title,
--Show the movie title
SUM( `Sales Amount` ) as `Sales Amount`
--And total sales per movie
FROM Fact_Sales fs
--from Fact_Sales
INNER JOIN Dimension_Movie dm on (fs.movie_key = dm.movie_key) --and from Dimens
ion_Movie
WHERE fs.`Sales Amount` > 10
--Where Sales Amount in Fact_Sales is
greater than 10
GROUP BY 1;
Instead o f returning a result o f mo vies with total sales greater than $10 .0 0 , we have returned a result o f
mo vies with one-time sales greater than $10 .0 0 . We filtered o ur data inco rrectly.
So ho w do we fix this erro r? One way wo uld be to use a sub-query. We'll calculat e t he t o t al sale s f o r
e ach m o vie , then lim it t ho se re sult s t o Sale s Am o unt > 10 . Run this co mmand against yo ur perso nal
database:
CODE TO TYPE:
SELECT Title, `Sales Amount`
FROM (SELECT dm.Title,
SUM( `Sales Amount` ) as `Sales Amount`
FROM Fact_Sales fs
INNER JOIN Dimension_Movie dm on (fs.movie_key = dm.movie_key)
GROUP BY 1
) as subQuery
WHERE `Sales Amount` > 10;
This query returns many mo re mo vies in the result - it lo o ks like nearly every mo vie has generated sales
greater than $10 .0 0 .
OBSERVE:
+-----------------------------+--------------+
| Title
| Sales Amount |
+-----------------------------+--------------+
| !!! MISSING MOVIE !!!
|
999.99 |
| ACADEMY DINOSAUR
|
35.78 |
| ACE GOLDFINGER
|
52.93 |
| ADAPTATION HOLES
|
32.89 |
| AFFAIR PREJUDICE
|
91.77 |
...lines omitted...
| WRATH MILE
|
23.86 |
| WRONG BEHAVIOR
|
62.80 |
| WYOMING STORM
|
72.87 |
| YENTL IDAHO
|
130.78 |
| YOUTH KICK
|
12.95 |
| ZHIVAGO CORE
|
14.91 |
| ZOOLANDER FICTION
|
67.84 |
| ZORRO ARK
|
214.69 |
+-----------------------------+--------------+
947 rows in set (0.74 sec)
This type o f pro blem is difficult to spo t, especially when o ur first query seems to wo rk. It's impo rtant to have
peers review queries to make sure everything is written co rrectly.
As usual, we've co vered a lo t in this lesso n! Yo u're do ing really great, and yo u're nearly do ne with this co urse. The next lesso n
will be a descriptio n o f yo ur final pro ject. See yo u then!

Copyright 1998-2013 O'Reilly Media, Inc.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.

Final Project
DBA 3: Data Warehousing Lesson 14
Northwind T raders
Northwind Traders is a sample database that Micro so ft distributed with its Access pro duct. It's an o ld database (the
newest dates in it are fro m 19 9 6 ), but is a great example o f database design.
This pro ject uses an SQLite versio n o f the No rthwind Traders database. Unlike MySQL, SQLite packages an entire
database (tables, views, indexes, etc.) into a single file.
Here's a diagram o f the tables fro m the database that get used in this pro ject:

Fo r yo ur final pro ject, yo u'll use Northwind Traders as a data so urce to design, implement, and po pulate a data
wareho use.
Yo u are required to implement these dimensio ns:
Date
Emplo yees
Custo mers
Suppliers
Pro ducts
Orders
and these facts:
Order Unit Price

Supplier Unit Price


A read-o nly co py o f the No rthwind database is available o n the server:
OBSERVE:
C:/talend_files/Nwind.db

Note

If yo u type in the file name, be sure to use fo rward slashes instead o f back slashes:
C:/talend_files/Nwind.db

Instead o f using MySQL co mpo nents as the so urce fo r yo ur data, yo u'll need to use the SQLite co mpo nents. Yo u can
use the SQL Builder to o l in TOS to examine the tables in No rthwind and see the vario us data types o f co lumns. To use
the SQL Builder, click o n the

to the right o f the query bo x fo r tSQLiteInput.

Yo ur data wareho use will be lo cated in yo ur existing MySQL database. To distinguish tables fo r yo ur final pro ject fro m
o ther tables, use the prefix fp_ fo r yo ur table names. Fo r example, yo u might name yo ur date dimensio n: fp_dimDate.

fp_dimDate
Yo ur date dimensio n sho uld be called f p_dim Dat e .
The dates in Northwind Traders range fro m 19 9 4 to 19 9 6 . Make sure yo ur date dimensio n has all o f the
required dates in it. Yo u can use the file c:/talend_files/NwindDates.xls to lo ad yo ur date dimensio n if
yo u like. This date dimensio n is no t the same as the o ne used in the class - its co lumns are: date,
is_weekend, year, quarter, month, and day.

fp_dimEmployees
Use the fo llo wing query to po pulate f p_dim Em plo ye e s:
CODE TO TYPE:
SELECT EmployeeID, LastName, FirstName, Title, TitleOfCourtesy,
BirthDate, HireDate, Address, City, Region, PostalCode, Country,
HomePhone, Extension
FROM Employees;

Note

The dates in the SQLite are strings and we need to co nvert them to dates in o ur tMap
co mpo nent befo re writing them o ut to MySQL. Assuming that the HireDate co lumn is identified
in the expressio n co lumn o f the tMap o utputs sectio n as ro w2.HireDate, we wo uld use this
expressio n:
TalendDate.parseDate("dd-MMM-yyyy", row2.HireDate)
In the Schema edito r in the lo wer right side set HireDate's Type to Date and the Date Pattern to
"yyyy-MM-dd". Apply this same technique to o ther dates enco untered in this pro ject.

fp_dimCustomers
Use the fo llo wing query to po pulate f p_dim Cust o m e rs, a T ype -2 SCD:
CODE TO TYPE:
SELECT CustomerID, CompanyName, ContactName, ContactTitle, Address, City,
Region, PostalCode, Country, Phone, Fax
FROM Customers;

fp_dimSuppliers

Use the fo llo wing query to po pulate f p_dim Supplie rs, a T ype -2 SCD:
CODE TO TYPE:
SELECT SupplierID, CompanyName, ContactName, ContactTitle,
Address, City, Region, PostalCode, Country, Phone, Fax, HomePage
FROM Suppliers;

fp_dimProducts
Use the fo llo wing query to po pulate f p_dim Pro duct s, a T ype -2 SCD:
CODE TO TYPE:
SELECT Products.ProductID, Products.ProductName, Products.Discontinued,
Categories.CategoryName, Categories.Description as CategoryDescription
FROM Products
INNER JOIN Categories on Products.CategoryID = Categories.CategoryID;

fp_dimOrders
Use the fo llo wing query to po pulate f p_dim Orde rs:
CODE TO TYPE:
SELECT Orders.OrderID, Customers.CompanyName as CustomerName, Customers.ContactN
ame,
Orders.OrderDate, Orders.RequiredDate,
Orders.ShipName, Orders.ShipAddress, Orders.ShipCity,
Orders.ShipRegion, Orders.ShipPostalCode, Orders.ShipCountry
FROM Orders
INNER JOIN Customers on Orders.CustomerID = Customers.CustomerID;

Order Unit Price


Use the fo llo wing query to retrieve data fo r eventual placement in yo ur f p_f act Orde rUnit Price table:
CODE TO TYPE:
SELECT od.OrderID, od.ProductID, od.UnitPrice, o.OrderDate
FROM OrderDetails as od
INNER JOIN Orders as o on od.OrderID = o.OrderID;

Note

Yo u will need to use a staging table fo r yo ur fact pro cess.

Supplier Unit Price


Use the fo llo wing query to retrieve data fo r eventual placement in yo ur f p_f act Supplie rUnit Price table:
CODE TO TYPE:
SELECT ProductID, SupplierID, UnitPrice
FROM Products;

Note

Yo u will need to use a staging table fo r yo ur fact pro cess.

There is no dates fo r this fact. We want to jo in o n the "latest" values fo r the Pro duct and Supplier dimensio ns.
To do so yo u need to add the expressio n end_date='2099-01-01' to yo ur jo in.
As always, feel free to co ntact yo ur mento r if yo u have any questio ns.
Thanks fo r playing, have fun, and go o d luck with this last pro ject. It's been great wo rking with yo u!

Copyright 1998-2013 O'Reilly Media, Inc.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
See http://creativecommons.org/licenses/by-sa/3.0/legalcode for more information.