Sie sind auf Seite 1von 9

Dimensional modeling tutorial

Dimensional Modeling Definition

Dimensional modeling is the design concept used by many data warehouse designers to
build their data warehouse. Dimensional model is the underlying data model used by many
of the commercial OLAP products available today in the market. In this model, all data is
contained in two types of tables called Fact Table and Dimension Table.

Fact Table

Fact table contains the measurements or metrics or facts of business processes. If your
business process is Sales, then a measurement of this business process such as "monthly
sales number" is captured in the fact table. In addition to the measurements, the only other
things a fact table contains are foreign keys for the dimension tables.

Dimension Table

Context of the measurements are represented in dimension tables. You can also think of the
context of a measurement as the characteristics such as who, what, where, when, how of a
measurement (subject ). In your business process Sales, the characteristics of the 'monthly
sales number' measurement can be a Location (Where), Time (When), Product Sold (What).

The Dimension Attributes are the various columns in a dimension table. In the Location
dimension, the attributes can be Location Code, State, Country, Zip code. Generally the
Dimension Attributes are used in report labels, and query constraints such as where
Country='USA'. The dimension attributes also contain one or more hierarchical
relationships.

Before designing your data warehouse, you need to decide what this data warehouse
contains. Say if you want to build a data warehouse containing monthly sales numbers
across multiple store locations, across time and across products then your dimensions are:

Location
Time
Product

Each dimension table contains data for one dimension. In the above example you get all
your store location information and put that into one single table called Location. Your
store location data may be spanned across multiple tables in your OLTP system (unlike
OLAP), but you need to de-normalize all that data into one single table.
Online Transaction Processing (OLTP) Schema

In Online Transaction Processing (OLTP), the database is designed to achieve efficient


transactions such as INSERT and UPDATE. This is very different from the OLAP design.

Unlike OLAP, normalization is very important to reduce duplicates and also cut down on
the size of the data. our OLTP schema may look like this:

Locations Table
Field Name Type
Loc_Id INTEGER (4)
Loc_Code VARCHAR (5)
Loc_Name VARCHAR (30)
State_Id INTEGER (4)
Country_Id INTEGER (4)

States Table
Field Name Type
Sate_Id INTEGER (4)
State_Name VARCHAR (50)

Countries Table
Field Name Type
Country_Id INTEGER (4)
Country_Name VARCHAR (50)

In order to query for all locations that are in country 'USA' we will have to join these three
tables. The SQL will look like:

SELECT * FROM Locations, States, Countries where Locations.State_Id = States.State_Id


AND Locations.Country_id=Countries.Country_Id and Country_Name='USA'
Dimension Tables

Dimensional modeling allows only one table per dimension. But your OLTP data spans
across multiple tables as described. So we need de-normalize the OLTP schema and export
into your Dimension Tables.

For example, for the location dimension, you achieve this by joining the three OLTP tables
and inserting the data into the single Location table.

Your Location Table will look like this:

Location Dimension Table Schema


Field Name Type
Dim_Id INTEGER (4)
Loc_Code VARCHAR (4)
Name VARCHAR (50)
State_Name VARCHAR (20)
Country_Name VARCHAR (20)

All Dimension tables contain a key column called the dimension key. In this example
Dim_Id is our dimension Id. This is the unique key into our Location dimension table.

The actual data in your Location Table may look like this

Location Dimension Table Data


Dim_Id Loc_Code Name State_Name Country_Name
1001 IL01 Chicago Loop Illinois USA
1002 IL02 Arlington Hts Illinois USA
1003 NY01 Brooklyn New York USA
1004 TO01 Toronto Ontario Canada
1005 MX01 Mexico City Distrito Federal Mexico

You may notice that some of the information is repeated in the above dimension table. The
State Name and Country Name are repeated through out the table. You may feel that this is
waste of data space and against the normalization principles. But in dimensional modeling
this type of design makes the querying very optimized and reduces the query times. Also
we will learn later that in a typical data warehouse, the dimension tables make up only 10 to
15 % of the storage as the fact table is by far the largest table and takes up the rest of the
storage allocation.
Time Dimension Table

After de-normalization, your Time table will look like this:

Time Dimension Table Schema


Field Name Type
Dim_Id INTEGER (4)
Month SMALL INTEGER (2)
Month_Name VARCHAR (3)
Quarter SMALL INTEGER (4)
Quarter_Name VARCHAR (2)
Year SMALL INTEGER (2)

The actual data in your Time Table may will look like this:

Time Dimension Data


TM TM TM TM TM TM_Year
_Dim_Id _Month _Month_Name _Quarter _Quarter_Name
1001 1 Jan 1 Q1 2003
1002 2 Feb 1 Q1 2003
1003 3 Mar 1 Q1 2003
1004 4 Apr 2 Q2 2003
1005 5 May 2 Q2 2003

Product Dimension Table

After de-normalization, your Product table will look like this:

Product Dimension Table Schema


Field Name Type
Dim_Id INTEGER (4)
SKU VARCHAR (10)
Name VARCHAR (30)
Category VARCHAR (30)

In this table PR_Dim_Id is our dimension Id. This is the unique key into our Product
dimension table.

The actual data in your Product Table may look like this:
Product Dimension Table Data
Dim_Id SKU Name Category
1001 DOVE6K Dove Soap 6Pk Sanitary
1002 MLK66F# Skim Milk 1 Gal Dairy
1003 SMKSAL55 Smoked Salmon 6oz Meat

Fact Table

Fact table contains the actual business process measurements or metrics called facts.
Usually these facts are numeric.

These facts are generally Additive.


Some times the facts are semi additive such as balances
Some times they are non additive such as unit price

In the above example where you are building a data warehouse of monthly sales in dollars,
your fact table will contain the actual sales numbers, one row per month. In addition to the
data itself, you will have the foreign keys for the various dimensions in this row.

Granularity or Grain of Fact Table

The level of detail of the fact table is known as the grain of the fact table. In this example
the grain of your fact table is monthly sales number per location per product.

Your Fact Table will look like this

MonthlySales Fact Table Schema


Field Name Type
TM_Dim_Id INTEGER (4)
PR_ Dim_Id INTEGER (4)
LOC_ INTEGER (4)
Dim_Id
Sales INTEGER (4)

In this table the combination of all three dimension table foreign keys make up the primary
key in the fact table. (TM_Dim_Id, PR_ Dim_Id, LOC_ Dim_Id) is our primary key. This
is the unique key into our Sales fact table.

The actual data in your MonthlySales Table will look like this:
MonthlySales Fact Table Data
TM_Dim_Id PR_ Dim_Id LOC_ Sales
Dim_Id
1001 1001 1003 435677
1002 1002 1001 451121
1003 1001 1003 98765
1001 1004 1001 65432

A fact table may contain one or more Facts. Usually you create one fact table per business
process or event. For example if you want to analyze the sales numbers and also advertising
spending, they are two separate business processes. So you will create two separate fact
tables, one for sales data and one for advertising cost data. On the other hand if you want to
track the sales tax in addition to the sales number, you simply create one more fact column
in the Sales fact table called Tax.
Star Schema

Now if you look at our new dimensional modeled schema, it will look like this:

You can easily tell this looks like a STAR. Hence it is also known as Star Schema.

Advantages of Star Schema

Star Schema is very easy to understand, even for non technical business managers
Star Schema provides better performance and smaller query times
Star schema is easily extensible and will handle future changes easily

A Typical SQL Query Template for the Sales Schema will look like:
--- Select the measurements that you want to aggregate using SUM clause

SELECT P.Name, SUM(F.Sales)

--- JOIN the FACT table with Dimension Tables

FROM Sales F, Time T, Product P, Location L


WHERE F.TM_Dim_Id = T.Dim_Id
AND F.PR_Dim_Id = P.Dim_Id
AND F.LOC_Dim_Id = L.Dim_Id

--- Constrains the Dimension Attributes


AND T.Month='Jan' AND T.Year='2003' AND L.Country_Name='USA'

-- finally the 'group by' clause identifies the aggregation level. In this example you are
aggregating
-- all sales within a product category.

GROUP BY P.Category

How does this dimensional model map into an actual report?

Now if you look at a report generated from your dimensional modeled data warehouse, it
will look like this:
You can tell from the above report that all your dimensions are the row and column
headings in the report. The facts are the numeric numbers in the report.

Advantages of Star Schema

Star Schema is very easy to understand, even for non technical business managers
Star Schema provides better performance and smaller query times
Star schema is easily extensible and will handle future changes easily

Snowflake Schema

If we did not de-normalize our dimensions into one table each, then the schema would have
looked like this

You can see, this looks like a snow flake, hence this type of schema is called SNOW
FLAKE schema. General rule of thumb is keep away from snow flake schemas as even
though they may save you some space, they will cost a lot in terms of query times.

Step by Step Design of Dimensional Model

First identify the business process.


In this step you will determine what is your business process that your data
warehouse represents. This process will be the source of your metrics or
measurements.

Next Identify the Grain.


You will determine what does one row of fact table mean. In the previous example
you have decided that your grain is 'monthly sales per location per product'.

Next Identify the Dimensions.


Your dimensions should be descriptive (SQL VARCHAR or CHARACTER) as
much as possible and confirm to your grain.

Finally Identify the facts.


In this step you will identify what are your measurements (or metrics or facts). The
facts should be numeric and should confirm to the grain defines in step 2.

Das könnte Ihnen auch gefallen