Beruflich Dokumente
Kultur Dokumente
Dimensional Modeling
Microsoft Confidential. © 2006 Microsoft Corporation. All rights reserved. These materials are
confidential to and maintained as a trade secret by Microsoft Corporation. Information in these
materials is restricted to Microsoft authorized recipients only. Any use, distribution or public
discussion of, and any feedback to, these materials is subject to the terms of the attached
license. By providing any feedback on these materials to Microsoft, you agree to the terms of that
license.
1. For good and valuable consideration, the receipt and sufficiency of which are acknowledged, You and
Microsoft agree as follows:
(a) If You are an authorized representative of the corporation or other entity designated below
("Company"), and such Company has executed a Microsoft Corporation Non-Disclosure Agreement that is
not limited to a specific subject matter or event ("Microsoft NDA"), You represent that You have
authority to act on behalf of Company and agree that the Confidential Information, as defined in the
Microsoft NDA, is subject to the terms and conditions of the Microsoft NDA and that Company will treat the
Confidential Information accordingly;
(b) If You are an individual, and have executed a Microsoft NDA, You agree that the Confidential
Information, as defined in the Microsoft NDA, is subject to the terms and conditions of the Microsoft NDA
and that You will treat the Confidential Information accordingly; or
(c)If a Microsoft NDA has not been executed, You (if You are an individual), or Company (if You are an
authorized representative of Company), as applicable, agrees: (a) to refrain from disclosing or distributing
the Confidential Information to any third party for five (5) years from the date of disclosure of the
Confidential Information by Microsoft to Company/You; (b) to refrain from reproducing or summarizing the
Confidential Information; and (c) to take reasonable security precautions, at least as great as the
precautions it takes to protect its own confidential information, but no less than reasonable care, to keep
confidential the Confidential Information. You/Company, however, may disclose Confidential Information
in accordance with a judicial or other governmental order, provided You/Company either (i) gives
Microsoft reasonable notice prior to such disclosure and to allow Microsoft a reasonable opportunity to
seek a protective order or equivalent, or (ii) obtains written assurance from the applicable judicial or
governmental entity that it will afford the Confidential Information the highest level of protection afforded
under applicable law or regulation. Confidential Information shall not include any information, however
designated, that: (i) is or subsequently becomes publicly available without Your/Company’s breach of any
obligation owed to Microsoft; (ii) became known to You/Company prior to Microsoft’s disclosure of such
information to You/Company pursuant to the terms of this Agreement; (iii) became known to
You/Company from a source other than Microsoft other than by the breach of an obligation of
confidentiality owed to Microsoft; or (iv) is independently developed by You/Company. For purposes of this
paragraph, "Confidential Information" means nonpublic information that Microsoft designates as being
confidential or which, under the circumstances surrounding disclosure ought to be treated as confidential
by Recipient. "Confidential Information" includes, without limitation, information in tangible or intangible
form relating to and/or including released or unreleased Microsoft software or hardware products, the
marketing or promotion of any Microsoft product, Microsoft's business policies or practices, and
information received from others that Microsoft is obligated to treat as confidential.
2. You may review these Materials only (a) as a reference to assist You in planning and designing Your
product, service or technology ("Product") to interface with a Microsoft Product as described in these
Materials; and (b) to provide feedback on these Materials to Microsoft. All other rights are retained by
Microsoft; this agreement does not give You rights under any Microsoft patents. You may not (i) duplicate
any part of these Materials, (ii) remove this agreement or any notices from these Materials, or (iii) give
any part of these Materials, or assign or otherwise provide Your rights under this agreement, to anyone
else.
3. These Materials may contain preliminary information or inaccuracies, and may not correctly represent
any associated Microsoft Product as commercially released. All Materials are provided entirely "AS IS." To
the extent permitted by law, MICROSOFT MAKES NO WARRANTY OF ANY KIND, DISCLAIMS ALL EXPRESS,
IMPLIED AND STATUTORY WARRANTIES, AND ASSUMES NO LIABILITY TO YOU FOR ANY DAMAGES OF
ANY TYPE IN CONNECTION WITH THESE MATERIALS OR ANY INTELLECTUAL PROPERTY IN THEM.
4. If You are an entity and (a) merge into another entity or (b) a controlling ownership interest in You
changes, Your right to use these Materials automatically terminates and You must destroy them.
6. Microsoft has no obligation to maintain confidentiality of any Microsoft Offering, but otherwise the
confidentiality of Your Feedback, including Your identity as the source of such Feedback, is governed by
Your NDA.
7. This agreement is governed by the laws of the State of Washington. Any dispute involving it must be
brought in the federal or state superior courts located in King County, Washington, and You waive any
defenses allowing the dispute to be litigated elsewhere. If there is litigation, the losing party must pay the
other party’s reasonable attorneys’ fees, costs and other expenses. If any part of this agreement is
unenforceable, it will be considered modified to the extent necessary to make it enforceable, and the
remainder shall continue in effect. This agreement is the entire agreement between You and Microsoft
concerning these Materials; it may be changed only by a written document signed
by both You and Microsoft.
Overview
Dimensional modeling is the design concept used by many data warehouse designers to
build their data warehouse. Dimensional model is the underlying data model used by
many of the commercial OLAP products available today in the market. The dimensional
data model provides a method for making databases simple and understandable. The
major purpose of creating a data warehouse from transactional systems is creating
intelligence out of day to day activity and it is not intended for extracting operation reports
and cannot be treated merely as a report store.
Terminology
The terms that are used frequently in this document, are explained below. Details of
each of these terms are explained in the later sections of this chapter.
Dimension table: It is a business entity of the source system. There can be multiple
normalized table represent one single business entity on the source system. Example:
Customer Dimension, Product Dimension
Fact table: A central table in a data warehouse schema that contains numerical
measures and keys relating facts to dimension tables.
Grain: The level of the measures within a fact table represented by the lowest level of
the dimensions. It is the level of detail the facts are stored in the fact table. Example,
Fact table may contain lower grain to daily transactions such as customer purchasing
information and number books purchased. OR fact table may contain higher grain to
simply state the number of books sold without the details of who bought which book.
Measures: It is a numeric attribute that would typically be derived from transactions. It
is generally placed in fact table of dimensional structure. Example: Number of books
sold today
Hierarchies and levels: This is basically group of attributes that would logically relate
each other in the form a tree structure. Example: City, State, Country; Day, Month,
Year. Levels is the relation between the members of the hierarchy, such as in the
above example, city is at lower level in the hierarchy and Country is at higher level.
Star Schema. The simplest relational schema for querying a data warehouse
databases is the star schema. The star schema has a center, represented by a
fact table, and the points of the star, represented by the dimension tables. From a
technical perspective, the advantage of a star schema is that joins between the
Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing
feedback on these materials, you agree to the attached license agreement. Please provide
feedback at BI Feedback Alias.
dimensions and the fact tables are simple, performance, ability to slicing and easy
understanding of data.
The following diagram depicts a sample star schema,
Dimensions Explained
OLTP data sources typically use an entity-relationship (E-R) schema technique for
modeling transactional data, because OLTP transactions usually involve a large number
of transactions of small amounts of data. Data warehouses, on the other hand, usually
involve fewer transactions of very large amounts of data, so E-R schema technique is not
as efficient. Instead, data warehouses generally employ one of two approaches towards
schema design, referred to as star schema or snowflake schema.
Dimension Types
There are various terms relating to dimensions that you should be familiar with. Following
is a brief summary. Most of these will be discussed in more detail elsewhere.
Changing Dimension: Changing dimension is the dimension which has at least one
attribute whose value would change over the time.
Slowly Changing Dimensions: Attributes of a dimension that would undergo changes
over time. It depends on the business requirement whether particular attribute history of
changes should be preserved in the warehouse. Example: Employee work location
attribute may undergo changes over time not frequently. This is called a Slowly Changing
Attribute and a dimension containing such an attribute is called a Slowly Changing
Dimension.
Rapidly Changing Dimensions: A dimension attribute that changes frequently is a
Rapidly Changing Attribute. For example, a CurrentLocation attribute for an Employee
may change very frequently. If you don’t need to track the changes, the Rapidly
Changing Attribute is no problem, but if you do need to track the changes, using a
standard Slowly Changing Dimension technique can result in a huge inflation of the size
of the dimension. One solution is to move the attribute to its own dimension, with a
separate foreign key in the fact table. This new dimension is called a Rapidly Changing
Dimension. Deciding if an attribute should be moved into a Rapidly Changing Dimension
or kept in a Slowly Changing Dimension is up to the data architect. The decision depends
on factors such as disk space, performance, expected data volume, history preservation
etc,
Junk Dimensions: A junk dimension is a single table with a combination of different and
unrelated attributes to avoid having a large number of foreign keys in the fact table. Junk
dimensions are often created to manage the foreign keys created by Rapidly Changing
Dimensions. If you have bit-mapped indexes available, you may not need to create
separate Junk dimensions, because the multiple, small-cardinality foreign keys can be
managed efficiently.
Inferred Dimensions: While loading fact records, a dimension record may not yet be
ready. One solution is to generate an surrogate key with Null for all the other attributes.
Identifying dimensions
Dimensions provide the 'who, what, where, when and why' that is applied to the
'how much' fact.
Lookup tables and master tables which generally act as a primary key tables in
OLTP would end up as dimensions in warehouse
For normalization reasons, there can be more than one table for a single entity in
OLTP such as customer table and customer address table, both of them can make
it up to one dimension in warehouse as a part of the denormalization process
Dimensions are typically highly denormalized and descriptive. They are typically
text fields, and serve as context for the facts
Dimensions can be left as degenerate dimensions whenever they make sense by
themselves (usually a matter of opinion), provided that they do not have additional
attributes. One reason for creating a separate dimension table is if you want to
report on ―missing‖ values from the fact table: the dimension table can include all
Dimension Maintenance
Following are the guidelines pertaining to the dimension maintenance
Avoid removing attributes from a dimension’s schema. Consider not removing an
attribute from a dimension without analyzing the impact on other objects such as
analytical queries, report queries, and cubes. Even if the attribute is discontinued
at the source, you can add a default value in the warehouse for new rows, rather
than remove the attribute from dimension. Business keys should always be
retained in the dimension tables, as these provide a mechanism for tracing the
report values back to the source.
As an example, If the source schema is modified to remove an attribute such as
Date of Birth, there probably already exist data for the attribute in the dimension
table. So if the attribute is kept in Dimension it will have NULL, if removed you are
loosing historical data
Adding a new attribute to the dimension should not change the granularity of the
dimension. For example: Adding Customer location should not change the
granularity of the customer dimension. If there is only one record in customer
dimension before adding location, it should still have only one record per customer
in the dimension even after adding location attribute to it. If adding a new attribute
takes the dimension to a new lower granularity then consider managing the value
as a slowly changing attribute, or add an outrigger table to the dimension.
Changing the dimension granularity would affect the fact tables with references to
this dimension
Dimension Guidelines
Consider the following guidelines when designing your Dimension tables:
Denormalization
This is the process of flattening the dimensions to form one single unit of all the related
entities. Normalization is generally performed on OLTP systems to reduce redundancy
and reduce the risk of update anomalies while managing many transactions. On the
contrary, denormalization introduces redundancy and reduces joins, allowing for faster
retrieval and more user-friendly structures
Unlike a transaction system, where new transactions can be launched from many
process, in Data warehouse only the ETL process will add or update records, and the
update frequency is much less than that of a typical OLTP.
Benefits of selectively violating normalization rules and introducing redundancy:
Redundancy can reduce data retrieval time
Redundancy can create a more user-friendly model
Potential disadvantages with denormalizing the database:
It may cause update, delete and insert performance to suffer
It will take up more space
Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing
feedback on these materials, you agree to the attached license agreement. Please provide
feedback at BI Feedback Alias.
In a typical data warehouse, the advantages of denormalization far outweigh the
disadvantages.
Association table
EmpId EmpName Age
1098 John 28
EmpId EmpName Age
1098 John 28
Also, the association is maintained without the need of any fact "happening" transactions
at the source. In general, the association tables have keys mapping in many to many
relationship but if underlying dimensions are type II, there is maintenance overhead. To
reduce overhead, we would suggest creating association table with business keys rather
than SKs. The key can easily found in Type II dimension with the help of EndKey
Employee Dimension
SK EmployeeName EmpId Location AccountId AccountType
1 Dave 123 Redmond 9287 Savings
2 Dave 123 Redmond 4235 Checking
3 Jim 124 Redmond 8923 Savings
Here employee information is repeated because employee has more than one account,
to reduce the redundancy and to have only one record per employee in employee
dimension, create the following structure using outrigger.
Employee Dimension
SK EmployeeName EmpId Location
1 Dave 123 Redmond
3 Jim 124 Redmond
AccountsOutrigger
AccSK EmpSK AccountId AccountType
1 1 9287 Savings
2 1 4235 Checking
3 3 Jim Savings
Horizontal Data
Partitioned Mart
Customer
OLTP Dimension Horizontal Data
ETL Partitioned Mart
(Conformed)
Complete DW
Dimension
You should use conformed dimensions whenever possible instead of individual ones.
Data mart and Data warehouse will use the same dimension not multiple copies of the
same dimension, loading process is only once
Conformed dimensions are to be designed to satisfy all the data mart and data
warehouse needs. A dimension can be taken to higher granularity but cannot be taken to
lower granularity once fact tables are populated. As the confirmed dimensions are
Changing Dimension
A data warehouse must deal with attributes that change values over time. This is not an
issue for most transaction systems, because they deal only with the present. But a data
warehouse will often present data historically for anywhere from two to ten years.
Organization changes, customer attribute changes, and other changes must be
managed.
If using SK key as a means to join the facts and dimensions to get latest information from
Type II dimensions would as
Select Current.CustomerName, Measure
From
Fact
Inner Join Customer old on Fact.SK = CustomerSK
inner join Customer Current on Old.CustomerId = Current.CustomerId
If using Business key, the query can be eliminated with a self join as
Select Current.CustoerName, Measure
From
Fact
Inner Join Customer Current on Fact.CustomerId = Current.CustomerId and
Customer.EndDate is NULL
This method of joining on the additional business key (instead of dimension surrogate
key) is particularly useful when the ―current‖ information is sought in the result set on a
large volume Type II dimension table. This method will not eliminate the need of
surrogate key in the facts, it only suggests having an additional attribute in fact table as
dimensional business key for improved performance queries.
If the resulting dimension Business Key is NULL then it means the corresponding record
does not exist in warehouse or that it has expired. In this case a new record is to be
inserted in the warehouse.
If the resulting source Business Key is NULL, then it means the corresponding record has
been deleted and should be flagged with an EndDate in the warehouse table.
If neither Business Key is NULL then it means the corresponding record exists in
warehouse and is active. In this case the appropriate Type I or Type II processing can be
performed.
Using the Full Outer Join approach combined with a Hash value for attributes, you can
optimally find out the Type I and Type II attribute changes, as given below
Step 1: Identify Inserts, Updates, Deletes using Full Outer Join
Step 2: For Inserts, directly insert into dimensions
Step 3: For Deletes, update dimensions for marking as inactive
Step 4: For Updates, compare the Type I and Type II hash values to determine
subset of records to have Type I effect and subset of records to have Type II effect
Employee (source)
EmpId EmpName Age
1098 John 28
EmpDimension (Warehouse)
EmpSK EmpId EmpName AgeRange
342 1098 John 20
Issue #2
Having both date and time dimension in the same dimension can cause interpretation
problems for fact tables that have granularity of a Day. It would be easy to inadvertently
enter two records into the fact table for the same day, because of the extra granularity in
the dimension table.
A date dimension with one record per day will suffice if users do not
need time granularity finer than a single day
If the business analysis or reports does not need exact time of the day the transaction
occurred, then the Time dimension is not needed for this fact.
Summarizing data for a range of days requires joining only the date
dimension table to the fact table
Even if fact table rows have a grain of seconds, you do not need to join to the Time
dimension if you only need to aggregate at the day or higher grain.
Example
For example, suppose that the fact table contains the detailed information that Book1
sold at 5 hours 30 minutes 28 seconds on 2/1/2006
When these transactions are aggregated at the end of the week to the day level, it would
take away the time information but keep only the day information. In this case, there is no
need to join to the fact table.
Example (aggregated)
Number of books sold on 2/1/2006
Fact tables
Fact tables contain data that describes specific events within a business, such as bank
transactions or product sales
Ideal fact measures are additive in nature but it is not a requirement that a fact table
have numeric measures. For example, consider list of computers added and deleted
from an environment. The fact table may capture the computers that are added and
deleted. There is no measure here but count of computers becomes the measure.
There is no measure stored in the fact table for this kind of transaction..
Granularity
Granularity is defined as the lowest level of detail of the data in the fact tables particularly
the measures. The more granular the data, the more detailed your reports and analysis
can be. However, excess granularity consumes space and increases complexity. It is a
trade off because once you aggregate a fact table, you cannot decompose the
aggregates.
There are three types of grain: Transaction, Periodic Snapshots and Accumulating
Snapshots.
Transactions - This is the basic detail record of the warehouse as it happened in
the OLTP systems. Example: Customer purchases a book or money transfer from
savings to checking. The Time grain of a transaction is essentially continuous.
Periodic Snapshots - This is a set of detail records that are repeated over time. It
is to get number of customers existing, inventory of the system etc., at the end of
the day. The grain of a snapshot is the interval at which the snapshot is taken—
whether hourly, daily, weekly, monthly, or yearly.
Accumulating Snapshot - This is a snapshot that can change over time. The
common example is the student enrollment example. The student's data doesn't
change but the dates for enrollment, admission, etc.
EmployeeHours OLTP:
EmployeeId Date StartTime EndTime
34 1/1/2006 9 17
21 1/2/2006 12 19
56 1/1/2006 13 20.5
Here in the above fact the original measures such as StartTime and EndTime are not
included, because of this the measures such as number of employee start working on
1/1/2006 before 10AM cannot be answered though it is possible to deduct the number
from the source schema.
Here the guideline is when calculating custom measures on fact, retain the source
measures as part of the fact.
Keeping the source measures also makes it possible to show an audit trail of how the
value was calculated.
InventoryFact
ProductKey DateKey Inventory
23 44 289
43 44 435
65 44 658
InvoiceFact
Both InventoryFact and InvoiceFact both have same granularity (Product – Daily) and
represent single business entity products SKU number. Even if there is nothing sold for a
product on a given day, the record can be added to the combined fact as inventory fact is
a snapshot fact, not based on whether the product is sold today or not, with
corresponding sold measure as 0. The cardinality of the combined fact would be same as
Inventory Fact.
The combined fact would look like the following
Combining these two fact tables into one also enables calculated measures such as
percentage of SKUs are sold today, that would need to use measurements from both the
facts, without this combination it is going to be cross fact and would lead to performance
problems.
Keep the data types of the fact table columns as short as possible
Fact records are already huge in size; for ex. use tiny int where possible instead of INT. It
is suggested to adopt the same data type for measurements that source system has. If
you create aggregate tables, be sure to change the data types to be large enough to
handle the maximum aggregated values.
Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing
feedback on these materials, you agree to the attached license agreement. Please provide
feedback at BI Feedback Alias.
Fact loading should not create records that are not part of OLTP
system
Capture only transactional records from OLTP. Avoid having the ETL process add new
records that were not exactly sourced from OLTP systems. Example, if Customer1
purchased Book1 then fact table should have only one entry for this transaction. Adding
records such as Customer1 did not purchase Book2, Customer1 did not purchase Book3,
will not add any value and should be avoided.
Fact records are considered as event records – i.e. transactions processed on the OLTP.
No fact records should be created by the ETL, it is just a transformation.
It is suggested to try creating additive measures and avoid having non additive measures
in the fact tables.
The measures of the fact table are numeric in nature and additive. But some
measurements are non additive and semi additive. Semi additive measures are those
which cannot be summed up across all the products but can be summed of the same
product across the year. Non additive measurements are such as ratio, percentage etc.,
but all these measurements are to be numeric in nature. For non additive and semi
additive, custom aggregations are created but for this original source additive
measurements are to be part of fact table to calculate custom aggregations.
Uses Indexed views to improve query performance that access data through views