08 - Dimensional Modeling

8
Dimensional Modeling
Microsoft Confidential. © 2006 Microsoft Corporation. All rights reserved. These materials are
confidential to and maintained as a trade secret by Microsoft Corporation. Information in these
materials is restricted to Microsoft authorized recipients only. Any use, distribution or public
discussion of, and any feedback to, these materials is subject to the terms of the attached
license. By providing any feedback on these materials to Microsoft, you agree to the terms of that
license.
Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing

feedback on these materials, you agree to the attached license agreement. Please provide
feedback at BI Feedback Alias.
Microsoft Corporation Technical Documentation License Agreement (Standard)
READ THIS! THIS IS A LEGAL AGREEMENT BETWEEN MICROSOFT CORPORATION ("MICROSOFT") AND
THE RECIPIENT OF THESE MATERIALS, WHETHER AN INDIVIDUAL OR AN ENTITY ("YOU"). IF YOU HAVE
ACCESSED THIS AGREEMENT IN THE PROCESS OF DOWNLOADING MATERIALS ("MATERIALS") FROM A
MICROSOFT WEB SITE, BY CLICKING "I ACCEPT", DOWNLOADING, USING OR PROVIDING FEEDBACK ON
THE MATERIALS, YOU AGREE TO THESE TERMS. IF THIS AGREEMENT IS ATTACHED TO MATERIALS, BY
ACCESSING, USING OR PROVIDING FEEDBACK ON THE ATTACHED MATERIALS, YOU AGREE TO THESE
TERMS.
1. For good and valuable consideration, the receipt and sufficiency of which are acknowledged, You and
Microsoft agree as follows:
(a) If You are an authorized representative of the corporation or other entity designated below
("Company"), and such Company has executed a Microsoft Corporation Non-Disclosure Agreement that is
not limited to a specific subject matter or event ("Microsoft NDA"), You represent that You have
authority to act on behalf of Company and agree that the Confidential Information, as defined in the
Microsoft NDA, is subject to the terms and conditions of the Microsoft NDA and that Company will treat the
Confidential Information accordingly;
(b) If You are an individual, and have executed a Microsoft NDA, You agree that the Confidential
Information, as defined in the Microsoft NDA, is subject to the terms and conditions of the Microsoft NDA
and that You will treat the Confidential Information accordingly; or
(c)If a Microsoft NDA has not been executed, You (if You are an individual), or Company (if You are an
authorized representative of Company), as applicable, agrees: (a) to refrain from disclosing or distributing
the Confidential Information to any third party for five (5) years from the date of disclosure of the
Confidential Information by Microsoft to Company/You; (b) to refrain from reproducing or summarizing the
Confidential Information; and (c) to take reasonable security precautions, at least as great as the
precautions it takes to protect its own confidential information, but no less than reasonable care, to keep
confidential the Confidential Information. You/Company, however, may disclose Confidential Information
in accordance with a judicial or other governmental order, provided You/Company either (i) gives
Microsoft reasonable notice prior to such disclosure and to allow Microsoft a reasonable opportunity to
seek a protective order or equivalent, or (ii) obtains written assurance from the applicable judicial or
governmental entity that it will afford the Confidential Information the highest level of protection afforded
under applicable law or regulation. Confidential Information shall not include any information, however
designated, that: (i) is or subsequently becomes publicly available without Your/Company’s breach of any
obligation owed to Microsoft; (ii) became known to You/Company prior to Microsoft’s disclosure of such
information to You/Company pursuant to the terms of this Agreement; (iii) became known to
You/Company from a source other than Microsoft other than by the breach of an obligation of
confidentiality owed to Microsoft; or (iv) is independently developed by You/Company. For purposes of this
paragraph, "Confidential Information" means nonpublic information that Microsoft designates as being
confidential or which, under the circumstances surrounding disclosure ought to be treated as confidential
by Recipient. "Confidential Information" includes, without limitation, information in tangible or intangible
form relating to and/or including released or unreleased Microsoft software or hardware products, the
marketing or promotion of any Microsoft product, Microsoft's business policies or practices, and
information received from others that Microsoft is obligated to treat as confidential.
2. You may review these Materials only (a) as a reference to assist You in planning and designing Your
product, service or technology ("Product") to interface with a Microsoft Product as described in these
Materials; and (b) to provide feedback on these Materials to Microsoft. All other rights are retained by
Microsoft; this agreement does not give You rights under any Microsoft patents. You may not (i) duplicate
any part of these Materials, (ii) remove this agreement or any notices from these Materials, or (iii) give
any part of these Materials, or assign or otherwise provide Your rights under this agreement, to anyone
else.
3. These Materials may contain preliminary information or inaccuracies, and may not correctly represent
any associated Microsoft Product as commercially released. All Materials are provided entirely "AS IS." To
the extent permitted by law, MICROSOFT MAKES NO WARRANTY OF ANY KIND, DISCLAIMS ALL EXPRESS,
IMPLIED AND STATUTORY WARRANTIES, AND ASSUMES NO LIABILITY TO YOU FOR ANY DAMAGES OF
ANY TYPE IN CONNECTION WITH THESE MATERIALS OR ANY INTELLECTUAL PROPERTY IN THEM.
4. If You are an entity and (a) merge into another entity or (b) a controlling ownership interest in You
changes, Your right to use these Materials automatically terminates and You must destroy them.

5. You have no obligation to give Microsoft any suggestions, comments or other feedback ("Feedback")
relating to these Materials. However, any Feedback you voluntarily provide may be used in Microsoft
Products and related specifications or other documentation (collectively, "Microsoft Offerings") which in
turn may be relied upon by other third parties to develop their own Products. Accordingly, if You do give
Microsoft Feedback on any version of these Materials or the Microsoft Offerings to which they apply, You
agree: (a) Microsoft may freely use, reproduce, license, distribute, and otherwise commercialize Your
Feedback in any Microsoft Offering; (b) You also grant third parties, without charge, only those patent
rights necessary to enable other Products to use or interface with any specific parts of a Microsoft Product
that incorporate Your Feedback; and (c) You will not give Microsoft any Feedback (i) that You have reason
to believe is subject to any patent, copyright or other intellectual property claim or right of any third
party; or (ii) subject to license terms which seek to require any Microsoft Offering incorporating or derived
from such Feedback, or other Microsoft intellectual property, to be licensed to or otherwise shared with
any third party.
6. Microsoft has no obligation to maintain confidentiality of any Microsoft Offering, but otherwise the
confidentiality of Your Feedback, including Your identity as the source of such Feedback, is governed by
Your NDA.
7. This agreement is governed by the laws of the State of Washington. Any dispute involving it must be
brought in the federal or state superior courts located in King County, Washington, and You waive any
defenses allowing the dispute to be litigated elsewhere. If there is litigation, the losing party must pay the
other party’s reasonable attorneys’ fees, costs and other expenses. If any part of this agreement is
unenforceable, it will be considered modified to the extent necessary to make it enforceable, and the
remainder shall continue in effect. This agreement is the entire agreement between You and Microsoft
concerning these Materials; it may be changed only by a written document signed
by both You and Microsoft.

In this Chapter
● Choosing appropriate schema type and designing
● Dimensions, Facts and Granularity
● Slowly Changing Dimensions
● Modeling Time and Date Dimension
Overview
Dimensional modeling is the design concept used by many data warehouse designers to
build their data warehouse. Dimensional model is the underlying data model used by
many of the commercial OLAP products available today in the market. The dimensional
data model provides a method for making databases simple and understandable. The
major purpose of creating a data warehouse from transactional systems is creating
intelligence out of day to day activity and it is not intended for extracting operation reports
and cannot be treated merely as a report store.
Terminology
The terms that are used frequently in this document, are explained below. Details of
each of these terms are explained in the later sections of this chapter.
Dimension table: It is a business entity of the source system. There can be multiple
normalized table represent one single business entity on the source system. Example:
Customer Dimension, Product Dimension
Fact table: A central table in a data warehouse schema that contains numerical
measures and keys relating facts to dimension tables.
Grain: The level of the measures within a fact table represented by the lowest level of
the dimensions. It is the level of detail the facts are stored in the fact table. Example,
Fact table may contain lower grain to daily transactions such as customer purchasing
information and number books purchased. OR fact table may contain higher grain to
simply state the number of books sold without the details of who bought which book.
Measures: It is a numeric attribute that would typically be derived from transactions. It
is generally placed in fact table of dimensional structure. Example: Number of books
sold today
Hierarchies and levels: This is basically group of attributes that would logically relate
each other in the form a tree structure. Example: City, State, Country; Day, Month,
Year. Levels is the relation between the members of the hierarchy, such as in the
above example, city is at lower level in the hierarchy and Country is at higher level.
Physical Schema Structures

There are basically two types structures that are popularly used while designing the
physical dimensional modeling, they are Star Schema and Snow flake Schema
 Star Schema. The simplest relational schema for querying a data warehouse
databases is the star schema. The star schema has a center, represented by a
fact table, and the points of the star, represented by the dimension tables. From a
technical perspective, the advantage of a star schema is that joins between the
dimensions and the fact tables are simple, performance, ability to slicing and easy
understanding of data.
The following diagram depicts a sample star schema,
Product Dimension Store Dimension

ProductKey StoreKey
Product Name StoreName
StartDate Sales Fact StartDate
ProductKey
ProductManufacturer StoreLocation
CustomerKey
StoreKey
Customer Dimension DateKey
CustomerKey Date Dimension
CustomerName DateKey
StartDate CalendarDate
CustomerLocation Month
Day
 Snowflake Schema. A variation of a star schema is called a snowflake. In a

snowflake, the dimension tables are normalized. From a reporting perspective,
there is no logical difference because all the tables of a snowflake dimension can
be logically joined into a single star dimension. From a performance perspective,
the snowflake may result in slower queries because of the additional joins required.
Loc Dim
LocKey
Prod Dimension LocName
ProductKey Store Dimension StartDate
Product Name StoreKey
StartDate StoreName
Sales Fact StartDate
ProductMfr
LocKey
ProductKey
CustomerKey
Cust Dimension
StoreKey Date Dimension
CustomerKey
DateKey DateKey
CustomerName
StartDate CalendarDate
CustomerLocKey Month
CustLoc Dim
Day
CustomerLocKey
Location
StartDate
Designing a Warehouse Schema

Following are the basic steps for designing a data warehouse based on facts and
dimensions.
Identify Facts and Fact Table Grain

o Identity business processes for analysis such as sales trend over the period or
cross sell

o Determine the facts that are needed directly or indirectly for the identified
business analysis
o The facts identification is the process of looking at source database and see if
the source database can satisfy business needs such as source does not have
any information on products name/brand sold but has only consolidated
information on the kind of product that is sold
o Identify and list out source facts typically stored in transaction tables
o Determine the lowest level of granularity needed for the facts such as business
analysis needs only weekly sales information
o Determine the entities for the source transactions. It can be identified from the
parameters for the source measures such as Product cost is based on products,
date and store
o Business entities are termed as dimensions and numeric attributes of the
transactional tables are called measures. Fact table is a set of dimensions
attributes and measures.
Identify Dimension Keys and Attributes

A dimension logically consists of a primary (unique) key that maps to a foreign key in
the fact table. The dimension also includes additional attributes about that key. For
example, in a Customer dimension, the Customer ID may be the logical primary key.
Some of the attributes provide the basis for grouping the facts. These are the
groupable attributes. For example, Gender, City, State, and Country may be
groupable attributes of a customer. Other attributes are descriptive. For example,
Mobile Phone Number is an attribute of a customer, but a report is unlikely to group
by phone numbers. Some of the groupable attributes can be organized into a
hierarchy, allowing drill-down reporting. For example, City rolls up to State, which
rolls up to Country. Not all attributes can be organized into a single hierarchy. For
example, Gender does not have any correlation to City.
o As OLTP is normalized there can be more than one table present for each
business entity such as Products, Products Manufacturer details
o Extract the required attributes of each business entity such as Products needs
only Product Name, Manufacturer though there are other attributes of parameters
but not needed for the analysis
o Denormalize the business entity (flattening normalized tables to make unique set
of attributes of each business entity)
o Prepare a logical data modal consisting structure on how business entities
related each other and entities that depend on source measures
o It is not required to use all the business entities collected for each fact
o Hierarchies for the dimensions are stored with in the dimensional table in star
schema
o In Star schema every dimension will have surrogate key as a primary key and
there is no child/parent table present for the dimensions
o Fact tables will have a composite key containing the dimension keys
Steps in designing Snow Flake Schema

 It is similar to Star schema but with a degree of normalization is being performed
 One or more dimension tables, each with a generated primary key, which contain
one or more descriptive fields used to determine some level attributes for each
dimension. In addition, a dimension table may contain a join to one or more other
child dimension tables that supply additional level attributes for each dimension
Example: Manufacturer information is kept in a separate dimension and Product
dimension has a reference of Manufacturer in its attributes. In star schema the
manufacturer information would have been as attributes to Product dimension (not
a separate dimension)
 In a snowflake schema, more than one table is required to completely define the
set of attributes for the dimension.
 In a snow flake schema, a dimension table may have have a relationship to foreign
keys in more than one dimension table. For example, a Geography table can join
to the GeographyKey of Customer, Reseller, and Manufacturer
 Whether a dimension is designed as a single table (star) or multiple tables
(snowflake) has no effect on the fact table.
 In a fully-normalized snowflake dimension, you can clearly see the hierarchical
relationship between attributes. In a star dimension, you may not see the
hierarchical relationships.
Guidelines for choosing Star vs. Snowflake

 Consider a snowflake schema to reduce storage requirements for a large
dimension tables. For example, Storing the information about the manufacturer in a
product dimension table would make a product dimension unnecessarily large
because one manufacturer may provide multiple products
 Consider snowflaking a dimension if one entity is referenced by multiple
dimensions. For example, Customers, Resellers, and Manufacturers may all have
a Geography component. Extracting geography information into a normalized table
allows for consistency between geographical definitions for the different
dimensions. A snowflaked dimension accessed from multiple dimensions is called
a reference dimension.
 Another reason for using a snowflake schema is to supply information, not about
the level attributes of a dimension, but the secondary attributes of a level.
Consideration of Operational Data Source (ODS)

Operational Data Source is a topic covered in Extraction guide.
o ODS data store technique is adopted when business requirement does demand
OLTP granularity for shorter duration of time.
o ODS should be considered only when information retrieval requirements are to
be real time and cannot access OLTP systems because of criticality of
resources.
o ODS is a logically different data store apart from warehouse.
o The data in ODS is much granular or equal to the granularity of warehouse.
ODS does not maintain history of attributes and the duration of ODS data is
typically to the order of synch duration. Example: If data is loaded to warehouse
once in a day then intra day transactions are maintained in ODS.
o ODS does not maintain Surrogate keys but having business keys

o If ODS exists, it will be used for loading to warehouse than any other staging
source
o Population to ODS is through ETL process such as Data Extraction,
Transformation and Loading
Dimensions Explained
OLTP data sources typically use an entity-relationship (E-R) schema technique for
modeling transactional data, because OLTP transactions usually involve a large number
of transactions of small amounts of data. Data warehouses, on the other hand, usually
involve fewer transactions of very large amounts of data, so E-R schema technique is not
as efficient. Instead, data warehouses generally employ one of two approaches towards
schema design, referred to as star schema or snowflake schema.
Dimension Types
There are various terms relating to dimensions that you should be familiar with. Following
is a brief summary. Most of these will be discussed in more detail elsewhere.
Changing Dimension: Changing dimension is the dimension which has at least one
attribute whose value would change over the time.
Slowly Changing Dimensions: Attributes of a dimension that would undergo changes
over time. It depends on the business requirement whether particular attribute history of
changes should be preserved in the warehouse. Example: Employee work location
attribute may undergo changes over time not frequently. This is called a Slowly Changing
Attribute and a dimension containing such an attribute is called a Slowly Changing
Dimension.
Rapidly Changing Dimensions: A dimension attribute that changes frequently is a
Rapidly Changing Attribute. For example, a CurrentLocation attribute for an Employee
may change very frequently. If you don’t need to track the changes, the Rapidly
Changing Attribute is no problem, but if you do need to track the changes, using a
standard Slowly Changing Dimension technique can result in a huge inflation of the size
of the dimension. One solution is to move the attribute to its own dimension, with a
separate foreign key in the fact table. This new dimension is called a Rapidly Changing
Dimension. Deciding if an attribute should be moved into a Rapidly Changing Dimension
or kept in a Slowly Changing Dimension is up to the data architect. The decision depends
on factors such as disk space, performance, expected data volume, history preservation
etc,
Junk Dimensions: A junk dimension is a single table with a combination of different and
unrelated attributes to avoid having a large number of foreign keys in the fact table. Junk
dimensions are often created to manage the foreign keys created by Rapidly Changing
Dimensions. If you have bit-mapped indexes available, you may not need to create
separate Junk dimensions, because the multiple, small-cardinality foreign keys can be
managed efficiently.
Inferred Dimensions: While loading fact records, a dimension record may not yet be
ready. One solution is to generate an surrogate key with Null for all the other attributes.

This should technically be called an inferred member, but is often called an inferred
dimension.
Conformed Dimensions: A Dimension that is used in multiple locations is called a
conformed dimension. A conformed dimension may be used with multiple fact tables in a
single database, or across multiple data marts or data warehouses. Conformed
dimensions are discussed in detail later in this chapter.
Degenerate Dimensions: A degenerate dimension is when the dimension attribute is
stored as part of fact table, and not in a separate dimension table. Common examples
include Order Line Number or Transaction Number. These are essentially dimension
keys for which there are no other attributes. In a data warehouse, these are often used as
the result of a drill through query to analyze the source of an aggregated number in a
report. You can use these values to trace back to transactions in the OLTP system.
Role Playing Dimensions: A role-playing dimension is one where the same dimension
key—along with its associated attributes—can be joined to more than one foreign key in
the fact table. For example, a fact table may include foreign keys for both Ship Date and
Delivery Date. But the same date dimension attributes apply to each foreign key, so you
can join the same dimension table to both foreign keys. Here the date dimension is taking
multiple roles to map ship date as well as delivery date, and hence the name of Role
Playing dimension.
Shrunken Dimensions: A shrunken dimension is a subset of another dimension. For
example, the Orders fact table may include a foreign key for Product, but the Target fact
table may include a foreign key only for ProductCategory, which is in the Product table,
but much less granular. Creating a smaller dimension table, with ProductCategory as its
primary key, is one way of dealing with this situation of heterogeneous grain. If the
Product dimension is snowflaked, there is probably already a separate table for
ProductCategory, which can serve as the Shrunken Dimension.
Outtriggers: Sometimes a dimension attribute can have multiple values for a single
dimension key member. For example, a customer may have multiple accounts. One
solution is to create a separate table with Customer Account as its logical primary key. An
outrigger dimension is the logical inverse of a shrunken dimension.
Static Dimensions: Static dimensions are not extracted from the original data source,
but are created within the context of the data warehouse. A static dimension can be
loaded manually—for example with Status codes—or it can be generated by a procedure,
such as a Date or Time dimension.
Identifying dimensions
 Dimensions provide the 'who, what, where, when and why' that is applied to the
'how much' fact.
 Lookup tables and master tables which generally act as a primary key tables in
OLTP would end up as dimensions in warehouse
 For normalization reasons, there can be more than one table for a single entity in
OLTP such as customer table and customer address table, both of them can make
it up to one dimension in warehouse as a part of the denormalization process
 Dimensions are typically highly denormalized and descriptive. They are typically
text fields, and serve as context for the facts
 Dimensions can be left as degenerate dimensions whenever they make sense by
themselves (usually a matter of opinion), provided that they do not have additional
attributes. One reason for creating a separate dimension table is if you want to
report on ―missing‖ values from the fact table: the dimension table can include all

possible values, not just the ones for which there are fact transactions. By creating
an outer join, you can find members that do not have fact transactions.
 Dimensions are some times referred to as causal: they contain the causal factors
responsible for the collected measures
 If the attributes are connected in one or more hierarchies then these are usually
captured in the dimension table
 Dimension attributes are business descriptors users believe are of significance to
understanding the measures in the fact table
 Dimension attributes are usually discrete (not continuous) and are often used as
the source of constraints. Some dimension attributes may be continuous—for
example the list price for products. Continuous values are not useful for grouping.
You can often convert a continuous value into discrete buckets. For example,
converting prices into High, Medium, Low.
 Dimension attributes usually have a limited cardinality at least when compared to
facts, which come from OLTP transactions
 User input and evaluation of existing business reports help define the dimensions
to include in the data warehouse. The words ―by‖, ―for‖, ―of‖, and ―in‖ often flag a
dimension.
o FOR EX 1: A user who wants to see data by sales region and by product
has just identified two dimensions (geography and product).
o FOR EX 2: Business reports that group sales by salesperson or sales by
customer identify two more dimensions (salesforce and customer).
o FOR EX 3: The question, "What were the sales of food items in the
northwest region for the third quarter of 1999?" represents the use of three
dimensions (product, geography, time) to specify the information to be
summarized
Dimension Maintenance
Following are the guidelines pertaining to the dimension maintenance
 Avoid removing attributes from a dimension’s schema. Consider not removing an
attribute from a dimension without analyzing the impact on other objects such as
analytical queries, report queries, and cubes. Even if the attribute is discontinued
at the source, you can add a default value in the warehouse for new rows, rather
than remove the attribute from dimension. Business keys should always be
retained in the dimension tables, as these provide a mechanism for tracing the
report values back to the source.
 As an example, If the source schema is modified to remove an attribute such as
Date of Birth, there probably already exist data for the attribute in the dimension
table. So if the attribute is kept in Dimension it will have NULL, if removed you are
loosing historical data
 Adding a new attribute to the dimension should not change the granularity of the
dimension. For example: Adding Customer location should not change the
granularity of the customer dimension. If there is only one record in customer
dimension before adding location, it should still have only one record per customer
in the dimension even after adding location attribute to it. If adding a new attribute
takes the dimension to a new lower granularity then consider managing the value
as a slowly changing attribute, or add an outrigger table to the dimension.
Changing the dimension granularity would affect the fact tables with references to
this dimension

 Dimensions must have the ability to add attributes to dimensions incrementally.
Assume all attributes on the dimension is Type I then if a new attribute is added
which could potentially result in changing the dimension to Type II this would
impact the ETL logic. Consider maintenance changes while developing ETL
framework such some attributes of Type I today may become Type II tomorrow or
vice versa. Type I and Type II are part of slowly changing dimension design, which
is covered in the later section.
 Data warehouse architectures must be designed to accommodate ongoing data
updates, and to allow for future expansion with minimum impact on existing design.
Fortunately, the dimensional model and its straightforward schemas simplify these
activities.
 Records are added to the fact table in periodic batches, often with a little effect on
most dimensions. For example, a sale of an existing product to an existing
customer at an existing store will not affect the product, customer, or store
dimensions at all. If the customer is new, a new record should have been added to
the customer dimension table before adding the fact record. The historical nature
of data warehouses means that records almost never have to be deleted from
tables except to correct errors. Errors in source data are often detected in the
extraction and transformation processes and are corrected before the data is
loaded into the data warehouse database.
 The date and time dimensions are often created and maintained in the data
warehouse independent of the other dimension tables or fact tables – updating
date and time dimensions may involve only a simple annual task to physically add
the records for the next year.
 An entirely new schema can be developed to a data warehouse. A new business
subject area can be added by designing and creating a fact table and any new
dimensions specific to the subject area. Existing dimensions can be reused without
modification to maintain conformity throughout the entire warehouse. If the new
fact table has a different grain, you may need to create a shrunken dimension (or
add an outrigger table) while still conforming with the master dimension.
Dimension Guidelines
Consider the following guidelines when designing your Dimension tables:
Denormalization
This is the process of flattening the dimensions to form one single unit of all the related
entities. Normalization is generally performed on OLTP systems to reduce redundancy
and reduce the risk of update anomalies while managing many transactions. On the
contrary, denormalization introduces redundancy and reduces joins, allowing for faster
retrieval and more user-friendly structures
Unlike a transaction system, where new transactions can be launched from many
process, in Data warehouse only the ETL process will add or update records, and the
update frequency is much less than that of a typical OLTP.
Benefits of selectively violating normalization rules and introducing redundancy:
 Redundancy can reduce data retrieval time
 Redundancy can create a more user-friendly model
Potential disadvantages with denormalizing the database:
 It may cause update, delete and insert performance to suffer
 It will take up more space
In a typical data warehouse, the advantages of denormalization far outweigh the
disadvantages.
Avoid encoding record information in Surrogate key

A surrogate key should be a simple running sequence number and should not have any
encoded information. If the surrogate key is an integer, the query manager can perform
faster joins between the fact table and the dimension table. The surrogate key should be
able to change freely when you and it should be able to apply a new surrogate key to any
new records added because of Type II changes. Encoding record information in
surrogate key would have adverse affect when the record information is changed.
For example, a surrogate key that includes the combination of Employee ID and Location
ID would need to change if the employee location changes. This defeats the decoupling
purpose of the surrogate key.
Denormalize snowflake dimensions for end users

If you decide to store a dimension as a snowflake—in multiple tables—use a view to
denormalize it into a star before representing it to end users. A simple dimension
structure that consists of simply a dimension key and a list of attributes is easier for users
to understand than a complex relationship of tables with joins.
For example, an Employee dimension may be stored as an Employee table and a
Department table. If the Department ID is a foreign key in the Employee table, it is
logically part of the Employee dimension. Rather than show the end user two separate
entities, show the dimension as if it were a single table.
Do not show surrogate keys to the end user because they do not have any business
meaning.
Consider adding zero key for Dimension tables

Adding a dummy record in each dimension as a first record whose surrogate key value is
zero, so called as zero key. The business key and other attributes are defaulted to NULL
or default value such as Not Available.
The fact table typically contains the transactions that are recorded at source system.
Due to the nature of extraction process implemented, there is a possibility that business
key is not available at the time of fact table loading. In such a case, a default key is
assumed from dimension (as dimension is not yet populated with the required key)
having zero value. Once the dimension record comes back in the loading, the
corresponding zero key in the fact is updated with corresponding surrogate key of the
business key. Zero keys also added date and time dimensions. If the date/time
information is not available at the time fact table population zero keys of the
corresponding dimensions are assumed.
Do not combine attributes from different business entities into a

single dimension
One Dimension represents one entity and its details. Combining two entities into one
dimension in order to reduce number of dimensions does not adequately model the
relationship.
For example, suppose that a single Employee can participate in multiple Projects, and a
single Project can include multiple Employees. This means that Employees and Projects
are different entities. The relationship between Employees and Projects must be from an
association table or a fact table. If you try to include Project information in the Employee
dimension, you would need a separate Employee member for each Employee Project.
Create a surrogate key as the primary key of each dimension

Surrogate keys will act as a primary key for the dimension. It is required because of
following advantages
Consolidating from Multiple sources: When data is consolidated from multiple sources,
there is a possibility that business key repeated across sources. To have it uniquely
identified, surrogate key is created. This is a single column incremented for each added
record and its value is not based on the business key
Slowly changing members: Dimension members can change over time. In Type II
dimensions where the history of Type II attribute is tracked will have business key
repeated. Surrogate keys will uniquely identify each dimensional record irrespective of
type of change in attribute columns.
Inferred dim support: When the dimension record is a late arrival after its fact records
then a new dummy record (having business key and all other columns as NULL) is
inserted in dimension table with a business key. Surrogate key here supports the
scenario to get referenced in fact table, be aware that dummy dimensional record will get
updated once the dimension has received the late arrived dimensional record.
Natural Key may change: The natural key (source key) may be modified in source
system, having it referenced in fact table would lead to chaos as it needs to update entire
fact table wherever the natural key is referenced. Having surrogate key independent of
natural key would not have to change anything even if the natural key is modified in the
source system
Performance: Having an integer surrogate key in the dimension can improve the
performance of data access queries because of its integer data type and also single
attribute to represent the dimension record. Consider the natural key at source is not a
single column but a group of columns acting as a composite key. Having this in
dimension with surrogate key support would force the facts having these two keys
referenced, this would not only increase the disk usage but joins on both the columns to
get dimensional record
Use a simple integer for Surrogate keys

 Integer key joins are faster that joins with other data types, The SQL query engine is
optimized for integer comparison.
 If possible, use an Identity constraint to generate the surrogate key. This is much
more efficient than creating the key by using ETL code
 Each time a member has a new value for a Type II attributes, you add a new
record, which generate another surrogate key. This allows older fact table rows to
have the old attribute value, while new fact table rows have the new value.
 Surrogate keys can be defined as unique across source systems, so that a John
Reynolds from system XYZ and a John Reese from system ABC might both be
identified in their respective systems as JR. However, once they are loaded into
the data warehouse, they are each assigned a unique, non-overlapping value.

Identify type and relationship between the dimensions and the facts
Consider following guidelines when defining the relationship between dimensions and
facts
 If an attribute in a dimension is a numeric value that can be meaningfully
aggregated in user reports, consider adding the numeric value to a fact table
 A dimension can have numeric values that don’t need to be aggregated in reports.
These numeric values can remain in the dimension.
 A fact table may contain only foreign keys, but no measures. This is called a
factless fact table and exists when the business measures consist of only counts.
This is a valid fact table design, but you may need to use distinct counts to
aggregate the counts, rather than simple fact row counts.
 The purpose for a dimension table is the ability to have multiple attributes for the
dimension key. If a dimension key does not have any additional attributes, you can
implement the dimension as a degenerate dimension in the fact table.
Creating separate snow flake dimension tables for variable number

of attributes of entities
For heterogeneous dimension items such as products, some products have 100
attributes to track and some products just 10. Having all these product details in the
same dimension table would make the dimension table very wide, with no values for most
attributes in any one row (sparse). One solution is to create one main dimension table
containing the shared attributes and create individual dimension tables that join to the
main dimension table to track the extended attributes
Create an association table when there is a many-to-many

relationship between two dimensions
An association table maps the relationship between attributes from different dimensions.
Example:
Consider there are two dimensions namely Authors Dimension and BookTitle Dimension.
One author can have multiple bookTitles and one Booktitle can have multiple authors.
This is many to many relationship between authors and booktitles.
Consider creating a separate association table which has the association between
Authors and Titles.
Association table
EmpId EmpName Age
1098 John 28
EmpId EmpName Age
1098 John 28
Also, the association is maintained without the need of any fact "happening" transactions
at the source. In general, the association tables have keys mapping in many to many
relationship but if underlying dimensions are type II, there is maintenance overhead. To
reduce overhead, we would suggest creating association table with business keys rather
than SKs. The key can easily found in Type II dimension with the help of EndKey

Guidelines for choosing an Outrigger table
If a dimension attribute is not unique for the business key then another dimension is
created also called a child dimension which would have reference of the parent
dimension and also has its own surrogate key. One business key record in parent
dimension can have in multiple records in the child dimension, the child dimension here
called as outrigger.
This is sometimes called a Multi-valued Snowflake Dimension or a Reference Dimension
An outrigger table is different from a snowflake table. A snowflake has lower cardinality
than the dimension key; An outrigger has higher cardinality. In a snow flake, the parent
dimension (which is normalized from star to snowflake) will have reference of snow flaked
dimension, where as here the outrigger will have the reference of parent dimension.
Example, Employee dimension is snowlaked to have EmpDimension and Status
Dimension. Here Emp Dimension will have an attribute called Status Dimension SK. In
outrigger (example below), the outrigger table AccountOutrigger will have an attribute to
store the Emp SK.
Outriggers are meant for improving maintainability, save space or increase flexibility.
Outriggers are not same as Snow Flake. Outriggers are created only when there is more
than one value for a dimensional attribute where snow flake is created just to take out
some dimensional attribute (normalization)
Avoid snow flaking the outrigger as it could result complex, less stable & more difficult to
maintain model
Example:
Employee Dimension
SK EmployeeName EmpId Location AccountId AccountType
1 Dave 123 Redmond 9287 Savings
2 Dave 123 Redmond 4235 Checking
3 Jim 124 Redmond 8923 Savings
Here employee information is repeated because employee has more than one account,
to reduce the redundancy and to have only one record per employee in employee
dimension, create the following structure using outrigger.
Employee Dimension
SK EmployeeName EmpId Location
1 Dave 123 Redmond
3 Jim 124 Redmond
AccountsOutrigger
AccSK EmpSK AccountId AccountType
1 1 9287 Savings
2 1 4235 Checking
3 3 Jim Savings

Conformed (Shared) dimensions
A conformed (or shared) dimension is one that is used with multiple fact tables, or in
multiple data marts or in multiple data warehouses. Shared does not mean only one copy
is present, it does mean that only one copy is loaded with formal loading process and
other copies will get replicated from the loaded copy. A dimension is said to be
conformed if it has been used across different schemas such as data marts, data
warehouses and thus maintaining the logically same set of dimension attributes. Using
conformed dimensions is critical to successful data warehouse design.
Business Entities (also called as dimensions here) can be reused for other facts such as
Products entity can be reused for Sales as well as Product costs
Conformed dimension ensures the dimension data is sourced from only one process and
any other derived dimensions for data mart purposes would have consistent information.
When there are multiple fact tables or data marts in the system then all these fact tables
and data marts should access one logical dimension per one business entity. Example:
There should be only one customer dimension the system that is being used by all data
marts or fact tables or outrigger tables or other snow flake dimensions.
This dimension can be replicated or horizontally partitioned to suit the requirements of
other data marts. Horizontal partitioning of a dimension means that some of the rows are
in one copy and some are in another copy. For example, one dimension might have all
the East region customers, while another has all the West region customers.
ETL process is to populate only one customer dimension in the system, not once for each
data mart or separately for other horizontally/vertically separated dimensions
Horizontal Data
Partitioned Mart
Customer
OLTP Dimension Horizontal Data
ETL Partitioned Mart
(Conformed)
Complete DW
Dimension
You should use conformed dimensions whenever possible instead of individual ones.
Data mart and Data warehouse will use the same dimension not multiple copies of the
same dimension, loading process is only once
Conformed dimensions are to be designed to satisfy all the data mart and data
warehouse needs. A dimension can be taken to higher granularity but cannot be taken to
lower granularity once fact tables are populated. As the confirmed dimensions are

designed to be shared among data marts and data warehouse, it should be designed as
lower granularity possible as per source data base schema
Changing Dimension
A data warehouse must deal with attributes that change values over time. This is not an
issue for most transaction systems, because they deal only with the present. But a data
warehouse will often present data historically for anywhere from two to ten years.
Organization changes, customer attribute changes, and other changes must be
managed.
Procedure for identifying changes in a dimension

 Identifying the attributes that would change overtime needs knowledge about the
OLTP system. In principle, any attribute except the key can change over time. You
need to plan for what to do in case an attribute changes.
 Identifying the attributes whose change history to be maintained in warehouse is a
business analysis decision
 Changes are to be well modeled in the data model modal and data mapping sheet
 Granularity and normalization to be properly represented in data modal
 Assess the data type of surrogate keys that would sufficiently allow you to support
changing dimensions
 Dimension Size considerations
o Data must be split based on attribute changes and usage
o Data that is used together is kept together to reduce joins
o Splitting a fat dimension into several smaller dimensions can improve
performance
o Snowflake schema with its separated dimensions performance slightly
better than the star schema with its wide flattened dimensions
 If the rapidly changing attributes are not textual and discrete, but are numeric and
continuous, then consider placing them in facts
 The validity of the data warehouse is thus dependent on how well changes in the
its dimensions are tracked
Slowly Changing Dimensions Explained

The Slowly Changing Dimension (SCD) should technically be called Slowly Changing
Attribute, because any attribute can be handled differently from any other. There are
logically three ways of dealing with a slowly changing attribute:
 Type I: The existing record is updated with change. Replacing the data is very
appropriate if the old data was erroneous or it is unlikely that the changes will
materially affect important analyses. Type 1 slowly changing dimensions do
require re-creating any aggregations that would be affected by the change.
 Type II: History is preserved, a new record is added to dimension and old record is
marked as inactive. EndDate or IsLatest is generally used to represent latest
record of a business key. A new surrogate key is created for every change
irrespective of the business key. Type II changes can cause a serious inflation in
the number of members of a dimension.

 Type III: Type III preserves a version of the history by adding it to the dimension
record in a different attribute. Typically, there are two states: current and previous.
In some cases, you may have 4 or 5 states. When the number of defined states is
exceeded, previous states are discarded. A Type III change does not create a new
row. A Type III change does require extra columns in the table schema. As with a
Type I change, a Type III change requires a dimension update, so you need to re-
process all aggregations after a change.
Slowly Changing Dimension Guidelines

Consider the following guidelines for addressing slowly changing dimensions:
Avoid Type II for dimensions with a large number of columns; use

number of smaller dimensions
If the SCD dimension having a Type II attribute and also has large number of columns
then any change in this attribute would result in adding entire row again to the same
dimension though there is no change in any of the other columns. Create smaller
dimensions as an alternative solution.
Consider loading with correct SK for Late arriving facts

In the case of an inferred dimension member, where you create the member surrogate
key before the attributes are available, do not create a new Type II record when the
attributes arrive. Instead, load the initial attributes as if they were Type II attributes. For
example, RecordX of fact1 is delayed and it is corresponding to customer when the
location was L1 (SK = 100). The location of customer is changed to L2 and so the
surrogate (SK=101). Now when the RecordX is added to Fact1 then you should consider
SK=100 instead of the latest SK (SK=101) of customer dimension
Consider not treating attributes as Type II attributes unless required

The idea behind of Type II attributes is to maintain history for business analysis to track
the changes over time. If the changes need not be tracked and unnecessary then
consider the attribute as Type I (overwrite the attribute value every time it changes)
Consider size of dimension record when identifying Type II attributes

Having more number of Type II attributes in a single dimension would result in adding the
entire dimension record for any change of Type II attribute. If the number of Type II
attributes are unavoidable, consider breaking the dimensions
Use a sufficiently large data type for surrogate keys

When estimating the size of a data type for a surrogate key, be sure to include the
additional records that will be created from Type II changes.
Example: In a dimension with an estimated 9,000,000 customers, you might be tempted
to use a small integer for the key, but one Type II attribute, with only one change per
customer would result in 18,000,000 records. To be safe, you may want to use bigint
keys for dimensions.

Use TYPE III SCD when you need current and previous values
Type III attribute changes are relatively rare in a data warehouse, but if you do need to
report only on the current and previous states, it can simplify reporting compared to a
Type II implementation.
Slowly Changing Dimensions – Scenarios and

Solutions
Process Type I updates in the correct sequence
With a Type I attribute in the warehouse, it is possible to have a source system that
maintains multiple change entries. When applying the changes, be sure to apply them in
the correct sequence so that the warehouse maintains the last possible value. For
example, the Employee location was New York at 9AM. It was changed to Los Angeles at
1 PM and then changed again to Houston at 4 PM. When extraction is performed at 6
PM, it has all the changes starting from last extraction. When the data is applied to the
warehouse, it is critical that the final value for the dimension member be the last
change—Houston.
There are two ways to accomplish this
 Batch Aggregation: Hourly aggregate the changes and load it at the end of the
day of each hourly data. In such a case if the update occurred multiple times with
in an hour then the latest change is propagated to warehouse.
 Loading in Intervals: Consider loading the warehouse hourly from each hourly
data extract and reduce the interval further if updates are frequent and need to be
captured
Split a dimension table when attributes change at different

frequencies
If there is a dimension has that Employee information and Department information,
consider splitting into two tables, with a foreign key to the department table in the
employee table. If both employee and department dimension are in the same dimension
and as they change at different rates, it will lead to huge cardinality particularly if both are
Type II in nature.
Use an Rapidly Changing Dimension (RCD) or Outrigger table for

Type II attributes that change frequently
The best solution is to treat the attribute (that changes very frequently) as rapidly
changing and create a separate Rapidly Changing Dimensions (RCD) for the attribute.
Add the surrogate key of this new dimension to fact table to maintain the current status
when the record is inserted in the fact. If, however, there is no fact record then it is
difficult to know the latest value of this attribute. One solution is to create a separate fact
table that has only the dimension and the RCD surrogate key. (In essence, this becomes
an outrigger dimension.) .

When Current information is required on a large Type II Dimension
table, consider joining on Business Key
Reports that access fact table joins Type II SCD dimensions to get latest instance of the
business key. But it results in self join to get latest SK for the business key whose key
referenced from the corresponding record in fact. Fact.DimSK = Dim.DimSK and
Dim.BusinessKey = SelfDim.BusinessKey and SelfDim.EndDate is NULL (EndDate is
typically a column on Type II dimensions to mark the start and end date of the Type II
change). The solution is to add business key in the facts, not in composite keys, as an
additional field would reduce the self join. This is particularly used for Type II SCD
dimension having large volume and reports direct access the underlying relational tables.
Typically the join condition would be
Fact.BusinessKey = Dim.BusinessKey and Dim.EndDate is NULL
If using SK key as a means to join the facts and dimensions to get latest information from
Type II dimensions would as
Select Current.CustomerName, Measure
From
Fact
Inner Join Customer old on Fact.SK = CustomerSK
inner join Customer Current on Old.CustomerId = Current.CustomerId
If using Business key, the query can be eliminated with a self join as
Select Current.CustoerName, Measure
From
Fact
Inner Join Customer Current on Fact.CustomerId = Current.CustomerId and
Customer.EndDate is NULL
This method of joining on the additional business key (instead of dimension surrogate
key) is particularly useful when the ―current‖ information is sought in the result set on a
large volume Type II dimension table. This method will not eliminate the need of
surrogate key in the facts, it only suggests having an additional attribute in fact table as
dimensional business key for improved performance queries.
Use a Full Outer Join to identify Inserts, Updates, and Deletes

While loading warehouse, you need to check for the existence of the new record in the
warehouse. If record does exist, then you have a Type I or Type II effect depending on
the attribute that is changed. If the record does not exist, you must add one. In addition,
you have the case where the record exists in the warehouse, but is not in the new
extract. This signals a logical deletion. Comparing the existence of the values in both the
new and old data sets can be slow, particularly when the tables are large.
A good solution is to use a Full Outer Join between the new table and the existing
warehouse dimension table, using the business key for the join. You can then determine
Updates, Inserts, and Deletes in one query.
Example

SourceTable Full Outer Join Dim On SourceTable.BusinessKey = Dim.BusinessKey and
Dim.EndDate is NULL
If the resulting dimension Business Key is NULL then it means the corresponding record
does not exist in warehouse or that it has expired. In this case a new record is to be
inserted in the warehouse.
If the resulting source Business Key is NULL, then it means the corresponding record has
been deleted and should be flagged with an EndDate in the warehouse table.
If neither Business Key is NULL then it means the corresponding record exists in
warehouse and is active. In this case the appropriate Type I or Type II processing can be
performed.
Using the Full Outer Join approach combined with a Hash value for attributes, you can
optimally find out the Type I and Type II attribute changes, as given below
 Step 1: Identify Inserts, Updates, Deletes using Full Outer Join
 Step 2: For Inserts, directly insert into dimensions
 Step 3: For Deletes, update dimensions for marking as inactive
 Step 4: For Updates, compare the Type I and Type II hash values to determine
subset of records to have Type I effect and subset of records to have Type II effect
Rapidly Changing Dimension

A dimension is considered to be a rapidly changing dimension (RCD) if one or more of its
attributes changes frequently in many rows. Consider the following guidelines if you are
dealing with RCD.
Determine if an attribute is Rapidly Changing

When an attribute where you need to maintain history (a Type II attribute) changes
frequently, you can end up with a huge inflation of rows in the dimension table. To avoid
inflating the dimension table, you can move the attribute to a separate dimension, called
a Rapidly Changing Dimension (RCD). The threshold for when you would move an
attribute to an RCD is based on the size of the dimension, the number of changing
attributes, and the frequency of change. As a broad rule of thumb, if an attribute would
cause a dimension to double in size more than once per year, you may want to move it to
an RCD. The cut-off limit will vary depending on your resources and business
requirements.,
Slow down the rate of change by using Tracking bands

The number of changes the Rapidly Changed Attribute can be reduced if a range is
defined instead of precise value of the attribute, these ranges also termed as Tracking
Bands. This would reduce the number of changes in RCD unless it crosses the range
values. As the range it belongs to will not change as frequently as the value, it will not be
considered as RCD. This approach would eliminate the need of handling RCD in another
set of dimensions. Defining the range is not feasible for every RCD but can be
implemented wherever ―range‖ can be implemented and business requirements are met
with the range definition. Example, If business requirements does not demand actual
age of employee, and if age is considered to be rapidly changing attribute, then consider

defining the range for the age. The range include [10-20], [20-30] etc., So, even if
employee age changes from 25 to 26, the range he/she belongs does not change.
Employee (source)
EmpId EmpName Age
1098 John 28
EmpDimension (Warehouse)
EmpSK EmpId EmpName AgeRange
342 1098 John 20
Determine if Type II change is needed for Rapidly Changing

attributes
Determine if having history maintained for rapidly changing attributes is important for
business reports/analysis needed. If it is not required, a type I change on such attributes
would not create any cardinality or performance problems.
Use static time components rather than changing attributes

Rather than use Age as an attribute, you can store the Birth Date. Birth Date will not
change, but can be used to calculated the exact age as needed.
Move Rapidly Changing Attributes to another dimension

When you have a rapidly changing attribute, you avoid inflating the dimension table by
moving the attribute to its own dimension, with its own The solution for a dimension with
rapidly changing attributes is to break the offending attributes out of the dimension and
create one or more new dimensions. Consider the following example.
Example:
Consider a customer dimension having status attribute which is identified as rapidly
changing attribute and retaining history is necessary. As a solution, create a
CustomerStatusDimension and add the foreign key for that dimension to each row of the
fact table as the fact data is loaded. Rather than add the CustomerStatusDimension to
the main fact table, you can also create a separate fact table including the Date, the
CustomerID, and the the CustomerStatusDimensionID. This solution would also allows
end users to analyze the history of status changes.
Create Junk dimension to store multiple RCD attributes

If you create a large number of rapidly changing attributes and create a dimension for
each, the number of dimensions can become very large, and the fact table can end up
with a large number of foreign keys. One approach is to combine several of these mini-
dimensions into a single physical dimension. This approach works because each RCD
typically has a very low cardinality. This is the same technique used to create what is
often called a junk dimension—a dimension that contains unrelated attributes to get them
out of the fact table. A junk dimension in the fact table reduces the number of extra
foreign keys in the fact table, but it is still necessary to use the fact table to resolve the
relationship between the customer and the attributes.
To keep the RCD attributes completely out of the main fact table, create a separate fact
table with a Date, the CustomerID, and the RCD composite key.

Modeling Date and Time dimensions
There will be only one Date dimension and one Time dimension across all facts and
dimension tables. Not all date and time fields needed to be mapped to date and time
dimensions. Map a date-time field to a Date or Time dimension only when you need the
extra attributes from the dimension.
Typically the lowest granularity of the Date dimension is Day, and the lowest granularity
of the Time dimension is Second. Many warehouses do not need a Time dimension, but
a Date dimension is used in virtually every data warehouse system.
In general, create separate dimensions for Date and Time

If it is necessary to extract contiguous chunks of time that cross day boundaries (for
example 11/24/2000 10 p.m. to 11/25/2000 6 a.m.), then it is easier if the hour and day
are in the same dimension. Otherwise, separate dimensions for Date and Time are easier
to manage and to query.
It is easier to analyze cyclical and recurring daily events if Date and Time are separate
dimensions.
Example:
Number of meetings active at 9AM this week. This is an easy query if Date and Time are
separate; it is much more complex if Date and Time are combined into a single
dimension.
Following are issues of having date and time dimensions together in one DateTime
dimension
Issue#1:
Combining Date and Time into a single dimension creates an extremely large dimension,
particularly if you use Seconds as the grain. The problem is not as great if you use Hour
as the grain of the combined table
 Cardinality of Date and Time Dimensions (separated) for 10 years of data

Date Dimension: 10 * 365 = 3650
Time Dimension (granularity: seconds): 24 * 60 * 60 = 86400
 Cardinality of Date and Time Dimensions (combined) for 10 years of data

DateTime Dimensions (granularity Hourly): 10 * 365 * 24 = 87600
DateTime Dimensions (granularity Seconds): 10 * 365 * 24 * 60 * 60 = 315,360,000
The more records in the DateTime dimension, the slower query performance will be.
Issue #2
Having both date and time dimension in the same dimension can cause interpretation
problems for fact tables that have granularity of a Day. It would be easy to inadvertently
enter two records into the fact table for the same day, because of the extra granularity in
the dimension table.

When do you need Time information in the Fact Table?
 Business analysis needs time granularity for roll up or for measures/aggregations
based on transaction date time
 When the business analysis has requirement such as number of transactions
happened between 3 PM and 4 PM on 1/1/2006
 If no date time can be extracted related to a transaction date time, alternately Synch
Date Time is used in presentation. Example: There is a change of employee location
in one of the employee master tables at source but source didn’t capture the
date/time of the change. To have Type II affect on employee dimension, synch
date/time is used as a representation of location change occurred date/time, it is
opted only in case of date/time information is needed for ware house dimension and
the same is not present in source tables.
 When source transaction has date and time information and fact is loaded with no
aggregations
 Incase of date time information is not present, synch date time is used as a record
changing/event occurred time
Store Date keys in the Fact Table

Consider using a meaningful surrogate key for the Date dimension as an integer with the
structure YYYYMMDD. This makes it easy for ad hoc users and analysts to understand
the underlying date. You cannot use this method if both date and time are stored in one
dimension, because the structured key would be too long.
Typically a surrogate key should not contain any business value and should not be used
for any specific business rules. Dates are an exception for ease of use where date
dimension surrogate key format is suggested to have YYYYMMDD having data type as
INT(4)
Similarly for time dimension, the suggested format is HHMMSS
Time "4:23:47 PM" would have a surrogate key as 162347
Though it is readable avoid doing any calculation using the surrogate keys directly
without mapping to the corresponding date/time dimension.
A date dimension with one record per day will suffice if users do not
need time granularity finer than a single day
If the business analysis or reports does not need exact time of the day the transaction
occurred, then the Time dimension is not needed for this fact.
A separate Time dimension table should be constructed if a fine time

granularity, such as minute or second, is needed
Consider mapping Time dimension key to the facts only when the business requirement
demands fine granularity of the facts. Having the granularity to the seconds may
significantly increase the data volume in the warehouse. If Time granularity is not
required, you can pre-aggregate the fact table to the Day grain.

If exact event time is needed in the warehouse, the source system
must provide data at the necessary grain
Business needs such as the number of books purchased between 3 and 4 PM of a day
then the extracted transaction information from source transaction tables should have
granular information at least up to the hourly. Business analysis requirement demands
are met but limited to with what is available in source systems.
Guidelines on when to add timekey information to facts/dimensions

Just because there is a time-related field is available to put into the fact table, that doesn’t
mean it needs to be there. The following guidelines would help to decide whether to add
timekey information to dimensions
 Do not consider mapping to time dimension if there is need to aggregate based on
Time dimension attributes. For example, a business requirement to group the report
by hour would require the Hour attribute from the Time dimension.
 If there is no granularity to the hour, minute, or second level in the fact table, do not
consider mapping time dimension
Summarizing data for a range of days requires joining only the date
dimension table to the fact table
Even if fact table rows have a grain of seconds, you do not need to join to the Time
dimension if you only need to aggregate at the day or higher grain.
Example
For example, suppose that the fact table contains the detailed information that Book1
sold at 5 hours 30 minutes 28 seconds on 2/1/2006
When these transactions are aggregated at the end of the week to the day level, it would
take away the time information but keep only the day information. In this case, there is no
need to join to the fact table.
Example (aggregated)
Number of books sold on 2/1/2006
Add convenient attributes to Date and Time dimensions

It is often convenient to maintain attribute columns in a date dim and time dim to provide
additional convenience to business users such as calculating whether a transaction
happened on a holiday, a specific day in the financial cycle, at a peak time (for examlle
from 10 AM to 2 PM). You could even store aggregated values—such as the total
number of holidays in the month—in each row of the Date dimension.
With multiple time zones, store local and GMT times.

 If data comes from multiple time zones, adjust all times to a standard time sone, such
as GMT. You can also store an additional attribute for how to handle a specific
Daylight Savings time calculation. With multiple time zones, consider using GMT
time in the dimension, and storing the time zone delta in a fact table column. In this
case, the time zone shift is similar in function to a Rapidly Changing Dimension.
 Special care needs to be given when reporting multi-zone times as the date can
easily be misrepresented if the wrong Time attribute (GMT or local) is used.

 Take particular care when dealing with day aggregations when multiple time zones
are involved. Does the day begin at local time, or at a world-wide standard time?
Fact tables
Fact tables contain data that describes specific events within a business, such as bank
transactions or product sales
Types of Fact tables

There are basically two ways transaction-based fact table records are sourced, plus
another way for non-transaction counts.
 Transaction table: This is the transactions as it happens on the source. Each
transaction and its measure such as price is extracted, transformed and loaded to
facts at regular intervals of time. Ideally, if there are 100 transactions happened on
a given day on the source and if fact is designed at the same level of granularity, in
transaction fact table you would see 100 records added once ETL process finishes
loading.
 Snapshot table: This is not exactly a transaction on the source. It is generally to
get the snapshot of the system. Example: Getting all customer account balance as
of end of the day. Even when there is no transaction on any particular account, the
snapshot fact table would gather all the current account balance information and
keep it in the fact table to exactly know what the account balance of a given
account is on a given date.
 Factless Fact: This fact generally is not sourced from an OLTP transaction.
Occasionally you need only to capture whether an event happened. Example: You
might want to capture employee attendance, you only need to know which
employees came to work and which employees were absent. Absent employees
will not come from attendance sheet. This is prepared separately with the help of
employee table and attended employees to load into factless fact. Factless facts,
as the name says, will not have any measures. Factless facts are often used with
counts, and particularly distinct counts. For example, how many distinct employees
worked during a given week.
Ideal fact measures are additive in nature but it is not a requirement that a fact table
have numeric measures. For example, consider list of computers added and deleted
from an environment. The fact table may capture the computers that are added and
deleted. There is no measure here but count of computers becomes the measure.
There is no measure stored in the fact table for this kind of transaction..
Identifying Facts and Measures

 Attributes of fact table are in general containing surrogate keys of the respective
dimension tables
 The OLTP transactions or OLTP snapshot, are the source of fact records and
would decide what business keys would make up to OLTP transaction. During the
load process, these business keys are mapped to the business key of the
respective dimensions to extract the corresponding surrogate keys.
 Numeric values that resulted from an OLTP transaction typically end up as
measures in the fact table. One fact table can have more than one measure,
including custom calculated measures based on measures derived from the OLTP
system..
 OLTP transactions are almost always appended. Existing records are not updated.
 Calculated measures that include only values from a single row may be pre
computed during the load process and stored in the fact table, or they may be
computed on the fly as they are used. Determination of which measures should be
pre computed is a design consideration. There are other considerations in addition
to the usual tradeoff between storage space and computational time.
o All the facts have a specific granularity
o A set of foreign keys constitutes a logical concatenated key
o The fact table typically contains the largest volume of data
 The needs of the analyst must be supported by the facts in the fact table, there
must be measures which have relevance to the business goals which the
organization seeks to fulfill
 FACTS are measurable
 Facts are by nature dynamic and variable over time
 Fact tables have an unlimited number of rows—as opposed to dimension tables,
which have a finite cardinality, even if very large.
 Fact data represents the measurable results of a business event, or a business
state and Dimensional data provides the context for that event or state.
 Fact events vary depending on the business
 It is important to understand the basic business process and identify the business
events and states when developing a dimensional data warehouse.
 Fact tables should not contain descriptive information or any data other than the
numerical measurement fields and the foreign key pointers to the corresponding
entries in the dimension tables.
 Identifying Measures. The most useful measures to include in a fact table are
numbers that are additive.
o Additive measures. allow summary information to be obtained by adding
various quantities of the measure, such as the sales of a specific item at a
group of stores for a particular time period.
o Semi-additive measure: Is additive across all dimensions except time.
The classic example of a semi-additive fact is a daily account balance. It's
perfectly legitimate to add the balances of every account on a specific
day—the fact is additive across the Account dimension—but the end-of-
month balance of my savings account is not the sum of the balances for
the past month—the fact is non-additive across time. When aggregating
across time you need a special function such as the last non-empty value,
or the first non-empty value, or the average. if you include semi-additive
facts in a dimensional database, you should ideally use a front end tool
that manage the appropriate semi-additive aggregation. At the very least,
the front end tool should warn users if they attempt to summarize across a
semi-additive dimension.
o Non-additive measures: Non additive measures are those which cannot
be summarized like additive/semi additive measures. In general, you want
to convert non-additive measure to additive measures. You can often do
that by adding a weighting factor. For example, rather than store a rate,
you can multiply the rate by a weight. After summing the weighted rate and
also the weights, you can easily calculate a weighted average rate for any
level of summarization.

Each fact table also includes a multipart index that contains as foreign keys the primary
keys of related dimension tables, which contain the attributes that describe the fact
records.
Granularity
Granularity is defined as the lowest level of detail of the data in the fact tables particularly
the measures. The more granular the data, the more detailed your reports and analysis
can be. However, excess granularity consumes space and increases complexity. It is a
trade off because once you aggregate a fact table, you cannot decompose the
aggregates.
There are three types of grain: Transaction, Periodic Snapshots and Accumulating
Snapshots.
 Transactions - This is the basic detail record of the warehouse as it happened in
the OLTP systems. Example: Customer purchases a book or money transfer from
savings to checking. The Time grain of a transaction is essentially continuous.
 Periodic Snapshots - This is a set of detail records that are repeated over time. It
is to get number of customers existing, inventory of the system etc., at the end of
the day. The grain of a snapshot is the interval at which the snapshot is taken—
whether hourly, daily, weekly, monthly, or yearly.
 Accumulating Snapshot - This is a snapshot that can change over time. The
common example is the student enrollment example. The student's data doesn't
change but the dates for enrollment, admission, etc.
Fact tables Guidelines

Consider the following guidelines when designing your Fact tables:
Aggregating Data in Fact Tables

Data is generally aggregated to store the summarized information over a period of time of
the fact measures such as the number of units of a product sold in a given month. The
source for these aggregations is the detailed granular fact which collects transactional
information from source at regular intervals of time. In some data warehouses, the
aggregation is performed directly by the ETL process to move it into an aggregated fact
table where the detailed transactional records do not have any value for the report users.
For example, executives may not need to see the number of units sold at 6th hour of a
day. Consider the following guidelines when designing your Fact tables that contain
aggregate data:
 Aggregating data in the fact table should only be done after considering the
consequences. Aggregation is the process of calculating summary data from detail
records. It is often tempting to reduce the size of fact tables by aggregating data
into summary records when the fact table is created. Detailed information is no
longer directly available to the analyst. If detailed information is needed, the detail
rows that were summarized will have to be identified and located, possibly in the
source system that provided the data. Fact table data should be maintained at the
finest granularity feasible.
 Avoid aggregation in warehouse tables if you will use an OLAP cube as the source
for reports and analysis. In this case, the cube provides the aggregation
functionality, but if you need to get to details it still available in the fact table.

 If the reports query the underlying relational structure then aggregation will make
queries faster
 It would make sense to archive only aggregated information rather than all the
warehouse records.
 To enhance relational queries, you may want to store the summary records in
separate tables
 Mixing aggregated and detailed data in the fact table can cause issues and
complications when using the data warehouse. For example, a sales order often
contains several line items and may contain a discount, tax, or shipping cost that is
applied to the order total instead of individual line items, yet the quantities and item
identification are recorded at the line item level. Summarization queries become
more complex in this situation. If you are using a tool such as Analysis Services,
leave the aggregation to the tool
Determine if a source numeric field is to be measure or an attribute

Sometimes during the design process, it is unclear whether a numeric data field from a
production data source is a measured fact or an attribute. Generally, if the numeric data
field is a measurement that changes each time we sample it, it is a fact. If it is a discretely
valued description of something that is more or less constant, it is a dimension attribute.
One way of making the distinction is whether it is a value you would want to use as a
grouping item for a report. If it is, then it is an attribute. Often, an aggregatable measure
also appears as an attribute by bucketing the values. For example, order quantity is
typically a measure, but Order Size (i.e., Order Quantity bucketed into 5 discrete buckets)
is an attribute.
Storage is cheap and time is expensive:

This is true in most of the data warehouse processes. This means that you should store
measures needed for reports as part of the fact measures. Ideally, all the measures
would directly be coming from source transactions but any calculations on these
measures can be computed and stored in another measure attribute of the fact. This
reduces the report query time as there is no more computation involved. If reports
directly query relational tables, create aggregation tables with pre-computed totals to
enable reports to query at different grains with no computation during the reporting query.
Retain all measures coming from source in your fact

Have the original details of the measurements which are used for calculating the
additive/non additive fields, can be used for later aggregations. Example:
If you keep all the measurements that are available from source in the fact table, any
possible custom measurements later on the fact table can be done without changing the
design of the fact table. This ensures if one calculation is feasible to extract from source
system then the warehouse system provides the same feasibility.
EmployeeHours OLTP:
EmployeeId Date StartTime EndTime
34 1/1/2006 9 17
21 1/2/2006 12 19
56 1/1/2006 13 20.5

EmployeeHoursFact:
EmployeeKey DateKey NumberOfHours
11 45 8
12 46 7
13 45 7.5
Here in the above fact the original measures such as StartTime and EndTime are not
included, because of this the measures such as number of employee start working on
1/1/2006 before 10AM cannot be answered though it is possible to deduct the number
from the source schema.
Here the guideline is when calculating custom measures on fact, retain the source
measures as part of the fact.
Keeping the source measures also makes it possible to show an audit trail of how the
value was calculated.
If the granularity of two facts tables is the same, consider combining

them
As a general rule when you start your dimensional modeling each business function
should have its own fact table. As your dimensional model evolves try combining the fact
tables where the granularity is the same between the fact tables. Combing the facts
makes analysis easier and eliminates the need for unnecessary joins.
InventoryFact
ProductKey DateKey Inventory
23 44 289
43 44 435
65 44 658
InvoiceFact
ProductKey DateKey Sold

23 44 32
43 44 45
65 44 59
Both InventoryFact and InvoiceFact both have same granularity (Product – Daily) and
represent single business entity products SKU number. Even if there is nothing sold for a
product on a given day, the record can be added to the combined fact as inventory fact is
a snapshot fact, not based on whether the product is sold today or not, with
corresponding sold measure as 0. The cardinality of the combined fact would be same as
Inventory Fact.
The combined fact would look like the following

ProductSKUFact
ProductKey DateKey Inventory Sold
23 44 289 32
43 44 435 45
65 44 658 59
Combining these two fact tables into one also enables calculated measures such as
percentage of SKUs are sold today, that would need to use measurements from both the
facts, without this combination it is going to be cross fact and would lead to performance
problems.
Use indexed views to improve the performance of relational queries

The performance of queries against the relational data warehouse can be significantly
improved when you use indexed views. Indexed view is a pre-computed table comprising
aggregated or joined data from fact and possibly dimension tables. A indexed view is one
way of implementing particulary when the requested data set is summary values that
were not pre computed or aggregated
Fact table Index Guidelines

 Consider having a composite primary key for your Fact table. Consider composite
key (typically composite of all related dimension surrogate keys) as the primary key
for Fact tables. Conversely, you can add an Identity column as a primary key in the
fact table.
 Some dimensional modeling will have business keys also being part of facts for
enabling report performance. These business keys do not need to be part of the
composite key of the fact as their corresponding dimension SKs are declared in
composite key. There can be non clustered index to cover these business keys for
better reports performance.
 Business keys which are used by ad hoc queries should be part of non clustered
index. Non-clustered indexes will the increase efficiency and responsiveness of ad-
hoc queries. In the absence non key attributes covered by an index, would result
in a clustered index scan (provided clustered index is created & available) which is
expensive
Avoid updating fact records

Fact table represents the transactional record that was created by source business
processes. There should not be any record in the fact table that does not originate in a
business transaction at the source. At the same time, any change of the value of a
measure on the source system should result a new record in the fact, never an update of
the fact table. If you use an OLAP tool, any update of fact table records (apart from end
date if exists) would require refreshing the cube. Refreshing the cube having huge data
volume may take longer time and may result in down time of the system.
Keep the data types of the fact table columns as short as possible
Fact records are already huge in size; for ex. use tiny int where possible instead of INT. It
is suggested to adopt the same data type for measurements that source system has. If
you create aggregate tables, be sure to change the data types to be large enough to
handle the maximum aggregated values.
Fact loading should not create records that are not part of OLTP
system
Capture only transactional records from OLTP. Avoid having the ETL process add new
records that were not exactly sourced from OLTP systems. Example, if Customer1
purchased Book1 then fact table should have only one entry for this transaction. Adding
records such as Customer1 did not purchase Book2, Customer1 did not purchase Book3,
will not add any value and should be avoided.
Fact records are considered as event records – i.e. transactions processed on the OLTP.
No fact records should be created by the ETL, it is just a transformation.
Business needs must determine the minimum granularity of the fact

table
Fact table granularity comes from business needs, you can’t have further granularity than
OLTP records. It is driven by the business needs. Example: – if the business requires to
track only aggregated values for the entire day instead of hourly basis, then use day
aggregate as the base granularity
Ensure fact table measures have relevance to the business goals

Though it is possible to create multiple measurements from the source measures, the
fact measures should represent business requirement. Create only those measures that
are required for the reports/analytic needs of the business users.
Ex: If Percentage sold is not a requirement for business people, do not create this as a
measurement just because it is possible to create from the available data
It is suggested to try creating additive measures and avoid having non additive measures
in the fact tables.
The measures of the fact table are numeric in nature and additive. But some
measurements are non additive and semi additive. Semi additive measures are those
which cannot be summed up across all the products but can be summed of the same
product across the year. Non additive measurements are such as ratio, percentage etc.,
but all these measurements are to be numeric in nature. For non additive and semi
additive, custom aggregations are created but for this original source additive
measurements are to be part of fact table to calculate custom aggregations.
Aggregation to higher grain

Aggregations of the facts can be performed either going to higher grain within the
dimension, or by removing a dimension. If going to higher grain, you may need to create
a shrunken dimension table in the dimension or split out the higher grain attributes into a
snowflaked table. As an example of aggregations on non-hierarchy attributes, a fact table
may track the products with color, location so the dimension has a surrogate key on the
products which has two different records for two different colors for the same location and
same product. But when the fact table is aggregated to remove the color hierarchy, then
there should be another dimension with only one record for product and location (with no
color). As an example of aggregations on hierarchy attributes, consider a dimension
having city, state and country. If the aggregation of the fact table is performed to the
state level then this dimension cannot be used as the granular is higher. So, use a
shrunken dimension or a showflaked dimension table at the appropriate grain.
Determine when and where calculated measures are pre-computed.

Calculated measures are those measures which are not directly available from source
but computed from the measures coming from source.
Example: Number Sold is the source measure. Percentage of the items sold per city is
the measure calculated based on number of items sold in the city to the total items old in
the given day.
Calculated measures can be created in the facts while loading enabling the reports which
query the relational data warehouse table directly.
If you are using an OLAP tool such as Analysis Services, it may be easier to do
calculations that involve aggregated values (such as percent of total) in the logical cube
design where both detailed and aggregated values are accessible.
Keys and Indexes

This section describes guidelines for creating keys and indexes on different objects of
physical warehouse schema.
Types of Keys and Indexes

Consider the following types of keys:
 Surrogate Key: A key to represent uniqueness in dimensions and acts as a method
of being referenced from fact tables. It generally an integer data type and get
incremented for each key, to act as Identity column of the dimension. Details are
surrogate key is explained in the earlier sections of the chapter.
 Natural Key: Also termed as Business key. This is the primary key or the unique
key on the source system to represent the identification of entity uniquely. Typically
this is also part of physical implementation of warehouse schema participating in
dimension structure but will not be as a key. Natural key is just another attribute of
dimension structure.
Consider the following types of Indexes:
 Clustered: Need of having indexes on the fact tables and dimensions are very
crucial in warehouse and should be optimized for data access queries. Typically
clustered index on dimensions is created on surrogate key and surrogate keys used
in the fact will be a composite key for fact tables.
 Non-Clustered: There is a requirement of creating non clustered indexes on
measures and on other widely accessed columns of the fact table.
 Covering Index: Covering index will contain all the columns of the table, in this
context it is for fact table. The idea behind of creating covering index is boosting
performance for any queries need not be part of any other index At the cost of index
size.
Key and Index Guidelines

Consider the following Key and Index guidelines for your dimensional model:

 Avoid using GUIDs as keys. GUIDs may be used in data from distributed source
systems, but they are difficult to use as table keys. GUIDs use a significant amount of
storage (16 bytes each), cannot be efficiently sorted, and are difficult for humans to
read. Indexes on GUID columns may be relatively slower than indexes on integer
keys because GUIDs are four times larger. GUIDs are not generally accessed in the
query and so there is no need of creating indexes on GUIDs
 Have non clustered indexes created on the business keys if they are part of facts
 Have clustered index created on dimension surrogate keys in facts (or create
clustered index on Identity column and use non-clustered for SK FK’s)
 There is no need to create separate indexes on measurements as they are accessed
via the surrogate keys or business keys
 For each dimension table, add a clustered index on the surrogate keys (primary key)
 Make sure to use the index in the same order it is created. For example, if an index is
created on columns F1, F2 in the order, try having your query filters out first on F1
and then F2
 Analyze the query execution plan for specific report queries and sample ad hoc
queries before optimizing the usage
 Keep index keys smaller, do not include all the columns of the fact in the covering
index
 Use Index Tuning Wizard to tune your indexes based on query load
 Uses Indexed views to improve query performance that access data through views


08 - Dimensional Modeling

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

08 - Dimensional Modeling

Hochgeladen von

Copyright:

Verfügbare Formate

8

Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing

Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing

Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing

Physical Schema Structures

Product Dimension Store Dimension

 Snowflake Schema. A variation of a star schema is called a snowflake. In a

Designing a Warehouse Schema

Identify Facts and Fact Table Grain

Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing

Identify Dimension Keys and Attributes

Steps in designing Snow Flake Schema

Guidelines for choosing Star vs. Snowflake

Consideration of Operational Data Source (ODS)

Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing

Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing

Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing

Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing

Avoid encoding record information in Surrogate key

Denormalize snowflake dimensions for end users

Consider adding zero key for Dimension tables

Do not combine attributes from different business entities into a

Create a surrogate key as the primary key of each dimension

Use a simple integer for Surrogate keys

Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing

Creating separate snow flake dimension tables for variable number

Create an association table when there is a many-to-many

Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing

Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing

Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing

Procedure for identifying changes in a dimension

Slowly Changing Dimensions Explained

Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing

Slowly Changing Dimension Guidelines

Avoid Type II for dimensions with a large number of columns; use

Consider loading with correct SK for Late arriving facts

Consider not treating attributes as Type II attributes unless required

Consider size of dimension record when identifying Type II attributes

Use a sufficiently large data type for surrogate keys

Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing

Slowly Changing Dimensions – Scenarios and

Split a dimension table when attributes change at different

Use an Rapidly Changing Dimension (RCD) or Outrigger table for

Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing

Use a Full Outer Join to identify Inserts, Updates, and Deletes

Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing

Rapidly Changing Dimension

Determine if an attribute is Rapidly Changing

Slow down the rate of change by using Tracking bands

Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing

Determine if Type II change is needed for Rapidly Changing

Use static time components rather than changing attributes

Move Rapidly Changing Attributes to another dimension

Create Junk dimension to store multiple RCD attributes

Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing

In general, create separate dimensions for Date and Time

 Cardinality of Date and Time Dimensions (separated) for 10 years of data

 Cardinality of Date and Time Dimensions (combined) for 10 years of data

Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing

Store Date keys in the Fact Table

A separate Time dimension table should be constructed if a fine time

Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing

Guidelines on when to add timekey information to facts/dimensions