Sie sind auf Seite 1von 38

BIT415 / 515

Data Analytics Techniques


1
Dimensional Modeling and Star
Schemas
First Course on Dimensional Modeling
The logical and physical designs are the
cornerstone of the Data Warehouse
While our prior textbook focused on Inmons
theories, in this section, we will utilize
2
theories, in this section, we will utilize
Kimballs philosophy for dimensional
modeling
Beyond Inmons theories of single star
schema data marts, we will explore Kimballs
Data Warehouse Bus Architecture
First Course on Dimensional Modeling
Inmon vs. Kimball
Bill Inmon is sometimes called the father of data
warehousing.
First to define and champion the concept of a data warehouse
Credited with the term data warehouse.
Ralph Kimball is sometimes called the father of business
intelligence.
3
intelligence.
Codified the star schema and snowflake data structures
Defined many Business Intelligence concepts, such as:
Data marts
Dimensional hierarchies
Base and aggregate metrics
Drilling
In short, Kimball developed the science behind modern
analytical reporting tools
Both men have made immeasurable contributions to the
field
First Course on Dimensional Modeling
The Case for Dimensional Modeling
What is Entity Relationship Modeling?
Weve already covered ER Modeling
Some traditional ERDs (for example, SAP)
have 1000s of entities that are not easily
queryable
4
queryable
This is a show stopper for BI
End users cannot remember the model
Software cannot easily query an ERD. Optimizers
may make wrong choices
Traditional ERDs - not Intuitive and High
Performance
First Course on Dimensional Modeling
The Case for Dimensional Modeling
What is Dimensional Modeling?
Design technique that presents data in a format that
is intuitive and allows high performance access
Has one table with a multipart key called the fact table
Has a set of smaller tables called dimension tables
5
Has a set of smaller tables called dimension tables
Each dimension table has a single part primary key
that corresponds to one of the components of the
multipart key in the fact table
First Course on Dimensional Modeling
The Case for Dimensional Modeling
What is Dimensional Modeling? (continued)
Fact table has a multipart key made up of the FKs
from the dimensions. Represents M-M
Most useful fact tables contain facts that are numeric
and additive
6
and additive
Fact additivity is crucial
Dimension tables usually contain descriptive textual
information. These are the most interesting
constraints and are usually the row headers in a SQL
result set.
First Course on Dimensional Modeling
The Case for Dimensional Modeling
Relationship between Dimensional Modeling
and ER Modeling
First step in converting from ER model is to separate
the model into its separate business processes
Second step is to select the M-M relations containing
7
Second step is to select the M-M relations containing
numeric and additive non-key facts and designate
them as fact tables
Third step is to denormalize the remaining tables into
flat tables with a single part key
These become the dimension tables
First Course on Dimensional Modeling
The Case for Dimensional Modeling
Relationship between Dimensional
Modeling and ER Modeling (continued)
Resulting data warehouse model will contain
10 to 25 star schemas, each with 5 to15
conformed dimensions
8
conformed dimensions
Many dimension tables will be shared
Applications that drill down will use multiple
dimensions from a single star join
Applications that drill across will link separate fact
tables through the conformed dimensions
Each fact table can be queried independently
First Course on Dimensional Modeling
The Case for Dimensional Modeling
The strengths of Dimensional Modeling
1. Predictable, standard framework
Query tools and user interfaces can make assumptions that
make user interfaces more understandable and processing
more efficient
Allows browsing across attributes within a dimension using
9
Allows browsing across attributes within a dimension using
bit vector indexes (bit maps)
2. Every dimension is equivalent
Equal entry points into the fact table
Withstands unexpected changes in user behavior
Symmetrical user interfaces, query strategies, and SQL
3. Gracefully extensible to accommodate unexpected
new data elements and design changes
First Course on Dimensional Modeling
The Case for Dimensional Modeling
The strengths of Dimensional Modeling (continued)
4. There are a number of standard approaches for handling
common modeling situations, such as (discussed later in the
chapter):
Slowly changing dimensions
Heterogeneous products
10
Heterogeneous products
Transaction based businesses
Event handling scenarios (i.e. factless)
5. Growing number of administrative utilities and software
processes that manage and use aggregates (discussed in a
later chapter)
Aggregates are summary records that are logically redundant; used
to enhance performance
The dimensional model is the only viable technique for
achieving both user comprehension and query performance
First Course on Dimensional Modeling
Putting Dimensional Models Together: DW Bus Architecture
The debate:
Do we build a central Data Warehouse or separate
subject areas (Kimball vs. Inmon)
Kimball states that there are some Data Warehouse
myths. These are open to debate:
11
myths. These are open to debate:
Nobody believes in a totally monolithic approach
All Data Warehouse practitioners use a step by step approach
We have (or, at least Kimball has) moved beyond the phase
in Data Warehouse development where a data mart must be
restricted to being an aggregated subset of a non-queryable
Data Warehouse
First Course on Dimensional Modeling
Putting Dimensional Models Together: DW Bus Architecture
The planning crisis
Two unrelated challenges. The DW manager is
supposed to understand:
all of the content and location of all data in the enterprise
What keeps management awake at night (provide answers to
12
What keeps management awake at night (provide answers to
all of the high level questions that executives want answered)
Kimball says that the data mart is the solution to this
dilemma, built one at a time
However, isolated stovepipe data marts that cannot be tied
together are the bane of DW movement
First Course on Dimensional Modeling
Putting Dimensional Models Together: DW Bus Architecture
Data Marts with a Bus Architecture
Plan a series of steps with finite and specific goals
Separate data marts
Each implementation closely adheres to the architecture
2 steps
13
2 steps
Create a surrounding architecture that defines the scope and
implementation of the complete DW
Oversee the construction of each piece
The biggest task in construction is designing the extract
system to get data, transform it, and load it into the final
database that allow querying
First Course on Dimensional Modeling
Putting Dimensional Models Together: DW Bus Architecture
Conformed Dimensions and Standard Fact
Definitions
Before implementation, produce the suite of
conformed dimensions and standardize the definition
of facts
14
of facts
This set of standards is called Kimballs DW Bus Architecture
Every fact table is surrounded by conformed
dimensions in a star join
A conformed dimension means the same thing to every
possible fact table
A major responsibility of the DW team is to establish, publish,
maintain, and enforce conformed dimensions
First Course on Dimensional Modeling
Putting Dimensional Models Together: DW Bus Architecture
Conformed Dimensions and Standard Fact
Definitions (continued)
Without conformed dimensions, data marts cannot be
used together and may produce wrong results
Conformed dimensions make possible:
15
Conformed dimensions make possible:
Single dimension can be used with multiple fact tables
User interfaces and data content are consistent when used
with that dimension
Consistent interpretation of attributes across data marts
First Course on Dimensional Modeling
Putting Dimensional Models Together: DW Bus Architecture
Designing the Conformed Dimensions
Conformed dimensions will naturally be defined at the
most granular level possible
The grain of the time dimension will be days
Conformed dimensions always have an anonymous
16
Conformed dimensions always have an anonymous
(surrogate) key that is not the production key from
one of the legacy systems
Taking the Pledge
The data mart teams must use the conformed
dimensions
Creation of conformed dimensions is as much a
political decision as it is a technical decision
Ch05 First Course on Dimensional Modeling
Putting Dimensional Models Together: DW Bus Architecture
Establishing the Conformed Fact Definitions
The upfront data architecture effort will be about 20%
on conformed fact definitions and about 80% on
conformed dimensions
Usually done at the same time
Facts must also be conformed
17
Facts must also be conformed
Must be the same if they are called the same thing
Sometimes, a fact has one natural unit of measure in
one fact table and another natural unit of measure in
another fact table
This can cause problems for drill across reports
The correct solution is to carry the fact in both units of
measure in both tables (duplicate the fact)
If it is difficult or impossible to exactly conform a fact, then
give each interpretation a different name.
Ch05 First Course on Dimensional Modeling
Putting Dimensional Models Together: DW Bus Architecture
The importance of Data Mart granularity
Base level fact tables in each data mart should be at the natural
lowest levels of all the constituent dimensions
Granular fact tables can be gracefully extended by adding new
facts, new dimension attributes, or even whole new dimensions
Gracefully extended means that old queries and applications
18
Gracefully extended means that old queries and applications
continue to run
Multiple Source Data Marts
Kimball recommends
Start with a single source data mart
Most risk comes from too big of an extract programming job
After several single source marts, then combine them
This will satisfy users and allow the team to work on harder issues
Ch05 First Course on Dimensional Modeling
Putting Dimensional Models Together: DW Bus Architecture
Rescuing Stovepipes
If you have a pre-existing efforts:
If the dimensions were proper conformed dimensions, then they can
become part of the overall architecture
If not, then shut it them down and rebuild
When you dont need conformed dimensions
19
When you dont need conformed dimensions
When there are several separately managed lines of business
A data mart is a complete subset of the overall data warehouse
Every data mart is a family of similar tables sharing conformed
dimensions
The Data Warehouse Bus
Conformed dimensions and conformed facts are the bus of the
DW
First Course on Dimensional Modeling
Basic Data Modeling Techniques
Fact Tables and Dimension Tables
The fundamental idea is that every type of business
data can be represented as a type of cube of data
The cells of the cube contain measured values
The edges of the cube define the natural dimensions
20
The edges of the cube define the natural dimensions
The call this a hypercube, or alternately, cube or
data cube
Usually contain between 4 and 15 dimensions
Kimball says that models with 20 or more dimensions
seem unjustified
First Course on Dimensional Modeling
Basic Data Modeling Techniques
Envisioning a cube
We saw this in a previous slide
21
Chicago
M
a
r
k
e
t
s
D
i
m
e
n
s
i
o
n
Atlanta
Sales
Fact
First Course on Dimensional Modeling
Basic Data Modeling Techniques
Q4
Cherries
Grapes
Melons
Q1 Q2 Q3
Time Dimension
Dallas
Denver
Chicago
M
a
r
k
e
t
s
Apples
Envisioning a value in each dimensions of a cube
SalesRep
Product
Fact table representing Daily Sales counts by SalesRep
6 11 2 12
First Course on Dimensional Modeling
Basic Data Modeling Techniques
Envisioning a value in each dimension of a cube
23
Date
3 5 17 11
1 21 14 22
13 6 13 31
First Course on Dimensional Modeling
Basic Data Modeling Techniques
Envisioning a value in each dimension of a cube with summaries
First Course on Dimensional Modeling
Basic Data Modeling Techniques
Fact Tables and Dimension Tables (continued)
Facts
An observation in the marketplace
Most are numeric
The designer should suspect that any numeric data is
25
The designer should suspect that any numeric data is
probably a fact
Attribute
Usually text fields that describe a characteristic of a tangible
thing
Dimension
Textual attributes that describe things are organized within
the dimensions
First Course on Dimensional Modeling
Basic Data Modeling Techniques
Inside Dimension Tables, Drilling Up and Down
Drilling down is the most venerable (respected) kind
of drilling in a Data Warehouse
Drilling down means giving more detail
i.e. adding a row header to a report
26
i.e. adding a row header to a report
Dimension attributes
Are textual
The source of application constraints
Conversely, removing a row header is drilling up
Not necessarily in the same order that they were added
First Course on Dimensional Modeling
Basic Data Modeling Techniques
Inside Dimension Tables, Drilling Up and Down
(continued)
DW industry has been using the term browsing for
since the beginning of the 1980s
Means interactively examining the relationships among
27
Means interactively examining the relationships among
attributes in a dimension table
Has nothing to do with browsing on the internet
Snowflake schemas (see figure 5.6 in the textbook)
When low cardinality fields in the dimension have been
removed into separate tables and linked back to the original
table with artificial keys
Kimball is against snowflaking.
Questionable space savings
Defeats the use of bitmap indexes
First Course on Dimensional Modeling
Basic Data Modeling Techniques
Inside Dimension Tables, Drilling Up and
Down (continued)
Importance of High Quality Verbose Attributes
The quality of the DW is measured by the quality of
28
The quality of the DW is measured by the quality of
the dimension attributes
An ideal dimension table contains many readable
text fields describing the members of the
dimension
Fully expanded words
Not codes or abbreviations eliminate
See the top of the next slide
First Course on Dimensional Modeling
Basic Data Modeling Techniques
Dimension attributes should be:
Verbose (full words)
Descriptive
Complete (no missing values)
Quality Assured (no misspellings, impossible
29
Quality Assured (no misspellings, impossible
values, obsolete or orphaned values, or
cosmetically different versions of the same
attribute)
Indexed (perhaps b-tree for high cardinality and
bitmap for low cardinality)
Equally available (in single flat-denormalized
dimension)
Documented (in metadata that explains the origin
and interpretation of each attribute)
First Course on Dimensional Modeling
Basic Data Modeling Techniques
Inside Dimension Tables, Drilling Up and Down
(continued)
Importance of High Quality Verbose Attributes
(continued)
Kimball recommends a standard time dimension
30
Kimball recommends a standard time dimension
Includes a multinational sub-dimension
See next slide for details
Kimball recommends that a name and address record should
be broken down into as many parts as possible
Replace abbreviations with full text
Kimball recommends that, for commercial customers, make a
separate customer record for each level in the hierarchy
First Course on Dimensional Modeling
Basic Data Modeling Techniques
31
First Course on Dimensional Modeling
Basic Data Modeling Techniques
Inside Dimension Tables, Drilling Up and Down
(continued)
For slowly changing dimensions (like product or
customer), there are 3 options
Type 1 - Overwrite the dimension record, losing history
32
Type 1 - Overwrite the dimension record, losing history
Whenever the old value has no business significance
Type 2 - Create a new dimension records (with a new
surrogate key)
Whenever a true change has take place and it is appropriate to
partition history by different descriptions
Type 3 - Create an old field to store the previous value
Whenever it is logically possible to act as if the change had not
occurred
First Course on Dimensional Modeling
Basic Data Modeling Techniques
Inside Dimension Tables, Drilling Up and Down
(continued)
Rapidly changing dimensions
Probably use the Type 2 technique
Large dimensions (millions of records)
Modern databases will support these
May require suppressing or not creating some records
33
May require suppressing or not creating some records
Rapidly changing monster (large) dimensions
May need to break off hot (rapidly changing) dimensions into their
own dimension table
Degenerate dimensions
i.e. Order number in an order detail fact table. Keep this in the fact
table
Junk dimensions
Misc. flags and text
Kimball recommends putting them together in a dimension
First Course on Dimensional Modeling
Basic Data Modeling Techniques
FKs, PKs, and Surrogate Keys
All DW keys are surrogate keys
Ensure that they have no meaning
Never use original production keys
Typically 4 byte integer (holds 2 billions values)
34
Typically 4 byte integer (holds 2 billions values)
Dates Keys
Use surrogate key - Since some Date built-in fields are 8
bytes, 4 byte surrogate key will save 4 bytes
Facts in the fact table should be chosen to be
perfectly additive
First Course on Dimensional Modeling
Basic Data Modeling Techniques
4 Step Design Method for Designing an
Individual Fact Table -
4 choices made in order
1. Single source vs. multi source data mart
Kimball recommends starting with single source
2. Fact table grain - 3 styles
35
2. Fact table grain - 3 styles
Individual transactions (i.e. sales records)
Snapshot - activity during a period (i.e. daily sales)
Line items from control documents (i.e. invoice lines)
3. Choose dimensions
Examine all data and attach single valued descriptors as
dimensions
4. Chose the facts
Dependent upon the grain of the fact table (#2 above)
Store aggregate or summary records in different fact tables
First Course on Dimensional Modeling
Basic Data Modeling Techniques
Families of Fact Tables
A single data mart can be a coordinated set of fact
tables
They use conformed dimensions
4 reasons for building families of fact tables
36
4 reasons for building families of fact tables
1. Chains and circles
i.e. order, product, or customer evolves through a series of
steps. Each step captures transactions or snapshots. Each step
would have a fact table.
Often called value chain
value circle when a business or multiple entities can share
data with the same kind of transactions
2. Heterogeneous Product Schemas
May have multiple fact tables. i.e. bank has one account fact
table and another fact table with the checking acct subset.
First Course on Dimensional Modeling
Basic Data Modeling Techniques
Families of Fact Tables (continued)
4 reasons for building families of fact tables
(continued)
3. Transaction and Snapshot Schemas
Virtually every data mart has some need for 2 versions of data
37
Virtually every data mart has some need for 2 versions of data
One for transactions
One for periodic snapshots
May need Current Rolling Snapshot
i.e. keeping n months of periodic data
4. Aggregates
Stored summaries meant to improve performance
Created in separate fact tables
First Course on Dimensional Modeling
Basic Data Modeling Techniques
Factless Fact Tables
When the designer finds no facts to go into the fact
table
2 situations - events and coverage
Events recording with something happens
38
Events recording with something happens
i.e. student attendance no record when the student doesnt
attend
Dummy attribute is added for later aggregation purposes
Coverage when data is not available
i.e. sales promotion data only available for items sold in that
promotion
Fact table will not contain items not sold during that sales
promotion
Fact table will also not contain items not on promotion

Das könnte Ihnen auch gefallen