You are on page 1of 35

Data Warehouse

Fundamentals of Star

Learning Objectives
Data Integrity
Keys, Indexes, Cardinality
Star Transformation
Source to Target Documentation
Learning Objectives

 To understand the purpose of the star transformation

 To understand what to index
 To understand when to index
Data Integrity

Data integrity refers to the overall completeness,

accuracy and consistency of data. It is normally
enforced in a database system by a series of integrity
constraints or rules:
 Null rule
 Unique Column Values
 Entity integrity
 Referential integrity
Null Rule

 A Null is a rule defined on a single column that allows

or disallows inserts or updates of rows containing a
null in that column.
Unique Column Values

 A unique value defined on a column (or set of

columns) allows the insert or update of a row only if it
contains a unique value in that column (or set of
Unique Column Values
Entity integrity
 Concept of a primary key
 Every table must have a primary key and that the
column or columns chosen to be the primary key
should be unique and not null.
Referential integrity

Concept of a foreign key

 Usually the foreign-key value refers to a primary key
value of another table in the database.

 In data modeling, cardinality refers to the number of

rows in table A that relate to table B: "one-to-many"
or "many-to-many." This is said to be the cardinality
of a given table in relation to another.

 The many in a one-to-many relationship does not

mean that there must be more than one instance of
the child connected to a parent. The many in one-to-
many really means that there are zero, one, or more
instances of the child paired up to the parent.
Crow’s Foot Notation used in ERD (Entity Relationship

 When referring to indexes, cardinality refers to the

number of distinct values in a particular column. If
you have a PERSON table, for example, GENDER is
likely to be a very low cardinality column (only 5
values in GENDER_DIM in DW) while PATIENT_ID is
likely to be a very high cardinality column (every row
will have a different value).
 When looking at query plans, cardinality refers to the
number of rows that are expected to be returned
from a particular operation.

Indexes provide faster access to data for operations

that return a small portion of a table's rows.
In general, you should create an index on a column in
any of the following situations:
 The column is filtered frequently.
 A UNIQUE key integrity constraint exists on the
column (PK).
 A referential integrity constraint exists on the column

Limit the Number of Indexes for Each Table

 The more indexes, the more overhead is incurred as the

table is altered. When rows are inserted or deleted, all
indexes on the table must be updated. When a column is
updated, all indexes on the column must be updated.

 If a table is primarily read-only, you might use more

indexes; but, if a table is heavily updated, you might use
fewer indexes.
Bitmap Index vs. B-tree Index

Internally, a bitmap and a B-tree indexes are very different,

but functionally they are identical in that they serve to assist
Oracle in retrieving rows faster than a full-table scan. The
basic differences between B-tree and bitmap indexes include:
 1. Syntax differences: The bitmap index includes the
"bitmap" keyword. The B-tree index does not say
 2. Cardinality differences: The bitmap index is generally
for columns with lots of duplicate values (low cardinality),
while B-tree indexes are best for high cardinality columns.
B-tree Index vs. Bitmap Index

A B-tree index keeps data A bitmap index keeps data sorted

sorted in a tree-like structure in a two-dimensional array with
and it walks the branches one column for every row in the
until it hits the node with the table being indexed. It finds the
answer. answer by merging the bitmaps.
Composite Index
You can create a composite index (using several columns, up
to 32), and the same index can be used for queries that
reference all of these columns, or just some of them.

In general, you should put the column expected to be used

most often first in the index.
Primary Key vs. Unique Key

Questions to ask clients:

 What data do you plan to filter?

 What are your required fields vs. ‘nice to haves’?
 Which columns uniquely identify each record?
Database Normalization

Database normalization is the process of organizing the

attributes and tables of a relational database to
minimize data redundancy.
First normal form (1NF)
First normal form sets the fundamental rules for database
normalization and relates to a single table within a relational
database system.

 Every column in the table must be unique

 Separate tables must be created for each set of related data
 Each table must be identified with a unique column or
concatenated columns called the primary key
 No rows may be duplicated
 no row/column intersections contain a null value
 no row/column intersections contain multivalued fields
Second normal form (2NF)

Second normal form builds on the first normal form


1. Split up all data resulting in many-to-many

relationships and store the data as separate tables.

2. Each nonkey attribute in the relation must be

functionally dependent upon the primary key.
Third Normal Form (3NF)

3NF states that only foreign key columns should be

used to reference another table, and no other columns
from the parent table should exist in the referenced
Source to Target
Star Schema
The Star Schema is a physical database
model which consists of one or more fact
tables referencing any number of dimension

Benefits for using star schema:

• Improved query performance
• Load performance and Administration
• Referential Integrity
Star Schema
Patient_DIM Vendor_DIM

• Patient_DIM_ID • Vendor_DIM_ID
• Patient Name • Vendor Number
• Patient Address • Vendor Name
• SHC# • Vendor Address


• DX_DIM_ID • Month
• Diagnosis Code • Day
• Diagnosis Desc • Quarter

Diagnosis_DIM Service Dt_DIM

Star Schema Table


98675 1 100.00 68593 3287 2364

98675 2 100.00 68593 3287 3265

25555 6 250.00 42563 1256 8996

68395 3 500.00 23254 6985 5688

Flat Table (without Star Schema)



98675 1 100.00 100-232-563 SMITH, JOHN VU, LIU FAMILY 250.0 DIABETES
98675 2 100.00 100-232-563 SMITH, JOHN VU, LIU FAMILY 276. CHF
25555 6 250.00 101-103-600 DOE, JANE JONES, ORTHO 656.20 ANKLE
68395 3 500.00 102-896-405 MILLER, KELLY, KRIS PEDIATRICS 426.11 FEVER
Snowflake Schema
• Structure in which a single fact table is
surrounded by one or more multileveled
• Designed for flexible querying across
more complex dimension relationship
• Suitable for many-to-many and one-to-
many relationships among related
dimension levels
Snowflake Schema
• Gender_DIM_ID
• Gender Code
Gender_DIM • Gender Desc Division_DIM

Patient_DIM Vendor_DIM

• Patient_DIM_ID
• Patient Name
• Patient Address
• SHC#


Diagnosis_DIM Service Dt_DIM

When to use Star vs. Snowflake
• If there are attributes in the lowest level
dimension that need to be filtered on, use
SNOWFLAKE If users plan to filter
• Gender_DIM_ID on Gender values (e.g.
• Gender Code - Male or Female)
• Gender Desc during their reporting

• Patient_DIM_ID
• Patient Name Patient_DIM
• Patient Address
• SHC#
• Gender_DIM_ID

Claim Patient_DI Vendor_D Diagnosis Service_D MCA_CLAIM_FACT

When to use Star vs. Snowflake
• If there are attributes in the lowest level dimension that
will only be displayed in reporting, but not filtered on,
“flatten out” the attributes in the highest level dimension.
This will create a Star Schema.
The Patient_DIM table
• Patient_DIM_ID should contain the
• Patient Name Patient_DIM Gender values
• Patient Address
• SHC#
• Gender Code
• Gender Desc

Claim Patient_DI Vendor_D Diagnosis Service_D


Learning Objectives
Data Integrity
Keys, Indexes, Cardinality
Star Transformation
Source to Target Documentation