You are on page 1of 35

# Data Warehouse

Fundamentals of Star
Transformations
04/06/2015
Overview

Learning Objectives
Data Integrity
Keys, Indexes, Cardinality
Star Transformation
Source to Target Documentation
Conclusion
Questions
Learning Objectives

##  To understand the purpose of the star transformation

 To understand what to index
 To understand when to index
Data Integrity

## Data integrity refers to the overall completeness,

accuracy and consistency of data. It is normally
enforced in a database system by a series of integrity
constraints or rules:
 Null rule
 Unique Column Values
 Entity integrity
 Referential integrity
Null Rule

##  A Null is a rule defined on a single column that allows

or disallows inserts or updates of rows containing a
null in that column.
Unique Column Values

##  A unique value defined on a column (or set of

columns) allows the insert or update of a row only if it
contains a unique value in that column (or set of
columns).
Unique Column Values
Entity integrity
 Concept of a primary key
 Every table must have a primary key and that the
column or columns chosen to be the primary key
should be unique and not null.
Referential integrity

## Concept of a foreign key

 Usually the foreign-key value refers to a primary key
value of another table in the database.
Cardinality

##  In data modeling, cardinality refers to the number of

rows in table A that relate to table B: "one-to-many"
or "many-to-many." This is said to be the cardinality
of a given table in relation to another.

##  The many in a one-to-many relationship does not

mean that there must be more than one instance of
the child connected to a parent. The many in one-to-
many really means that there are zero, one, or more
instances of the child paired up to the parent.
Cardinality
Crow’s Foot Notation used in ERD (Entity Relationship
Diagram):
Cardinality

##  When referring to indexes, cardinality refers to the

number of distinct values in a particular column. If
you have a PERSON table, for example, GENDER is
likely to be a very low cardinality column (only 5
values in GENDER_DIM in DW) while PATIENT_ID is
likely to be a very high cardinality column (every row
will have a different value).
Cardinality
 When looking at query plans, cardinality refers to the
number of rows that are expected to be returned
from a particular operation.
Indexes

## Indexes provide faster access to data for operations

that return a small portion of a table's rows.
In general, you should create an index on a column in
any of the following situations:
 The column is filtered frequently.
 A UNIQUE key integrity constraint exists on the
column (PK).
 A referential integrity constraint exists on the column
(FK).
Indexes

##  The more indexes, the more overhead is incurred as the

table is altered. When rows are inserted or deleted, all
indexes on the table must be updated. When a column is
updated, all indexes on the column must be updated.

##  If a table is primarily read-only, you might use more

indexes; but, if a table is heavily updated, you might use
fewer indexes.
Bitmap Index vs. B-tree Index

## Internally, a bitmap and a B-tree indexes are very different,

but functionally they are identical in that they serve to assist
Oracle in retrieving rows faster than a full-table scan. The
basic differences between B-tree and bitmap indexes include:
 1. Syntax differences: The bitmap index includes the
"bitmap" keyword. The B-tree index does not say
"bitmap".
 2. Cardinality differences: The bitmap index is generally
for columns with lots of duplicate values (low cardinality),
while B-tree indexes are best for high cardinality columns.
B-tree Index vs. Bitmap Index

## A B-tree index keeps data A bitmap index keeps data sorted

sorted in a tree-like structure in a two-dimensional array with
and it walks the branches one column for every row in the
until it hits the node with the table being indexed. It finds the
Composite Index
You can create a composite index (using several columns, up
to 32), and the same index can be used for queries that
reference all of these columns, or just some of them.

## In general, you should put the column expected to be used

most often first in the index.
Primary Key vs. Unique Key
Indexes

##  What data do you plan to filter?

 What are your required fields vs. ‘nice to haves’?
 Which columns uniquely identify each record?
Database Normalization

## Database normalization is the process of organizing the

attributes and tables of a relational database to
minimize data redundancy.
First normal form (1NF)
First normal form sets the fundamental rules for database
normalization and relates to a single table within a relational
database system.

##  Every column in the table must be unique

 Separate tables must be created for each set of related data
 Each table must be identified with a unique column or
concatenated columns called the primary key
 No rows may be duplicated
 no row/column intersections contain a null value
 no row/column intersections contain multivalued fields
Second normal form (2NF)

(1NF).

## 1. Split up all data resulting in many-to-many

relationships and store the data as separate tables.

## 2. Each nonkey attribute in the relation must be

functionally dependent upon the primary key.
Third Normal Form (3NF)

## 3NF states that only foreign key columns should be

used to reference another table, and no other columns
from the parent table should exist in the referenced
table.
Source to Target
Star Schema
The Star Schema is a physical database
model which consists of one or more fact
tables referencing any number of dimension
tables.

## Benefits for using star schema:

• Improved query performance
• Referential Integrity
Star Schema
Patient_DIM Vendor_DIM

• Patient_DIM_ID • Vendor_DIM_ID
• Patient Name • Vendor Number
• Patient Address • Vendor Name
• SHC# • Vendor Address

MCA_CLAIM_FACT

• DATE_DIM_ID
• DX_DIM_ID • Month
• Diagnosis Code • Day
• Diagnosis Desc • Quarter

## Diagnosis_DIM Service Dt_DIM

Star Schema Table

## 68395 3 500.00 23254 6985 5688

Flat Table (without Star Schema)

## CLAIM LINE TOTBILL PATIENT_SHC PATIENT_NAME VENDOR_NAME VENDOR_HM DX_CODE DX_Desc

O_DIV

98675 1 100.00 100-232-563 SMITH, JOHN VU, LIU FAMILY 250.0 DIABETES
MEDICINE
98675 2 100.00 100-232-563 SMITH, JOHN VU, LIU FAMILY 276. CHF
MEDICINE
25555 6 250.00 101-103-600 DOE, JANE JONES, ORTHO 656.20 ANKLE
MARK SPRAIN
68395 3 500.00 102-896-405 MILLER, KELLY, KRIS PEDIATRICS 426.11 FEVER
MIKE
Snowflake Schema
• Structure in which a single fact table is
surrounded by one or more multileveled
dimensions
• Designed for flexible querying across
more complex dimension relationship
• Suitable for many-to-many and one-to-
many relationships among related
dimension levels
Snowflake Schema
• Gender_DIM_ID
• Gender Code
Gender_DIM • Gender Desc Division_DIM

Patient_DIM Vendor_DIM

• Patient_DIM_ID
• Patient Name
• SHC#

MCA_CLAIM_FACT

## Diagnosis_DIM Service Dt_DIM

When to use Star vs. Snowflake
• If there are attributes in the lowest level
dimension that need to be filtered on, use
SNOWFLAKE If users plan to filter
• Gender_DIM_ID on Gender values (e.g.
Gender_DIM
• Gender Code - Male or Female)
• Gender Desc during their reporting

• Patient_DIM_ID
• Patient Name Patient_DIM
• SHC#
• Gender_DIM_ID

## Claim Patient_DI Vendor_D Diagnosis Service_D MCA_CLAIM_FACT

M_ID IM_ID _DIM_ID t_DIM_ID
When to use Star vs. Snowflake
• If there are attributes in the lowest level dimension that
will only be displayed in reporting, but not filtered on,
“flatten out” the attributes in the highest level dimension.
This will create a Star Schema.
The Patient_DIM table
• Patient_DIM_ID should contain the
• Patient Name Patient_DIM Gender values
• SHC#
• Gender Code
• Gender Desc

## Claim Patient_DI Vendor_D Diagnosis Service_D

M_ID IM_ID _DIM_ID t_DIM_ID MCA_CLAIM_FACT
Conclusion

Learning Objectives
Data Integrity
Keys, Indexes, Cardinality
Star Transformation
Source to Target Documentation
Questions?