BI Tech Session On Data Warehousing: Dhruv Nath

BI Tech Session on Data Warehousing
Dhruv Nath
Slides on OLAP
DW : Contents
ER Model vs Dimensional Model Designing a Data Warehouse
Starting with the ER Model
Facts and Dimensions BI Products and Vendors Data Warehouse Optimisation OLAP Implementation
OLTP Databases use the Entity Relationship Model
Why no Many-Many relationships ?
Why cant we use the ER Model for Analytics / BI ?
Problems with using the ER Model / 3NF for Querying Complex to understand and query
All kinds of tables being joined to all kinds of other tables Maybe OK for joining a few tables. Not OK when lots of tables involved
Complex to visualise The E-R Model is very symmetric

no way to figure out what data is business numbers (changing) and what is constant (eg. Regions, Products)
The E-R Model is designed for capturing / updating detailed data. Not for querying it Different Model required for querying this data by Management
An Easier Model to Query

Product Model
Geography
Sales Collections Complaints
Dealer
Year
Dimensional Model
Facts and Dimensions
Benefits of the Dimensional Model

Simple
Can be used directly by the user
Very clear what data is business numbers (changing - facts) and what is constant (eg. Regions, Products - dimensions)
Example : Dimensional Model of Data

Cust. Id Cust Name Address Region Code
Fact
Cust. Id Month & Yr Region Code Balance
Region Name Address
Phone
Manager
Dimension
Dimension
Month & Yr Quarter
What is the primary key in each dimension ? What is the primary key in the Fact table ? What are the foreign keys ? What relationships do they define ? What do we call this schema ? Star Schema
Dimension
Example : Dimensional Model of Data

Fact
Region Name Address
Phone
Manager
Dimension
Dimension
Each Dimension represents an entity (with attributes)

Month & Yr Quarter
The Star Schema can be visualised as a Data Cube. How ?
Dimension
Visualising a Star Schema as a Data Cube
Querying : OLAP (vs OLTP)
Dimensional Model
Fact
Region Name Address
Phone
Manager
Dimension Can have any number of dimensions Usually 5 - 15
Dimension
How are snapshots added on ?

Month & Yr Quarter
Dimension
Exercise : Compare the ER Model with the Dimensional Model of Data ER Model
Dimensional Model
Designed for entering / storing data (transactions) Optimized for transactions: single row entry and retrieval Thousands of concurrent users No way to figure out what data is business numbers (changing) and what is constant / static / nearstatic (eg. Regions, Products). All of them are fields or relations. Therefore tough to implement a query JOINs needed between any combination of tables. Therefore tough to implement a query
Designed for analysis / querying by the user Optimized for bulk load and large, complex, unpredictable queries Few concurrent users What is constant / static / nearstatic (dimensions) and what are business numbers (facts) very clear. Therefore easier to implement a query JOINS only between the Fact Table and each Dimension Table. Therefore easier to implement a query
Data Marts
Fact
Region Name Address
Phone
Manager
Dimension
Dimension
Month & Yr Quarter
Dimension
How would Data Marts created out of such a Data Warehouse look ? Similar. Some fields may be missing. Examples ? Corporate customers : No personal details Retail customers : No Organisational details Data Cubes usually formed in Data Marts
DW : Contents
Exercise : ER-Model to Dimensional Model

Print for Students SALES_REP
Sells_to
CUSTOMER
Places_Order Is_Ordered Contains
PRODUCT
Line_Item
ORDER
Exercise : Convert this ER Model into a Dimensional Model (Star Schema)
Dimensional Model
What are the Foreign Keys in the Fact Table ? What is the primary key in the Fact Table ?
CUSTOMER Cust Id Cust Name Address
SALES_REP
Emp. Id Name Qualifications Fact LINE_ITEM Emp Id Cust Id Date Order Num Product Code Quantity TIME
Date Quarter
PRODUCT Product Code Product Name Brand Rate
ORDER
Order Num Credit Terms Lead Time
Star : Instead of keeping a relationship from Sales_Rep to Customer, the relationship is from both to line item New Dimension created : Time. Time will always be a dimension in a Data Warehouse
Exercise : Is this a Normalised design ?

Print for Students
SALES_REP
Date Quarter
ORDER
Exercise : Is this a Normalised design ?

In the Fact Table, Emp Id is functionally dependent on (Cust Id + Date) not the primary key Logically, every time Customer P places an order on Salesman Q, we will have one row in the fact table for this Customer, Salesman combination
So redundancy. Cust Id should have been enough.
Therefore anomalies ???

Insert : Cannot insert a Customer Salesman relationship, till the Customer places an order Delete : If an order is cancelled, and this is the only order the salesman has from this Customer, we lose the Salesman Customer relationship Does this lack of normalisation cause a problem ?
Does lack of normalisation cause a problem ?

A Datawarehouse has no updation, deletion or insertion
Only snapshots getting added on with time
So no anomalies ----- Lack of normalisation is not a problem The E-R Model tries to remove redundancy completely The Dimensional model tries to simplify the schema, and therefore brings in redundancy
eg. the relationship between sales_rep and customer is repeated in every line_item where these two are involved
Does lack of normalisation cause a problem contd. ?

Cannot enter a Salesman Customer relationship till the customer places at least one order Instead it is shown as a relationship between a customer and a line item, and a salesperson and the same line item. The relationship is only through the line item (Fact) Is this a problem ? In a DW we decide what our focus is - those are the facts. In this case our fact is the line items sold, not the relationship between the salesperson / customer rep and the customer If the relationship (even without the order) is important to maintain at is important, we create another Star Schema, around some other fact (say, Opportunity)
Constellation
Multiple STARs
Exercise Implementing Data Marts
DW : Contents
Which of these can be facts ?

Region Sales No. of Complaints Type of complaint Outstandings Premium paid Salary Colour Cash_on_hand collections breakages product customer interest
Typical characteristics of facts ??
Typical Characteristics of Facts

Numerical Additive why ?
Querying involves scanning lots of records The end result of the query should be short - one or two pages / screens Additive facts can provide this
Examples ?
Sales, Collections, Revenue, Expenses
Continuously valued (even numbers (eg. no. of complaints / no. of transactions are considered continuously valued)
Will Facts always be additive ?

Semi-additive Facts ? Explain
Account Balance - Explain Can be added across some dimensions, not all Guidelines What forms additive facts and what forms semiadditive facts ? Flows vs Levels (eg. Deposits vs balance, eg. Collections vs Current outstandings)
Non-additive Facts ? Explain

Interest %age, %age target achievement, %age profit Cannot be added across any dimension Can this be converted into an Additive fact ? Convert interest %age to an absolute value When is this done ? ETL (Transform stage)
Additive Facts : Summarise

Facts will usually be additive, or semiadditive. Avoid non-additive facts However, it is possible to have facts without satisfying some or all of these conditions Ultimately, the designer decides.
Review : Facts - Guidelines

Numerical Continuously valued Additive
Semi-additive Non-additive
Dimensions
Determined by what you want as row and column headers in your query reports : Usually :
Textual Discrete
Could also be numeric. Where ?

Where they form column headers, and no calculations are done on them (eg. Age, Salary). Typically a range
Time is always one dimension. Why ?

Because of snapshots
Dimensions are an entry point into a Data Warehouse
Exercise : Facts or Dimensions ?

Region Sales No. of Complaints Type of complaint Outstandings Premium paid Salary Colour Cash_on_hand collections breakages product customer interest
The same thing can be modelled as a fact or as a dimension. Depends on the designer Numeric dimensions are in the form of a range
DW : Contents
BI Products and Vendors

Data Cubes Clients
OLTP Databases
Data Marts
Data Warehouse
DBMS Vendors
Oracle, Microsoft SQL Server, IBM (DB2),..
BI Products and Vendors

Data Cubes Clients
OLTP Databases
Data Marts
Data Warehouse
BI Tool Vendors Provide everything except the OLTP DBMS and DW. ETL included SAS, Cognos (IBM), Business Objects (SAP), Qlikview..
Implementing a Data Warehouse Where should the Pilot be done ?

Four Regions (rep by 4 teams) :
1. Dynamic and keen Regional Manager very poor historical data 2. Excellent historical data. RM interested but doesnt have much time 3. Recently started Region. Not much historical data, but good current data. RM interested, may spend some time 4. Small, unimportant Region, but good RM, and interested. Good historical data, but not too much of it
DW : Contents
Exercise : How big are the Fact and Dimension Tables ? a) Number of records b) Size in bytes
Cust. Id
Cust Name Address Phone
Region Code
Fact
Cust. Id
Region Name Address Manager
Month & Yr
Region Code Balance
Dimension
Dimension
Month & Yr
Quarter
1 lakh customers, 10 regions. Data stored for the past 10 years What if we store daily balances, and for each of the 1000 branches ? Implications ? Space, speed. So what do we do ? Optimise on Fact table size. Ignore dimension tables !!!
Dimension
Optimisation : Exercise : Can we modify this Star Schema to cut down space ?
SALES_REP
Date Quarter
ORDER
Is the Dimension Table Normalised ?
Star Schema Option 2
Denormalised Dimension Table

Cust Id Cust Name Address Emp. Id Name Qualifications
CUSTOMER
SALES_REP
Date Quarter
ORDER
Advantage / Disadvantage ? Fact Table space vs. Ease of Querying Which one would you use ?
More highly Denormalised Dimension Table
Star Schema Option 3

Cust Id Cust Name Address Emp. Id Name Qualifications
CUSTOMER
SALES_REP
Date Quarter
PRODUCT Product Code Product Name Brand Fact Rate
ORDER
Advantage / Disadvantage ? Table space vs. Ease of Querying Which one would you use ?
Order Num Credit Terms Lead Time Cust Id Cust Name Address Emp. Id Name Qualifications
Optimisation : What occupies the maximum space in the Fact Table ?

Cust. Id
Cust Name Address Phone
Region Code
Fact
Cust. Id
Region Name Address Manager
Month & Yr
Region Code Balance
Dimension
Dimension
Month & Yr
Quarter
Dimension
Keys How do we reduce the size of the keys ? Use surrogate keys
Optimisation : Use Surrogate keys

Operational Keys - Disadvantage ?
English like Ids : occupy space
Surrogate Keys - meaningless integers. 2 or 4 byte integers most common. Advantage ?

Much shorter
Disadvantage ?
Processing reqd to transform from op to surrogate In any case, when the data comes from multiple sources, keys in all but one of the sources need to change
Exercise : Add surrogate keys to this schema

Cust Key (PK) Cust Id Cust Name Address
CUSTOMER
SALES_REP
Emp. Id Name Qualifications Emp Key (PK) Fact LINE_ITEM Emp Id Emp Key Cust Id Cust Key Date Order Num Order Key Product Code Product Key Quantity TIME
Date Quarter
PRODUCT Product Code Product Name Brand Rate Product Key
ORDER
Order Key Order Num Credit Terms Lead Time
Do we need both the original and the surrogate key in the Dimension Table ? Fact Table ?
Designing a Data Warehouse

CUSTOMER Cust Key (PK) Cust Id Cust Name Address
SALES_REP
Emp. Id Name Qualifications Emp Key (PK) Fact LINE_ITEM Emp Id Emp Key Cust Id Cust Key Date Key Date Order Num Order Key Product Code Product Key Quantity TIME Date Key (PK) Date Quarter
PRODUCT Product Code Product Name Brand Rate Product Key
ORDER
Order Key Order Num Credit Terms Lead Time
Based on this exercise, what is the process for converting an ER Model into a Dimensional Model (Data Warehouse)
The DW Design Process

Identify an association table as the central fact table Choose the Dimensions Add date (time) dimension Replace all operational keys with surrogate keys Promote foreign keys from each dimension table to the fact table Choose the Facts
Are the Dimensions normalised ?

Cust. Id Cust Name Address Phone
Thinking question : Is there any situation where we would normalise a dimension table ? Fact
Cust. Id Month Region Code Balance
Region Code Region Name Address Manager
Dimension
Dimension
Month & Yr Quarter
Dimension
Add fields to each dimension to make it denormalised Now, what does the schema look like if we normalise each dimension Snowflake Schematable ? Are Snowflake Schemas desirable ? Why ? Speed of querying. Complexity of querying for the user
DW : Contents
Representing dimensions
SKU Store Date Promotion Brand Locality Month All
Product
PIN Code Quarter
Product Category
City Year
Department
Region
All products
All
All
How do we represent a query - eg. Get Sales by SKU by Store by Date by Promotion ? How do we show a Roll-up / Drill-down ?
Representing dimensions
Product
PIN Code Quarter
Product Category
City Year
Department
Region
All products
All
All
For this query, we need to add fields across rows in the Fact Table. How many rows need to be summed? Problems ? Speed. Solution ?
Pre-aggregate sums and store
Multiple levels of aggregates

Product
PIN Code Quarter
Product Category
City Year
Department
Region
All products
All
All
Store multiple level aggregates Redundancy : To speed up querying
Aggregation : Issues
When are aggregates computed ?
During every update
How do we decide what aggregates to keep ?

Frequency of usage / repeat queries Priority of users
Managers / Analysts should figure out the likely frequency.

Therefore what aggregates to keep
Aggregation : Issues
Where are Aggregations stored ?
Separate Fact table Families of Stars (Constellations)
When are they computed ?

During every update
How do we decide what aggregates to keep ?

Frequency of usage / repeat queries Priority of users
Users should not be aware of aggregation. The software automatically uses the aggregate Fact table to answer the query. Why ?
Implementing OLAP
Relational OLAP Disc
Implemented using a regular Relational DBMS Linked list structures
Multi-Dimensional OLAP Disc

MDDB Created in advance and stored for querying Array structures
Advantages and Disadvantages ? Disc
ROLAP

vs

MOLAP
Array Structure therefore fast All cells in the Fact Table are stored whether they exist or not Therefore huge space (Explain)
eg. (Bank example) A customer does not have any Account in a given branch A customer does not perform any transaction in most of his accounts on specific days
Linked List Structure slow Space Optimised only records that have some value are stored
All data is available in the ROLAP. Can handle large DW
No Pre-aggregated data therefore slow
Therefore only small DW can be handled. For large DW, summarised data can be kept in the MDDB. Drilling down requires going back to ROLAP (Called HOLAP Hybrid OLAP) Pre-aggregated data therefore fast
MOLAP
Sparse Matrix techniques used to optimised space
ROLAP vs MOLAP
DBMS vendors started off with ROLAP (knowhow already existed), but are now adding MOLAP Pure BI vendors largely into MOLAP (proprietary)
Role Play Implementation across Multiple Locations
Book
The Data Warehouse Toolkit Ralph Kendall, Margy Ross - Wiley
BI Tech Session on Data Warehousing
Dhruv Nath

BI Tech Session On Data Warehousing: Dhruv Nath

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

BI Tech Session On Data Warehousing: Dhruv Nath

Hochgeladen von

Copyright:

Verfügbare Formate

BI Tech Session on Data Warehousing

OLTP Databases use the Entity Relationship Model

Why no Many-Many relationships ?

Why cant we use the ER Model for Analytics / BI ?

Complex to visualise The E-R Model is very symmetric

An Easier Model to Query

Sales Collections Complaints

Facts and Dimensions

Benefits of the Dimensional Model

Example : Dimensional Model of Data

Region Name Address

Month & Yr Quarter

Example : Dimensional Model of Data

Region Name Address

Each Dimension represents an entity (with attributes)

The Star Schema can be visualised as a Data Cube. How ?

Visualising a Star Schema as a Data Cube

Querying : OLAP (vs OLTP)

Region Name Address

Dimension Can have any number of dimensions Usually 5 - 15

How are snapshots added on ?

Region Name Address

Month & Yr Quarter

Exercise : ER-Model to Dimensional Model

Places_Order Is_Ordered Contains

Exercise : Convert this ER Model into a Dimensional Model (Star Schema)

PRODUCT Product Code Product Name Brand Rate

Order Num Credit Terms Lead Time

Exercise : Is this a Normalised design ?

PRODUCT Product Code Product Name Brand Rate

Order Num Credit Terms Lead Time

Exercise : Is this a Normalised design ?

Therefore anomalies ???

Does lack of normalisation cause a problem ?

Does lack of normalisation cause a problem contd. ?

Exercise Implementing Data Marts

Which of these can be facts ?

Typical Characteristics of Facts

Will Facts always be additive ?

Non-additive Facts ? Explain

Additive Facts : Summarise

Review : Facts - Guidelines

Could also be numeric. Where ?

Time is always one dimension. Why ?

Dimensions are an entry point into a Data Warehouse

Exercise : Facts or Dimensions ?

BI Products and Vendors

Oracle, Microsoft SQL Server, IBM (DB2),..

BI Products and Vendors

Implementing a Data Warehouse Where should the Pilot be done ?

Region Name Address Manager

PRODUCT Product Code Product Name Brand Rate

Order Num Credit Terms Lead Time

Is the Dimension Table Normalised ?

Star Schema Option 2

Denormalised Dimension Table

PRODUCT Product Code Product Name Brand Rate

Order Num Credit Terms Lead Time

More highly Denormalised Dimension Table

Star Schema Option 3

PRODUCT Product Code Product Name Brand Fact Rate

Optimisation : What occupies the maximum space in the Fact Table ?

Region Name Address Manager