Sie sind auf Seite 1von 58

BI Tech Session on Data Warehousing

Dhruv Nath

Slides on OLAP

DW : Contents
ER Model vs Dimensional Model Designing a Data Warehouse
Starting with the ER Model

Facts and Dimensions BI Products and Vendors Data Warehouse Optimisation OLAP Implementation

OLTP Databases use the Entity Relationship Model

Why no Many-Many relationships ?

Why cant we use the ER Model for Analytics / BI ?

Problems with using the ER Model / 3NF for Querying Complex to understand and query
All kinds of tables being joined to all kinds of other tables Maybe OK for joining a few tables. Not OK when lots of tables involved

Complex to visualise The E-R Model is very symmetric


no way to figure out what data is business numbers (changing) and what is constant (eg. Regions, Products)

The E-R Model is designed for capturing / updating detailed data. Not for querying it Different Model required for querying this data by Management

An Easier Model to Query


Product Model

Geography

Sales Collections Complaints

Dealer

Year

Dimensional Model

Facts and Dimensions

Benefits of the Dimensional Model


Simple
Can be used directly by the user

Very clear what data is business numbers (changing - facts) and what is constant (eg. Regions, Products - dimensions)

Example : Dimensional Model of Data


Cust. Id Cust Name Address Region Code

Fact
Cust. Id Month & Yr Region Code Balance

Region Name Address

Phone

Manager

Dimension

Dimension

Month & Yr Quarter

What is the primary key in each dimension ? What is the primary key in the Fact table ? What are the foreign keys ? What relationships do they define ? What do we call this schema ? Star Schema

Dimension

Example : Dimensional Model of Data


Cust. Id Cust Name Address Region Code

Fact
Cust. Id Month & Yr Region Code Balance

Region Name Address

Phone

Manager

Dimension

Dimension

Each Dimension represents an entity (with attributes)


Month & Yr Quarter

The Star Schema can be visualised as a Data Cube. How ?

Dimension

Visualising a Star Schema as a Data Cube

Querying : OLAP (vs OLTP)

Dimensional Model
Cust. Id Cust Name Address Region Code

Fact
Cust. Id Month & Yr Region Code Balance

Region Name Address

Phone

Manager

Dimension Can have any number of dimensions Usually 5 - 15

Dimension

How are snapshots added on ?


Month & Yr Quarter

Dimension

Exercise : Compare the ER Model with the Dimensional Model of Data ER Model

Dimensional Model

Designed for entering / storing data (transactions) Optimized for transactions: single row entry and retrieval Thousands of concurrent users No way to figure out what data is business numbers (changing) and what is constant / static / nearstatic (eg. Regions, Products). All of them are fields or relations. Therefore tough to implement a query JOINs needed between any combination of tables. Therefore tough to implement a query

Designed for analysis / querying by the user Optimized for bulk load and large, complex, unpredictable queries Few concurrent users What is constant / static / nearstatic (dimensions) and what are business numbers (facts) very clear. Therefore easier to implement a query JOINS only between the Fact Table and each Dimension Table. Therefore easier to implement a query

Data Marts
Cust. Id Cust Name Address Region Code

Fact
Cust. Id Month & Yr Region Code Balance

Region Name Address

Phone

Manager

Dimension

Dimension

Month & Yr Quarter

Dimension

How would Data Marts created out of such a Data Warehouse look ? Similar. Some fields may be missing. Examples ? Corporate customers : No personal details Retail customers : No Organisational details Data Cubes usually formed in Data Marts

DW : Contents
ER Model vs Dimensional Model Designing a Data Warehouse
Starting with the ER Model

Facts and Dimensions BI Products and Vendors Data Warehouse Optimisation OLAP Implementation

Exercise : ER-Model to Dimensional Model


Print for Students SALES_REP
Sells_to

CUSTOMER

Places_Order Is_Ordered Contains

PRODUCT

Line_Item

ORDER

Exercise : Convert this ER Model into a Dimensional Model (Star Schema)

Dimensional Model
What are the Foreign Keys in the Fact Table ? What is the primary key in the Fact Table ?
CUSTOMER Cust Id Cust Name Address

SALES_REP
Emp. Id Name Qualifications Fact LINE_ITEM Emp Id Cust Id Date Order Num Product Code Quantity TIME

Date Quarter

PRODUCT Product Code Product Name Brand Rate

ORDER

Order Num Credit Terms Lead Time

Star : Instead of keeping a relationship from Sales_Rep to Customer, the relationship is from both to line item New Dimension created : Time. Time will always be a dimension in a Data Warehouse

Exercise : Is this a Normalised design ?


Print for Students
CUSTOMER Cust Id Cust Name Address

SALES_REP
Emp. Id Name Qualifications Fact LINE_ITEM Emp Id Cust Id Date Order Num Product Code Quantity TIME

Date Quarter

PRODUCT Product Code Product Name Brand Rate

ORDER

Order Num Credit Terms Lead Time

Exercise : Is this a Normalised design ?


In the Fact Table, Emp Id is functionally dependent on (Cust Id + Date) not the primary key Logically, every time Customer P places an order on Salesman Q, we will have one row in the fact table for this Customer, Salesman combination
So redundancy. Cust Id should have been enough.

Therefore anomalies ???


Insert : Cannot insert a Customer Salesman relationship, till the Customer places an order Delete : If an order is cancelled, and this is the only order the salesman has from this Customer, we lose the Salesman Customer relationship Does this lack of normalisation cause a problem ?

Does lack of normalisation cause a problem ?


A Datawarehouse has no updation, deletion or insertion
Only snapshots getting added on with time

So no anomalies ----- Lack of normalisation is not a problem The E-R Model tries to remove redundancy completely The Dimensional model tries to simplify the schema, and therefore brings in redundancy
eg. the relationship between sales_rep and customer is repeated in every line_item where these two are involved

Does lack of normalisation cause a problem contd. ?


Cannot enter a Salesman Customer relationship till the customer places at least one order Instead it is shown as a relationship between a customer and a line item, and a salesperson and the same line item. The relationship is only through the line item (Fact) Is this a problem ? In a DW we decide what our focus is - those are the facts. In this case our fact is the line items sold, not the relationship between the salesperson / customer rep and the customer If the relationship (even without the order) is important to maintain at is important, we create another Star Schema, around some other fact (say, Opportunity)

Constellation
Multiple STARs

Exercise Implementing Data Marts

DW : Contents
ER Model vs Dimensional Model Designing a Data Warehouse
Starting with the ER Model

Facts and Dimensions BI Products and Vendors Data Warehouse Optimisation OLAP Implementation

Which of these can be facts ?


Region Sales No. of Complaints Type of complaint Outstandings Premium paid Salary Colour Cash_on_hand collections breakages product customer interest
Typical characteristics of facts ??

Typical Characteristics of Facts


Numerical Additive why ?
Querying involves scanning lots of records The end result of the query should be short - one or two pages / screens Additive facts can provide this

Examples ?
Sales, Collections, Revenue, Expenses

Continuously valued (even numbers (eg. no. of complaints / no. of transactions are considered continuously valued)

Will Facts always be additive ?


Semi-additive Facts ? Explain
Account Balance - Explain Can be added across some dimensions, not all Guidelines What forms additive facts and what forms semiadditive facts ? Flows vs Levels (eg. Deposits vs balance, eg. Collections vs Current outstandings)

Non-additive Facts ? Explain


Interest %age, %age target achievement, %age profit Cannot be added across any dimension Can this be converted into an Additive fact ? Convert interest %age to an absolute value When is this done ? ETL (Transform stage)

Additive Facts : Summarise


Facts will usually be additive, or semiadditive. Avoid non-additive facts However, it is possible to have facts without satisfying some or all of these conditions Ultimately, the designer decides.

Review : Facts - Guidelines


Numerical Continuously valued Additive
Semi-additive Non-additive

Dimensions
Determined by what you want as row and column headers in your query reports : Usually :
Textual Discrete

Could also be numeric. Where ?


Where they form column headers, and no calculations are done on them (eg. Age, Salary). Typically a range

Time is always one dimension. Why ?


Because of snapshots

Dimensions are an entry point into a Data Warehouse

Exercise : Facts or Dimensions ?


Region Sales No. of Complaints Type of complaint Outstandings Premium paid Salary Colour Cash_on_hand collections breakages product customer interest
The same thing can be modelled as a fact or as a dimension. Depends on the designer Numeric dimensions are in the form of a range

DW : Contents
ER Model vs Dimensional Model Designing a Data Warehouse
Starting with the ER Model

Facts and Dimensions BI Products and Vendors Data Warehouse Optimisation OLAP Implementation

BI Products and Vendors


Data Cubes Clients

OLTP Databases

Data Marts

Data Warehouse

DBMS Vendors

Oracle, Microsoft SQL Server, IBM (DB2),..

BI Products and Vendors


Data Cubes Clients

OLTP Databases

Data Marts

Data Warehouse

BI Tool Vendors Provide everything except the OLTP DBMS and DW. ETL included SAS, Cognos (IBM), Business Objects (SAP), Qlikview..

Implementing a Data Warehouse Where should the Pilot be done ?


Four Regions (rep by 4 teams) :
1. Dynamic and keen Regional Manager very poor historical data 2. Excellent historical data. RM interested but doesnt have much time 3. Recently started Region. Not much historical data, but good current data. RM interested, may spend some time 4. Small, unimportant Region, but good RM, and interested. Good historical data, but not too much of it

DW : Contents
ER Model vs Dimensional Model Designing a Data Warehouse
Starting with the ER Model

Facts and Dimensions BI Products and Vendors Data Warehouse Optimisation OLAP Implementation

Exercise : How big are the Fact and Dimension Tables ? a) Number of records b) Size in bytes
Cust. Id
Cust Name Address Phone

Region Code

Fact
Cust. Id

Region Name Address Manager

Month & Yr
Region Code Balance

Dimension

Dimension

Month & Yr
Quarter

1 lakh customers, 10 regions. Data stored for the past 10 years What if we store daily balances, and for each of the 1000 branches ? Implications ? Space, speed. So what do we do ? Optimise on Fact table size. Ignore dimension tables !!!

Dimension

Optimisation : Exercise : Can we modify this Star Schema to cut down space ?
CUSTOMER Cust Id Cust Name Address

SALES_REP
Emp. Id Name Qualifications Fact LINE_ITEM Emp Id Cust Id Date Order Num Product Code Quantity TIME

Date Quarter

PRODUCT Product Code Product Name Brand Rate

ORDER

Order Num Credit Terms Lead Time

Is the Dimension Table Normalised ?

Star Schema Option 2

Denormalised Dimension Table


Cust Id Cust Name Address Emp. Id Name Qualifications

CUSTOMER

SALES_REP
Emp. Id Name Qualifications Fact LINE_ITEM Emp Id Cust Id Date Order Num Product Code Quantity TIME

Date Quarter

PRODUCT Product Code Product Name Brand Rate

ORDER

Order Num Credit Terms Lead Time

Advantage / Disadvantage ? Fact Table space vs. Ease of Querying Which one would you use ?

More highly Denormalised Dimension Table

Star Schema Option 3


Cust Id Cust Name Address Emp. Id Name Qualifications

CUSTOMER

SALES_REP
Emp. Id Name Qualifications Fact LINE_ITEM Emp Id Cust Id Date Order Num Product Code Quantity TIME

Date Quarter

PRODUCT Product Code Product Name Brand Fact Rate

ORDER

Advantage / Disadvantage ? Table space vs. Ease of Querying Which one would you use ?

Order Num Credit Terms Lead Time Cust Id Cust Name Address Emp. Id Name Qualifications

Optimisation : What occupies the maximum space in the Fact Table ?


Cust. Id
Cust Name Address Phone

Region Code

Fact
Cust. Id

Region Name Address Manager

Month & Yr
Region Code Balance

Dimension

Dimension

Month & Yr
Quarter

Dimension

Keys How do we reduce the size of the keys ? Use surrogate keys

Optimisation : Use Surrogate keys


Operational Keys - Disadvantage ?
English like Ids : occupy space

Surrogate Keys - meaningless integers. 2 or 4 byte integers most common. Advantage ?


Much shorter

Disadvantage ?
Processing reqd to transform from op to surrogate In any case, when the data comes from multiple sources, keys in all but one of the sources need to change

Exercise : Add surrogate keys to this schema


Cust Key (PK) Cust Id Cust Name Address

CUSTOMER

SALES_REP
Emp. Id Name Qualifications Emp Key (PK) Fact LINE_ITEM Emp Id Emp Key Cust Id Cust Key Date Order Num Order Key Product Code Product Key Quantity TIME

Date Quarter

PRODUCT Product Code Product Name Brand Rate Product Key

ORDER

Order Key Order Num Credit Terms Lead Time

Do we need both the original and the surrogate key in the Dimension Table ? Fact Table ?

Designing a Data Warehouse


CUSTOMER Cust Key (PK) Cust Id Cust Name Address

SALES_REP
Emp. Id Name Qualifications Emp Key (PK) Fact LINE_ITEM Emp Id Emp Key Cust Id Cust Key Date Key Date Order Num Order Key Product Code Product Key Quantity TIME Date Key (PK) Date Quarter

PRODUCT Product Code Product Name Brand Rate Product Key

ORDER

Order Key Order Num Credit Terms Lead Time

Based on this exercise, what is the process for converting an ER Model into a Dimensional Model (Data Warehouse)

The DW Design Process


Identify an association table as the central fact table Choose the Dimensions Add date (time) dimension Replace all operational keys with surrogate keys Promote foreign keys from each dimension table to the fact table Choose the Facts

Are the Dimensions normalised ?


Cust. Id Cust Name Address Phone

Thinking question : Is there any situation where we would normalise a dimension table ? Fact
Cust. Id Month Region Code Balance

Region Code Region Name Address Manager

Dimension

Dimension

Month & Yr Quarter

Dimension

Add fields to each dimension to make it denormalised Now, what does the schema look like if we normalise each dimension Snowflake Schematable ? Are Snowflake Schemas desirable ? Why ? Speed of querying. Complexity of querying for the user

DW : Contents
ER Model vs Dimensional Model Designing a Data Warehouse
Starting with the ER Model

Facts and Dimensions BI Products and Vendors Data Warehouse Optimisation OLAP Implementation

Representing dimensions
SKU Store Date Promotion Brand Locality Month All

Product

PIN Code Quarter

Product Category

City Year

Department

Region

All products

All

All

How do we represent a query - eg. Get Sales by SKU by Store by Date by Promotion ? How do we show a Roll-up / Drill-down ?

Representing dimensions
SKU Store Date Promotion Brand Locality Month All

Product

PIN Code Quarter

Product Category

City Year

Department

Region

All products

All

All

For this query, we need to add fields across rows in the Fact Table. How many rows need to be summed? Problems ? Speed. Solution ?

Pre-aggregate sums and store

Multiple levels of aggregates


SKU Store Date Promotion Brand Locality Month All

Product

PIN Code Quarter

Product Category

City Year

Department

Region

All products

All

All

Store multiple level aggregates Redundancy : To speed up querying

Aggregation : Issues
When are aggregates computed ?
During every update

How do we decide what aggregates to keep ?


Frequency of usage / repeat queries Priority of users

Managers / Analysts should figure out the likely frequency.


Therefore what aggregates to keep

Aggregation : Issues
Where are Aggregations stored ?
Separate Fact table Families of Stars (Constellations)

When are they computed ?


During every update

How do we decide what aggregates to keep ?


Frequency of usage / repeat queries Priority of users

Users should not be aware of aggregation. The software automatically uses the aggregate Fact table to answer the query. Why ?

Implementing OLAP
Relational OLAP Disc
Implemented using a regular Relational DBMS Linked list structures

Multi-Dimensional OLAP Disc


MDDB Created in advance and stored for querying Array structures

Advantages and Disadvantages ? Disc

ROLAP

vs

MOLAP
Array Structure therefore fast All cells in the Fact Table are stored whether they exist or not Therefore huge space (Explain)
eg. (Bank example) A customer does not have any Account in a given branch A customer does not perform any transaction in most of his accounts on specific days

Linked List Structure slow Space Optimised only records that have some value are stored

All data is available in the ROLAP. Can handle large DW

No Pre-aggregated data therefore slow

Therefore only small DW can be handled. For large DW, summarised data can be kept in the MDDB. Drilling down requires going back to ROLAP (Called HOLAP Hybrid OLAP) Pre-aggregated data therefore fast

MOLAP
Sparse Matrix techniques used to optimised space

ROLAP vs MOLAP
DBMS vendors started off with ROLAP (knowhow already existed), but are now adding MOLAP Pure BI vendors largely into MOLAP (proprietary)

Role Play Implementation across Multiple Locations

Book
The Data Warehouse Toolkit Ralph Kendall, Margy Ross - Wiley

BI Tech Session on Data Warehousing

Dhruv Nath

Das könnte Ihnen auch gefallen