You are on page 1of 168

Introduction to Introduction to Data Warehousing Data Warehousing

Session Objectives
Overview of Data Warehousing Data Warehouse Architectures How to create a data warehouse How to design a data warehouse Understand the ETL process What is metadata How to administer a data warehouse

Copyright 2004, Cognizant Academy, All Rights Reserved

Operational Systems Operational Systems

What is an Operational System?


Operational systems are just what their name implies; they are the systems that help us run the day-to-day enterprise operations.

These are the backbone systems of any enterprise, such as order entry inventory etc.

The classic examples are airline reservations, credit-card authorizations, and ATM withdrawals etc.,

Copyright 2004, Cognizant Academy, All Rights Reserved

Characteristics of Operational Systems


Continuous availability Predefined access paths Transaction integrity Volume of transaction - High Data volume per query - Low Used by operational staff Supports day to day control operations Large number of users

Copyright 2004, Cognizant Academy, All Rights Reserved

Historical Look at Informational Processing


The goal of Informational Processing is to turn data into information! Why? Because business questions are answered using information and the knowledge of how to apply that information to a given problem.
Data Data Information Information Knowledge Knowledge

Copyright 2004, Cognizant Academy, All Rights Reserved

Need for a Separate informational system

Data : Informational data is distinctly different from operational data in its structure and content . Processing : Informational processing is distinctly different from operational processing in its characteristics and use of data

Copyright 2004, Cognizant Academy, All Rights Reserved

The Information Center


Management requires business information A request for a report is made to the Information Center Information Center works on developing the report Requirements for the report must be clarified

Copyright 2004, Cognizant Academy, All Rights Reserved

The Information Center


Report provided to analyst Analyst manipulates data for decision making

Management receives information, but... What took so long? and How do I know its right?

Copyright 2004, Cognizant Academy, All Rights Reserved

The Information Center

Too Many Steps Involved!

Copyright 2004, Cognizant Academy, All Rights Reserved

10

Tactical Information
Transported Quantity Inventory Control System

Order quantity

OLTP Server

Production quantity

Supports day to day control operations Transaction Processing High Performance Operational Systems Fast Response Time Initiates immediate action
Copyright 2004, Cognizant Academy, All Rights Reserved 11

Strategic Information
Production & Inventory Marketing

Finance

Payroll

Understand Business Issues Analyze Trends and Relationships Analyze Problems Discover Business Opportunities Plan for the Future

Copyright 2004, Cognizant Academy, All Rights Reserved

12

Need for Tactical and Strategic information


OLTP Server Data Warehouse Server

Operational Data

Periodic Refresh

Tactical Information

Strategic Information

Operational data helps the organization meet operational and tactical requirements for data. While the Data Warehouse data helps the organization meet strategic requirements for information
Copyright 2004, Cognizant Academy, All Rights Reserved 13

Operational Vs Analytical systems


Operational
v v

Analytical
v v

Primarily primitive, Current; accurate as of now

Primarily derived, Historical; accuracy maintained over time

v v v v v

Constantly updated Minimal redundancy Highly detailed data Referential integrity Supports day-to-day business functions

v v v v v

Less frequently updated Managed redundancy Summarized data Historical integrity Supports long-term informational requirements

Normalized design

De-normalized design
14

Copyright 2004, Cognizant Academy, All Rights Reserved

Data Warehousing Data Warehousing

Data Warehouse Definition


The Data Warehouse is Subject Oriented Integrated Time variant Non-volatile collection of data in support of management decision processes

Copyright 2004, Cognizant Academy, All Rights Reserved

16

Data Warehouse- Differences from Operational Systems


Operational Systems
Order Entry Billing Accounting

Data Warehouse
Customer

Usage

Revenue

v Operational

data is organized by specific processes or tasks and is maintained by separate systems

v Warehoused

data is organized by subject area and is populated from many operational systems
17

Copyright 2004, Cognizant Academy, All Rights Reserved

Data Warehouse- Differences from Operational Systems


Operational Systems Data Warehouse

Application Specific
v Applications

Integrated

and their databases were designed and built separately over long periods of time

v Integrated

from the start

v Evolved

v Designed

(or Architected) at one time, implemented iteratively over short periods of time

Copyright 2004, Cognizant Academy, All Rights Reserved

18

Data Warehouse- Differences from Operational Systems


Operational Systems Data Warehouse

v Primarily

concerned with current data

v Generally

concerned with historical data

Copyright 2004, Cognizant Academy, All Rights Reserved

19

Datawarehouse- Differences from Operational Systems


Operational systems Database

Data warehouse

Update Insert Update Delete Insert

Load/ Update

Initial Load Incremental Load Incremental Load

Constant Change
v Updated

Consistent Points in Time


v Added

constantly

to regularly, but loaded data

is rarely directly changed


v Data

changes according to need, not a fixed schedule

v Does NOT mean the Data warehouse is never updated or never changes!!
20

Copyright 2004, Cognizant Academy, All Rights Reserved

Data in a Data Warehouse


What about the data in the Datawarehouse?

Separate DSS data base Storage of data only, no data is created Integrated and Scrubbed data Historical data Read only (no recasting of history) Various levels of summarization Meta data Subject Easily oriented accessible

Copyright 2004, Cognizant Academy, All Rights Reserved

21

Data Warehousing Features


Strategic enterprise level decision support Multi-dimensional view on the enterprise data Caters to the entire spectrum of management Descriptive, standard business terms High degree of scalability High analytical capability Historical data only

Copyright 2004, Cognizant Academy, All Rights Reserved

22

Datawarehouse - Business Benefits

Benefits To Business Understand business trends Better forecasting decisions Better products to market in timely manner Analyze daily sales information and make quick decisions Solution for maintaining your company's competitive edge

Copyright 2004, Cognizant Academy, All Rights Reserved

23

Data Warehouse- Application Areas


Following are some Business Applications of a data warehouse: Risk management Financial analysis Marketing programs Profit trends Procurement analysis Inventory analysis Statistical analysis Claims analysis Manufacturing optimization Customer relationship management
Copyright 2004, Cognizant Academy, All Rights Reserved 24

Data Marts Data Marts

What is a Data mart?


Data mart is a decentralized subset of data found either in a data warehouse or as a standalone subset designed to support the unique business unit requirements of a specific decision-support system. Data marts have specific business-related purposes such as measuring the impact of marketing promotions, or measuring and forecasting sales performance etc,. Data Mart

Enterprise Data Warehouse

Copyright 2004, Cognizant Academy, All Rights Reserved

Data Mart

26

Data marts - Main Features


Main Features: Low cost Controlled locally rather than centrally, conferring power on the user group. Contain less information than the warehouse Rapid response Easily understood and navigated than an enterprise data warehouse. Within the range of divisional or departmental budgets

Copyright 2004, Cognizant Academy, All Rights Reserved

27

Advantages of Datamart over Datawarehouse


Datamart Advantages : Typically single subject area and fewer dimensions Limited feeds Very quick time to market (30-120 days to pilot) Quick impact on bottom line problems Focused user needs Limited scope Optimum model for DW construction Demonstrates ROI Allows prototyping

Copyright 2004, Cognizant Academy, All Rights Reserved

28

Disadvantages of Data Mart


Data Mart disadvantages : Does not provide integrated view of business information. Uncontrolled proliferation of data marts results in redundancy More number of data marts complex to maintain Scalability issues for large number of users and increased data volume

Copyright 2004, Cognizant Academy, All Rights Reserved

29

Different Approaches for Different Approaches for Implementing Data marts Implementing Data marts

Non-architected Data marts

Q: When is a Data Warehouse not a Data Warehouse? A: When its an unarchitected collection of data marts

Copyright 2004, Cognizant Academy, All Rights Reserved

31

Non-architected Data marts


Source systems Data marts End user access

Significant and expensive duplication of effort and data.

Copyright 2004, Cognizant Academy, All Rights Reserved

32

The upsides of Non-architected Data marts are: 1. Speed 2. Low cost

The downsides of Non-architected Data marts are: 1.Multiple extraction processes 2. Multiple business rules 3. Multiple semantics 4. Extremely challenging to integrate

Copyright 2004, Cognizant Academy, All Rights Reserved

33

Architected Data Warehouse

Data Staging Metadata

Enterprise Data Warehouse

Source systems

End user access

Copyright 2004, Cognizant Academy, All Rights Reserved

34

Unarchitected Data marts Vs Data warehouse


Unarchitected Data Marts Data Warehouse

Enterprise
Data Warehouse

 Easy to do, Not architected ? Are the extracts, transformations, integration's & loads consistent? ? Is the redundancy managed? ? What is the impact on the sources?
Copyright 2004, Cognizant Academy, All Rights Reserved

v Architected v Data and results consistent v Redundancy is managed v Detailed history available for drilldown v Metadata is consistent!

35

The Operational Data The Operational Data Store Store

ODS Definition
The ODS is defined to be a structure that is: Integrated Subject oriented Volatile, where update can be done Current valued, containing data that is a day or perhaps a month old Contains detailed data only.

Copyright 2004, Cognizant Academy, All Rights Reserved

37

Why We Need Operational Data Store?

Need To obtain a system of record that contains the best data that exists in a legacy environment as a source of information

Best here implies data to be


Complete Up to date Accurate

In conformance with the organizations information model

Copyright 2004, Cognizant Academy, All Rights Reserved

38

Operational Data Store - Insulated from OLTP


OLTP Server

ODS data resolves data integration issues Data physically separated from production environment to insulate it from the processing demands of reporting and analysis Access to current data facilitated.

ODS

Tactical Analysis

Copyright 2004, Cognizant Academy, All Rights Reserved

39

Operational Data Store - Data


Detailed data
Records of Business Events

(e.g. Orders capture) Data from heterogeneous sources Does not store summary data Contains current data

Copyright 2004, Cognizant Academy, All Rights Reserved

40

ODS- Benefits
Integrates the data Synchronizes the structural differences in data High transaction performance Serves the operational and DSS environment Transaction level reporting on current data
Flat files 60,5.2,JOHN 72,6.2,DAVID Operational Data Store Relational Database

Excel files Copyright 2004, Cognizant Academy, All Rights Reserved

41

Operational Data Store- Update schedule

ODS Data

Data warehouse Data

Update schedule - Daily or less time frequency Detail of Data is mostly between 30 and 90 days Addresses operational needs

Weekly or greater time frequency Potentially infinite history Address strategic needs

Copyright 2004, Cognizant Academy, All Rights Reserved

42

ODS Vs Data warehouse Characteristics


Parameters Integrated and subject oriented Updated By Transactions Stores Summarized data Used for Strategic decisions Used at managerial level Used for tactical decisions Contains current and detailed data Lengthy historical perspective ODS Data warehouse


43

Copyright 2004, Cognizant Academy, All Rights Reserved

OLAP OLAP

What is OLAP

OLAP tools are used for analyzing data It helps users to get an insight into the organizations data It helps users to carry out multi dimensional analysis on the available data Using OLAP techniques users will be able to view the data from different perspectives Helps in decision making and business planning Converting OLTP data into information Solution for maintaining your company's competitive edge

Copyright 2004, Cognizant Academy, All Rights Reserved

45

OLAP Terminology

Drill Down and Drill Up Slice and Dice Multi dimensional analysis What IF analysis

Copyright 2004, Cognizant Academy, All Rights Reserved

46

Data Warehouse Data Warehouse Architecture Architecture

Basic Data Warehouse Architecture


Meta Data Management

Information Information Access Access

Data warehouse
Reporting tools

Operational & External data

ODS
Mining

Data Staging layer

OLAP

Data Marts

Information Servers

Web Browsers

Administration

Copyright 2004, Cognizant Academy, All Rights Reserved

48

Basic Data Warehouse Architecture

Copyright 2004, Cognizant Academy, All Rights Reserved

49

Operational & External Data layer


The database-ofrecord Consists of system specific reference data and event data Source of data for the data warehouse. Contains detailed data Continually changes due to updates Stores data up to the last transaction.

Operational & External Data Layer

Copyright 2004, Cognizant Academy, All Rights Reserved

50

Data Staging layer


Extracts data from operational and external databases. Transforms the data and loads into the data warehouse. This includes decoding production data and merging of records from multiple DBMS formats.

Data Staging layer

Copyright 2004, Cognizant Academy, All Rights Reserved

51

Data Warehouse layer

Stores data used for informational analysis Present summarized data to the end-user for analysis The nature of the operational data, the end-user requirements and the business
Data ware house Layer

objectives of the enterprise determine the structure

Copyright 2004, Cognizant Academy, All Rights Reserved

52

Meta Data layer


Metadata is data about data. Stored in a repository. Contains all corporate Metadata resources: database catalogs, data dictionaries

Meta Data Layer

Copyright 2004, Cognizant Academy, All Rights Reserved

53

Process Management layer


Scheduler or the high-level job control To build and maintain the data warehouse and data directory information To keep the Data warehouse up-to-date.

Process Management Layer

Copyright 2004, Cognizant Academy, All Rights Reserved

54

Information Access layer


Interfaced with the data warehouse through an OLAP server. Performs analytical operations and presents data for analysis. End-users generates ad-hoc reports and perform multidimensional analysis using OLAP tools

Information Access Layer

Copyright 2004, Cognizant Academy, All Rights Reserved

55

Data Warehouse Architecture Implementation


The following should be considered for a successful implementation of a Data Warehousing solution: Architecture : Open Data Warehousing architecture with common interfaces for product integration Data warehouse database server Tools : Data Modeling tools Extraction and Transformation/propagation tools Analysis/end-user tools: OLAP and Reporting Metadata Management tools
Copyright 2004, Cognizant Academy, All Rights Reserved 56

Different Approaches Different Approaches for Implementing an for Implementing an Enterprise Enterprise Datawarehouse Datawarehouse

What is an Enterprise Datawarehouse? (EDW)


An Enterprise Data Warehouse (EDW) contains detailed as well as summarized data

Separate subject-oriented database.

Supports detailed analysis of business trends over a period of time

Used for short- and long-term business planning and decision making covering multiple business units.

Copyright 2004, Cognizant Academy, All Rights Reserved

58

EDW- Top DownApproach


Heterogeneous Source Systems Source 1 Source 2 Source 3

Common Staging interface Layer

Staging

Data mart bus architecture Layer

Enterprise Datawarehouse

Incremental Architected data marts DM 2 DM 1 DM 3

Copyright 2004, Cognizant Academy, All Rights Reserved

59

EDW- Top Down Approach Implementation


An EDW is composed of multiple subject areas, such as finance, Human resources, Marketing, Sales, Manufacturing, etc.

In a top down scenario, the entire EDW is architected, and then a small slice of a subject area is chosen for construction

Subsequent slices are constructed, until the entire EDW is complete

Copyright 2004, Cognizant Academy, All Rights Reserved

60

Upsides and Downsides of Top-Down Approach


The upsides to a Top Down approach are: 1. Coordinated environment 2. Single point of control & development

The downsides to a Top Down approach are: 1. Cross everything nature of enterprise project 2. Analysis paralysis 3. Scope control 4. Time to market 5. Risk and exposure

Copyright 2004, Cognizant Academy, All Rights Reserved

61

EDW- Bottom upApproach


Heterogeneous Source Systems Source 1 Source 2 Source 3

Common Staging interface Layer

Staging

Data mart bus architecture Layer

Incremental Architected data marts DM 2 DM 1 DM 3

Enterprise Datawarehouse

Copyright 2004, Cognizant Academy, All Rights Reserved

62

EDW- Bottom Up Approach Implementation


Initially an Enterprise Data Mart Architecture (EDMA) is developed

Once the EDMA is complete, an initial subject area is selected for the first incremental Architected Data Mart (ADM).

The EDMA is expanded in this area to include the full range of detail required for the design and development of the incremental ADM.

Copyright 2004, Cognizant Academy, All Rights Reserved

63

Upsides and Downsides of Bottom Up Approach


The upsides to a bottom up approach are: 1. Quick ROI 2. Low risk, low political exposure learning and development environment 3. Lower level, shorter-term political will required 4. Fast delivery 5. Focused problem, focused team 6. Inherently incremental The downsides to a bottom up approach are: 1. Multiple team coordination 2. Must have an EDMA to integrate incremental data marts
Copyright 2004, Cognizant Academy, All Rights Reserved 64

Data warehouse Architecture - Summary


Lot of tools and technologies Data warehouse system architectures. Top down approach Bottom up approach

Copyright 2004, Cognizant Academy, All Rights Reserved

65

Building a Data Building a Data Warehouse Warehouse

Building a Data Warehouse


The initiatives involved in building a data warehouse are Identify the need and justify the cost Architect the warehouse Choose product and vendors Create a dimensional business model Create the physical model Design & develop extract, transform and load systems Test and refine the data warehouse

Data Warehouse design is driven by business users; Not by the IS team

Copyright 2004, Cognizant Academy, All Rights Reserved

67

Data Warehouse Life cycle


Business Requirement

Refine Model

ETL ETL

Data Data Ware Ware house house

Info Info Access Access

Logical Modeling
Map Req. to OLTP

Reporting tools Enterprise Data Warehouse

Reverse Engg.

Mining

OLAP OLTP System Map Data sources External Data Storage Web Browsers

Copyright 2004, Cognizant Academy, All Rights Reserved

68

ER Modeling ER Modeling

Review of Logical Modeling Terms & Symbols


Entities define specific groups of information

Sales Organization Sales Org ID Distribution Channel

Entity
Copyright 2004, Cognizant Academy, All Rights Reserved 70

Review of Logical Modeling Terms & Symbols


Entities are made up of attributes

Sales Organization Sales Org ID Distribution Channel Attributes

Copyright 2004, Cognizant Academy, All Rights Reserved

71

Review of Logical Modeling Terms & Symbols


One or more attribute uniquely identifies an instance of an entity

Sales Organization Sales Org ID Distribution Channel Identifier

Copyright 2004, Cognizant Academy, All Rights Reserved

72

Review of Logical Modeling Terms & Symbols


The logical model identifies relationships between entities

Copyright 2004, Cognizant Academy, All Rights Reserved

{
Relationship

Sales Detail Sales Record ID

Sales Rep Sales Rep ID

73

Logical Data Model


Suppliers Supplier ID Customer Customer ID Retail Market Wholesale Industry

Manufacturing Group Manufacturing Org ID

Sales Detail Sales Record ID

Sales Rep Sales Rep ID

Sales Organization Sales Org ID Distribution Channel

Factory Factory ID

Product Product SKU

Product Sales Plan Plan ID

Copyright 2004, Cognizant Academy, All Rights Reserved

74

Dimensional Modeling Dimensional Modeling

Facts and Measures


ue en

Pr o fita bili

es al S

ev R

Net Pr ofit

Gros s

Marg in

ty

t os C

Facts or Measures are the Key Performance Indicators of an enterprise Factual data about the subject area Numeric, summarized
Copyright 2004, Cognizant Academy, All Rights Reserved 76

Dimension
e nu e ev ) R re s u le as a e S M (

What was sold ? Whom was it sold to ? When was it sold ? Where was it sold ?

Dimensions put measures in perspective What, when and where qualifiers to the measures Dimensions could be products, customers, time, geography etc.

Copyright 2004, Cognizant Academy, All Rights Reserved

77

Some Examples of Data warehousing Dimensions


The following Dimensions are common in all Data warehouses in various forms

Product Dimension Service Dimension Geographic Dimension Time dimension

Copyright 2004, Cognizant Academy, All Rights Reserved

78

Dimension Elements
Geography

Product Time

Components of a dimension Represents the natural elements in the business dimension Directly related to the dimension Facilitates analysis from different perspectives of a dimension Often referred to as levels of a dimension.

Copyright 2004, Cognizant Academy, All Rights Reserved

79

Dimension Hierarchy
Time Dimension

Year Month Date 9/4/99 April

1999 May 28/4/99 5/5/99 17/5/99

Drill Down

Represents the natural business hierarchy within dimension elements Clarifies the drill up, drill down directions Each element represents different levels of aggregation End users may need custom hierarchies
80

Copyright 2004, Cognizant Academy, All Rights Reserved

Drill Up

Multi-Dimensional Analysis
100.0 80.0 East A East B West A North A West A 1st 2nd 3rd Qtr Qtr Qtr East A 4th Qtr West B North A North B

Product

60.0 40.0

Ti m e

20.0 0.0

Geography

East West North

A B A B A B

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr 20.4 27.4 90.0 20.4 19.8 26.6 87.3 19.8 30.6 38.6 34.6 31.6 29.7 37.4 33.6 30.7 45.9 46.9 45.0 43.9 44.5 45.5 43.7 42.6

Characteristic of online analytical processing (OLAP)

Copyright 2004, Cognizant Academy, All Rights Reserved

81

Drill Up & Drill Down


Current Result Set
East West North 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr 20.4 27.4 90 20.4 30.6 38.6 34.6 31.6 45.9 46.9 45 43.9

Up

E as t W es t N orth

1999 158.2 135.4 181.7

Down

Drill down is a process of requesting for detailed information Drill up is a process of summarizing the existing information

Copyright 2004, Cognizant Academy, All Rights Reserved

82

Dimensional Modeling
Subject Area Atomic Detail Dimensions Facts What do you want to know about? What level of detail do you need? Analyze key performance indicators Measures How fresh do you need it? How far back do you need to know it?

Frequency of Update Depth of History

Copyright 2004, Cognizant Academy, All Rights Reserved

83

Requirements for a Dimensional model


Clean, current, accurate logical models Physical models A subject area model Star / Snowflake schema design

Copyright 2004, Cognizant Academy, All Rights Reserved

84

Dimensional Modeling Methodology


s ine s Bu q Re s
Re fin em od el .

Logical Modeling

Map Req. to OLTP

External

Data Sources OLTP System

Copyright 2004, Cognizant Academy, All Rights Reserved

85

Techniques for Implementing a Dimensional model


Star Schema Snow-flake Schema Hybrid Schema Optimal Snow-flake Schema

Copyright 2004, Cognizant Academy, All Rights Reserved

86

Star schema- Logical structure


Dimension Dimension

Employee
Fact Table

Product

Dimension

Time

Employee Product Customer Day Units sold Revenue

Dimension

Customer

Copyright 2004, Cognizant Academy, All Rights Reserved

87

Star schema: Physical view


Geography_dim emp_code emp_name city_code city state_code state region_code region

Tim e _ d im day _c ode date d a y _ o f_ w e e k m on th _s eq m on th _n um m o n t h _ lo n g _ n a m e m o n t h _ s h o rt _ n a m e q t r_ s e q q t r_ n u m q u a rt e r y ear

Fact table emp_code prod_code day_code cust_code units revenue

Product_dim prod_code prod_name brand color_code

Customer_dim cust_code cust_name age_code age sex_code sex city_code city

Copyright 2004, Cognizant Academy, All Rights Reserved

88

Star schema characteristics

A star schema is a highly denormalized, query-centric model where the basic premise is that information can be broken into two groups: facts and dimensions.

In a star schema, facts are in a single place (the fact table) and the descriptions (or elements) that lead to those facts are in dimension tables. The star schema is built for simplicity and speed. The assumption behind it is that the database is static with no updates being performed online

Copyright 2004, Cognizant Academy, All Rights Reserved

89

Star schema: Dimension Table


PK Geography_dim
Empl_Code 2341 3424 1232 3554 3963 2924 2673 3253 234 2342 empl_name Mike King Jim McCann Kitty Stokes Clem Akins Duncan Moore Dawn McGuire Joe Becker Geoff Bergren Garth Boyd Lin Cepele city_code 101 106 104 102 101 103 105 107 106 104 city Atlantic city Chicago Austin Medford Atlantic city Englewood Alverton Springfield Chicago Austin state_code NJ IL PA NJ NJ NJ PA IL IL PA state New Jersey Illinois Pennsylvania New Jersey New Jersey New Jersey Pennsylvania Illinois Illinois Pennsylvania region_code 1 2 1 1 1 1 1 2 2 1 region New Jersey Illinois New Jersey New Jersey New Jersey New Jersey New Jersey Illinois Illinois New Jersey

Attributes Region State City Employee

Elements Region State City Employee

De-normalized structure Easy navigation within the dimension

Copyright 2004, Cognizant Academy, All Rights Reserved

90

Star schema: Fact Table


day_codeprod_codecust_code empl_codeunits sold revenue 1211 345 1231123 1232 23 7935 1211 22 1245223 3554 12 264 1211 112 1522342 3963 6 672 1212 233 1524665 2924 34 7922 1212 112 1366454 2673 76 8512 1212 22 1403453 3554 22 484

sales_fact Dimension Keys Measures

Contains columns for measures and dimensions

Copyright 2004, Cognizant Academy, All Rights Reserved

91

Snow-flake schema
Customer Time

Revenue Units Sold Net Profit

City

Product

Region

Brand

Country

Color

Copyright 2004, Cognizant Academy, All Rights Reserved

92

Snow-flake: Physical view


emp_code emp_name

emp_code city_code cityname city_code state_code statename state_code region_code regionname region_code country_code countryname

day_code day_name week_code

emp_code cust_code prod_code day_code units revenue

cust_code cust_name age_code age sex_code sex city_code city

prod_code brand_code prod_name

week_code week_name month_code

brand_code brand_name color_code

month_code month_name quarter_code year

color_code color_name

Copyright 2004, Cognizant Academy, All Rights Reserved

93

Hybrid schema: Physical view


em p_code em p_nam e city_code city state_code state region_code region

day_code day_name week_code week_code week_name month_code

emp_code cust_code prod_code day_code units revenue

cust_code cust_name age_code age sex_code sex city_code city

prod_code brand_code prod_name

brand_code brand_name color_code

month_code month_name quarter_code year

color_code color_name

Copyright 2004, Cognizant Academy, All Rights Reserved

94

Optimal Snow-flake schema


emp_code emp_name city_code city state_code state region_code region
cust_code cust_name age_code age sex_code sex city_code city

day_code day_name week_code week_code week_name month_code

emp_code cust_code prod_code day_code brand_code units revenue

prod_code brand_code prod_name

brand_code brand_name color_code

month_code month_name quarter_code year

color_code color_name

Copyright 2004, Cognizant Academy, All Rights Reserved

95

What is a Slowly Changing Dimension?


Although dimension tables are typically static lists, most dimension tables do change over time.

Since these changes are smaller in magnitude compared to changes in fact tables, these dimensions are known as slowly growing or slowly changing dimensions.

Copyright 2004, Cognizant Academy, All Rights Reserved

96

Slowly Changing Dimension -Classification


Slowly changing dimensions are classified into three different types TYPE I TYPE II TYPE III

Copyright 2004, Cognizant Academy, All Rights Reserved

97

Slowly Changing Dimensions Type I

Source
Emp id Name Email Emp id

Target
Name Email

1001

Shane

Shane@xyz.c om

1001

Shane

Shane@xyz. com

Source
Emp id Name Email Emp id

Target
Name Email

1001

Shane

Shane@ abc.co.in

1001

Shane

Shane@ abc.co.in

Shane@xyz. com

Copyright 2004, Cognizant Academy, All Rights Reserved

98

Slowly Changing Dimensions Type II

Target Source
Emp id Name Email PM_PRI MARYK EY Emp id Name Email PM_VER SION_N UMBER

10

Shane

Shane@xyz.c om

1000

10

Shane

Shane@x yz. com

Copyright 2004, Cognizant Academy, All Rights Reserved

99

Slowly Changing Dimensions -Versioning


Source
Name Emp id Email

10

Shane

Shane@ abc.co.in

PM_PRIMA RYKEY

Emp id

Name

Email

PM_VERSION_NUMBER

1000

10

Shane

Shane@ xyz.com Shane@ abc.co.in

1001

10

Shane

Target

Copyright 2004, Cognizant Academy, All Rights Reserved

100

Slowly Changing Dimensions -Versioning


Source
Name Emp id Email

10

Shane

Shane@ abc.com

PM_PRIM ARYKEY

Emp id

Name

Email

PM_VERSION_NUM BER

Target

1000

10

Shane

Shane@ xyz.com
Shane@ abc.co.in Shane@ abc.com

1001 1003

10 10

Shane Shane

1 2

Copyright 2004, Cognizant Academy, All Rights Reserved

101

Slowly Changing Dimensions Type II - Flag

Emp id

Name

Email

PM_PR IMAR YKEY

Emp id

Name

Email

PM_CUR RENT_FL AG

10

Shane

Shane@xyz.c om

1000

10

Shane

Shane@ xyz. com

Source

Target
102

Copyright 2004, Cognizant Academy, All Rights Reserved

Slowly Changing Dimensions - Flag Current Source


Emp id Name Email 10 Shane Shane@ abc.co.in

PM_PRIMA RYKEY

Emp id

Name

Email

PM_CURRENT_FLAG

1000

10

Shane

Shane@ xyz.com Shane@ abc.co.in

1001

10

Shane

Target

Copyright 2004, Cognizant Academy, All Rights Reserved

103

Slowly Changing Dimensions - Flag Current Source


Emp id Name Email 10 Shane Shane@ abc.com

PM_PRIMA RYKEY

Emp id

Name

Email

PM_CURRENT_FLAG

Target

1000

10

Shane

Shane@ xyz.com Shane@ abc.co.in Shane@ abc.com

1001

10

Shane

1003

10

Shane

Copyright 2004, Cognizant Academy, All Rights Reserved

104

Slowly Changing Dimensions Type II

Emp id

Name

Email

PM_PRI MARYK EY

Emp id

Name

Email

PM_BEG IN_DAT E

PM_EN D_DATE

10

Shane

Shane@xyz.c om 1000 10 Shane Shane@x yz.com 01/01/00

Source Target
Copyright 2004, Cognizant Academy, All Rights Reserved 105

Slowly Changing Dimensions -Effective Source Date Name


Emp id

Email
Shane@ abc.co.in

10

Shane

PM_PRIMAR YKEY

Emp id

Name

Email

PM_BEGIN_D ATE

PM_END_D ATE

1000

10

Shane

Shane@x yz.com Shane@ abc.co.in

01/01/00

03/01/00

1001

10

Shane

03/01/00

Target
Copyright 2004, Cognizant Academy, All Rights Reserved 106

Slowly Changing Dimensions - Effective Source Date


Emp id Name

Email
Shane@ abc.com

10

Shane

PM_PRIM ARYKEY

Emp id

Name

Email

PM_BEGIN_D ATE

PM_END_DA TE

1000

10

Shane

Shane@ xyz.com Shane@ abc.co.in Shane@ abc.com

01/01/00

03/01/00

1001 1003

10 10

Shane Shane

03/01/00 05/02/00

05/02/00

Target
Copyright 2004, Cognizant Academy, All Rights Reserved 107

Slowly Changing Dimensions Type III

PM_PRI MARYKE Y Emp id Name Email

Emp id

Name

Email

PM_Prev_ Column Name

PM_EFFEC T_DATE

10

Shane

Shane@xyz.c om

10

Shane

Shane@xyz. com

01/01/00

Source

Target

Copyright 2004, Cognizant Academy, All Rights Reserved

108

Slowly Changing Dimensions Type III


Source
Name Emp id

Email
Shane@ abc.co.in

10

Shane

PM_PRIMAR YKEY

Emp id

Name

Email

PM_Prev_Colu mnName

PM_EFFEC T_DATE

10

Shane

Shane@ abc.co.in

Shane@xyz.co m

01/02/00

Target
Copyright 2004, Cognizant Academy, All Rights Reserved 109

Slowly Changing Dimensions Type III


Source
Emp id Name

Email
Shane@ abc.com

10

Shane

PM_PRIM ARYKEY

Emp id

Name

Email

PM_Prev_Colu mnName

PM_EFFECT_ DATE

10

Shane

Shane@ abc.com

Shane@ abc.co.in

01/03/00

Target

Copyright 2004, Cognizant Academy, All Rights Reserved

110

Conformed Dimensions

Conformed dimensions are those which are consistent across Data marts.

Essential for integrating the Data marts into an Enterprise Data warehouse

Copyright 2004, Cognizant Academy, All Rights Reserved

111

Casual Dimensions
Casual dimensions can be used for explaining why a record exists in a fact table Casual dimensions should not change the grain of the fact table

Copyright 2004, Cognizant Academy, All Rights Reserved

112

Casual Dimension - Example


Example: Why did a customer buy a particular product Why did a customer use a particular ATM machine

Copyright 2004, Cognizant Academy, All Rights Reserved

113

Factless Fact Tables


The two types of factless fact tables are: Coverage tables Event tracking tables

Copyright 2004, Cognizant Academy, All Rights Reserved

114

Factless Fact Tables - Coverage Tables


Coverage tables are required when a primary fact table is sparse Example: Tracking products in a store that did not sell

Copyright 2004, Cognizant Academy, All Rights Reserved

115

Factless Fact Tables - Event Tracking


These tables are used for tracking a event: Example: Tracking student attendance

Copyright 2004, Cognizant Academy, All Rights Reserved

116

Helper Tables
Helper tables are used when there are multi valued dimensions. That is when there is a many to many relationship between a fact table and a dimension table Helper table can be placed between two dimensions tables or between a dimension table and a fact table.

Copyright 2004, Cognizant Academy, All Rights Reserved

117

Helper Tables - Example


Example : A customer having more than one bank account

Copyright 2004, Cognizant Academy, All Rights Reserved

118

Surrogate Keys

Joins between fact and dimension tables should be based on surrogate keys Surrogate keys should not be composed of natural keys glued together Users should not obtain any information by looking at these keys These keys should be simple integers

Copyright 2004, Cognizant Academy, All Rights Reserved

119

Why Existing Keys Should Not Be Used

Keys may be reused after they have been purged even thought they are used in the warehouse A product description or a customer description could be changed without changing the key Key formats may be generalized to handle some new situation A mistake could be made and a key could be reused

Copyright 2004, Cognizant Academy, All Rights Reserved

120

ETL- Extraction, ETL- Extraction, Transformation & Transformation & Loading Loading

What is ETL?
ETL(Extraction, Transformation and Loading) is a process by which data is integrated and transformed from the operational systems into the Data warehouse environment
Filters and Extractors Cleanser
Cleaning Rules Rule 1 Rule 2 Rule 3

Operational systems

Error View Check Correct

Transformation Rules Rule 1 Rule 2 Rule 3

Transformation Engine

Integrator

Error View Check Correct

Loader

Warehouse

Copyright 2004, Cognizant Academy, All Rights Reserved

122

Operational Data - Challenges


Data from heterogeneous sources Format differences Data Variations
Context
Across locations the same code could represent different customers Across periods of time a product code could have been reused

Copyright 2004, Cognizant Academy, All Rights Reserved

123

Extraction
80 tables
Data from tables 30

Oracle

Filter Data from 10 tables Where Date<10/12/99

50 tables
files om a fr t

Sybase

Da

Target Text files

Copyright 2004, Cognizant Academy, All Rights Reserved

124

Transformation
Source
Emp id
10001

Last Name
Jones

First Name
Indiana

10002

Holmes

Sherlock

Staging Area Name = Concat(First Name, Last Name) Indiana Jones Sherlock Homes

Copyright 2004, Cognizant Academy, All Rights Reserved

125

Loading
Source
Direct Load

Data Warehouse

Staging Area
d& rme d fo rans ata loa n,T d Clea

ed grat i nt e

Cleaning, Transformation & Integration of Raw data

Copyright 2004, Cognizant Academy, All Rights Reserved

126

Volume of ETL in a Data warehouse


Source OLTP Systems

Data Marts

Metadata Enterprise Data Warehouse

Design Design Mapping Mapping

Extract Extract Scrub Scrub Transform Transform

60

to

0% 8

ft o

e h

rk o

is

re e

Load Load Index Index Aggregation Aggregation

Replication Replication Data Set Distribution Data Set Distribution

Access & Analysis Access & Analysis Resource Scheduling & Distribution Resource Scheduling & Distribution

Meta Data Meta Data System Monitoring System Monitoring


Copyright 2004, Cognizant Academy, All Rights Reserved 127

Factors Influencing ETL Architecture


Volume at each warehouse component. The time window available for extraction. The extraction type (Full,Periodic etc.) Complexity of the processes at each stage.

Copyright 2004, Cognizant Academy, All Rights Reserved

128

Extraction Types Extraction Types

Extraction Types
Extraction

Full Extract

Periodic/ Incremental Extract

Copyright 2004, Cognizant Academy, All Rights Reserved

130

Full Extract

Existing data

Data Mart

Full Extract Source System

Copyright 2004, Cognizant Academy, All Rights Reserved

131

Full Extract

New data

Data Mart

Full Extract Source System

Copyright 2004, Cognizant Academy, All Rights Reserved

132

Full Extract
New data

Data Mart

Full Extract Source System

Copyright 2004, Cognizant Academy, All Rights Reserved

133

Incremental Extract

Existing data Incremental Data Data Mart

Incremental Extract Source System

Copyright 2004, Cognizant Academy, All Rights Reserved

134

Incremental Extract

Incremental Data

New data

Existing data

Data Mart

Source System

Incremental Extract

Changed data

Copyright 2004, Cognizant Academy, All Rights Reserved

135

Incremental Extract

Incremental Data

New data

Incremental addition to data mart Data Mart

Source System Incremental Extract

Changed data

Existing data updated using changed data

Copyright 2004, Cognizant Academy, All Rights Reserved

136

Transformation Transformation

Data Transformation
Conversions
Data type (e.g. Char to Date) Bring data to common units (Currency,Measuring Units)

Classifications
Changing continuous values to discrete ranges (e.g. Temperatures to Temperature Ranges)

Splitting of fields Merging of fields Aggregations (e.g. Sum, Avg., Count) Derivations (Percentages, Ratios, Indicators)

Copyright 2004, Cognizant Academy, All Rights Reserved

138

Structural Transformations
Additive Orders arrive every two minutes OLTP

Aggregate

Data ware house

Average Daily Productivity figures

OLTP

Average

Data ware house

Copyright 2004, Cognizant Academy, All Rights Reserved

139

Format transformation
Source Schema Target Schema Transformation 32 32

Data Type Conversions

Age as a String

Age as an Integer Target Schema


15 10 1999

Splitting

Source Schema
15-10-1992 Transformation

Day Date as a String

Month

Year

Date as a combination of 3 integer fields

Copyright 2004, Cognizant Academy, All Rights Reserved

140

Simple Conversions
Source Schema Target Schema

Rs. 10000

Multiply by 1/43

$232.56

Revenue in Rupees

Revenue in Dollars

1000 lbs.

Multiply by 0.4536

453.56 kgs.

Production in Pounds Source Schema

Production in Kilograms Target Schema

Transformations using Simple Conversions


141

Copyright 2004, Cognizant Academy, All Rights Reserved

Classification
Name John Black Richard Wayne Jennifer Goldman Helmut Koch Anna Ludwig Shito Maketha Tracy Withman Ada Zhesky David Rosenberg Pankaj Sharma Zhu Ling George Kurtz Rita Hartman Age 27 53 45 37 32 28 39 25 33 29 44 27 34
Age Group Frequency 20-25 1 26-30 4 31-35 3 36-40 2 41-45 2 46-50 1 51-55 1 56-60 0

Grouping

Copyright 2004, Cognizant Academy, All Rights Reserved

142

Data Consistency Transformations


Source 1 Gender Male M Female F Source 2 Gender Male Male Female Female Source 1 Gender Male 1 Female 2

Target Gender Male M Female F

Copyright 2004, Cognizant Academy, All Rights Reserved

143

Reconciliation of Duplicated data


Joseph Smith 123 Maine St. MA 70127 J.R.Smith 123 Maine St. MA 70127 Joe Smith 123 Maine St. MA 70127

Joseph R Smith 123 Maine St. MA - 70127

Copyright 2004, Cognizant Academy, All Rights Reserved

144

Data Aggregation - Design Requirements

Aggregates must be stored in their own fact tables and each level should have its own fact table Dimension tables attached to the aggregate fact tables should where ever possible be shrunken versions of the dimension tables attached to the base fact table The base fact table and all of its related aggregate fact tables must be associated together as a family of schemas

Copyright 2004, Cognizant Academy, All Rights Reserved

145

Loading Loading

Types of Data warehouse Loading


Target update types
Insert Update

Copyright 2004, Cognizant Academy, All Rights Reserved

147

Types of Data Warehouse Updates

Data Warehouse

Source data

Data Staging

v v v

Point in Time Snapshots New Data Changed Data

v Insert v Full Replace v Selective Replace v Update v Update plus Retain History

Copyright 2004, Cognizant Academy, All Rights Reserved

148

New Data and Point-In-Time Data Insert

Source data

New data OR Point-in-Time Snapshot (e.g.. Monthly) New Data Added to Existing Data

Copyright 2004, Cognizant Academy, All Rights Reserved

149

Changed Data Insert


Source data Changed Data Added to Existing Data

Changed data

Copyright 2004, Cognizant Academy, All Rights Reserved

150

Change of Dimension values


When the value of dimension in a data warehouse changes, then

History of change needs to be maintained. Changed data alone needs to be identified Changed data should be easier to access. Reconstruction of the dimension table any point in time should be easier

Copyright 2004, Cognizant Academy, All Rights Reserved

151

ETL - Approach in a nutshell


1) Identify the Operational systems based on data islands in the target 2) Map source-target dependencies. 3) Define cleaning and transformation rules 4) Validate source-target mapping 5) Consolidate Meta data for ETL 6) Draw the ETL architecture 7) Build the cleaning, transformation and auditing routines using either a tool or customized programs

Copyright 2004, Cognizant Academy, All Rights Reserved

152

Meta Data in a Meta Data in a Data Warehouse Data Warehouse

What is Metadata?
Data about data and the processes Metadata is stored in a data dictionary and repository. Insulates the data warehouse from changes in the schema of operational systems. It serves to identify the contents and location of data in the data warehouse

Copyright 2004, Cognizant Academy, All Rights Reserved

154

Why Do You Need Meta Data?


Share resources
Users Tools

Document system

Without meta data


Not Sustainable Not able to fully utilize resource

Copyright 2004, Cognizant Academy, All Rights Reserved

155

The Role of Meta Data in the Data Warehouse


Meta Data enables data to become information, because with it you

Know what data you have and You can trust it!

Copyright 2004, Cognizant Academy, All Rights Reserved

156

Meta Data Answers.


How have business definitions and terms changed over time? How do product lines vary across organizations? What business assumptions have been made? How do I find the data I need? What is the original source of the data? How was this summarization created? What queries are available to access the data

Copyright 2004, Cognizant Academy, All Rights Reserved

157

Meta Data Process


Integrated with entire process and data flow
Populated from beginning to end Begin population at design phase of project Dedicated resources throughout
Build Maintain

Design Design Mapping Mapping

Extract Extract Scrub Scrub Transform Transform

Load Load Index Index Aggregation Aggregation

Replication Replication Data Set Distribution Data Set Distribution

Access & Analysis Access & Analysis Resource Scheduling & Distribution Resource Scheduling & Distribution

Meta Data Meta Data System Monitoring System Monitoring


Copyright 2004, Cognizant Academy, All Rights Reserved 158

Types of ETL Meta Data


ETL Meta data

Technical Meta data

Operational Meta data

Copyright 2004, Cognizant Academy, All Rights Reserved

159

Classification of ETL Meta Data


Data Warehouse Meta data This Meta data stores descriptive information about the physical implementation details of data warehouse.

Source Meta data This Meta data stores information about the source data and the mapping of source data to data warehouse data

Copyright 2004, Cognizant Academy, All Rights Reserved

160

ETL Meta Data


Transformations & Integrations. This Meta data describes comprehensive information about the Transformation and loading.

Processing Information This Meta data stores information about the activities involved in the processing of data such as scheduling and archives etc

End User Information This Meta data records information about the user profile and security.

Copyright 2004, Cognizant Academy, All Rights Reserved

161

ETL -Planning for the Movement


The following may be helpful for planning the movement Develop a ETL plan Specifications Implementation

Copyright 2004, Cognizant Academy, All Rights Reserved

162

Data Warehouse Data Warehouse Administration Administration

Data Warehouse Administrative Tasks


Build and maintain the data warehouse Maintaining the meta data To keep the data warehouse up to date Tuning the data warehouse General administrative tasks

Copyright 2004, Cognizant Academy, All Rights Reserved

164

Dormant Data
The data that is hardly used in a data warehouse is called dormant data The faster data warehouses grows the more data becomes dormant. Over a period of time the amount of dormant data in a data warehouse increases

Copyright 2004, Cognizant Academy, All Rights Reserved

165

Origins of Dormant Data


Storing history data that is not required Storing columns that are never used Storing detail level data when only summary level data is used Creating summary data that is never used

Copyright 2004, Cognizant Academy, All Rights Reserved

166

Strategy For Removing Dormant Data


The strategy for removing dormant data might include: Removing data after a period of time say after two years Removing summary data that has not been accessed in the past six months Removing columns that have never or only very infrequently been accessed Storing data for high profile users even though that data has not been accessed Storing data for selected accounts even though that data has not been accessed

Copyright 2004, Cognizant Academy, All Rights Reserved

167

Tuning a Data Warehouse


Some of the techniques that can be used for tuning a data warehouse are: Handling dormant data Storing pre summarized data based on data pattern usage Creating indexes for data that is frequently used Merging tables that have common and regular access

Copyright 2004, Cognizant Academy, All Rights Reserved

168