Sie sind auf Seite 1von 33

DATA

WAREHOUSING
Basics
Concepts
People Making Technology Work

Agenda

Evolution of DWH
Why should we consider Data Warehousing solutions ?
Definition of Data Warehouse
Characteristics of DWH
Difference between DWs and OLTP
DWH Life Cycle
DWH Architecture
Dimensional Data Modeling
Star Schema Design
Fact Table
Fact Granularity
Dimension Tables
Snowflake Schema Design
Important aspects of Star Schema & Snow Flake Schema
Data Acquisition (ETL)
ETL Concepts

Evolution of DWH

Traditional approaches to computer system design during 1980s

Not optimized for analysis and reporting


Company wide reporting couldnt be supported from a
single system
For developing reports often required writing specific
computer programs which was slow and expensive

Why should we consider Data Warehousing solutions ?


When users are requesting access to a large amount of
historical information for reporting purposes, you should
strongly consider a warehouse or mart. The user will benefit
when the information is organized in an efficient manner for
this type of access.

Def . Data Warehousing

DWH is type of relational data base system specially


designed for query analysis processing rather than
transactional processing.

The DWH systems are also called as Historical Dbs,


Read only Dbs, Integrated Dbs, Decision Supporting
System, Executive info System, Business Info System.

Characteristics of DWH

Subject Oriented
Non Volatile
Integrated
Time Variant

Differences..
DWH database (OLAP)

OLTP database

Designed for analysis of business


measures by category and
attributes.

Designed for real time business


operations.

Optimized for bulk loads and large,


complex, unpredictable queries
that access many rows per
table.

Optimized for a common set of


transactions, usually adding or
retrieving a single row at a time
per table.

Loaded with consistent, valid data;


requires no real time validation.

Optimized for validation of


incoming data during
transactions; uses validation
data tables.

Supports few concurrent users


relative to OLTP.

Supports thousands of concurrent


users.

OLAP Database (OLAP)

OLTP Database

Multidimensional Database
Structures

Normalized Data
Structures

Index - Many

Index - Few

Joins - Few

Joins - Many

Aggregated Data - More

Aggregate Data - Few

No. of users - Few

No. of users - More

Periodic update of data

Data Modification
More

Huge volumes of data

Small volumes of data

DWH Life Cycle

Business Analyst
Data Modular
ETL Developer
Report Developer
Testing

DWH Architecture
Three common architectures are:
DWH Architecture (Basic)
DWH Architecture (With a staging area)
DWH Architecture (With a staging area and data marts)

DWH Architecture (Basic)

DWH Architecture (with a staging area)

DWH Architecture
(with a staging area and data marts)

Dimensional Data Modeling


To develop a Star Schema design a Data Modeler follows
dimensional modeling design aspect.
Dimensional modeling is a 3 stage process

Conceptual modeling
Logical Modeling
Physical Modeling

Before start implementing the schema design a


Data modeler should understand the following
process
Understand the clients Business requirements
Understand the grain of fact
Designing of the Dimension tables
Designing of the Fact tables

Example of Dimensional Data Model (Star Schema Design)

Fact Table
Contain numeric measures of the business
Contains facts and connected to dimensions
two types of columns
facts or measures
foreign keys to dimension tables
May contain date-stamped data
A fact table might contain either detail level facts or facts
that have been aggregated

Steps in designing Fact Table


Identify a business process for analysis(like sales).
Identify measures or facts (sales dollar).
Identify dimensions for facts(product dimension, location dimension,
time dimension, organization dimension).
List the columns that describe each dimension.(region name, branch
name, region name).
Determine the lowest level of summary in a fact table(sales dollar).

Types of Facts (Measures)

Additive - Measures that can be added across all dimensions.


Semi Additive - Measures that can be added across few
dimensions and not with others.
Non Additive - Measures that cannot be added across all
dimensions.

In the example, sales fact table is connected to dimensions location, product, time
and organization. Measure "Sales Dollar" in sales fact table can be added
across all dimensions independently or in a combined manner which is
explained below.
Sales Dollar value for a particular product
Sales Dollar value for a product in a location
Sales Dollar value for a product in a year within a location
Sales Dollar value for a product in a year within a location sold or serviced by
an employee

Fact Granularity
A fact table maintains a numerical info
It is defined as the level at which fact info/- is stored.
The level is determined by dimensional table.
Year?
Quarter?
Month?
Week?
Day?

Dimension Tables

Contain textual information that represents attributes of the business


Contain relatively static data
Are joined to fact table through a foreign key reference
Are usually smaller than fact tables

Example of Location Dimension

Location Dimension
Location Dimension

Location Dimension
Id

Country
Name

State
Name

County
Name

City Name

Date Time Stamp

USA

New York

Shelby

Manhattan

1/1/2005 11:23:31
AM

USA

Florida

Jefferson

Panama
City

1/1/2005 11:23:31
AM

USA

California

Montgomery

San Hose

1/1/2005 11:23:31
AM

USA

New Jersey

Hudson

Jersey City

1/1/2005 11:23:31
AM

Star Schema Design benefits

Easy for users to understand


Fast response to queries
Support multi dimensional analysis
Supported by many front end tools

Snowflake Schema Design


Dimension table hierarchies are broken into
simpler tables
In few organizations, they try to normalize the
dimension tables to save space
Both Fact and Dimensional tables are Normalized
Increases the number of joins and poor
performance in retrieval of data
May become large and unmanageable
Degrades query performance

Example of Snowflake Schema

Important aspects of Star Schema & Snow Flake Schema

In a star schema every dimension will have a primary key.


In a star schema, a dimension table will not have any
parent table.
Whereas in a snow flake schema, a dimension table will
have one or more parent tables.
Hierarchies for the dimensions are stored in the
dimensional table itself in star schema.
Whereas hierarchies are broken into separate tables in
snow flake schema. These hierarchies helps to drill down
the data from topmost hierarchies to the lowermost
hierarchies.

Data Acquisition
It is the process of extracting the relevant
business info/- from the different source
systems transforming the data from one
format into an another format, integrating
the data in to homogeneous format and
loading the data in to a warehouse
database.
Data Extraction
(E)
Data Transformation (T)
Data Loading
(L)

Sample ETL Process Flow

ETL Process
The ETL Process having the following basic steps
Is mapping the data between source systems and target database
Is cleansing of source data in staging area
Is transforming cleansed source data and then loading into the target
system

Source System
A database, application, file, or other storage facility from
which the data in a data warehouse is derived.
Mapping
The definition of the relationship and data flow between
source and target objects.
Staging Area
A place where data is processed before entering the
warehouse.
Cleansing
The process of resolving inconsistencies and fixing the
anomalies in source data, typically as part of the ETL
process.

Transformation
The process of manipulating data. Any manipulation beyond
copying is a transformation. Examples include cleansing,
aggregating, and integrating data from multiple sources.
Transportation
The process of moving copied or transformed data from a
source to a data warehouse.
Target System
A database, application, file, or other storage facility to which the
"transformed source data" is loaded in a data warehouse.

Thank You !!!

Das könnte Ihnen auch gefallen