Sie sind auf Seite 1von 24

Data warehousing Concepts

J.Srinivasa Reddy

Data Warehousing Concepts

Introduction

In today’s competitive global business environment, understanding and managing enterprise wide information is crucial for making timely decisions and responding to changing business conditions. There is a tremendous amount of data generated by day- to-day business operational applications.

generated by day- to-day business operational applications. In addition there is valuable data available from external
generated by day- to-day business operational applications. In addition there is valuable data available from external

In addition there is valuable data available from external sources such as market research organizations, independent surveys and quality testing labs.

Operational Data

Operational data is the data you use to run your business. This data is what is typically stored, retrieved, and updated by your Online Transactional Processing (OLTP) system. An OLTP system may be, for example, a reservations system, an accounting application, or an order entry application.

Informational Data

Informational data is created from the wealth of operational data that exists in your business and some external data useful to analyze your business. Informational data is what makes up a data warehouse. Informational data is typically:

Summarized operational data Infrequently updated from the operational systems Optimized for decision support applications Possibly "read only" (no updates allowed)

Possibly "read only" (no updates allowed) Based on the way the data is used, database can

Based on the way the data is used, database can be classified in to two ways:

the one that is used for transactions Online Transaction Processing (OLTP) and the one that is used for analysis Online Analytical Process (OLAP).

As the business these days contain huge amounts of data and the users are connected to these databases across the globe and round the clock the necessity for maintaining a separate database for the sake of analysis is very much clear.

OLTP Databases

OLTP Databases are what we generally refer as “Databases”. These are the databases that contain information of day-to-day transactions. Typically OLTP database has hundreds of users connected to it and performing transactions round the clock. Most of the time these transactions insert data in to the database. Example : ATM Machine , Online Shopping, Online Application Filing, Online Railway Reservation

Online Application Filing, Online Railway Reservation The ratio of number of records being inserted is more

The ratio of number of records being inserted is more than the number of records being updated or deleted. Hence these databases or optimized for insertions. These databases are normalized to reduce the redundancy of the data and increase performance while inserting the data.

Data warehousing Concepts

J.Srinivasa Reddy

OLAP Systems

An OLAP Database is generally used to analyze data. it is optimized for retrieving data so you can quickly retrieve data.

for retrieving data so you can quickly retrieve data. An OLAP database is generally created from

An OLAP database is generally created from the information you have put in an OLTP database. OLAP Systems are often referred to as Decision Support System (DSS). Decision Support System (sometimes also called Business Intelligence or BI) is about synthesizing useful knowledge from large data sets.

Data Warehouses

Data warehousing is a concept. It is a set of hardware and software components that can be used to better analyze the massive amounts of data that companies are accumulating to make better business decisions. Data Warehousing is not just data in the data warehouse, but also the architecture and tools to collect, query, analyze and present information.

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.

an organization to consolidate data from several sources. “A data warehouse is a collection of corporate
“A data warehouse is a collection of corporate information, derived directly from operational systems and
“A data warehouse is a collection of corporate information, derived directly from
operational systems and some external data sources. Its specific purpose is to support
business decisions, not business operations”
OLTP Vs Warehouse
Operational System
Data Warehouse
Transaction Processing
Query Processing
Time Sensitive
History Oriented
Operator View
Managerial View
Organized by transactions (Order, Input,
Inventory)
Organized by subject (Customer, Product)
Relatively smaller database
Large database size
Many concurrent users
Relatively few concurrent users
Volatile Data
Non Volatile Data
Stores all data
Stores relevant data
Not Flexible
Flexible

Data warehousing Concepts

J.Srinivasa Reddy

Remember Between OLTP and Data Warehouse systems

Users are different Data content is different
Users are different
Data content is different
systems Users are different Data content is different D ata structures are different Hardware is different

Data structures are different

Data content is different D ata structures are different Hardware is different Draw Backs of Conventional
Hardware is different
Hardware is different

Draw Backs of Conventional Reporting Architecture

As and when volumes of data in a database Keeps increasing . Performance of report generation gets degraded.

If a query contains Joins, Group Functions and Group by clause etc. Which are time consuming and resource consuming all the resources of the system will be used for query execution & in turn transaction get affected.

CRA Does not support Trend Analysis (Report generation based on data from the past).

CRA does not support integration of data for report generation.

 CRA does not support integration of data for report generation.

To overcome drawbacks of Conventional reporting Architecture we use Data Warehouse to Provide “Modern Reporting Architecture”

Data warehousing Concepts

J.Srinivasa Reddy

Modern Reporting Architecture OOLLTTPP Historical Data Reporting Tools OODDSS
Modern Reporting Architecture
OOLLTTPP
Historical Data
Reporting Tools
OODDSS

Different kinds of Information Needs

Current information Is this medicine available in stock (OLTP) Recent information What are the tests
Current information
Is this medicine available in stock
(OLTP)
Recent information
What are the tests this patient has completed so far (ODS)
Historical information
Has the incidence of Tuberculosis increased in last 5 years in Southern region
(Data Warehouse)

Data warehousing Concepts

J.Srinivasa Reddy

Common Terms in Warehousing Source:- Source is a database from where we extract the data.
Common Terms in Warehousing
Source:-
Source is a database from where we extract the data.
In a typical data warehouse environment the sources already exist and read only.
There can be one or more sources in a given environment.
Target:-
Is a database into which we load the data. target database may or may not exist.
In general there is only one target database.
Staging
Area
Data warehouse
is only one target database. Staging Area Data warehouse Staging Area:- Staging area is a system

Staging Area:-

Staging area is a system that stands between the legacy system & analytics system (DWH).The Data Staging Area is considered the back room of the DWH. The Data Staging Area is where the Extract, Transform & Load takes place and is out of boundaries for end user. Data Staging Area can be Logical / Physical. Staging Area is used to populate the DWH.

Functions of Staging Area:-

Extracting data from multiple legacy systems.

Cleaning the data

Integrating the data from multiple systems in to a single DWH.

Transforming legacy system keys in to a DWH Keys (surrogate keys)

legacy system keys in to a DWH Keys (surrogate keys)  Transforming disparate codes for gender,

Transforming disparate codes for gender, marital status etc. into the DWH Standards.

Loading the various DWH tables using automated jobs in a sequence.

Data warehousing Concepts

J.Srinivasa Reddy

Need for Staging Area:-

To improve performance of DWH.

Need for Staging Area:-  To improve performance of DWH.  To integrate data form multiple

To integrate data form multiple sources

For cleansing erroneous data, accidentally miscoded data, deliberately disorted data in the legacy systems before loading in to the DWH.

Area is also required for data adjustment before it can be used for analysis. Ex : multiple currencies must be translated in to one common value.

For aggregating the data to load the data into aggregate tables in the DWH.

Staging Area Processes:-

Data acquisition process

Data integration Process

Data adjustment process

Data aggregation process

Data cleansing process

ODS (Operational Data Store):-

ODS (Operational Data Store):-

Typically an ODS is a normalized structure that integrates the data based on a subject area. It only holds one to three months worth of historical data unlike a data warehouse which stores years of historical data. It is used to store copy of the current data.

data. It is used to store copy of the current data. ODS’s also used to populate

ODS’s also used to populate the Warehouse.

Types of ODS :- Class 1 :

In this environment the updates to the source system are reflected in the ODS in just a few seconds.

Class 2 :

Class II ODS is updated intra day for every one to three hours.

Class 3 :

A Class III ODS is usually updated once a day. Usually at night after the source system has closed down.

3 : A Class III ODS is usually updated once a day. Usually at night after

Data warehousing Concepts

J.Srinivasa Reddy

OLTP Vs ODS Vs DWH

Characteristic OLTP ODS Data Warehouse Audience Operating Personnel Analysts Managers and analysts Data access
Characteristic
OLTP
ODS
Data Warehouse
Audience
Operating Personnel
Analysts
Managers and
analysts
Data access
Individual records,
transaction driven
Individual records,
transaction or
analysis driven
Set of records,
analysis driven
Data content
Current, real-time
Current and near-
current
Historical
Data Structure
Detailed
Detailed and lightly
summarized
Detailed and
Summarized
Data
Functional
Subject-oriented
Subject-oriented
organization
Type of Data
Homogeneous
Homogeneous
Vast Supply of very
heterogeneous data
|
|
|
|
|
Data redundancy Non-redundant within system; Unmanaged redundancy among systems Somewhat Managed redundant with
Data
redundancy
Non-redundant
within system;
Unmanaged
redundancy among
systems
Somewhat
Managed
redundant with
redundancy
operational
databases
Data update
Field by field
Field by field
Controlled batch
Database size
Moderate
Moderate
Large to very large
Development
Requirements
driven, structured
Data driven,
Data driven,
Methodology
somewhat
evolutionary
evolutionary
Philosophy
Support day-to-day
operation
Support day-to-day
decisions &
operational
activities
Support managing
the enterprise

Data warehousing Concepts

J.Srinivasa Reddy

Metadata:- Is the data or information about the data. Metadata describes data contained in the
Metadata:-
Is the data or information about the data.
Metadata describes data contained in the data warehouse as well as sources of the data
and the transformations or derivations that may have been performed to create data
elements.
Connection information: ETL tool to SDB
Information about SDO: Table definitions (table name, no of columns)
Column definitions (column names, data types & length)
SDB (Source Database)
SDO (Source Database Object)
Connection information: ETL tool to TDB
Information about TDO: Table definitions (table name, no of columns)
Column definitions (column names, data types & length)

TDB (Target Database) TDO (Target Database Object)

Information about the data processing element. Extraction ETL Tool Loading Process the data transformation Source
Information about the data processing element.
Extraction
ETL Tool
Loading
Process the data
transformation
Source DB
Target DB
C1
C2
C3
C4
C1
C2
C3
C4
Filter
SDO
TDO
Data process unit

Data warehousing Concepts

J.Srinivasa Reddy

Data mart:- A data warehouse with a particular subject of interest can be called a
Data mart:-
A data warehouse with a particular subject of interest can be called a data mart.
A data warehouse contain N no of data marts.
Ex :
sales data mart.
finance data mart
inventory data mart
HR data mart
Data marts are work-group or departmentalized warehouses, Which are generally small
in
size, typically contained 10 to 50 GB of data.
Data marts contain informational data that is tailored to the needs of the specific
departmental work group.
Data marts are less expensive, takes less time for implementation with Quick ROI
(return on investment)
Data marts are scalable to a full data warehouse,
And data marts are subsets of enterprise data warehouse.

Advantages of Data mart:-

Easy access to frequently needed data.

Creates collective view by a group of users.

Improves end-user response time.

Ease of creation.

 Improves end-user response time.  Ease of creation.  Lower cost than implementing a full

Lower cost than implementing a full DWH.

Potential users are more clearly defined than in a full Data warehouse.

cost than implementing a full DWH.  Potential users are more clearly defined than in a

Data warehousing Concepts

J.Srinivasa Reddy

According to Bill Inmon, known as the father of Data Warehousing, A data warehouse is
According to Bill Inmon, known as the father of Data Warehousing,
A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant
collection of data in support of management's decisions.
Subject Oriented:
Information is presented according to specific subject or areas of interest. Data is
intended to provide information about a particular subject.
Example :
For a manufacturing company sale, shipment, and inventory are critical business
subjects.
Integrated:
Data that is gathered into the data warehouse from a variety of sources and merged into
coherent whole. The data warehouse contains information about variety of subjects,
from variety of sources.
a
Time-Variant

Contains a history of the subject, as well as current information. Historical information is an important component of data warehouse. The time-variant nature of data in a data warehouse

Allows for analysis of the past.

Relates information to present.

 Enable forecasts for the future. Non-Volatile: Information that once entered into warehouse, should not
 Enable forecasts for the future.
Non-Volatile:
Information that once entered into warehouse, should not change,
Stable information that doesn’t change each time an operational process is executed.
Information is consistent regardless of when the warehouse is accessed.
Ralph Kimball’s Definition:
A
Data warehouse consists of a copy of transactional data specially structured for query
and Analysis.
Approaches of Data warehouse:
 Top – Down Approach (Bill inmon Approach)
 Bottom-Up Approach (Kimball Approach)

Data warehousing Concepts

J.Srinivasa Reddy

Data warehousing Concepts J.Srinivasa Reddy Advantages of Top – Down Approach are:  A truly corporate
Data warehousing Concepts J.Srinivasa Reddy Advantages of Top – Down Approach are:  A truly corporate

Advantages of Top – Down Approach are:

A truly corporate effort, an enterprise view of data.

Inherently architected – not a union of disparate data mart.

Single, central storage of data about the content

Centralized rules and control.

May see quick results if implemented with iterations.

Disadvantages of Top – Down Approach are:

Takes longer to build even with an iterative method.

High exposure/risk to failure.

Needs high level of cross-functional skills.

with an iterative method.  High exposure/risk to failure.  Needs high level of cross-functional skills.

Data warehousing Concepts

J.Srinivasa Reddy

Data warehousing Concepts J.Srinivasa Reddy Advantages of Bottom-Up Approach are:  Faster and easier implementation

Advantages of Bottom-Up Approach are:

J.Srinivasa Reddy Advantages of Bottom-Up Approach are:  Faster and easier implementation of manageable pieces

Faster and easier implementation of manageable pieces

Favorable return on investment and proof of concept.

Less risk of failure.

Inherently incremental; can schedule important data marts first.

Allows project team to learn and grow.

Disadvantages of Bottom-Up Approach are:

Each data mart has its own narrow view of data.

Permeates redundant data in every data mart.

Perpetuates inconsistent and conflicting data.

of data.  Permeates redundant data in every data mart.  Perpetuates inconsistent and conflicting data.

Data warehousing Concepts

J.Srinivasa Reddy

Data warehousing Concepts J.Srinivasa Reddy It is a simple architecture of data warehousing. End users directory

It is a simple architecture of data warehousing. End users directory access data derived from several source systems through the data warehouse.

warehousing. End users directory access data derived from several source systems through the data warehouse. Page

Data warehousing Concepts

J.Srinivasa Reddy

Data warehousing Concepts J.Srinivasa Reddy Page # 14

Data warehousing Concepts

J.Srinivasa Reddy

Data warehousing Concepts J.Srinivasa Reddy Page # 15

Data warehousing Concepts

J.Srinivasa Reddy

Cubes
Cubes

Data warehousing Concepts

J.Srinivasa Reddy

Dimensional Modeling

Introduction to Dimensional Modeling

Dimensional Modeling Introduction to Dimensional Modeling Dimensional modeling (DM) is the name of a logical design

Dimensional modeling (DM) is the name of a logical design technique often used for data warehouses. It is the method of organizing data in DWH. Dimensional modeling is the only viable technique for databases that are designed to support end-user queries. The goal of dimensional modeling is to represent a set of business measurements in a standard framework that allows for high-performance access. Any Business process is an entity in dimensional modeling.

Dimensional modeling is attractive because end users usually easily understand this framework. The schemas that result from dimensional modeling are so predictable that query tool vendors can build their tools around a set of well-known structures.

Drawbacks of E-R Modeling for DWH

Data warehouse contains the redundancy of data. When there is data redundancy usage of E-R Model is not possible. Still if we use E-R Model for DWH it increases the complexity of relationships between tables, which decreases DWH performance.

between tables, which decreases DWH performance. To overcome these problems we use simplified E-R Model

To overcome these problems we use simplified E-R Model according to DWH requirements called Dimensional Model.

Why Dimensional Modeling

called Dimensional Model. Why Dimensional Modeling • Logical model is easy to understand – Standard

Logical model is easy to understand

– Standard framework and business model for end user apps

– Model can be done (mostly) independent of expected queries

– Handle changes easy – such as adding new dimensional attributes

• Optimized for performance

– High performance “browsing” across the attributes

– Strategy to handling aggregates, leveraging summary tables or OLAP aggregation technologies.

– Logical redundant with base table to enhance query performance

– OLAP engines can make strong assumptions on how to optimize

• Historical tracking of information

– Strategies for handling changing dimensions

– Fact design allows high volume snapshots and transaction Tracking

Types of Dimensional Modeling

and transaction Tracking Types of Dimensional Modeling There are different types of Dimensional Models. 1. Star

There are different types of Dimensional Models.

1. Star Schema Model

2. Snowflake Schema Model

3. Galaxy Schema Model

Data warehousing Concepts

J.Srinivasa Reddy

Star Schema Model

Star Schema is a relational database schema for representing multi dimensional data. It is the
Star Schema is a relational database schema for representing multi dimensional data.
It is the simplest form of data warehouse schema that contains one or more dimensions
and fact tables. It is called a star schema because the entity-relationship diagram
between dimensions and fact tables resembles a star where one fact table is connected
to multiple dimensions. It consists of one fact table surrounded by related dimensions.
The center of the star schema consists of a large fact table and it points towards the
dimension tables.
Fact Table
The centralized table in a star schema is called as FACT table. It is a table in a star
schema that contains facts and connected to dimensions.
A fact table typically has two types of columns:
1.
columns contain facts
2.
Columns are foreign keys to dimension tables.
The primary key of a fact table is usually a composite key that is made up of all of its
foreign keys.

Data warehousing Concepts

J.Srinivasa Reddy

Data warehousing Concepts J.Srinivasa Reddy Fact = Subject of Analysis Measures = Attributes describing facts Derived
Fact = Subject of Analysis Measures = Attributes describing facts Derived Measures  Sales 
Fact = Subject of Analysis
Measures = Attributes describing facts
Derived Measures
Sales
Quantity, Price
Profit
 Fact Tables Contain numbers and other business metrics.
– Define the basic measures users want to analyze
– Numbers are then aggregated according to related dimensions
 Fact tables contain dimension keys
Defines relationship between measures and dimensions using surrogate keys
 Typically narrow tables, but often very large
Fact tables store different types of measures like additive, non additive and semi additive
measures.
Additive - Measures that can be added across all dimensions.
Non Additive - Measures that cannot be added across all dimensions.
Semi Additive - Measures that can be added across few dimensions and not with
others.
A fact table might contain either detail level facts or facts that have been aggregated
(fact tables that contain aggregated facts are often instead called summary tables).

In the real world, it is possible to have a fact table that contains no measures or facts. These tables are called as Fact less Fact tables.

Data warehousing Concepts

J.Srinivasa Reddy

Steps in designing Fact Table

Concepts J.Srinivasa Reddy Steps in designing Fact Table  Identify a business process for analysis (like

Identify a business process for analysis (like sales).

Identify measures or facts (sales amount).

Identify dimensions for facts (product dimension, location dimension, time dimension, organization dimension).

List the columns that describe each dimension. (region name, branch name, region name).

Determine the lowest level of summary in a fact table (sales amount).

name, branch name, region name).  Determine the lowest level of summary in a fact table

Data warehousing Concepts

J.Srinivasa Reddy

Dimension Table

The detailed descriptions of your fact are dimensions. Dimension table contains attributes that describe fact records in the fact table. A dimension table is a table, typically in a data warehouse, that contains further information about an attribute in a fact table.

For example, a SALES table can have the following dimension tables TIME, PRODUCT, REGION, SALESPERSON, etc.

dimension tables TIME, PRODUCT, REGION, SALESPERSON, etc. Dimensions are the qualifiers that make the measures of

Dimensions are the qualifiers that make the measures of the fact table meaningful, because they answer the what, when, and where aspects of a question.

For example, consider the following business questions, for which the dimensions are utilized:

What accounts produced the highest revenue last year?

What was our profit by vendor?

How many units were sold for each product?

by vendor?  How many units were sold for each product? In the preceding set of

In the preceding set of questions, revenue, profit, and units sold are measures (not dimensions), as each represents quantitative or factual data.

In the above set of questions Account, Year, Vendor, Product are dimensions that making measures meaningful by providing further information.

Dimensions = static structure of business information

measures meaningful by providing further information. Dimensions = static structure of business information Page # 21

Data warehousing Concepts

J.Srinivasa Reddy

Dimension Details  Attributes - Descriptive characteristics of an entity - Building blocks of dimensions,
Dimension Details
 Attributes
- Descriptive characteristics of an entity
- Building blocks of dimensions, describe each instance
- Usually text fields, with discrete values
- e.g., the flavor of a product, the size of a product
 Dimension Keys
- Surrogate Keys
- Candidate Business Keys
 Dimension Granularity
- Granularity in general is the level of detail of data contained in an entity
- A dimensions granularity is the lowest level object which uniquely identifies a
member.
- Typically the identifying name of a dimension
Dimension Keys
 Dimension Business Key

- Column or columns that identify a unique instance of the business record (not necessarily a unique record in the dimension table)

- Used in the ETL process to tie fact records with dimension members

the ETL process to tie fact records with dimension members  Dimension Record Surrogate Key -

Dimension Record Surrogate Key

- Defines the dimension’s primary key

- Relates to the fact table foreign key field

- Numeric data type, typically integer (2,4,8 byte)

Dimension Surrogate Keys

Surrogate Key Usage

– Consolidates multi-value business keys

– Allows tracking of dimension history

– Standardizes dimension tables

– Limits fact table width for optimization

Surrogate Key Design Practices

width for optimization  Surrogate Key Design Practices – Avoid smart keys – Avoid production keys

– Avoid smart keys

– Avoid production keys (may change!)

– The company may acquire a competitor and thereby change the key building rules changed record, but deliberately not changed key

– Narrow as possible

Data warehousing Concepts

J.Srinivasa Reddy

Types of Dimensions 1. Confirmed Dimensions. A Conformed Dimension is a dimension which can be
Types of Dimensions
1.
Confirmed Dimensions.
A
Conformed Dimension is a dimension which can be used across multiple data marts.
Its basically one dimension that shares with two fact tables.
Confirmed Dimensions are nothing but Reusable Dimensions. The dimensions which
you are using multiple times or in multiple data marts. Those are common in different
data marts A common dimension shared among multiple star schemas.
eg: Time dimension shared between 2 different facts.
two fact tables share the same dimension key, then u can cal that dimension as
confirmed dimension
If
2.
Junk Dimensions.
A
number of very small dimensions might be lumped together to form a single
dimension, a junk dimension - the attributes are not closely related
A
"junk" dimension is a collection of random transactional codes, flags and/or text

attributes that are unrelated to any particular dimension. The junk dimension is simply

a structure that provides a convenient place to store the junk attributes.

3. Degenerated Dimension

A degenerate dimension is data that is dimensional in nature but stored in fact table.
A degenerate dimension is data that is dimensional in nature but stored in fact table.
A Degenerate dimension is a Dimension which has only a single attribute. Degenerate
dimension is a dimension key generated in the fact table that doesn't connected to any
dimension table.
Degenerate dimension corresponds to a dimension table that has no attributes. It acts
as Primary key for the fact table and a grouping element. It is generated at the time of
transaction.

Data warehousing Concepts

J.Srinivasa Reddy