Sie sind auf Seite 1von 89

Data Warehouse

Concepts
&
Architecture

By Monstercourses.com

Topics
Data warehousing &
Architecture
Data Mart
ETL
OLTP
DSS
Database Design
Star Schema
Snowflake Schema
Useful information
Fit falls
Summary

By Monstercourses.com

A producer wants to
know.
What
Whatis
isthe
theTotal
Total
Revenue
Revenuefor
forthe
the
year
year2009?
2009?
What
Whatis
isthe
themost
most
effective
effectivedistribution
distribution
channel?
channel?

Who
Whoare
aremy
mycustomers
customers
and
andwhat
whatproducts
products
are
arethey
theybuying?
buying?

What
Whatproduct
productprompromotions
otionshave
havethe
thebiggest
biggest
mpact
mpacton
onrevenue?
revenue?
What
Whatimpact
impactwill
will
new
newproducts/services
products/services
have
haveon
onrevenue
revenue
and
andmargins?
margins?

Which
Whichcustomers
customers
are
aremost
mostlikely
likelyto
togo
go
to
tothe
thecompetition
competition??

Data, Data everywhere


yet ...
I cant find the data I need
data is scattered over the network
many versions, subtle differences

I cant get the data I need


need an expert to get the data

I cant understand the data I found


available data poorly documented

I cant use the data I found


results are unexpected
data needs to be transformed from
one form to other

Scenario 1
ABC Pvt Ltd is a company with
branches at Mumbai, Delhi,
Chennai and Banglore. The Sales
Manager wants quarterly sales
report. Each branch has a
separate operational system.

By Monstercourses.com

Scenario 1 : ABC Pvt Ltd.


Mumbai

Delhi
Sales
Manager

Sales per item type per branch


for first quarter.
Chennai

Banglore

By Monstercourses.com

Solution 1:ABC Pvt Ltd.


Extract sales information from each
database.
Store the information in a common repository
at a single site.

By Monstercourses.com

Solution 1:ABC Pvt Ltd.


Mumbai
Report
Delhi
Data
Warehouse

Query &
Analysis tools

Chennai

Banglore

By Monstercourses.com

Sales
Manager

Scenario 2
One Stop Shopping Super Market has
huge
operational database. Whenever
Executives wants some report the
OLTP system becomes slow and data
entry operators have to wait for some
time.

By Monstercourses.com

Scenario 2 : One Stop


Shopping

Data Entry Operator


Report
Wait

Operational
Database

Management

Data Entry Operator

By Monstercourses.com

10

Solution 2
Extract data needed for analysis from
operational database.
Store it in warehouse.
Refresh warehouse at regular interval so that
it contains up to date information for analysis.
Warehouse will contain data with historical
perspective.

By Monstercourses.com

11

Solution 2
Data Entry
Operator
Report
Transaction

Extract
Data
Operational
data Warehouse
database

Manager

Data Entry
Operator

By Monstercourses.com

12

Scenario 3
Cakes & Cookies is a small, new
company. President of the company
wants his company should grow. He
needs information so that he can
make correct decisions.

By Monstercourses.com

13

Solution 3
Improve the quality of data before
loading it into the warehouse.
Perform data cleaning and
transformation before loading the data.
Use query analysis tools to support
adhoc queries.

By Monstercourses.com

14

Solution 3

Data
Warehouse

Expan
sion
sales
Query and Analysis
tool

President
time
Improve
ment

By Monstercourses.com

15

What the users are saying...


Data should be integrated
across the enterprise
Summary data has a real
value to the organization
Historical data holds the key
to understanding data over
time
What-if capabilities are
required
By Monstercourses.com

16

Application Areas
Industry
Application
Finance
Credit Card Analysis
Insurance
Claims, Fraud Analysis
Telecommunicati
Call record Analysis
on
Transport
Logistics Management
Consumer goods
promotion Analysis
Data Service
Value added data
providers
Utilities
Power usage Analysis
By Monstercourses.com

17

Why Separate Data


Warehouse?
Performance
Op dbs designed & tuned for known txs & workloads.
Complex OLAP queries would degrade perf. for op txs.
Special data organization, access & implementation methods
needed for multidimensional views & queries.

Function

Missing data: Decision support requires historical data, which op dbs do


not typically maintain.
Data consolidation: Decision support requires consolidation
(aggregation, summarization) of data from many heterogeneous
sources: op dbs, external sources.
Data quality: Different sources typically use inconsistent data
By Monstercourses.com
18
representations, codes,
and formats
which have to be reconciled.

Data Warehouse.. Defined


A data warehouse is a collection of
corporate information, derived directly
from operational systems and some
external data sources. Its specific
purpose is to support business
decisions, not business operations

By Monstercourses.com

19

Data Warehouse
A data warehouse is a
subject-oriented
integrated
time-varying
non-volatile

collection of data in support of management's


decision making process.
-- Bill Inmon, Father of the Data Warehouse
By Monstercourses.com
20

Subject Oriented
Main Frame

Data is Integrated and Loaded by Subject


Cust

Oracle

OLTP

Flat Files

Prod

1996
1996
1997

D/W
Data

O/P
1998
A/R

Legacy DB
By Monstercourses.com

21

Time Variant
Operational System

Data Warehouse

View of The Business


Today

Designated Time Frame


(3 - 10 Years)

Operational Time Frame

One Snapshot Per Cycle

Key Need Not Have Date

Key Includes Date

By Monstercourses.com

22

Integration
In terms of data.
encoding structures.
Measurement of
attributes.
physical attribute.
of data

remark
s

naming conventions.
By Monstercourses.com
Data type format

23

Non-Volatile
Operational System

Data Warehouse

CRUD Actions

No Data Update

Insert
Read

Load
Create
Update

Read

Read

Read

Replace

Read

Delete

By Monstercourses.com

24

Characteristics of a DW
Subject-oriented Data
collects all data for a subject, from different sources
Read-only Requests
loaded during off-hours, read-only during day hours
Interactive Features, ad-hoc query
flexible design to handle spontaneous user queries
Pre-aggregated data
to improve runtime performance
Highly denormalized data structures
fat tables with redundant columns
By Monstercourses.com

25

Data Warehousing
Architecture
Monitoring &
Administration

OLAP Servers

Metadata
Repository

External
Sources

Extract
Transfor
m Load
Refresh

Reconciled
data

Analysis

Serve
Query/Repor
ting

Operational
Dbs

Data Mining

DATA SOURCES

TOOLS
DATA MARTS

By Monstercourses.com

26

DW Layered Architecture

By Monstercourses.com

27

What are Operational


Systems?
They are OLTP systems
Run mission critical
applications
Need to work with
stringent performance
requirements for routine
tasks
Used to run a business!

By Monstercourses.com

28

Operational Systems
Run the business in real time
Based on up-to-the-second data
Optimized to handle large numbers
of simple read/write transactions
Optimized for fast response to
predefined transactions
Used by people who deal with
customers, products -- clerks,
salespeople etc.
They are increasingly used by
customers

By Monstercourses.com

29

OLTP vs Data Warehouse


OLTP
Application Oriented
Used to run business

Detailed data
Current up to date
Isolated Data
Repetitive access
Clerical User

Warehouse (DSS)
Subject Oriented
Used to analyze
business
Summarized and refined
Snapshot data
Integrated Data
Ad-hoc access
Knowledge User
(Manager)

By Monstercourses.com
30

OLTP vs Data Warehouse


OLTP
Performance Sensitive
Few Records accessed at a
time (tens)

Data Warehouse

Read/Update Access
No data redundancy
Database Size 100MB
-100 GB

By Monstercourses.com

Performance relaxed
Large volumes accessed at
a time(millions)
Mostly Read (Batch
Update)
Redundancy present
Database Size
100
GB - few terabytes

31

OLTP vs Data Warehouse


OLTP
Transaction
throughput is the
performance metric
Thousands of users

Data Warehouse

By Monstercourses.com

Query throughput is
the performance
metric
Hundreds of users

32

To summarize ...
OLTP Systems are
used to run a
business

The Data Warehouse


helps to optimize
the business
By Monstercourses.com

33

ETL ?

Extraction Transformation &


Loading

By Monstercourses.com

34

Why ETL..
Data Integrity Problems
Same person, different spellings
Agarwal, Agrawal, Aggarwal etc...
Multiple ways to denote company name
Persistent Systems, PSPL, Persistent Pvt. LTD.
Use of different names
mumbai, bombay
Different account numbers generated by different
applications for the same customer
Required fields left blank
Invalid product codes collected at point of sale
manual entry leads to mistakes
in case of a problem use 9999999
By Monstercourses.com
35

Introduction
Extraction, Transformation, Validation, Load
Source
System 1

Source
System 2

E
T
V
L

Staging Area

E
T
V
L

Data warehouse

Source
System 3
By Monstercourses.com

36

Extraction
Source Systems (Multiple Source
Systems)
Flat files, Excel, Legacy Systems, RDBMS etc.

Frequency of Extraction
Staging Area (If any? How many?)
Most Transformations from Source to
Staging
Cleansing and Data Quality
Data integrity, De-duplication, completeness,
correctness
By Monstercourses.com

37

Transformation
Usage of tools
Reusability of Transformations
Reusability of Mappings

Different tools

Informatica
Warehouse Builder
ETI
Sagent
PL/SQL scripts
By Monstercourses.com

38

Loading
Loading Frequency
Optimized Loading
Indexing
Partitioning

Aggregation
Sum
Average
Max

Update Strategy
Error Handling
By Monstercourses.com

39

Data Transformation Terms

Data Cleaning
Data Conditioning
Data Scrubbing
Data Merging
Data Aggregation

By Monstercourses.com
40

Data Transformation Terms


Data Cleaning
It is process of the cl
Sources for data generally in legacy mainframes in
VSAM, IMS, IDMS, DB2; more data today in
relational databases on Unix

Conditioning
The conversion of data types from the source to the
target data store (warehouse) -- always a relational
database

By Monstercourses.com
41

Load Types
Ongoing Data Load or Incremental
Loading
Bulk Load (One time Load) for History

By Monstercourses.com
42

Data Extraction and


Cleansing
Extract data from existing operational
and legacy data
Issues:
Sources of data for the warehouse
Data quality at the sources
Merging different data sources
Data Transformation
How to propagate updates (on the sources) to the
warehouse
Terabytes of data to be loaded

By Monstercourses.com
43

Scrubbing Data
Sophisticated transformation
tools.
Used for cleaning the quality
of data
Clean data is vital for the
success of the warehouse
Example
Seshadri, Sheshadri, Sesadri,
Seshadri S., Srinivasan
Seshadri, etc. are the same
person
By Monstercourses.com

44

STAGING AREA - Some Clarity

Staging Area
optional
to cleanse the source data
Accepts data from different sources
Data model is required at staging area
Multiple data models may be required for
parking different sources and for
transformed data to be pushed out to
warehouse
By Monstercourses.com

46

Types of Data Warehouse


Enterprise Data Warehouse
Data Mart
Enterprise
Data Warehouse

Datamart

Datamart

By Monstercourses.com

Datamart
47

Enterprise data warehouse


Contains data drawn from multiple
operational systems
Supports time- series and trend analysis
across different business areas
Can be used as a transient storage area to
clean all data and ensure consistency
Can be used to populate data marts
Can be used for everyday and strategic
decision making
By Monstercourses.com

48

What is Data Mart?


A data mart is a subset of data warehouse
that is designed for a particular line of
business, such as sales, marketing, or
finance.
In a dependent data mart, data can be
derived from an enterprise-wide data
warehouse. In an independent data mart,
data can be collected directly from
sources.
By Monstercourses.com

49

Data Warehouse vs. Data


Marts

What comes first

Data Mart
Logical subset of enterprise data
warehouse
Organized around a single business
process
Based on granular data
May or may not contain aggregates
Object of analytical processing by the
end user.
Less expensive and much smaller than
a full blown Bycorporate
data warehouse.
Monstercourses.com
51

Physical data warehouse:


Data warehouse --> data marts

External
Data

SOURCE DATA

Operational Data

Data Marts

Data Warehouse

Staging Area

Physical Data Warehouse:


Data Warehouse --> Data Marts
By Monstercourses.com

52

Physical data warehouse:


Data marts --> data warehouse
External
Data

SOURCE DATA

Operational Data

Data Warehouse
Data Marts

Staging Area

Physical Data Warehouse:


Data Marts --> Data Warehouse
By Monstercourses.com

53

Physical Data Warehouse:


Parallel Data Warehouse and
Data Mart

Data Warehouse
External
Data

SOURCE DATA

Staging Area
Operational Data

Data Marts

Physical Data Warehouse:


Parallel Data Warehouse & Data Marts
By Monstercourses.com

54

DW Implementation
Approaches

Top Down
Bottom-up
Combination of both
Choices depend on:
current infrastructure
resources
architecture
ROI
Implementation speed
By Monstercourses.com

55

Top Down Implementation

By Monstercourses.com

56

Bottom Up Implementation

By Monstercourses.com

57

DW Implementation
Approaches
Top Down
More planning and
design initially
Involve people from
different work-groups,
departments
Data marts may be built
later from Global DW
Overall data model to
be decided up-front

Bottom Up
Can plan initially without
waiting for global
infrastructure
built incrementally
can be built before or in
parallel with Global DW
Less complexity in
design

By Monstercourses.com

58

DW Implementation
Approaches
Bottom Up

Top Down
Consistent data definition
and enforcement of
business rules across
enterprise
High cost, lengthy
process, time consuming
Works well when there is
centralized IS department
responsible for all H/W
and resources

Data redundancy and


inconsistency between
data marts may occur
Integration requires
great planning
Less cost of H/W and
other resources
Faster pay-back

By Monstercourses.com

59

DW Implementation
Approaches
Combined Approach
Determine degree of planning and design for
a global approach to integrate data marts
being built by bottom-up approach
Develop base level infrastructure definition for
global DW at business level
Develop plan to handle data elements
needed by multiple data marts
Build a common data store to be used by
data marts and
global DW
By Monstercourses.com
60

Dimensional modeling
Must identify
Business process to be supported
Grain (level of detail)
Dimensions
Facts

By Monstercourses.com

61

Conventions used in
Dimensional modeling
Facts/Measures(KPIs)
Dimensions
Dimension hierarchies
Dimension Levels
Dimension Level members

By Monstercourses.com

62

Facts
A fact is a collection of related data
items, consisting of measures and
context data.
Each fact typically represents a
business item, a business transaction,
or an event that can be used in
analyzing the business or business
process.
Facts are measured, continuously
valued, rapidly changing information.
Can be calculated
and/or derived.
By Monstercourses.com
63

Basic concept of Fact Table..


The centralized table in a star schema
is called as FACT table. A fact table
typically has two types of columns:
those that contain facts and those that
are foreign keys to dimension tables.
The primary key of a fact table is
usually a composite key that is made up
of all of its foreign keys.
By Monstercourses.com

64

So?....
A table that is used to store business
information (measures) that can be
used in mathematical equations with
respect to the Dimension Codes
Quantities
Percentages
Prices

By Monstercourses.com

65

Dimensions
A dimension is a collection of members
or units of the same type of views.
Dimensions determine the contextual
background for the facts.
Dimensions represent the way business
people talk about the data resulting
from a business process, e.g., who,
what, when, where, why, how
By Monstercourses.com

66

Dimension with respect to


Fact
Table used to store qualitative data
about fact records
Who
What
When
Where
Why

By Monstercourses.com

67

Dimensions Table
Dimension table is one that describe the
business entities of an enterprise,
represented as hierarchical, categorical
information such as time, departments,
locations, and products.

By Monstercourses.com

68

So?... Dimensions are


Collection of members or units of the
same type of views.
determine the contextual background for
the facts.
the parameters over which we want to
perform OLAP (eg. Time, Location/region,
Customers)

Member is a distinct name to determine data


items position (eg. Time - Month, quarter)
Hierarchy arrange members into hierarchies or
levels
By Monstercourses.com

69

Hierarchies
A logical structure that uses ordered levels as a
means of organizing data. A hierarchy can be
used to define data aggregation; for example,
in a time dimension, a hierarchy might be
used to aggregate data from the Month level
to the Quarter level, from the Quarter level to
the Year level.
A hierarchy can also be used to define a
navigational drill path, regardless of whether
the levels in the hierarchy represent
aggregated totals or not
By Monstercourses.com

70

Hierarchies
Allow for the rollup of data to more
summarized levels.
Time

day
month
quarter
year

By Monstercourses.com

71

Hierarchies

By Monstercourses.com

72

Level
A position in a hierarchy. For example, a time
dimension might have a hierarchy that
represents data at the Month, Quarter, and
Year levels.

By Monstercourses.com

73

Measures
A measure is a numeric attribute of a
fact, representing the performance or
behaviour of the business relative to
dimensions.
The actual numbers are called as
variables.
eg. sales in money, sales volume, quantity supplied,
supply cost, transaction amount

A measure is determined by
combinations of the members of the
dimensions Byand
is located on facts.
Monstercourses.com
74

What is Star Schema?


Star Schema is a relational database schema for
representing multidimensional data. It is the simplest form
of data warehouse schema that contains one or more
dimensions and fact tables. It is called a star schema
because the entity-relationship diagram between
dimensions and fact tables resembles a star where one fact
table is connected to multiple dimensions. The center of the
star schema consists of a large fact table and it points
towards the dimension tables. The advantage of star
schema is slicing down, performance increase and easy
understanding of data.
By Monstercourses.com

75

Common structures for


Data Marts :
Denormalize!
Star
Single fact table surrounded by denormalized
dimension tables
The fact table primary key is the composite of
the foreign keys (primary keys of dimension
tables)
Fact table contains transaction type information.
Many star schemas in a data mart
Easily understood by end users, more disk
storage required
By Monstercourses.com
76

Example of Star Schema

By Monstercourses.com

77

Snow Flake Schema


A snowflake schema is a term that describes a
star schema structure normalized through the
use of outrigger tables. i.e. dimension table
hierarchies are broken into simpler tables.

By Monstercourses.com

78

Common structures for


Data Marts:
Denormalize!

Snowflake

Single fact table surrounded by normalized


dimension tables
Normalizes dimension table to save data storage
space.
When dimensions become very very large
Less intuitive, slower performance due to joins

May want to use both approaches, especially


if supporting multiple end-user tools.
By Monstercourses.com

79

Example of Snow flake


schema

By Monstercourses.com

80

Snowflake - Disadvantages
Normalization of dimension makes it
difficult for user to understand
Decreases the query performance
because it involves more joins
Dimension tables are normally smaller
than fact tables - space may not be a
major issue to warrant snowflaking
By Monstercourses.com

81

Keys
Primary Keys
uniquely identify a record

Foreign Keys
primary key of another table referred here

Surrogate Keys
system-generated key for dimensions
key on its own has no meaning
integer key, less space
By Monstercourses.com

82

Schema & Snow Flake


Schema
In a star schema every dimension will have a primary
key.
In a star schema, a dimension table will not have any
parent table.
Whereas in a snow flake schema, a dimension table
will have one or more parent tables.
Hierarchies for the dimensions are stored in the
dimensional table itself in star schema.
Whereas hierarchies are broken into separate tables in
snow flake schema. These hierarchies helps to drill
down the data from topmost hierarchies to the
lowermost hierarchies.
By Monstercourses.com

83

Fact Constellation
Fact Constellation
Multiple fact tables that share many
dimension tables
Booking and Checkout may share many
dimension tables in the hotel industry

Hotels

Booking

Promotion

Checkout

Travel Agents
Room Type
Customer
By Monstercourses.com
84

Basic Dimensional Modeling


Techniques

Slowing changing Dimensions


Confirmed Dimensions
Degenerate Dimensions
Junk Dimensions

By Monstercourses.com

85

Slowly Changing Dimension


Dimensions that change over the period of
time are called Slowly Changing
Dimensions.
For instance, a product price changes over time;
People change their names for some reason;
Country and State names may change over time.

SCD - Types
Type1-Overwirting the existing values
Type2-Maintain the history of changed values
Type3-Partial history maintenance.
By Monstercourses.com

86

Confirmed Dimension
A conformed dimension is a set of data
attributes that have been physically
implemented in multiple Data Marts or Star
Schema or Snowflake Schema using the
same structure, attributes.

By Monstercourses.com

87

Junk Dimension
Create special dimensions to hold
miscellaneous attributes found in the source
database
Scenario:
Occasionally, there are miscellaneous attributes, such as
yes/no attributes or comment attributes, that dont fit into
tight star schemas. Rather than discarding flag fields and
yes/no attributes, place them in a junk dimension. In
addition, you can handle comment and open-ended text
attributes by creating a text-based junk dimension
By Monstercourses.com

88

Degenerated Dimension
A degenerate dimension is data that is
dimensional in nature but stored in a fact
table.
Scenario:
if you have a dimension that only has Order Number and
Order Line Number, you would have a 1:1 relationship with
the Fact table. Do you want to have two tables with a
billion rows or one table with a billion rows. Therefore, this
would be a degenerate dimension and Order Number and
Order Line Number would be stored in the Fact table
By Monstercourses.com

89

Thank You

By Monstercourses.com

90

Das könnte Ihnen auch gefallen