Beruflich Dokumente
Kultur Dokumente
Concepts
&
Architecture
By Monstercourses.com
Topics
Data warehousing &
Architecture
Data Mart
ETL
OLTP
DSS
Database Design
Star Schema
Snowflake Schema
Useful information
Fit falls
Summary
By Monstercourses.com
A producer wants to
know.
What
Whatis
isthe
theTotal
Total
Revenue
Revenuefor
forthe
the
year
year2009?
2009?
What
Whatis
isthe
themost
most
effective
effectivedistribution
distribution
channel?
channel?
Who
Whoare
aremy
mycustomers
customers
and
andwhat
whatproducts
products
are
arethey
theybuying?
buying?
What
Whatproduct
productprompromotions
otionshave
havethe
thebiggest
biggest
mpact
mpacton
onrevenue?
revenue?
What
Whatimpact
impactwill
will
new
newproducts/services
products/services
have
haveon
onrevenue
revenue
and
andmargins?
margins?
Which
Whichcustomers
customers
are
aremost
mostlikely
likelyto
togo
go
to
tothe
thecompetition
competition??
Scenario 1
ABC Pvt Ltd is a company with
branches at Mumbai, Delhi,
Chennai and Banglore. The Sales
Manager wants quarterly sales
report. Each branch has a
separate operational system.
By Monstercourses.com
Delhi
Sales
Manager
Banglore
By Monstercourses.com
By Monstercourses.com
Query &
Analysis tools
Chennai
Banglore
By Monstercourses.com
Sales
Manager
Scenario 2
One Stop Shopping Super Market has
huge
operational database. Whenever
Executives wants some report the
OLTP system becomes slow and data
entry operators have to wait for some
time.
By Monstercourses.com
Operational
Database
Management
By Monstercourses.com
10
Solution 2
Extract data needed for analysis from
operational database.
Store it in warehouse.
Refresh warehouse at regular interval so that
it contains up to date information for analysis.
Warehouse will contain data with historical
perspective.
By Monstercourses.com
11
Solution 2
Data Entry
Operator
Report
Transaction
Extract
Data
Operational
data Warehouse
database
Manager
Data Entry
Operator
By Monstercourses.com
12
Scenario 3
Cakes & Cookies is a small, new
company. President of the company
wants his company should grow. He
needs information so that he can
make correct decisions.
By Monstercourses.com
13
Solution 3
Improve the quality of data before
loading it into the warehouse.
Perform data cleaning and
transformation before loading the data.
Use query analysis tools to support
adhoc queries.
By Monstercourses.com
14
Solution 3
Data
Warehouse
Expan
sion
sales
Query and Analysis
tool
President
time
Improve
ment
By Monstercourses.com
15
16
Application Areas
Industry
Application
Finance
Credit Card Analysis
Insurance
Claims, Fraud Analysis
Telecommunicati
Call record Analysis
on
Transport
Logistics Management
Consumer goods
promotion Analysis
Data Service
Value added data
providers
Utilities
Power usage Analysis
By Monstercourses.com
17
Function
By Monstercourses.com
19
Data Warehouse
A data warehouse is a
subject-oriented
integrated
time-varying
non-volatile
Subject Oriented
Main Frame
Oracle
OLTP
Flat Files
Prod
1996
1996
1997
D/W
Data
O/P
1998
A/R
Legacy DB
By Monstercourses.com
21
Time Variant
Operational System
Data Warehouse
By Monstercourses.com
22
Integration
In terms of data.
encoding structures.
Measurement of
attributes.
physical attribute.
of data
remark
s
naming conventions.
By Monstercourses.com
Data type format
23
Non-Volatile
Operational System
Data Warehouse
CRUD Actions
No Data Update
Insert
Read
Load
Create
Update
Read
Read
Read
Replace
Read
Delete
By Monstercourses.com
24
Characteristics of a DW
Subject-oriented Data
collects all data for a subject, from different sources
Read-only Requests
loaded during off-hours, read-only during day hours
Interactive Features, ad-hoc query
flexible design to handle spontaneous user queries
Pre-aggregated data
to improve runtime performance
Highly denormalized data structures
fat tables with redundant columns
By Monstercourses.com
25
Data Warehousing
Architecture
Monitoring &
Administration
OLAP Servers
Metadata
Repository
External
Sources
Extract
Transfor
m Load
Refresh
Reconciled
data
Analysis
Serve
Query/Repor
ting
Operational
Dbs
Data Mining
DATA SOURCES
TOOLS
DATA MARTS
By Monstercourses.com
26
DW Layered Architecture
By Monstercourses.com
27
By Monstercourses.com
28
Operational Systems
Run the business in real time
Based on up-to-the-second data
Optimized to handle large numbers
of simple read/write transactions
Optimized for fast response to
predefined transactions
Used by people who deal with
customers, products -- clerks,
salespeople etc.
They are increasingly used by
customers
By Monstercourses.com
29
Detailed data
Current up to date
Isolated Data
Repetitive access
Clerical User
Warehouse (DSS)
Subject Oriented
Used to analyze
business
Summarized and refined
Snapshot data
Integrated Data
Ad-hoc access
Knowledge User
(Manager)
By Monstercourses.com
30
Data Warehouse
Read/Update Access
No data redundancy
Database Size 100MB
-100 GB
By Monstercourses.com
Performance relaxed
Large volumes accessed at
a time(millions)
Mostly Read (Batch
Update)
Redundancy present
Database Size
100
GB - few terabytes
31
Data Warehouse
By Monstercourses.com
Query throughput is
the performance
metric
Hundreds of users
32
To summarize ...
OLTP Systems are
used to run a
business
33
ETL ?
By Monstercourses.com
34
Why ETL..
Data Integrity Problems
Same person, different spellings
Agarwal, Agrawal, Aggarwal etc...
Multiple ways to denote company name
Persistent Systems, PSPL, Persistent Pvt. LTD.
Use of different names
mumbai, bombay
Different account numbers generated by different
applications for the same customer
Required fields left blank
Invalid product codes collected at point of sale
manual entry leads to mistakes
in case of a problem use 9999999
By Monstercourses.com
35
Introduction
Extraction, Transformation, Validation, Load
Source
System 1
Source
System 2
E
T
V
L
Staging Area
E
T
V
L
Data warehouse
Source
System 3
By Monstercourses.com
36
Extraction
Source Systems (Multiple Source
Systems)
Flat files, Excel, Legacy Systems, RDBMS etc.
Frequency of Extraction
Staging Area (If any? How many?)
Most Transformations from Source to
Staging
Cleansing and Data Quality
Data integrity, De-duplication, completeness,
correctness
By Monstercourses.com
37
Transformation
Usage of tools
Reusability of Transformations
Reusability of Mappings
Different tools
Informatica
Warehouse Builder
ETI
Sagent
PL/SQL scripts
By Monstercourses.com
38
Loading
Loading Frequency
Optimized Loading
Indexing
Partitioning
Aggregation
Sum
Average
Max
Update Strategy
Error Handling
By Monstercourses.com
39
Data Cleaning
Data Conditioning
Data Scrubbing
Data Merging
Data Aggregation
By Monstercourses.com
40
Conditioning
The conversion of data types from the source to the
target data store (warehouse) -- always a relational
database
By Monstercourses.com
41
Load Types
Ongoing Data Load or Incremental
Loading
Bulk Load (One time Load) for History
By Monstercourses.com
42
By Monstercourses.com
43
Scrubbing Data
Sophisticated transformation
tools.
Used for cleaning the quality
of data
Clean data is vital for the
success of the warehouse
Example
Seshadri, Sheshadri, Sesadri,
Seshadri S., Srinivasan
Seshadri, etc. are the same
person
By Monstercourses.com
44
Staging Area
optional
to cleanse the source data
Accepts data from different sources
Data model is required at staging area
Multiple data models may be required for
parking different sources and for
transformed data to be pushed out to
warehouse
By Monstercourses.com
46
Datamart
Datamart
By Monstercourses.com
Datamart
47
48
49
Data Mart
Logical subset of enterprise data
warehouse
Organized around a single business
process
Based on granular data
May or may not contain aggregates
Object of analytical processing by the
end user.
Less expensive and much smaller than
a full blown Bycorporate
data warehouse.
Monstercourses.com
51
External
Data
SOURCE DATA
Operational Data
Data Marts
Data Warehouse
Staging Area
52
SOURCE DATA
Operational Data
Data Warehouse
Data Marts
Staging Area
53
Data Warehouse
External
Data
SOURCE DATA
Staging Area
Operational Data
Data Marts
54
DW Implementation
Approaches
Top Down
Bottom-up
Combination of both
Choices depend on:
current infrastructure
resources
architecture
ROI
Implementation speed
By Monstercourses.com
55
By Monstercourses.com
56
Bottom Up Implementation
By Monstercourses.com
57
DW Implementation
Approaches
Top Down
More planning and
design initially
Involve people from
different work-groups,
departments
Data marts may be built
later from Global DW
Overall data model to
be decided up-front
Bottom Up
Can plan initially without
waiting for global
infrastructure
built incrementally
can be built before or in
parallel with Global DW
Less complexity in
design
By Monstercourses.com
58
DW Implementation
Approaches
Bottom Up
Top Down
Consistent data definition
and enforcement of
business rules across
enterprise
High cost, lengthy
process, time consuming
Works well when there is
centralized IS department
responsible for all H/W
and resources
By Monstercourses.com
59
DW Implementation
Approaches
Combined Approach
Determine degree of planning and design for
a global approach to integrate data marts
being built by bottom-up approach
Develop base level infrastructure definition for
global DW at business level
Develop plan to handle data elements
needed by multiple data marts
Build a common data store to be used by
data marts and
global DW
By Monstercourses.com
60
Dimensional modeling
Must identify
Business process to be supported
Grain (level of detail)
Dimensions
Facts
By Monstercourses.com
61
Conventions used in
Dimensional modeling
Facts/Measures(KPIs)
Dimensions
Dimension hierarchies
Dimension Levels
Dimension Level members
By Monstercourses.com
62
Facts
A fact is a collection of related data
items, consisting of measures and
context data.
Each fact typically represents a
business item, a business transaction,
or an event that can be used in
analyzing the business or business
process.
Facts are measured, continuously
valued, rapidly changing information.
Can be calculated
and/or derived.
By Monstercourses.com
63
64
So?....
A table that is used to store business
information (measures) that can be
used in mathematical equations with
respect to the Dimension Codes
Quantities
Percentages
Prices
By Monstercourses.com
65
Dimensions
A dimension is a collection of members
or units of the same type of views.
Dimensions determine the contextual
background for the facts.
Dimensions represent the way business
people talk about the data resulting
from a business process, e.g., who,
what, when, where, why, how
By Monstercourses.com
66
By Monstercourses.com
67
Dimensions Table
Dimension table is one that describe the
business entities of an enterprise,
represented as hierarchical, categorical
information such as time, departments,
locations, and products.
By Monstercourses.com
68
69
Hierarchies
A logical structure that uses ordered levels as a
means of organizing data. A hierarchy can be
used to define data aggregation; for example,
in a time dimension, a hierarchy might be
used to aggregate data from the Month level
to the Quarter level, from the Quarter level to
the Year level.
A hierarchy can also be used to define a
navigational drill path, regardless of whether
the levels in the hierarchy represent
aggregated totals or not
By Monstercourses.com
70
Hierarchies
Allow for the rollup of data to more
summarized levels.
Time
day
month
quarter
year
By Monstercourses.com
71
Hierarchies
By Monstercourses.com
72
Level
A position in a hierarchy. For example, a time
dimension might have a hierarchy that
represents data at the Month, Quarter, and
Year levels.
By Monstercourses.com
73
Measures
A measure is a numeric attribute of a
fact, representing the performance or
behaviour of the business relative to
dimensions.
The actual numbers are called as
variables.
eg. sales in money, sales volume, quantity supplied,
supply cost, transaction amount
A measure is determined by
combinations of the members of the
dimensions Byand
is located on facts.
Monstercourses.com
74
75
By Monstercourses.com
77
By Monstercourses.com
78
Snowflake
79
By Monstercourses.com
80
Snowflake - Disadvantages
Normalization of dimension makes it
difficult for user to understand
Decreases the query performance
because it involves more joins
Dimension tables are normally smaller
than fact tables - space may not be a
major issue to warrant snowflaking
By Monstercourses.com
81
Keys
Primary Keys
uniquely identify a record
Foreign Keys
primary key of another table referred here
Surrogate Keys
system-generated key for dimensions
key on its own has no meaning
integer key, less space
By Monstercourses.com
82
83
Fact Constellation
Fact Constellation
Multiple fact tables that share many
dimension tables
Booking and Checkout may share many
dimension tables in the hotel industry
Hotels
Booking
Promotion
Checkout
Travel Agents
Room Type
Customer
By Monstercourses.com
84
By Monstercourses.com
85
SCD - Types
Type1-Overwirting the existing values
Type2-Maintain the history of changed values
Type3-Partial history maintenance.
By Monstercourses.com
86
Confirmed Dimension
A conformed dimension is a set of data
attributes that have been physically
implemented in multiple Data Marts or Star
Schema or Snowflake Schema using the
same structure, attributes.
By Monstercourses.com
87
Junk Dimension
Create special dimensions to hold
miscellaneous attributes found in the source
database
Scenario:
Occasionally, there are miscellaneous attributes, such as
yes/no attributes or comment attributes, that dont fit into
tight star schemas. Rather than discarding flag fields and
yes/no attributes, place them in a junk dimension. In
addition, you can handle comment and open-ended text
attributes by creating a text-based junk dimension
By Monstercourses.com
88
Degenerated Dimension
A degenerate dimension is data that is
dimensional in nature but stored in a fact
table.
Scenario:
if you have a dimension that only has Order Number and
Order Line Number, you would have a 1:1 relationship with
the Fact table. Do you want to have two tables with a
billion rows or one table with a billion rows. Therefore, this
would be a degenerate dimension and Order Number and
Order Line Number would be stored in the Fact table
By Monstercourses.com
89
Thank You
By Monstercourses.com
90