Sie sind auf Seite 1von 39

Data Warehouse Concepts

&
Terminology
- Vamshi Myana

Contents

What is Datawarehouse?
Why Separate Data Warehouse?
Data Granularity
Difference between OLTP & DW
Datawarehouse Architecture
Top-Down Versus Bottom-Up Approach
Data Warehouses Versus Data Marts
Dimensional Modeling Fundamentals
Extraction, Transformation and Load
Separate Data Warehouse?
ETL(Extract Transform Load) & OLAP

What is Datawarehouse?
A data warehouse is a relational database that is
designed for query and analysis rather than for
transaction processing. It usually contains historical
data derived from transaction data, but it can include
data from other sources. It separates analysis
workload from transaction workload and enables an
organization to consolidate data from several sources.
In addition to a relational database, a data warehouse
environment includes an extraction,
transformation, and loading (ETL) solution, an
online analytical processing (OLAP) engine, client
analysis tools, and other applications that manage the
process of gathering data and delivering it to business
users.

Data Warehouse
Properties
Subject
Oriented

Integrated

Data
Warehouse
Non Volatile

Time Variant
-- Bill Inmon, Building the Data Warehouse 1996

Subject-Oriented
Data is categorized and stored by business subject
rather than by application
OLTP Applications
Equity
Plans

Shares

Insurance
Savings
Loans

Data Warehouse Subject

Customer
financial
information

Integrated
Constructed by integrating multiple,
heterogeneous data sources
Relational databases, flat files, on-line transaction
records

Data cleaning and data integration techniques are


applied.
Ensure consistency in naming conventions,
encoding structures, attribute measures, etc. among
different data sources
E.g. Hotel price: currency, tax, breakfast covered,
etc.

Time-Variant
Data is stored as a series of snapshots, each
representing a period of time

Time
Jan-97
Feb-97
Mar-97

Data
January
February
March

Nonvolatile
Typically data in the data warehouse is not updated or delelted.
Operational

Warehouse
Load

Insert
Update
Delete

Read

Read

Why Separate Data


Warehouse?
High performance for both systems
DBMS tuned for OLTP: access methods, indexing, concurrency
control, recovery
Warehouse tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.

Different functions and different data:


missing data: Decision support requires historical data which
operational DBs do not typically maintain
data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled

Datawarehouse terminology

Enterprise Data warehouse


Collects all information about subjects (customers,products,sales,assets, personnel) that
span the entire organization

Data Mart
Departmental subsets that focus on selected subjects

Decision Support System (DSS)


is not a product its an environment where Information technology is used to help

the knowledge worker (executive, manager, analyst) make faster & better decisions.

Operational data store (ODS)


Stores tactical data from production systems that are subject-oriented and
integrated to address operational needs.

Online Analytical Processing (OLAP)


An element of decision support systems (DSS), which provides analysis of data
stored in a database. OLAP tools enable users to analyze different dimensions of
multidimensional data.

Data Granularity
What is Granularity of your DW?
Granularity is the level of details we
want to store in the data warehouse.
For a retail store, Point of Sale (POS) is
the lowest granularity information
available.
For banking its the account level details
based on every day transactions.

Data Warehouse Versus


OLTP
Property

Operational

Data Warehouse

Response
Time

Sub seconds to
seconds

Seconds to hours

Operations

DML

Primarily read only

Nature of Data

30-60 days

Snapshots over time

Data Organization

Applications

Subject, time

Size

Small to large

Large to very large

Data Source

Operational, Internal

Operational, Internal,
External

Activities

Processes

Analysis

Data warehouse
Architectures

Data warehouse
Architectures

Top-Down Versus BottomUp Approach

Here are the two different basic approaches:

Overall data warehouse feeding dependent data marts


Several departmental or local data marts combining into a
data warehouse.

In the first approach, you extract data from the


operational systems; you then transform, clean,
integrate, and keep the data in the data
warehouse.
So, which approach is best in your case, the
top-down or the bottom-up approach?

Top-Down Approach
The advantages of this approach are:
A truly corporate effort, an enterprise
view of data
Inherently architectednot a union of
disparate data marts
Single, central storage of data about
the content
Centralized rules and control

Top-Down Approach
The disadvantages are:
Takes longer to build
High exposure/risk to failure
Needs high level of cross-functional
skills
High outlay without proof of concept

Bottom-Up Approach
The advantages of this approach are:
Faster and easier implementation of
manageable pieces
Favorable return on investment and
proof of concept
Less risk of failure
Inherently incremental; can schedule
important data marts first

Bottom-Up Approach
The disadvantages are:
Each data mart has its own narrow view
of data
Permeates redundant data in every data
mart
Perpetuates inconsistent and
irreconcilable data

Data Warehouses Versus


Data Marts
Data
Warehouse

Data
Mart

Dimensional Model

A dimensional model is a model in which the data is structurally classified as


fact or dimension.
General characteristics:

Query oriented
Structured around data usage not business rules
Organized roughly into base facts and dimensions of those facts
Based on identification of key grains of data and on characteristics of those grains
Consisting usually of snapshot, business data
Looks to reduce the number and depth of joins

Two general patterns Star schema: A fact table in the middle connected to a set of dimension tables
Snowflake schema: A refinement of star schema where some dimensional
hierarchy is normalized into a set of smaller dimension tables, forming a shape
similar to snowflake
Fact constellations: Multiple fact tables share dimension tables, viewed as a
collection of stars, therefore called galaxy schema or fact constellation

Example of Star Schema


time

item

time_key
day
day_of_the_week
month
quarter
year

Sales Fact Table


time_key
item_key
branch_key

branch
branch_key
branch_name
branch_type

location_key
units_sold
dollars_sold
avg_sales

Measures

item_key
item_name
brand
type
supplier_type

location
location_key
street
city
province_or_street
country

Example of Snowflake
Schema
Store
Dimension
STORE KEY
Store Description
City
State
District ID
Region_ID
Regional Mgr.

District_ID
District Desc.
Region_ID

Store Fact Table


STORE KEY
PRODUCT KEY
PERIOD KEY
Dollars
Units
Price

Region_ID
Region Desc.
Regional Mgr.

Dimensional Modeling
Terminology
A Fact table stores measures as well as keys
representing relationships to various dimensions.
Additive - Measures that can be added across all
dimensions.
Semi Additive - Measures that can be added across few
dimensions and not with others.
Non Additive - Measures that cannot be added across all
dimensions.

Dimensions are perspectives with respect to


which an organization wants to keep record. It
contain textual attributes that describe the facts

In the example, sales fact table is connected to dimensions


location, product, time and organization. Measure "Sales Dollar"
in sales fact table can be added across all dimensions
independently or in a combined manner which is explained
below.

Sales Dollar value for a particular product


Sales Dollar value for a product in a location
Sales Dollar value for a product in a year within a location
Sales Dollar value for a product in a year within a location sold or
serviced by an employee

Conformed Dimension
Dimension tables that adhere to a common
structure, and therefore allow queries to be
executed across star schemas.
Sales Schema

Item Key

DATE KEY

Item Desc.

ITEM KEY

Brand Desc.

STORE KEY

Category

PROMO KEY

..

Sales Fact

Item Key

Inventory Schema

Item Desc.
Brand Desc.
Category
..

DATE KEY
ITEM KEY
STORE KEY
Inventory
Fact

Extraction, Transformation
and Load

OLTP Databases

Staging File

Warehouse Database

Purchase specialist tools, or develop programs


Extraction-- Is mapping the data between
source systems and target database
Transformation--validate, clean, integrate, and
time stamp data
Load--Loading the transformed data into the
target system

What is OLAP?
What is OLAP?
Online Analytical Processing. Viewing data in a
multi dimensional way.

Why OLAP?
Slice and dice for data warehouse.
RDBMS is a 2 dimensional way of storing /
viewing the data
OLAP is a multi dimensional way of storing /
viewing the data

OLAP operations
Roll up (drill-up):
summarize data
by climbing up
hierarchy or by
dimension reduction

Drill down (roll down):


reverse of roll-up
from higher level
summary to lower level
summary or detailed
data, or introducing
new dimensions

OLAP operations
Slicing: Selecting the
dimensions of the cube
to be viewed.
Example: View Sales
volume as a function
of Product by
Country by Quarter

Dicing: Specifying the


values along one or
more dimensions.
Example: View Sales
volume for
Product=PC by
Country by Quarter

Types in OLAP?
Three types of OLAP in the industry.
1.MOLAP Multi dimensional OLAP (Ex
MSOLAP, Essbase, Cognos).
2.ROLAP Relational OLAP ( Ex Business
Objects, Microstrategy).
3.HOLAP Hybrid OLAP

Architecture diagram of
ROLAP
App Server
ROLAP tools
Like
DataWarehouse
Or
Data Mart

BO
Cognos
Microstrategy
Etc
BI Metadata

OLAP
Report1
OLAP
Report2
OLAP
Report n

When a report is executed by end user the actual SQL is issued to RDBMS to get
the data. Some BI tools can even store the results set in the application server and
periodically refresh that report based on the data refreshes which happen in DW.

Architecture diagram of
MOLAP
Microsoft
Analysis
Services
DataWarehouse
Or
Data Mart

MOLAP
cubes

BI Metadata
Cube defn
etc

MOLAP
cubes

OLAP
Report1
OLAP
Report2
OLAP
Report n

When a report is executed by end user the actual data is retrieved from the MOLAP
cubes. The way it retrieves by using MDX queries based on the report. MDX stands
for Multidimensional expression. SQL is used to get the data RDBMS, MDX is used
to get the data from MOLAP. The MOLAP cubes are refreshed periodically
based on the data refreshes which happen in DW.

Terminology
Cube
A cube is a
multidimensional structure
of data. Cubes are defined
by a set of dimensions and
measures.

Terminology

Products

n
o
i
t
a
c
o
L

Dimension
A structural attribute
of a cube that acts as
an index for identifying
values within a multidimensional array.
If all dimensions have
a single member
selected, then a single
cell is defined.

Time

Terminology
Measures
Numeric data of
interest.

Coffee

e.g. Revenue per Sale,


Quantity

Apples
Tea

Time

April

March

1.95

February

Onions

January

Products

a
in
ru
Ch
Pe
n
pa y
Ja
al
It

n
o
i
t
a
c
Lo

Summary
This session covered the following topics:

What is Datawarehouse?
Difference between OLTP & DW
Data warehouse Architecture and
approach
Dimensional Modeling
What is OLAP?

Questions ?

Thank You.

Das könnte Ihnen auch gefallen