Sie sind auf Seite 1von 31

Inmons Definition of a DW

R. Kimballs definition of a DW
A
Adatawarehouse
data warehouse isacopyoftransactional
is a copy of transactional
dataspecificallystructuredforqueryingand
analysis.

Adatawarehouse
A data warehouse isa
is a
subjectoriented,
integrated,
nonvolatile,and
timevariant

Accordingtothisdefinition:

collectionofdatainsupportofmanagementsdecisions.
Thedatawarehousecontainsgranularcorporatedata.

Theformofthestoreddata(RDBMS,flatfile)has
(
,
)
nothingtodowithwhethersomethingisadata
warehouse.
Datawarehousingisnotnecessarilyfortheneedsof
"decisionmakers"orusedintheprocessofdecision
g
making.

BITS Pilani, Pilani Campus

Operational vs DW

BITS Pilani, Pilani Campus

Operational vs DW

Operationalsystem
Operational system

OLTP
Systemsthatsupportdaytodayoperations
These systems get
Thesesystems
getdatainto
data into DB
DB
Ex:Take anorder,processaclaim,makea
shipment,generateaninvoiceetc.
DataWarehouseisanenvironmentthat

DataWarehousesystem

OLAP
SSystemsthatsupportstrategicdecisions
h
i d ii
Thesesystemsgetdataout ofDB
Ex:Showtopsellingproducts,showproblem
regions,showthehighestmargins,alertson
h
h h h
l
thresholds.
BITS Pilani, Pilani Campus

Subject-Oriented Data Collections

BITS Pilani, Pilani Campus

Integrated Data Collections

Cl i l
Classicaloperationssystemsare
i
organizedaroundtheapplicationsof
thecompany.Foraninsurance
company,theapplicationsmaybeauto,
h l h lif
health,life,andcasualty.Themajor
d
l Th
j
subjectareasoftheinsurance
corporationmightbecustomer,policy,
premium,andclaim.Fora
manufacturer,themajorsubjectareas
f
h
j
bj
mightbeproduct,order,vendor,billof
material,andrawgoods.Foraretailer,
themajorsubjectareasmaybe
product,SKU,sale,vendor,andsoforth.
d t SKU l
d
d f th
Each
typeofcompanyhasitsownuniqueset
ofsubjects

BITS Pilani, Pilani Campus

Ofalltheaspectsofadatawarehouse,
integrationisthemostimportant.Data
isfedfrommultipledisparatesources
intothedatawarehouse.Asthedatais
feditis
converted,reformatted,resequenced,
summarized,andsoforth.Theresultis
thatdataonceitresidesinthedata
warehousehasasinglephysical
corporateimage.

BITS Pilani, Pilani Campus

Non-volatile Data Collections

Time-variant Data Collections


DataWarehouse

Operational
Dataisupdatedintheoperational
environmentasaregularmatterof
course,butwarehousedataexhibits
averydifferentsetof
diff
f
characteristics.Datawarehouse
dataisloaded(usuallyenmasse)
andaccessed,butitisnotupdated
(i h
(inthegeneralsense).Instead,
l
) I
d
whendatainthedatawarehouseis
loaded,itisloadedinasnapshot,
staticformat.Whensubsequent
changesoccur,anewsnapshot
h
h
recordiswritten.Indoingsoa
historyofdataiskeptinthedata
warehouse.

Timehorizon
Time
horizon 12years.
12 years
Updateofrecords
Keystructuremay/maynot
containanelementoftime

Time horizon 5
Timehorizon
515
15years.
years.
Sophisticatedsnapshotsof
data
Keystructurecontainsan
elementoftime

Timevariancy impliesthateveryunit
ofdatainthedatawarehouseis
accurateasofsomeonemomentin
time.Insomecases,arecordistime
stamped.Inothercases,arecordhas
adateoftransaction.Butinevery
case,thereissomeformoftime
markingtoshowthemomentintime
d
duringwhichtherecordisaccurate.
h h h
d
A1to2yeartimehorizonisnormal
foroperationalsystems;a5to15
yeartimehorizonisnormalforthe
d
datawarehouse.Asaresultofthis
h
l f h
differenceintimehorizons,thedata
warehousecontainsmuch more
historythananyotherenvironment.

BITS Pilani, Pilani Campus

Operational Data Store (ODS)

BITS Pilani, Pilani Campus

Data Warehouses and Data Marts

The Operational Data Store is used for tactical decision


making while the DW supports strategic decisions. It
contains transaction data, at the lowest level of detail for
the subject area
subject-oriented, just like a DW
integrated, just like a DW
volatile ((or updateable)
p
) , unlike a DW
an ODS is like a transaction processing system
information gets overwritten with updated data
no history is maintained (other than audit trail) or operational history

current,
t i.e.,
i
nott time-variant,
ti
i t unlike
lik a DW
current data, up to a few years
no history is maintained (other than audit trail) or operational history

A data warehouse is a central repository for all or significant parts of the data
that an enterprise's various business systems collect. Enables strategic
decision making
making.
A data mart is a repository of data gathered from operational data and other
sources that is designed to serve a particular community of knowledge
workers. In scope, the data may derive from an enterprise-wide database or
data warehouse or be more specialized. The emphasis of a data mart is on
meeting the specific demands of a particular group of knowledge users in
terms of analysis, content, presentation, and ease-of-use. Users of a data
mart can expect to have data presented in terms that are familiar.
In practice, the terms data mart and data warehouse each tend to imply the
presence of the other in some form. However, most writers using the term
seem to
t agree that
th t the
th design
d i off a data
d t martt tends
t d to
t start
t t from
f
an analysis
l i
of user needs and that a data warehouse tends to start from an analysis of
what data already exists and how it can be collected in such a way that the
data can later be used. A data warehouse is a central aggregation of data
(which can be distributed physically); a data mart is a data repository that
may derive from a data warehouse or not and that emphasizes ease of
access and usability for a particular designed purpose.

A data warehouse tends to be a strategic but somewhat unfinished concept;


a data mart tends to be tactical and aimed at meeting an immediate need.

BITS Pilani, Pilani Campus

The goals of a Data Warehouse

BITS Pilani, Pilani Campus

The goals of a Data Warehouse

We have mountains of data in this company,


p y but we can't
access it."
"We need to slice and dice the data every which way."
"You've
You ve got to make it easy for business people to get at
the data directly."
"Just show me what is important."
"It drives me crazy to have two people present the same
business metrics at a meeting, but with different
numbers."
"We want people to use information to support more factbased decision making."

BITS Pilani, Pilani Campus

The data warehouse must make an organization's


g
information easily accessible.
The data warehouse must present the organization's
information consistently.
The data warehouse must be adaptive and resilient to
change.
The
Th data
d t warehouse
h
mustt be
b a secure bastion
b ti th
thatt
protects our information assets.
The data warehouse must serve as the foundation for
improved decision making.
The business community must accept the data
warehouse if it is to be deemed successful.
BITS Pilani, Pilani Campus

Data Warehouse Architecture

Requirements for DW
Securityy Requirements
q
a paradox:
Data Warehouse: publish data widely
Security: restrict data to those with a need to know
role-based security at the final applications (not
grant or revoke at the DBMS level)
security for developers (separate subnet), backups
((tapes,
p , disks))

BITS Pilani, Pilani Campus

Requirements for DW(contd)

BITS Pilani, Pilani Campus

Requirements for DW (contd)


Data Latencyy

Data Integration
at the core of the IT business, aka, the 360 degree
view of the business
specific to Data Warehouses: establishing common
attributes (conforming dimensions), agreeing on
common business metrics (conforming facts) so that
one can perform mathematical calculations
(differences, ratios, etc)

how quickly to deliver data to the end user


improvements with algorithms, parallel processing,
streaming

Archiving
change calculations
legal compliance lineage requirements

End User
reports, OLAP, data handoff

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Technical Requirements for DW


Architecture
ETL tool versus hand coding
batch updates
p
versus data streaming
g
horizontal (orders/shipments) versus vertical
(customers/orders) task dependency
scheduler
h d l automation
t
ti
quality handling/data cleansing
metadata
security
staging

Data Warehousing
BLANK PAGE
BITS Pilani
I
BITS Pilani, Pilani Campus

July 26, 2014

Design Techniques:
Merging Tables

Data Warehouse Design


Start with the ER-Diagram
g
that represents
p
the
corporate data model or with one ore more
operational data models to be integrated
R
Remove
data
d
used
d purely
l iin the
h operational
i
l
environment
Enhance key structures with an element of time
Add derived (calculated) data (i.e., summaries)
Turn relationships
relationships of the ER model into artifacts
artifacts
in the data warehouse

Many tables imply much dynamic


I/O

Merging many tables together


makes access faster

BITS Pilani, Pilani Campus

Design Techniques:
Introduction of Redundant Data

BITS Pilani, Pilani Campus

Design Technique:
Separation of Data when there is a disparity of probability of access

BITS Pilani, Pilani Campus

Design Technique:
Introduce Derived Data

BITS Pilani, Pilani Campus

Design Techniques:
Creative Indexes

Calculated once
Forever available

The top 10 customers


The average transaction value for this extract
The largest transaction
The number of customers who showed activity without purchasing.
BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Design Technique:
Forget Referential Integrity

Dimensional Modeling

In the operational
p
environment, referential integrity
g y appears
pp
as a dynamic link among tables of data.
Not in a data warehouse because
volume of data is too large
the data warehouse is not updated, just appended to
the warehouse represents data over time and relationships do
not remain static

Data modeling for Data Warehouses


Based on
fact tables
dimension tables

relationships of data are represented by an artifact in the


data warehouse environment. Therefore, some data will
be duplicated
duplicated, and some data will be deleted when other
data is still in the warehouse. In any case, trying to
replicate referential integrity in the data warehouse
environment is a patently incorrect approach
approach.
BITS Pilani, Pilani Campus

Fact Tables

BITS Pilani, Pilani Campus

Dimension Tables

Represent
p
a business p
process, i.e., models the business
process as an artifact in the data model
contain the measurements or metrics or facts of
business processes
"monthly sales number" in the Sales business process
most are additive (sales this month), some are semi-additive
(balance as of), some are not additive (unit price)

the level of detail is called the grain of the table


contain foreign keys for the dimension tables

Represent the who, what, where, when and how of a measurement/artifact


Represent real
real-world
world entities not business processes
Give the context of a measurement (subject)
For example for the Sales fact table, the characteristics of the 'monthly sales
number' measurement can be a Location (Where), Time (When), Product Sold
(
(What).
)
The Dimension Attributes are the various columns in a dimension table. In the
Location dimension, the attributes can be Location Code, State, Country, Zip code.
Generally the Dimension Attributes are used in report labels, and query constraints
such as where Country='USA'. The dimension attributes also contain one or more
hierarchical relationships
relationships.
Before designing your data warehouse, you need to decide what this data warehouse
contains. Say if you want to build a data warehouse containing monthly sales
numbers across multiple store locations, across time and across products then your
dimensions are:
Location
Time
Product

BITS Pilani, Pilani Campus

Possible OLTP Location Design

BITS Pilani, Pilani Campus

Location Dimension
Dim_id

Loc_cd

Name

State_NM

Country_NM

1001

IL01

Chicago
Loop

Illinois

USA

1002

IL02

Arlington

Illinois

USA

1003

NY01

Brooklyn

New York

USA

1004

TO01

Toronto

Ontario

Canada

1005

MX01

Mexico
City

Distrito
Federal

Mexico

In order to query for all locations that are in country 'USA'


we will have to join these three tables:

In order to q
query
y for all locations that are in country
y 'USA'
SELECT *
FROM Locations, States, Countries
where Locations.State_Id = States.State_Id
AND Locations.Country_id=Countries.Country_Id
AND Country_Name='USA'

SELECT *
FROM Location_dim
where Country_Name=
Country Name='USA'
USA

BITS Pilani, Pilani Campus

Notethe
redundancy
d d

BITS Pilani, Pilani Campus

Time Dimension

Product Dimension

Dim_id

Month

MonthName

Quarter

QuarterName

Year

1001

Jan

Q1

2005

1002

Feb

Q1

2005

1003

Mar

Q1

2005

1004

Apr

Q2

2005

1005

May

Q2

2005

Prod_id

Prod_cd

Name

Category

1001

STD

Short-Term-Disability

Disability

1002

LTD

Long-Term Disability

Disability

1003

GUL

Group Universal Life

Life

1004

PA

Personal Accident

Accident

1005

VADD

Voluntary Accident

Accident

Not as trivial at it seems ;-)

BITS Pilani, Pilani Campus

Star Schemas

BITS Pilani, Pilani Campus

Snow-flake Schemas
Select the measurements
SELECT P.Name, SUM(F.Sales)
JOIN the FACT table with Dimensions
FROM Sales F, Time T, Product P,
Location L
WHERE F.TM_Dim_Id = T.Dim_Id
AND F.PR_Dim_Id = P.Dim_Id
AND F.LOC_Dim_Id = L.Dim_Id
Constrain the Dimensions
AND T.Month='Jan' AND T.Year='2003' AND
L.Country_Name='USA'

Advantages:
-easyy to understand
-better performance
-extensible

Group by' for the aggregation level


GROUP BY P.Category

If we did not de-normalize


the dimensions

Ruleofthumb:
d t
dontusethem
th

BITS Pilani, Pilani Campus

Dimensional Modeling Steps

BITS Pilani, Pilani Campus

Data Warehouses and Data Marts


A data warehouse is a central repository for all or significant
parts of the data that an enterprise's
enterprise s various business
systems collect. Enables strategic decision making.
Enterprise wide

Identifyy the business process


p
Identify the level of detail needed (grain)
Identify the dimensions
Identify the facts

A data mart is a repository of data gathered from operational


data and other sources that is designed to serve a particular
community of knowledge workers. In scope, the data may
derive from an enterprise
enterprise-wide
wide database or data warehouse
or be more specialized.
The emphasis of a data mart is on meeting the specific
demands of a particular group of knowledge users in terms of
analysis, content, presentation, and ease-of-use.
Departmental

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Top-down vs Bottom-up Approach

DW and Data Marts


D t W
Data
Warehouse
h

D t M
Data
Martt

Corporate/Enterprise-wide
Union of all data marts
Data received from staging
area
Q i on presentation
Queries
t ti
resource
Structure for corporate view of
d t
data
Organized on E-R Model.

Departmental
A Single
Si l b
business
i
process
STAR join(facts and Dim)
Technology optimal for
data access and analysis
Structure to suit the
departmental view of data

Top-down approach
Bill Inmon
N
Normalized
li d d
data
t model
d l
Enterprise view of data
Single central storage of
Single,
data
Takes longer to build
High exposure to risk and
failure.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Data warehouse
Architecture types

Bottom-up approach
Ralph Kimbal
D
De-normalized
li d d
data
t model
d l
Collection of conformed
data marts which gives
enterprise view
Inherently incremental
Less risk of failure and
allows project team to
learn and grow
grow.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Data warehouse
Architecture types

Centralized Data Warehouse

Independent
p
Data Marts.
Data
Mart

Data
Mart
S
SourceData
D t

Reports/
epo ts/
Queries

DSO / ODS
DSO/ODS

D t St i
DataStaging

S
SourceData
D t

Reports/
Q i
Queries

D t St i
DataStaging
Data
Mart

Normalizeddatainthirdnormalform.
SummarizeddataatDSO/ODSlevel
Queries/ReportsaccesscentralDW.
There are no Separate data marts
TherearenoSeparatedatamarts.

Eachdatamartinthismodelservesaparticularorganizationalunit.
Eachdatamartisindependentofoneanother.
Variancesbetweendatamartsaffectdataanalysisacrossdatamarts.
Forexample:SalesandShipmentsaretwoindependentdatamarts.
Eventhoughsalesandshipmentsarerelated,inthismodel,itisdifficulttoanalyze
p
g
salesandshipmentdatatogether.

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Data warehouse
Architecture types

Data warehouse
Architecture types

Hub and Spoke


p
model.

Hub and Spoke


p
model.
DataStorage

S
SourceData
D t

Reports/
Queries
Q

D t St i
DataStaging
Data
Mart

Data
Mart

S
SourceData
D t

Reports/
Queries
Q

D t St i
DataStaging
Data
Mart

Data
Mart

Data
Mart

Data
Mart

Inmon CIF(CorporateInformationFactory)Approach.
CentralizedDWinthirdnormalform.
DependentDatamartsobtaindatafromCentralizedDW.

Kimbals conformedapproach
Businessdimensionsfromfirstdatamartissharedamongotherdatamarts,
ConformeddimensionswillgivelogicalintegratedDWwithenterpriseview,

EachdependentDatamartmayhave
 Normalized
 Denormalized
 summarized/dimensionaldatastructures.

Bottomupapproach

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Characteristics of DW/BI

DW Lifecycle Principles

High
g p
profile and high
g impact
p

Focus on the business

High risk

Build an information infrastructure

Highly political

Deliver in meaningful increments: six to twelve month


timeframes

Requires sophisticated and complex data gathering

Deliver the entire solution: query and display tools in


addition to the database

Requires intensive user access, training and support


high maintenance
BITS Pilani, Pilani Campus

DW Lifecycle

BITS Pilani, Pilani Campus

Specialized Roles
Data warehouse DBA

Project
Planning

Business
Requirements
Definition

Technical
Architecture
Design

Product
Selection &
Installation

Dimensional
Modeling

Physical
Design

BI
Application
Specification

Growth

ETL Design &


Development

BI
Application
Development

OLAP designer
ETL system developer

Deployment

DW/BI management tools/Application developer


M i t
Maintenance

Project Management

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Data Warehousing
BLANK PAGE

Data Warehousing
BLANK PAGE

BITS Pilani

BITS Pilani
I

I
July 26, 2014

July 26, 2014

DW Infrastructure

DW Infrastructure

Issues:

Platforms:

Performance drain on the operating environment

Source system

Technical skills of the data warehouse implementers

Staging area

Operational issues such as funding requirements

Application server

Shop standards

Desktop tools

BITS Pilani, Pilani Campus

DW Infrastructure
Database server

BITS Pilani, Pilani Campus

DW Infrastructure
Database server

Si
Size:
500 GB to 250 TB

Nature of use

< 500 GB is small


500 to 5 TB is medium
over 5 TB is large

Volatility: what is the nature and frequency of the update


process
Users: the number of users as well as their level of
knowledge

Number of business processes (marts.)


In some cases there mayy be a separate
p
p
platform for
each one.
There may be an additional central server for
management
g
roll ups
p

ad hoc queries from power users


standardized queries
data mining
Technical support: the hardware
hardware, the operating system
and the database engine may all require specialized
support
Software requirements may dictate platform.

BITS Pilani, Pilani Campus

DW Infrastructure
Operating systems

BITS Pilani, Pilani Campus

DW Infrastructure
Hardware
Single processor at a time system
Symmetric Multiprocessing (SMP)

Mainframes
transaction oriented
complex administration
not parallel

M
Multiple
lti l processors
Shared memory
Common bus

O
Open
system
t
(UNIX) servers

Massively Parallel Processing (MPP)

specialized environment

NT servers
relatively small capacities (limited numbers of
processors and less efficient performance)
BITS Pilani, Pilani Campus

Multiple processors
Distributed memory
Distributed bus

BITS Pilani, Pilani Campus

DW Infrastructure
Performance

DW Infrastructure
Database Engine

Indexing
Physical organization
C hi and
Caching
d bl
blocking
ki
Data distribution
Memory
Chip architecture

Relational
Well understood
Include DW support for star joins and fast access
Flexible

M ltidi
Multidimensional
i
l (MOLAP)
Extremely fast
Pre-calculated
Pre calculated combination facts

BITS Pilani, Pilani Campus

DW Infrastructure
Front Room

BITS Pilani, Pilani Campus

DW Infrastructure
Operational management

Configuration of desk-top

Load window (operational scheduling)

Cli t/S
Client/Server

Backup
B k

Web

User support

Supplemental
pp
tools

Change
g management
g

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Data Warehousing
BLANK PAGE

Data Warehousing
BLANK PAGE

BITS Pilani

BITS Pilani
I

I
July 26, 2014

July 26, 2014

Fact Table

Dimension Tables

Measurements associated with a specific business


process
Grain: level of detail of the table
Process events produce fact records
Facts (attributes) are usually

Entities describing the objects of the process


Conformed dimensions - cross processes
Attributes are descriptive

Numeric
Additive

Derived facts included


Foreign (surrogate) keys refer to dimension tables
(entities)

Text
Numeric

Surrogate keys
1:m with the fact table
Null entries
Date dimensions

BITS Pilani, Pilani Campus

Bus Architecture

BITS Pilani, Pilani Campus

Keys and Surrogate Keys


A surrogate key is a unique identifier for data
warehouse records that replaces source
primary keys (business/natural keys)
Protect against changes in source systems
Allow integration from multiple sources
Enable rows that do not exist in source data
Track changes over time (e.g. new customer
instances when addresses change)
Replace text keys with integers for efficiency

An architecture that permits aggregating data across


multiple marts
Conformed dimensions and attributes
Bus matrix

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Data Warehousing
BLANK PAGE

Data Warehousing
BLANK PAGE

BITS Pilani

BITS Pilani
I

I
July 26, 2014

July 26, 2014

Slowly Changing Dimensions

Slowly Changing Dimensions

Attributes in a dimension that change more slowly


than the fact granularity
Type 1: Current only / overwrite the old value
Type 2: All history / create a new dimensional
record
Type 3: Most recent few (rare) / create a
previous value attribute

CustKey

BKCustID

CustName

CommDist

Gender

HomOwn?

1552

31421

Jane Rider

Fact Table
Date

CustKey

ProdKey

Item Count

1/7/2004

1552

95

Amount
1,798.00

3/2/2004

1552

37

27.95

5/7/2005

1552

87

320.26

2/21/2006

1552 2387

42

19.95

Dimension with a slowly changing attribute

Note: rapidly changing dimensions usually indicate


the presence of a business process that should be
tracked as a separate dimension or as a fact table

Cust
Key

BKCust
ID

Cust
Name

Comm
Dist

Gender

Hom
Own?

Eff

End

1552

31421

Jane Rider

1/7/2004

1/1/2006

2387

31421

Jane Rider

31

1/2/2006

12/31/9999

BITS Pilani, Pilani Campus

Slowly Changing Dimensions


Original

ProductKey

Description

Category

SKU

21553

LeapPad

Education

LP2105

Type 1

ProductKey

Description

Category

SKU

21553

LeapPad

Toy

LP2105

Type 2

ProductKey

Description

Category

SKU

21553

LeapPad

Education

LP2105

44631

LeapPad

Toy

LP2105

ProductKey

Description

Category

OldCat

SKU

21553

LeapPad

Toy

Education

LP2105

ProductKey

Description

Category

OldCat

SKU

21553

LeapPad

Education

Electronics

LP2105

44631

LeapPad

Toy

Education

LP2105

68122

LeapPad

Education

Electronics

LP2105

Type 3

Hybrid

BITS Pilani, Pilani Campus

Date Dimensions
One row for every day for which you expect to
have data for the fact table (perhaps
generated in a spreadsheet and imported)
Usually use a meaningful integer surrogate
key (such as yyyymmdd 20060926 for Sep.
26, 2006). Note: this order sorts correctly.
Include rows for missing or future dates to be
added later.

BITS Pilani, Pilani Campus

Aggregates

BITS Pilani, Pilani Campus

Fact Tables

Precalculated summary tables

Transaction

Improve performance
Record data an coarser granularity

Track processes at discrete points in time when they occur

State change summary that has one row per item.

Periodic snapshot

Access rows on each update.

Accumulating snapshot

Cumulative performance over specific time intervals

Constantly updated over time. May include multiple dates representing


stages.

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Star schema Model

DATE
DateKey
Attributes

STORE
StoreKey
Attributes

Possible Date Attributes

POS FACT
DateKey
ProductKey
StoreKey
PromotionKey
POSTransactionNumber
SalesQuantity
SalesDollarAmount
CostDollarAmount
GrossProfitDollarAmount

SQL date
Full date description
Day of week
Day of month
Day of calendar year
Day of fiscal year
Month of calendar year
Month of fiscal year
Calendar Quarter
Fiscal Quarter

PRODUCT
ProductKey
Attributes

PROMOTION
PromotionKey
Attributes

Fiscal week
Year
Month
Fiscal year
Holiday ?
Holiday name
Day of holiday
Weekday ?
Selling season
Major event
etc.

BITS Pilani, Pilani Campus

Possible Product Attributes


Description
SKU number
Brand description
Department
Package type
Package size
Fat content
Diet type
Weight

BITS Pilani, Pilani Campus

Possible Store Attributes

Weight units of
measure
Storage type
Shelf unit type
Shelf width
Shelf height
Shelf depth
etc.

Store Name
Store Number
Street address
City
County
State
Zip
Manager
District

Region
Floor plan type
Photo processing type
Financial service type
Square footage
Selling square footage
First open date
Last remodel date
etc.

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Conformed Dimensions:
Inventory Snapshot Model

Factless Fact Tables

Process: Store inventory


In order to evaluate promotions that might have
generated no sales we need another approach.

Grain: Daily inventory by product and store

Promotion could generate another fact table (or could be


considered a fact table in itself). That new fact table
would have no additive attributes.

BITS Pilani, Pilani Campus

Dimensions: Date, product, store

Fact: quantity-on-hand

BITS Pilani, Pilani Campus

Dimensional Model

The Bus Matrix


Date

Product

Store

Promotion

Warehouse

Vendor

Retail Sales

Retail Inventory

Retail
Deliveries

Warehouse
Inventory

Warehouse
Deliveries

Purchase Orders

Contract

Shipper

Process
DATE
DateKey
Attributes

Inventory Fact
ProductKey
DateKey
StoreKey
QuantityOnHand
QuantitySold
ValueAtCost
ValueAtSellingPrice

PRODUCT
ProductKey
Attributes

STORE
StoreKey
Attributes

Note: QuantityOnHand is semi-additive. It is additive across product and store,


but not across date. The other attributes are additive.
BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Data Acquisition from OLTP


Systems

ETL Processing

Why is it hard?
Multiple source systems technologies.
Inconsistent data representations
representations.
Multiple sources for the same data element.
Complexity
C
l it off required
i d ttransformations.
f
ti
Scarcity and cost of legacy cycles.
Volume of legacy data.

Operational Data
Data Transformation
Enterprise
E
t
i Warehouse
W h
andd
Integrated Data Marts
Replication
Dependent Data Marts or
Departmental Warehouses

Business Users

BITS Pilani, Pilani Campus

Data Acquisition from OLTP


Systems

Data Acquisition from OLTP


Systems

Many possible source systems technologies:


* Flat files
* VSAM
* IMS
* IDMS
* DB2 (many flavors)
* Ad
Adabase
b

*
*
*
*
*
*

Excel
Access
Oracle
Informix
Sybase
I
Ingres

BITS Pilani, Pilani Campus

* Model 204
* DBF Format
* RDB
* RMS
* Compressed
*M
Many others...
th

BITS Pilani, Pilani Campus

Inconsistent data representation: Same data, different domain


values...
values
Examples:
Date value representations:
- 1996-02-14
- 02/14/1996
- 14-FEB-1996
- 960214
- 14485
Gender value representations:
- M/F
- M/F/PM/PF
- 0/1
- 1/2
BITS Pilani, Pilani Campus

Data Acquisition from OLTP


Systems

Data Acquisition from OLTP


Systems

Complexity of required transformations:


p scalar transformations.
Simple
0/1 => M/F
y element transformations.
One to many
6x30 address field => street1, street2, city, state,
zip
p
Many to many element transformations.
Householding
g and Individualization of customer
records

Volume of legacy data:


Need lots of processing and I/O to
effectively handle large data volumes.
2GB file limit in older versions of UNIX is
p
for handling
g legacy
g y data not acceptable
need full 64-bit file system.
Need
eed efficient
e c e interconnect
e co ec ba
bandwidth
d d to
o
transfer large amounts of data from legacy
sources.

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Data Acquisition from OLTP


Systems
Parallel software and hardware architectures:
Use data parallelism (partitioning) to allow
concurrent execution of multiple job streams.
Software architecture must allow efficient repartitioning of data between steps in the
transformation process.
Want powerful parallel hardware architectures
with many processors and I/O channels.

Data Warehousing
BLANK PAGE
BITS Pilani
I
July 26, 2014

BITS Pilani, Pilani Campus

ETL Processing
It is important to look at the big picture.
Data acquisition time may include:

Extracts from source systems.


Data movement.
Transformations.
D t loading.
Data
l di
Index maintenance.
Statistics collection.
collection
Summary data maintenance.
Data mart construction.
Backups.

Data Warehousing
BLANK PAGE
BITS Pilani
I
BITS Pilani, Pilani Campus

July 26, 2014

Loading Strategies

Loading Strategies

Once we have transformed data, there are three


primary loading strategies:

We must also worry about rolling off old data


as its
it economic
i value
l d
drops b
below
l
th
the costt ffor
storing and maintaining it.

1. Full data refresh with block slamming into


empty tables.
2. Incremental data refresh with block
slamming into existing (populated) tables
slamming
tables.
3. Trickle feed with continuous data acquisition
using
i row llevell iinsertt and
d update
d t operations.
ti

New
N
d data
d
new data
data

oldOld
data
data data

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Loading Strategies

Full Refresh Strategy

Should consider:

Completely re-load table on each refresh.

Step 1: Load table using block slamming.


p 2: Build indexes.
Step
Step 3: Collect statistics.

Data storage requirements.


I
Impact
t on query workloads.
kl d
Ratio of existing to new data.
Insert versus update workloads.

This iis a good


Thi
d ((simple)
i l ) strategy
t t
ffor smallll ttables
bl
or when a high percentage of rows in the data
changes
h
on each
h refresh
f h ((greater
t th
than 10%)
10%).
e.g., reference lookup tables or account tables where
balances change on each refresh
refresh.
BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Full Refresh Strategy

Full Refresh Strategy

Performance hints:

Consider using shadow tables to allow refresh


to take place without impacting query
workloads.

Remove referential integrity


g y ((RI)) constraints
from table definitions for loading operations.
Assume that data cleansing takes place in transformations.

Remove secondary index specifications from


table definition.

1. Load shadow table.


2. Replace-view operation to direct queries to refreshed
table make new data visible.

Build indices after table has been loaded.

Make sure target table logging is disabled


during loads.
BITS Pilani, Pilani Campus

Trades storage for availability.


BITS Pilani, Pilani Campus

Incremental Refresh Strategy

Incremental Refresh Strategy

Incrementally load new data into existing target


table that has already been populated from
previous loads.

Design considerations for incremental load directly into


t
target
t table
t bl using
i RDBMS utilities:
tiliti

Two primary strategies:


1. Incremental load directly into target table.
2 U
2.
Use shadow
h d
ttable
bl lload
d ffollowed
ll
db
by iinsert-select
t l t
operation into target table.

Indices should be maintained automatically.


Re-collect statistics if table demographics have
changed significantly.
Typically requires a table lock to be taken during
block slamming operation.
Do you want to allow for dirty reads?
Logging behavior differs across RDBMS products.

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Incremental Refresh Strategy

Incremental Refresh Strategy

Design considerations for shadow table


implementation:
p

Both incremental load strategies described preserve index structures


during the loading operation
operation.

Use block slamming into empty shadow table


having identical structure to target table.
Staging
St i space required
i d ffor shadow
h d
ttable.
bl
Insert-select operation from shadow table to target
table will p
preserve indices.
Locking will normally escalate to table level lock.
Beware of log file size constraints.
Beware of performance overhead for logging
logging.
Beware of rollbacks if operation fails for any reason.

However, there is a cost to maintaining indexes during the loads...


Rule-of-thumb: Each secondary index maintained during the load
costs 2-3 times the resources of the actual row insertion of data into
a table.
R l f th b Consider
Rule-of-thumb:
C
id d
dropping
i and
d re-building
b ildi iindex
d structures
t t
if
the number of rows being incrementally loaded is more than 10% of
the size of the target table.
Note: Drop and re-build of secondary indices may not be acceptable
due to availability requirements of the DW.

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Trickle Feed

Trickle Feed

Acquire data on a continuous basis into


RDBMS using row level SQL insert and
update operations.

A tradeoff exists between data freshness and


insert efficiency:

Data is made available to DW immediately rather


than waiting for batch loading to complete.
Much higher overhead for data acquisition on a per
record basis as compared to batch strategies.
Row level locking mechanisms allow queries to
proceed during data acquisition.
Typically relies on Enterprise Application Integration
(EAI) for data delivery.
delivery
BITS Pilani, Pilani Campus

Buffering rows for insertion allows for fewer


round trips to RDBMS...
but waiting to accumulate rows into the
b ff impacts
buffer
i
t data
d t freshness.
f h
gg
approach:
pp
Use a threshold that
Suggested
buffers up to M rows, but never waits more
than N seconds before sending a buffer of data
f insertion.
for
i
ti
BITS Pilani, Pilani Campus

ELT verses ETL

ETL Processing

There are two fundamental approaches to data


acquisition:
i iti

ETL processing performs the transform


operations prior to loading data into the
RDBMS.

ETL is extract, transform, load in which


transformation takes place on a
transformation server using either an engine
or by generated code.
ELT is extract, load, transform in which data
transformations take place in the relational
database on the data warehouse server.
Of course, hybrids are also possible...

1. Extract data from the source systems.


2. Transform data into a form consistent with
the target tables
tables.
3. Load the data into the target tables (or to
shadow
h d
tables).
t bl )

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

ETL Processing

ETL Processing

ETL processing is typically performed using


resources on the
th source systems
t
platform(s)
l tf
( )
or a dedicated transformation server.

Perform the transformations on the source


system platform if available resources exist
and there is significant data reduction that can
be achieved during the transformations.

Transformation
Server
Source Systems
Pre Transformations
Pre-Transformations
Data Warehouse

Perform the transformations on a dedicated


transformation server if the source systems
are highly distributed, lack capacity, or have
high cost per unit of computing.

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

ELT Processing

ELT Processing

First, load raw data into empty tables using


RDBMS block slamming utilities.
Next,, use SQL to transform the raw data
into a form appropriate to the target tables.

DW server is the the transformation server for


ELT processing.

Ideally, the SQL is generated using a meta data driven tool


rather than hand coding.

Finally, use insert-select into the target table


for incremental loads or view switching if a full
refresh strategy is used.
BITS Pilani, Pilani Campus

Files
Source Systems

Teradata
Fastload

Network

Channel
Data Warehouse

BITS Pilani, Pilani Campus

ELT Processing

Bottom Line

ELT Processing obviates the need for a separate


transformation server
server.
Assumes that spare capacity exists on DW server to
support transformation operations.

ELT leverages the build-in scalability and


manageability of the parallel RDBMS and HW
platform.
platform
Must allocate sufficient staging area space to
support load of raw data and execution of the
transformation SQL.
Works well only for batch oriented transforms
because SQL is optimized for set processing.

ETL is a significant task in any DW deployment.


ETL is a significant task in any DW
deployment.

Many options
for data loading strategies: need to
Many options for data loading strategies:
need to evaluate
tradeoffs in performance,
data freshness,
evaluate tradeoffs
in performance,
data
freshness, and compatibility with source
and compatibility
with source systems environment.
systems environment.
Many options for ETL/ELT deployment:
need to evaluate tradeoffs in where and how
Many options
for ETL/ELTppdeployment: need to
transformations should be applied
evaluate tradeoffs in where and how transformations
should be applied.

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Loading Dimensions

Loading Dimensions

Physically built to have the minimal sets of components


ETL is
is aasignificant
in any
DW
The primary key
single task
field
containing
meaningless
deployment.
unique integer Surrogate Keys
Many
options
for data
strategies:
The DW owns
these
keys
andloading
never
allows any other entity to
need to evaluate tradeoffs in performance, data
assign them freshness, and compatibility with source
systems
De-normalized
flat environment.
tables all attributes in a dimension must
take on a single
value
in the presence of a dimension primary
Many options for ETL/ELT deployment:
key.
need to evaluate tradeoffs in where and how
pp
transformations should be applied
Should possess
one or more other
fields that compose the
natural key of the dimension

The data loading module consists of all the steps required to


ahsignificant
in any iDW (SCD) and
administer
d i i t slowly
l ETLl ischanging
i task
di
dimensions
d write
it th
the
dimension todeployment.
disk as a physical table in the proper dimensional
format with correct
primary
p
keys,
y correct
y and final
Many options
forydata
loading
strategies:natural keys,
need to evaluate tradeoffs in performance, data
descriptive attributes.
freshness, and compatibility with source
Creating andsystems
assigning
the surrogate keys occur in this module
environment.
The
Th table
bl iis definitely
d fi i l staged,
d since
i
iit iis the
h object
bj
to b
be lloaded
d d
Many options for ETL/ELT deployment:
into the presentation
system
of
the
data
warehouse.
need to evaluate tradeoffs in where and how
pp
transformations should be applied

BITS Pilani, Pilani Campus

Loading Dimensions

BITS Pilani, Pilani Campus

Loading Dimensions
When DW receives notification that an existing
g
ETL is a significant task in any DW
row in dimension
deployment. has changed it gives out 3
yp of responses
p options for data loading strategies:
types
Many
need to evaluate tradeoffs in performance, data
Type 1 freshness, and compatibility with source
systems environment.
Type 2
Many options for ETL/ELT deployment:
Type 3 need to evaluate tradeoffs in where and how

ETL is a significant task in any DW


deployment.
Many options for data loading strategies:
need to evaluate tradeoffs in performance, data
freshness, and compatibility with source
systems environment.
Many options for ETL/ELT deployment:
need to evaluate tradeoffs in where and how
pp
transformations should be applied

pp
transformations should be applied

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Type 1 Dimension

Type 2 Dimension

ETL is a significant task in any DW


deployment.

ETL is a significant task in any DW


deployment.

Many options for data loading strategies:


need to evaluate tradeoffs in performance, data
freshness, and compatibility with source
systems environment.

Many options for data loading strategies:


need to evaluate tradeoffs in performance, data
freshness, and compatibility with source
systems environment.

Many options for ETL/ELT deployment:


need to evaluate tradeoffs in where and how
pp
transformations should be applied

Many options for ETL/ELT deployment:


need to evaluate tradeoffs in where and how
pp
transformations should be applied

BITS Pilani, Pilani Campus

Type 3 Dimension

BITS Pilani, Pilani Campus

Loading Facts
Fact tables hold the measurements of an enterprise. The
relationship between
fact tables and measurements is
ETL is a significant task in any DW
extremely simple.
If a measurement exists, it can be modeled
deployment.
as a fact table row. If a fact table row exists, it is a
measurement Many
. options for data loading strategies:
need
evaluate
tradeoffs
performance,
data is converting the
When building
a tofact
table,
the infinal
ETL step
and compatibility
with source
natural keys freshness,
in the new
input records
into the correct,
environment.
contemporarysystems
surrogate
keys
ETL maintains
a
special
surrogate
key lookup table for each
Many
options
ETL/ELT deployment:
dimension. This
table
isfor
updated
whenever a new dimension
need to
evaluate
tradeoffs inawhere
and2how
entity is created
and
whenever
Type
change occurs on an
pp
transformations
existing dimension
entityshould be applied
All of the required lookup tables should be pinned in memory
so that they can be randomly accessed as each incoming fact
record presents its natural keys. This is one of the reasons for
making the look
lookup
p tables separate from the original data
warehouse dimension tables.

ETL is a significant task in any DW


deployment.
Many options for data loading strategies:
need to evaluate tradeoffs in performance, data
freshness, and compatibility with source
systems environment.
Many options for ETL/ELT deployment:
need to evaluate tradeoffs in where and how
pp
transformations should be applied

BITS Pilani, Pilani Campus

Key Building Process

BITS Pilani, Pilani Campus

Loading Fact Tables


Managing Indexes

ETL is a significant task in any DW


deployment.
Many options for data loading strategies:
need to evaluate tradeoffs in performance, data
freshness, and compatibility with source
systems environment.

Performance Killers at load time


ETL is a significant task in any DW
Drop all indexes in pre-load time
deployment.
Segregate Updates from inserts
Load updates
Many options for data loading strategies:
Rebuild indexes

need to evaluate tradeoffs in performance, data

Managing Partitions
freshness, and compatibility with source

Partitions allow
a table
(and its indexes) to be physically divided into
systems
environment.
minitables for administrative purposes and to improve query
performance Many options for ETL/ELT deployment:
need to evaluate
tradeoffs
in whereon
andfact
howtables is to partition the
The most common
partitioning
strategy
pp
should be
table by thetransformations
date key
key. Because
theapplied
date dimension is preloaded and
static, you know exactly what the surrogate keys are
Need to partition the fact table on the key that joins to the date
dimension for the optimizer to recognize the constraint.
The ETL team must be advised of any table partitions that need to be
maintained

Many options for ETL/ELT deployment:


need to evaluate tradeoffs in where and how
pp
transformations should be applied

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

The Classic Star Schema

Need for Aggregates

Arelationalmodelwithaonetomanyrelationship

betweendimensiontableandfacttable.
b
di
i
bl
df
bl
Asinglefacttable,withdetailandsummarydata
Facttableprimarykeyhasonlyonekeycolumnper
F t t bl
i
k h
l
k
l
dimension
Eachdimensionisasingletable,highlydenormalized
Each dimension is a single table highly denormalized
x Benefits:Easytounderstand,intuitivemappingbetweenthe
businessentities,easytodefinehierarchies,reduces#ofphysical
joins low maintenance very simple metadata
joins,lowmaintenance,verysimplemetadata
x Drawbacks: Summarydatainthefacttableyieldspoorer
performanceforsummarylevels,hugedimensiontablesa
problem

x Sizesoftypicaltables:
yp
 Timedimension:5yearsx365days=1825
 Storedimension:300storesreportingdailysales
 Productiondimension:40,000productsineachstore
(about4000sellineachstoredaily)
 Maximumnumberofbasefacttablerecords:2billion
Maximum number of base fact table records: 2 billion
(lowestlevelofdetail)
 EachBrandhas500products
p
 Transactionsarestoredbyproduct/store/week.

x Aqueryinvolving1brand,allstore,1year:
retrieve/summarizeover7millionfacttablerows.

BITS Pilani, Pilani Campus

Need for Aggregates

BITS Pilani, Pilani Campus

Aggregating Fact Tables


x Aggregatefacttablesaresummariesofthe
gg g
mostgranulardataathigherlevelsalongthe
dimensionhierarchies.
Productkey
Product
Category
Department

Timekey
D t M th
DateMonth
Quarter
Year

T t l Possible
Total
P
ibl R
Rows = 1825 * 300 * 4000 * 1 = 2 billi
billion

Store key
Storekey
Storename
Territory
Region

Productkey
Timekeyy
Storekey
Unitsales
Saledollars

Multiwayaggregates:
Territory Category Month

(Datavaluesathigherlevel)

BITS Pilani, Pilani Campus

A way of making aggregates

BITS Pilani, Pilani Campus

Aggregate fact tables


Store Dimension
STORE KEY
Store Description
City
State
District ID
District
i i Desc.
Region_ID
Region Desc.
Regional Mgr.

Fact Table
STORE KEY
PRODUCT KEY
PERIOD KEY
Dollars
Units
Price

Product Dimension
PRODUCT KEY
Product Desc.
Brand
Color
Size
Manufacturer

BITS Pilani, Pilani Campus

Time Dimension
PERIOD KEY
Period Desc
Year
Quarter
M th
Month
Day
Current Flag
Sequence

District Fact Table


District_ID
PRODUCT_KEY
PERIOD KEY
PERIOD_KEY
Dollars
Units
Price

Region Fact Table


Region_ID
Region_ID
PRODUCTKEY
PRODUCT_KEY
KEY
PRODUCT
PRODUCT_KEY
PERIOD_KEY
PERIOD_KEY
Dollars
Units
Price

BITS Pilani, Pilani Campus

Families of Stars

Snowflake Schema
Snowflake schema is a type
yp of star schema but a more
complex model.
Snowflaking is a method of normalizing the dimension
tables in a star schema.
The normalization eliminates redundancy.
The result is more complex queries and reduced query
performance.
f
Reasons:
To save storage space
To optimize some specific quires (for attributes with low
cardinality)

BITS Pilani, Pilani Campus

Snowflake Schema

BITS Pilani, Pilani Campus

Snowflake Schema
Theattributeswithlowcardinalityineach
y
originaldimensiontableareremovedto
formseparatetables.Thesenewtablesare
linked back to the original dimension table
linkedbacktotheoriginaldimensiontable
throughartificialkeys.

Productkey
Productname
Productcode
Brandkey

Brand key
Brandkey
Brandname
Categorykey

BITS Pilani, Pilani Campus

Snowflake Schema

C t
Categorykey
k
Productcategory

BITS Pilani, Pilani Campus

What is the Best Design?

Advantages:
g
Small saving in storage space
Normalized structures are easier to update and
maintain

Disadvantages:
S
Schema
h
lless iintuitive
t iti and
d end-users
d
are putt off
ff by
b
the complexity
Ability to browse through the contents difficult
Degrade query performance because of additional
joins

BITS Pilani, Pilani Campus

Performance benchmarking can be used to


determine what is the best design.
Snowflake schema: easier to maintain
dimension tables when dimension tables are
very large (reduce overall space). It is not
generally
g
y recommended in a data
warehouse environment.
Star schema: more effective for data cube
b
browsing
i (l
(less jjoins):
i ) can affect
ff t
performance.

BITS Pilani, Pilani Campus

Where Does OLAP Fit In? (1)

Where Does OLAP Fit In? (2)


Information is conceptually viewed as cubes for simplifying the
way in which users access
access, view
view, and analyze data
data.

OLAP = On
On-line
line analytical processing
processing.
pp
, not a database
OLAP is a characterization of applications,
design technique.
Idea is to provide very fast response time in order to facilitate
iterative decision
decision-making.
making
Analytical processing requires access to complex
aggregations
gg g
((as opposed
pp
to record-level access).
)

facts or measures.
measures.
Quantitative values are known as facts
e.g., sales $, units sold, etc.

Descriptive categories are known as dimensions.

e.g., geography, time, product, scenario (budget or actual), etc.

Dimensions are often organized in hierarchies that represent levels of


detail in the data (e
(e.g.,
g UPC
UPC, SKU
SKU, prod
product
ct ssubcategory,
bcategor prod
product
ct
category, etc.).

BITS Pilani, Pilani Campus

Need for
Multidimensional Analysis

BITS Pilani, Pilani Campus

Different ways of Analysis

A simple analysis
How many units of product A did we sell in the store in Racine, WI

Typically, decision support requires more complex analyses


How much revenue did the new product X generate during the
l t three
last
th
months,
th broken
b k d
down b
by iindividual
di id l months,
th iin th
the
Southern Region, by individual stores, broken down by the
promotions, compared to estimates, and compared to the
previous version of the the product?

Roll-ups to provide summaries and aggregates along the


hierarchies of the dimensions
Drill-downs from the top level to the lowest along the
hierarchies of the dimensions
Calculations involving facts and metrics
Algebraic
geb a c equations
equa o s involving
o
g key
ey performance
pe o a ce indicators
d ca o s
Moving averages and growth percentages
Trend analyses using statistical methods

BITS Pilani, Pilani Campus

OLTP vs OLAP

BITS Pilani, Pilani Campus

OLAP

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

OLAP

OLAP Features

The name On-Line Analytical


y
Processing
g was
coined in a paper by E.F. Codd in 1993 (Providing
On-Line Analytical Processing for User Analysts)
A definition
OLAP is a category of software technology that enables analysts,
managers, and executives to gain insight into data through fast,
consistent, interactive access in a a wide variety of possible views of
information that has been transformed from raw data to reflect the real
dimensionality of the enterprise as understood by the user

BITS Pilani, Pilani Campus

Dimensional Analysis (1)

BITS Pilani, Pilani Campus

Dimensional Analysis (2)

BITS Pilani, Pilani Campus

Some Queries

BITS Pilani, Pilani Campus

Hypercubes

Display the total sales of all products for past five years in
all stores
Compare total sales for all stores, product by product,
between years 2000 and 1999.
Show comparison of sales by individual stores, product by
product between years 2000 and 1999 only for those
product,
products with reduced sales.
Show the results of the previous queries, but rotating the
columns with rows

BITS Pilani, Pilani Campus

Multi-dimension cubes
Hard to visualize and display beyond three dimensions

Multi-dimensional domain structure (MDS)


Represents each dimension as a line showing the values
A multidimensional database (MDD) is a computer software system designed to
allow for the efficient and convenient storage and retrieval of large volumes of data
that is (1) intimately related and (2) stored, viewed and analyzed from different
perspectives. These perspectives are called dimensions

BITS Pilani, Pilani Campus

Relational Vs
Multi-Dimensional Models

Relational Vs
Multi-Dimensional Models

SALES VOLUMES FOR GLEASON DEALERSHIP


MODEL
MINI VAN
MINI VAN
MINI VAN
SPORTS COUPE
SPORTS COUPE
SPORTS COUPE
SEDAN
SEDAN
SEDAN

COLOR
BLUE
RED
WHITE
BLUE
RED
WHITE
BLUE
RED
WHITE

Sales Volumes

SALES VOLUME
6
5
4
3
5
5
4
3
2

M
O
D
E
L

Mini Van

Coupe

Sedan

Red

White

Blue

COLOR

BITS Pilani, Pilani Campus

Relational Vs
Multi-Dimensional Models

BITS Pilani, Pilani Campus

MDS

Multidimensional array structure represents a


higher level of organization than the relational
table
Perspectives are embedded directly into the
structure in the multidimensional model
All possible
ibl combinations
bi ti
off perspectives
ti
containing
t i i a
specific attribute (the color BLUE, for example) line up along
the dimension position for that attribute.

Perspectives
P
ti
are placed
l
d iin fields
fi ld in
i th
the relational
l ti
l
model - tells us nothing about field contents.

BITS Pilani, Pilani Campus

Display of Hypercubes

BITS Pilani, Pilani Campus

MDS

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Display of Hypercubes

Drill-Down and Roll-Up

BITS Pilani, Pilani Campus

Slice or Rotation

BITS Pilani, Pilani Campus

Dice or Range

Sales Volumes

M
O
D
E
L

Mini Van

Coupe

Sedan

C
O
L
O
R
( ROTATE 90

Blue

Red

S a le s V o lu m e s

Blue

Red

White

White

COLOR

View #1

Mini Van

Coupe

Sedan

MODEL

M
O
D
E
L

M in i V a n
M in i V a n

C oupe

C oupe

C a rr
C ly d e

View #2

N o rm a l
B lu e

M e ta l
B lu e

C a rr
C ly d e
N o rm a l
B lu e

M e ta l
B lu e

D E A L E R S H IP

COLOR

Also referred to as data slicing.


Each rotation yields a different slice or two dimensional table
of data.

The end user selects the desired positions along each dimension.
Also referred to as "data
data dicing.
dicing "
The data is scoped down to a subset grouping

BITS Pilani, Pilani Campus

Slice-and-Dice or Rotation

BITS Pilani, Pilani Campus

MOLAP Implementations
OLAP has historically been implemented through use of
multi-dimensional
lti di
i
ld
databases
t b
(MDD
(MDDs).
)
Dimensions are key business factors for analysis:
g
geographies
g p
((zip,
p, state,, region,...)
g , )
products (item, product category, product department,...)
dates (day, week, month, quarter, year,...)

Veryy high
g p
performance via fast look-up
p into cube data
structure to retrieve pre-calculated results.
Cube data structures allow pre-calculation of aggregate
results
lt ffor each
h possible
ibl combination
bi ti off di
dimensional
i
l values.
l
Use of application programming interface (API) for access
via front-end
front end tools.
BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

MOLAP Implementations

MOLAP Implementations
Need to consider both maintenance and storage
implications when designing strategy for when to build
cubes.
Maintenance Considerations: Every data item received
into MDD must be aggregated into every cube (assuming
to-date summaries are maintained).
Storage Considerations: Although cubes get much
smaller (e.g., more dense) as dimensions get less
detailed
deta
ed (e
(e.g.,
g , yea
year vs.
s day), sto
storage
age implications
p cat o s for
o
building hundreds of cubes can be significant.

BITS Pilani, Pilani Campus

Virtual Cubes

BITS Pilani, Pilani Campus

Partitioned Cubes

Virtual cubes are used when there is a need to join


information from two dissimilar cubes that share one or
more common dimensions.
Similar to a relational view; two (or more) cubes are
linked along common dimension(s).
Often used to save space by eliminating redundant
storage of information.

One logical cube of data can be spread across


multiple physical cubes on separate (or same)
servers.
servers
The divide
divide-and-conquer
and conquer approach of partitioned cubes
helps to mitigate the scalability limitations of a MOLAP
environment.
Ideal cube partitioning is completely invisible to end
users
users.

BITS Pilani, Pilani Campus

MOLAP vs ROLAP

BITS Pilani, Pilani Campus

Bottom Line
There are many implementation techniques for
delivery of an OLAP environment.
Must fully consider the performance, scalability,
complexity, and flexibility characteristics when
deciding between MOLAP and ROLAP.
ROLAP
Understand your tools and RDBMS!

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Midterm Review

Midterm Review
The Dimensional Data Model

D t W
Data
Warehouse
h

Corporate/Enterprise-wide
Union of all data marts
Data received from staging
area
Q i on presentation
Queries
t ti
resource
Structure for corporate view of
d t
data
Organized on E-R Model.

D t M
Data
Martt
Departmental
A Single
Si l b
business
i
process
STAR join(facts and Dim)
Technology optimal for
data access and analysis
Structure to suit the
departmental view of data

Contains the same information as the normalized model

Has far fewer tables

Grouped in coherent business categories

Pre-joins hierarchies and lookup tables resulting in fewer join paths and
fewer intermediate tables

Normalized fact table with denormalized dimension tables.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Midterm Review

Midterm Review
A surrogate key is a unique identifier for data
warehouse
h
records
d th
thatt replaces
l
source
primary keys (business/natural keys)
Protect against changes in source systems
Allow integration from multiple sources
Enable rows that do not exist in source data
Track changes over time (e.g. new customer
instances when addresses change)
Replace text keys with integers for efficiency

Bus Architecture

BITS Pilani, Pilani Campus

An architecture that permits aggregating data across multiple marts


Conformed dimensions and attributes
Bus matrix

BITS Pilani, Pilani Campus

Midterm Review

BITS Pilani, Pilani Campus

Midterm Review

Slowlyy Changing
g g Dimensions
Attributes in a dimension that change more slowly
than the fact granularity
Type 1: Current only / overwrite the old value
Type 2: All history / create a new dimensional
record
Type 3: Most recent few (rare) / create a
previous
previous value
value attribute
Note: rapidly changing dimensions usually indicate
the presence of a business process that should be
tracked as a separate dimension or as a fact table
BITS Pilani, Pilani Campus

Fact Table - in coming data


Date

CustKey

ProdKey

Item Count

Amount

1/1/2004

31421

STD

1,798.00

1/2/2004

31421

STD

27 95
27.95

1/3/2004

31421

STD

320.26

1/2/2006

31421

LTD

19.95

Fact Table
Date

CustKey

ProdKey

Item Count

Amount

1552

1001

1,798.00

1552

1001

27.95

1552

1001

320.26

731

1552 2387

1002

19.95
BITS Pilani, Pilani Campus

Data Extraction & Static Data


Capture

Incremental Data Capture

Data Extraction:
Data
D t in
i operational
ti
l systems
t

For revisions since the last time data was captured


Can be Immediate or Deferred

Current Value
Periodic Status

Data Extraction Types

IImmediate
di t Data
D t Extraction
E t ti
- data extraction is real-time.

As Is (Static) Data Capture


Data of Revisions (Incremental Data Capture)

a.
b.
c.

Static Data Capture


p
of data at a g
given p
point time
Capture
Taking snapshot of relevant source data
Primarily for initial load of data to DW
Full
F ll refresh
f h off Dimensional
Di
i
lT
Tables
bl

Capture through transaction logs Replication Technology


Capture through database triggers
Capture in source applications

Deferred Data Extraction


- Do not capture the date in real-time
a.
b.

Capture based on Date and Time Stamp


Capture by Comparing Files

BITS Pilani, Pilani Campus

Immediate Data Extraction

Data Transformation

Basic Tasks

Selection
Splitting/Joining
p
g
g
Conversion
Summarization
Enrichment

Major Types

Format Revisions
Decoding of Fields
Calculated and Derived Values
Splitting of Single Fields
Merging of Information
Character set conversion and Unit Measure Conversions
Date/Time Conversion
Key Restructuring
Deduplication

BITS Pilani, Pilani Campus

Deferred Data Extraction

Transformation for Dimension Attributes

Applying
pp y g Data

Data Loading
Take the prepared data, apply it to DW,
and store in DB.
Different ways of moving data
Initial load
Incremental load
Full Refresh

General methods for applying data


Writing special load programs
Load utilities of DBMSs

Loading Data in Dimension Tables

Data Loads in Fact Tables


Dimension records are loaded first
Create concatenated key for fact table
record from the dimension records
records.

History Loads for fact tables


Incremental
I
t l loads
l d ffor ffactt tables
t bl

Aggregating Fact Tables

Aggregate fact tables

x Aggregatefacttablesaresummariesofthe
gg g
mostgranulardataathigherlevelsalongthe
dimensionhierarchies.
Productkey
Product
Category
Department

Timekey
D t M th
DateMonth
Quarter
Year

Productkey
Timekeyy
Storekey
Unitsales
Saledollars

Storekey
Store
key
Storename
Territory
Region

Multiwayaggregates:
Territory Category Month

Store Dimension
STORE KEY
Store Description
City
State
District ID
District
i i Desc.
Region_ID
Region Desc.
Regional Mgr.

Fact Table
STORE KEY
PRODUCT KEY
PERIOD KEY
Dollars
Units
Price

Product Dimension
PRODUCT KEY
Product Desc.
Brand
Color
Size
Manufacturer

Time Dimension
PERIOD KEY
Period Desc
Year
Quarter
M th
Month
Day
Current Flag
Sequence

District Fact Table


District_ID
PRODUCT_KEY
PERIOD KEY
PERIOD_KEY
Dollars
Units
Price

Region Fact Table


Region_ID
Region_ID
PRODUCTKEY
PRODUCT_KEY
KEY
PRODUCT
PRODUCT_KEY
PERIOD_KEY
PERIOD_KEY
Dollars
Units
Price

(Datavaluesathigherlevel)
BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

3-Tier Data warehouse


Architecture by Ms.
Ms Subha

Bottom Tier

Data warehouses often adopt a 3-tier architecture:

It is a warehouse database server


(mostly a Relational Database System)
Data is fed using back end tools and
utilities (Extract, Clean, Transform, Load
and Refresh)
Data is extracted using programs called
Gateways (ODBC,
(ODBC JDBC)
It also contains Meta data repository

1. Bottom Tier Data warehouse Server


2. Middle Tier OLAP Server
3. Top Tier Front end tools

BITS Pilani, Pilani Campus

Middle Tier
It is an OLAP server that is typically implemented
using either ROLAP or MOLAP
ROLAP - relational
l ti
l OLAP model,
d l an extended
t d d
relational DBMS that maps operations on multidimen
p
sional data to standard relational operations
MOLAP -multidimensional OLAP model, a specialpurpose server that directly implements
multidimensional data and operations.

Top Tier
It is a frontend client layer, which contains
query and reporting tools, analysis tools,
and/or data mining tools (e.g., trend
analysis, prediction, and so on).

Das könnte Ihnen auch gefallen