SS ZG515-L1 To L9 QuickReference (2x3) PDF

Inmons Definition of a DW
R. Kimballs definition of a DW
A
Adatawarehouse
data warehouse isacopyoftransactional
is a copy of transactional
dataspecificallystructuredforqueryingand
analysis.
Adatawarehouse
A data warehouse isa
is a
subjectoriented,
integrated,
nonvolatile,and
timevariant
Accordingtothisdefinition:
collectionofdatainsupportofmanagementsdecisions.
Thedatawarehousecontainsgranularcorporatedata.
Theformofthestoreddata(RDBMS,flatfile)has
(
,
)
nothingtodowithwhethersomethingisadata
warehouse.
Datawarehousingisnotnecessarilyfortheneedsof
"decisionmakers"orusedintheprocessofdecision
g
making.
BITS Pilani, Pilani Campus
Operational vs DW
Operational vs DW
Operationalsystem
Operational system
OLTP
Systemsthatsupportdaytodayoperations
These systems get
Thesesystems
getdatainto
data into DB
DB
Ex:Take anorder,processaclaim,makea
shipment,generateaninvoiceetc.
DataWarehouseisanenvironmentthat
DataWarehousesystem
OLAP
SSystemsthatsupportstrategicdecisions
h
i d ii
Thesesystemsgetdataout ofDB
Ex:Showtopsellingproducts,showproblem
regions,showthehighestmargins,alertson
h
h h h
l
thresholds.
Subject-Oriented Data Collections
Integrated Data Collections
Cl i l
Classicaloperationssystemsare
i
organizedaroundtheapplicationsof
thecompany.Foraninsurance
company,theapplicationsmaybeauto,
h l h lif
health,life,andcasualty.Themajor
d
l Th
j
subjectareasoftheinsurance
corporationmightbecustomer,policy,
premium,andclaim.Fora
manufacturer,themajorsubjectareas
f
h
j
bj
mightbeproduct,order,vendor,billof
material,andrawgoods.Foraretailer,
themajorsubjectareasmaybe
product,SKU,sale,vendor,andsoforth.
d t SKU l
d
d f th
Each
typeofcompanyhasitsownuniqueset
ofsubjects
Ofalltheaspectsofadatawarehouse,
integrationisthemostimportant.Data
isfedfrommultipledisparatesources
intothedatawarehouse.Asthedatais
feditis
converted,reformatted,resequenced,
summarized,andsoforth.Theresultis
thatdataonceitresidesinthedata
warehousehasasinglephysical
corporateimage.
Non-volatile Data Collections
Time-variant Data Collections

DataWarehouse
Operational
Dataisupdatedintheoperational
environmentasaregularmatterof
course,butwarehousedataexhibits
averydifferentsetof
diff
f
characteristics.Datawarehouse
dataisloaded(usuallyenmasse)
andaccessed,butitisnotupdated
(i h
(inthegeneralsense).Instead,
l
) I
d
whendatainthedatawarehouseis
loaded,itisloadedinasnapshot,
staticformat.Whensubsequent
changesoccur,anewsnapshot
h
h
recordiswritten.Indoingsoa
historyofdataiskeptinthedata
warehouse.
Timehorizon
Time
horizon 12years.
12 years
Updateofrecords
Keystructuremay/maynot
containanelementoftime
Time horizon 5
Timehorizon
515
15years.
years.
Sophisticatedsnapshotsof
data
Keystructurecontainsan
elementoftime
Timevariancy impliesthateveryunit
ofdatainthedatawarehouseis
accurateasofsomeonemomentin
time.Insomecases,arecordistime
stamped.Inothercases,arecordhas
adateoftransaction.Butinevery
case,thereissomeformoftime
markingtoshowthemomentintime
d
duringwhichtherecordisaccurate.
h h h
d
A1to2yeartimehorizonisnormal
foroperationalsystems;a5to15
yeartimehorizonisnormalforthe
d
datawarehouse.Asaresultofthis
h
l f h
differenceintimehorizons,thedata
warehousecontainsmuch more
historythananyotherenvironment.
Operational Data Store (ODS)
Data Warehouses and Data Marts
The Operational Data Store is used for tactical decision

making while the DW supports strategic decisions. It
contains transaction data, at the lowest level of detail for
the subject area
subject-oriented, just like a DW
integrated, just like a DW
volatile ((or updateable)
p
) , unlike a DW
an ODS is like a transaction processing system
information gets overwritten with updated data
no history is maintained (other than audit trail) or operational history
current,
t i.e.,
i
nott time-variant,
ti
i t unlike
lik a DW
current data, up to a few years
no history is maintained (other than audit trail) or operational history
A data warehouse is a central repository for all or significant parts of the data
that an enterprise's various business systems collect. Enables strategic
decision making
making.
A data mart is a repository of data gathered from operational data and other
sources that is designed to serve a particular community of knowledge
workers. In scope, the data may derive from an enterprise-wide database or
data warehouse or be more specialized. The emphasis of a data mart is on
meeting the specific demands of a particular group of knowledge users in
terms of analysis, content, presentation, and ease-of-use. Users of a data
mart can expect to have data presented in terms that are familiar.
In practice, the terms data mart and data warehouse each tend to imply the
presence of the other in some form. However, most writers using the term
seem to
t agree that
th t the
th design
d i off a data
d t martt tends
t d to
t start
t t from
f
an analysis
l i
of user needs and that a data warehouse tends to start from an analysis of
what data already exists and how it can be collected in such a way that the
data can later be used. A data warehouse is a central aggregation of data
(which can be distributed physically); a data mart is a data repository that
may derive from a data warehouse or not and that emphasizes ease of
access and usability for a particular designed purpose.
A data warehouse tends to be a strategic but somewhat unfinished concept;

a data mart tends to be tactical and aimed at meeting an immediate need.
The goals of a Data Warehouse
The goals of a Data Warehouse
We have mountains of data in this company,

p y but we can't
access it."
"We need to slice and dice the data every which way."
"You've
You ve got to make it easy for business people to get at
the data directly."
"Just show me what is important."
"It drives me crazy to have two people present the same
business metrics at a meeting, but with different
numbers."
"We want people to use information to support more factbased decision making."
The data warehouse must make an organization's

g
information easily accessible.
The data warehouse must present the organization's
information consistently.
The data warehouse must be adaptive and resilient to
change.
The
Th data
d t warehouse
h
mustt be
b a secure bastion
b ti th
thatt
protects our information assets.
The data warehouse must serve as the foundation for
improved decision making.
The business community must accept the data
warehouse if it is to be deemed successful.
Data Warehouse Architecture
Requirements for DW
Securityy Requirements
q
a paradox:
Data Warehouse: publish data widely
Security: restrict data to those with a need to know
role-based security at the final applications (not
grant or revoke at the DBMS level)
security for developers (separate subnet), backups
((tapes,
p , disks))
Requirements for DW(contd)
Requirements for DW (contd)

Data Latencyy
Data Integration
at the core of the IT business, aka, the 360 degree
view of the business
specific to Data Warehouses: establishing common
attributes (conforming dimensions), agreeing on
common business metrics (conforming facts) so that
one can perform mathematical calculations
(differences, ratios, etc)
how quickly to deliver data to the end user

improvements with algorithms, parallel processing,
streaming
Archiving
change calculations
legal compliance lineage requirements
End User
reports, OLAP, data handoff
Technical Requirements for DW

Architecture
ETL tool versus hand coding
batch updates
p
versus data streaming
g
horizontal (orders/shipments) versus vertical
(customers/orders) task dependency
scheduler
h d l automation
t
ti
quality handling/data cleansing
metadata
security
staging
Data Warehousing
BLANK PAGE
BITS Pilani
I
July 26, 2014
Design Techniques:
Merging Tables
Data Warehouse Design

Start with the ER-Diagram
g
that represents
p
the
corporate data model or with one ore more
operational data models to be integrated
R
Remove
data
d
used
d purely
l iin the
h operational
i
l
environment
Enhance key structures with an element of time
Add derived (calculated) data (i.e., summaries)
Turn relationships
relationships of the ER model into artifacts
artifacts
in the data warehouse
Many tables imply much dynamic

I/O
Merging many tables together

makes access faster
Design Techniques:
Introduction of Redundant Data
Design Technique:
Separation of Data when there is a disparity of probability of access
Design Technique:
Introduce Derived Data
Design Techniques:
Creative Indexes
Calculated once
Forever available
The top 10 customers

The average transaction value for this extract
The largest transaction
The number of customers who showed activity without purchasing.
Design Technique:
Forget Referential Integrity
Dimensional Modeling
In the operational
p
environment, referential integrity
g y appears
pp
as a dynamic link among tables of data.
Not in a data warehouse because
volume of data is too large
the data warehouse is not updated, just appended to
the warehouse represents data over time and relationships do
not remain static
Data modeling for Data Warehouses

Based on
fact tables
dimension tables
relationships of data are represented by an artifact in the

data warehouse environment. Therefore, some data will
be duplicated
duplicated, and some data will be deleted when other
data is still in the warehouse. In any case, trying to
replicate referential integrity in the data warehouse
environment is a patently incorrect approach
approach.
Fact Tables
Dimension Tables
Represent
p
a business p
process, i.e., models the business
process as an artifact in the data model
contain the measurements or metrics or facts of
business processes
"monthly sales number" in the Sales business process
most are additive (sales this month), some are semi-additive
(balance as of), some are not additive (unit price)
the level of detail is called the grain of the table

contain foreign keys for the dimension tables
Represent the who, what, where, when and how of a measurement/artifact

Represent real
real-world
world entities not business processes
Give the context of a measurement (subject)
For example for the Sales fact table, the characteristics of the 'monthly sales
number' measurement can be a Location (Where), Time (When), Product Sold
(
(What).
)
The Dimension Attributes are the various columns in a dimension table. In the
Location dimension, the attributes can be Location Code, State, Country, Zip code.
Generally the Dimension Attributes are used in report labels, and query constraints
such as where Country='USA'. The dimension attributes also contain one or more
hierarchical relationships
relationships.
Before designing your data warehouse, you need to decide what this data warehouse
contains. Say if you want to build a data warehouse containing monthly sales
numbers across multiple store locations, across time and across products then your
dimensions are:
Location
Time
Product
Possible OLTP Location Design
Location Dimension
Dim_id
Loc_cd
Name
State_NM
Country_NM
1001
IL01
Chicago
Loop
Illinois
USA
1002
IL02
Arlington
Illinois
USA
1003
NY01
Brooklyn
New York
USA
1004
TO01
Toronto
Ontario
Canada
1005
MX01
Mexico
City
Distrito
Federal
Mexico
In order to query for all locations that are in country 'USA'

we will have to join these three tables:
In order to q
query
y for all locations that are in country
y 'USA'
SELECT *
FROM Locations, States, Countries
where Locations.State_Id = States.State_Id
AND Locations.Country_id=Countries.Country_Id
AND Country_Name='USA'
SELECT *
FROM Location_dim
where Country_Name=
Country Name='USA'
USA
Notethe
redundancy
d d
Time Dimension
Product Dimension
Dim_id
Month
MonthName
Quarter
QuarterName
Year
1001
Jan
Q1
2005
1002
Feb
Q1
2005
1003
Mar
Q1
2005
1004
Apr
Q2
2005
1005
May
Q2
2005
Prod_id
Prod_cd
Name
Category
1001
STD
Short-Term-Disability
Disability
1002
LTD
Long-Term Disability
Disability
1003
GUL
Group Universal Life
Life
1004
PA
Personal Accident
Accident
1005
VADD
Voluntary Accident
Accident
Not as trivial at it seems ;-)
Star Schemas
Snow-flake Schemas
Select the measurements
SELECT P.Name, SUM(F.Sales)
JOIN the FACT table with Dimensions
FROM Sales F, Time T, Product P,
Location L
WHERE F.TM_Dim_Id = T.Dim_Id
AND F.PR_Dim_Id = P.Dim_Id
AND F.LOC_Dim_Id = L.Dim_Id
Constrain the Dimensions
AND T.Month='Jan' AND T.Year='2003' AND
L.Country_Name='USA'
Advantages:
-easyy to understand
-better performance
-extensible
Group by' for the aggregation level

GROUP BY P.Category
If we did not de-normalize

the dimensions
Ruleofthumb:
d t
dontusethem
th
Dimensional Modeling Steps
Data Warehouses and Data Marts

A data warehouse is a central repository for all or significant
parts of the data that an enterprise's
enterprise s various business
systems collect. Enables strategic decision making.
Enterprise wide
Identifyy the business process

p
Identify the level of detail needed (grain)
Identify the dimensions
Identify the facts
A data mart is a repository of data gathered from operational

data and other sources that is designed to serve a particular
community of knowledge workers. In scope, the data may
derive from an enterprise
enterprise-wide
wide database or data warehouse
or be more specialized.
The emphasis of a data mart is on meeting the specific
demands of a particular group of knowledge users in terms of
analysis, content, presentation, and ease-of-use.
Departmental
Top-down vs Bottom-up Approach
DW and Data Marts

D t W
Data
Warehouse
h
D t M
Data
Martt
Corporate/Enterprise-wide
Union of all data marts
Data received from staging
area
Q i on presentation
Queries
t ti
resource
Structure for corporate view of
d t
data
Organized on E-R Model.
Departmental
A Single
Si l b
business
i
process
STAR join(facts and Dim)
Technology optimal for
data access and analysis
Structure to suit the
departmental view of data
Top-down approach
Bill Inmon
N
Normalized
li d d
data
t model
d l
Enterprise view of data
Single central storage of
Single,
data
Takes longer to build
High exposure to risk and
failure.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data warehouse
Architecture types
Bottom-up approach
Ralph Kimbal
D
De-normalized
li d d
data
t model
d l
Collection of conformed
data marts which gives
enterprise view
Inherently incremental
Less risk of failure and
allows project team to
learn and grow
grow.
Data warehouse
Architecture types
Centralized Data Warehouse
Independent
p
Data Marts.
Data
Mart
Data
Mart
S
SourceData
D t
Reports/
epo ts/
Queries
DSO / ODS
DSO/ODS
D t St i
DataStaging
S
SourceData
D t
Reports/
Q i
Queries
D t St i
DataStaging
Data
Mart
Normalizeddatainthirdnormalform.
SummarizeddataatDSO/ODSlevel
Queries/ReportsaccesscentralDW.
There are no Separate data marts
TherearenoSeparatedatamarts.
Eachdatamartinthismodelservesaparticularorganizationalunit.
Eachdatamartisindependentofoneanother.
Variancesbetweendatamartsaffectdataanalysisacrossdatamarts.
Forexample:SalesandShipmentsaretwoindependentdatamarts.
Eventhoughsalesandshipmentsarerelated,inthismodel,itisdifficulttoanalyze
p
g
salesandshipmentdatatogether.
Data warehouse
Architecture types
Data warehouse
Architecture types
Hub and Spoke

p
model.
Hub and Spoke

p
model.
DataStorage
S
SourceData
D t
Reports/
Queries
Q
D t St i
DataStaging
Data
Mart
Data
Mart
S
SourceData
D t
Reports/
Queries
Q
D t St i
DataStaging
Data
Mart
Data
Mart
Data
Mart
Data
Mart
Inmon CIF(CorporateInformationFactory)Approach.
CentralizedDWinthirdnormalform.
DependentDatamartsobtaindatafromCentralizedDW.
Kimbals conformedapproach
Businessdimensionsfromfirstdatamartissharedamongotherdatamarts,
ConformeddimensionswillgivelogicalintegratedDWwithenterpriseview,
EachdependentDatamartmayhave
Normalized
Denormalized
summarized/dimensionaldatastructures.
Bottomupapproach
Characteristics of DW/BI
DW Lifecycle Principles
High
g p
profile and high
g impact
p
Focus on the business
High risk
Build an information infrastructure
Highly political
Deliver in meaningful increments: six to twelve month

timeframes
Requires sophisticated and complex data gathering
Deliver the entire solution: query and display tools in

addition to the database
Requires intensive user access, training and support

high maintenance
DW Lifecycle
Specialized Roles
Data warehouse DBA
Project
Planning
Business
Requirements
Definition
Technical
Architecture
Design
Product
Selection &
Installation
Dimensional
Modeling
Physical
Design
BI
Application
Specification
Growth
ETL Design &

Development
BI
Application
Development
OLAP designer
ETL system developer
Deployment
DW/BI management tools/Application developer

M i t
Maintenance
Project Management
Data Warehousing
BLANK PAGE
Data Warehousing
BLANK PAGE
BITS Pilani
BITS Pilani
I
I
July 26, 2014
July 26, 2014
DW Infrastructure
DW Infrastructure
Issues:
Platforms:
Performance drain on the operating environment
Source system
Technical skills of the data warehouse implementers
Staging area
Operational issues such as funding requirements
Application server
Shop standards
Desktop tools
DW Infrastructure
Database server
DW Infrastructure
Database server
Si
Size:
500 GB to 250 TB
Nature of use
< 500 GB is small

500 to 5 TB is medium
over 5 TB is large
Volatility: what is the nature and frequency of the update

process
Users: the number of users as well as their level of
knowledge
Number of business processes (marts.)

In some cases there mayy be a separate
p
p
platform for
each one.
There may be an additional central server for
management
g
roll ups
p
ad hoc queries from power users

standardized queries
data mining
Technical support: the hardware
hardware, the operating system
and the database engine may all require specialized
support
Software requirements may dictate platform.
DW Infrastructure
Operating systems
DW Infrastructure
Hardware
Single processor at a time system
Symmetric Multiprocessing (SMP)
Mainframes
transaction oriented
complex administration
not parallel
M
Multiple
lti l processors
Shared memory
Common bus
O
Open
system
t
(UNIX) servers
Massively Parallel Processing (MPP)
specialized environment
NT servers
relatively small capacities (limited numbers of
processors and less efficient performance)
Multiple processors
Distributed memory
Distributed bus
DW Infrastructure
Performance
DW Infrastructure
Database Engine
Indexing
Physical organization
C hi and
Caching
d bl
blocking
ki
Data distribution
Memory
Chip architecture
Relational
Well understood
Include DW support for star joins and fast access
Flexible
M ltidi
Multidimensional
i
l (MOLAP)
Extremely fast
Pre-calculated
Pre calculated combination facts
DW Infrastructure
Front Room
DW Infrastructure
Operational management
Configuration of desk-top
Load window (operational scheduling)
Cli t/S
Client/Server
Backup
B k
Web
User support
Supplemental
pp
tools
Change
g management
g
Data Warehousing
BLANK PAGE
Data Warehousing
BLANK PAGE
BITS Pilani
BITS Pilani
I
I
July 26, 2014
July 26, 2014
Fact Table
Dimension Tables
Measurements associated with a specific business

process
Grain: level of detail of the table
Process events produce fact records
Facts (attributes) are usually
Entities describing the objects of the process

Conformed dimensions - cross processes
Attributes are descriptive
Numeric
Additive
Derived facts included

Foreign (surrogate) keys refer to dimension tables
(entities)
Text
Numeric
Surrogate keys
1:m with the fact table
Null entries
Date dimensions
Bus Architecture
Keys and Surrogate Keys

A surrogate key is a unique identifier for data
warehouse records that replaces source
primary keys (business/natural keys)
Protect against changes in source systems
Allow integration from multiple sources
Enable rows that do not exist in source data
Track changes over time (e.g. new customer
instances when addresses change)
Replace text keys with integers for efficiency
An architecture that permits aggregating data across

multiple marts
Conformed dimensions and attributes
Bus matrix
Data Warehousing
BLANK PAGE
Data Warehousing
BLANK PAGE
BITS Pilani
BITS Pilani
I
I
July 26, 2014
July 26, 2014
Slowly Changing Dimensions
Attributes in a dimension that change more slowly

than the fact granularity
Type 1: Current only / overwrite the old value
Type 2: All history / create a new dimensional
record
Type 3: Most recent few (rare) / create a
previous value attribute
CustKey
BKCustID
CustName
CommDist
Gender
HomOwn?
1552
31421
Jane Rider
Fact Table
Date
CustKey
ProdKey
Item Count
1/7/2004
1552
95
Amount
1,798.00
3/2/2004
1552
37
27.95
5/7/2005
1552
87
320.26
2/21/2006
1552 2387
42
19.95
Dimension with a slowly changing attribute
Note: rapidly changing dimensions usually indicate

the presence of a business process that should be
tracked as a separate dimension or as a fact table
Cust
Key
BKCust
ID
Cust
Name
Comm
Dist
Gender
Hom
Own?
Eff
End
1552
31421
Jane Rider
1/7/2004
1/1/2006
2387
31421
Jane Rider
31
1/2/2006
12/31/9999

Original
ProductKey
Description
Category
SKU
21553
LeapPad
Education
LP2105
Type 1
ProductKey
Description
Category
SKU
21553
LeapPad
Toy
LP2105
Type 2
ProductKey
Description
Category
SKU
21553
LeapPad
Education
LP2105
44631
LeapPad
Toy
LP2105
ProductKey
Description
Category
OldCat
SKU
21553
LeapPad
Toy
Education
LP2105
ProductKey
Description
Category
OldCat
SKU
21553
LeapPad
Education
Electronics
LP2105
44631
LeapPad
Toy
Education
LP2105
68122
LeapPad
Education
Electronics
LP2105
Type 3
Hybrid
Date Dimensions
One row for every day for which you expect to
have data for the fact table (perhaps
generated in a spreadsheet and imported)
Usually use a meaningful integer surrogate
key (such as yyyymmdd 20060926 for Sep.
26, 2006). Note: this order sorts correctly.
Include rows for missing or future dates to be
added later.
Aggregates
Fact Tables
Precalculated summary tables
Transaction
Improve performance
Record data an coarser granularity
Track processes at discrete points in time when they occur
State change summary that has one row per item.
Periodic snapshot
Access rows on each update.
Accumulating snapshot
Cumulative performance over specific time intervals
Constantly updated over time. May include multiple dates representing

stages.
Star schema Model
DATE
DateKey
Attributes
STORE
StoreKey
Attributes
Possible Date Attributes
POS FACT
DateKey
ProductKey
StoreKey
PromotionKey
POSTransactionNumber
SalesQuantity
SalesDollarAmount
CostDollarAmount
GrossProfitDollarAmount
SQL date
Full date description
Day of week
Day of month
Day of calendar year
Day of fiscal year
Month of calendar year
Month of fiscal year
Calendar Quarter
Fiscal Quarter
PRODUCT
ProductKey
Attributes
PROMOTION
PromotionKey
Attributes
Fiscal week
Year
Month
Fiscal year
Holiday ?
Holiday name
Day of holiday
Weekday ?
Selling season
Major event
etc.
Possible Product Attributes

Description
SKU number
Brand description
Department
Package type
Package size
Fat content
Diet type
Weight
Possible Store Attributes
Weight units of
measure
Storage type
Shelf unit type
Shelf width
Shelf height
Shelf depth
etc.
Store Name
Store Number
Street address
City
County
State
Zip
Manager
District
Region
Floor plan type
Photo processing type
Financial service type
Square footage
Selling square footage
First open date
Last remodel date
etc.
Conformed Dimensions:
Inventory Snapshot Model
Factless Fact Tables
Process: Store inventory

In order to evaluate promotions that might have
generated no sales we need another approach.
Grain: Daily inventory by product and store
Promotion could generate another fact table (or could be

considered a fact table in itself). That new fact table
would have no additive attributes.
Dimensions: Date, product, store
Fact: quantity-on-hand
Dimensional Model
The Bus Matrix

Date
Product
Store
Promotion
Warehouse
Vendor
Retail Sales
Retail Inventory
Retail
Deliveries
Warehouse
Inventory
Warehouse
Deliveries
Purchase Orders
Contract
Shipper
Process
DATE
DateKey
Attributes
Inventory Fact
ProductKey
DateKey
StoreKey
QuantityOnHand
QuantitySold
ValueAtCost
ValueAtSellingPrice
PRODUCT
ProductKey
Attributes
STORE
StoreKey
Attributes
Note: QuantityOnHand is semi-additive. It is additive across product and store,

but not across date. The other attributes are additive.
Data Acquisition from OLTP

Systems
ETL Processing
Why is it hard?
Multiple source systems technologies.
Inconsistent data representations
representations.
Multiple sources for the same data element.
Complexity
C
l it off required
i d ttransformations.
f
ti
Scarcity and cost of legacy cycles.
Volume of legacy data.
Operational Data
Data Transformation
Enterprise
E
t
i Warehouse
W h
andd
Integrated Data Marts
Replication
Dependent Data Marts or
Departmental Warehouses
Business Users

Systems

Systems
Many possible source systems technologies:

* Flat files
* VSAM
* IMS
* IDMS
* DB2 (many flavors)
* Ad
Adabase
b
*
*
*
*
*
*
Excel
Access
Oracle
Informix
Sybase
I
Ingres
* Model 204
* DBF Format
* RDB
* RMS
* Compressed
*M
Many others...
th
Inconsistent data representation: Same data, different domain

values...
values
Examples:
Date value representations:
- 1996-02-14
- 02/14/1996
- 14-FEB-1996
- 960214
- 14485
Gender value representations:
- M/F
- M/F/PM/PF
- 0/1
- 1/2

Systems

Systems
Complexity of required transformations:

p scalar transformations.
Simple
0/1 => M/F
y element transformations.
One to many
6x30 address field => street1, street2, city, state,
zip
p
Many to many element transformations.
Householding
g and Individualization of customer
records
Volume of legacy data:

Need lots of processing and I/O to
effectively handle large data volumes.
2GB file limit in older versions of UNIX is
p
for handling
g legacy
g y data not acceptable
need full 64-bit file system.
Need
eed efficient
e c e interconnect
e co ec ba
bandwidth
d d to
o
transfer large amounts of data from legacy
sources.

Systems
Parallel software and hardware architectures:
Use data parallelism (partitioning) to allow
concurrent execution of multiple job streams.
Software architecture must allow efficient repartitioning of data between steps in the
transformation process.
Want powerful parallel hardware architectures
with many processors and I/O channels.
Data Warehousing
BLANK PAGE
BITS Pilani
I
July 26, 2014
ETL Processing
It is important to look at the big picture.
Data acquisition time may include:
Extracts from source systems.

Data movement.
Transformations.
D t loading.
Data
l di
Index maintenance.
Statistics collection.
collection
Summary data maintenance.
Data mart construction.
Backups.
Data Warehousing
BLANK PAGE
BITS Pilani
I
July 26, 2014
Loading Strategies
Loading Strategies
Once we have transformed data, there are three

primary loading strategies:
We must also worry about rolling off old data

as its
it economic
i value
l d
drops b
below
l
th
the costt ffor
storing and maintaining it.
1. Full data refresh with block slamming into

empty tables.
2. Incremental data refresh with block
slamming into existing (populated) tables
slamming
tables.
3. Trickle feed with continuous data acquisition
using
i row llevell iinsertt and
d update
d t operations.
ti
New
N
d data
d
new data
data
oldOld
data
data data
Loading Strategies
Full Refresh Strategy
Should consider:
Completely re-load table on each refresh.
Step 1: Load table using block slamming.

p 2: Build indexes.
Step
Step 3: Collect statistics.
Data storage requirements.

I
Impact
t on query workloads.
kl d
Ratio of existing to new data.
Insert versus update workloads.
This iis a good

Thi
d ((simple)
i l ) strategy
t t
ffor smallll ttables
bl
or when a high percentage of rows in the data
changes
h
on each
h refresh
f h ((greater
t th
than 10%)
10%).
e.g., reference lookup tables or account tables where
balances change on each refresh
refresh.
Performance hints:
Consider using shadow tables to allow refresh

to take place without impacting query
workloads.
Remove referential integrity

g y ((RI)) constraints
from table definitions for loading operations.
Assume that data cleansing takes place in transformations.
Remove secondary index specifications from

table definition.
1. Load shadow table.

2. Replace-view operation to direct queries to refreshed
table make new data visible.
Build indices after table has been loaded.
Make sure target table logging is disabled

during loads.
Trades storage for availability.

Incremental Refresh Strategy
Incrementally load new data into existing target

table that has already been populated from
previous loads.
Design considerations for incremental load directly into

t
target
t table
t bl using
i RDBMS utilities:
tiliti
Two primary strategies:

1. Incremental load directly into target table.
2 U
2.
Use shadow
h d
ttable
bl lload
d ffollowed
ll
db
by iinsert-select
t l t
operation into target table.
Indices should be maintained automatically.

Re-collect statistics if table demographics have
changed significantly.
Typically requires a table lock to be taken during
block slamming operation.
Do you want to allow for dirty reads?
Logging behavior differs across RDBMS products.
Design considerations for shadow table

implementation:
p
Both incremental load strategies described preserve index structures

during the loading operation
operation.
Use block slamming into empty shadow table

having identical structure to target table.
Staging
St i space required
i d ffor shadow
h d
ttable.
bl
Insert-select operation from shadow table to target
table will p
preserve indices.
Locking will normally escalate to table level lock.
Beware of log file size constraints.
Beware of performance overhead for logging
logging.
Beware of rollbacks if operation fails for any reason.
However, there is a cost to maintaining indexes during the loads...

Rule-of-thumb: Each secondary index maintained during the load
costs 2-3 times the resources of the actual row insertion of data into
a table.
R l f th b Consider
Rule-of-thumb:
C
id d
dropping
i and
d re-building
b ildi iindex
d structures
t t
if
the number of rows being incrementally loaded is more than 10% of
the size of the target table.
Note: Drop and re-build of secondary indices may not be acceptable
due to availability requirements of the DW.
Trickle Feed
Trickle Feed
Acquire data on a continuous basis into

RDBMS using row level SQL insert and
update operations.
A tradeoff exists between data freshness and

insert efficiency:
Data is made available to DW immediately rather

than waiting for batch loading to complete.
Much higher overhead for data acquisition on a per
record basis as compared to batch strategies.
Row level locking mechanisms allow queries to
proceed during data acquisition.
Typically relies on Enterprise Application Integration
(EAI) for data delivery.
delivery
Buffering rows for insertion allows for fewer

round trips to RDBMS...
but waiting to accumulate rows into the
b ff impacts
buffer
i
t data
d t freshness.
f h
gg
approach:
pp
Use a threshold that
Suggested
buffers up to M rows, but never waits more
than N seconds before sending a buffer of data
f insertion.
for
i
ti
ELT verses ETL
ETL Processing
There are two fundamental approaches to data

acquisition:
i iti
ETL processing performs the transform

operations prior to loading data into the
RDBMS.
ETL is extract, transform, load in which

transformation takes place on a
transformation server using either an engine
or by generated code.
ELT is extract, load, transform in which data
transformations take place in the relational
database on the data warehouse server.
Of course, hybrids are also possible...
1. Extract data from the source systems.

2. Transform data into a form consistent with
the target tables
tables.
3. Load the data into the target tables (or to
shadow
h d
tables).
t bl )
ETL Processing
ETL Processing
ETL processing is typically performed using

resources on the
th source systems
t
platform(s)
l tf
( )
or a dedicated transformation server.
Perform the transformations on the source

system platform if available resources exist
and there is significant data reduction that can
be achieved during the transformations.
Transformation
Server
Source Systems
Pre Transformations
Pre-Transformations
Data Warehouse
Perform the transformations on a dedicated

transformation server if the source systems
are highly distributed, lack capacity, or have
high cost per unit of computing.
ELT Processing
ELT Processing
First, load raw data into empty tables using

RDBMS block slamming utilities.
Next,, use SQL to transform the raw data
into a form appropriate to the target tables.
DW server is the the transformation server for

ELT processing.
Ideally, the SQL is generated using a meta data driven tool

rather than hand coding.
Finally, use insert-select into the target table

for incremental loads or view switching if a full
refresh strategy is used.
Files
Source Systems
Teradata
Fastload
Network
Channel
Data Warehouse
ELT Processing
Bottom Line
ELT Processing obviates the need for a separate

transformation server
server.
Assumes that spare capacity exists on DW server to
support transformation operations.
ELT leverages the build-in scalability and

manageability of the parallel RDBMS and HW
platform.
platform
Must allocate sufficient staging area space to
support load of raw data and execution of the
transformation SQL.
Works well only for batch oriented transforms
because SQL is optimized for set processing.
ETL is a significant task in any DW deployment.

ETL is a significant task in any DW
deployment.
Many options
for data loading strategies: need to
Many options for data loading strategies:
need to evaluate
tradeoffs in performance,
data freshness,
evaluate tradeoffs
in performance,
data
freshness, and compatibility with source
and compatibility
with source systems environment.
systems environment.
Many options for ETL/ELT deployment:
need to evaluate tradeoffs in where and how
Many options
for ETL/ELTppdeployment: need to
transformations should be applied
evaluate tradeoffs in where and how transformations
should be applied.
Loading Dimensions
Loading Dimensions
Physically built to have the minimal sets of components

ETL is
is aasignificant
in any
DW
The primary key
single task
field
containing
meaningless
deployment.
unique integer Surrogate Keys
Many
options
for data
strategies:
The DW owns
these
keys
andloading
never
allows any other entity to
need to evaluate tradeoffs in performance, data
assign them freshness, and compatibility with source
systems
De-normalized
flat environment.
tables all attributes in a dimension must
take on a single
value
in the presence of a dimension primary
key.
pp
Should possess
one or more other
fields that compose the
natural key of the dimension
The data loading module consists of all the steps required to

ahsignificant
in any iDW (SCD) and
administer
d i i t slowly
l ETLl ischanging
i task
di
dimensions
d write
it th
the
dimension todeployment.
disk as a physical table in the proper dimensional
format with correct
primary
p
keys,
y correct
y and final
Many options
forydata
loading
strategies:natural keys,
descriptive attributes.
Creating andsystems
assigning
the surrogate keys occur in this module
environment.
The
Th table
bl iis definitely
d fi i l staged,
d since
i
iit iis the
h object
bj
to b
be lloaded
d d
into the presentation
system
of
the
data
warehouse.
pp
Loading Dimensions
Loading Dimensions
When DW receives notification that an existing
g
row in dimension
deployment. has changed it gives out 3
yp of responses
p options for data loading strategies:
types
Many
Type 1 freshness, and compatibility with source
Type 2
Type 3 need to evaluate tradeoffs in where and how

deployment.
pp
pp
Type 1 Dimension
Type 2 Dimension

deployment.

deployment.



pp

pp
Type 3 Dimension
Loading Facts
Fact tables hold the measurements of an enterprise. The
relationship between
fact tables and measurements is
extremely simple.
If a measurement exists, it can be modeled
deployment.
as a fact table row. If a fact table row exists, it is a
measurement Many
. options for data loading strategies:
need
evaluate
tradeoffs
performance,
data is converting the
When building
a tofact
table,
the infinal
ETL step
and compatibility
with source
natural keys freshness,
in the new
input records
into the correct,
environment.
contemporarysystems
surrogate
keys
ETL maintains
a
special
surrogate
key lookup table for each
Many
options
ETL/ELT deployment:
dimension. This
table
isfor
updated
whenever a new dimension
need to
evaluate
tradeoffs inawhere
and2how
entity is created
and
whenever
Type
change occurs on an
pp
transformations
existing dimension
entityshould be applied
All of the required lookup tables should be pinned in memory
so that they can be randomly accessed as each incoming fact
record presents its natural keys. This is one of the reasons for
making the look
lookup
p tables separate from the original data
warehouse dimension tables.

deployment.
pp
Key Building Process
Loading Fact Tables

Managing Indexes

deployment.
Performance Killers at load time

Drop all indexes in pre-load time
deployment.
Segregate Updates from inserts
Load updates
Rebuild indexes
Managing Partitions
Partitions allow
a table
(and its indexes) to be physically divided into
systems
environment.
minitables for administrative purposes and to improve query
performance Many options for ETL/ELT deployment:
need to evaluate
tradeoffs
in whereon
andfact
howtables is to partition the
The most common
partitioning
strategy
pp
should be
table by thetransformations
date key
key. Because
theapplied
date dimension is preloaded and
static, you know exactly what the surrogate keys are
Need to partition the fact table on the key that joins to the date
dimension for the optimizer to recognize the constraint.
The ETL team must be advised of any table partitions that need to be
maintained

pp
The Classic Star Schema
Need for Aggregates
Arelationalmodelwithaonetomanyrelationship
betweendimensiontableandfacttable.
b
di
i
bl
df
bl
Asinglefacttable,withdetailandsummarydata
Facttableprimarykeyhasonlyonekeycolumnper
F t t bl
i
k h
l
k
l
dimension
Eachdimensionisasingletable,highlydenormalized
Each dimension is a single table highly denormalized
x Benefits:Easytounderstand,intuitivemappingbetweenthe
businessentities,easytodefinehierarchies,reduces#ofphysical
joins low maintenance very simple metadata
joins,lowmaintenance,verysimplemetadata
x Drawbacks: Summarydatainthefacttableyieldspoorer
performanceforsummarylevels,hugedimensiontablesa
problem
x Sizesoftypicaltables:
yp
Timedimension:5yearsx365days=1825
Storedimension:300storesreportingdailysales
Productiondimension:40,000productsineachstore
(about4000sellineachstoredaily)
Maximumnumberofbasefacttablerecords:2billion
Maximum number of base fact table records: 2 billion
(lowestlevelofdetail)
EachBrandhas500products
p
Transactionsarestoredbyproduct/store/week.
x Aqueryinvolving1brand,allstore,1year:
retrieve/summarizeover7millionfacttablerows.
Need for Aggregates
Aggregating Fact Tables

x Aggregatefacttablesaresummariesofthe
gg g
mostgranulardataathigherlevelsalongthe
dimensionhierarchies.
Productkey
Product
Category
Department
Timekey
D t M th
DateMonth
Quarter
Year
T t l Possible
Total
P
ibl R
Rows = 1825 * 300 * 4000 * 1 = 2 billi
billion
Store key
Storekey
Storename
Territory
Region
Productkey
Timekeyy
Storekey
Unitsales
Saledollars
Multiwayaggregates:
Territory Category Month
(Datavaluesathigherlevel)
A way of making aggregates
Aggregate fact tables

Store Dimension
STORE KEY
Store Description
City
State
District ID
District
i i Desc.
Region_ID
Region Desc.
Regional Mgr.
Fact Table
STORE KEY
PRODUCT KEY
PERIOD KEY
Dollars
Units
Price
Product Dimension
PRODUCT KEY
Product Desc.
Brand
Color
Size
Manufacturer
Time Dimension
PERIOD KEY
Period Desc
Year
Quarter
M th
Month
Day
Current Flag
Sequence
District Fact Table

District_ID
PRODUCT_KEY
PERIOD KEY
PERIOD_KEY
Dollars
Units
Price
Region Fact Table

Region_ID
Region_ID
PRODUCTKEY
PRODUCT_KEY
KEY
PRODUCT
PRODUCT_KEY
PERIOD_KEY
PERIOD_KEY
Dollars
Units
Price
Families of Stars
Snowflake Schema
Snowflake schema is a type
yp of star schema but a more
complex model.
Snowflaking is a method of normalizing the dimension
tables in a star schema.
The normalization eliminates redundancy.
The result is more complex queries and reduced query
performance.
f
Reasons:
To save storage space
To optimize some specific quires (for attributes with low
cardinality)
Snowflake Schema
Snowflake Schema
Theattributeswithlowcardinalityineach
y
originaldimensiontableareremovedto
formseparatetables.Thesenewtablesare
linked back to the original dimension table
linkedbacktotheoriginaldimensiontable
throughartificialkeys.
Productkey
Productname
Productcode
Brandkey
Brand key
Brandkey
Brandname
Categorykey
Snowflake Schema
C t
Categorykey
k
Productcategory
What is the Best Design?
Advantages:
g
Small saving in storage space
Normalized structures are easier to update and
maintain
Disadvantages:
S
Schema
h
lless iintuitive
t iti and
d end-users
d
are putt off
ff by
b
the complexity
Ability to browse through the contents difficult
Degrade query performance because of additional
joins
Performance benchmarking can be used to

determine what is the best design.
Snowflake schema: easier to maintain
dimension tables when dimension tables are
very large (reduce overall space). It is not
generally
g
y recommended in a data
warehouse environment.
Star schema: more effective for data cube
b
browsing
i (l
(less jjoins):
i ) can affect
ff t
performance.
Where Does OLAP Fit In? (1)
Where Does OLAP Fit In? (2)

Information is conceptually viewed as cubes for simplifying the
way in which users access
access, view
view, and analyze data
data.
OLAP = On
On-line
line analytical processing
processing.
pp
, not a database
OLAP is a characterization of applications,
design technique.
Idea is to provide very fast response time in order to facilitate
iterative decision
decision-making.
making
Analytical processing requires access to complex
aggregations
gg g
((as opposed
pp
to record-level access).
)
facts or measures.
measures.
Quantitative values are known as facts
e.g., sales $, units sold, etc.
Descriptive categories are known as dimensions.
e.g., geography, time, product, scenario (budget or actual), etc.
Dimensions are often organized in hierarchies that represent levels of

detail in the data (e
(e.g.,
g UPC
UPC, SKU
SKU, prod
product
ct ssubcategory,
bcategor prod
product
ct
category, etc.).
Need for
Multidimensional Analysis
Different ways of Analysis
A simple analysis
How many units of product A did we sell in the store in Racine, WI
Typically, decision support requires more complex analyses

How much revenue did the new product X generate during the
l t three
last
th
months,
th broken
b k d
down b
by iindividual
di id l months,
th iin th
the
Southern Region, by individual stores, broken down by the
promotions, compared to estimates, and compared to the
previous version of the the product?
Roll-ups to provide summaries and aggregates along the

hierarchies of the dimensions
Drill-downs from the top level to the lowest along the
hierarchies of the dimensions
Calculations involving facts and metrics
Algebraic
geb a c equations
equa o s involving
o
g key
ey performance
pe o a ce indicators
d ca o s
Moving averages and growth percentages
Trend analyses using statistical methods
OLTP vs OLAP
OLAP
OLAP
OLAP Features
The name On-Line Analytical

y
Processing
g was
coined in a paper by E.F. Codd in 1993 (Providing
On-Line Analytical Processing for User Analysts)
A definition
OLAP is a category of software technology that enables analysts,
managers, and executives to gain insight into data through fast,
consistent, interactive access in a a wide variety of possible views of
information that has been transformed from raw data to reflect the real
dimensionality of the enterprise as understood by the user
Dimensional Analysis (1)
Dimensional Analysis (2)
Some Queries
Hypercubes
Display the total sales of all products for past five years in
all stores
Compare total sales for all stores, product by product,
between years 2000 and 1999.
Show comparison of sales by individual stores, product by
product between years 2000 and 1999 only for those
product,
products with reduced sales.
Show the results of the previous queries, but rotating the
columns with rows
Multi-dimension cubes
Hard to visualize and display beyond three dimensions
Multi-dimensional domain structure (MDS)

Represents each dimension as a line showing the values
A multidimensional database (MDD) is a computer software system designed to
allow for the efficient and convenient storage and retrieval of large volumes of data
that is (1) intimately related and (2) stored, viewed and analyzed from different
perspectives. These perspectives are called dimensions
Relational Vs
Multi-Dimensional Models
Relational Vs
SALES VOLUMES FOR GLEASON DEALERSHIP

MODEL
MINI VAN
MINI VAN
MINI VAN
SPORTS COUPE
SPORTS COUPE
SPORTS COUPE
SEDAN
SEDAN
SEDAN
COLOR
BLUE
RED
WHITE
BLUE
RED
WHITE
BLUE
RED
WHITE
Sales Volumes
SALES VOLUME
6
5
4
3
5
5
4
3
2
M
O
D
E
L
Mini Van
Coupe
Sedan
Red
White
Blue
COLOR
Relational Vs
MDS
Multidimensional array structure represents a

higher level of organization than the relational
table
Perspectives are embedded directly into the
structure in the multidimensional model
All possible
ibl combinations
bi ti
off perspectives
ti
containing
t i i a
specific attribute (the color BLUE, for example) line up along
the dimension position for that attribute.
Perspectives
P
ti
are placed
l
d iin fields
fi ld in
i th
the relational
l ti
l
model - tells us nothing about field contents.
Display of Hypercubes
MDS
Display of Hypercubes
Drill-Down and Roll-Up
Slice or Rotation
Dice or Range
Sales Volumes
M
O
D
E
L
Mini Van
Coupe
Sedan
C
O
L
O
R
( ROTATE 90
Blue
Red
S a le s V o lu m e s
Blue
Red
White
White
COLOR
View #1
Mini Van
Coupe
Sedan
MODEL
M
O
D
E
L
M in i V a n
M in i V a n
C oupe
C oupe
C a rr
C ly d e
View #2
N o rm a l
B lu e
M e ta l
B lu e
C a rr
C ly d e
N o rm a l
B lu e
M e ta l
B lu e
D E A L E R S H IP
COLOR
Also referred to as data slicing.

Each rotation yields a different slice or two dimensional table
of data.
The end user selects the desired positions along each dimension.
Also referred to as "data
data dicing.
dicing "
The data is scoped down to a subset grouping
Slice-and-Dice or Rotation
MOLAP Implementations
OLAP has historically been implemented through use of
multi-dimensional
lti di
i
ld
databases
t b
(MDD
(MDDs).
)
Dimensions are key business factors for analysis:
g
geographies
g p
((zip,
p, state,, region,...)
g , )
products (item, product category, product department,...)
dates (day, week, month, quarter, year,...)
Veryy high
g p
performance via fast look-up
p into cube data
structure to retrieve pre-calculated results.
Cube data structures allow pre-calculation of aggregate
results
lt ffor each
h possible
ibl combination
bi ti off di
dimensional
i
l values.
l
Use of application programming interface (API) for access
via front-end
front end tools.
Need to consider both maintenance and storage
implications when designing strategy for when to build
cubes.
Maintenance Considerations: Every data item received
into MDD must be aggregated into every cube (assuming
to-date summaries are maintained).
Storage Considerations: Although cubes get much
smaller (e.g., more dense) as dimensions get less
detailed
deta
ed (e
(e.g.,
g , yea
year vs.
s day), sto
storage
age implications
p cat o s for
o
building hundreds of cubes can be significant.
Virtual Cubes
Partitioned Cubes
Virtual cubes are used when there is a need to join

information from two dissimilar cubes that share one or
more common dimensions.
Similar to a relational view; two (or more) cubes are
linked along common dimension(s).
Often used to save space by eliminating redundant
storage of information.
One logical cube of data can be spread across

multiple physical cubes on separate (or same)
servers.
servers
The divide
divide-and-conquer
and conquer approach of partitioned cubes
helps to mitigate the scalability limitations of a MOLAP
environment.
Ideal cube partitioning is completely invisible to end
users
users.
MOLAP vs ROLAP
Bottom Line
There are many implementation techniques for
delivery of an OLAP environment.
Must fully consider the performance, scalability,
complexity, and flexibility characteristics when
deciding between MOLAP and ROLAP.
ROLAP
Understand your tools and RDBMS!
Midterm Review
Midterm Review
The Dimensional Data Model
D t W
Data
Warehouse
h
Corporate/Enterprise-wide
Union of all data marts
Data received from staging
area
Q i on presentation
Queries
t ti
resource
Structure for corporate view of
d t
data
Organized on E-R Model.
D t M
Data
Martt
Departmental
A Single
Si l b
business
i
process
STAR join(facts and Dim)
Technology optimal for
data access and analysis
Structure to suit the
departmental view of data
Contains the same information as the normalized model
Has far fewer tables
Grouped in coherent business categories
Pre-joins hierarchies and lookup tables resulting in fewer join paths and
fewer intermediate tables
Normalized fact table with denormalized dimension tables.
Midterm Review
Midterm Review
A surrogate key is a unique identifier for data
warehouse
h
records
d th
thatt replaces
l
source
primary keys (business/natural keys)
Protect against changes in source systems
Allow integration from multiple sources
Enable rows that do not exist in source data
Track changes over time (e.g. new customer
instances when addresses change)
Replace text keys with integers for efficiency
Bus Architecture
An architecture that permits aggregating data across multiple marts

Conformed dimensions and attributes
Bus matrix
Midterm Review
Midterm Review
Slowlyy Changing
g g Dimensions
Attributes in a dimension that change more slowly
than the fact granularity
Type 1: Current only / overwrite the old value
Type 2: All history / create a new dimensional
record
Type 3: Most recent few (rare) / create a
previous
previous value
value attribute
Note: rapidly changing dimensions usually indicate
the presence of a business process that should be
tracked as a separate dimension or as a fact table
Fact Table - in coming data

Date
CustKey
ProdKey
Item Count
Amount
1/1/2004
31421
STD
1,798.00
1/2/2004
31421
STD
27 95
27.95
1/3/2004
31421
STD
320.26
1/2/2006
31421
LTD
19.95
Fact Table
Date
CustKey
ProdKey
Item Count
Amount
1552
1001
1,798.00
1552
1001
27.95
1552
1001
320.26
731
1552 2387
1002
19.95
Data Extraction & Static Data

Capture
Incremental Data Capture
Data Extraction:
Data
D t in
i operational
ti
l systems
t
For revisions since the last time data was captured

Can be Immediate or Deferred
Current Value
Periodic Status
Data Extraction Types
IImmediate
di t Data
D t Extraction
E t ti
- data extraction is real-time.
As Is (Static) Data Capture

Data of Revisions (Incremental Data Capture)
a.
b.
c.
Static Data Capture

p
of data at a g
given p
point time
Capture
Taking snapshot of relevant source data
Primarily for initial load of data to DW
Full
F ll refresh
f h off Dimensional
Di
i
lT
Tables
bl
Capture through transaction logs Replication Technology

Capture through database triggers
Capture in source applications
Deferred Data Extraction

- Do not capture the date in real-time
a.
b.
Capture based on Date and Time Stamp

Capture by Comparing Files
Immediate Data Extraction
Data Transformation
Basic Tasks
Selection
Splitting/Joining
p
g
g
Conversion
Summarization
Enrichment
Major Types
Format Revisions
Decoding of Fields
Calculated and Derived Values
Splitting of Single Fields
Merging of Information
Character set conversion and Unit Measure Conversions
Date/Time Conversion
Key Restructuring
Deduplication
Deferred Data Extraction
Transformation for Dimension Attributes
Applying
pp y g Data
Data Loading
Take the prepared data, apply it to DW,
and store in DB.
Different ways of moving data
Initial load
Incremental load
Full Refresh
General methods for applying data

Writing special load programs
Load utilities of DBMSs
Loading Data in Dimension Tables
Data Loads in Fact Tables

Dimension records are loaded first
Create concatenated key for fact table
record from the dimension records
records.
History Loads for fact tables

Incremental
I
t l loads
l d ffor ffactt tables
t bl
Aggregating Fact Tables
Aggregate fact tables
x Aggregatefacttablesaresummariesofthe
gg g
mostgranulardataathigherlevelsalongthe
dimensionhierarchies.
Productkey
Product
Category
Department
Timekey
D t M th
DateMonth
Quarter
Year
Productkey
Timekeyy
Storekey
Unitsales
Saledollars
Storekey
Store
key
Storename
Territory
Region
Multiwayaggregates:
Territory Category Month
Store Dimension
STORE KEY
Store Description
City
State
District ID
District
i i Desc.
Region_ID
Region Desc.
Regional Mgr.
Fact Table
STORE KEY
PRODUCT KEY
PERIOD KEY
Dollars
Units
Price
Product Dimension
PRODUCT KEY
Product Desc.
Brand
Color
Size
Manufacturer
Time Dimension
PERIOD KEY
Period Desc
Year
Quarter
M th
Month
Day
Current Flag
Sequence
District Fact Table

District_ID
PRODUCT_KEY
PERIOD KEY
PERIOD_KEY
Dollars
Units
Price
Region Fact Table

Region_ID
Region_ID
PRODUCTKEY
PRODUCT_KEY
KEY
PRODUCT
PRODUCT_KEY
PERIOD_KEY
PERIOD_KEY
Dollars
Units
Price
(Datavaluesathigherlevel)
3-Tier Data warehouse

Architecture by Ms.
Ms Subha
Bottom Tier
Data warehouses often adopt a 3-tier architecture:
It is a warehouse database server

(mostly a Relational Database System)
Data is fed using back end tools and
utilities (Extract, Clean, Transform, Load
and Refresh)
Data is extracted using programs called
Gateways (ODBC,
(ODBC JDBC)
It also contains Meta data repository
1. Bottom Tier Data warehouse Server

2. Middle Tier OLAP Server
3. Top Tier Front end tools
Middle Tier
It is an OLAP server that is typically implemented
using either ROLAP or MOLAP
ROLAP - relational
l ti
l OLAP model,
d l an extended
t d d
relational DBMS that maps operations on multidimen
p
sional data to standard relational operations
MOLAP -multidimensional OLAP model, a specialpurpose server that directly implements
multidimensional data and operations.
Top Tier
It is a frontend client layer, which contains
query and reporting tools, analysis tools,
and/or data mining tools (e.g., trend
analysis, prediction, and so on).

SS ZG515-L1 To L9 QuickReference (2x3) PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

SS ZG515-L1 To L9 QuickReference (2x3) PDF

Hochgeladen von

Copyright:

Verfügbare Formate

Inmons Definition of a DW

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Subject-Oriented Data Collections

BITS Pilani, Pilani Campus

Integrated Data Collections

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Non-volatile Data Collections

Time-variant Data Collections

BITS Pilani, Pilani Campus

Operational Data Store (ODS)

BITS Pilani, Pilani Campus

Data Warehouses and Data Marts

The Operational Data Store is used for tactical decision

A data warehouse tends to be a strategic but somewhat unfinished concept;

BITS Pilani, Pilani Campus

The goals of a Data Warehouse

BITS Pilani, Pilani Campus

The goals of a Data Warehouse

We have mountains of data in this company,

BITS Pilani, Pilani Campus

The data warehouse must make an organization's

Data Warehouse Architecture

BITS Pilani, Pilani Campus

Requirements for DW(contd)

BITS Pilani, Pilani Campus

Requirements for DW (contd)

how quickly to deliver data to the end user

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Technical Requirements for DW

July 26, 2014

Data Warehouse Design

Many tables imply much dynamic

Merging many tables together

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

The top 10 customers

BITS Pilani, Pilani Campus

Data modeling for Data Warehouses

relationships of data are represented by an artifact in the

BITS Pilani, Pilani Campus

the level of detail is called the grain of the table

Represent the who, what, where, when and how of a measurement/artifact

BITS Pilani, Pilani Campus

Possible OLTP Location Design

BITS Pilani, Pilani Campus

In order to query for all locations that are in country 'USA'

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Group Universal Life

Not as trivial at it seems ;-)

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Group by' for the aggregation level

If we did not de-normalize

BITS Pilani, Pilani Campus

Dimensional Modeling Steps

BITS Pilani, Pilani Campus

Data Warehouses and Data Marts