Beruflich Dokumente
Kultur Dokumente
R. Kimballs definition of a DW
A
Adatawarehouse
data warehouse isacopyoftransactional
is a copy of transactional
dataspecificallystructuredforqueryingand
analysis.
Adatawarehouse
A data warehouse isa
is a
subjectoriented,
integrated,
nonvolatile,and
timevariant
Accordingtothisdefinition:
collectionofdatainsupportofmanagementsdecisions.
Thedatawarehousecontainsgranularcorporatedata.
Theformofthestoreddata(RDBMS,flatfile)has
(
,
)
nothingtodowithwhethersomethingisadata
warehouse.
Datawarehousingisnotnecessarilyfortheneedsof
"decisionmakers"orusedintheprocessofdecision
g
making.
Operational vs DW
Operational vs DW
Operationalsystem
Operational system
OLTP
Systemsthatsupportdaytodayoperations
These systems get
Thesesystems
getdatainto
data into DB
DB
Ex:Take anorder,processaclaim,makea
shipment,generateaninvoiceetc.
DataWarehouseisanenvironmentthat
DataWarehousesystem
OLAP
SSystemsthatsupportstrategicdecisions
h
i d ii
Thesesystemsgetdataout ofDB
Ex:Showtopsellingproducts,showproblem
regions,showthehighestmargins,alertson
h
h h h
l
thresholds.
BITS Pilani, Pilani Campus
Cl i l
Classicaloperationssystemsare
i
organizedaroundtheapplicationsof
thecompany.Foraninsurance
company,theapplicationsmaybeauto,
h l h lif
health,life,andcasualty.Themajor
d
l Th
j
subjectareasoftheinsurance
corporationmightbecustomer,policy,
premium,andclaim.Fora
manufacturer,themajorsubjectareas
f
h
j
bj
mightbeproduct,order,vendor,billof
material,andrawgoods.Foraretailer,
themajorsubjectareasmaybe
product,SKU,sale,vendor,andsoforth.
d t SKU l
d
d f th
Each
typeofcompanyhasitsownuniqueset
ofsubjects
Ofalltheaspectsofadatawarehouse,
integrationisthemostimportant.Data
isfedfrommultipledisparatesources
intothedatawarehouse.Asthedatais
feditis
converted,reformatted,resequenced,
summarized,andsoforth.Theresultis
thatdataonceitresidesinthedata
warehousehasasinglephysical
corporateimage.
Operational
Dataisupdatedintheoperational
environmentasaregularmatterof
course,butwarehousedataexhibits
averydifferentsetof
diff
f
characteristics.Datawarehouse
dataisloaded(usuallyenmasse)
andaccessed,butitisnotupdated
(i h
(inthegeneralsense).Instead,
l
) I
d
whendatainthedatawarehouseis
loaded,itisloadedinasnapshot,
staticformat.Whensubsequent
changesoccur,anewsnapshot
h
h
recordiswritten.Indoingsoa
historyofdataiskeptinthedata
warehouse.
Timehorizon
Time
horizon 12years.
12 years
Updateofrecords
Keystructuremay/maynot
containanelementoftime
Time horizon 5
Timehorizon
515
15years.
years.
Sophisticatedsnapshotsof
data
Keystructurecontainsan
elementoftime
Timevariancy impliesthateveryunit
ofdatainthedatawarehouseis
accurateasofsomeonemomentin
time.Insomecases,arecordistime
stamped.Inothercases,arecordhas
adateoftransaction.Butinevery
case,thereissomeformoftime
markingtoshowthemomentintime
d
duringwhichtherecordisaccurate.
h h h
d
A1to2yeartimehorizonisnormal
foroperationalsystems;a5to15
yeartimehorizonisnormalforthe
d
datawarehouse.Asaresultofthis
h
l f h
differenceintimehorizons,thedata
warehousecontainsmuch more
historythananyotherenvironment.
current,
t i.e.,
i
nott time-variant,
ti
i t unlike
lik a DW
current data, up to a few years
no history is maintained (other than audit trail) or operational history
A data warehouse is a central repository for all or significant parts of the data
that an enterprise's various business systems collect. Enables strategic
decision making
making.
A data mart is a repository of data gathered from operational data and other
sources that is designed to serve a particular community of knowledge
workers. In scope, the data may derive from an enterprise-wide database or
data warehouse or be more specialized. The emphasis of a data mart is on
meeting the specific demands of a particular group of knowledge users in
terms of analysis, content, presentation, and ease-of-use. Users of a data
mart can expect to have data presented in terms that are familiar.
In practice, the terms data mart and data warehouse each tend to imply the
presence of the other in some form. However, most writers using the term
seem to
t agree that
th t the
th design
d i off a data
d t martt tends
t d to
t start
t t from
f
an analysis
l i
of user needs and that a data warehouse tends to start from an analysis of
what data already exists and how it can be collected in such a way that the
data can later be used. A data warehouse is a central aggregation of data
(which can be distributed physically); a data mart is a data repository that
may derive from a data warehouse or not and that emphasizes ease of
access and usability for a particular designed purpose.
Requirements for DW
Securityy Requirements
q
a paradox:
Data Warehouse: publish data widely
Security: restrict data to those with a need to know
role-based security at the final applications (not
grant or revoke at the DBMS level)
security for developers (separate subnet), backups
((tapes,
p , disks))
Data Integration
at the core of the IT business, aka, the 360 degree
view of the business
specific to Data Warehouses: establishing common
attributes (conforming dimensions), agreeing on
common business metrics (conforming facts) so that
one can perform mathematical calculations
(differences, ratios, etc)
Archiving
change calculations
legal compliance lineage requirements
End User
reports, OLAP, data handoff
Data Warehousing
BLANK PAGE
BITS Pilani
I
BITS Pilani, Pilani Campus
Design Techniques:
Merging Tables
Design Techniques:
Introduction of Redundant Data
Design Technique:
Separation of Data when there is a disparity of probability of access
Design Technique:
Introduce Derived Data
Design Techniques:
Creative Indexes
Calculated once
Forever available
Design Technique:
Forget Referential Integrity
Dimensional Modeling
In the operational
p
environment, referential integrity
g y appears
pp
as a dynamic link among tables of data.
Not in a data warehouse because
volume of data is too large
the data warehouse is not updated, just appended to
the warehouse represents data over time and relationships do
not remain static
Fact Tables
Dimension Tables
Represent
p
a business p
process, i.e., models the business
process as an artifact in the data model
contain the measurements or metrics or facts of
business processes
"monthly sales number" in the Sales business process
most are additive (sales this month), some are semi-additive
(balance as of), some are not additive (unit price)
Location Dimension
Dim_id
Loc_cd
Name
State_NM
Country_NM
1001
IL01
Chicago
Loop
Illinois
USA
1002
IL02
Arlington
Illinois
USA
1003
NY01
Brooklyn
New York
USA
1004
TO01
Toronto
Ontario
Canada
1005
MX01
Mexico
City
Distrito
Federal
Mexico
In order to q
query
y for all locations that are in country
y 'USA'
SELECT *
FROM Locations, States, Countries
where Locations.State_Id = States.State_Id
AND Locations.Country_id=Countries.Country_Id
AND Country_Name='USA'
SELECT *
FROM Location_dim
where Country_Name=
Country Name='USA'
USA
Notethe
redundancy
d d
Time Dimension
Product Dimension
Dim_id
Month
MonthName
Quarter
QuarterName
Year
1001
Jan
Q1
2005
1002
Feb
Q1
2005
1003
Mar
Q1
2005
1004
Apr
Q2
2005
1005
May
Q2
2005
Prod_id
Prod_cd
Name
Category
1001
STD
Short-Term-Disability
Disability
1002
LTD
Long-Term Disability
Disability
1003
GUL
Life
1004
PA
Personal Accident
Accident
1005
VADD
Voluntary Accident
Accident
Star Schemas
Snow-flake Schemas
Select the measurements
SELECT P.Name, SUM(F.Sales)
JOIN the FACT table with Dimensions
FROM Sales F, Time T, Product P,
Location L
WHERE F.TM_Dim_Id = T.Dim_Id
AND F.PR_Dim_Id = P.Dim_Id
AND F.LOC_Dim_Id = L.Dim_Id
Constrain the Dimensions
AND T.Month='Jan' AND T.Year='2003' AND
L.Country_Name='USA'
Advantages:
-easyy to understand
-better performance
-extensible
Ruleofthumb:
d t
dontusethem
th
D t M
Data
Martt
Corporate/Enterprise-wide
Union of all data marts
Data received from staging
area
Q i on presentation
Queries
t ti
resource
Structure for corporate view of
d t
data
Organized on E-R Model.
Departmental
A Single
Si l b
business
i
process
STAR join(facts and Dim)
Technology optimal for
data access and analysis
Structure to suit the
departmental view of data
Top-down approach
Bill Inmon
N
Normalized
li d d
data
t model
d l
Enterprise view of data
Single central storage of
Single,
data
Takes longer to build
High exposure to risk and
failure.
Data warehouse
Architecture types
Bottom-up approach
Ralph Kimbal
D
De-normalized
li d d
data
t model
d l
Collection of conformed
data marts which gives
enterprise view
Inherently incremental
Less risk of failure and
allows project team to
learn and grow
grow.
Data warehouse
Architecture types
Independent
p
Data Marts.
Data
Mart
Data
Mart
S
SourceData
D t
Reports/
epo ts/
Queries
DSO / ODS
DSO/ODS
D t St i
DataStaging
S
SourceData
D t
Reports/
Q i
Queries
D t St i
DataStaging
Data
Mart
Normalizeddatainthirdnormalform.
SummarizeddataatDSO/ODSlevel
Queries/ReportsaccesscentralDW.
There are no Separate data marts
TherearenoSeparatedatamarts.
Eachdatamartinthismodelservesaparticularorganizationalunit.
Eachdatamartisindependentofoneanother.
Variancesbetweendatamartsaffectdataanalysisacrossdatamarts.
Forexample:SalesandShipmentsaretwoindependentdatamarts.
Eventhoughsalesandshipmentsarerelated,inthismodel,itisdifficulttoanalyze
p
g
salesandshipmentdatatogether.
Data warehouse
Architecture types
Data warehouse
Architecture types
S
SourceData
D t
Reports/
Queries
Q
D t St i
DataStaging
Data
Mart
Data
Mart
S
SourceData
D t
Reports/
Queries
Q
D t St i
DataStaging
Data
Mart
Data
Mart
Data
Mart
Data
Mart
Inmon CIF(CorporateInformationFactory)Approach.
CentralizedDWinthirdnormalform.
DependentDatamartsobtaindatafromCentralizedDW.
Kimbals conformedapproach
Businessdimensionsfromfirstdatamartissharedamongotherdatamarts,
ConformeddimensionswillgivelogicalintegratedDWwithenterpriseview,
EachdependentDatamartmayhave
Normalized
Denormalized
summarized/dimensionaldatastructures.
Bottomupapproach
Characteristics of DW/BI
DW Lifecycle Principles
High
g p
profile and high
g impact
p
High risk
Highly political
DW Lifecycle
Specialized Roles
Data warehouse DBA
Project
Planning
Business
Requirements
Definition
Technical
Architecture
Design
Product
Selection &
Installation
Dimensional
Modeling
Physical
Design
BI
Application
Specification
Growth
BI
Application
Development
OLAP designer
ETL system developer
Deployment
Project Management
Data Warehousing
BLANK PAGE
Data Warehousing
BLANK PAGE
BITS Pilani
BITS Pilani
I
I
July 26, 2014
DW Infrastructure
DW Infrastructure
Issues:
Platforms:
Source system
Staging area
Application server
Shop standards
Desktop tools
DW Infrastructure
Database server
DW Infrastructure
Database server
Si
Size:
500 GB to 250 TB
Nature of use
DW Infrastructure
Operating systems
DW Infrastructure
Hardware
Single processor at a time system
Symmetric Multiprocessing (SMP)
Mainframes
transaction oriented
complex administration
not parallel
M
Multiple
lti l processors
Shared memory
Common bus
O
Open
system
t
(UNIX) servers
specialized environment
NT servers
relatively small capacities (limited numbers of
processors and less efficient performance)
BITS Pilani, Pilani Campus
Multiple processors
Distributed memory
Distributed bus
DW Infrastructure
Performance
DW Infrastructure
Database Engine
Indexing
Physical organization
C hi and
Caching
d bl
blocking
ki
Data distribution
Memory
Chip architecture
Relational
Well understood
Include DW support for star joins and fast access
Flexible
M ltidi
Multidimensional
i
l (MOLAP)
Extremely fast
Pre-calculated
Pre calculated combination facts
DW Infrastructure
Front Room
DW Infrastructure
Operational management
Configuration of desk-top
Cli t/S
Client/Server
Backup
B k
Web
User support
Supplemental
pp
tools
Change
g management
g
Data Warehousing
BLANK PAGE
Data Warehousing
BLANK PAGE
BITS Pilani
BITS Pilani
I
I
July 26, 2014
Fact Table
Dimension Tables
Numeric
Additive
Text
Numeric
Surrogate keys
1:m with the fact table
Null entries
Date dimensions
Bus Architecture
Data Warehousing
BLANK PAGE
Data Warehousing
BLANK PAGE
BITS Pilani
BITS Pilani
I
I
July 26, 2014
CustKey
BKCustID
CustName
CommDist
Gender
HomOwn?
1552
31421
Jane Rider
Fact Table
Date
CustKey
ProdKey
Item Count
1/7/2004
1552
95
Amount
1,798.00
3/2/2004
1552
37
27.95
5/7/2005
1552
87
320.26
2/21/2006
1552 2387
42
19.95
Cust
Key
BKCust
ID
Cust
Name
Comm
Dist
Gender
Hom
Own?
Eff
End
1552
31421
Jane Rider
1/7/2004
1/1/2006
2387
31421
Jane Rider
31
1/2/2006
12/31/9999
ProductKey
Description
Category
SKU
21553
LeapPad
Education
LP2105
Type 1
ProductKey
Description
Category
SKU
21553
LeapPad
Toy
LP2105
Type 2
ProductKey
Description
Category
SKU
21553
LeapPad
Education
LP2105
44631
LeapPad
Toy
LP2105
ProductKey
Description
Category
OldCat
SKU
21553
LeapPad
Toy
Education
LP2105
ProductKey
Description
Category
OldCat
SKU
21553
LeapPad
Education
Electronics
LP2105
44631
LeapPad
Toy
Education
LP2105
68122
LeapPad
Education
Electronics
LP2105
Type 3
Hybrid
Date Dimensions
One row for every day for which you expect to
have data for the fact table (perhaps
generated in a spreadsheet and imported)
Usually use a meaningful integer surrogate
key (such as yyyymmdd 20060926 for Sep.
26, 2006). Note: this order sorts correctly.
Include rows for missing or future dates to be
added later.
Aggregates
Fact Tables
Transaction
Improve performance
Record data an coarser granularity
Periodic snapshot
Accumulating snapshot
DATE
DateKey
Attributes
STORE
StoreKey
Attributes
POS FACT
DateKey
ProductKey
StoreKey
PromotionKey
POSTransactionNumber
SalesQuantity
SalesDollarAmount
CostDollarAmount
GrossProfitDollarAmount
SQL date
Full date description
Day of week
Day of month
Day of calendar year
Day of fiscal year
Month of calendar year
Month of fiscal year
Calendar Quarter
Fiscal Quarter
PRODUCT
ProductKey
Attributes
PROMOTION
PromotionKey
Attributes
Fiscal week
Year
Month
Fiscal year
Holiday ?
Holiday name
Day of holiday
Weekday ?
Selling season
Major event
etc.
Weight units of
measure
Storage type
Shelf unit type
Shelf width
Shelf height
Shelf depth
etc.
Store Name
Store Number
Street address
City
County
State
Zip
Manager
District
Region
Floor plan type
Photo processing type
Financial service type
Square footage
Selling square footage
First open date
Last remodel date
etc.
Conformed Dimensions:
Inventory Snapshot Model
Fact: quantity-on-hand
Dimensional Model
Product
Store
Promotion
Warehouse
Vendor
Retail Sales
Retail Inventory
Retail
Deliveries
Warehouse
Inventory
Warehouse
Deliveries
Purchase Orders
Contract
Shipper
Process
DATE
DateKey
Attributes
Inventory Fact
ProductKey
DateKey
StoreKey
QuantityOnHand
QuantitySold
ValueAtCost
ValueAtSellingPrice
PRODUCT
ProductKey
Attributes
STORE
StoreKey
Attributes
ETL Processing
Why is it hard?
Multiple source systems technologies.
Inconsistent data representations
representations.
Multiple sources for the same data element.
Complexity
C
l it off required
i d ttransformations.
f
ti
Scarcity and cost of legacy cycles.
Volume of legacy data.
Operational Data
Data Transformation
Enterprise
E
t
i Warehouse
W h
andd
Integrated Data Marts
Replication
Dependent Data Marts or
Departmental Warehouses
Business Users
*
*
*
*
*
*
Excel
Access
Oracle
Informix
Sybase
I
Ingres
* Model 204
* DBF Format
* RDB
* RMS
* Compressed
*M
Many others...
th
Data Warehousing
BLANK PAGE
BITS Pilani
I
July 26, 2014
ETL Processing
It is important to look at the big picture.
Data acquisition time may include:
Data Warehousing
BLANK PAGE
BITS Pilani
I
BITS Pilani, Pilani Campus
Loading Strategies
Loading Strategies
New
N
d data
d
new data
data
oldOld
data
data data
Loading Strategies
Should consider:
Performance hints:
Trickle Feed
Trickle Feed
ETL Processing
ETL Processing
ETL Processing
Transformation
Server
Source Systems
Pre Transformations
Pre-Transformations
Data Warehouse
ELT Processing
ELT Processing
Files
Source Systems
Teradata
Fastload
Network
Channel
Data Warehouse
ELT Processing
Bottom Line
Many options
for data loading strategies: need to
Many options for data loading strategies:
need to evaluate
tradeoffs in performance,
data freshness,
evaluate tradeoffs
in performance,
data
freshness, and compatibility with source
and compatibility
with source systems environment.
systems environment.
Many options for ETL/ELT deployment:
need to evaluate tradeoffs in where and how
Many options
for ETL/ELTppdeployment: need to
transformations should be applied
evaluate tradeoffs in where and how transformations
should be applied.
Loading Dimensions
Loading Dimensions
Loading Dimensions
Loading Dimensions
When DW receives notification that an existing
g
ETL is a significant task in any DW
row in dimension
deployment. has changed it gives out 3
yp of responses
p options for data loading strategies:
types
Many
need to evaluate tradeoffs in performance, data
Type 1 freshness, and compatibility with source
systems environment.
Type 2
Many options for ETL/ELT deployment:
Type 3 need to evaluate tradeoffs in where and how
pp
transformations should be applied
Type 1 Dimension
Type 2 Dimension
Type 3 Dimension
Loading Facts
Fact tables hold the measurements of an enterprise. The
relationship between
fact tables and measurements is
ETL is a significant task in any DW
extremely simple.
If a measurement exists, it can be modeled
deployment.
as a fact table row. If a fact table row exists, it is a
measurement Many
. options for data loading strategies:
need
evaluate
tradeoffs
performance,
data is converting the
When building
a tofact
table,
the infinal
ETL step
and compatibility
with source
natural keys freshness,
in the new
input records
into the correct,
environment.
contemporarysystems
surrogate
keys
ETL maintains
a
special
surrogate
key lookup table for each
Many
options
ETL/ELT deployment:
dimension. This
table
isfor
updated
whenever a new dimension
need to
evaluate
tradeoffs inawhere
and2how
entity is created
and
whenever
Type
change occurs on an
pp
transformations
existing dimension
entityshould be applied
All of the required lookup tables should be pinned in memory
so that they can be randomly accessed as each incoming fact
record presents its natural keys. This is one of the reasons for
making the look
lookup
p tables separate from the original data
warehouse dimension tables.
Managing Partitions
freshness, and compatibility with source
Partitions allow
a table
(and its indexes) to be physically divided into
systems
environment.
minitables for administrative purposes and to improve query
performance Many options for ETL/ELT deployment:
need to evaluate
tradeoffs
in whereon
andfact
howtables is to partition the
The most common
partitioning
strategy
pp
should be
table by thetransformations
date key
key. Because
theapplied
date dimension is preloaded and
static, you know exactly what the surrogate keys are
Need to partition the fact table on the key that joins to the date
dimension for the optimizer to recognize the constraint.
The ETL team must be advised of any table partitions that need to be
maintained
Arelationalmodelwithaonetomanyrelationship
betweendimensiontableandfacttable.
b
di
i
bl
df
bl
Asinglefacttable,withdetailandsummarydata
Facttableprimarykeyhasonlyonekeycolumnper
F t t bl
i
k h
l
k
l
dimension
Eachdimensionisasingletable,highlydenormalized
Each dimension is a single table highly denormalized
x Benefits:Easytounderstand,intuitivemappingbetweenthe
businessentities,easytodefinehierarchies,reduces#ofphysical
joins low maintenance very simple metadata
joins,lowmaintenance,verysimplemetadata
x Drawbacks: Summarydatainthefacttableyieldspoorer
performanceforsummarylevels,hugedimensiontablesa
problem
x Sizesoftypicaltables:
yp
Timedimension:5yearsx365days=1825
Storedimension:300storesreportingdailysales
Productiondimension:40,000productsineachstore
(about4000sellineachstoredaily)
Maximumnumberofbasefacttablerecords:2billion
Maximum number of base fact table records: 2 billion
(lowestlevelofdetail)
EachBrandhas500products
p
Transactionsarestoredbyproduct/store/week.
x Aqueryinvolving1brand,allstore,1year:
retrieve/summarizeover7millionfacttablerows.
Timekey
D t M th
DateMonth
Quarter
Year
T t l Possible
Total
P
ibl R
Rows = 1825 * 300 * 4000 * 1 = 2 billi
billion
Store key
Storekey
Storename
Territory
Region
Productkey
Timekeyy
Storekey
Unitsales
Saledollars
Multiwayaggregates:
Territory Category Month
(Datavaluesathigherlevel)
Fact Table
STORE KEY
PRODUCT KEY
PERIOD KEY
Dollars
Units
Price
Product Dimension
PRODUCT KEY
Product Desc.
Brand
Color
Size
Manufacturer
Time Dimension
PERIOD KEY
Period Desc
Year
Quarter
M th
Month
Day
Current Flag
Sequence
Families of Stars
Snowflake Schema
Snowflake schema is a type
yp of star schema but a more
complex model.
Snowflaking is a method of normalizing the dimension
tables in a star schema.
The normalization eliminates redundancy.
The result is more complex queries and reduced query
performance.
f
Reasons:
To save storage space
To optimize some specific quires (for attributes with low
cardinality)
Snowflake Schema
Snowflake Schema
Theattributeswithlowcardinalityineach
y
originaldimensiontableareremovedto
formseparatetables.Thesenewtablesare
linked back to the original dimension table
linkedbacktotheoriginaldimensiontable
throughartificialkeys.
Productkey
Productname
Productcode
Brandkey
Brand key
Brandkey
Brandname
Categorykey
Snowflake Schema
C t
Categorykey
k
Productcategory
Advantages:
g
Small saving in storage space
Normalized structures are easier to update and
maintain
Disadvantages:
S
Schema
h
lless iintuitive
t iti and
d end-users
d
are putt off
ff by
b
the complexity
Ability to browse through the contents difficult
Degrade query performance because of additional
joins
OLAP = On
On-line
line analytical processing
processing.
pp
, not a database
OLAP is a characterization of applications,
design technique.
Idea is to provide very fast response time in order to facilitate
iterative decision
decision-making.
making
Analytical processing requires access to complex
aggregations
gg g
((as opposed
pp
to record-level access).
)
facts or measures.
measures.
Quantitative values are known as facts
e.g., sales $, units sold, etc.
Need for
Multidimensional Analysis
A simple analysis
How many units of product A did we sell in the store in Racine, WI
OLTP vs OLAP
OLAP
OLAP
OLAP Features
Some Queries
Hypercubes
Display the total sales of all products for past five years in
all stores
Compare total sales for all stores, product by product,
between years 2000 and 1999.
Show comparison of sales by individual stores, product by
product between years 2000 and 1999 only for those
product,
products with reduced sales.
Show the results of the previous queries, but rotating the
columns with rows
Multi-dimension cubes
Hard to visualize and display beyond three dimensions
Relational Vs
Multi-Dimensional Models
Relational Vs
Multi-Dimensional Models
COLOR
BLUE
RED
WHITE
BLUE
RED
WHITE
BLUE
RED
WHITE
Sales Volumes
SALES VOLUME
6
5
4
3
5
5
4
3
2
M
O
D
E
L
Mini Van
Coupe
Sedan
Red
White
Blue
COLOR
Relational Vs
Multi-Dimensional Models
MDS
Perspectives
P
ti
are placed
l
d iin fields
fi ld in
i th
the relational
l ti
l
model - tells us nothing about field contents.
Display of Hypercubes
MDS
Display of Hypercubes
Slice or Rotation
Dice or Range
Sales Volumes
M
O
D
E
L
Mini Van
Coupe
Sedan
C
O
L
O
R
( ROTATE 90
Blue
Red
S a le s V o lu m e s
Blue
Red
White
White
COLOR
View #1
Mini Van
Coupe
Sedan
MODEL
M
O
D
E
L
M in i V a n
M in i V a n
C oupe
C oupe
C a rr
C ly d e
View #2
N o rm a l
B lu e
M e ta l
B lu e
C a rr
C ly d e
N o rm a l
B lu e
M e ta l
B lu e
D E A L E R S H IP
COLOR
The end user selects the desired positions along each dimension.
Also referred to as "data
data dicing.
dicing "
The data is scoped down to a subset grouping
Slice-and-Dice or Rotation
MOLAP Implementations
OLAP has historically been implemented through use of
multi-dimensional
lti di
i
ld
databases
t b
(MDD
(MDDs).
)
Dimensions are key business factors for analysis:
g
geographies
g p
((zip,
p, state,, region,...)
g , )
products (item, product category, product department,...)
dates (day, week, month, quarter, year,...)
Veryy high
g p
performance via fast look-up
p into cube data
structure to retrieve pre-calculated results.
Cube data structures allow pre-calculation of aggregate
results
lt ffor each
h possible
ibl combination
bi ti off di
dimensional
i
l values.
l
Use of application programming interface (API) for access
via front-end
front end tools.
BITS Pilani, Pilani Campus
MOLAP Implementations
MOLAP Implementations
Need to consider both maintenance and storage
implications when designing strategy for when to build
cubes.
Maintenance Considerations: Every data item received
into MDD must be aggregated into every cube (assuming
to-date summaries are maintained).
Storage Considerations: Although cubes get much
smaller (e.g., more dense) as dimensions get less
detailed
deta
ed (e
(e.g.,
g , yea
year vs.
s day), sto
storage
age implications
p cat o s for
o
building hundreds of cubes can be significant.
Virtual Cubes
Partitioned Cubes
MOLAP vs ROLAP
Bottom Line
There are many implementation techniques for
delivery of an OLAP environment.
Must fully consider the performance, scalability,
complexity, and flexibility characteristics when
deciding between MOLAP and ROLAP.
ROLAP
Understand your tools and RDBMS!
Midterm Review
Midterm Review
The Dimensional Data Model
D t W
Data
Warehouse
h
Corporate/Enterprise-wide
Union of all data marts
Data received from staging
area
Q i on presentation
Queries
t ti
resource
Structure for corporate view of
d t
data
Organized on E-R Model.
D t M
Data
Martt
Departmental
A Single
Si l b
business
i
process
STAR join(facts and Dim)
Technology optimal for
data access and analysis
Structure to suit the
departmental view of data
Pre-joins hierarchies and lookup tables resulting in fewer join paths and
fewer intermediate tables
Midterm Review
Midterm Review
A surrogate key is a unique identifier for data
warehouse
h
records
d th
thatt replaces
l
source
primary keys (business/natural keys)
Protect against changes in source systems
Allow integration from multiple sources
Enable rows that do not exist in source data
Track changes over time (e.g. new customer
instances when addresses change)
Replace text keys with integers for efficiency
Bus Architecture
Midterm Review
Midterm Review
Slowlyy Changing
g g Dimensions
Attributes in a dimension that change more slowly
than the fact granularity
Type 1: Current only / overwrite the old value
Type 2: All history / create a new dimensional
record
Type 3: Most recent few (rare) / create a
previous
previous value
value attribute
Note: rapidly changing dimensions usually indicate
the presence of a business process that should be
tracked as a separate dimension or as a fact table
BITS Pilani, Pilani Campus
CustKey
ProdKey
Item Count
Amount
1/1/2004
31421
STD
1,798.00
1/2/2004
31421
STD
27 95
27.95
1/3/2004
31421
STD
320.26
1/2/2006
31421
LTD
19.95
Fact Table
Date
CustKey
ProdKey
Item Count
Amount
1552
1001
1,798.00
1552
1001
27.95
1552
1001
320.26
731
1552 2387
1002
19.95
BITS Pilani, Pilani Campus
Data Extraction:
Data
D t in
i operational
ti
l systems
t
Current Value
Periodic Status
IImmediate
di t Data
D t Extraction
E t ti
- data extraction is real-time.
a.
b.
c.
Data Transformation
Basic Tasks
Selection
Splitting/Joining
p
g
g
Conversion
Summarization
Enrichment
Major Types
Format Revisions
Decoding of Fields
Calculated and Derived Values
Splitting of Single Fields
Merging of Information
Character set conversion and Unit Measure Conversions
Date/Time Conversion
Key Restructuring
Deduplication
Applying
pp y g Data
Data Loading
Take the prepared data, apply it to DW,
and store in DB.
Different ways of moving data
Initial load
Incremental load
Full Refresh
x Aggregatefacttablesaresummariesofthe
gg g
mostgranulardataathigherlevelsalongthe
dimensionhierarchies.
Productkey
Product
Category
Department
Timekey
D t M th
DateMonth
Quarter
Year
Productkey
Timekeyy
Storekey
Unitsales
Saledollars
Storekey
Store
key
Storename
Territory
Region
Multiwayaggregates:
Territory Category Month
Store Dimension
STORE KEY
Store Description
City
State
District ID
District
i i Desc.
Region_ID
Region Desc.
Regional Mgr.
Fact Table
STORE KEY
PRODUCT KEY
PERIOD KEY
Dollars
Units
Price
Product Dimension
PRODUCT KEY
Product Desc.
Brand
Color
Size
Manufacturer
Time Dimension
PERIOD KEY
Period Desc
Year
Quarter
M th
Month
Day
Current Flag
Sequence
(Datavaluesathigherlevel)
BITS Pilani, Pilani Campus
Bottom Tier
Middle Tier
It is an OLAP server that is typically implemented
using either ROLAP or MOLAP
ROLAP - relational
l ti
l OLAP model,
d l an extended
t d d
relational DBMS that maps operations on multidimen
p
sional data to standard relational operations
MOLAP -multidimensional OLAP model, a specialpurpose server that directly implements
multidimensional data and operations.
Top Tier
It is a frontend client layer, which contains
query and reporting tools, analysis tools,
and/or data mining tools (e.g., trend
analysis, prediction, and so on).