Lecture 10

Lecture 10: Data Warehouses
Introduction
Operational vs. Warehouse
Multidimensional Data
Examples
MOLAP vs ROLAP
Dimensional Hierarchies
OLAP Queries
Demos
Comparison with SQL
Queries
CUBE Operator
Multidimensional Design
Star/Snowflake Schemas
03/07/15
Online Aggregation
Implementation Issues
Bitmap Index
Constructing a Data
Warehouse
Views
Materialized View Example
Materialized View is an
Index
Issues in Materialized Views
Maintaining Materialized
Views
Introduction
In the late 80s and early 90s, companies
began to use their DBMSs for complex,
interactive, exploratory analysis of historical
data.
This was called Decision Support, and OnLine Analytic Processing (OLAP).
DS slowed down the operation of the
company, called On-Line Transaction
Processing (OLTP).
This led to the creation of Data Warehouses,
separate from operational Databases.
03/07/15
Operational vs Data Warehouse

Requirements
OLTP / Operational /
Production
DSS / Warehouse /
DataMart
Operate the business / Clerks
Diagnose the business /

Managers
Short queries, small amts of

data
opposite
Queries change data
opposite
Customer inquiry, Order Entry,

etc.
OLAP, Statistics, Visualization,

Data Mining, etc.
Legacy Applications,
Heterogeneous databases
Opposite
Often Distributed
Often Centralized (Warehouse)
Current data
Current and Historical data
03/07/15
Operational vs Data
Warehouse
Requirements, ctd
Operational
Warehouse
General E-R Diagrams

Locks necessary
Multidimensional data model

common
No Locks necessary
Crash recovery required
Crash recovery optional
Smaller volume of data
Huge volume of data
Need indexes designed to access

small amounts of data
Need indexes designed to access

large volumes of data
03/07/15
Operational Data
Data Warehousing
Integrated data spanning
EXTRACT
TRANSFORM
long time periods, often
LOAD
augmented with summary
REFRESH
information.
Several terabytes to
DATA
petabytes common.
Metadata
WAREHOUSE
Interactive response
Repository
times expected for
SUPPORTS
complex queries; ad-hoc
updates uncommon.
03/07/15
DATA
MINING
OLAP
5
Multidimensional Data
In order to support OLAP, warehouse data is
often structured multidimensionally, as
measures and dimensions.
Measure: Numeric attribute, e.g. sales amount
Dimension: attribute categorizing the
measure, e.g. product, store, date of sale.
The fact table is a foreign key for each
dimension, plus an attribute for each measure.
There will also be a dimension table for each
dimension.
On the next page, the fact tables are red, the
dimension tables are green.
03/07/15
Examples of
MultiDimensional Data
Purchase(ProductID, StoreID, DateID, Amt)

Product(ID, SKU, size, brand)
Store(ID, Address, Sales District, Region, Manager)
Date (ID, Week, Month, Holiday, Promotion)
Claims(ProvID, MembID, Procedure, DateID, Cost)

Providers(ID, Practice, Address, ZIP, City, State)
Members(ID, Contract, Name, Address)
Procedure (ID, Name, Type)
Telecomm (CustID, SalesRepID, ServiceID,

DateID)
SalesRep(ID, Address, Sales District, Region, Manager)
Service(ID, Name, Category)
03/07/15
MOLAP vs ROLAP
Multidimensional data can be stored physically
in a (disk-resident, persistent) array; called
MOLAP systems. Alternatively, can store as a
relation; called ROLAP systems.
The main relation, which relates dimensions to
a measure, is called the fact table. Each
dimension can have additional attributes and
an associated dimension table.
E.g., Products(pid, locid, timeid, amt)

Fact tables are much larger than dimensional tables.
03/07/15
locid
amt
timeid
pid
25.2
11
Multidimensional
Collection of numeric measures,
11
Data
Model
which depend on a set of dimensions.
1 1 25
2 1 8
11 3 1 15
E.g., measure Amt, dimensions Product

12
(key: pid), Location (locid), and Time
(timeid).
12 2 1 20
13
11
pid
12
Slice locid=1
is shown:
03/07/15
1 1 30
10
10
30
20
50
25
2
timeid
15
locid
3
12 3 1 50
13 1 1 8
13 2 1 10
13 3 1 10
11 1 2 35
9
Dimension Hierarchies
For each dimension, some of the attributes

may be organized in a hierarchy:
PRODUCT
TIME
LOCATION
year
category
state
pname
PID
03/07/15
quarter
week
date
10
25.3 OLAP Queries

Influenced by SQL and by spreadsheets.
A common operation is to aggregate a
measure over one or more dimensions.
Find total sales.

Find total sales for each city, or for each state.
Find top five products ranked by total sales.
Roll-up: Aggregating at different levels of a

dimension hierarchy.
E.g., Given total sales by city, we can roll-up to get
sales by state.
03/07/15
11
OLAP Queries
Drill-down: The inverse of roll-up.

E.g., Given total sales by state, can drill-down
to get total sales by city.
E.g., Can also drill-down on different dimension
to get total sales by product for each state.
Pivoting: Aggregation on selected

dimensions.
OR
E.g., Pivoting on State and Year

2007 63
yields this cross-tabulation:
Slicing and Dicing: Equality 2008 38
and range selections on one 2009 75

or more dimensions.
03/07/15
Total 176
CA
81
Tota
144
107 145
35
110
223 339
12
Cognos Demo
Now we watch a demo of Cognos (bought by IBM)

Dimensions: ProductsMargin ranges
Measure: Order value (sales)
First pivot from Product dimension to Margin

Range
Notice how quickly the cube changes
03/07/15
Slice to Low Margin, pivot to Product and

Company Region
Drill Down to High Tech, IDES AG
Now the guilty product is clear.
13
Tableau Demo
http://www.tableausoftware.com/products/tour2
Note the many measures.

Pivot on sales, date (drill down to month),
region as color.
Clear date, pivot on product and drill
down on subcategory.
Change region from color to rows
Move profit into color
Change bars to circles
Pivot on dates (columns)
03/07/15
14
Comparison with SQL Queries
The cross-tabulation obtained by pivoting can also be

computed using a collection of SQLqueries:
SELECT T.year, L.state, SUM(S.amt)

FROM Sales S, Times T, Locations L
WHERE S.timeid=T.timeid AND S.locid=L.locid
GROUP BY T.year, L.state
SELECT T.year, SUM(S.amt)

SELECT L.state,SUM(S.am
FROM Sales S, Times T FROM Sales S, Location
WHERE S.timeid=T.timeid
WHERE S.locid=L.locid
GROUP BY T.year
GROUP BY L.state
03/07/15
15
The CUBE Operator

Generalizing the previous example, if there
are k dimensions, we have 2^k possible
SQL GROUP BY queries that can be generated
through pivoting on a subset of dimensions.
CUBE pid, locid, timeid BY SUM Sales
Equivalent to rolling up Sales on all eight subsets

of the set {pid, locid, timeid}; each roll-up
corresponds to an SQL query of the form:
Lots of work on optimizing

the CUBE operator!
03/07/15
SELECT SUM(S.amt)
FROM Sales S
GROUP BY grouping-list
16
Example Multidimensional
Design
TIMES
timeid dat
e
PRODUCTS
week mont
h
pid timei
d
pid pnam categor pric

e
y
e
quarte yea holiday_fla

r
r
g
locid amt
SALES (Fact
table)
LOCATIONS
locid city
state countr
y
This kind of schema is very common in

OLAP applications
It is called a star schema
What is wrong with it?
03/07/15
17
Star/Snowflake Schemas
Why normalize?
Space
Redundancy, anomalies
Why unnormalize?
Performance
Which is more important in D.

Warehouses?
If normalized, it is a snowflake schema
03/07/15
18
Online Aggregation
Consider an aggregate query, e.g., finding the

average sales by state. Can we provide the
user with some information before the exact
average is computed for all states?
Can show the current running average for each
state as the computation proceeds.
Even better, if we use statistical techniques and
sample tuples to aggregate instead of simply
scanning the aggregated table, we can provide
bounds such as the average for Oregon is
2000102 with 95% probability.
Should also use nonblocking algorithms!
03/07/15
19
25.6 Implementation Issues
New indexing techniques: Bitmap indexes,

Join indexes, array representations,
compression, precomputation of
aggregations, etc.
E.g., Bitmap index:
sex
custid name sex rating
r
Bit-vector:F
M
1 bit for each
112 Joe M 3
10
00100
possible value.
115 Ram M 5
10
00001
119 Sue F 5
01
00001
10
00010
112 Woo M 4
03/07/15
20
Bitmap Indexes
Work when an attribute has few values,
e.g. gender or rating
Advantage: Small enough to fit in
memory
Many queries can be answered by bitvector ops, e.g. females with rating = 3.
03/07/15
21
25.7 Constructing a D.
Warehouse
Extract
Is the data in native format?
Clean
How many ways can you spell Mr.?
Errors, missing information
Transform
Fix semantic mismatches.
E.g. Last+first vs. Name
Load
Do it in parallel or else.
Refresh
Both data and indexes
03/07/15
22
25.8,9 Views and Decision

Support
In large databases, precomputation is

necessary for decent response times
Examples: brain, google
Example: Precompute daily sums for the

cube.
What can be derived from those
precomputations?
03/07/15
These precomputed queries are called

Materialized Views (SQL Server: Indexed
views).
23
Materialized View Example

Mat. CREATE VIEW DailySum(date, sumamt)
View AS SELECT date, SUM(amt)
FROM Times Join Sales USING(timeid)
GROUP BY date
Query SELECT week, SUM(amt)
FROM Times Join Sales USING(timeid)
Group By week
Modified SELECT week, SUM(sumamt)
Query
FROM Times Join DailySum USING (week)
GROUP BY week
03/07/15
24
Pros and Cons of Materialized

Views
Pro: Modified query is a join of two small
tables; original query is a join with one
huge table.
Con: Materialized views take up space,
need to be updated.
03/07/15
25
A Materialized View is an
Index
Recall the definition of an index
Data structure that provides fast access to data
Table indexes were of the form {(value,

pointer)}, perhaps at leaf level of a search
structure. This is different.
Needs to be maintained as underlying
tables change.
Ideally, we want incremental view
maintenance algorithms.
03/07/15
26
What views should we

materialize?
Remember the software that
automatically chooses optimal index
configurations?
The same software will choose optimal
materialized views, given a workload
and available space.
03/07/15
27
What about the optimizer?

Given a query and a set of materialized
views, can we use the materialized
views to answer the query?
This is tricky. Best reference is [348]
03/07/15
28
Refreshing Materialized
Views
How often should we refresh the
materialized view?
Many enterprises refresh warehouse data
only weekly/nightly, so can afford to
completely rebuild their materialized
views.
Others want their warehouses to be
current, so materialized views must be
updated incrementally if possible.
Let's look at some simple examples.
03/07/15
29
25.10 Maintaining Materialized

Views*
Incremental view maintenance

Defn: make changes in view that correspond
to changes in the base tables
Example: V = SELECT a FROM R

How is V modified if r is inserted to R?
How is V modified if r is deleted from R?
03/07/15
30
Maintaining Materialized Views*
Consider V = R S
How is V modified if r is deleted from R?
Consider V = SELECT g,COUNT(*)

FROM R GROUP BY g
How is V modified if r is deleted from R
For more general cases, see [348]
03/07/15
31

Lecture 10

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Lecture 10

Hochgeladen von

Copyright:

Verfügbare Formate

Lecture 10: Data Warehouses

Operational vs Data Warehouse

Operate the business / Clerks

Diagnose the business /

Short queries, small amts of

Queries change data

Customer inquiry, Order Entry,

OLAP, Statistics, Visualization,

Often Centralized (Warehouse)

Current and Historical data

General E-R Diagrams

Multidimensional data model

Crash recovery required

Crash recovery optional

Smaller volume of data

Huge volume of data

Need indexes designed to access

Need indexes designed to access

Purchase(ProductID, StoreID, DateID, Amt)

Claims(ProvID, MembID, Procedure, DateID, Cost)

Telecomm (CustID, SalesRepID, ServiceID,

E.g., Products(pid, locid, timeid, amt)

E.g., measure Amt, dimensions Product

For each dimension, some of the attributes

25.3 OLAP Queries

Find total sales.

Roll-up: Aggregating at different levels of a

Drill-down: The inverse of roll-up.

Pivoting: Aggregation on selected

E.g., Pivoting on State and Year

and range selections on one 2009 75

Now we watch a demo of Cognos (bought by IBM)

First pivot from Product dimension to Margin

Slice to Low Margin, pivot to Product and

Note the many measures.

Comparison with SQL Queries

The cross-tabulation obtained by pivoting can also be

SELECT T.year, L.state, SUM(S.amt)

SELECT T.year, SUM(S.amt)

The CUBE Operator

Equivalent to rolling up Sales on all eight subsets

Lots of work on optimizing

pid pnam categor pric

quarte yea holiday_fla

This kind of schema is very common in

Which is more important in D.

Consider an aggregate query, e.g., finding the

25.6 Implementation Issues

New indexing techniques: Bitmap indexes,

25.8,9 Views and Decision

In large databases, precomputation is

Example: Precompute daily sums for the

These precomputed queries are called

Materialized View Example

Pros and Cons of Materialized

Recall the definition of an index

Data structure that provides fast access to data

Table indexes were of the form {(value,

What views should we

What about the optimizer?

25.10 Maintaining Materialized

Incremental view maintenance

Example: V = SELECT a FROM R

Maintaining Materialized Views*