Sie sind auf Seite 1von 31

Lecture 10: Data Warehouses

Introduction
Operational vs. Warehouse

Multidimensional Data
Examples
MOLAP vs ROLAP
Dimensional Hierarchies
OLAP Queries
Demos
Comparison with SQL
Queries
CUBE Operator
Multidimensional Design
Star/Snowflake Schemas

03/07/15

Online Aggregation
Implementation Issues
Bitmap Index

Constructing a Data
Warehouse
Views
Materialized View Example
Materialized View is an
Index
Issues in Materialized Views
Maintaining Materialized
Views

Introduction
In the late 80s and early 90s, companies
began to use their DBMSs for complex,
interactive, exploratory analysis of historical
data.
This was called Decision Support, and OnLine Analytic Processing (OLAP).
DS slowed down the operation of the
company, called On-Line Transaction
Processing (OLTP).
This led to the creation of Data Warehouses,
separate from operational Databases.

03/07/15

Operational vs Data Warehouse


Requirements
OLTP / Operational /
Production

DSS / Warehouse /
DataMart

Operate the business / Clerks

Diagnose the business /


Managers

Short queries, small amts of


data

opposite

Queries change data

opposite

Customer inquiry, Order Entry,


etc.

OLAP, Statistics, Visualization,


Data Mining, etc.

Legacy Applications,
Heterogeneous databases

Opposite

Often Distributed

Often Centralized (Warehouse)

Current data

Current and Historical data

03/07/15

Operational vs Data
Warehouse
Requirements, ctd
Operational

Warehouse

General E-R Diagrams


Locks necessary

Multidimensional data model


common
No Locks necessary

Crash recovery required

Crash recovery optional

Smaller volume of data

Huge volume of data

Need indexes designed to access


small amounts of data

Need indexes designed to access


large volumes of data

03/07/15

Operational Data

Data Warehousing
Integrated data spanning
EXTRACT
TRANSFORM
long time periods, often
LOAD
augmented with summary
REFRESH
information.
Several terabytes to
DATA
petabytes common.
Metadata
WAREHOUSE
Interactive response
Repository
times expected for
SUPPORTS
complex queries; ad-hoc
updates uncommon.

03/07/15

DATA
MINING

OLAP
5

Multidimensional Data
In order to support OLAP, warehouse data is
often structured multidimensionally, as
measures and dimensions.
Measure: Numeric attribute, e.g. sales amount
Dimension: attribute categorizing the
measure, e.g. product, store, date of sale.
The fact table is a foreign key for each
dimension, plus an attribute for each measure.
There will also be a dimension table for each
dimension.
On the next page, the fact tables are red, the
dimension tables are green.

03/07/15

Examples of
MultiDimensional Data

Purchase(ProductID, StoreID, DateID, Amt)


Product(ID, SKU, size, brand)
Store(ID, Address, Sales District, Region, Manager)
Date (ID, Week, Month, Holiday, Promotion)

Claims(ProvID, MembID, Procedure, DateID, Cost)


Providers(ID, Practice, Address, ZIP, City, State)
Members(ID, Contract, Name, Address)
Procedure (ID, Name, Type)

Telecomm (CustID, SalesRepID, ServiceID,


DateID)
SalesRep(ID, Address, Sales District, Region, Manager)
Service(ID, Name, Category)

03/07/15

MOLAP vs ROLAP
Multidimensional data can be stored physically
in a (disk-resident, persistent) array; called
MOLAP systems. Alternatively, can store as a
relation; called ROLAP systems.
The main relation, which relates dimensions to
a measure, is called the fact table. Each
dimension can have additional attributes and
an associated dimension table.

E.g., Products(pid, locid, timeid, amt)


Fact tables are much larger than dimensional tables.

03/07/15

locid
amt

timeid

pid

25.2
11
Multidimensional
Collection of numeric measures,
11
Data
Model
which depend on a set of dimensions.

1 1 25
2 1 8

11 3 1 15

E.g., measure Amt, dimensions Product


12
(key: pid), Location (locid), and Time
(timeid).

12 2 1 20

13
11

pid
12

Slice locid=1
is shown:

03/07/15

1 1 30

10

10

30

20

50

25

2
timeid

15
locid
3

12 3 1 50
13 1 1 8
13 2 1 10
13 3 1 10
11 1 2 35
9

Dimension Hierarchies

For each dimension, some of the attributes


may be organized in a hierarchy:

PRODUCT

TIME

LOCATION

year
category
state
pname
PID
03/07/15

quarter
week

date
10

25.3 OLAP Queries


Influenced by SQL and by spreadsheets.
A common operation is to aggregate a
measure over one or more dimensions.

Find total sales.


Find total sales for each city, or for each state.
Find top five products ranked by total sales.

Roll-up: Aggregating at different levels of a


dimension hierarchy.
E.g., Given total sales by city, we can roll-up to get
sales by state.

03/07/15

11

OLAP Queries

Drill-down: The inverse of roll-up.


E.g., Given total sales by state, can drill-down
to get total sales by city.
E.g., Can also drill-down on different dimension
to get total sales by product for each state.

Pivoting: Aggregation on selected


dimensions.
OR

E.g., Pivoting on State and Year


2007 63
yields this cross-tabulation:
Slicing and Dicing: Equality 2008 38

and range selections on one 2009 75


or more dimensions.
03/07/15

Total 176

CA

81

Tota

144

107 145
35

110

223 339
12

Cognos Demo

Now we watch a demo of Cognos (bought by IBM)


Dimensions: ProductsMargin ranges
Measure: Order value (sales)

First pivot from Product dimension to Margin


Range
Notice how quickly the cube changes

03/07/15

Slice to Low Margin, pivot to Product and


Company Region
Drill Down to High Tech, IDES AG
Now the guilty product is clear.

13

Tableau Demo

http://www.tableausoftware.com/products/tour2

Note the many measures.


Pivot on sales, date (drill down to month),
region as color.
Clear date, pivot on product and drill
down on subcategory.
Change region from color to rows
Move profit into color
Change bars to circles
Pivot on dates (columns)

03/07/15

14

Comparison with SQL Queries

The cross-tabulation obtained by pivoting can also be


computed using a collection of SQLqueries:

SELECT T.year, L.state, SUM(S.amt)


FROM Sales S, Times T, Locations L
WHERE S.timeid=T.timeid AND S.locid=L.locid
GROUP BY T.year, L.state

SELECT T.year, SUM(S.amt)


SELECT L.state,SUM(S.am
FROM Sales S, Times T FROM Sales S, Location
WHERE S.timeid=T.timeid
WHERE S.locid=L.locid
GROUP BY T.year
GROUP BY L.state
03/07/15

15

The CUBE Operator


Generalizing the previous example, if there
are k dimensions, we have 2^k possible
SQL GROUP BY queries that can be generated
through pivoting on a subset of dimensions.
CUBE pid, locid, timeid BY SUM Sales

Equivalent to rolling up Sales on all eight subsets


of the set {pid, locid, timeid}; each roll-up
corresponds to an SQL query of the form:

Lots of work on optimizing


the CUBE operator!
03/07/15

SELECT SUM(S.amt)
FROM Sales S
GROUP BY grouping-list
16

Example Multidimensional
Design
TIMES

timeid dat
e

PRODUCTS

week mont
h

pid timei
d

pid pnam categor pric


e
y
e

quarte yea holiday_fla


r
r
g

locid amt

SALES (Fact

table)

LOCATIONS

locid city

state countr
y

This kind of schema is very common in


OLAP applications
It is called a star schema
What is wrong with it?

03/07/15

17

Star/Snowflake Schemas

Why normalize?
Space
Redundancy, anomalies

Why unnormalize?
Performance

Which is more important in D.


Warehouses?
If normalized, it is a snowflake schema

03/07/15

18

Online Aggregation

Consider an aggregate query, e.g., finding the


average sales by state. Can we provide the
user with some information before the exact
average is computed for all states?
Can show the current running average for each
state as the computation proceeds.
Even better, if we use statistical techniques and
sample tuples to aggregate instead of simply
scanning the aggregated table, we can provide
bounds such as the average for Oregon is
2000102 with 95% probability.
Should also use nonblocking algorithms!

03/07/15

19

25.6 Implementation Issues

New indexing techniques: Bitmap indexes,


Join indexes, array representations,
compression, precomputation of
aggregations, etc.
E.g., Bitmap index:
sex
custid name sex rating
r
Bit-vector:F
M
1 bit for each
112 Joe M 3
10
00100
possible value.
115 Ram M 5
10
00001
119 Sue F 5
01
00001
10
00010
112 Woo M 4

03/07/15

20

Bitmap Indexes
Work when an attribute has few values,
e.g. gender or rating
Advantage: Small enough to fit in
memory
Many queries can be answered by bitvector ops, e.g. females with rating = 3.

03/07/15

21

25.7 Constructing a D.
Warehouse
Extract
Is the data in native format?

Clean
How many ways can you spell Mr.?
Errors, missing information

Transform
Fix semantic mismatches.
E.g. Last+first vs. Name

Load
Do it in parallel or else.

Refresh
Both data and indexes

03/07/15

22

25.8,9 Views and Decision


Support

In large databases, precomputation is


necessary for decent response times
Examples: brain, google

Example: Precompute daily sums for the


cube.
What can be derived from those
precomputations?

03/07/15

These precomputed queries are called


Materialized Views (SQL Server: Indexed
views).
23

Materialized View Example


Mat. CREATE VIEW DailySum(date, sumamt)
View AS SELECT date, SUM(amt)
FROM Times Join Sales USING(timeid)
GROUP BY date
Query SELECT week, SUM(amt)
FROM Times Join Sales USING(timeid)
Group By week
Modified SELECT week, SUM(sumamt)
Query
FROM Times Join DailySum USING (week)
GROUP BY week
03/07/15

24

Pros and Cons of Materialized


Views
Pro: Modified query is a join of two small
tables; original query is a join with one
huge table.
Con: Materialized views take up space,
need to be updated.

03/07/15

25

A Materialized View is an
Index

Recall the definition of an index

Data structure that provides fast access to data

Table indexes were of the form {(value,


pointer)}, perhaps at leaf level of a search
structure. This is different.
Needs to be maintained as underlying
tables change.
Ideally, we want incremental view
maintenance algorithms.

03/07/15

26

What views should we


materialize?
Remember the software that
automatically chooses optimal index
configurations?
The same software will choose optimal
materialized views, given a workload
and available space.

03/07/15

27

What about the optimizer?


Given a query and a set of materialized
views, can we use the materialized
views to answer the query?
This is tricky. Best reference is [348]

03/07/15

28

Refreshing Materialized
Views
How often should we refresh the
materialized view?
Many enterprises refresh warehouse data
only weekly/nightly, so can afford to
completely rebuild their materialized
views.
Others want their warehouses to be
current, so materialized views must be
updated incrementally if possible.
Let's look at some simple examples.

03/07/15

29

25.10 Maintaining Materialized


Views*

Incremental view maintenance


Defn: make changes in view that correspond
to changes in the base tables

Example: V = SELECT a FROM R


How is V modified if r is inserted to R?
How is V modified if r is deleted from R?

03/07/15

30

Maintaining Materialized Views*

Consider V = R S
How is V modified if r is inserted to R?
How is V modified if r is deleted from R?

Consider V = SELECT g,COUNT(*)


FROM R GROUP BY g
How is V modified if r is inserted to R?
How is V modified if r is deleted from R

For more general cases, see [348]

03/07/15

31

Das könnte Ihnen auch gefallen