Sie sind auf Seite 1von 49

Data Warehousing

CPS216
Notes 13
Shivnath Babu

Warehousing
Growing industry: $8 billion way back in
1998
Range from desktop to huge:

Walmart:

900-CPU, 2,700 disk, 23TB


Teradata system

Lots of buzzwords, hype


slice

& dice, rollup, MOLAP, pivot, ...

Outline
What is a data warehouse?
Why a warehouse?
Models & operations
Implementing a warehouse
Future directions

What is a Warehouse?

Collection of diverse data


subject

oriented
aimed at executive, decision maker
often a copy of operational data
with value-added data (e.g., summaries, history)
integrated
time-varying
non-volatile
more
4

What is a Warehouse?

Collection of tools
gathering

data
cleansing, integrating, ...
querying, reporting, analysis
data mining
monitoring, administering warehouse

Warehouse Architecture
Client

Client
Query & Analysis

Metadata

Warehouse

Integration

Source

Source

Source
6

Motivating Examples
Forecasting
Comparing performance of units
Monitoring, detecting fraud
Visualization

Why a Warehouse?

Two Approaches:
Query-Driven

(Lazy)
Warehouse (Eager)

?
Source

Source

Query-Driven Approach

Client

Client
Mediator

Wrapper

Source

Wrapper

Wrapper

Source

Source

Advantages of Warehousing
High query performance
Queries not visible outside warehouse
Local processing at sources unaffected
Can operate when sources unavailable
Can query data not stored in a DBMS
Extra information at warehouse

Modify,

summarize (store aggregates)


Add historical information
10

Advantages of Query-Driven

No need to copy data


less

storage
no need to purchase data

More up-to-date data


Query needs can be unknown
Only query interface needed at sources
May be less draining on sources

11

OLTP vs. OLAP

OLTP: On Line Transaction Processing


Describes

processing at operational sites

OLAP: On Line Analytical Processing


Describes

processing at warehouse

12

OLTP vs. OLAP


OLTP

Mostly updates
Many small transactions
Mb-Gb of data
Raw data
Clerical users
Up-to-date data
Consistency,
recoverability critical

OLAP

Mostly reads
Queries long, complex
Gb-Tb of data
Summarized,
consolidated data
Decision-makers,
analysts as users

13

Data Marts
Smaller warehouses
Spans part of organization

e.g.,

marketing (customers, products, sales)

Do not require enterprise-wide consensus


but

long term integration problems?

14

Warehouse Models & Operators

Data Models
relations
stars

& snowflakes
cubes

Operators
slice

& dice
roll-up, drill down
pivoting
other
15

Star
product

prodId
p1
p2

name price
bolt
10
nut
5

sale oderId date


o100 1/7/97
o102 2/7/97
105 3/8/97

customer

custId
53
81
111

store

custId
53
53
111

name
joe
fred
sally

prodId
p1
p2
p1

storeId
c1
c1
c3

address
10 main
12 main
80 willow

qty
1
2
5

storeId
c1
c2
c3

city
nyc
sfo
la

amt
12
11
50

city
sfo
sfo
la
16

Star Schema

product
prodId
name
price

sale
orderId
date
custId
prodId
storeId
qty
amt

customer
custId
name
address
city

store
storeId
city

17

Terms
Fact table
Dimension tables
Measures

product
prodId
name
price

sale
orderId
date
custId
prodId
storeId
qty
amt

customer
custId
name
address
city

store
storeId
city

18

Dimension Hierarchies
sType

store

store storeId
s5
s7
s9

city

cityId
sfo
sfo
la

tId
t1
t2
t1

mgr
joe
fred
nancy

snowflake schema
constellations

region
sType tId
t1
t2

city

size
small
large

cityId pop
sfo
1M
la
5M

location
downtown
suburbs

regId
north
south

region regId
name
north cold region
south warm region
19

Cube
Fact table view:
sale

prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2

Multi-dimensional cube:
amt
12
11
50
8

p1
p2

c1
12
11

c2

c3
50

dimensions = 2

20

3-D Cube
Fact table view:
sale

prodId
p1
p2
p1
p2
p1
p1

storeId
c1
c1
c3
c2
c1
c2

Multi-dimensional cube:
date
1
1
1
1
2
2

amt
12
11
50
8
44
4

day 2
day 1

p1
p2 c1
p1
12
p2
11

c1
44

c2
4
c2

c3
c3
50

dimensions = 3

21

ROLAP vs. MOLAP


ROLAP:
Relational On-Line Analytical Processing
MOLAP:
Multi-Dimensional On-Line Analytical
Processing

22

Aggregates
Add up amounts for day 1
In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1
sale

prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2

date
1
1
1
1
2
2

amt
12
11
50
8
44
4

81

23

Aggregates
Add up amounts by day
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date
sale

prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2

date
1
1
1
1
2
2

amt
12
11
50
8
44
4

ans

date
1
2

sum
81
48

24

Another Example
Add up amounts by day, product
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
sale

prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2

date
1
1
1
1
2
2

amt
12
11
50
8
44
4

sale

prodId
p1
p2
p1

date
1
1
2

amt
62
19
48

rollup
drill-down
25

Aggregates
Operators: sum, count, max, min,
median, ave
Having clause
Using dimension hierarchy

average

by region (within store)


maximum by month (within date)

26

Cube Aggregation
Example: computing sums
day 2
day 1

p1
p2 c1
p1
12
p2
11

p1
p2

c1
56
11

c1
44

c2
4
c2

c3

...

c3
50

c2
4
8

rollup
drill-down

c3
50

sum

c1
67

c2
12

c3
50

129
p1
p2

sum
110
19

27

Cube Operators
day 2
day 1

p1
p2 c1
p1
12
p2
11

p1
p2

c1
56
11

c1
44

c2
4
c2

c3

...

c3
50

sale(c1,*,*)

c2
4
8

c3
50

sale(c2,p2,*)

sum

c1
67

c2
12

c3
50

129
p1
p2

sum
110
19

sale(*,*,*)
28

Extended Cube
c2
4
8
c312

p1
p2
c1
*
12

p1
p2
c1*
44

c1
56
11
c267
4

c2
44

c3
4
50

11
23

8
8

50

*
62
19
81

day 2

day 1

p1
p2
*

c3
50

* 50
48
48

*
110
19
129

sale(*,p2,*)

29

Aggregation Using Hierarchies

day 2
day 1

p1
p2 c1
p1
12
p2
11

c1
44

c2
4
c2

c3
c3
50

customer
region

country
p1
p2

region A region B
56
54
11
8

(customer c1 in Region A;
customers c2, c3 in Region B)

30

Pivoting
Fact table view:
sale

prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2

Multi-dimensional cube:
date
1
1
1
1
2
2

amt
12
11
50
8
44
4

Pivot turns unique values from


one column into unique columns
in the output

day 2
day 1

p1
p2 c1
p1
12
p2
11

p1
p2

c1
56
11

c1
44

c2
4
c2

c3
c3
50

c2
4
8

c3
50

31

Derived Data

Derived Warehouse Data


indexes
aggregates
materialized

views (next slide)

When to update derived data?


Incremental vs. refresh

32

Materialized Views

sale

Define new warehouse relations using


SQL expressions
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2

date
1
1
1
1
2
2

joinTb prodId
p1
p2
p1
p2
p1
p1

amt
12
11
50
8
44
4

name
bolt
nut
bolt
nut
bolt
bolt

product

price
10
5
10
5
10
10

storeId
c1
c1
c3
c2
c1
c2

date
1
1
1
1
2
2

id
p1
p2

amt
12
11
50
8
44
4

name price
bolt
10
nut
5

does not exist


at any source

33

Processing
ROLAP servers vs. MOLAP servers
Index Structures
What to Materialize?
Algorithms

Client

Client

Query & Analysis

Metadata

Warehouse

Integration

Source

Source

Source

34

ROLAP Server

Relational OLAP Server

sale

prodId
p1
p2
p1

date
1
1
2

sum
62
19
48

tools

utilities

ROLAP
server

Special indices, tuning;


Schema is denormalized

relational
DBMS

35

MOLAP Server
Multi-Dimensional OLAP Server
Sales
B
A

M.D. tools

Product

milk
soda
eggs
soap

utilities

multidimensional
server

2 3 4
Date

could also
sit on
relational
DBMS

36

Index Structures

Traditional Access Methods


B-trees,

hash tables, R-trees, grids,

Popular in Warehouses
inverted

lists
bit map indexes
join indexes
text indexes

37

Inverted Lists
18
19

20
21
22

23
25
26

age
index

r5
r19
r37
r40

rId
r4
r18
r19
r34
r35
r36
r5
r41

name age
joe
20
fred
20
sally
21
nancy 20
tom
20
pat
25
dave
21
jeff
26

...

20
23

r4
r18
r34
r35

inverted
lists

data
records
38

Using Inverted Lists

Query:
Get

people with age = 20 and name = fred

List for age = 20: r4, r18, r34, r35


List for name = fred: r18, r52
Answer is intersection: r18

39

Bit Maps

20
23

20
21
22

1
1
0
1
1
0
0
0
0

23
25
26

age
index

bit
maps

0
0
1
0
0
0
1
0
1
1

id
1
2
3
4
5
6
7
8

name age
joe
20
fred
20
sally
21
nancy 20
tom
20
pat
25
dave
21
jeff
26

...

18
19

data
records
40

Using Bit Maps

Query:
Get

people with age = 20 and name = fred

List for age = 20: 1101100000


List for name = fred: 0100000001
Answer is intersection: 010000000000

Good if domain cardinality small


Bit vectors can be compressed

41

Join
Combine SALE, PRODUCT relations
In SQL: SELECT * FROM SALE, PRODUCT WHERE ...
sale

prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2

date
1
1
1
1
2
2

joinTb prodId
p1
p2
p1
p2
p1
p1

amt
12
11
50
8
44
4

name
bolt
nut
bolt
nut
bolt
bolt

product

price
10
5
10
5
10
10

storeId
c1
c1
c3
c2
c1
c2

date
1
1
1
1
2
2

id
p1
p2

name price
bolt
10
nut
5

amt
12
11
50
8
44
4
42

Join Indexes
join index
product

sale

id
p1
p2

rId
r1
r2
r3
r4
r5
r6

name price
bolt
10
nut
5

jIndex
r1,r3,r5,r6
r2,r4

prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2

date
1
1
1
1
2
2

amt
12
11
50
8
44
4

43

What to Materialize?
Store in warehouse results useful for
common queries
Example:
total sales

day 2
day 1

c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8

p1
p2

materialize

c1
56
11

c2
4
8

c3
50

...

p1

c1
67

c2
12

c3
50

129
p1
p2

c1
110
19

44

Materialization Factors
Type/frequency of queries
Query response time
Storage cost
Update cost

45

Cube Aggregates Lattice


129

all
c1
67

p1

c2
12

c3
50

city

city, product
p1
p2

c1
56
11

c2
4
8

product

city, date

date

product, date

c3
50

day 2
day 1

c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8

city, product, date

use greedy
algorithm to
decide what
to materialize
46

Dimension Hierarchies
all
cities

state

city
c1
c2

state
CA
NY

city

47

Dimension Hierarchies
all
city

city, product

product

city, date

city, product, date

date

product, date
state
state, date
state, product
state, product, date

not all arcs shown...


48

Interesting Hierarchy
time

all
years
weeks

quarters

months

day
1
2
3
4
5
6
7
8

week
1
1
1
1
1
1
1
2

month
1
1
1
1
1
1
1
1

quarter
1
1
1
1
1
1
1
1

year
2000
2000
2000
2000
2000
2000
2000
2000

conceptual
dimension table

days
49

Das könnte Ihnen auch gefallen