Sie sind auf Seite 1von 54

IT-357:

Data-Warehousing

Motivation for Data Warehousing


We have mountains of data in this company, but
we cant access it.
Youve got to make it easy for business people
to get at the data directly.
Just show me what is important.
It drives me crazy to have two people present
the same business metrics at a meeting, but with
different numbers.
We want people to use information to support
more fact-based decision making.
We need to slice and dice the data every which
way.

Motivation for Data Warehousing


The data warehouse must:
Make an organizations information easily
accessible
Present the organizations information
consistently
Be adaptive and resilient to change
Be secure such that protects our information
assets
Serve as the foundation for improved decision
making
Must ensure the business community accept
the data warehouse if it is to be deemed
successful

Typical Queries on DW
What was the total number of Cell Phones sold in India
in 2013 group by companies?
What was the total revenue for property sales for each
type of property in Mangalore between 2006 and 2008?
What would be the effect on cell phone sales in the
Mangalore if a new college is opened?
Which type of Cell Phone sells most in Mangalore?
Which is the most travelled train in India in 2013?

Data Warehouse
Defined in many different ways, but not rigorously.
A decision support database that is maintained separately from
the organizations operational database
Support information processing by providing a solid platform of
consolidated, historical data for analysis.

A data warehouse is a
subject-oriented
integrated
time-variant
nonvolatile
W. H. Inmon
Ralph Kimball

DW Subject Oriented
Organized around major subjects, such as
customer, product, sales
Focusing on the modeling and analysis of data
for decision makers, not on daily operations or
transaction processing
Provide a simple and concise view around
particular subject issues by excluding data that
are not useful in the decision support process

DW - Integrated
Constructed by integrating multiple,
heterogeneous data sources
relational databases, flat files, on-line transaction
records

Data cleaning and data integration techniques


are applied.
Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different
data sources
E.g., Hotel price: currency, tax, food products, etc.

When data is moved to the warehouse, it is


converted.

DW Time Variant
The time horizon for the data warehouse is
significantly longer than that of operational
systems
Operational database: current value data
Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)

Every key structure in the data warehouse


Contains an element of time, explicitly or implicitly
But the key of operational data may or may not
contain time element

DW Non-volatile
A physically separate store of data transformed
from the operational environment
Operational update of data does not occur in the
data warehouse environment
Does not require transaction processing, recovery,
and concurrency control mechanisms
Requires only two to three operations in data
accessing:
initial loading, incremental loading of data and
access of data

Difference between OLTP and OLAP


OLTP

OLAP

users

clerk, IT professional

knowledge worker

function

day to day operations

decision support

DB design

application-oriented

subject-oriented

data

current, up-to-date
detailed, flat relational
isolated
repetitive

historical,
summarized, multidimensional
integrated, consolidated
ad-hoc/repetitive
lots of scans

unit of work

read/write
index/hash on prim. key
short, simple transaction

complex query

# records accessed

tens

millions

#users

thousands

hundreds

DB size

100MB-GB

100GB-TB

metric

transaction throughput

query throughput, response

usage
access

A Typical Data
Warehouse

DW Building Operations (ETL)


Data extraction
Get data from multiple, heterogeneous, and external
sources
Data cleaning
Detect errors in the data and rectify them when
possible
Data transformation
Convert data from legacy or host format to warehouse
format
Load/Refresh
Sort, summarize, consolidate, compute views, check
integrity, and build indexes and partitions
Propagate updates from sources

Who does not need a


Data Warehouse
If the business does not need a unified set of
information/unified view
If the same person is involved in research, manufacturing, sale,
support, relationship

If there is little or no latency time in the access and


analysis of information
If the same product is sold over and over
If there is no need for historical data
Family businesses
Small businesses

Structure - Data Warehouse

What is Meta Data

Meta data is the data defining warehouse objects (re-usable). It


stores:

Description of the structure of the data warehouse


Schema, view, dimensions, hierarchies, derived data definition, data
mart locations and contents

Operational meta-data
History of migrated data, currency of data (active, archived, or purged),
algorithms used for queries

The mapping from operational environment to the data warehouse

Data related to system performance


Number of users, system checks performed, failures

Business data
Business terms and definitions, ownership of data, policies (scope of
DW, security)

Data Warehouse Meta Data


Metadata for the data warehouse environment is
one of the most important aspects. Metadata
helps the DSS analyst find what data is in the
warehouse and use that data effectively and
efficiently.
Some of the components of data warehouse
metadata are:

The structure and contents of the warehouse


The mapping of data into the data warehouse
The extract/transformation history
Aging purging criteria
Ownership/stewardship information

Uses of Metadata
Some of the uses
Extraction and loading processes - metadata is used
to map data sources to a common view of information
within the warehouse
Warehouse management process - metadata is used
to automate the production of summary tables
Query management process - metadata is used to
direct a query to the most appropriate data source

RDBMS(OLTP) Vs DW (OLAP) Structure


Country

Region
Reg_ID

Reg_ID

Cntry_ID

Reg_Name

Cntry_Name

1 Europe

1 Germany

2 North America

1 Spain

3 Asia

2 Canada

2 Mexico

3 India

3 China

City
City_ID

Reg_ID

Cntry_ID

City_Name

Frankfurt

Vancouver

Toronto

Mexico City

Delhi

Beijing

Mumbai

Madrid

Location
Region

Country

City

Europe

Germany

Frankfurt

Europe

Spain

Madrid

North America Canada

Vancouver

North America Mexico

Mexico City

Asia

India

Delhi

Asia

China

Beijing

Asia

India

Mumbai

Concept Hierarchy

SQL for OLTP and OLAP


OLTP:
select * from region, country, city
where
region.reg_id = country.reg_id and
country.reg_id = city.reg_id

OLAP:
select * from location

OLTP for quick Insert/Update


OLAP for quick Data Access

Data Warehouse Usage


Three kinds of data warehouse applications
Information Processing
supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
Analytical Processing
multidimensional analysis of data warehouse data
supports basic OLAP operations: slice-dice, drill down, roll-up
Data Mining
knowledge discovery from hidden patterns
supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools

DW Queries Complexity
Complexity by just adding a maximum price:
Query 1: A simple data cube query: Find the total sales
in 2004, broken down by product, region, and month,
with subtotals for each dimension.
Query 2: A complex data cube query: Grouping by all
subsets of product, region, month, find the maximum
price in 2004 for each group and the total sales among
all maximum price tuples

OLAP and Data Mining


Is OLAP data mining? NO
OLAP is a way to look at pre-aggregated query results
(Evaluate query and results)
Data Mining is building models of data
Data mining tools model data and return actionable
rules
Practically speaking: OLAP tools can be used on Data
Cubes as well as perform some form of Data Mining

Types of Data Warehouse Models


Various Type of FACTS/DIMENSION Tables
Star schema: A fact table in the middle connected to a
set of dimension tables
Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into
a set of smaller dimension tables, forming a shape
similar to snowflake
Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation

Star Schema

Note the concept hierarchy within the Dimension tables. WHY?

Snowflake Schema

Advantages and Disadvantages?

FACT Constellation

Indexes for Data Warehouse


Different types between OLTP and OLAP
Ordinary Index
Bitmap Index
JOIN Index

Ordinary Index
Surrogate Key
Most cases it is not a Natural key in addition
to the Business key.
Represents an object in the database, but not
visible outside

B-Tree (B Plus Tree)


Must be used with Unique or High
Cardinal Data (Near Unique)

B Plus Tree

How to make the query make less


number of Disk accesses?

Bitmap Index
Index on a particular column
Each value in the column has a bit vector
The length of the bit vector: # of records in the
base table
Base table
Cust
C1
C2
C3
C4
C5

Region
Asia
Europe
Asia
America
Europe

Index on Region

Index on Type

Type RecID Asia Europe Am erica RecID Retail Dealer


Retail
1
1
0
0
1
1
0
Dealer 2
0
1
0
2
0
1
Dealer 3
1
0
0
3
0
1
Retail
4
0
0
1
4
1
0
0
1
0
5
0
1
Dealer 5

Usage of Bitmap Index


Static Tables
Data is not updated frequently.
Modification of Bitmap Index is expensive
They are compressed index type

DSS Systems
Generally suited for low cardinal values (but
need not be limited to)
Suited for systems which gets changes during
non-peak business hours

JOIN Index
In data warehouses, join index relates the
values of the dimensions of a start schema
to rows in the fact table
Eg: Sales and two dimensions city and
product
A join index on city maintains for each
distinct city a list of R-IDs (Prim key) of
the tuples recording the Sales in the city
Join indices can span multiple dimensions

Join Index

Size of a Data Warehouse


and Query Cost
Number of Tuples/Records in
DIMENSIONS
Number of Tuples/Records in FACT
Eg: Customer/Sale data

Cost of a query to locate a tuple using an


Index?

Optimizing Data Warehouse


1. Better understanding of the actual data
2. Better design (proper de-normalization)
3. Indexing:
Faster access

4. Partitioning:
Physical partitioning
Eg: Partitioning Date/Time Dimension

Data Cube

Data Cube

Data Cube
The key operation of a OLAP is the
formation of a data cube
Pre-computed query result
A data cube allows data to be modeled
and viewed in multiple dimensions. It is
defined by dimensions and facts
A data cube is a multidimensional
representation of data, together with all
possible aggregates

A Spreadsheet Data
Date

Location

Product

Sales

1-Jan-13 USA

TV

100

2-Jan-13 Canada

TV

250

3-Jan-13 Mexico

TV

300

4-Jan-13 Brazil

TV

200

1-Jan-13 USA

PC

50

2-Jan-13 Canada

PC

70

3-Jan-13 Mexico

PC

40

4-Jan-13 Brazil

PC

60

Represent in a 3-Dimension
Consider previous
sales of products
at a number of
locations at
various dates
This data can be
represented
as a 3
dimensional array

TV
PC
VCR
sum

1Qtr

2Qtr

Date
3Qtr

4Qtr

sum

Total annual sales


of TVs in U.S.A.
Americas

Country

Pr
od
uc
t

A Sample Data Cube

Asia Pacific
Europe
sum

Data Cube Browsing


(Tools In Practice)

Data Cube: Lattice of Cuboids 3D

All (highest level of summarization)


0-D (apex) cuboid
product

product,date

date
product,country

country
1-D cuboids
date, country

2-D cuboids
3-D (base) cuboid

product, date, country (lowest level of summarization)

A Simple Representation
Base and Aggregate cells.
Consider the data cube with the DIMENSION
Date, Product, County and the FACT
Quantity.
1D Cells: (Jan, *, *, 350)
1D Cells: (Feb, *, *, 50)
2D Cells: (Jan, * , Mexico, 70)
3D Cells: (Aug, TV, USA, 80)

Data Cube: Lattice of Cuboids 4D


All (highest level of summarization)

time

product

location

0-D (apex) cuboid


supplier
1-D cuboids

time,location
time,product

product,location

time,supplier

location,supplier

2-D cuboids
product,supplier

time,location,supplier

3-D cuboids
time,product,location

time,product,supplier

product,location,supplier

4-D (base) cuboid


time, product, location, supplier (lowest level of summarization)

Bottom Up Cube Calculation (BUC)

4 Dimension Computation (A, B, C, D attributes)


1D Cells: (Jan, *, *,*, 500)
2D Cells: (*, TV, *,Philips, 300)

The Compute Cube Operator


Cube definition and computation in DMQL
define
cube
sales
(sales_in_dollars)

[product,

city,

year]:

sum

compute cube sales

Transform it into a SQL-like language (with a new


operator cube by, introduced by Gray et al.96) ()
SELECT product, city, year, SUM (amount)
FROM SALES

(city)

(product)

(year)

CUBE BY product, city, year

Need compute the following Group-Bys


(city, product)
(city, year) (product, year)
(date, product, city),
(date,product),(date, city), (product, city),
(date), (product), (city)
(city, product, year)
()

Oracles CUBE Operator


CUBE extension will generate subtotals for all
combinations of the dimensions specified
As the number of dimensions increase, so do
the combinations of subtotals that need to be
calculated
SQL: SELECT Date_Id, Product_Id, City_id,
SUM(sales_value) AS sales_value FROM
dimension_tab GROUP BY CUBE (Date_Id,
Product_Id, City_id) ORDER BY Date_Id,
Product_Id, City_id;

Attribute Oriented Induction


Proposed in 1989 [Before Data Cube concept was
introduced]
How it is done?
Collect the task-relevant data (data for analysis) using a
relational database query (initial relation)
Perform generalization by attribute removal or attribute
generalization
Apply aggregation by merging identical, generalized tuples and
accumulating their respective counts
Interaction with users for knowledge presentation

Example for Attribute Oriented Induction


A DMQL Query
1. Use DB_XYZ mine characteristics as
Science Students
in relevance to name, gender, major, birth
place, birth date, residence, phone#, gpa
from student
where status in graduate

Algorithm to solve the Example


2. Remove/Generalizing Attributes:

Large set of distinct values

Case 1: EITHER there is no concept hierarchy defined


within the attribute
Case 2: OR Higher level concepts are defined in terms of
other attributes

Remove the attribute when any one of them are


true.

3. Class Characterization:

Generalization will result in groups of identical


tuples
Count the number of generalized (duplicate) tuples
and mark the count

Class Characterization
Name

Gender

Jim
Woodman
Scott
Lachance
Laura Lee

Removed

Retained

Major

Birth_date

Residence

Phone #

GPA

Vancouver,BC, 8-12-76
Canada
CS
Montreal, Que, 28-7-75
Canada
Physics Seattle, WA, USA 25-8-70

3511 Main St.,


Richmond
345 1st Ave.,
Richmond

687-4598

3.67

253-9106

3.70

125 Austin Ave.,


Burnaby

420-5232

3.83

Sci,Eng,
Bus

City

Removed

Excl,
VG,..

CS

Gender Major
M
F

Birth-Place

Science
Science

Country

Age range

Birth_region

Age_range

Residence

GPA

Canada
Foreign

20-25
25-30

Richmond
Burnaby

Very-good
Excellent

Count
16
22

Importance of Understanding the Data


The Developer/Tech Analyst/Bus Analyst must
know the data well (very)
Unlike Relational databases where design,
efficiency of data retrieval, GUI etc (Although
data too is important)
The data and how the data is stored and the
data quality become more and more important in
OLAP databases/Data Warehouses

References
Data Mining: Concepts and Techniques
Jiawei Han, Micheline Kamber, Jian Pei

Building the Data Warehouse


William Inmon

The Data Warehouse Toolkit


Ralph Kimball

Das könnte Ihnen auch gefallen