DW Intro

IT-357:
Data-Warehousing
Motivation for Data Warehousing

We have mountains of data in this company, but
we cant access it.
Youve got to make it easy for business people
to get at the data directly.
Just show me what is important.
It drives me crazy to have two people present
the same business metrics at a meeting, but with
different numbers.
We want people to use information to support
more fact-based decision making.
We need to slice and dice the data every which
way.
Motivation for Data Warehousing

The data warehouse must:
Make an organizations information easily
accessible
Present the organizations information
consistently
Be adaptive and resilient to change
Be secure such that protects our information
assets
Serve as the foundation for improved decision
making
Must ensure the business community accept
the data warehouse if it is to be deemed
successful
Typical Queries on DW
What was the total number of Cell Phones sold in India
in 2013 group by companies?
What was the total revenue for property sales for each
type of property in Mangalore between 2006 and 2008?
What would be the effect on cell phone sales in the
Mangalore if a new college is opened?
Which type of Cell Phone sells most in Mangalore?
Which is the most travelled train in India in 2013?
Data Warehouse
Defined in many different ways, but not rigorously.
A decision support database that is maintained separately from
the organizations operational database
Support information processing by providing a solid platform of
consolidated, historical data for analysis.
A data warehouse is a
subject-oriented
integrated
time-variant
nonvolatile
W. H. Inmon
Ralph Kimball
DW Subject Oriented
Organized around major subjects, such as
customer, product, sales
Focusing on the modeling and analysis of data
for decision makers, not on daily operations or
transaction processing
Provide a simple and concise view around
particular subject issues by excluding data that
are not useful in the decision support process
DW - Integrated
Constructed by integrating multiple,
heterogeneous data sources
relational databases, flat files, on-line transaction
records
Data cleaning and data integration techniques

are applied.
Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different
data sources
E.g., Hotel price: currency, tax, food products, etc.
When data is moved to the warehouse, it is

converted.
DW Time Variant
The time horizon for the data warehouse is
significantly longer than that of operational
systems
Operational database: current value data
Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse

Contains an element of time, explicitly or implicitly
But the key of operational data may or may not
contain time element
DW Non-volatile
A physically separate store of data transformed
from the operational environment
Operational update of data does not occur in the
data warehouse environment
Does not require transaction processing, recovery,
and concurrency control mechanisms
Requires only two to three operations in data
accessing:
initial loading, incremental loading of data and
access of data
Difference between OLTP and OLAP

OLTP
OLAP
users
clerk, IT professional
knowledge worker
function
day to day operations
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date
detailed, flat relational
isolated
repetitive
historical,
summarized, multidimensional
integrated, consolidated
ad-hoc/repetitive
lots of scans
unit of work
read/write
index/hash on prim. key
short, simple transaction
complex query
# records accessed
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
query throughput, response
usage
access
A Typical Data
Warehouse
DW Building Operations (ETL)

Data extraction
Get data from multiple, heterogeneous, and external
sources
Data cleaning
Detect errors in the data and rectify them when
possible
Data transformation
Convert data from legacy or host format to warehouse
format
Load/Refresh
Sort, summarize, consolidate, compute views, check
integrity, and build indexes and partitions
Propagate updates from sources
Who does not need a

Data Warehouse
If the business does not need a unified set of
information/unified view
If the same person is involved in research, manufacturing, sale,
support, relationship
If there is little or no latency time in the access and

analysis of information
If the same product is sold over and over
If there is no need for historical data
Family businesses
Small businesses
Structure - Data Warehouse
What is Meta Data
Meta data is the data defining warehouse objects (re-usable). It

stores:
Description of the structure of the data warehouse

Schema, view, dimensions, hierarchies, derived data definition, data
mart locations and contents
Operational meta-data
History of migrated data, currency of data (active, archived, or purged),
algorithms used for queries
The mapping from operational environment to the data warehouse
Data related to system performance

Number of users, system checks performed, failures
Business data
Business terms and definitions, ownership of data, policies (scope of
DW, security)
Data Warehouse Meta Data

Metadata for the data warehouse environment is
one of the most important aspects. Metadata
helps the DSS analyst find what data is in the
warehouse and use that data effectively and
efficiently.
Some of the components of data warehouse
metadata are:
The structure and contents of the warehouse

The mapping of data into the data warehouse
The extract/transformation history
Aging purging criteria
Ownership/stewardship information
Uses of Metadata
Some of the uses
Extraction and loading processes - metadata is used
to map data sources to a common view of information
within the warehouse
Warehouse management process - metadata is used
to automate the production of summary tables
Query management process - metadata is used to
direct a query to the most appropriate data source
RDBMS(OLTP) Vs DW (OLAP) Structure

Country
Region
Reg_ID
Reg_ID
Cntry_ID
Reg_Name
Cntry_Name
1 Europe
1 Germany
2 North America
1 Spain
3 Asia
2 Canada
2 Mexico
3 India
3 China
City
City_ID
Reg_ID
Cntry_ID
City_Name
Frankfurt
Vancouver
Toronto
Mexico City
Delhi
Beijing
Mumbai
Madrid
Location
Region
Country
City
Europe
Germany
Frankfurt
Europe
Spain
Madrid
North America Canada
Vancouver
North America Mexico
Mexico City
Asia
India
Delhi
Asia
China
Beijing
Asia
India
Mumbai
Concept Hierarchy
SQL for OLTP and OLAP

OLTP:
select * from region, country, city
where
region.reg_id = country.reg_id and
country.reg_id = city.reg_id
OLAP:
select * from location
OLTP for quick Insert/Update

OLAP for quick Data Access
Data Warehouse Usage

Three kinds of data warehouse applications
Information Processing
supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
Analytical Processing
multidimensional analysis of data warehouse data
supports basic OLAP operations: slice-dice, drill down, roll-up
Data Mining
knowledge discovery from hidden patterns
supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools
DW Queries Complexity
Complexity by just adding a maximum price:
Query 1: A simple data cube query: Find the total sales
in 2004, broken down by product, region, and month,
with subtotals for each dimension.
Query 2: A complex data cube query: Grouping by all
subsets of product, region, month, find the maximum
price in 2004 for each group and the total sales among
all maximum price tuples
OLAP and Data Mining

Is OLAP data mining? NO
OLAP is a way to look at pre-aggregated query results
(Evaluate query and results)
Data Mining is building models of data
Data mining tools model data and return actionable
rules
Practically speaking: OLAP tools can be used on Data
Cubes as well as perform some form of Data Mining
Types of Data Warehouse Models

Various Type of FACTS/DIMENSION Tables
Star schema: A fact table in the middle connected to a
set of dimension tables
Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into
a set of smaller dimension tables, forming a shape
similar to snowflake
Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
Star Schema
Note the concept hierarchy within the Dimension tables. WHY?
Snowflake Schema
Advantages and Disadvantages?
FACT Constellation
Indexes for Data Warehouse

Different types between OLTP and OLAP
Ordinary Index
Bitmap Index
JOIN Index
Ordinary Index
Surrogate Key
Most cases it is not a Natural key in addition
to the Business key.
Represents an object in the database, but not
visible outside
B-Tree (B Plus Tree)

Must be used with Unique or High
Cardinal Data (Near Unique)
B Plus Tree
How to make the query make less

number of Disk accesses?
Bitmap Index
Index on a particular column
Each value in the column has a bit vector
The length of the bit vector: # of records in the
base table
Base table
Cust
C1
C2
C3
C4
C5
Region
Asia
Europe
Asia
America
Europe
Index on Region
Index on Type
Type RecID Asia Europe Am erica RecID Retail Dealer

Retail
1
1
0
0
1
1
0
Dealer 2
0
1
0
2
0
1
Dealer 3
1
0
0
3
0
1
Retail
4
0
0
1
4
1
0
0
1
0
5
0
1
Dealer 5
Usage of Bitmap Index

Static Tables
Data is not updated frequently.
Modification of Bitmap Index is expensive
They are compressed index type
DSS Systems
Generally suited for low cardinal values (but
need not be limited to)
Suited for systems which gets changes during
non-peak business hours
JOIN Index
In data warehouses, join index relates the
values of the dimensions of a start schema
to rows in the fact table
Eg: Sales and two dimensions city and
product
A join index on city maintains for each
distinct city a list of R-IDs (Prim key) of
the tuples recording the Sales in the city
Join indices can span multiple dimensions
Join Index
Size of a Data Warehouse

and Query Cost
Number of Tuples/Records in
DIMENSIONS
Number of Tuples/Records in FACT
Eg: Customer/Sale data
Cost of a query to locate a tuple using an

Index?
Optimizing Data Warehouse

1. Better understanding of the actual data
2. Better design (proper de-normalization)
3. Indexing:
Faster access
4. Partitioning:
Physical partitioning
Eg: Partitioning Date/Time Dimension
Data Cube
Data Cube
Data Cube
The key operation of a OLAP is the
formation of a data cube
Pre-computed query result
A data cube allows data to be modeled
and viewed in multiple dimensions. It is
defined by dimensions and facts
A data cube is a multidimensional
representation of data, together with all
possible aggregates
A Spreadsheet Data
Date
Location
Product
Sales
1-Jan-13 USA
TV
100
2-Jan-13 Canada
TV
250
3-Jan-13 Mexico
TV
300
4-Jan-13 Brazil
TV
200
1-Jan-13 USA
PC
50
2-Jan-13 Canada
PC
70
3-Jan-13 Mexico
PC
40
4-Jan-13 Brazil
PC
60
Represent in a 3-Dimension
Consider previous
sales of products
at a number of
locations at
various dates
This data can be
represented
as a 3
dimensional array
TV
PC
VCR
sum
1Qtr
2Qtr
Date
3Qtr
4Qtr
sum
Total annual sales

of TVs in U.S.A.
Americas
Country
Pr
od
uc
t
A Sample Data Cube
Asia Pacific
Europe
sum
Data Cube Browsing

(Tools In Practice)
Data Cube: Lattice of Cuboids 3D
All (highest level of summarization)

0-D (apex) cuboid
product
product,date
date
product,country
country
1-D cuboids
date, country
2-D cuboids
3-D (base) cuboid
product, date, country (lowest level of summarization)
A Simple Representation
Base and Aggregate cells.
Consider the data cube with the DIMENSION
Date, Product, County and the FACT
Quantity.
1D Cells: (Jan, *, *, 350)
1D Cells: (Feb, *, *, 50)
2D Cells: (Jan, * , Mexico, 70)
3D Cells: (Aug, TV, USA, 80)
Data Cube: Lattice of Cuboids 4D

All (highest level of summarization)
time
product
location
0-D (apex) cuboid

supplier
1-D cuboids
time,location
time,product
product,location
time,supplier
location,supplier
2-D cuboids
product,supplier
time,location,supplier
3-D cuboids
time,product,location
time,product,supplier
product,location,supplier
4-D (base) cuboid

time, product, location, supplier (lowest level of summarization)
Bottom Up Cube Calculation (BUC)
4 Dimension Computation (A, B, C, D attributes)

1D Cells: (Jan, *, *,*, 500)
2D Cells: (*, TV, *,Philips, 300)
The Compute Cube Operator

Cube definition and computation in DMQL
define
cube
sales
(sales_in_dollars)
[product,
city,
year]:
sum
compute cube sales
Transform it into a SQL-like language (with a new

operator cube by, introduced by Gray et al.96) ()
SELECT product, city, year, SUM (amount)
FROM SALES
(city)
(product)
(year)
CUBE BY product, city, year
Need compute the following Group-Bys

(city, product)
(city, year) (product, year)
(date, product, city),
(date,product),(date, city), (product, city),
(date), (product), (city)
(city, product, year)
()
Oracles CUBE Operator

CUBE extension will generate subtotals for all
combinations of the dimensions specified
As the number of dimensions increase, so do
the combinations of subtotals that need to be
calculated
SQL: SELECT Date_Id, Product_Id, City_id,
SUM(sales_value) AS sales_value FROM
dimension_tab GROUP BY CUBE (Date_Id,
Product_Id, City_id) ORDER BY Date_Id,
Product_Id, City_id;
Attribute Oriented Induction

Proposed in 1989 [Before Data Cube concept was
introduced]
How it is done?
Collect the task-relevant data (data for analysis) using a
relational database query (initial relation)
Perform generalization by attribute removal or attribute
generalization
Apply aggregation by merging identical, generalized tuples and
accumulating their respective counts
Interaction with users for knowledge presentation
Example for Attribute Oriented Induction

A DMQL Query
1. Use DB_XYZ mine characteristics as
Science Students
in relevance to name, gender, major, birth
place, birth date, residence, phone#, gpa
from student
where status in graduate
Algorithm to solve the Example

2. Remove/Generalizing Attributes:
Large set of distinct values
Case 1: EITHER there is no concept hierarchy defined

within the attribute
Case 2: OR Higher level concepts are defined in terms of
other attributes
Remove the attribute when any one of them are

true.
3. Class Characterization:
Generalization will result in groups of identical

tuples
Count the number of generalized (duplicate) tuples
and mark the count
Class Characterization
Name
Gender
Jim
Woodman
Scott
Lachance
Laura Lee
Removed
Retained
Major
Birth_date
Residence
Phone #
GPA
Vancouver,BC, 8-12-76
Canada
CS
Montreal, Que, 28-7-75
Canada
Physics Seattle, WA, USA 25-8-70
3511 Main St.,

Richmond
345 1st Ave.,
Richmond
687-4598
3.67
253-9106
3.70
125 Austin Ave.,

Burnaby
420-5232
3.83
Sci,Eng,
Bus
City
Removed
Excl,
VG,..
CS
Gender Major
M
F
Birth-Place
Science
Science
Country
Age range
Birth_region
Age_range
Residence
GPA
Canada
Foreign
20-25
25-30
Richmond
Burnaby
Very-good
Excellent
Count
16
22
Importance of Understanding the Data

The Developer/Tech Analyst/Bus Analyst must
know the data well (very)
Unlike Relational databases where design,
efficiency of data retrieval, GUI etc (Although
data too is important)
The data and how the data is stored and the
data quality become more and more important in
OLAP databases/Data Warehouses
References
Data Mining: Concepts and Techniques
Jiawei Han, Micheline Kamber, Jian Pei
Building the Data Warehouse

William Inmon
The Data Warehouse Toolkit

Ralph Kimball

DW Intro

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

DW Intro

Hochgeladen von

Copyright:

Verfügbare Formate

IT-357:

Motivation for Data Warehousing

Motivation for Data Warehousing

Data cleaning and data integration techniques

When data is moved to the warehouse, it is

Every key structure in the data warehouse

Difference between OLTP and OLAP

day to day operations

query throughput, response

DW Building Operations (ETL)

Who does not need a

If there is little or no latency time in the access and

Structure - Data Warehouse

What is Meta Data

Meta data is the data defining warehouse objects (re-usable). It

Description of the structure of the data warehouse

The mapping from operational environment to the data warehouse

Data related to system performance

Data Warehouse Meta Data

The structure and contents of the warehouse

RDBMS(OLTP) Vs DW (OLAP) Structure

North America Canada

North America Mexico

SQL for OLTP and OLAP

OLTP for quick Insert/Update

Data Warehouse Usage

OLAP and Data Mining

Types of Data Warehouse Models

Note the concept hierarchy within the Dimension tables. WHY?

Advantages and Disadvantages?

Indexes for Data Warehouse

B-Tree (B Plus Tree)

How to make the query make less

Type RecID Asia Europe Am erica RecID Retail Dealer

Usage of Bitmap Index

Size of a Data Warehouse

Cost of a query to locate a tuple using an

Optimizing Data Warehouse

Total annual sales

A Sample Data Cube

Data Cube Browsing

Data Cube: Lattice of Cuboids 3D

All (highest level of summarization)

product, date, country (lowest level of summarization)

Data Cube: Lattice of Cuboids 4D

0-D (apex) cuboid

4-D (base) cuboid

Bottom Up Cube Calculation (BUC)

4 Dimension Computation (A, B, C, D attributes)

The Compute Cube Operator

compute cube sales

Transform it into a SQL-like language (with a new

CUBE BY product, city, year

Need compute the following Group-Bys

Oracles CUBE Operator

Attribute Oriented Induction

Example for Attribute Oriented Induction

Algorithm to solve the Example

Large set of distinct values

Case 1: EITHER there is no concept hierarchy defined

Remove the attribute when any one of them are

Generalization will result in groups of identical

3511 Main St.,