Data Warehouse

DATA WAREHOUSING
BY
B.M.BRINDA,
AP / IT - VCEW
12/04/2016
OVERVIEW
Introduction to Data
Data Warehouse
OLAP Vs. OLTP
Multi Dimensional Data
Data Mining
Data Preprocessing
12/04/2016
INTRODUCTION
Data
Data is information that
has been translated into a
form
that
is
more
convenient to move or
process.
Database
Database is a organized
collection of information
which can easily be
accessed, managed, and
updated by a set of
programs.
12/04/2016
Data, Data everywhere

yet ...
12/04/2016
I cant find the data I need

data is scattered over the network
many versions, subtle differences
I cant get the data I need
need an expert to get the data
I cant understand the data I found
available data poorly documented
I cant use the data I found

results are unexpected
data needs to be transformed from
one form to other
What is a Data Warehouse?

A single, complete and
consistent store of data
obtained from a variety of
different
sources
made
available to end users in a way
they can understand and use
in a business context.
12/04/2016
What is Data Warehousing?

Information
A
process
of
transforming data into
information
and
making it available to
users in a timely
enough manner to
make a difference
Data
12/04/2016
DATA WAREHOUSE
A data warehouse is a subject-oriented,
integrated, time-variant, and non - volatile
collection of data in support of managements
decision-making process.W. H. Inmon
12/04/2016
Subject-Oriented
Data that gives information about a particular subject
instead of about a company's ongoing operations.
Data is categorized and stored by business subject
rather than by application.
Application Oriented
Subject Oriented
Loans
ATM
Credit Card
Customer
Product
Vendor
Activity
Trust
Savings
12/04/2016
Integrated
Data on a given subject is defined and stored once.
Data that is gathered into the data warehouse from
a variety of sources and merged into a coherent whole.
Savings
Current
accounts
Loans
OLTP Applications
12/04/2016
Customer
Data Warehouse
9
Time-Variant
Data is stored as a series of snapshots, each
representing a period of time
All data in the data warehouse is identified with a
particular time period.
Time
Jan-97
Feb-97
Mar-97
12/04/2016
Data
January
February
March
10
Nonvolatile
Data in the data warehouse is not updated or deleted.
Data is stable in a data warehouse. More data is added but data
is never removed. This enables management to gain a
consistent picture of the business.
Operational
Warehouse
Load
Insert
Update
Delete
12/04/2016
Read
Read
11
Changing Data
First time load
Warehouse Database
Operational
Database
Refresh
Refresh
Refresh
12/04/2016
12
OLTP
OLTP Online Transaction Processing or
Operational Database Systems
Performs Online Transaction & Query Processing
Covers most of the day - to day operations
Characterized by a large number of short on-line
transactions (INSERT, UPDATE, DELETE).
Purchasing, Inventory, Manufacturing, Banking, Payroll,
Registration, etc..,
12/04/2016
13
OLAP
OLAP Online Analytical Processing or Data
Warehouse
It serves users or knowledge workers in the role of
decision making and data analysis
Organize and present data in various formats in order
to satisfy various user requests
characterized by relatively low volume of transactions.
OLAP allows users to analyze database information
from multiple database systems at one time.
OLAP data is stored in multidimensional databases.
12/04/2016
14
Data Warehouse vs. Operational DBMS

Distinct features (OLTP vs. OLAP):
User and system orientation: customer vs. market
Data contents: current, detailed vs. historical, consolidated
Database design: ER + application vs. star + subject
View: current, local vs. evolutionary, integrated
Access patterns: update vs. read-only but complex queries
12/04/2016
15
OLTP vs. OLAP

OLTP
OLAP
users
clerk, IT professional
knowledge worker
function
day to day operations
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date
detailed, flat relational
isolated
repetitive
historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
read/write
index/hash on prim. key
short, simple transaction
lots of scans
usage
access
unit of work
complex query
# records accessed tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
query throughput, response
12/04/2016
16
Why Separate Data Warehouse?

High performance for both systems
DBMS tuned for OLTP: access methods, indexing, concurrency
control, recovery
Warehousetuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
Different functions and different data:
missing data: Decision support requires historical data which
operational DBs do not typically maintain
data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
data quality: different sources typically use inconsistent data

representations, codes and formats which have to be reconciled
12/04/2016
17
MULTIDIMENSIONAL DATA MODEL

A data warehouse is based on a multi dimensional data model which views data in the
form of a data cube
Data Cube Allows data to be modeled and
viewed in multiple dimensions.
Supports viewing/modeling of a variable (a set of
variables) of interest.
Measures are used to report the values of the
particular variable with respect to a given set of
dimensions.
12/04/2016
18
DATA CUBE
Data Cube is defined by dimensions and facts
Dimensions Entities in which organization wants
to keep record.
Dimension Table - Each dimension may be
associated with a table
Item, Branch, Location
Facts Numerical Measures.

Fact Table contains the names of facts, measures,
as well as keys to each of related dimension table
Units_sold, Amount_Budgeted
12/04/2016
19
Cube: A Lattice of Cuboids

In data warehousing literature, an n-D base cube is called a
base cuboid. The top most 0-D cuboid, which holds the
highest-level of summarization, is called the apex cuboid. The
lattice of cuboids forms a data cube.
12/04/2016
20
3D Data Cube
Dollars_Sold
12/04/2016
21
Modeling of Data Warehouse

A data warehouse, however, requires a
concise, subject-oriented schema that
facilitates on-line data analysis.
Star Schema
Snow Flake Schema
Fact Constellation Schema
12/04/2016
22
STAR SCHEMA
Contains a large
central table fact
table contains bulk
data and a set of
smaller attendant
tables dimension
table, one for each
dimension
12/04/2016
23
SNOWFLAKE SCHEMA
A variant of the star
schema model, where
some
dimension
tables are normalized,
thereby
further
splitting the data into
additional tables.
12/04/2016
24
FACT CONSTELLATION
Schema
can
be
viewed
as
a
collection of stars,
and hence it is called
a galaxy schema or a
fact constellation.
Used
for
sophisticated
applications
12/04/2016
25
CONCEPT HIERARCHY
Defines a sequence of mappings from a set of low-level
concepts to higher-level, more general concepts.
12/04/2016
26
OLAP OPERATIONS
Roll Up (Drill Up) Reduction of Dimension
Drill Down ( Roll Down) Adds new
Dimesnion
Slice and Dice
Pivot (Rotate)
12/04/2016
27
ROLL UP
Performs aggregation
on a data cube, either
by climbing up a
concept hierarchy for
a dimension
Dimension Reduction
12/04/2016
28
DRILL DOWN
Drill-down is the reverse
of roll-up. It navigates
from less detailed data to
more detailed data.
Drill-down can be realized
by either stepping down a
concept hierarchy for a
dimension
Introducing
additional
dimensions
12/04/2016
29
SLICE & DICE

The slice operation selects one particular
dimension from a given cube and provides a
new sub-cube.
Dice selects two or more dimensions from a

given cube and provides a new sub-cube.
12/04/2016
30
SLICE & DICE
12/04/2016
31
PIVOT
The
pivot
operation is also
known as rotation.
It rotates the data
axes in view, in
order to provide an
alternative
presentation
of
data.
12/04/2016
32
Design of Data Warehouse

Four views regarding the design of a data warehouse
Top-down view
allows selection of the relevant information necessary for the data
warehouse
Data source view

exposes the information being captured, stored, and managed by
operational systems
Data warehouse view

consists of fact tables and dimension tables
Business query view

sees the perspectives of data in the warehouse from the view of enduser
12/04/2016
33
Data Warehouse Design Process

Top-down, bottom-up approaches or a combination of both
Top-down: Starts with overall design and planning
(mature)
Bottom-up: Starts with experiments and prototypes (rapid)
Data Warehouse Design Steps
Choose a business process to model, e.g., orders, invoices,
etc.
Choose the grain (atomic level of data) of the business
process
Choose the dimensions that will apply to each fact table
record
Choose the measure that will populate each fact table
record
12/04/2016
34
3 TIER DATA WAREHOUSE

ARCHITECTURE
12/04/2016
35
Three Data Warehouse Models

Enterprise warehouse
collects all of the information about subjects spanning the
entire organization
Data Mart
a subset of corporate-wide data that is of value to a specific
groups of users. Its scope is confined to specific, selected
groups, such as marketing data mart
Independent vs. dependent (directly from warehouse) data mart
Virtual warehouse
A set of views over operational databases
Only some of the possible summary views may be materialized
12/04/2016
36
DATA MINING
Extracting or Mining knowledge from large
amounts of data
Also called as Knowledge Extraction,
Knowledge Discovery from Data, Data /
Pattern Analysis
12/04/2016
37
DATA MINING STEPS
12/04/2016
38
KDD STEPS
1.
2.
3.
4.
5.
6.
7.
Data cleaning (to remove noise and inconsistent data)

Data integration (where multiple data sources may be combined)
Data selection (where data relevant to the analysis task are
retrieved from the database)
Data transformation (where data are transformed or consolidated
into forms appropriate for mining by performing summary or
aggregation operations, for instance)
Data mining (an essential process where intelligent methods are
applied in order to extract data patterns)
Pattern evaluation (to identify the truly interesting patterns
representing knowledge based on some interestingness
measures;
Knowledge presentation (where visualization and knowledge
representation techniques are used to present the mined
knowledge to the user)
12/04/2016
39
Why Data Preprocessing?

Data in the real world is dirty
incomplete: lacking attribute values, lacking
certain attributes of interest, or containing only
aggregate data
e.g., occupation=
noisy: containing errors or outliers

e.g., Salary=-10
inconsistent: containing discrepancies in codes or

names
e.g., Age=42 Birthday=03/07/1997
e.g., Was rating 1,2,3, now rating A, B, C
e.g., discrepancy between duplicate records
April 12, 2016
40
Why Is Data Dirty?

Incomplete data may come from
Not applicable data value when collected
Different considerations between the time when the data was collected
and when it is analyzed.
Human/hardware/software problems
Noisy data (incorrect values) may come from

Faulty data collection instruments
Human or computer error at data entry
Errors in data transmission
Inconsistent data may come from

Different data sources
Functional dependency violation (e.g., modify some linked data)
Duplicate records also need data cleaning

April 12, 2016
41
Why Is Data Preprocessing Important?

No quality data, no quality mining results!
Quality decisions must be based on quality data
e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
Data warehouse needs consistent integration of quality data
Data extraction, cleaning, and transformation comprises the

majority of the work of building a data warehouse
April 12, 2016
42
Multi-Dimensional Measure of Data Quality

A well-accepted multidimensional view:
Accuracy
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
April 12, 2016
43
Data Preprocessing - Tasks

Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same or
similar analytical results
Data discretization
Part of data reduction but with particular importance, especially for
numerical data
April 12, 2016
44
Forms of Data Preprocessing
April 12, 2016
45
Data Cleaning
Importance
Data cleaning is one of the three biggest problems in
data warehousingRalph Kimball
Data cleaning is the number one problem in data
warehousingDCI survey
Data cleaning tasks

Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data

Resolve redundancy caused by data integration
Data Cleaning
: How to Handle Missing Data?
Ignore the tuple: usually done when class label is

missing (assuming the tasks in classificationnot
effective when the percentage of missing values per
attribute varies considerably.
Fill in the missing value manually
Fill in it automatically with
a global constant : e.g., unknown, a new class?!
the attribute mean
the attribute mean for all samples belonging to the same
class: smarter
the most probable value: inference-based such as Bayesian
formula or regression
Data Cleaning
: How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection

detect suspicious values and check by human (e.g., deal
with possible outliers)
Simple Discretization Methods: Binning

Equal-width (distance) partitioning
Divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B A)/N.
The most straightforward, but outliers may dominate presentation
Skewed data is not handled well
Equal-depth (frequency) partitioning
Divides the range into N intervals, each containing approximately same
number of samples
Good data scaling
Managing categorical attributes can be tricky
April 12, 2016
Data Mining: Concepts and Techniques
49
Data Cleaning
: Binning Methods
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
REGRESSION
Data can be smoothed by fitting the data to a
function, such as with regression.
Linear regression involves finding the best line
to fit two attributes (or variables), so that one
attribute can be used to predict the other.
Multiple linear regression is an extension of linear
regression, where more than two attributes are
involved and the data are fit to a
multidimensional surface.
12/04/2016
51
Regression
y
Y1
y=x+1
Y1
X1
April 12, 2016
52
CLUSTERING
Outliers may be detected by clustering, where
similar values are organized into groups, or
clusters. Intuitively, values that fall outside
of the set of clusters may be considered
outliers
Outliers - Data objects with characteristics that are
considerably different than most of the other data
objects in the data set
12/04/2016
53
CLUSTERING
12/04/2016
54
Data Integration
Data integration:
Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data sources, e.g.,
Bill Clinton = William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different
sources are different
Possible reasons: different representations, different scales,
e.g., metric vs. British units
April 12, 2016
55
Handling Redundancy in Data Integration

Redundant data occur often when integration of multiple
databases
Object identification: The same attribute or object may

have different names in different databases
Derivable data: One attribute may be a derived attribute
in another table, e.g., annual revenue
Redundant attributes may be able to be detected by
correlation analysis
Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
April 12, 2016
56
Data Integration
Data integration:
Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust-id B.cust-#

Integrate metadata from different sources
Entity identification problem:

Identify real world entities from multiple data sources, e.g.,
Bill Clinton = William Clinton
Detecting and resolving data value conflicts

For the same real world entity, attribute values from
different sources are different
Possible reasons: different representations, different scales
Data Integration
: Handling Redundancy in Data Integration
Redundant data occur often when integration of

multiple databases
Object identification: The same attribute or object
may have different names in different databases
Derivable data: One attribute may be a derived
attribute in another table, e.g., annual revenue
Redundant attributes may be able to be detected

by correlation analysis
Careful integration of the data from multiple
sources may help reduce/avoid redundancies and
inconsistencies and improve mining speed and
quality
Data Integration :
Correlation Analysis (Numerical Data)
Correlation coefficient (also called Pearsons

product moment coefficient)
rA, B
( A A)( B B) ( AB ) n A B
(n 1)AB
(n 1)AB
where n is the number of tuples,

and are the respective means of A and B, A and B
are the respective standard deviationA
of A and B,
B and (AB) is the sum of the AB crossproduct.
If rA,B > 0, A and B are positively correlated (As

values increase as Bs). The higher, the stronger
correlation.
rA,B = 0: independent; rA,B < 0: negatively correlated
Data Integration
: Correlation Analysis (Categorical Data)
2 (chi-square) test
2
(
Observed
Expected
)
2
Expected
The larger the 2 value, the more likely the variables are
related
The cells that contribute the most to the 2 value are
those whose actual count is very different from the
expected count
Correlation does not imply causality
# of hospitals and # of car-theft in a city are correlated
Both are causally linked to the third variable: population
Chi-Square Calculation: An Example

Play chess
Not play chess
Sum (row)
Like science fiction
250(90)
200(360)
450
Not like science fiction
50(210)
1000(840)
1050
Sum(col.)
300
1200
1500
2 (chi-square) calculation (numbers in parenthesis are expected counts

calculated based on the data distribution in the two categories)
2
2
2
2
(
250
90
)
(
50
210
)
(
200
360
)
(
1000
840
)
2
507 .93
90
210
360
840
It shows that like_science_fiction and play_chess are correlated in the group
Data Transformation
Smoothing: remove noise from data

Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified
range
min-max normalization
z-score normalization
normalization by decimal scaling
Attribute/feature construction
New attributes constructed from the given ones
Data Transformation
: Normalization
Min-max normalization: to [new_minA, new_maxA]
v'
v min A
(new _ max A new _ min A) new _ min A
max A min A
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
$73,000 is mapped to
73,600 12,000
98,000 12,000
(1.0 0) 0 0.716
Z-score normalization (: mean, : standard deviation):
v'
v A
Ex. Let = 54,000, = 16,000. Then
Normalization by decimal scaling
v
v' j
10
73,600 54,000
1.225
16,000
Where j is the smallest integer such that Max(||) < 1
Data Reduction Strategies

Why data reduction?
A database/data warehouse may store terabytes of data
Complex data analysis/mining may take a very long time to run on the
complete data set
Data reduction
Obtain a reduced representation of the data set that is much smaller in
volume but yet produce the same (or almost the same) analytical results
Data reduction strategies
Aggregation
Sampling
Dimensionality Reduction
Feature subset selection
Feature creation
Discretization and Binarization
Attribute Transformation
Data Reduction : Aggregation

Combining two or more attributes (or objects)
into a single attribute (or object)
Purpose
Data reduction
Reduce the number of attributes or objects
Change of scale
Cities aggregated into regions, states, countries, etc
More stable data

Aggregated data tends to have less variability
Data Reduction : Sampling

Sampling is the main technique employed for
data selection.
It is often used for both the preliminary
investigation of the data and the final data analysis.
Statisticians sample because obtaining the entire

set of data of interest is too expensive or time
consuming.
Sampling is used in data mining because
processing the entire set of data of interest is too
expensive or time consuming.
Data Reduction : Types of Sampling

Simple Random Sampling
There is an equal probability of selecting any

particular item
Sampling without replacement
As each item is selected, it is removed from the

population
Sampling with replacement
Objects are not removed from the population as they

are selected for the sample.
In sampling with replacement, the same object can be
picked up more than once
Data Reduction
: Dimensionality Reduction
Purpose:
Avoid curse of dimensionality

Reduce amount of time and memory required by data
mining algorithms
Allow data to be more easily visualized
May help to eliminate irrelevant features or reduce
noise
Techniques
Principle Component Analysis

Singular Value Decomposition
Others: supervised and non-linear techniques
Dimensionality Reduction : PCA

Goal is to find a projection that captures the
largest amount of variation in data
x2
x1
Dimensionality Reduction : PCA

Find the eigenvectors of the covariance matrix
The eigenvectors
define the new space
x
2
x1
Data Reduction
: Feature Subset Selection
Another way to reduce dimensionality of data
Redundant features
duplicate much or all of the information contained in

one or more other attributes
Example: purchase price of a product and the amount
of sales tax paid
Irrelevant features
contain no information that is useful for the data

mining task at hand
Example: students' ID is often irrelevant to the task of
predicting students' GPA
Data Reduction
: Feature Subset Selection
Techniques:
Brute-force approch:
Try all possible feature subsets as input to data mining
algorithm
Filter approaches:
Features are selected before data mining algorithm is run
Wrapper approaches:
Use the data mining algorithm as a black box to find best
subset of attributes
Data Reduction
: Feature Creation
Create new attributes that can capture the

important information in a data set much
more efficiently than the original attributes
Three general methodologies:
Feature Extraction
domain-specific
Mapping Data to New Space

Feature Construction
combining features
Data Reduction
: Mapping Data to a New Space
Fourier transform
Wavelet transform
Two Sine Waves
Two Sine Waves + Noise
Frequency
Data Reduction
: Discretization Using Class Labels
Entropy based approach
3 categories for both x and y
5 categories for both x and y
Data Reduction
: Discretization Without Using Class Labels
Data
Equal frequency
Equal interval width
K-means
Data Reduction
: Attribute Transformation
A function that maps the entire set of values of

a given attribute to a new set of replacement
values such that each old value can be
identified with one of the new values
Simple functions: xk, log(x), ex, |x|
Standardization and Normalization

Data Warehouse

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Data Warehouse

Hochgeladen von

Copyright:

Verfügbare Formate

DATA WAREHOUSING

Data, Data everywhere

I cant find the data I need

I cant use the data I found

What is a Data Warehouse?

What is Data Warehousing?

Data Warehouse vs. Operational DBMS

OLTP vs. OLAP

day to day operations

# records accessed tens

query throughput, response

Why Separate Data Warehouse?

data quality: different sources typically use inconsistent data

MULTIDIMENSIONAL DATA MODEL

Facts Numerical Measures.

Cube: A Lattice of Cuboids

Modeling of Data Warehouse

SLICE & DICE

Dice selects two or more dimensions from a

SLICE & DICE

Design of Data Warehouse

Data source view

Data warehouse view

Business query view

Data Warehouse Design Process

3 TIER DATA WAREHOUSE

Three Data Warehouse Models

DATA MINING STEPS

Data cleaning (to remove noise and inconsistent data)

Why Data Preprocessing?

noisy: containing errors or outliers

inconsistent: containing discrepancies in codes or

Why Is Data Dirty?

Noisy data (incorrect values) may come from

Inconsistent data may come from

Duplicate records also need data cleaning

Why Is Data Preprocessing Important?

Data warehouse needs consistent integration of quality data

Data extraction, cleaning, and transformation comprises the

April 12, 2016

Multi-Dimensional Measure of Data Quality

April 12, 2016

Data Preprocessing - Tasks

Forms of Data Preprocessing

April 12, 2016

Data cleaning tasks

Correct inconsistent data

Ignore the tuple: usually done when class label is

Combined computer and human inspection

Simple Discretization Methods: Binning

Data Mining: Concepts and Techniques

April 12, 2016

Handling Redundancy in Data Integration

Object identification: The same attribute or object may

Schema integration: e.g., A.cust-id B.cust-#

Entity identification problem:

Detecting and resolving data value conflicts

Redundant data occur often when integration of

Redundant attributes may be able to be detected

Correlation coefficient (also called Pearsons

where n is the number of tuples,

If rA,B > 0, A and B are positively correlated (As

Chi-Square Calculation: An Example

Not play chess