Beruflich Dokumente
Kultur Dokumente
BY
B.M.BRINDA,
AP / IT - VCEW
12/04/2016
OVERVIEW
Introduction to Data
Data Warehouse
OLAP Vs. OLTP
Multi Dimensional Data
Data Mining
Data Preprocessing
12/04/2016
INTRODUCTION
Data
Data is information that
has been translated into a
form
that
is
more
convenient to move or
process.
Database
Database is a organized
collection of information
which can easily be
accessed, managed, and
updated by a set of
programs.
12/04/2016
12/04/2016
12/04/2016
A
process
of
transforming data into
information
and
making it available to
users in a timely
enough manner to
make a difference
Data
12/04/2016
DATA WAREHOUSE
A data warehouse is a subject-oriented,
integrated, time-variant, and non - volatile
collection of data in support of managements
decision-making process.W. H. Inmon
12/04/2016
Subject-Oriented
Data that gives information about a particular subject
instead of about a company's ongoing operations.
Data is categorized and stored by business subject
rather than by application.
Application Oriented
Subject Oriented
Loans
ATM
Credit Card
Customer
Product
Vendor
Activity
Trust
Savings
12/04/2016
Integrated
Data on a given subject is defined and stored once.
Data that is gathered into the data warehouse from
a variety of sources and merged into a coherent whole.
Savings
Current
accounts
Loans
OLTP Applications
12/04/2016
Customer
Data Warehouse
9
Time-Variant
Data is stored as a series of snapshots, each
representing a period of time
All data in the data warehouse is identified with a
particular time period.
Time
Jan-97
Feb-97
Mar-97
12/04/2016
Data
January
February
March
10
Nonvolatile
Data in the data warehouse is not updated or deleted.
Data is stable in a data warehouse. More data is added but data
is never removed. This enables management to gain a
consistent picture of the business.
Operational
Warehouse
Load
Insert
Update
Delete
12/04/2016
Read
Read
11
Changing Data
First time load
Warehouse Database
Operational
Database
Refresh
Refresh
Refresh
12/04/2016
12
OLTP
OLTP Online Transaction Processing or
Operational Database Systems
Performs Online Transaction & Query Processing
Covers most of the day - to day operations
Characterized by a large number of short on-line
transactions (INSERT, UPDATE, DELETE).
Purchasing, Inventory, Manufacturing, Banking, Payroll,
Registration, etc..,
12/04/2016
13
OLAP
OLAP Online Analytical Processing or Data
Warehouse
It serves users or knowledge workers in the role of
decision making and data analysis
Organize and present data in various formats in order
to satisfy various user requests
characterized by relatively low volume of transactions.
OLAP allows users to analyze database information
from multiple database systems at one time.
OLAP data is stored in multidimensional databases.
12/04/2016
14
12/04/2016
15
OLAP
users
clerk, IT professional
knowledge worker
function
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date
detailed, flat relational
isolated
repetitive
historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
read/write
index/hash on prim. key
short, simple transaction
lots of scans
usage
access
unit of work
complex query
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
12/04/2016
16
17
18
DATA CUBE
Data Cube is defined by dimensions and facts
Dimensions Entities in which organization wants
to keep record.
Dimension Table - Each dimension may be
associated with a table
Item, Branch, Location
19
12/04/2016
20
3D Data Cube
Dollars_Sold
12/04/2016
21
12/04/2016
22
STAR SCHEMA
Contains a large
central table fact
table contains bulk
data and a set of
smaller attendant
tables dimension
table, one for each
dimension
12/04/2016
23
SNOWFLAKE SCHEMA
A variant of the star
schema model, where
some
dimension
tables are normalized,
thereby
further
splitting the data into
additional tables.
12/04/2016
24
FACT CONSTELLATION
Schema
can
be
viewed
as
a
collection of stars,
and hence it is called
a galaxy schema or a
fact constellation.
Used
for
sophisticated
applications
12/04/2016
25
CONCEPT HIERARCHY
Defines a sequence of mappings from a set of low-level
concepts to higher-level, more general concepts.
12/04/2016
26
OLAP OPERATIONS
Roll Up (Drill Up) Reduction of Dimension
Drill Down ( Roll Down) Adds new
Dimesnion
Slice and Dice
Pivot (Rotate)
12/04/2016
27
ROLL UP
Performs aggregation
on a data cube, either
by climbing up a
concept hierarchy for
a dimension
Dimension Reduction
12/04/2016
28
DRILL DOWN
Drill-down is the reverse
of roll-up. It navigates
from less detailed data to
more detailed data.
Drill-down can be realized
by either stepping down a
concept hierarchy for a
dimension
Introducing
additional
dimensions
12/04/2016
29
12/04/2016
30
12/04/2016
31
PIVOT
The
pivot
operation is also
known as rotation.
It rotates the data
axes in view, in
order to provide an
alternative
presentation
of
data.
12/04/2016
32
12/04/2016
33
12/04/2016
35
Virtual warehouse
A set of views over operational databases
Only some of the possible summary views may be materialized
12/04/2016
36
DATA MINING
Extracting or Mining knowledge from large
amounts of data
Also called as Knowledge Extraction,
Knowledge Discovery from Data, Data /
Pattern Analysis
12/04/2016
37
12/04/2016
38
KDD STEPS
1.
2.
3.
4.
5.
6.
7.
12/04/2016
39
40
41
42
43
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same or
similar analytical results
Data discretization
Part of data reduction but with particular importance, especially for
numerical data
April 12, 2016
44
45
Data Cleaning
Importance
Data cleaning is one of the three biggest problems in
data warehousingRalph Kimball
Data cleaning is the number one problem in data
warehousingDCI survey
Data Cleaning
: How to Handle Missing Data?
Data Cleaning
: How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B A)/N.
The most straightforward, but outliers may dominate presentation
Skewed data is not handled well
Equal-depth (frequency) partitioning
Divides the range into N intervals, each containing approximately same
number of samples
Good data scaling
Managing categorical attributes can be tricky
April 12, 2016
49
Data Cleaning
: Binning Methods
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
REGRESSION
Data can be smoothed by fitting the data to a
function, such as with regression.
Linear regression involves finding the best line
to fit two attributes (or variables), so that one
attribute can be used to predict the other.
Multiple linear regression is an extension of linear
regression, where more than two attributes are
involved and the data are fit to a
multidimensional surface.
12/04/2016
51
Regression
y
Y1
y=x+1
Y1
X1
52
CLUSTERING
Outliers may be detected by clustering, where
similar values are organized into groups, or
clusters. Intuitively, values that fall outside
of the set of clusters may be considered
outliers
Outliers - Data objects with characteristics that are
considerably different than most of the other data
objects in the data set
12/04/2016
53
CLUSTERING
12/04/2016
54
Data Integration
Data integration:
Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data sources, e.g.,
Bill Clinton = William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different
sources are different
Possible reasons: different representations, different scales,
e.g., metric vs. British units
April 12, 2016
55
56
Data Integration
Data integration:
Combines data from multiple sources into a coherent store
Data Integration
: Handling Redundancy in Data Integration
Data Integration :
Correlation Analysis (Numerical Data)
( A A)( B B) ( AB ) n A B
(n 1)AB
(n 1)AB
Data Integration
: Correlation Analysis (Categorical Data)
2 (chi-square) test
2
(
Observed
Expected
)
2
Expected
The larger the 2 value, the more likely the variables are
related
The cells that contribute the most to the 2 value are
those whose actual count is very different from the
expected count
Correlation does not imply causality
# of hospitals and # of car-theft in a city are correlated
Both are causally linked to the third variable: population
Sum (row)
250(90)
200(360)
450
50(210)
1000(840)
1050
Sum(col.)
300
1200
1500
90
)
(
50
210
)
(
200
360
)
(
1000
840
)
2
507 .93
90
210
360
840
Data Transformation
Attribute/feature construction
New attributes constructed from the given ones
Data Transformation
: Normalization
Min-max normalization: to [new_minA, new_maxA]
v'
v min A
(new _ max A new _ min A) new _ min A
max A min A
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
$73,000 is mapped to
73,600 12,000
98,000 12,000
(1.0 0) 0 0.716
v'
v A
v
v' j
10
73,600 54,000
1.225
16,000
Data reduction
Obtain a reduced representation of the data set that is much smaller in
volume but yet produce the same (or almost the same) analytical results
Aggregation
Sampling
Dimensionality Reduction
Feature subset selection
Feature creation
Discretization and Binarization
Attribute Transformation
Change of scale
Cities aggregated into regions, states, countries, etc
Data Reduction
: Dimensionality Reduction
Purpose:
Techniques
x1
x1
Data Reduction
: Feature Subset Selection
Redundant features
Irrelevant features
Data Reduction
: Feature Subset Selection
Techniques:
Brute-force approch:
Try all possible feature subsets as input to data mining
algorithm
Filter approaches:
Features are selected before data mining algorithm is run
Wrapper approaches:
Use the data mining algorithm as a black box to find best
subset of attributes
Data Reduction
: Feature Creation
Data Reduction
: Mapping Data to a New Space
Fourier transform
Wavelet transform
Frequency
Data Reduction
: Discretization Using Class Labels
Data Reduction
: Discretization Without Using Class Labels
Data
Equal frequency
K-means
Data Reduction
: Attribute Transformation