Intro To DWH and DM MODIF by HLZ (Compatibility Mode)

Introduction to
Data Warehouse and Data Mining

IF5031
Manajemen Informasi B
Hira Laksmiwati /Saiful Akbar
Source :
1.
Ankur Teredesai, Assistant Professor, Dept. of Computer Science, RIT.
2.
CSEP 546, Pedro Dominingos
3.
Database System Concepts, 6th Ed, Silberschatz, Korth and Sudarshan
IF5031/Intro DWH-DM/Okt/2015
OBJECTIVES
Understand the concept and role of Data
Warehouse(5W)
Understand the concept and role of Data Mining
Data Mining techniques
Understand the difference between DWH and DM
Concept of OLTP,OLAP
Topics
Decision Support Systems

Data Warehousing
Data Mining
Data Mining Techniques :
o Classification
o Association Rules
o Clustering
OLTP vs OLAP
Decision Support Systems

Decision-support systems are used to make
business decisions, often based on data
collected by on-line transaction-processing
systems.
Examples of business decisions:
o What items to stock?
o What insurance premium to change?
o To whom to send advertisements?
Examples of data used for making decisions

o
o
Retail sales transaction details

Customer profiles (income, age, gender, etc.)
Decision-Support Systems:
DecisionOverview
Data analysis tasks are simplified by specialized tools and SQL
extensions
o
Example tasks
For each product category and each region, what were the total sales in the last quarter
and how do they compare with the same quarter last year
As above, for each product category and each customer category
Statistical analysis packages (e.g., : S++) can be interfaced with

databases
o
Statistical analysis is a large field, but not covered here
Data mining seeks to discover knowledge automatically in the form

of statistical rules and patterns from large databases.
A data warehouse archives information gathered from multiple
sources, and stores it under a unified schema, at a single site.
o
o
Important for large businesses that generate data from multiple divisions, possibly at multiple
sites
Data may also be purchased externally
Data Warehousing
Data sources often store only current data, not
historical data
Corporate decision making requires a unified view
of all organizational data, including historical data
A data warehouse is a repository (archive) of
information gathered from multiple sources, stored
under a unified schema, at a single site
o Greatly simplifies querying, permits study of historical trends
o Shifts decision support query load away from transaction processing
systems
Data Warehousing
Data Warehouse
Warehouse
Subject--Oriented
Subject
Organized around major subjects, such as
customer, product, sales.
Focusing on the modeling and analysis of data for
decision makers, not on daily operations or
transaction processing.
Provide a simple and concise view around
particular subject issues by excluding data that are
not useful in the decision support process.
Data Warehouse
Warehouse
Integrated
Constructed by integrating multiple,
heterogeneous data sources
o relational databases, flat files, on-line
transaction records
Data cleaning and data integration techniques
are applied.
o Ensure consistency in naming conventions,
encoding structures, attribute measures, etc.
among different data sources
E.g., Hotel price: currency, tax, breakfast covered, etc.
o When data is moved to the warehouse, it is

converted.
Data Warehouse
WarehouseTime
Variant
The time horizon for the data warehouse is
significantly longer than that of operational systems.
o Operational database: current value data.
o Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse
o Contains an element of time, explicitly or implicitly
o But the key of operational data may or may not
contain time element.
10
Data Warehouse
WarehouseNon
Non-Volatile
A physically separate store of data transformed
from the operational environment.
Operational update of data does not occur in the
data warehouse environment.
o Does not require transaction processing,
recovery, and concurrency control mechanisms
o Requires only two operations in data accessing:
initial loading of data and access of data.
11
Data Warehouse vs. Heterogeneous DBMS

Traditional heterogeneous DB integration:
o Build wrappers/mediators on top of heterogeneous
databases
o Query driven approach
When a query is posed to a client site, a meta-dictionary
is used to translate the query into queries appropriate for
individual heterogeneous sites involved, and the results
are integrated into a global answer set
Complex information filtering, compete for resources
Data warehouse: update-driven, high performance

o Information from heterogeneous sources is integrated in
advance and stored in warehouses for direct query and
analysis
12
Data Warehouse vs. Operational DBMS

OLTP (on-line transaction processing)
o Major task of traditional relational DBMS
o Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
OLAP (on-line analytical processing)
o Major task of data warehouse system
o Data analysis and decision making
Distinct features (OLTP vs. OLAP):
o User and system orientation: customer vs. market
o Data contents: current, detailed vs. historical, consolidated
o Database design: ER + application vs. star + subject
o View: current, local vs. evolutionary, integrated
o Access patterns: update vs. read-only but complex queries
13
OLTP vs. OLAP

OLTP
OLAP
users
clerk, IT professional
knowledge worker
function
day to day operations
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date
detailed, flat relational
isolated
repetitive
historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
lots of scans
unit of work
read/write
index/hash on prim. key
short, simple transaction
# records accessed
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
query throughput, response
usage
access
complex query
14
Why Separate Data Warehouse?

High performance for both systems
o DBMS tuned for OLTP: access methods, indexing,
concurrency control, recovery
o Warehousetuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
Different functions and different data:
o missing data: Decision support requires historical data
which operational DBs do not typically maintain
o data consolidation: DS requires consolidation
(aggregation, summarization) of data from
heterogeneous sources
o data quality: different sources typically use inconsistent
data representations, codes and formats which have
to be reconciled
15
Data Warehousing and OLAP

What is a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
Extensions of data cubes
From data warehousing to data mining
16
From Tables and Spreadsheets

to Data Cubes
A data warehouse is based on a multidimensional data
model which views data in the form of a data cube
A data cube, such as sales, allows data to be modeled and
viewed in multiple dimensions
o Dimension tables, such as item (item_name, brand, type),
or time(day, week, month, quarter, year)
o Fact table contains measures (such as dollars_sold) and
keys to each of the related dimension tables
In data warehousing literature, an n-D base cube is called a
base cuboid. The topmost 0-D cuboid, which holds the
highest-level of summarization, is called the apex cuboid. The
lattice of cuboids forms a data cube.
17
Cube: A Lattice of
Cuboids
all
time
time,item
0-D(apex) cuboid
item
location
time,location
item,location
time,supplier
time,item,location
supplier
1-D cuboids
location,supplier
2-D cuboids
item,supplier
time,location,supplier
3-D cuboids
time,item,supplier
item,location,supplier
4-D(base) cuboid
time, item,
location, supplier
18
Conceptual Modeling of Data

Warehouses
Modeling data warehouses: dimensions & measures
o Star schema: A fact table in the middle connected to
a set of dimension tables
o Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape similar
to snowflake
o Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
19
Example of Star Schema

time
item
time_key
day
day_of_the_week
month
quarter
year
Sales Fact Table

time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
item_key
item_name
brand
type
supplier_type
location
location_key
street
city
province_or_street
country
Measures
20
Example of Snowflake Schema

time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
item_key
item_name
brand
type
supplier_key
supplier
supplier_key
supplier_type
location
location_key
street
city_key
city
dollars_sold
city_key
city
province_or_street
country
avg_sales
Measures
21
Example of Fact Constellation

time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
item_name
brand
type
supplier_type
item_key
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
time_key
item_key
shipper_key
from_location
branch_key
branch
Shipping Fact Table
location
to_location
location_key
street
city
province_or_street
country
dollars_cost
units_shipped
shipper
Measures
22
shipper_key
shipper_name
location_key
shipper_type
Design Issues
When and how to gather data
o Source driven architecture: data sources transmit new information to
warehouse, either continuously or periodically (e.g., at night)
o Destination driven architecture: warehouse periodically requests
new information from data sources
o Keeping warehouse exactly synchronized with data sources (e.g.,
using two-phase commit) is too expensive
Usually OK to have slightly out-of-date data at warehouse
Data/updates are periodically downloaded form online
transaction processing (OLTP) systems.
What schema to use

o Schema integration
23
More Warehouse Design

Issues
Data cleansing
o E.g., correct mistakes in addresses (misspellings, zip code errors)
o Merge address lists from different sources and purge duplicates
How to propagate updates

o Warehouse schema may be a (materialized) view of schema from
data sources
What data to summarize

o Raw data may be too large to store on-line
o Aggregate values (totals/subtotals) often suffice
o Queries on raw data can often be transformed by query optimizer
to use aggregate values
24
Warehouse Schemas
Dimension values are usually encoded using small
integers and mapped to full values via dimension
tables
Resultant schema is called a star schema
o More complicated schema structures
Snowflake schema: multiple levels of dimension tables
Constellation: multiple fact tables
25
Data Warehouse Schema
26
Introduction to
DATA MINING
27
Data Mining: A KDD Process

o Data miningcore of
knowledge discovery
process
Pattern Evaluation
Data Mining
Task-relevant Data
Data
Warehouse
Selection
Data Cleaning
Data Integration
Databases
IF5031/Intro DWH-DM/Okt2015
28
Steps of a KDD Process

Learning the application domain
o relevant prior knowledge and goals of application
Creating a target data set: data selection

Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation
o Find useful features, dimensionality/variable reduction, invariant
representation.
Choosing functions of data mining

o summarization, classification, regression, association, clustering.
Choosing the mining algorithm(s)

Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
o visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge

29
Necessity Is the Mother of Invention

Data explosion problem
Automated data collection tools and mature database
technology lead to tremendous amounts of data
accumulated and/or to be analyzed in databases, data
warehouses, and other information repositories
We are drowning in data, but starving for knowledge!

Solution: Data warehousing and data mining
Data warehousing and on-line analytical processing
Mining interesting knowledge (rules, regularities, patterns,

constraints) from data in large databases
30
What Is Data Mining?

Data mining (knowledge discovery from
data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data
Alternative names
Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
Watch out: Is everything data mining?

(Deductive) query processing.
Expert systems or small ML/statistical programs
31
Data Mining
Data mining is the process of semi-automatically
analyzing large databases to find useful patterns
Prediction based on past history
o Predict if a credit card applicant poses a good credit risk, based on some
attributes (income, job type, age, ..) and past history
o Predict if a pattern of phone calling card usage is likely to be fraudulent
Some examples of prediction mechanisms:

o Classification
Given a new item whose class is unknown, predict to which class it
belongs
o Regression formulae
Given a set of mappings for an unknown function, predict the function
result for a new parameter value
32
Data Mining (Cont.)

Descriptive Patterns
o Associations
Find books that are often bought by similar customers. If a new
such customer buys one such book, suggest the others too.
o Associations may be used as a first step in detecting causation
E.g., association between exposure to chemical X and cancer,
o Clusters
E.g., typhoid cases were clustered in an area surrounding a
contaminated well
Detection of clusters remains important in detecting epidemics
33
Some Patterns
Association rules
o 98% of people who purchase diapers also buy
beer
Classification
o People with age less than 25 and salary > 40k
drive sports cars
Similar time sequences

o Stocks of companies A and B perform similarly
Outlier Detection
o Residential customers for telecom company with
businesses at home
34
Data Mining and Business Intelligence

Increasing potential
to support
business decisions
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
End User
Business
Analyst
Data
Analyst
Data
Statistical Analysis,
Querying and Reporting
Exploration
Data Warehouses / Data Marts
OLAP, MDA
DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
35
Architecture: Typical Data Mining

System
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data
warehouse server
Data cleaning & data

integration
Databases
Filterin
g
Data
Warehouse
36
Data Mining: On What Kinds of Data?
Relational database
Data warehouse
Transactional database
Advanced database and information repository
o Object-relational database
o Spatial and temporal data
o Time-series data
o Stream data
o Multimedia database
o Heterogeneous and legacy database
o Text databases & WWW
37
Data Mining Functionalities

Concept description: Characterization and discrimination
o Generalize, summarize, and contrast data characteristics, e.g.,
dry vs. wet regions
Association (correlation and causality)

o Diaper Beer [0.5%, 75%]
Classification and Prediction

o Construct models (functions) that describe and distinguish
classes or concepts for future prediction
E.g., classify countries based on climate, or classify cars based on
gas mileage
o Presentation: decision-tree, classification rule, neural network

o Predict some unknown or missing numerical values
38
Data Mining Functionalities (2)

Cluster analysis
o Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns
o Maximizing intra-class similarity & minimizing interclass similarity
Outlier analysis
o Outlier: a data object that does not comply with the general
behavior of the data
o Noise or exception? No! useful in fraud detection, rare events
analysis
Trend and evolution analysis
o Trend and deviation: regression analysis
o Sequential pattern mining, periodicity analysis
o Similarity-based analysis
Other pattern-directed or statistical analyses
39
MORE ON DM PATTERN
(Untuk dipelajari oleh mahasiswa)
40
Classification Rules
Classification rules help assign new objects to
classes.
o E.g., given a new automobile insurance applicant, should he or she
be classified as low risk, medium risk or high risk?
Classification rules for above example could use

a variety of data, such as educational level,
salary, age, etc.
o person P, P.degree = masters and P.income > 75,000
P.credit = excellent
o person P, P.degree = bachelors and
(P.income 25,000 and P.income 75,000)
P.credit = good
Rules are not necessarily exact: there may be

some misclassifications
Classification rules can be shown compactly as a
decision tree.
41
Decision Tree
42
Construction of Decision
Trees
Training set: a data sample in which the classification is
already known.
Greedy top down generation of decision trees.
o Each internal node of the tree partitions the data into groups based on a
partitioning attribute, and a partitioning condition for the node
o Leaf node:
all (or most) of the items at the node belong to the same class, or
all attributes have been considered, and no further partitioning is possible.
43
Best Splits
Pick best attributes and conditions on which to partition
The purity of a set S of training instances can be measured
quantitatively in several ways.
o Notation: number of classes = k, number of instances = |S|,
fraction of instances in class i = pi.
The Gini measure of purity is defined as

[
k
Gini (S) = 1 - p2i
i- 1
o When all instances are in a single class, the Gini value is 0

o It reaches its maximum (of 1 1 /k) if each class the same number of instances.
44
Best Splits (Cont.)

Another measure of purity is the entropy measure, which is
defined as
k
entropy (S) = p log p

i
2 i
i- 1
When a set S is split into multiple sets Si, I=1, 2, , r, we can

measure the purity of the resultant set of sets as:
r
purity(S1, S2, .., Sr) =
|Si|
i= 1 |S|
purity (Si)
The information gain due to particular split of S into Si, i = 1,

2, ., r
Information-gain (S, {S1, S2, ., Sr) = purity(S ) purity (S1, S2,
Sr)
45
Best Splits (Cont.)

Measure of cost of a split:
r |S |
i
Information-content (S, {S1, S2, .., Sr})) =
i- 1 |S|
log2
|Si|
|S|
Information-gain ratio =
Information-gain (S, {S1, S2, , Sr})
Information-content (S, {S1, S2, .., Sr})
The best split is the one that gives the maximum
information gain ratio
46
Finding Best Splits

Categorical attributes (with no meaningful order):
o Multi-way split, one child for each value
o Binary split: try all possible breakup of values into two sets, and pick the
best
Continuous-valued attributes (can be sorted in a

meaningful order)
o Binary split:
Sort values, try each as a split point
o E.g., if values are 1, 10, 15, 25, split at 1, 10, 15
Pick the value that gives best split
o Multi-way split:
A series of binary splits on the same attribute has roughly equivalent
effect
47
Decision-Tree
DecisionConstruction Algorithm
Procedure GrowTree (S )
Partition (S );
Procedure Partition (S)
if ( purity (S ) > p or |S| < s ) then
return;
for each attribute A
evaluate splits on attribute A;
Use best split found (across all attributes) to partition
S into S1, S2, ., Sr,
for i = 1, 2, .., r
Partition (Si );
48
Other Types of Classifiers

Neural net classifiers are studied in artificial
intelligence and are not covered here
Bayesian classifiers use Bayes theorem, which says
p (cj | d ) = p (d | cj ) p (cj )
p(d)
where
p (c
( j | d ) = probability of instance d being in
class cj,
p (d | cj ) = probability of generating
instance d given class cj,
p (cj ) = probability of occurrence of class
cj, and
p (d ) = probability of instance d occuring
49
Nave Bayesian Classifiers

Bayesian classifiers require
o computation of p (d | cj )
o precomputation of p (cj )
o p (d ) can be ignored since it is the same for all classes
To simplify the task, nave Bayesian classifiers

assume attributes have independent distributions,
and thereby estimate
p (d | cj) = p (d1 | cj ) * p (d2 | cj ) * .* (p (dn |
cj )
o Each of the p (di | cj ) can be estimated from a histogram on di values for
each class cj
the histogram is computed from the training instances
o Histograms on multiple attributes are more expensive to compute and
store
50
Regression
Regression deals with the prediction of a value, rather than
a class.
o Given values for a set of variables, X1, X2, , Xn, we wish to predict the value of a
variable Y.
One way is to infer coefficients a0, a1, a1, , an such that

Y = a0 + a1 * X1 + a2 * X2 + + an * Xn
Finding such a linear polynomial is called linear regression.
o In general, the process of finding a curve that fits the data is also called curve fitting.
The fit may only be approximate

o because of noise in the data, or
o because the relationship is not exactly a polynomial
Regression aims to find coefficients that give the best

possible fit.
51
Association Rules
Retail shops are often interested in associations
between different items that people buy.
o Someone who buys bread is quite likely also to buy milk
o A person who bought the book Database System Concepts is quite likely also to
buy the book Operating System Concepts.
Associations information can be used in several ways.

o E.g., when a customer buys a particular book, an online shop may suggest
associated books.
Association rules:
bread milk
DB-Concepts, OS-Concepts Networks
o Left hand side: antecedent, right hand side: consequent
o An association rule must have an associated population; the population consists
of a set of instances
E.g., each transaction (sale) at a shop is an instance, and the set of all
transactions is the population
52
Association Rules (Cont.)

Rules have an associated support, as well as an
associated confidence.
Support is a measure of what fraction of the population
satisfies both the antecedent and the consequent of
the rule.
o E.g., suppose only 0.001 percent of all purchases include milk and screwdrivers.
The support for the rule is milk screwdrivers is low.
Confidence is a measure of how often the consequent

is true when the antecedent is true.
o E.g., the rule bread milk has a confidence of 80 percent if 80 percent of the
purchases that include bread also include milk.
53
Finding Association Rules

We are generally only interested in association
rules with reasonably high support (e.g., support
of 2% or greater)
Nave algorithm
1. Consider all possible sets of relevant items.
2. For each set find its support (i.e., count how many transactions
purchase all items in the set).
Large itemsets: sets with sufficiently high support
3. Use large itemsets to generate association rules.
1. From itemset A generate the rule A - {b } b for each b A.
Support of rule = support (A).
Confidence of rule = support (A ) / support (A - {b })
54
Finding Support
Determine support of itemsets via a single pass on
set of transactions
o Large itemsets: sets with a high count at the end of the pass
If memory not enough to hold all counts for all

itemsets use multiple passes, considering only some
itemsets in each pass.
Optimization: Once an itemset is eliminated
because its count (support) is too small none of its
supersets needs to be considered.
The a priori technique to find large itemsets:
o Pass 1: count support of all sets with just 1 item. Eliminate those items with
low support
o Pass i: candidates: every set of i items such that all its i-1 item subsets are
large
Count support of all candidates
Stop if there are no candidates
55
Other Types of Associations

Basic association rules have several limitations
Deviations from the expected probability are
more interesting
o E.g., if many people purchase bread, and many people purchase
cereal, quite a few would be expected to purchase both
o We are interested in positive as well as negative correlations between
sets of items
Positive correlation: co-occurrence is higher than predicted
Negative correlation: co-occurrence is lower than predicted
Sequence associations / correlations

o E.g., whenever bonds go up, stock prices go down in 2 days
Deviations from temporal patterns

o E.g., deviation from a steady growth
o E.g., sales of winter wear go down in summer
Not surprising, part of a known pattern.
Look for deviation from value predicted using past patterns
56
Clustering
Clustering: Intuitively, finding clusters of points in the
given data such that similar points lie in the same
cluster
Can be formalized using distance metrics in several
ways
o Group points into k sets (for a given k) such that the average distance of
points from the centroid of their assigned group is minimized
Centroid: point defined by taking average of coordinates in each
dimension.
o Another metric: minimize average distance between every pair of points
in a cluster
Has been studied extensively in statistics, but on

small data sets
o Data mining systems aim at clustering techniques that can handle very
large data sets
o E.g., the Birch clustering algorithm (more shortly)
57
Hierarchical Clustering
Example from biological classification
o (the word classification here does not mean a prediction mechanism)
chordata
mammalia
leopards humans
reptilia
snakes crocodiles
Other examples: Internet directory systems (e.g.,

Yahoo, more on this later)
Agglomerative clustering algorithms
o Build small clusters, then cluster small clusters into bigger clusters, and so on
Divisive clustering algorithms

o Start with all items in a single cluster, repeatedly refine (break) clusters into
smaller ones
58
Clustering Algorithms
Clustering algorithms have been designed to
handle very large datasets
E.g., the Birch algorithm
o Main idea: use an in-memory R-tree to store points that are being
clustered
o Insert points one at a time into the R-tree, merging a new point with an
existing cluster if is less than some distance away
o If there are more leaf nodes than fit in memory, merge existing clusters
that are close to each other
o At the end of first pass we get a large number of clusters at the leaves of
the R-tree
Merge clusters to reduce the number of clusters
59
Collaborative Filtering
Goal: predict what movies/books/ a person
may be interested in, on the basis of
o Past preferences of the person
o Other people with similar past preferences
o The preferences of such people for a new movie/book/
One approach based on repeated clustering

o Cluster people on the basis of preferences for movies
o Then cluster movies on the basis of being liked by the same
clusters of people
o Again cluster people based on their preferences for (the newly
created clusters of) movies
o Repeat above till equilibrium
Above problem is an instance of collaborative

filtering, where users collaborate in the task of
filtering information to find information of
interest
60
Other Types of Mining

Text mining: application of data mining to textual
documents
o cluster Web pages to find related pages
o cluster pages a user has visited to organize their visit history
o classify Web pages automatically into a Web directory
Data visualization systems help users examine large

volumes of data and detect patterns visually
o Can visually encode large amounts of information on a single screen
o Humans are very good a detecting visual patterns
61
End of Chapter
62
Figure 20.01
63
Figure 20.02
64
Figure 20.03
65
Figure 20.05
66

Intro To DWH and DM MODIF by HLZ (Compatibility Mode)

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Intro To DWH and DM MODIF by HLZ (Compatibility Mode)

Hochgeladen von

Copyright:

Verfügbare Formate

Introduction to

Data Warehouse and Data Mining

CSEP 546, Pedro Dominingos

Database System Concepts, 6th Ed, Silberschatz, Korth and Sudarshan

Decision Support Systems

Decision Support Systems

Examples of data used for making decisions

Retail sales transaction details

Statistical analysis packages (e.g., : S++) can be interfaced with

Statistical analysis is a large field, but not covered here

Data mining seeks to discover knowledge automatically in the form

o When data is moved to the warehouse, it is

Data Warehouse vs. Heterogeneous DBMS

Data warehouse: update-driven, high performance

Data Warehouse vs. Operational DBMS

OLTP vs. OLAP

day to day operations

query throughput, response

Why Separate Data Warehouse?

Data Warehousing and OLAP

From Tables and Spreadsheets

Conceptual Modeling of Data

Example of Star Schema

Sales Fact Table

Example of Snowflake Schema

Example of Fact Constellation

Shipping Fact Table

What schema to use

More Warehouse Design

How to propagate updates

What data to summarize

Data Warehouse Schema

Data Mining: A KDD Process

Steps of a KDD Process

Creating a target data set: data selection

Choosing functions of data mining

Choosing the mining algorithm(s)

Use of discovered knowledge

Necessity Is the Mother of Invention

We are drowning in data, but starving for knowledge!

Mining interesting knowledge (rules, regularities, patterns,

What Is Data Mining?

Watch out: Is everything data mining?

Some examples of prediction mechanisms:

Data Mining (Cont.)

Similar time sequences

Data Mining and Business Intelligence

Architecture: Typical Data Mining

Data cleaning & data

Data Mining: On What Kinds of Data?

Data Mining Functionalities

Association (correlation and causality)

Classification and Prediction

o Presentation: decision-tree, classification rule, neural network

Data Mining Functionalities (2)

Classification rules for above example could use

Rules are not necessarily exact: there may be

The Gini measure of purity is defined as

o When all instances are in a single class, the Gini value is 0

Best Splits (Cont.)

entropy (S) = p log p

When a set S is split into multiple sets Si, I=1, 2, , r, we can

purity(S1, S2, .., Sr) =