Sie sind auf Seite 1von 66

Introduction to

Data Warehouse and Data Mining


IF5031
Manajemen Informasi B
Hira Laksmiwati /Saiful Akbar

Source :
1.
Ankur Teredesai, Assistant Professor, Dept. of Computer Science, RIT.
2.

CSEP 546, Pedro Dominingos

3.

Database System Concepts, 6th Ed, Silberschatz, Korth and Sudarshan

IF5031/Intro DWH-DM/Okt/2015

OBJECTIVES
Understand the concept and role of Data
Warehouse(5W)
Understand the concept and role of Data Mining
Data Mining techniques
Understand the difference between DWH and DM
Concept of OLTP,OLAP

IF5031/Intro DWH-DM/Okt/2015

Topics

Decision Support Systems


Data Warehousing
Data Mining
Data Mining Techniques :
o Classification
o Association Rules
o Clustering

OLTP vs OLAP

IF5031/Intro DWH-DM/Okt/2015

Decision Support Systems


Decision-support systems are used to make
business decisions, often based on data
collected by on-line transaction-processing
systems.
Examples of business decisions:
o What items to stock?
o What insurance premium to change?
o To whom to send advertisements?

Examples of data used for making decisions


o
o

Retail sales transaction details


Customer profiles (income, age, gender, etc.)

IF5031/Intro DWH-DM/Okt/2015

Decision-Support Systems:
DecisionOverview
Data analysis tasks are simplified by specialized tools and SQL
extensions
o

Example tasks
For each product category and each region, what were the total sales in the last quarter
and how do they compare with the same quarter last year
As above, for each product category and each customer category

Statistical analysis packages (e.g., : S++) can be interfaced with


databases
o

Statistical analysis is a large field, but not covered here

Data mining seeks to discover knowledge automatically in the form


of statistical rules and patterns from large databases.
A data warehouse archives information gathered from multiple
sources, and stores it under a unified schema, at a single site.
o
o

Important for large businesses that generate data from multiple divisions, possibly at multiple
sites
Data may also be purchased externally

IF5031/Intro DWH-DM/Okt/2015

Data Warehousing
Data sources often store only current data, not
historical data
Corporate decision making requires a unified view
of all organizational data, including historical data
A data warehouse is a repository (archive) of
information gathered from multiple sources, stored
under a unified schema, at a single site
o Greatly simplifies querying, permits study of historical trends
o Shifts decision support query load away from transaction processing
systems

IF5031/Intro DWH-DM/Okt/2015

Data Warehousing

IF5031/Intro DWH-DM/Okt/2015

Data Warehouse
Warehouse
Subject--Oriented
Subject
Organized around major subjects, such as
customer, product, sales.
Focusing on the modeling and analysis of data for
decision makers, not on daily operations or
transaction processing.
Provide a simple and concise view around
particular subject issues by excluding data that are
not useful in the decision support process.
IF5031/Intro DWH-DM/Okt/2015

Data Warehouse
Warehouse
Integrated
Constructed by integrating multiple,
heterogeneous data sources
o relational databases, flat files, on-line
transaction records
Data cleaning and data integration techniques
are applied.
o Ensure consistency in naming conventions,
encoding structures, attribute measures, etc.
among different data sources
E.g., Hotel price: currency, tax, breakfast covered, etc.

o When data is moved to the warehouse, it is


converted.
IF5031/Intro DWH-DM/Okt/2015

Data Warehouse
WarehouseTime
Variant
The time horizon for the data warehouse is
significantly longer than that of operational systems.
o Operational database: current value data.
o Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse
o Contains an element of time, explicitly or implicitly
o But the key of operational data may or may not
contain time element.
IF5031/Intro DWH-DM/Okt/2015

10

Data Warehouse
WarehouseNon
Non-Volatile
A physically separate store of data transformed
from the operational environment.
Operational update of data does not occur in the
data warehouse environment.
o Does not require transaction processing,
recovery, and concurrency control mechanisms
o Requires only two operations in data accessing:
initial loading of data and access of data.

IF5031/Intro DWH-DM/Okt/2015

11

Data Warehouse vs. Heterogeneous DBMS


Traditional heterogeneous DB integration:
o Build wrappers/mediators on top of heterogeneous
databases
o Query driven approach
When a query is posed to a client site, a meta-dictionary
is used to translate the query into queries appropriate for
individual heterogeneous sites involved, and the results
are integrated into a global answer set
Complex information filtering, compete for resources

Data warehouse: update-driven, high performance


o Information from heterogeneous sources is integrated in
advance and stored in warehouses for direct query and
analysis
IF5031/Intro DWH-DM/Okt/2015

12

Data Warehouse vs. Operational DBMS


OLTP (on-line transaction processing)
o Major task of traditional relational DBMS
o Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
OLAP (on-line analytical processing)
o Major task of data warehouse system
o Data analysis and decision making
Distinct features (OLTP vs. OLAP):
o User and system orientation: customer vs. market
o Data contents: current, detailed vs. historical, consolidated
o Database design: ER + application vs. star + subject
o View: current, local vs. evolutionary, integrated
o Access patterns: update vs. read-only but complex queries
IF5031/Intro DWH-DM/Okt/2015

13

OLTP vs. OLAP


OLTP

OLAP

users

clerk, IT professional

knowledge worker

function

day to day operations

decision support

DB design

application-oriented

subject-oriented

data

current, up-to-date
detailed, flat relational
isolated
repetitive

historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
lots of scans

unit of work

read/write
index/hash on prim. key
short, simple transaction

# records accessed

tens

millions

#users

thousands

hundreds

DB size

100MB-GB

100GB-TB

metric

transaction throughput

query throughput, response

usage
access

complex query

IF5031/Intro DWH-DM/Okt/2015

14

Why Separate Data Warehouse?


High performance for both systems
o DBMS tuned for OLTP: access methods, indexing,
concurrency control, recovery
o Warehousetuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
Different functions and different data:
o missing data: Decision support requires historical data
which operational DBs do not typically maintain
o data consolidation: DS requires consolidation
(aggregation, summarization) of data from
heterogeneous sources
o data quality: different sources typically use inconsistent
data representations, codes and formats which have
to be reconciled

IF5031/Intro DWH-DM/Okt/2015

15

Data Warehousing and OLAP


What is a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
Extensions of data cubes
From data warehousing to data mining
IF5031/Intro DWH-DM/Okt/2015

16

From Tables and Spreadsheets


to Data Cubes
A data warehouse is based on a multidimensional data
model which views data in the form of a data cube
A data cube, such as sales, allows data to be modeled and
viewed in multiple dimensions
o Dimension tables, such as item (item_name, brand, type),
or time(day, week, month, quarter, year)
o Fact table contains measures (such as dollars_sold) and
keys to each of the related dimension tables
In data warehousing literature, an n-D base cube is called a
base cuboid. The topmost 0-D cuboid, which holds the
highest-level of summarization, is called the apex cuboid. The
lattice of cuboids forms a data cube.
IF5031/Intro DWH-DM/Okt/2015

17

Cube: A Lattice of
Cuboids
all
time

time,item

0-D(apex) cuboid

item

location

time,location

item,location

time,supplier
time,item,location

supplier

1-D cuboids

location,supplier

2-D cuboids
item,supplier

time,location,supplier

3-D cuboids

time,item,supplier

item,location,supplier

4-D(base) cuboid
time, item,
IF5031/Intro DWH-DM/Okt/2015

location, supplier

18

Conceptual Modeling of Data


Warehouses
Modeling data warehouses: dimensions & measures
o Star schema: A fact table in the middle connected to
a set of dimension tables
o Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape similar
to snowflake
o Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
IF5031/Intro DWH-DM/Okt/2015

19

Example of Star Schema


time

item

time_key
day
day_of_the_week
month
quarter
year

Sales Fact Table


time_key
item_key
branch_key

branch

location_key

branch_key
branch_name
branch_type

units_sold
dollars_sold
avg_sales

item_key
item_name
brand
type
supplier_type

location
location_key
street
city
province_or_street
country

Measures
IF5031/Intro DWH-DM/Okt/2015

20

Example of Snowflake Schema


time
time_key
day
day_of_the_week
month
quarter
year

item
Sales Fact Table
time_key
item_key
branch_key

branch

location_key

branch_key
branch_name
branch_type

units_sold

item_key
item_name
brand
type
supplier_key

supplier
supplier_key
supplier_type

location
location_key
street
city_key

city

dollars_sold

city_key
city
province_or_street
country

avg_sales
Measures
IF5031/Intro DWH-DM/Okt/2015

21

Example of Fact Constellation


time
time_key
day
day_of_the_week
month
quarter
year

item
Sales Fact Table
time_key

item_key
item_name
brand
type
supplier_type

item_key

location_key

branch_key
branch_name
branch_type

units_sold
dollars_sold
avg_sales

time_key
item_key
shipper_key
from_location

branch_key
branch

Shipping Fact Table

location

to_location

location_key
street
city
province_or_street
country

dollars_cost
units_shipped
shipper

Measures
IF5031/Intro DWH-DM/Okt/2015

22

shipper_key
shipper_name
location_key
shipper_type

Design Issues
When and how to gather data
o Source driven architecture: data sources transmit new information to
warehouse, either continuously or periodically (e.g., at night)
o Destination driven architecture: warehouse periodically requests
new information from data sources
o Keeping warehouse exactly synchronized with data sources (e.g.,
using two-phase commit) is too expensive
Usually OK to have slightly out-of-date data at warehouse
Data/updates are periodically downloaded form online
transaction processing (OLTP) systems.

What schema to use


o Schema integration

IF5031/Intro DWH-DM/Okt/2015

23

More Warehouse Design


Issues
Data cleansing
o E.g., correct mistakes in addresses (misspellings, zip code errors)
o Merge address lists from different sources and purge duplicates

How to propagate updates


o Warehouse schema may be a (materialized) view of schema from
data sources

What data to summarize


o Raw data may be too large to store on-line
o Aggregate values (totals/subtotals) often suffice
o Queries on raw data can often be transformed by query optimizer
to use aggregate values

IF5031/Intro DWH-DM/Okt/2015

24

Warehouse Schemas
Dimension values are usually encoded using small
integers and mapped to full values via dimension
tables
Resultant schema is called a star schema
o More complicated schema structures
Snowflake schema: multiple levels of dimension tables
Constellation: multiple fact tables

IF5031/Intro DWH-DM/Okt/2015

25

Data Warehouse Schema

IF5031/Intro DWH-DM/Okt/2015

26

Introduction to
DATA MINING

IF5031/Intro DWH-DM/Okt/2015

27

Data Mining: A KDD Process


o Data miningcore of
knowledge discovery
process

Pattern Evaluation

Data Mining
Task-relevant Data
Data
Warehouse

Selection

Data Cleaning
Data Integration
Databases
IF5031/Intro DWH-DM/Okt2015

28

Steps of a KDD Process


Learning the application domain
o relevant prior knowledge and goals of application

Creating a target data set: data selection


Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation
o Find useful features, dimensionality/variable reduction, invariant
representation.

Choosing functions of data mining


o summarization, classification, regression, association, clustering.

Choosing the mining algorithm(s)


Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
o visualization, transformation, removing redundant patterns, etc.

Use of discovered knowledge


IF5031/Intro DWH-DM/Okt/2015

29

Necessity Is the Mother of Invention


Data explosion problem
Automated data collection tools and mature database
technology lead to tremendous amounts of data
accumulated and/or to be analyzed in databases, data
warehouses, and other information repositories

We are drowning in data, but starving for knowledge!


Solution: Data warehousing and data mining
Data warehousing and on-line analytical processing

Mining interesting knowledge (rules, regularities, patterns,


constraints) from data in large databases
IF5031/Intro DWH-DM/Okt/2015

30

What Is Data Mining?


Data mining (knowledge discovery from
data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data

Alternative names
Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.

Watch out: Is everything data mining?


(Deductive) query processing.
Expert systems or small ML/statistical programs
IF5031/Intro DWH-DM/Okt/2015

31

Data Mining
Data mining is the process of semi-automatically
analyzing large databases to find useful patterns
Prediction based on past history
o Predict if a credit card applicant poses a good credit risk, based on some
attributes (income, job type, age, ..) and past history
o Predict if a pattern of phone calling card usage is likely to be fraudulent

Some examples of prediction mechanisms:


o Classification
Given a new item whose class is unknown, predict to which class it
belongs
o Regression formulae
Given a set of mappings for an unknown function, predict the function
result for a new parameter value

IF5031/Intro DWH-DM/Okt/2015

32

Data Mining (Cont.)


Descriptive Patterns
o Associations
Find books that are often bought by similar customers. If a new
such customer buys one such book, suggest the others too.
o Associations may be used as a first step in detecting causation
E.g., association between exposure to chemical X and cancer,
o Clusters
E.g., typhoid cases were clustered in an area surrounding a
contaminated well
Detection of clusters remains important in detecting epidemics

IF5031/Intro DWH-DM/Okt/2015

33

Some Patterns
Association rules
o 98% of people who purchase diapers also buy
beer

Classification
o People with age less than 25 and salary > 40k
drive sports cars

Similar time sequences


o Stocks of companies A and B perform similarly

Outlier Detection
o Residential customers for telecom company with
businesses at home
IF5031/Intro DWH-DM/Okt/2015

34

Data Mining and Business Intelligence


Increasing potential
to support
business decisions

Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery

End User

Business
Analyst
Data
Analyst

Data
Statistical Analysis,
Querying and Reporting
Exploration
Data Warehouses / Data Marts
OLAP, MDA
DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
IF5031/Intro DWH-DM/Okt/2015

35

Architecture: Typical Data Mining


System
Graphical user interface

Pattern evaluation
Data mining engine
Knowledge-base
Database or data
warehouse server

Data cleaning & data


integration

Databases
IF5031/Intro DWH-DM/Okt/2015

Filterin
g

Data
Warehouse
36

Data Mining: On What Kinds of Data?

Relational database
Data warehouse
Transactional database
Advanced database and information repository
o Object-relational database
o Spatial and temporal data
o Time-series data
o Stream data
o Multimedia database
o Heterogeneous and legacy database
o Text databases & WWW

IF5031/Intro DWH-DM/Okt/2015

37

Data Mining Functionalities


Concept description: Characterization and discrimination
o Generalize, summarize, and contrast data characteristics, e.g.,
dry vs. wet regions

Association (correlation and causality)


o Diaper  Beer [0.5%, 75%]

Classification and Prediction


o Construct models (functions) that describe and distinguish
classes or concepts for future prediction
E.g., classify countries based on climate, or classify cars based on
gas mileage

o Presentation: decision-tree, classification rule, neural network


o Predict some unknown or missing numerical values
IF5031/Intro DWH-DM/Okt/2015

38

Data Mining Functionalities (2)


Cluster analysis
o Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns
o Maximizing intra-class similarity & minimizing interclass similarity
Outlier analysis
o Outlier: a data object that does not comply with the general
behavior of the data
o Noise or exception? No! useful in fraud detection, rare events
analysis
Trend and evolution analysis
o Trend and deviation: regression analysis
o Sequential pattern mining, periodicity analysis
o Similarity-based analysis
Other pattern-directed or statistical analyses
IF5031/Intro DWH-DM/Okt/2015

39

MORE ON DM PATTERN
(Untuk dipelajari oleh mahasiswa)

IF5031/Intro DWH-DM/Okt/2015

40

Classification Rules
Classification rules help assign new objects to
classes.
o E.g., given a new automobile insurance applicant, should he or she
be classified as low risk, medium risk or high risk?

Classification rules for above example could use


a variety of data, such as educational level,
salary, age, etc.
o person P, P.degree = masters and P.income > 75,000
P.credit = excellent
o person P, P.degree = bachelors and
(P.income 25,000 and P.income 75,000)
P.credit = good

Rules are not necessarily exact: there may be


some misclassifications
Classification rules can be shown compactly as a
decision tree.
IF5031/Intro DWH-DM/Okt/2015

41

Decision Tree

IF5031/Intro DWH-DM/Okt/2015

42

Construction of Decision
Trees
Training set: a data sample in which the classification is
already known.
Greedy top down generation of decision trees.
o Each internal node of the tree partitions the data into groups based on a
partitioning attribute, and a partitioning condition for the node
o Leaf node:
all (or most) of the items at the node belong to the same class, or
all attributes have been considered, and no further partitioning is possible.

IF5031/Intro DWH-DM/Okt/2015

43

Best Splits
Pick best attributes and conditions on which to partition
The purity of a set S of training instances can be measured
quantitatively in several ways.
o Notation: number of classes = k, number of instances = |S|,
fraction of instances in class i = pi.

The Gini measure of purity is defined as


[
k
Gini (S) = 1 - p2i
i- 1

o When all instances are in a single class, the Gini value is 0


o It reaches its maximum (of 1 1 /k) if each class the same number of instances.

IF5031/Intro DWH-DM/Okt/2015

44

Best Splits (Cont.)


Another measure of purity is the entropy measure, which is
defined as
k

entropy (S) = p log p


i
2 i
i- 1

When a set S is split into multiple sets Si, I=1, 2, , r, we can


measure the purity of the resultant set of sets as:
r

purity(S1, S2, .., Sr) =

|Si|

i= 1 |S|

purity (Si)

The information gain due to particular split of S into Si, i = 1,


2, ., r
Information-gain (S, {S1, S2, ., Sr) = purity(S ) purity (S1, S2,
Sr)

IF5031/Intro DWH-DM/Okt/2015

45

Best Splits (Cont.)


Measure of cost of a split:
r |S |
i
Information-content (S, {S1, S2, .., Sr})) =
i- 1 |S|

log2

|Si|
|S|

Information-gain ratio =
Information-gain (S, {S1, S2, , Sr})
Information-content (S, {S1, S2, .., Sr})
The best split is the one that gives the maximum
information gain ratio
IF5031/Intro DWH-DM/Okt/2015

46

Finding Best Splits


Categorical attributes (with no meaningful order):
o Multi-way split, one child for each value
o Binary split: try all possible breakup of values into two sets, and pick the
best

Continuous-valued attributes (can be sorted in a


meaningful order)
o Binary split:
Sort values, try each as a split point
o E.g., if values are 1, 10, 15, 25, split at 1, 10, 15
Pick the value that gives best split
o Multi-way split:
A series of binary splits on the same attribute has roughly equivalent
effect

IF5031/Intro DWH-DM/Okt/2015

47

Decision-Tree
DecisionConstruction Algorithm
Procedure GrowTree (S )
Partition (S );
Procedure Partition (S)
if ( purity (S ) > p or |S| < s ) then
return;
for each attribute A
evaluate splits on attribute A;
Use best split found (across all attributes) to partition
S into S1, S2, ., Sr,
for i = 1, 2, .., r
Partition (Si );
IF5031/Intro DWH-DM/Okt/2015

48

Other Types of Classifiers


Neural net classifiers are studied in artificial
intelligence and are not covered here
Bayesian classifiers use Bayes theorem, which says
p (cj | d ) = p (d | cj ) p (cj )
p(d)
where
p (c
( j | d ) = probability of instance d being in
class cj,
p (d | cj ) = probability of generating
instance d given class cj,
p (cj ) = probability of occurrence of class
cj, and
p (d ) = probability of instance d occuring
IF5031/Intro DWH-DM/Okt/2015

49

Nave Bayesian Classifiers


Bayesian classifiers require
o computation of p (d | cj )
o precomputation of p (cj )
o p (d ) can be ignored since it is the same for all classes

To simplify the task, nave Bayesian classifiers


assume attributes have independent distributions,
and thereby estimate
p (d | cj) = p (d1 | cj ) * p (d2 | cj ) * .* (p (dn |
cj )
o Each of the p (di | cj ) can be estimated from a histogram on di values for
each class cj
the histogram is computed from the training instances
o Histograms on multiple attributes are more expensive to compute and
store

IF5031/Intro DWH-DM/Okt/2015

50

Regression
Regression deals with the prediction of a value, rather than
a class.
o Given values for a set of variables, X1, X2, , Xn, we wish to predict the value of a
variable Y.

One way is to infer coefficients a0, a1, a1, , an such that


Y = a0 + a1 * X1 + a2 * X2 + + an * Xn
Finding such a linear polynomial is called linear regression.
o In general, the process of finding a curve that fits the data is also called curve fitting.

The fit may only be approximate


o because of noise in the data, or
o because the relationship is not exactly a polynomial

Regression aims to find coefficients that give the best


possible fit.

IF5031/Intro DWH-DM/Okt/2015

51

Association Rules
Retail shops are often interested in associations
between different items that people buy.
o Someone who buys bread is quite likely also to buy milk
o A person who bought the book Database System Concepts is quite likely also to
buy the book Operating System Concepts.

Associations information can be used in several ways.


o E.g., when a customer buys a particular book, an online shop may suggest
associated books.

Association rules:
bread milk
DB-Concepts, OS-Concepts Networks
o Left hand side: antecedent, right hand side: consequent
o An association rule must have an associated population; the population consists
of a set of instances
E.g., each transaction (sale) at a shop is an instance, and the set of all
transactions is the population

IF5031/Intro DWH-DM/Okt/2015

52

Association Rules (Cont.)


Rules have an associated support, as well as an
associated confidence.
Support is a measure of what fraction of the population
satisfies both the antecedent and the consequent of
the rule.
o E.g., suppose only 0.001 percent of all purchases include milk and screwdrivers.
The support for the rule is milk screwdrivers is low.

Confidence is a measure of how often the consequent


is true when the antecedent is true.
o E.g., the rule bread milk has a confidence of 80 percent if 80 percent of the
purchases that include bread also include milk.

IF5031/Intro DWH-DM/Okt/2015

53

Finding Association Rules


We are generally only interested in association
rules with reasonably high support (e.g., support
of 2% or greater)
Nave algorithm
1. Consider all possible sets of relevant items.
2. For each set find its support (i.e., count how many transactions
purchase all items in the set).
 Large itemsets: sets with sufficiently high support
3. Use large itemsets to generate association rules.
1. From itemset A generate the rule A - {b } b for each b A.
 Support of rule = support (A).
 Confidence of rule = support (A ) / support (A - {b })

IF5031/Intro DWH-DM/Okt/2015

54

Finding Support
Determine support of itemsets via a single pass on
set of transactions
o Large itemsets: sets with a high count at the end of the pass

If memory not enough to hold all counts for all


itemsets use multiple passes, considering only some
itemsets in each pass.
Optimization: Once an itemset is eliminated
because its count (support) is too small none of its
supersets needs to be considered.
The a priori technique to find large itemsets:
o Pass 1: count support of all sets with just 1 item. Eliminate those items with
low support
o Pass i: candidates: every set of i items such that all its i-1 item subsets are
large
Count support of all candidates
Stop if there are no candidates
IF5031/Intro DWH-DM/Okt/2015

55

Other Types of Associations


Basic association rules have several limitations
Deviations from the expected probability are
more interesting
o E.g., if many people purchase bread, and many people purchase
cereal, quite a few would be expected to purchase both
o We are interested in positive as well as negative correlations between
sets of items
Positive correlation: co-occurrence is higher than predicted
Negative correlation: co-occurrence is lower than predicted

Sequence associations / correlations


o E.g., whenever bonds go up, stock prices go down in 2 days

Deviations from temporal patterns


o E.g., deviation from a steady growth
o E.g., sales of winter wear go down in summer
Not surprising, part of a known pattern.
Look for deviation from value predicted using past patterns
IF5031/Intro DWH-DM/Okt/2015

56

Clustering
Clustering: Intuitively, finding clusters of points in the
given data such that similar points lie in the same
cluster
Can be formalized using distance metrics in several
ways
o Group points into k sets (for a given k) such that the average distance of
points from the centroid of their assigned group is minimized
Centroid: point defined by taking average of coordinates in each
dimension.
o Another metric: minimize average distance between every pair of points
in a cluster

Has been studied extensively in statistics, but on


small data sets
o Data mining systems aim at clustering techniques that can handle very
large data sets
o E.g., the Birch clustering algorithm (more shortly)

IF5031/Intro DWH-DM/Okt/2015

57

Hierarchical Clustering
Example from biological classification
o (the word classification here does not mean a prediction mechanism)
chordata
mammalia
leopards humans

reptilia
snakes crocodiles

Other examples: Internet directory systems (e.g.,


Yahoo, more on this later)
Agglomerative clustering algorithms
o Build small clusters, then cluster small clusters into bigger clusters, and so on

Divisive clustering algorithms


o Start with all items in a single cluster, repeatedly refine (break) clusters into
smaller ones

IF5031/Intro DWH-DM/Okt/2015

58

Clustering Algorithms
Clustering algorithms have been designed to
handle very large datasets
E.g., the Birch algorithm
o Main idea: use an in-memory R-tree to store points that are being
clustered
o Insert points one at a time into the R-tree, merging a new point with an
existing cluster if is less than some distance away
o If there are more leaf nodes than fit in memory, merge existing clusters
that are close to each other
o At the end of first pass we get a large number of clusters at the leaves of
the R-tree
Merge clusters to reduce the number of clusters

IF5031/Intro DWH-DM/Okt/2015

59

Collaborative Filtering
Goal: predict what movies/books/ a person
may be interested in, on the basis of
o Past preferences of the person
o Other people with similar past preferences
o The preferences of such people for a new movie/book/

One approach based on repeated clustering


o Cluster people on the basis of preferences for movies
o Then cluster movies on the basis of being liked by the same
clusters of people
o Again cluster people based on their preferences for (the newly
created clusters of) movies
o Repeat above till equilibrium

Above problem is an instance of collaborative


filtering, where users collaborate in the task of
filtering information to find information of
interest
IF5031/Intro DWH-DM/Okt/2015

60

Other Types of Mining


Text mining: application of data mining to textual
documents
o cluster Web pages to find related pages
o cluster pages a user has visited to organize their visit history
o classify Web pages automatically into a Web directory

Data visualization systems help users examine large


volumes of data and detect patterns visually
o Can visually encode large amounts of information on a single screen
o Humans are very good a detecting visual patterns

IF5031/Intro DWH-DM/Okt/2015

61

End of Chapter
IF5031/Intro DWH-DM/Okt/2015

62

Figure 20.01

IF5031/Intro DWH-DM/Okt/2015

63

Figure 20.02

IF5031/Intro DWH-DM/Okt/2015

64

Figure 20.03

IF5031/Intro DWH-DM/Okt/2015

65

Figure 20.05

IF5031/Intro DWH-DM/Okt/2015

66