Data Mining Algorithmes

DATA MINING
ALGORITHMES
SUSHIL
1
KULKARNI
SUSHIL KULKARNI
INTENSIONS
Define classification problem using map and

illustrate with examples.
What are the different techniques to classify the

data into classes
List the approach in classification.

What are common methods to define classes?
Give suitable examples.
What are the different issues faced in doing

classification of data?
SUSHILKULKARNI
KULKARNI
SUSHIL
CLASSIFICATION
PROBLEM
3
SUSHIL KULKARNI
CLASSIFICATION PROBLEM
Given a database D={t1,t2,,tn} and a set of
classes C={C1,,Cm}, the Classification
Problem is to define a mapping f: DC
where each ti is assigned to one class.
Actually divides D into equivalence
classes.
Prediction is similar, but may be viewed as
4having infinite number of classes.
SUSHIL KULKARNI
CLASSIFICATION EXAMPLES
Teachers classify students grades as A,
B, C, D, or F.
Identify mushrooms as poisonous or
edible.
Predict when a river will flood.
Identify individuals with credit risks.
Speech recognition
Pattern recognition
5
SUSHIL KULKARNI
CLASSIFICATION EXAMPLE:
MARKS
x
<90
If x >= 90 then grade =A.

If 80<=x<90 then grade =B.
If 70<=x<80 then grade =C. <80
x
If 60<=x<70 then grade =D.
<70
If x<50 then grade =F
x
<50
6
>=90
A
>=80
B
>=70
C
>=60
D
SUSHIL KULKARNI
CLASSIFICATION EXAMPLE
Letter Recognition
View letters as constructed from 5 components:
Letter A
Letter B
Letter C
Letter D
Letter E
Letter F
SUSHIL KULKARNI
CLASSIFICATION
TECHNIQUES
8
SUSHIL KULKARNI
CLASSIFICATION
TECHNIQUES
Regression
Distance
Decision Trees
Rules
Neural Networks
9
SUSHIL KULKARNI
CLASSIFICATION TECHNIQUES
Approach:
Create specific model by evaluating
training data (or using domain experts
knowledge)
Apply model developed to new data.
10
SUSHIL KULKARNI
CLASSIFICATION TECHNIQUES
Classes must be predefined
Most common techniques use DTs, NNs,
or are based on distances or statistical
methods.
11
SUSHIL KULKARNI
DEFINE CLASSES
Distance Based
Partitioning Based
12
SUSHIL KULKARNI
ISSUES IN CLASSIFICATION
View letters as constructed from 5
components:
Missing Data
1. Ignore
2. Replace with assumed value
Measuring Performance
1. Classification accuracy on test data
2. Confusion matrix
133. OC Curve
SUSHIL KULKARNI
INTENSIONS
How one can find the performance that can be

measured to do the classification of data?
Explain Operating Characteristic curve.

Define confusion matrix.
How regression is used to classify the data?
What are the two different approaches in
classification using regression?
How correlation is used in classification of data?

What is Bayes theorem?
Explain with example.
14
SUSHIL KULKARNI
SUSHIL KULKARNI
PERFORMANCE
MEASURE
15
SUSHIL KULKARNI
HEIGHT EXAMPLE DATA
16
SUSHIL KULKARNI
MEASURING PERFORMANCE IN
CLASSIFICATION
C j is a specific class and t I is a database
tuple, may or may not be assigned to that
class while its actual membership may or
may not be in mat class. This gives four
parts as shown below:
17
SUSHIL KULKARNI
MEASURING PERFORMANCE IN
CLASSIFICATION
1.True Positive: t i predicted to be in c j and is

actually in it.
2. False Positive : t i predicted to be in c j but
is not actually in it.
3. True Negative : t I not predicted to be in c j
and is not actually in it.
4. False Negative : t i not predicted to be in
18c j but actually in it.
SUSHIL KULKARNI
CLASSIFICATION
PERFORMANCE
19
True Positive
False Negative
False Positive
True Negative
SUSHIL KULKARNI
OPERATING CHARECTERISTIC
CURVE
It shows the relation ship between false
positives and true positives
OC curve was originally used to examine
false alarm rates.
20
SUSHIL KULKARNI
OPERATING CHARACTERISTIC
CURVE
21
SUSHIL KULKARNI
CONFUSION MATRIX
It illustrates the accuracy of solution to a
classification problem
Definition:
Given m classes, a confusion matrix is an m
by m matrix where each entry indicates the
number of tuples from D that were assigned
to class C j but where correct class is C i
22
SUSHIL KULKARNI
CONFUSION MATRIX
EXAMPLE
Using height data example with Output 1
correct and Output 2 actual assignment
23
SUSHIL KULKARNI
STATISTICAL
BASED
ALGORITHMS
24
SUSHIL KULKARNI
REGRESSION
Assume data fits a predefined function
Determine best values for regression
coefficients c 0,c1,,cn.
Linear Regression:
y = c0+ c1x1++ cnxn
Assume an error: y = c0+ c1x1++ cnxn+ e
25
SUSHIL KULKARNI
Linear Regression
26
SUSHIL KULKARNI
LINEAR REGRESSION : Poor Fit
27
SUSHIL KULKARNI
CLASSIFICATION USING
REGRESSION
Division: Use regression function to
divide area into regions.
Prediction: Use regression function to
predict a class membership function.
Input includes desired class.
28
SUSHIL KULKARNI
DIVISION
29
SUSHIL KULKARNI
PREDICTION
30
SUSHIL KULKARNI
CORRELATION
Examine the degree to which the
values for two variables behave
similarly.
Correlation coefficient r:
1 = perfect correlation
-1 = perfect but opposite correlation
0 = no correlation
31
SUSHIL KULKARNI
BAYES THEOREM
Posterior Probability: P(h1|xi)
Prior Probability: P(h1)
Bayes Theorem:
Assign probabilities of hypotheses given a

data value.
32
SUSHIL KULKARNI
BAYES THEOREM EXAMPLE

Credit authorizations (hypotheses):
h1=authorize purchase, h2 = authorize after
further identification, h3=do not authorize,
h4= do not authorize but contact police
Assign twelve data values for all
combinations of credit and income:
From training data: P(h1) = 60%;

P(h2)=20%; P(h3)=10%; P(h4)=10%.
33
SUSHIL KULKARNI
Bayes Example(contd)
Training Data:
34
SUSHIL KULKARNI
INTENSIONS
Explain different distance bases algorithms.
Explain similarity measures between data using

distances.
How distance is useful in classification of data?
35
Explain KNN in detail
SUSHIL KULKARNI
SUSHIL KULKARNI
DISTANCE BASED
ALGORITHM
36
SUSHIL KULKARNI
SIMILARITY MEASURES
Determine similarity between two objects.
Similarity characteristics:
Alternatively, distance measure measure

how unlike or dissimilar objects are.
37
SUSHIL KULKARNI
SIMILARITY MEASURES
Similarity characteristics:
Sim( t i, t i ) = 1
SIMILARITY
Sim( t i, t j ) = 0
No SIMILARITY
38
SUSHIL KULKARNI
SIMILARITY MEASURES
39
SUSHIL KULKARNI
DISTANCE
Place items in class to which they are
closest
40
SUSHIL KULKARNI
DISTANCE
Must determine distance between an item
and a class.
41
SUSHIL KULKARNI
DISTANCE MEASURES
Measure dissimilarity between objects
42
SUSHIL KULKARNI
DISTANCE
Classes represented by
1. Centroid: Central value.
2. Medoid: Representative point.
3. Individual points
Algorithm: KNN
43
SUSHIL KULKARNI
K-NEAREST NEIGHBOUR (KNN)

Training set includes classes.
Examine K items near item to be
classified.
New item placed in class with the most
number of close items.
O(q) for each tuple to be classified.
(Here q is the size of the training set.)
44
SUSHIL KULKARNI
KNN
45
SUSHIL KULKARNI
KNN ALGORITHM
46
SUSHIL KULKARNI
DECISION TREE
47
SUSHIL KULKARNI
DECISION TREE
Tree where the root and each internal
node is labeled with a question.
The arcs represent each possible
answer to the associated question.
Each leaf node represents a
prediction of a solution to the
problem.
48
SUSHIL KULKARNI
DECISION TREE
Popular technique for classification;
Leaf node indicates class to which the
corresponding tuple belongs.
49
SUSHIL KULKARNI
DECISION TREE: Example
50
SUSHIL KULKARNI
DECISION TREE
Given:
D = {t1, , tn} where ti=<ti1, , tih>
Database schema contains
{A1, A2, , Ah}
Classes C={C1, ., Cm}
51
SUSHIL KULKARNI
DECISION TREE
Decision or Classification Tree is a tree
associated with D such that
Each internal node is labeled with
attribute, Ai
Each arc is labeled with predicate
which can be applied to attribute at
parent
52
Each leaf node is labeled with a class,

Cj
SUSHIL KULKARNI
DECISION TREES MODEL

A Decision Tree Model is a
computational model consisting of
three parts:
Decision Tree
Algorithm to create the tree
Algorithm that applies the tree to
data
53
SUSHIL KULKARNI
DECISION TREES MODEL

Creation of the tree is the most
difficult part.
Processing is basically a search
similar to that in a binary search tree
(although DT may not be binary).
54
SUSHIL KULKARNI
Decision Tree Algorithm
55
SUSHIL KULKARNI
DIRECTED TREE :
ADVANTAGES
Easy to understand.
Easy to generate rules
56
SUSHIL KULKARNI
DIRECTED TREE :
DISADVANTAGES
May suffer from over fitting.
Classifies by rectangular partitioning.
Does not easily handle nonnumeric
data.
Can be quite large pruning is
necessary.
57
SUSHIL KULKARNI
DECISION TREE
Partitioning based: Divide search
space into rectangular regions.
Tuple placed into class based on the
region within which it falls.
DT approaches differ in how the tree is
built: DT Induction
58
SUSHIL KULKARNI
DECISION TREE
Internal nodes associated with attribute
and arcs with values for that attribute.
Algorithms: ID3, C4.5, CART
59
SUSHIL KULKARNI
DT INDUCTION
60
SUSHIL KULKARNI
DT SPLIT AREA
Gender
M
F
Height
61
SUSHIL KULKARNI
COMPARING DTs
62
Balanced
Deep
SUSHIL KULKARNI
DT ISSUES
Choosing Splitting Attributes
Ordering of Splitting Attributes
Splits
Tree Structure
Stopping Criteria
Training Data
63
SUSHIL KULKARNI
Decision Tree Induction is often based

on Information Theory
So
64
SUSHIL KULKARNI
INFORMATION
65
SUSHIL KULKARNI
DT INDUCTION
When all the marbles in the bowl are
mixed up, little information is given.
When the marbles in the bowl are all
from one class and those in the other
two classes are on either side, more
information is given.
Use this approach with DT Induction !
66
SUSHIL KULKARNI
ARTIFICIAL NEURAL
NETWORK (ANN)
ANN is an information processing
paradigm that is inspired by the way
brain process information.
Composed of a large number of highly
interconnected
processing
elements
called neurones.
ANNs, like people, learn by example.
67
SUSHIL KULKARNI
ARTIFICIAL NEURAL
NETWORK (ANN)
Learning in biological systems involves
adjustments to the synaptic connections
that exist between the neurones.
This is true of ANNs as well.
68
SUSHIL KULKARNI
HOW HUMAN BRAIN LEARNS?
69
SUSHIL KULKARNI
Cont..
70
SUSHIL KULKARNI

In the human brain, a typical neuron
collects signals from others through a
host of fine structures called dendrites.
The neuron sends out spikes of electrical
activity through a long, thin stand known
as an axon, which splits into thousands
of branches.
71
SUSHIL KULKARNI

At the end of each branch, a structure
called a synapse converts the activity from
the axon into electrical effects that inhibit
or excite activity from the axon into
electrical effects that inhibit or excite
activity in the connected neurones.
Learning
occurs
by
changing
the
effectiveness of the synapses so that the
influence of one neuron on another
changes.
72
SUSHIL KULKARNI
A SIMPLE NEURON
73
SUSHIL KULKARNI
NEURAL NETWORKS
Based on observed functioning of
human brain.
(Artificial Neural Networks (ANN)
The first artificial neuron was
produced
in
1943
by
the
neurophysiologist Warren McCulloch
and the logician Walter Pits.
74
SUSHIL KULKARNI
NEURAL NETWORKS
Our view of neural networks is very
simplistic.
We view a neural network (NN) from a
graphical viewpoint.
Used in pattern recognition, speech
recognition, computer vision, and
classification.
75
SUSHIL KULKARNI
NEURAL NETWORKS: Example
76
SUSHIL KULKARNI
NEURAL NETWORKS
It is a directed graph F=<V,A> with vertices
V={1,2,,n} and arcs A={<i,j>|1<=i,j<=n},
with the following restrictions:
V is partitioned into a set of input nodes, V I,
hidden nodes, VH, and output nodes, VO.
The vertices are also partitioned into

layers
77
SUSHIL KULKARNI
NEURAL NETWORKS
Any arc <i,j> must have node i in layer
h-1 and node j in layer h.
Arc <i,j> is labeled with a numeric value
wij.
Node i is labeled with a function fi.
78
SUSHIL KULKARNI
NN NODE
79
SUSHIL KULKARNI
NEURAL NETWORK MODEL

It is a computational model consisting of
Three parts:
Neural Network graph

Learning algorithm that indicates
how learning takes place.
Recall techniques that determine
how information is obtained from the
network.
80
SUSHIL KULKARNI
NN : Advantages
Learning
Can continue learning even after training
set has been applied.
Easy parallelization
Solves many problems
81
SUSHIL KULKARNI
NN: Disadvantages
Difficult to understand
May suffer from overfitting
Structure of graph must be determined a
priori.
Input values must be numeric.
Verification difficult.
82
SUSHIL KULKARNI
NEURAL NETWORKS
Typical NN structure for classification:
1. One output node per class
2.Output value is class membership
function value
Supervised learning
83
SUSHIL KULKARNI
NEURAL NETWORKS
For each tuple in training set, propagate it
through NN. Adjust weights on edges to
improve future classification.
Algorithms: Propagation, Back
propagation, Gradient Descent
84
SUSHIL KULKARNI
NN ISSUES
Number of source nodes
Number of hidden layers
Training data
Number of sinks
Interconnections
85
SUSHIL KULKARNI
NN ISSUES
Weights
Activation Functions
Learning Technique
When to stop learning
86
SUSHIL KULKARNI
DECISION TREE VS. NEURAL

NETWORK
87
SUSHIL KULKARNI
PRPOGATION
Tuple Input
Output
88
SUSHIL KULKARNI
NN PROPOGATION ALGORITHM
89
SUSHIL KULKARNI
EXAMPLE PROPOGATION
90
SUSHIL KULKARNI
RULES
91
SUSHIL KULKARNI
CLASSIFICATION USING RULES

Perform classification using If-Then
rules
Classification Rule: r = <a,c>
Antecedent, Consequent
May generate from from other
techniques (DT, NN) or generate
directly.
Algorithms: Gen, RX, 1R, PRISM
92
SUSHIL KULKARNI
GENERATING RULES FROM

DTs
93
SUSHIL KULKARNI
GENERATING RULES EXAMPLE
94
SUSHIL KULKARNI
GENERATING RULES FROM NNs
95
SUSHIL KULKARNI
1R ALGORITHM
96
SUSHIL KULKARNI
1R EXAMPLE
97
SUSHIL KULKARNI
PRISM ALGORITHM
98
SUSHIL KULKARNI
PRISM EXAMPLE
99
SUSHIL KULKARNI
DECISION TREE VS. RULES

Tree has implied
order in which
splitting is
performed.
Tree created
based on looking
at all classes.
100
Rules have no
ordering of
predicates.
Only need to look
at one class to
generate its rules.
SUSHIL KULKARNI
INTENSIONS
Clustering Examples
Segment customer database based on
similar buying patterns.
Group houses in a town into
neighborhoods based on similar features.
Identify new plant species
Identify similar Web usage patterns
101
SUSHIL KULKARNI
CLUSTERING
102
SUSHIL KULKARNI
CLUSTERING : Example
103
SUSHIL KULKARNI
CLUSTERING HOUSES
Geographic
Size
Distance
Based Based
104
SUSHIL KULKARNI
Clustering vs. Classification

No prior knowledge
Number of clusters
Meaning of clusters
Unsupervised learning
105
SUSHIL KULKARNI
Clustering Issues
Outlier handling
Dynamic data
Interpreting results
Evaluating results
Number of clusters
Data to be used
Scalability
106
SUSHIL KULKARNI
Impact of Outliers on
Clustering
107
SUSHIL KULKARNI
Clustering Problem
Given a database D={t1,t2,,tn} of tuples
and an integer value k, the Clustering
Problem is to define a mapping f:D{1,..,k}
where each ti is assigned to one cluster Kj,
1<=j<=k.
A Cluster, Kj, contains precisely those
tuples mapped to it.
Unlike classification problem, clusters are
not known a priori.
108
SUSHIL KULKARNI
Types of Clustering
Hierarchical Nested set of clusters
created.
Partitional One set of clusters created.
Incremental Each element handled one
at a time.
Simultaneous All elements handled
together.
Overlapping/Non-overlapping
109
SUSHIL KULKARNI
Clustering Approaches
Clustering
Hierarchical
Agglomerative
110
Partitional
Divisive
Categorical
Sampling
Large DB
Compression
SUSHIL KULKARNI
Cluster Parameters
111
SUSHIL KULKARNI
Distance Between Clusters

Single Link: smallest distance between points
Complete Link: largest distance between
points
Average Link: average distance between
points
Centroid: distance between centroids
112
SUSHIL KULKARNI
Hierarchical Clustering
Clusters are created in levels actually creating
sets of clusters at each level.
Agglomerative
Initially each item in its own cluster
Iteratively clusters are merged together
Bottom Up
Divisive
Initially all items in one cluster
Large clusters are successively divided
Top Down
113
SUSHIL KULKARNI
Hierarchical Algorithms
Single Link
MST Single Link
Complete Link
Average Link
114
SUSHIL KULKARNI
Dendrogram
Dendrogram: a tree data
structure which illustrates
hierarchical clustering
techniques.
Each level shows clusters for
that level.
Leaf individual clusters
Root one cluster
A cluster at level i is the union

of its children clusters at level
i+1.
115
SUSHIL KULKARNI
Agglomerative Example
A B C D E
A
D
Threshold of
1 2 34 5
A B C D E
116
SUSHIL KULKARNI
MST Example
A
A B C D E
117
SUSHIL KULKARNI
118
SUSHIL KULKARNI
Agglomerative Algorithm
119
SUSHIL KULKARNI
Single Link
View all items with links (distances)
between them.
Finds maximal connected components
in this graph.
Two clusters are merged if there is at
least one edge which connects them.
Uses threshold distances at each level.
Could be agglomerative or divisive.
120
SUSHIL KULKARNI
MST Single Link Algorithm
121
SUSHIL KULKARNI
Single Link Clustering
122
SUSHIL KULKARNI
Partitional Clustering
Nonhierarchical
Creates clusters in one step as opposed
to several steps.
Since only one set of clusters is output,
the user normally has to input the desired
number of clusters, k.
Usually deals with static sets.
123
SUSHIL KULKARNI
Partitional Algorithms
MST
Squared Error
K-Means
Nearest Neighbor
PAM
BEA
GA
124
SUSHIL KULKARNI
K-Means
Initial set of clusters randomly chosen.
Iteratively, items are moved among sets
of clusters until the desired set is
reached.
High degree of similarity among
elements in a cluster is obtained.
Given a cluster Ki={ti1,ti2,,tim}, the
cluster mean is mi = (1/m)(ti1 + + tim)
125
SUSHIL KULKARNI
K-Means Example
Given: {2,4,10,12,3,20,30,11,25}, k=2
Randomly assign means: m1=3,m2=4
K1={2,3}, K2={4,10,12,20,30,11,25},
m1=2.5,m2=16
K1={2,3,4},K2={10,12,20,30,11,25},
m1=3,m2=18
K1={2,3,4,10},K2={12,20,30,11,25},
m1=4.75,m2=19.6
K1={2,3,4,10,11,12},K2={20,30,25},
m1=7,m2=25
Stop as the clusters with these means are the
same.
126
SUSHIL KULKARNI
K-Means Algorithm
127
SUSHIL KULKARNI
Nearest Neighbor
Items are iteratively merged into the
existing clusters that are closest.
Incremental
Threshold, t, used to determine if items
are added to existing clusters or a new
cluster is created.
128
SUSHIL KULKARNI
Nearest Neighbor Algorithm
129
SUSHIL KULKARNI
Clustering Large Databases

Most clustering algorithms assume a large
data structure which is memory resident.
Clustering may be performed first on a
sample of the database then applied to the
entire database.
Algorithms
BIRCH
DBSCAN
CURE
130
SUSHIL KULKARNI
Desired Features for Large

Databases
One scan (or less) of DB
Online
Suspendable, stoppable, resumable
Incremental
Work with limited main memory
Different techniques to scan (e.g.
sampling)
Process each tuple once
131
SUSHIL KULKARNI
BIRCH
Balanced Iterative Reducing and
Clustering using Hierarchies
Incremental, hierarchical, one scan
Save clustering information in a tree
Each entry in the tree contains
information about one cluster
New nodes inserted in closest entry in
tree
132
SUSHIL KULKARNI
Clustering Feature
CT Triple: (N,LS,SS)
N: Number of points in cluster
LS: Sum of points in the cluster
SS: Sum of squares of points in the cluster
CF Tree
Balanced search tree
Node has CF triple for each child
Leaf node represents cluster and has CF value
for each subcluster in it.
Subcluster has maximum diameter
133
SUSHIL KULKARNI
CURE
Clustering Using Representatives
Use many points to represent a cluster
instead of only one
Points will be well scattered
134
SUSHIL KULKARNI
CURE Approach
135
SUSHIL KULKARNI
CURE Algorithm
136
SUSHIL KULKARNI
CURE for Large Databases
137
SUSHIL KULKARNI
ASSOCIATION
RULES
138
SUSHIL KULKARNI
Association Rules Outline

Goal: Provide an overview of basic
Association Rule mining techniques
Association Rules Problem Overview
Large itemsets
Association Rules Algorithms

Apriori
Sampling
Partitioning
Parallel Algorithms
Comparing Techniques
Incremental Algorithm
Advanced AR Techniques
139
SUSHIL KULKARNI
Example: Market Basket Data

Items frequently purchased together:
Bread Butter
Uses:
Placement
Advertising
Sales
Coupons
Objective: increase sales and reduce

costs
140
SUSHIL KULKARNI
Association Rule Definitions

Set of items: I={I1,I2,,Im}
Transactions: D={t1,t2, , tn}, tj I
Itemset: {Ii1,Ii2, , Iik} I
Support of an itemset: Percentage of
transactions which contain that itemset.
Large (Frequent) itemset: Itemset whose
number of occurrences is above a
threshold.
141
SUSHIL KULKARNI
Association Rules Example
I = { Beer, Bread, Jelly, Milk, PeanutButter}

Support of {Bread,PeanutButter} is 60%
142
SUSHIL KULKARNI
Association Rule Definitions

Association Rule (AR): implication
X Y where X,Y I and X Y = ;
Support of AR (s) X Y: Percentage
of transactions that contain X Y
Confidence of AR ( ) X Y: Ratio of
number of transactions that contain
X Y to the number that contain X
143
SUSHIL KULKARNI
Association Rules Ex (contd)
144
SUSHIL KULKARNI
Association Rule Problem

Given a set of items I={I1,I2,,Im} and a
database of transactions D={t1,t2, , tn}
where ti={Ii1,Ii2, , Iik} and Iij I, the
Association Rule Problem is to
identify all association rules X Y with
a minimum support and confidence.
Link Analysis
NOTE: Support of X Y is same as
support of X Y.
145
SUSHIL KULKARNI
Association Rule Techniques

1. Find Large Itemsets.
2. Generate rules from frequent
itemsets.
146
SUSHIL KULKARNI
Algorithm to Generate ARs
147
SUSHIL KULKARNI
Apriori
Large Itemset Property:
Any subset of a large itemset is large.
Contrapositive:
If an itemset is not large, none of its
supersets are large.
148
SUSHIL KULKARNI
Large Itemset Property
149
SUSHIL KULKARNI
Apriori Ex (contd)
s=30%
150
= 50%
SUSHIL KULKARNI
Apriori Algorithm
1. C1 = Itemsets of size one in I;
2. Determine all large itemsets of size 1, L1;
3. i = 1;
4. Repeat
5.
i = i + 1;
6.
Ci = Apriori-Gen(Li-1);
7.
Count Ci to determine Li;
8. until no more large itemsets found;

151
SUSHIL KULKARNI
Apriori-Gen
Generate candidates of size i+1 from large
itemsets of size i.
Approach used: join large itemsets of size
i if they agree on i-1
May also prune candidates who have
subsets that are not large.
152
SUSHIL KULKARNI
Apriori-Gen Example
153
SUSHIL KULKARNI
Apriori-Gen Example (contd)
154
SUSHIL KULKARNI
Apriori Adv/Disadv
Advantages:
Uses large itemset property.
Easily parallelized
Easy to implement.
Disadvantages:
Assumes transaction database is memory
resident.
Requires up to m database scans.
155
SUSHIL KULKARNI
Sampling
Large databases
Sample the database and apply Apriori to the
sample.
Potentially Large Itemsets (PL): Large
itemsets from sample
Negative Border (BD - ):
Generalization of Apriori-Gen applied to
itemsets of varying sizes.
Minimal set of itemsets which are not in PL,
but whose subsets are all in PL.
156
SUSHIL KULKARNI
Negative Border Example
PL
157
PL BD-(PL)
SUSHIL KULKARNI
Sampling Algorithm
1.
2.
3.
4.
5.
6.
7.
8.
158
Ds = sample of Database D;
PL = Large itemsets in Ds using smalls;
C = PL BD-(PL);
Count C in Database using s;
ML = large itemsets in BD-(PL);
If ML = then done
else C = repeated application of BD-;
Count C in Database;
SUSHIL KULKARNI
Sampling Example
Find AR assuming s = 20%
Ds = { t1,t2}
Smalls = 10%
PL = {{Bread}, {Jelly}, {PeanutButter},
{Bread,Jelly}, {Bread,PeanutButter}, {Jelly,
PeanutButter}, {Bread,Jelly,PeanutButter}}
BD-(PL)={{Beer},{Milk}}
ML = {{Beer}, {Milk}}
Repeated application of BD- generates all
remaining itemsets
159
SUSHIL KULKARNI
Sampling Adv/Disadv
Advantages:
Reduces number of database scans to one
in the best case and two in worst.
Scales better.
Disadvantages:
Potentially large number of candidates in
second pass
160
SUSHIL KULKARNI
Partitioning
Divide database into partitions D1,D2,
,Dp
Apply Apriori to each partition
Any large itemset must be large in at
least one partition.
161
SUSHIL KULKARNI
Partitioning Algorithm
1.
2.
3.
4.
5.
162
Divide D into partitions D1,D2,,Dp;

For I = 1 to p do
Li = Apriori(Di);
C = L1 Lp;
Count C on D to generate L;
SUSHIL KULKARNI
Partitioning Example
L1 ={{Bread}, {Jelly},
{PeanutButter},
{Bread,Jelly},
{Bread,PeanutButter},
{Jelly, PeanutButter},
{Bread,Jelly,PeanutButter}}
D1
D2
S=10%
163
L2 ={{Bread}, {Milk},
{PeanutButter}, {Bread,Milk},
{Bread,PeanutButter}, {Milk,
PeanutButter},
{Bread,Milk,PeanutButter},
{Beer}, {Beer,Bread},
{Beer,Milk}}
SUSHIL KULKARNI
Partitioning Adv/Disadv
Advantages:
Adapts to available main memory
Easily parallelized
Maximum number of database scans is
two.
Disadvantages:
May have many candidates during second
scan.
164
SUSHIL KULKARNI
Parallelizing AR Algorithms
Based on Apriori
Techniques differ:
What is counted at each site
How data (transactions) are distributed
Data Parallelism
Data partitioned
Count Distribution Algorithm
Task Parallelism
Data and candidates partitioned
Data Distribution Algorithm
165
SUSHIL KULKARNI
T H A N K S !
166
SUSHIL KULKARNI

Data Mining Algorithmes

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Data Mining Algorithmes

Hochgeladen von

Copyright:

Verfügbare Formate

DATA MINING

Define classification problem using map and

What are the different techniques to classify the

List the approach in classification.

What are the different issues faced in doing

If x >= 90 then grade =A.

How one can find the performance that can be

Explain Operating Characteristic curve.

How correlation is used in classification of data?

HEIGHT EXAMPLE DATA

1.True Positive: t i predicted to be in c j and is

LINEAR REGRESSION : Poor Fit

Assign probabilities of hypotheses given a

BAYES THEOREM EXAMPLE

From training data: P(h1) = 60%;

Explain different distance bases algorithms.

Explain similarity measures between data using

How distance is useful in classification of data?

Explain KNN in detail

Alternatively, distance measure measure

K-NEAREST NEIGHBOUR (KNN)

DECISION TREE: Example

Each leaf node is labeled with a class,

DECISION TREES MODEL

DECISION TREES MODEL

Decision Tree Algorithm

Decision Tree Induction is often based

HOW HUMAN BRAIN LEARNS?

HOW HUMAN BRAIN LEARNS?

HOW HUMAN BRAIN LEARNS?

NEURAL NETWORKS: Example

The vertices are also partitioned into

NEURAL NETWORK MODEL

Neural Network graph

DECISION TREE VS. NEURAL

CLASSIFICATION USING RULES

GENERATING RULES FROM

GENERATING RULES EXAMPLE

GENERATING RULES FROM NNs

DECISION TREE VS. RULES

Clustering vs. Classification

Distance Between Clusters

A cluster at level i is the union

MST Single Link Algorithm

Single Link Clustering

Nearest Neighbor Algorithm

Clustering Large Databases

Desired Features for Large

CURE for Large Databases

Association Rules Outline

Association Rules Algorithms

Example: Market Basket Data

Objective: increase sales and reduce

Association Rule Definitions

Association Rules Example

I = { Beer, Bread, Jelly, Milk, PeanutButter}

Association Rule Definitions

Association Rules Ex (contd)

Association Rule Problem

Association Rule Techniques

Algorithm to Generate ARs

Large Itemset Property