Beruflich Dokumente
Kultur Dokumente
(-$1,000 - $2,000)
Step 3:
(-$400 -$5,000)
Step 4:
Market Baskets
Frequent Itemsets
A-priori Algorithm
Counts of
candidate
pairs
Pass 1 Pass 2
First Second
pass pass
C1 = all items
L1 = those counted on first pass to be
frequent.
C2 = pairs, both chosen from L1.
In general, Ck = k –tuples each k –1 of which
is in Lk-1.
Lk = those candidates with support ≥ s.
The selection of the initial points for the clustering are extremely important for the K-means algorithm.
CityU, 3/14/2008
C3
C4 Outlines =>
Objects are far away from other objects in a cluster
CityU, 3/11/2008
C5
Data Structures
x 11 ... x 1f ... x 1p
Data matrix ... ... ... ... ...
(two modes) x ... x if ... x ip
i1
... ... ... ... ...
x ... x nf ... x np
n1
Dissimilarity matrix
(one mode) 0
d(2,1) 0
d(3,1) d(3,2) 0
: : :
d(n,1) d(n,2) ... ... 0
DWDM : 05BIF403: Prof. D.RAJESH 225
Slide 225
C5 The learning of the classifier(clustering) is "supervised" in that it is told to which class each training tuple belongs. It contrasts with
unsupervised learning (clustering), in which the class label of each training tuple is not known, and the number or set of classes (k clusters) to
be learned may not be known in advance
CityU, 3/17/2008
C6
C6 Many clustering algorithms require users to input certain parameters in cluster analysis (such as the number of k desired clusters). The
clustering results can be quite sensitive to input parameters. Parameters, are often difficult to determine. But if the parameters of k clusters is
set up right, it helps to speed up the processing of clustering.
CityU, 3/17/2008
Interval-valued variables
Standardize data
Calculate the mean absolute deviation:
− m f
x
z if = deviation is
Using mean absolute s more
if
robust than using
f
standard deviation
d(i, j) = (| x − x | +| x − x | +...+| x − x | )
q
q q q
i1 j1 i2 ip
i2 j2 j1 j2
ip jp
where i = (x , x , …, x ) and j = (x , x , …, x ) are two p-dimensional data
i1 jp
objects, and q is a positive integer
If q = 1, d is Manhattan distance
d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp
DWDM : 05BIF403: Prof. D.RAJESH 228
Similarity and Dissimilarity Between
Objects (Cont.)
If q = 2, d is Euclidean distance (most popular):
d (i, j ) = (| x − x | 2 + | x − x | 2 +...+ | x − x | 2 )
i1 j1 i2 j2 ip jp
Properties
d(i,j) ≥ 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) ≤ d(i,k) + d(k,j)
one can use weighted distance, parametric Pearson product
moment correlation, or other disimilarity measures.
d(i, j) = b+c
a+b+c
DWDM : 05BIF403: Prof. D.RAJESH 230
Dissimilarity between Binary Variables
Example
0 + 1
d ( jack , mary ) = = 0 . 33
2 + 0 + 1
1 + 1
d ( jack , jim ) = = 0 . 67
1 + 1 + 1
1 + 2
d ( jim , mary ) = = 0 . 75
+ D.RAJESH
1 Prof.
DWDM : 05BIF403: 1 + 2 231
Partitioning Method
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
C7 The k-medoids method is more robust than k-means in the presence of noise and outliners, because a medoid is less influenced by outliers or
other extreme values than a mean. However, its processing is mroe costly than the k-means method.
CityU, 3/17/2008
Case 1: p currently belongs to medoid Oj. If Oj is
replaced by Orandom as a medoid and p is closest to
one of Oj; i not= j, then p is reassigned to Oi.
Case 2: p currently belongs to medoid Oj. If Oj is
replaced by Orandom, then p is reassigned to
Orandom.
Case 3: p currently belongs to medoid Oi, i not= j. If
Oj is replaced by Orandom as a medoid and p is still
closest to Oi, then the assignment does not change.
Case 4: p currently belongs to medoid Oi, i not= j. If
Oj is replaced by Orandom as a medoid and p is
closest to Orandom, then p is reassigned to Orandom.
C8 The k-medoids method is more robust than k-means in the presernce of noise and outliners, because a medoid is less influenced by outliners or
other extreme values than a mean. However, its processing is more costly than the k-means method.
CityU, 3/17/2008
Two dimensional example with 10 objects
Item A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
TCAC=CAAC+CBAC+CCAC+CDAC+CEAC = 1 + 0 – 2 – 1 + 0 = -2
Where CAAC = the cost change of object A after replacing medoid
A with medoid C
As a result, TCAC = 1 + 0 – 2 – 1 + 0 = – 2
The new center {B, C} is less costly. As a result, we should swan {A. B} to {B, C} by Medoid
method
Call Revenue
Call Usage
DWDM : 05BIF403: Prof. D.RAJESH 255
Derived business rules (observation)
Kevan M N P P N
Caroline F N P P P
Erik M P N N P
Genetic Algorithm
Genetic Algorithms (GA) apply an evolutionary
approach to inductive learning. GA has been
successfully applied to problems that are difficult to
solve using conventional techniques such as
scheduling problems, traveling salesperson problem,
network routing problems and financial marketing.
C9 Genetic algorithm is to locate representative sample data (solution) among training (test) data, which represents a rule.
The idea is to change the solution set until they all pass a fitness function.
Phase 1: Training data set is to derive a rule of representative data (solution) for population elements (input data to be mined).
Phase 2: Test data set is to test the result of the derived solution from phase 1. If the result passes the required successful rate, then the result
becomes a rule. Otherwise, repeat Phase 1.
CityU, 3/18/2008
c1
(2) Minimum error rate (fitness function score) specified by the user has been reached after matching all the training (test) data.
cstest, 3/31/2008
Digitalized Genetic knowledge
representation
C11 Own class means that the data in the training data set matches the target data (suggested solution).
Competing class means that the data in the training data set does not match the target data.
CityU, 3/26/2008
Step 2 of supervised genetic learning
Step 2a applies a fitness function to evaluate each
element currently in the population. With each
iteration, elements not satisfying the fitness
criteria are eliminated from the population. The
final result of a supervised genetic learning
session is a set of population elements that best
represents the training data.
3 ? No No Male 40-49
Note: the higher the fitness score, the smaller will be the
error rate for the solution.
A Second-Generation Population
3 ? No No Male 40-49