Beruflich Dokumente
Kultur Dokumente
learning
Unsupervis
ed
learning
supervised
learning
Clustering
K-means
K-medoids
Hierarchical
Associatio
n analysis
Classificati
on
Decision tree
K-Nearest
neighbor
Nave Bayesian
Support vector
machines
Neural network
Clustering
Finding groups of objects such that the objects in a group will be similar (or related) to
one another and different from (or unrelated to) the objects in other groups.
Clustering
Similarity
Similarity Numerical measure of how alike two data objects are. Is higher when
objects are more alike.
P1
P2
P3
P4
Standardization is necessary, if scales differ
Euclidean Distance
P1
2.8
4
3.1
6
5.0
9
P2
2.8
4
1.4
1
3.1
6
P3
3.1
6
1.4
1
p4
5.0
9
3.1
6
K-means Clustering
3.1
6
84
2.
3
2.2
P1
P2
P3
P4
P1
2.8
4
3.1
6
5.0
9
P2
2.8
4
1.4
1
3.1
6
P3
3.1
6
1.4
1
p4
5.0
9
3.1
6
K-means Clustering
Partitional clustering approach
Each cluster is associated with a
centroid (center point)
Each point is assigned to the
cluster with the closest centroid
Number of clusters, K, must be
specified
K-means in R
TSS
MINIMI
ZE
V1
K-means in R
wss <- (nrow(iris2)-1)*sum(apply(iris2,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(iris2,centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of
squares")
Scaling ( normalization )
Customer
Marital
status
House
Car
Salary
25000
20000
0
0
0
25000
d(A,B) = 5000
A and C could be similar , so will be in one
d(A,C) = 1.3
D(B,C) > 5000 cluster
Customer
Marital
status
House
Car
Salary
Normalizati
on
(Value
min)
(Maxmin)
K-means limitations
K-means has problems when
the data contains outliers
Finding optimum number of
clusters K is difficult
K-medoids clustering
library(fpc)
iris2 <- iris
iris2$Species <- NULL
pamk.result <- pamk(iris2)
table(pamk.result$pamobject$clustering, iris$Species)
layout(matrix(c(1,2),1,2))
plot(pamk.result$pamobject)
layout(matrix(1))
K-medoids clustering
Hierarchical
clustering
Hierarchical
clustering
Hierarchical
clustering
Hierarchical
clustering
Clustering depending
on type of dataset
K means should not be used for dataset with outliers ??
Heirarchical clustering should not be used for large
dataset ??
Thank You