Beruflich Dokumente
Kultur Dokumente
One
data
point
Data-Points are a little bit like Coconuts
Another
data
point
One
data
point
Each Data Point Has Features
1. Color
2. Size
3. Amount of hair
4. Cut on top
Data Sets are like piles of coconuts
But actually they are matrices
1. Color
2. Size
3. Amount of hair
4. Cut on top
Data points are unlabeled if we don’t know
their “type”
1. Color
2. Size
3. Amount of hair
4. Cut on top
Features
are similar
Clustering
Features
are similar
Clustering
Features
are similar
Clustering
Need a way to measure similarity
Need a way to measure similarity
1. Color
2. Size 1. Color
3. Amount of hair 2. Size
4. Cut on top 3. Amount of hair
4. Cut on top
Clustering 2D Data
Index Feature 1 Feature 2
0 50.3 8.5
1 12.1 86.4
2 8.6 76.6
3 69.8 6.5
4 99.1 10.2
5 10.1 87.6
6 8.4 64.5
7 6.5 75.4
8 83.5 8.4
9 92.1 2.1
10 6.6 75.5
11 6.7 92.4
Clustering 2D Data
Feature 1
Similarity is Euclidean
Feature 1
Clustering 2D Data
Feature 1
Clustering
Feature 1
Clustering
Feature 1
Clustering with two features
Feature 1
Clustering with three features
What is Cluster Analysis?
• Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Cluster analysis
– Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
• Unsupervised learning: no predefined classes
• Typical applications
– As a stand-alone tool to get insight into data distribution
– As a preprocessing step for other algorithms
Clustering: Rich Applications and
Multidisciplinary Efforts
• Pattern Recognition
• Spatial Data Analysis
– Create thematic maps in GIS by clustering feature spaces
– Detect spatial clusters or for other spatial mining tasks
• Image Processing
• Economic Science (especially market research)
• WWW
– Document classification
– Cluster Weblog data to discover groups of similar access
patterns
Examples of Clustering Applications
• Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
• Land use: Identification of areas of similar land use in an earth
observation database
• Insurance: Identifying groups of motor insurance policy holders
with a high average claim cost
• City-planning: Identifying groups of houses according to their
house type, value, and geographical location
• Earth-quake studies: Observed earth quake epicenters should
Quality: What Is Good Clustering?
• Interval-scaled variables
• Binary variables
• Nominal, ordinal, and ratio variables
• Variables of mixed types
Interval-valued variables
• Standardize data
– Calculate the mean absolute deviation:
sf 1
n (| x1 f m f | | x2 f m f | ... | xnf m f |)
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and q is a positive integer
• If q = 1, d is Manhattan distance
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2 ip jp
Similarity and Dissimilarity Between
Objects (Cont.)
• If q = 2, d is Euclidean distance:
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1 i2 j2 ip jp
– Properties
• d(i,j) 0
• d(i,i) = 0
• d(i,j) = d(j,i)
• d(i,j) d(i,k) + d(k,j)
• Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures
Variable Selection(1)
• Categorical
• A categorical variable (sometimes called a nominal
variable) is one that has two or more categories, but there
is no intrinsic ordering to the categories. For example,
gender is a categorical variable having two categories (male
and female) and there is no intrinsic ordering to the
categories. Hair color is also a categorical variable having a
number of categories (blonde, brown, brunette, red, etc.)
and again, there is no agreed way to order these from
highest to lowest. A purely categorical variable is one that
simply allows you to assign categories but you cannot
clearly order the variables. If the variable has a clear
ordering, then that variable would be an ordinal variable,
as described below.
Variable Selection(2)
• Ordinal
• An ordinal variable is similar to a categorical variable. The difference between
the two is that there is a clear ordering of the variables. For example, suppose
you have a variable, economic status, with three categories (low, medium and
high). In addition to being able to classify people into these three categories, you
can order the categories as low, medium and high. Now consider a variable like
educational experience (with values such as elementary school graduate, high
school graduate, some college and college graduate). These also can be ordered
as elementary school, high school, some college, and college graduate. Even
though we can order these from lowest to highest, the spacing between the
values may not be the same across the levels of the variables. Say we assign
scores 1, 2, 3 and 4 to these four levels of educational experience and we
compare the difference in education between categories one and two with the
difference in educational experience between categories two and three, or the
difference between categories three and four. The difference between categories
one and two (elementary and high school) is probably much bigger than the
difference between categories two and three (high school and some college). In
this example, we can order the people in level of educational experience but the
size of the difference between categories is inconsistent (because the spacing
between categories one and two is bigger than categories two and three). If
these categories were equally spaced, then the variable would be an interval
variable.
Variable Selection(3)
• Interval
• An interval variable is similar to an ordinal variable,
except that the intervals between the values of the
interval variable are equally spaced. For example,
suppose you have a variable such as annual income
that is measured in dollars, and we have three people
who make $10,000, $15,000 and $20,000. The second
person makes $5,000 more than the first person and
$5,000 less than the third person, and the size of these
intervals is the same. If there were two other people
who make $90,000 and $95,000, the size of that
interval between these two people is also the same
($5,000).
Nominal Variables
M f 1
Major Clustering Approaches (I)
• Partitioning approach:
– Construct various partitions and then evaluate them by some criterion, e.g.,
minimizing the sum of square errors
– Typical methods: k-means, k-medoids, CLARANS
• Hierarchical approach:
– Create a hierarchical decomposition of the set of data (or objects) using some
criterion
– Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
• Density-based approach:
– Based on connectivity and density functions
– Typical methods: DBSACN, OPTICS, DenClue
Major Clustering Approaches (II)
• Grid-based approach:
– based on a multiple-level granularity structure
– Typical methods: STING, WaveCluster, CLIQUE
• Model-based:
– A model is hypothesized for each of the clusters and tries to find the best fit of
that model to each other
– Typical methods: EM, SOM, COBWEB
• Frequent pattern-based:
– Based on the analysis of frequent patterns
– Typical methods: pCluster
• User-guided or constraint-based:
– Clustering by considering user-specified or application-specific constraints
– Typical methods: COD (obstacles), constrained clustering
Typical Alternatives to Calculate the Distance
between Clusters
• Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
• Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)
• Average: avg distance between an element in one cluster and an
element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
• Centroid: distance between the centroids of two clusters, i.e.,
dis(Ki, Kj) = dis(Ci, Cj)
• Medoid: distance between the medoids of two clusters, i.e., dis(Ki,
Kj) = dis(Mi, Mj)
– Medoid: one chosen, centrally located object in the cluster
Centroid, Radius and Diameter of a Cluster
(for numerical data sets)
• Centroid: the “middle” of a cluster iN 1(t
Cm ip )
N
Clustering is subjective
The quality or state of being similar; likeness; resemblance; as, a similarity of features.
Webster's Dictionary
Similarity is hard
to define, but…
“We know it
when we see it”
Hierarchical Partitional
A Useful Tool for Summarizing Similarity Measurements
In order to better appreciate and evaluate the examples given in the early part of this talk,
we will now introduce the dendrogram.
0 8 8 7 7
0 2 4 4
0 3 3
D( , ) = 8 0 1
D( , ) = 1 0
Bottom-Up (agglomerative): Starting
with each item in its own cluster, find
the best pair to merge into a new
cluster. Repeat until all clusters are
fused together.
Consider all
Choose
possible
… the best
merges…
Consider all
Choose
possible
… the best
merges…
Consider all
Choose
possible
… the best
merges…
Consider all
Choose
possible
… the best
merges…
Consider all
Choose
possible
… the best
merges…
4
Hierarchical clustering
• Agglomerative (Bottom up)
• 5th iteration
3
1 2
5
4
Hierarchical clustering
• Agglomerative (Bottom up)
• Finally
6
k3 clusters
1
left 2 9
5
8
4
7
Hierarchical clustering
• Divisive (Top-down)
– Start at the top with all patterns in one cluster
– The cluster is split using a flat clustering algorithm
– This procedure is applied recursively until each
pattern is in its own singleton cluster
Hierarchical clustering
• Divisive (Top-down)
How to find similarity in Hierarchical
clustering?
C1
C3
Clustering
Single Linkage
C1
Clustering
Complete Linkage
C1
Clustering
Average Group Linkage
+
C2
C1
Summary of Hierarchal Clustering Methods
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Objective Function
Algorithm k-means
1. Decide on a value for k.
2. Initialize the k cluster centers (randomly, if necessary).
3. Decide the class memberships of the N objects by
assigning them to the nearest cluster center.
4. Re-estimate the k cluster centers, by assuming the
memberships found above are correct.
5. If none of the N objects changed membership in the last
iteration, exit. Otherwise goto 3.
K-means Clustering: Step 1
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
k2
2
k3
0
0 1 2 3 4 5
K-means Clustering: Step 2
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
k2
2
k3
0
0 1 2 3 4 5
K-means Clustering: Step 3
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
2
k3
k2
1
0
0 1 2 3 4 5
K-means Clustering: Step 4
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
2
k3
k2
1
0
0 1 2 3 4 5
K-means Clustering: Step 5
Algorithm: k-means, Distance Metric: Euclidean Distance
4
k1
k2
k3
1
0
0 1 2 3 4 5
Comments on the K-Means Method
• Strength
– Relatively efficient: O(tkn), where n is # objects, k is # clusters,
and t is # iterations. Normally, k, t << n.
– Often terminates at a local optimum
• Weakness
– Applicable only when mean is defined, then what about
categorical data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Expensive (with random seeds)
– How to choose initial Seed?
K-means
• Disadvantages
– Dependent on initialization
K-means
• Disadvantages
– Dependent on initialization
K-means
• Disadvantages
– Dependent on initialization
K-means
• Disadvantages
– Dependent on initialization
• Select random seeds
• Ideas? How to choose seeds?
Run the algorithm many times
Points having minimum similarity with
each other
Points with maximum attributes
defined
K-means
• Disadvantages
– Dependent on initialization
– Sensitive to outliers
K-means
• Disadvantages
– Dependent on initialization
– Sensitive to outliers
Deciding K
• Try a couple of K
Deciding K
• When k = 1, the objective function is 873.0
Deciding K
• When k = 2, the objective function is 173.1
Deciding K
• When k = 3, the objective function is 133.6