Sie sind auf Seite 1von 96

Clustering

Dr. Rehan Ashraf


What is Clustering
• Cluster analysis or clustering is the task of grouping
a set of objects in such a way that objects in the
same group (called a cluster) are more similar (in
some sense or another) to each other than to those
in other groups (clusters). It is a main task of
exploratory data mining, and a common technique
for statistical data analysis, used in many fields,
including machine learning, pattern
recognition, image analysis, information
retrieval, bioinformatics, data compression,
and computer graphics.
• Different Algorithms are used (see )
Data-Points are a little bit like Coconuts
Data-Points are a little bit like Coconuts

One
data
point
Data-Points are a little bit like Coconuts
Another
data
point

One
data
point
Each Data Point Has Features

1. Color
2. Size
3. Amount of hair
4. Cut on top
Data Sets are like piles of coconuts
But actually they are matrices

Index Color Size Hair Cut on Top


0 Green 52.6 False True
1 Brown 18.5 True False
2 Grey 45.4 True False
3 Green 13.6 False True
4 Green 22.6 False False
5 Brown 16.5 True False
Data points are unlabeled if we don’t know
their “type”

1. Color
2. Size
3. Amount of hair
4. Cut on top
Data points are unlabeled if we don’t know
their “type”

1. Color
2. Size
3. Amount of hair
4. Cut on top

This is a rotten coconut. But that piece of


information is missing from our dataset
Clustering: Grouping together similar data-points
Clustering
Clustering

Features
are similar
Clustering

Features
are similar
Clustering

Features
are similar
Clustering
Need a way to measure similarity
Need a way to measure similarity

1. Color
2. Size 1. Color
3. Amount of hair 2. Size
4. Cut on top 3. Amount of hair
4. Cut on top
Clustering 2D Data
Index Feature 1 Feature 2
0 50.3 8.5
1 12.1 86.4
2 8.6 76.6
3 69.8 6.5
4 99.1 10.2
5 10.1 87.6
6 8.4 64.5
7 6.5 75.4
8 83.5 8.4
9 92.1 2.1
10 6.6 75.5
11 6.7 92.4
Clustering 2D Data

Index Feature 1 Feature 2


0 50.3 8.5
1 12.1 86.4
2 8.6 76.6
3 69.8 6.5
4 99.1 10.2
5 10.1 87.6
6 8.4 64.5
7 6.5 75.4
8 83.5 8.4
9 92.1 2.1
10 6.6 75.5
11 6.7 92.4

Feature 1
Similarity is Euclidean

Index Feature 1 Feature 2


0 50.3 8.5
1 12.1 86.4
2 8.6 76.6
3 69.8 6.5
Not similar
4 99.1 10.2
5 10.1 87.6
6 8.4 64.5
7 6.5 75.4
8 83.5 8.4
9 92.1 2.1
Similar
10 6.6 75.5
11 6.7 92.4

Feature 1
Clustering 2D Data

Feature 1
Clustering

Feature 1
Clustering

Feature 1
Clustering with two features

Feature 1
Clustering with three features
What is Cluster Analysis?
• Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Cluster analysis
– Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
• Unsupervised learning: no predefined classes
• Typical applications
– As a stand-alone tool to get insight into data distribution
– As a preprocessing step for other algorithms
Clustering: Rich Applications and
Multidisciplinary Efforts
• Pattern Recognition
• Spatial Data Analysis
– Create thematic maps in GIS by clustering feature spaces
– Detect spatial clusters or for other spatial mining tasks
• Image Processing
• Economic Science (especially market research)
• WWW
– Document classification
– Cluster Weblog data to discover groups of similar access
patterns
Examples of Clustering Applications
• Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
• Land use: Identification of areas of similar land use in an earth
observation database
• Insurance: Identifying groups of motor insurance policy holders
with a high average claim cost
• City-planning: Identifying groups of houses according to their
house type, value, and geographical location
• Earth-quake studies: Observed earth quake epicenters should
Quality: What Is Good Clustering?

• A good clustering method will produce high quality


clusters with
– high intra-class similarity
– low inter-class similarity
• The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
• The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns
Measure the Quality of Clustering

• Dissimilarity/Similarity metric: Similarity is expressed in


terms of a distance function, typically metric: d(i, j)
• There is a separate “quality” function that measures the
“goodness” of a cluster.
• The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables.
• Weights should be associated with different variables
based on applications and data semantics.
• It is hard to define “similar enough” or “good enough”
– the answer is typically highly subjective.
Requirements of Clustering
• Scalability
• Ability to deal with different types of attributes
• Ability to handle dynamic data
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to determine input
parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• High dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability
Type of data in clustering analysis

• Interval-scaled variables
• Binary variables
• Nominal, ordinal, and ratio variables
• Variables of mixed types
Interval-valued variables

• Standardize data
– Calculate the mean absolute deviation:
sf  1
n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)

where m f  1n (x1 f  x2 f  ...  xnf )


.

– Calculate the standardized measurement (z-score)


xif  m f
zif  sf
• Using mean absolute deviation is more robust than using
standard deviation
Similarity and Dissimilarity Between
Objects

• Distances are normally used to measure the


similarity or dissimilarity between two data objects
• Some popular ones include: Minkowski distance:
d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q )
i1 j1 i2 j2 ip jp

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and q is a positive integer
• If q = 1, d is Manhattan distance
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp
Similarity and Dissimilarity Between
Objects (Cont.)
• If q = 2, d is Euclidean distance:
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j2 ip jp
– Properties
• d(i,j)  0
• d(i,i) = 0
• d(i,j) = d(j,i)
• d(i,j)  d(i,k) + d(k,j)
• Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures
Variable Selection(1)
• Categorical
• A categorical variable (sometimes called a nominal
variable) is one that has two or more categories, but there
is no intrinsic ordering to the categories. For example,
gender is a categorical variable having two categories (male
and female) and there is no intrinsic ordering to the
categories. Hair color is also a categorical variable having a
number of categories (blonde, brown, brunette, red, etc.)
and again, there is no agreed way to order these from
highest to lowest. A purely categorical variable is one that
simply allows you to assign categories but you cannot
clearly order the variables. If the variable has a clear
ordering, then that variable would be an ordinal variable,
as described below.
Variable Selection(2)
• Ordinal
• An ordinal variable is similar to a categorical variable. The difference between
the two is that there is a clear ordering of the variables. For example, suppose
you have a variable, economic status, with three categories (low, medium and
high). In addition to being able to classify people into these three categories, you
can order the categories as low, medium and high. Now consider a variable like
educational experience (with values such as elementary school graduate, high
school graduate, some college and college graduate). These also can be ordered
as elementary school, high school, some college, and college graduate. Even
though we can order these from lowest to highest, the spacing between the
values may not be the same across the levels of the variables. Say we assign
scores 1, 2, 3 and 4 to these four levels of educational experience and we
compare the difference in education between categories one and two with the
difference in educational experience between categories two and three, or the
difference between categories three and four. The difference between categories
one and two (elementary and high school) is probably much bigger than the
difference between categories two and three (high school and some college). In
this example, we can order the people in level of educational experience but the
size of the difference between categories is inconsistent (because the spacing
between categories one and two is bigger than categories two and three). If
these categories were equally spaced, then the variable would be an interval
variable.
Variable Selection(3)
• Interval
• An interval variable is similar to an ordinal variable,
except that the intervals between the values of the
interval variable are equally spaced. For example,
suppose you have a variable such as annual income
that is measured in dollars, and we have three people
who make $10,000, $15,000 and $20,000. The second
person makes $5,000 more than the first person and
$5,000 less than the third person, and the size of these
intervals is the same. If there were two other people
who make $90,000 and $95,000, the size of that
interval between these two people is also the same
($5,000).
Nominal Variables

• A generalization of the binary variable in that it can take


more than 2 states, e.g., red, yellow, blue, green
• Method 1: Simple matching
– m: # of matches, p: total # of variables
d (i, j)  p  p
m

• Method 2: use a large number of binary variables


– creating a new binary variable for each of the M nominal states
Ordinal Variables

• An ordinal variable can be discrete or continuous


• Order is important, e.g., rank
• Can be treated like interval-scaled
– replace xif by their rank rif {1,...,M f }
– map the range of each variable onto [0, 1] by replacing i-th
object in the f-th variable by
rif 1
zif 
M f 1
– compute the dissimilarity using methods for interval-scaled
variables
Ratio-Scaled Variables

• Ratio-scaled variable: a positive measurement on a


nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt
• Methods:
– treat them like interval-scaled variables—not a good choice!
(why?—the scale can be distorted)
– apply logarithmic transformation
yif = log(xif)
– treat them as continuous ordinal data treat their rank as interval-
scaled
Variables of Mixed Types

• A database may contain all the six types of variables


– symmetric binary, asymmetric binary, nominal, ordinal, interval
and ratio
• One may use a weighted formula to combine their
effects  pf  1 ij( f ) d ij( f )
d (i, j) 
 pf  (f)
 1 ij
– f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
– f is interval-based: use the normalized distance
– f is ordinal or ratio-scaled
• compute ranks rif and
• and treat zif as interval-scaled r 1
zif 
if

M f 1
Major Clustering Approaches (I)

• Partitioning approach:
– Construct various partitions and then evaluate them by some criterion, e.g.,
minimizing the sum of square errors
– Typical methods: k-means, k-medoids, CLARANS
• Hierarchical approach:
– Create a hierarchical decomposition of the set of data (or objects) using some
criterion
– Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
• Density-based approach:
– Based on connectivity and density functions
– Typical methods: DBSACN, OPTICS, DenClue
Major Clustering Approaches (II)
• Grid-based approach:
– based on a multiple-level granularity structure
– Typical methods: STING, WaveCluster, CLIQUE
• Model-based:
– A model is hypothesized for each of the clusters and tries to find the best fit of
that model to each other
– Typical methods: EM, SOM, COBWEB
• Frequent pattern-based:
– Based on the analysis of frequent patterns
– Typical methods: pCluster
• User-guided or constraint-based:
– Clustering by considering user-specified or application-specific constraints
– Typical methods: COD (obstacles), constrained clustering
Typical Alternatives to Calculate the Distance
between Clusters
• Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
• Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)
• Average: avg distance between an element in one cluster and an
element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
• Centroid: distance between the centroids of two clusters, i.e.,
dis(Ki, Kj) = dis(Ci, Cj)
• Medoid: distance between the medoids of two clusters, i.e., dis(Ki,
Kj) = dis(Mi, Mj)
– Medoid: one chosen, centrally located object in the cluster
Centroid, Radius and Diameter of a Cluster
(for numerical data sets)
• Centroid: the “middle” of a cluster iN 1(t
Cm  ip )
N

• Radius: square root of average distance from any point of


the cluster to its centroid
 N (t  cm ) 2
Rm  i 1 ip
N

• Diameter: square root of average mean squared distance


between all pairs of points in the cluster
 N  N (t  t ) 2
Dm  i 1 i 1 ip iq
N ( N 1)
Partitioning Algorithms: Basic Concept
• Partitioning method: Construct a partition of a database D of n
objects into a set of k clusters, s.t., min sum of squared distance

km1tmiKm (Cm  tmi )2

• Given a k, find a partition of k clusters that optimizes the chosen


partitioning criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means (MacQueen’67): Each cluster is represented by the center of the
cluster
– k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects in the
cluster
The K-Means Clustering Method
• Given k, the k-means algorithm is implemented in
four steps:
– Partition objects into k nonempty subsets
– Compute seed points as the centroids of the clusters of the
current partition (the centroid is the center, i.e., mean
point, of the cluster)
– Assign each object to the cluster with the nearest seed
point
– Go back to Step 2, stop when no more new assignment
Question:
What is a natural grouping among
these objects?
What is a natural grouping among these objects?

Clustering is subjective

Simpson's Family School Employees Females Males


What is Similarity? *No offence!

The quality or state of being similar; likeness; resemblance; as, a similarity of features.
Webster's Dictionary

Similarity is hard
to define, but…
“We know it
when we see it”

The real meaning


of similarity is a
philosophical
question.
Measure Distance or Similarity?
Two Types of Clustering
• Partitional algorithms: Construct various partitions and then evaluate them by some
criterion (K-Means, K-Means++)
•Hierarchical algorithms: Create a hierarchical decomposition of the set of objects using
some criterion (Agglomerative, single link)

Hierarchical Partitional
A Useful Tool for Summarizing Similarity Measurements
In order to better appreciate and evaluate the examples given in the early part of this talk,
we will now introduce the dendrogram.

Terminal Branch Root


The similarity between two objects in a
Internal Branch
Internal Node
dendrogram is represented as the height
Leaf of the lowest internal node they share.
(How-to) Hierarchical Clustering
*Don’t cram me!
The number of dendrograms with n Since we cannot test all possible trees
leafs = (2n -3)!/[(2(n -2)) (n -2)!] we will have to heuristic search of all
possible trees. We could do this..
Number Number of Possible
of Leafs Dendrograms
2 1 Bottom-Up (agglomerative): Starting
3 3
4 15
with each item in its own cluster, find
5 105 the best pair to merge into a new
... … cluster. Repeat until all clusters are
10 34,459,425
fused together.

Top-Down (divisive): Starting with all


the data in a single cluster, consider
every possible way to divide the cluster
into two. Choose the best division and
recursively operate on both sides.
We begin with a distance matrix which
contains the distances between every
pair of objects in our database.

0 8 8 7 7

0 2 4 4

0 3 3
D( , ) = 8 0 1

D( , ) = 1 0
Bottom-Up (agglomerative): Starting
with each item in its own cluster, find
the best pair to merge into a new
cluster. Repeat until all clusters are
fused together.

Consider all Choose


possible … the best
merges…
Bottom-Up (agglomerative): Starting
with each item in its own cluster, find
the best pair to merge into a new
cluster. Repeat until all clusters are
fused together.

Consider all
Choose
possible
… the best
merges…

Consider all Choose


possible … the best
merges…
Bottom-Up (agglomerative): Starting
with each item in its own cluster, find
the best pair to merge into a new
cluster. Repeat until all clusters are
fused together.

Consider all
Choose
possible
… the best
merges…

Consider all
Choose
possible
… the best
merges…

Consider all Choose


possible … the best
merges…
Bottom-Up (agglomerative): Starting
with each item in its own cluster, find
the best pair to merge into a new
cluster. Repeat until all clusters are
fused together.

Consider all
Choose
possible
… the best
merges…

Consider all
Choose
possible
… the best
merges…

Consider all Choose


possible … the best
merges…
Hierarchical clustering
• Agglomerative (Bottom up)
Hierarchical clustering
• Agglomerative (Bottom up)
• 1st iteration
1
Hierarchical clustering
• Agglomerative (Bottom up)
• 2nd iteration
1 2
Hierarchical clustering
• Agglomerative (Bottom up)
• 3rd iteration
3
1 2
Hierarchical clustering
• Agglomerative (Bottom up)
• 4th iteration
3
1 2

4
Hierarchical clustering
• Agglomerative (Bottom up)
• 5th iteration
3
1 2
5

4
Hierarchical clustering
• Agglomerative (Bottom up)
• Finally
6
k3 clusters
1
left 2 9
5
8
4
7
Hierarchical clustering
• Divisive (Top-down)
– Start at the top with all patterns in one cluster
– The cluster is split using a flat clustering algorithm
– This procedure is applied recursively until each
pattern is in its own singleton cluster
Hierarchical clustering
• Divisive (Top-down)
How to find similarity in Hierarchical
clustering?

C1

C2 Merge which pair of clusters?

C3
Clustering
Single Linkage

In this method the distance between two


+ clusters is determined by the distance of
the two closest objects (nearest neighbors)
in the different clusters
+
C2

C1
Clustering
Complete Linkage

In this method, the distances between


+ clusters are determined by the greatest
distance between any two objects in the
different clusters (i.e., by the "furthest
+ neighbors").
C2

C1
Clustering
Average Group Linkage

Distance between two cluster means.


+

+
C2

C1
Summary of Hierarchal Clustering Methods

• No need to specify the number of clusters in advance.


• Hierarchal nature maps nicely onto human intuition for
some domains

• They do not scale well: time complexity of at least O(n2),


where n is the number of total objects.
• Like any heuristic search algorithms, local optima are a
problem.
• Interpretation of results is (very) subjective.
Partitional Clustering
• Nonhierarchical, each instance is placed in
exactly one of K non overlapping clusters.
• Since only one set of clusters is output, the user
normally has to input the desired number of
clusters K.
Partitional clustering: K-Means algorithm
Squared Error m = Number of data-points in one cluster
K = total number of clusters

10
9
8
7
6
5
4
3
2
1

1 2 3 4 5 6 7 8 9 10
Objective Function
Algorithm k-means
1. Decide on a value for k.
2. Initialize the k cluster centers (randomly, if necessary).
3. Decide the class memberships of the N objects by
assigning them to the nearest cluster center.
4. Re-estimate the k cluster centers, by assuming the
memberships found above are correct.
5. If none of the N objects changed membership in the last
iteration, exit. Otherwise goto 3.
K-means Clustering: Step 1
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1

k2
2

k3
0
0 1 2 3 4 5
K-means Clustering: Step 2
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1

k2
2

k3
0
0 1 2 3 4 5
K-means Clustering: Step 3
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1

2
k3
k2
1

0
0 1 2 3 4 5
K-means Clustering: Step 4
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1

2
k3
k2
1

0
0 1 2 3 4 5
K-means Clustering: Step 5
Algorithm: k-means, Distance Metric: Euclidean Distance

4
k1

k2
k3
1

0
0 1 2 3 4 5
Comments on the K-Means Method
• Strength
– Relatively efficient: O(tkn), where n is # objects, k is # clusters,
and t is # iterations. Normally, k, t << n.
– Often terminates at a local optimum
• Weakness
– Applicable only when mean is defined, then what about
categorical data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Expensive (with random seeds)
– How to choose initial Seed?
K-means
• Disadvantages
– Dependent on initialization
K-means
• Disadvantages
– Dependent on initialization
K-means
• Disadvantages
– Dependent on initialization
K-means
• Disadvantages
– Dependent on initialization
• Select random seeds
• Ideas? How to choose seeds?
Run the algorithm many times
Points having minimum similarity with
each other
Points with maximum attributes
defined
K-means
• Disadvantages
– Dependent on initialization
– Sensitive to outliers
K-means
• Disadvantages
– Dependent on initialization
– Sensitive to outliers
Deciding K
• Try a couple of K
Deciding K
• When k = 1, the objective function is 873.0
Deciding K
• When k = 2, the objective function is 173.1
Deciding K
• When k = 3, the objective function is 133.6

Das könnte Ihnen auch gefallen