CLARANS

CLARANS
clustering objects for spatial data
Clustering Large Applications Based

upon RANdomized Search
Content
Dissimilarity
K-Medoids
PAM CLARANS
CLARA
Spatial Data
What is Spatial Data?
Objects of types: Used in/for:
- points - GIS - Geographic Information

- lines Systems
- polygons - GPS - Global Positioning System
- other geographic and geometric data - Environmental Studies
primitives - etc.
K-Medoids Clustering Method
Compared to K-Means
Find representative objects, called

medoids, in clusters.
Instead of using the mean of the

members, it uses an actual data point
as a cluster centroid.
The best known reprezentation is

PAM - Partitioning Around Medoids
K-Medoids - PAM
1. Initialize: select k of the n data points as the medoids

2. Associate each data point to the closest medoid.
3. While the cost of the configuration decreases(while there is a change in configuration)
a. For each medoid m, for each non-medoid data point o:
i. Swap m and o, recompute the cost (sum of distances of points to their medoid)
ii. If the total cost of the configuration increased in the previous step, undo the
swap
Dissimilarity
● Distance between two samples under some criterion

● How different samples are
● Key concept for clustering
● It directly influences the shape of the cluster
CLARA
● Draws a sample of the dataset and applies PAM on the sample in order to find the
medoids
● the algorithm can’t find the best solution if one of the best k-medoids is not among the
selected sample
● Improvement
○ select multiple samples
○ choose the sample(k-medoids) that has the lowest average dissimilarity of all objects in the
entire dataset
CLARANS - Randomized CLARA
● Searching a graph where every node is a potential solution - a set of k-medoids

● two nodes are neighbors if their sets differ by only one medoid
● each node has a cost - the total dissimilarity between every object and the medoids of
its cluster
● the problem is ‘finding a minimum on the graph’
● at each step, all neighbors of current node are searched and the one with the lowest
cost is chosen
● examining k( n-k) neighbors is time consuming - it draws samples of neighbors to

examine
● after finding the local optimum - CLARANS starts with a new randomly selected node
in search for a new local optimum
● input
○ the number of neighbors to examine
○ the number of local optimums to search for
1. Input parameters num_local_min and max_neighbor. Initialize i to 1, and mincost to a large

number.
2. Set current_node to an arbitrary node in G n;k .
3. Set j to 1.
4. Consider a random neighbor S of current_node, and based on S, calculate the cost differential
of the two nodes.
5. If S has a lower cost, set current_node to S, and go to Step 3.
6. Otherwise, increment j by 1. If j <= max_neighbor, go to Step 4.
7. Otherwise, when j > max_neighbor, compare the cost of current_node with mincost. If the
former is less than mincost, set mincost to the cost of current_node and set bestnode to
current_node.
8. Increment i by 1. If i > num_local_min, output bestnode and halt. Otherwise, go to Step 2.
CLARANS
Effective spatial data mining algorithms
● SD(CLARANS) - spatial dominant approach

○ The spatial components of objects in the dataset are collected and clustered using CLARANS
○ Non-spatial description of the objects are brought into the resulting clusters
○ each cluster(defined by spatial boundaries) is described by its relative abundance of non-spatial
attributes
● NSD(CLARANS) - non-spatial dominant approach
○ Non-spatial attribute generalization produces k generalized attribute groups
○ the spatial component are clustered to find k clusters
○ if the clusters overlap, they may be merged(and their attribute description merged as well)
○ each cluster is described by a single attribute description
SD(CLARANS)
NSD(CLARANS)
CLARANS vs. PAM(K-Medoids) - The quality of the results is comparable but CLARANS is much more efficient than PAM
CLARANS
Deals with larger data sets than PAM
Efficient and Effective - outperforms PAM Efficiency depends on the sample size
and CLARA
A good clustering on samples will not
Return higher quality clusters than PAM necessarily represent a good clustering of
and CLARA the whole data set if the sample is biased
Handles outliers It assumes that data to be clustered can be

vested in the main memory simultaneously
Can handle not only points objects, but also
polygon objects efficiently
Importance & Practical Applications
● Geographic Information System - thematic maps

● Discovering hotspots: unusual locations
● Helpful for outlier detection
Summary - CLARANS
● PAM checks every neighbor

● CLARA examines fewer neighbours - searches in subgraphs
built from samples
● CLARANS searches the whole graph but draws samples of
neighbors dynamically
● CLARANS has the benefit of not confining the search to a
restricted area
Summary - CLARANS
● Efficient medoid-based clustering algorithm

● The best practice for spatial data-mining
● Applies a strategy to search in a certain graph
Questions?

CLARANS

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

CLARANS

Hochgeladen von

Copyright:

Verfügbare Formate

CLARANS

clustering objects for spatial data

Clustering Large Applications Based

Objects of types: Used in/for:

- points - GIS - Geographic Information

Find representative objects, called

Instead of using the mean of the

The best known reprezentation is

1. Initialize: select k of the n data points as the medoids

● Distance between two samples under some criterion

● Searching a graph where every node is a potential solution - a set of k-medoids

● examining k( n-k) neighbors is time consuming - it draws samples of neighbors to

1. Input parameters num_local_min and max_neighbor. Initialize i to 1, and mincost to a large

Effective spatial data mining algorithms

● SD(CLARANS) - spatial dominant approach

Deals with larger data sets than PAM

Handles outliers It assumes that data to be clustered can be

● Geographic Information System - thematic maps

● PAM checks every neighbor

● Efﬁcient medoid-based clustering algorithm

Das könnte Ihnen auch gefallen