Sie sind auf Seite 1von 19

CLARANS

clustering objects for spatial data

Clustering Large Applications Based


upon RANdomized Search
Content

Dissimilarity
K-Medoids

PAM CLARANS

CLARA
Spatial Data
What is Spatial Data?

Objects of types: Used in/for:

- points - GIS - Geographic Information


- lines Systems
- polygons - GPS - Global Positioning System
- other geographic and geometric data - Environmental Studies
primitives - etc.
K-Medoids Clustering Method
Compared to K-Means

Find representative objects, called


medoids, in clusters.

Instead of using the mean of the


members, it uses an actual data point
as a cluster centroid.

The best known reprezentation is


PAM - Partitioning Around Medoids
K-Medoids - PAM

1. Initialize: select k of the n data points as the medoids


2. Associate each data point to the closest medoid.
3. While the cost of the configuration decreases(while there is a change in configuration)
a. For each medoid m, for each non-medoid data point o:
i. Swap m and o, recompute the cost (sum of distances of points to their medoid)
ii. If the total cost of the configuration increased in the previous step, undo the
swap
Dissimilarity

● Distance between two samples under some criterion


● How different samples are
● Key concept for clustering
● It directly influences the shape of the cluster
CLARA

● Draws a sample of the dataset and applies PAM on the sample in order to find the
medoids
● the algorithm can’t find the best solution if one of the best k-medoids is not among the
selected sample
● Improvement
○ select multiple samples
○ choose the sample(k-medoids) that has the lowest average dissimilarity of all objects in the
entire dataset
CLARANS - Randomized CLARA

● Searching a graph where every node is a potential solution - a set of k-medoids


● two nodes are neighbors if their sets differ by only one medoid
● each node has a cost - the total dissimilarity between every object and the medoids of
its cluster
● the problem is ‘finding a minimum on the graph’
● at each step, all neighbors of current node are searched and the one with the lowest
cost is chosen
CLARANS - Randomized CLARA

● examining k( n-k) neighbors is time consuming - it draws samples of neighbors to


examine
● after finding the local optimum - CLARANS starts with a new randomly selected node
in search for a new local optimum
● input
○ the number of neighbors to examine
○ the number of local optimums to search for
CLARANS - Randomized CLARA

1. Input parameters num_local_min and max_neighbor. Initialize i to 1, and mincost to a large


number.
2. Set current_node to an arbitrary node in G n;k .
3. Set j to 1.
4. Consider a random neighbor S of current_node, and based on S, calculate the cost differential
of the two nodes.
5. If S has a lower cost, set current_node to S, and go to Step 3.
6. Otherwise, increment j by 1. If j <= max_neighbor, go to Step 4.
7. Otherwise, when j > max_neighbor, compare the cost of current_node with mincost. If the
former is less than mincost, set mincost to the cost of current_node and set bestnode to
current_node.
8. Increment i by 1. If i > num_local_min, output bestnode and halt. Otherwise, go to Step 2.
CLARANS

Effective spatial data mining algorithms

● SD(CLARANS) - spatial dominant approach


○ The spatial components of objects in the dataset are collected and clustered using CLARANS
○ Non-spatial description of the objects are brought into the resulting clusters
○ each cluster(defined by spatial boundaries) is described by its relative abundance of non-spatial
attributes
● NSD(CLARANS) - non-spatial dominant approach
○ Non-spatial attribute generalization produces k generalized attribute groups
○ the spatial component are clustered to find k clusters
○ if the clusters overlap, they may be merged(and their attribute description merged as well)
○ each cluster is described by a single attribute description
SD(CLARANS)
NSD(CLARANS)
CLARANS vs. PAM(K-Medoids) - The quality of the results is comparable but CLARANS is much more efficient than PAM
CLARANS

Deals with larger data sets than PAM

Efficient and Effective - outperforms PAM Efficiency depends on the sample size
and CLARA
A good clustering on samples will not
Return higher quality clusters than PAM necessarily represent a good clustering of
and CLARA the whole data set if the sample is biased

Handles outliers It assumes that data to be clustered can be


vested in the main memory simultaneously
Can handle not only points objects, but also
polygon objects efficiently
Importance & Practical Applications

● Geographic Information System - thematic maps


● Discovering hotspots: unusual locations
● Helpful for outlier detection
Summary - CLARANS

● PAM checks every neighbor


● CLARA examines fewer neighbours - searches in subgraphs
built from samples
● CLARANS searches the whole graph but draws samples of
neighbors dynamically
● CLARANS has the benefit of not confining the search to a
restricted area
Summary - CLARANS

● Efficient medoid-based clustering algorithm


● The best practice for spatial data-mining
● Applies a strategy to search in a certain graph
Questions?

Das könnte Ihnen auch gefallen