Medoid Clustering

High Performance Data Mining
Chapter 3: Clustering
Vipin Kumar
Army High Performance Computing Research Center Department of Computer Science University of Minnesota
http://www.cs.umn.edu/~kumar
R. Grossman, C. Kamath, V. Kumar
Data Mining for Scientific and Engineering Applications
Ch 3/ 1
Chapter 3: Clustering Algorithms

Outline l K-means Clustering l Hierarchical Clustering Single Link, Group Average, CURE, Chameleon l Graph Based Clustering l Miscellaneous Topics Scalability Other Clustering Techniques
Ch 3/ 2
Clustering Definition
l Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that
Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another.
l Similarity Measures:
Euclidean Distance if attributes are continuous. Other Problem-specific Measures.
Ch 3/ 3
Input Data for Clustering

l A set of N points in an M dimensional space OR l A proximity matrix that gives the pairwise distance or similarity between points.
Can be viewed as a weighted graph.
I1 I2 I3 I4 I5 I6 I1 1.00 0.70 0.80 0.00 0.00 0.00 I2 0.70 1.00 0.65 0.25 0.00 0.00 I3 0.80 0.65 1.00 0.00 0.00 0.00 I4 0.00 0.25 0.00 1.00 0.90 0.85 I5 0.00 0.00 0.00 0.90 1.00 0.95 I6 0.00 0.00 0.00 0.85 0.95 1.00
Ch 3/ 4
K-means Clustering
l Find a single partition of the data into K clusters such that the within cluster error, e.g., r r x - c , is minimized. l Basic K-means Algorithm:
K 2 r i=1 x C i
i
1. 2. 3. 4.
Select K points as the initial centroids. Assign all points to the closest centroid. Recompute the centroids. Repeat steps 2 and 3 until the centroids dont change.
l K-means is a gradient-descent algorithm that always converges - perhaps to a local minimum.

(Clustering for Applications, Anderberg)
R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 5
Example: Kmeans
Initial Data and Seeds

Final Clustering
Data Mining for Scientific and Engineering Applications Ch 3/ 6
Example: K-means
Initial Data and Seeds

Final Clustering
Data Mining for Scientific and Engineering Applications Ch 3/ 7
K-means: Initial Point Selection

l Bad set of initial points gives a poor solution. l Random selection
Simple and efficient. Initial points dont cover clusters with high probability. Many runs may be needed for optimal solution.
l Choose initial points from

Dense regions so that the points are well-separated.
l Many more variations on initial point selection.

K-means: How to Update Centroids

l Depends on the exact error criterion used. l If trying to minimize the squared error, K r r2 xrC x - c i , then the new centroid is the i=1 i mean of the points in a cluster. l If trying to minimize the sum of distances, , then the new centroid is the median of the points in a cluster.
r i= 1 x C i
r r x - ci
K-means: When to Update Centroids

l Update the centroids only after all points are assigned to centers. l Update the centroids after each point assignment.
May adjust the relative weight of the point being added and the current center to speed convergence. Possibility of better accuracy and faster convergence at the cost of more work. Update issues are similar to those of updating weights for neural-nets using back-propagation. (Artificial Intelligence, Winston)
K-means: Pre and Post Processing

l Outliers can dominate the clustering and, in some cases, are eliminated by preprocessing. l Post-processing attempts to fix-up the clustering produced by the K-means algorithm.
Merge clusters that are close to each other. Split loose clusters that contribute most to the error. Permanently eliminate small clusters since they may represent groups of outliers.
l Approaches are based on heuristics and require the user to choose parameter values.
K-means: Time and Space requirements

l O(MN) space since it uses just the vectors, not the proximity matrix.
M is the number of attributes. N is the number of points. Also keep track of which cluster each point belongs to and the K cluster centers.
l Time for basic K-means is O(T*K*M*N),

T is the number of iterations. (T is often small, 5-10, and can easily be bounded, as few changes occur after the first few iterations).
K-means: Determining the Number of Clusters

l Mostly heuristic and domain dependant approaches. l Plot the error for 2, 3, clusters and find the knee in the curve. l Use domain specific knowledge and inspect the clusters for desired characteristics.
Ch 3/ 13
K-means: Problems and Limitations

l Based on minimizing within cluster error - a criterion that is not appropriate for many situations.
Unsuitable when clusters have widely different sizes or have convex shapes.
l Restricted to data in Euclidean spaces, but variants of K-means can be used for other types of data.
K-medoid Clustering
l Find a single partition of the data into K clusters such that each cluster has a most representative point, i.e., a point that is the most centrally located point in the cluster with respect to some measure, e.g., distance. l These representative points are called medoids. l Basic K-medoid Algorithm: 1. Select K points as the initial medoids. 2. Assign all points to the closest medoid. 3. See if any other point is a better medoid. 4. Repeat steps 2 and 3 until the medoids dont change.
Ch 3/ 15
K-medoid Clustering
l Can be used with similarities, as well as distances and there is no Euclidean space restriction. l Finding a better medoid involves comparing all pairs of medoid and non-medoid points and is relatively inefficient.
Sampling may be used. (Efficient and effective clustering method for spatial data mining. Ng and
Han, 94)
l Better resistance to outliers.

(Finding Groups in Data, Kaufman and Rousseeuw)
Ch 3/ 16
Types of Clustering: Partitional and Hierarchical

l Partitional Clustering ( K-means and K-medoid) finds a one-level partitioning of the data into K disjoint groups. l Hierarchical Clustering finds a hierarchy of nested clusters (dendogram).
May proceed either bottom-up (agglomerative) or top-down (divisive). Uses a proximity matrix. Can be viewed as operating on a proximity graph.
Hierarchical Clustering Algorithms

l Hierarchical Agglomerative Clustering
1. Initially each item belongs to a single cluster. 2. Combine the two most similar clusters. 3. Repeat step 2 until there is only a single cluster. Most popular approach. Starting with a single cluster, divide clusters until only single item clusters remain. Less popular, but equivalent in functionality.
l Hierarchical Divisive Clustering
Ch 3/ 18
Cluster Similarity: MIN or Single Link

l Similarity of two clusters is based on the two most similar (closest) points in the different clusters.
Determined by one pair of points, i.e., by one link in the proximity graph.
l Can handle non-elliptical shapes. l Sensitive to noise and outliers.

I1 I2 I3 I4 I5 I1 1.00 0.90 0.10 0.65 0.20 I2 0.90 1.00 0.70 0.60 0.50 I3 0.10 0.70 1.00 0.40 0.30 I4 0.65 0.60 0.40 1.00 0.80 I5 0.20 0.50 0.30 0.80 1.00
2 3
5
Ch 3/ 19
Cluster Similarity: MAX or Complete Linkage

l Similarity of two clusters is based on the two least similar (most distant) points in the different clusters.
Determined by all pairs of points in the two clusters. Tends to break large clusters. Less susceptible to noise and outliers.
I1 I1 1.00 I2 0.90 I3 0.10 I4 0.65 I5 0.20 I2 I3 I4 0.90 0.10 0.65 1.00 0.70 0.60 0.70 1.00 0.40 0.60 0.40 1.00 0.50 0.30 0.80 I5 0.20 0.50 0.30 0.80 1.00
3 4
Ch 3/ 20
Cluster Similarity: Group Average

l Similarity of two clusters is the average of pairwise similarities between points in the two clusters.
Similarity(Clusteri , Clusterj ) =
piClusteri p jClusterj
Similarity(p , p )
i j
l Compromise between Single and Complete Link. l Need to use average connectivity for scalability since total connectivity favors large clusters.
I1 I2 I3 I4 I5 I1 1.00 0.90 0.10 0.65 0.20 I2 0.90 1.00 0.70 0.60 0.50 I3 0.10 0.70 1.00 0.40 0.30 I4 0.65 0.60 0.40 1.00 0.80 I5 0.20 0.50 0.30 0.80 1.00
|Clusteri |*|Clusterj |
2 3
5
Ch 3/ 21
Cluster Similarity: Centroid Methods

l Similarity of two clusters is based on the distance of the centroids of the two clusters. l Similar to K-means
Euclidean distance requirement Problems with different sized clusters and convex shapes.
l Variations include median based methods.
Ch 3/ 22
Hierarchical Clustering: Time and Space requirements

l O(N2) space since it uses the proximity matrix.
N is the number of points.
l O(N3) time in many cases.

There are N steps and at each step the size, N2, proximity matrix must be updated and searched. By being careful, the complexity can be reduced to O(N2 log(N) ) time for some approaches.
Ch 3/ 23
Hierarchical Clustering: Problems and Limitations

l Once a decision is made to combine two clusters, it cannot be undone. l No objective function is directly minimized. l Different schemes have problems with one or more of the following:
Sensitivity to noise and outliers. Difficulty handling different sized clusters and convex shapes. Breaking large clusters.
Ch 3/ 24
Recent Approaches: CURE

l Uses a number of points to represent a cluster. l Representative points are found by selecting a constant number of points from a cluster and then shrinking them toward the center of the cluster. l Cluster similarity is the similarity of the closest pair of representative points from different clusters. l Shrinking representative points toward the center helps avoid problems with noise and outliers. l CURE is better able to handle clusters of arbitrary shapes and sizes.
(CURE, Guha, Rastogi, Shim)
Ch 3/ 25
Experimental Results CURE
(centroid)
(single link)
Picture from CURE, Guha, Rastogi, Shim.

Experimental Results CURE

(centroid)
(single link)
Picture from CURE, Guha, Rastogi, Shim.

Limitations of Current Merging Schemes

lExisting merging schemes are static in nature.
Ch 3/ 28
Chameleon: Clustering Using Dynamic Modeling

l Adapt to the characteristics of the data set to find the natural clusters. l Use a dynamic model to measure the similarity between clusters.
Main property is the relative closeness and relative interconnectivity of the cluster. Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters. The merging scheme preserves
self-similarity.
l One of the areas of application is spatial data.

Characteristics of Spatial Data Sets

Clusters are defined as densely populated regions of the space. Clusters have arbitrary shapes, orientation, and non-uniform sizes. Difference in densities across clusters and variation in density within clusters. Existence of special artifacts (streaks) and noise. The clustering algorithm must address the above characteristics and also require minimal supervision.
Ch 3/ 30
Chameleon: Steps
l Preprocessing Step: Represent the Data by a Graph Given a set of points, we construct the k-nearestneighbor (k-NN) graph to capture the relationship between a point and its k nearest neighbors.
l Phase 1: Use a multilevel graph partitioning algorithm on the graph to find a large number of clusters of well-connected vertices.
Each cluster should contain mostly points from one true cluster, i.e., is a sub-cluster of a real cluster. Graph algorithms take into account global structure.
Chameleon: Steps
l Phase 2: Use Hierarchical Agglomerative Clustering to merge sub-clusters.
Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters. Two key properties are used to model cluster similarity:
Relative Interconnectivity: Absolute interconnectivity of two clusters normalized by the internal connectivity of the clusters. Relative Closeness: Absolute closeness of two clusters normalized by the internal closeness of the clusters.
Ch 3/ 32
Experimental Results CHAMELEON
Ch 3/ 33
Ch 3/ 34
Experimental Results CURE (10 clusters)
Ch 3/ 35
Ch 3/ 36
Ch 3/ 37
Ch 3/ 38
Ch 3/ 39
Hierarchical Divisive Clustering

l Starting with a single cluster, divide clusters until only single item clusters remain.
MST (Minimum Spanning Tree). Graph-based clustering.
Same as Single Link. Susceptible to noise and outliers. Global view. Less susceptible to noise and outliers. Example: Graph-base clustering is not fooled by bridges.
Ch 3/ 40
Hypergraph-Based Clustering
Construct aahypergraph in which related data are Construct hypergraph in which related data are connected via hyperedges. connected via hyperedges. Partition this hypergraph in aaway such that each partition Partition this hypergraph in way such that each partition contains highly connected data. contains highly connected data.
How do we find related sets of data items? Use Association Rules! How do we find related sets of data items? Use Association Rules!
S&P 500 Stock Data

l S&P 500 stock price movement from Jan. 1994 to Oct. 1996.
Day 1: Intel-UP Day 1: Intel-UP Day 2: Intel-DOWN Day 2: Intel-DOWN Day 3: Intel-UP Day 3: Intel-UP Microsoft-UP Morgan-Stanley-DOWN Microsoft-UP Morgan-Stanley-DOWN Microsoft-DOWN Morgan-Stanley-UP Microsoft-DOWN Morgan-Stanley-UP Microsoft-DOWN Morgan-Stanley-DOWN Microsoft-DOWN Morgan-Stanley-DOWN
l Frequent item sets from the stock data.

{Intel-up, Microsoft-UP} {Intel-up, Microsoft-UP} {Intel-down, Microsoft-DOWN, Morgan-Stanley-UP} {Intel-down, Microsoft-DOWN, Morgan-Stanley-UP} {Morgan-Stanley-UP, MBNA-Corp-UP, Fed-Home-Loan-UP} {Morgan-Stanley-UP, MBNA-Corp-UP, Fed-Home-Loan-UP}
Ch 3/ 42
Clustering of S&P 500 Stock Data
Discovered Clusters
Industry Group
1 2 3 4 5 6
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN, Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN, Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN, Sun-DOWN Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, ADV-Micro-Device-DOWN,Andrew-Corp-DOWN, Computer-Assoc-DOWN,Circuit-City-DOWN, Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN Fannie-Mae-DOWN,Fed-Home-Loan-DOWN, MBNA-Corp-DOWN,Morgan-Stanley-DOWN Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Schlumberger-UP Barrick-Gold-UP,Echo-Bay-Mines-UP Homestake-Mining-UP,Newmont-Mining-UP, Placer-Dome-Inc-UP Alcan-Aluminum-DOWN,Asarco-Inc-DOWN, Cyprus-Amax-Min-DOWN,Inland-Steel-Inc-Down, Inco-LTD-DOWN,Nucor-Corp-DOWN,Praxair-Inc-DOWN, Reynolds-Metals-DOWN,Stone-Container-DOWN, USX-US-Steel-DOWN
Technology1-DOWN
Technology2-DOWN
Financial-DOWN Oil-UP Gold-UP
Metal-DOWN
Other clusters found: Bank, Paper/Lumber, Motor/Machinery, Other clusters found: Bank, Paper/Lumber, Motor/Machinery, Retail, Telecommunication, Tech/Electronics Retail, Telecommunication, Tech/Electronics
1984 Congressional Voting records

l Voting records on 16 key votes of 435 congressmen
Congressman 1: crime-YES, education-spending-YES, mx-missile-NO, Congressman 1: crime-YES, education-spending-YES, mx-missile-NO, Congressman 2: crime-YES,education-spending-NO,mx-missile-YES, ... Congressman 2: crime-YES,education-spending-NO,mx-missile-YES, ... Congressman 3: crime-NO,education-spending-NO,mx-missile-YES, Congressman 3: crime-NO,education-spending-NO,mx-missile-YES,
l Frequent item sets from the voting data.

{crime-YES, education-spending-YES} {crime-YES, education-spending-YES} {education-spending-NO, mx-missile-YES} {education-spending-NO, mx-missile-YES} {crime-NO, education-spending-NO, physician-fee-freeze-NO} {crime-NO, education-spending-NO, physician-fee-freeze-NO}
Ch 3/ 44
Clustering of Congressional Voting Data

Our Results
250 200 150 100 50
Cluster Cluster
200 150 100
AutoClass result
Dem. Rep.
50
Cluster Cluster Cluster
Dem. Rep.
Ch 3/ 45
Clustering of ESTs in Protein Coding Database

Laboratory Experiments
New Protein Functionality of the protein
Similarity Match Clusters of Short Segments of Protein-Coding Sequences (EST)
Researchers John Carlis John Riedl Ernest Retzel Elizabeth Shoop
Known Proteins
Expressed Sequence Tags (EST)

l Generate short segments of protein-coding sequences (EST). l Match ESTs against known proteins using similarity matching algorithms. l Find Clusters of ESTs that have same functionality. l Match new protein against the EST clusters. l Experimentally verify only the functionality of the proteins represented by the matching EST clusters
EST Clusters by Hypergraph-Based Scheme

l 662 different items corresponding to ESTs. l 11,986 variables corresponding to known proteins l Found 39 clusters
12 clean clusters each corresponds to single protein family (113 ESTs) 6 clusters with two protein families 7 clusters with three protein families 3 clusters with four protein families 6 clusters with five protein families
EST Counts
25
other-8
other-7
other-6
other-5
other-4
other-3
other-2
20
15
other-1
dehybrogenase
cytochrome-b5
10
actin
5-methyltetra
ubiqutin
s-adenosyl
proline-rich
iso-reductase
xyloglycan
glycine-hydro
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 Clusters
l Runtime was less than 5 minutes.

heat-shock
tubulin
Ch 3/ 48
EST clusters by LSI/K-means

l Dimension has been reduced to 50 using LSI l Found 38 clusters
17 clean clusters (69 ESTs) 8 clusters with several protein families 1 cluster with 508 ESTs 22 clusters with one or two ESTs
EST Counts
16 14 12 10 8 6 4 2 0 1 3 5 7 9 11 13 15 17 19 21 23 25 Clusters
other-5 other-4 other-3 other-2 other-1 nucleoside actin-depoly heat-shock protein-h7 sucrose adp-ribosylation p-rich-low arabinogalactan glycine-rich kinase keratin cyclophylin proline-rich histone-h2b expansin tubulin
Ch 3/ 49
Web Document Data

l Explosive growth of documents on the World Wide Web l Searching relevant documents becomes a challenge l Cluster of related words from documents can serve as keywords in searching l Cluster of documents can aid filtering and categorization of retrieved web documents
Ch 3/ 50
Clustering of Related Words

Hypergraph Model Hypergraph Model
l 87 transactions l 87 transactions corresponding to Web corresponding to Web documents documents l 5772 items l 5772 items corresponding to distinct corresponding to distinct word stems word stems l 20 clusters l 20 clusters l Runtime of 5 minutes l Runtime of 5 minutes
AutoClass AutoClass
l 5772 transactions l 5772 transactions corresponding to distinct corresponding to distinct word stems word stems l 87 attributes l 87 attributes corresponding to Web corresponding to Web documents documents l 35 word clusters l 35 word clusters l Runtime of 55 minutes l Runtime of 55 minutes
Ch 3/ 51
Word Clusters Using Hypergraph-Based Method

Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5
http http internet internet mov mov please please site site web web ww ww access access approach approach comput comput electron electron goal goal manufactur manufactur power power step step act act busi busi check check enforc enforc feder feder follow follow govern govern informate informate page page public public data data engineer engineer includes includes manag manag network network services services softwar softwar support support systems systems technologi technologi wide wide action action administrate administrate agenci agenci complianc complianc establish establish health health law law laws laws nation nation offic offic regulations regulations
Ch 3/ 52
Word Clusters Using AutoClass

Cluster 11 Cluster
copyright copyright design design found found internate internate object object
Cluster 22 Cluster
adopt adopt efficientli efficientli hr hr http http librari librari offices offices procedur procedur automaticalli automaticalli resist resist basic basic bookmarks bookmarks com com comprehens comprehens held held html html hyper hyper mov mov please please programm programm reserv reserv bas bas ww ww changes changes
Cluster 33 Cluster
concern concern agent agent documents documents juli juli apr apr nov nov patents patents register register bear bear timeout timeout trademark trademark uspto uspto court court doc doc appeals appeals list list notficate notficate pac pac recent recent sites sites tac tac topics topics user user word word
Cluster 44 Cluster
cornell cornell amend amend formerli formerli meet meet onlin onlin own own people people publications publications select select servers servers technic technic version version web web center center effort effort amendments amendments appear appear news news organize organize pages pages portions portions sections sections server server structur structur uscod uscod visit visit welcom welcom central central
Cluster 55 Cluster
congress congress employ employ equal equal homepag homepag ii ii implementate implementate legislate legislate major major nbsp nbsp representatives representatives senat senat thomas thomas track track webmast webmast affirm affirm engineer engineer home home house house iii iii legisl legisl mail mail name name page page section section send send bills bills trade trade action action
Ch 3/ 53
Comparison of Word Clusters

Hypergraph
C lu ste rr 1 C lu ste 1
hh ttp ttp in tern e e t in tern t m oo v mv pp lea se lea se site site w eb w eb ww ww
a a cc es s cc es s ap pp ro a ch ap ro a ch cc o m p u t om put e e le ctro n le ctro n gg o a l oal m an uu fa ctu r m an fa ctu r pp o w e r ow er step step
ac t t ac bb u si u si c c h e ck h e ck e e n fo rc n fo rc fed e e r fed r fo llo w fo llo w gg o v e rn o v e rn in fo rm a a te in fo rm te pp a g e age pp u b lic u b lic
dd a ta a ta en gg in e er en in e er in clu dd es in clu es m aa n a g m nag nn e tw o rk e tw o rk serv ice s s serv ice so ftw ar so ftw ar su pp p o rt su p o rt s s y ste m s y ste m s tec hh n o lo g i tec n o lo g i w id e e w id
a a ctio n ctio n a a d m in istra te d m in istra te a a g en c i g en c i co m pp lia n c co m lia n c e e sta b lish sta b lish hh e alth e alth law law la w s s la w nn a tio n a tio n oo ffic ffic re gg u latio n s u latio n s re
LSI/K-means dim 10 128 clusters
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7
leav leav nuclear nuclear base base structures structures classifi classifi copyright death notices copyright death notices enforc heart attornyes enforc heart attornyes injuri investigate injuri investigate awards awards participat third share participat third share protect central vii protect central vii refus charge class refus charge class commiss commiss committes committes posit posit profess profess race race richard richard sense sense sex sex tell tell thank thank equival equival favor favor ill ill increases increases labor labor provid provid secretari secretari steps steps handbook handbook harm harm incorrectli incorrectli letter letter misus misus names names otherwis otherwis publish publish soleli soleli
Document Clusters Using Hierarchical Clustering

8 7 6 5 4 3 2 1 0
Manufacturing
Labor
Comm/Network
Business
10
13
Clusters
16
Document Clusters Using Hypergraph-Based Method

10 8 6 4 2
Manufacturing
Labor
Comm/Network
Business
10
13
Clusters
16
Comparison of Document Clusters

Hypergraph
20 18 16
PM
Document Counts
LSI/K-means
16 14 12
Document Counts
PM
MSI
MP
IS
IPT
IP
ER
EC
BC
AA
14 12 10 8 6 4 2 0 1 2 3 4 5 6 7 8 Cluster 9 10 11 12 13 14 15
MSI
MP
IS
IPT
IP
ER
EC
BC
AA
10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Cluster
Ch 3/ 57
Clustering Scalability for Large Data Sets

l One very common solution is sampling, but the sampling could miss small clusters.
Data is sometimes not organized to make valid sampling easy or efficient.
l Another approach is to compress the data or portions of the data.

Any such approach must ensure that not too much information is lost.
(Scaling Clustering Algorithms to Large Databases, Bradley, Fayyad and Reina.)
Ch 3/ 58
Clustering Scalability for Large Data Sets: Birch

lBIRCH (Balanced and Iterative Reducing and Clustering using Hierarchies)
BIRCH can efficiently cluster data with a single pass and can improve that clustering in additional passes. Can work with a number of different distance metrics. BIRCH can also also deal effectively with outliers.
Ch 3/ 59

l BIRCH is based on the notion of a clustering feature (CF) and a CF tree.
A cluster of data points (vectors) can be represented by a triple of numbers (N, LS, SS)
N is the number of points in the cluster LS is the linear sum of the points SS is the sum of squares of the points. Each point is placed in the leaf node corresponding to the closest cluster (CF). Clusters (CFs) are updated.
Points are processed incrementally.
Ch 3/ 60

l Basic steps of BIRCH
Load the data into memory by creating a CF tree that summarizes the data. Perform global clustering.
Produces a better clustering than the initial step. An agglomerative, hierarchical technique was selected.
Redistribute the data points using the centroids of clusters discovered in the global clustering phase, and thus, discover a new (and hopefully better) set of clusters. (See Zhang, Ramakrishnan and Livney or Ganti, Ramakrishnan, and Gehrke)
Other Clustering Approaches

l Modeling clusters as a mixture of Multivariate Normal Distributions. (Raftery and Fraley) l Bayesian Approaches (AutoClass, Cheeseman) l Density-Based Clustering (DB-SCAN, Kriegel) l Neural Network Approaches (SOM, Kohonen) l Subspace Clustering (CLIQUE, Agrawal) l Many, many other variations and combinations of approaches.
Ch 3/ 62
Other Important Topics

l Dimensionality Reduction
Latent Semantic Indexing (LSI) Principal Component Analysis (PCA)
l Feature transformation.
Normalizing features to the same scale by subtracting the mean and dividing by the standard deviation.
l Feature Selection
As in classification, not all features are equally important.
Clustering: Summary
l Clustering is an old and multidisciplinary area. l New challenges related to new or newly important kinds of data
Noisy Large High-Dimensional New Kinds of Similarity Measures (non-metric) Clusters of Variable Size and Density Arbitrary Cluster Shapes (non-globular) Many and Mixed Attribute Types (temporal, continuous, categorical)
l New data mining approaches and algorithms are being developed that may be more suitable for these problems.

Medoid Clustering

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Medoid Clustering

Hochgeladen von

Copyright:

Verfügbare Formate

High Performance Data Mining

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Chapter 3: Clustering Algorithms

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Euclidean Distance if attributes are continuous. Other Problem-specific Measures.

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Input Data for Clustering

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

l K-means is a gradient-descent algorithm that always converges - perhaps to a local minimum.

Initial Data and Seeds

Initial Data and Seeds

K-means: Initial Point Selection

l Choose initial points from

l Many more variations on initial point selection.

K-means: How to Update Centroids

K-means: When to Update Centroids

K-means: Pre and Post Processing

K-means: Time and Space requirements

l Time for basic K-means is O(T*K*M*N),

K-means: Determining the Number of Clusters

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

K-means: Problems and Limitations

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

l Better resistance to outliers.

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Types of Clustering: Partitional and Hierarchical

Hierarchical Clustering Algorithms

l Hierarchical Divisive Clustering

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Cluster Similarity: MIN or Single Link

l Can handle non-elliptical shapes. l Sensitive to noise and outliers.

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Cluster Similarity: MAX or Complete Linkage

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Cluster Similarity: Group Average

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Cluster Similarity: Centroid Methods

l Variations include median based methods.

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Hierarchical Clustering: Time and Space requirements

l O(N3) time in many cases.

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Hierarchical Clustering: Problems and Limitations

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Recent Approaches: CURE

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Experimental Results CURE

Picture from CURE, Guha, Rastogi, Shim.

l Time for basic K-means is O(TKM*N),