Beruflich Dokumente
Kultur Dokumente
Chapter 3: Clustering
Vipin Kumar
Army High Performance Computing Research Center Department of Computer Science University of Minnesota
http://www.cs.umn.edu/~kumar
Ch 3/ 1
Ch 3/ 2
Clustering Definition
l Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that
Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another.
l Similarity Measures:
Ch 3/ 3
Ch 3/ 4
K-means Clustering
l Find a single partition of the data into K clusters such that the within cluster error, e.g., r r x - c , is minimized. l Basic K-means Algorithm:
K 2 r i=1 x C i
i
1. 2. 3. 4.
Select K points as the initial centroids. Assign all points to the closest centroid. Recompute the centroids. Repeat steps 2 and 3 until the centroids dont change.
Example: Kmeans
Final Clustering
Data Mining for Scientific and Engineering Applications Ch 3/ 6
Example: K-means
Final Clustering
Data Mining for Scientific and Engineering Applications Ch 3/ 7
r r x - ci
l Approaches are based on heuristics and require the user to choose parameter values.
R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 11
Ch 3/ 13
l Restricted to data in Euclidean spaces, but variants of K-means can be used for other types of data.
R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 14
K-medoid Clustering
l Find a single partition of the data into K clusters such that each cluster has a most representative point, i.e., a point that is the most centrally located point in the cluster with respect to some measure, e.g., distance. l These representative points are called medoids. l Basic K-medoid Algorithm: 1. Select K points as the initial medoids. 2. Assign all points to the closest medoid. 3. See if any other point is a better medoid. 4. Repeat steps 2 and 3 until the medoids dont change.
Ch 3/ 15
K-medoid Clustering
l Can be used with similarities, as well as distances and there is no Euclidean space restriction. l Finding a better medoid involves comparing all pairs of medoid and non-medoid points and is relatively inefficient.
Sampling may be used. (Efficient and effective clustering method for spatial data mining. Ng and
Han, 94)
Ch 3/ 16
Ch 3/ 18
2 3
5
Ch 3/ 19
3 4
Ch 3/ 20
Similarity(p , p )
i j
l Compromise between Single and Complete Link. l Need to use average connectivity for scalability since total connectivity favors large clusters.
I1 I2 I3 I4 I5 I1 1.00 0.90 0.10 0.65 0.20 I2 0.90 1.00 0.70 0.60 0.50 I3 0.10 0.70 1.00 0.40 0.30 I4 0.65 0.60 0.40 1.00 0.80 I5 0.20 0.50 0.30 0.80 1.00
|Clusteri |*|Clusterj |
2 3
5
Ch 3/ 21
Ch 3/ 22
Ch 3/ 23
Ch 3/ 24
Ch 3/ 25
(centroid)
(single link)
(single link)
Ch 3/ 28
self-similarity.
Ch 3/ 30
Chameleon: Steps
l Preprocessing Step: Represent the Data by a Graph Given a set of points, we construct the k-nearestneighbor (k-NN) graph to capture the relationship between a point and its k nearest neighbors.
l Phase 1: Use a multilevel graph partitioning algorithm on the graph to find a large number of clusters of well-connected vertices.
Each cluster should contain mostly points from one true cluster, i.e., is a sub-cluster of a real cluster. Graph algorithms take into account global structure.
R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 31
Chameleon: Steps
l Phase 2: Use Hierarchical Agglomerative Clustering to merge sub-clusters.
Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters. Two key properties are used to model cluster similarity:
Relative Interconnectivity: Absolute interconnectivity of two clusters normalized by the internal connectivity of the clusters. Relative Closeness: Absolute closeness of two clusters normalized by the internal closeness of the clusters.
Ch 3/ 32
Ch 3/ 33
Ch 3/ 34
Ch 3/ 35
Ch 3/ 36
Ch 3/ 37
Ch 3/ 38
Ch 3/ 39
Ch 3/ 40
Hypergraph-Based Clustering
Construct aahypergraph in which related data are Construct hypergraph in which related data are connected via hyperedges. connected via hyperedges. Partition this hypergraph in aaway such that each partition Partition this hypergraph in way such that each partition contains highly connected data. contains highly connected data.
How do we find related sets of data items? Use Association Rules! How do we find related sets of data items? Use Association Rules!
R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 41
Ch 3/ 42
Discovered Clusters
Industry Group
1 2 3 4 5 6
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN, Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN, Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN, Sun-DOWN Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, ADV-Micro-Device-DOWN,Andrew-Corp-DOWN, Computer-Assoc-DOWN,Circuit-City-DOWN, Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN Fannie-Mae-DOWN,Fed-Home-Loan-DOWN, MBNA-Corp-DOWN,Morgan-Stanley-DOWN Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Schlumberger-UP Barrick-Gold-UP,Echo-Bay-Mines-UP Homestake-Mining-UP,Newmont-Mining-UP, Placer-Dome-Inc-UP Alcan-Aluminum-DOWN,Asarco-Inc-DOWN, Cyprus-Amax-Min-DOWN,Inland-Steel-Inc-Down, Inco-LTD-DOWN,Nucor-Corp-DOWN,Praxair-Inc-DOWN, Reynolds-Metals-DOWN,Stone-Container-DOWN, USX-US-Steel-DOWN
Technology1-DOWN
Technology2-DOWN
Metal-DOWN
Other clusters found: Bank, Paper/Lumber, Motor/Machinery, Other clusters found: Bank, Paper/Lumber, Motor/Machinery, Retail, Telecommunication, Tech/Electronics Retail, Telecommunication, Tech/Electronics
R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 43
Ch 3/ 44
AutoClass result
Dem. Rep.
50
Cluster Cluster Cluster
Dem. Rep.
Ch 3/ 45
Known Proteins
25
other-8
other-7
other-6
other-5
other-4
other-3
other-2
20
15
other-1
dehybrogenase
cytochrome-b5
10
actin
5-methyltetra
ubiqutin
s-adenosyl
proline-rich
iso-reductase
xyloglycan
glycine-hydro
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 Clusters
heat-shock
tubulin
Ch 3/ 48
16 14 12 10 8 6 4 2 0 1 3 5 7 9 11 13 15 17 19 21 23 25 Clusters
other-5 other-4 other-3 other-2 other-1 nucleoside actin-depoly heat-shock protein-h7 sucrose adp-ribosylation p-rich-low arabinogalactan glycine-rich kinase keratin cyclophylin proline-rich histone-h2b expansin tubulin
Ch 3/ 49
Ch 3/ 50
AutoClass AutoClass
l 5772 transactions l 5772 transactions corresponding to distinct corresponding to distinct word stems word stems l 87 attributes l 87 attributes corresponding to Web corresponding to Web documents documents l 35 word clusters l 35 word clusters l Runtime of 55 minutes l Runtime of 55 minutes
Ch 3/ 51
Ch 3/ 52
Cluster 22 Cluster
adopt adopt efficientli efficientli hr hr http http librari librari offices offices procedur procedur automaticalli automaticalli resist resist basic basic bookmarks bookmarks com com comprehens comprehens held held html html hyper hyper mov mov please please programm programm reserv reserv bas bas ww ww changes changes
Cluster 33 Cluster
concern concern agent agent documents documents juli juli apr apr nov nov patents patents register register bear bear timeout timeout trademark trademark uspto uspto court court doc doc appeals appeals list list notficate notficate pac pac recent recent sites sites tac tac topics topics user user word word
Cluster 44 Cluster
cornell cornell amend amend formerli formerli meet meet onlin onlin own own people people publications publications select select servers servers technic technic version version web web center center effort effort amendments amendments appear appear news news organize organize pages pages portions portions sections sections server server structur structur uscod uscod visit visit welcom welcom central central
Cluster 55 Cluster
congress congress employ employ equal equal homepag homepag ii ii implementate implementate legislate legislate major major nbsp nbsp representatives representatives senat senat thomas thomas track track webmast webmast affirm affirm engineer engineer home home house house iii iii legisl legisl mail mail name name page page section section send send bills bills trade trade action action
Ch 3/ 53
C lu ste rr 1 C lu ste 1
hh ttp ttp in tern e e t in tern t m oo v mv pp lea se lea se site site w eb w eb ww ww
C lu ste rr 2 C lu ste 2
a a cc es s cc es s ap pp ro a ch ap ro a ch cc o m p u t om put e e le ctro n le ctro n gg o a l oal m an uu fa ctu r m an fa ctu r pp o w e r ow er step step
C lu ste rr 3 C lu ste 3
ac t t ac bb u si u si c c h e ck h e ck e e n fo rc n fo rc fed e e r fed r fo llo w fo llo w gg o v e rn o v e rn in fo rm a a te in fo rm te pp a g e age pp u b lic u b lic
C lu ste rr 4 C lu ste 4
dd a ta a ta en gg in e er en in e er in clu dd es in clu es m aa n a g m nag nn e tw o rk e tw o rk serv ice s s serv ice so ftw ar so ftw ar su pp p o rt su p o rt s s y ste m s y ste m s tec hh n o lo g i tec n o lo g i w id e e w id
C lu ste rr 5 C lu ste 5
a a ctio n ctio n a a d m in istra te d m in istra te a a g en c i g en c i co m pp lia n c co m lia n c e e sta b lish sta b lish hh e alth e alth law law la w s s la w nn a tio n a tio n oo ffic ffic re gg u latio n s u latio n s re
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7
leav leav nuclear nuclear base base structures structures classifi classifi copyright death notices copyright death notices enforc heart attornyes enforc heart attornyes injuri investigate injuri investigate awards awards participat third share participat third share protect central vii protect central vii refus charge class refus charge class commiss commiss committes committes posit posit profess profess race race richard richard sense sense sex sex tell tell thank thank equival equival favor favor ill ill increases increases labor labor provid provid secretari secretari steps steps handbook handbook harm harm incorrectli incorrectli letter letter misus misus names names otherwis otherwis publish publish soleli soleli
Manufacturing
Labor
Comm/Network
Business
10
13
Clusters
R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 55
16
Manufacturing
Labor
Comm/Network
Business
10
13
Clusters
R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 56
16
LSI/K-means
16 14 12
Document Counts
PM
MSI
MP
IS
IPT
IP
ER
EC
BC
AA
14 12 10 8 6 4 2 0 1 2 3 4 5 6 7 8 Cluster 9 10 11 12 13 14 15
MSI
MP
IS
IPT
IP
ER
EC
BC
AA
10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Cluster
Ch 3/ 57
Ch 3/ 58
Ch 3/ 59
Ch 3/ 60
Redistribute the data points using the centroids of clusters discovered in the global clustering phase, and thus, discover a new (and hopefully better) set of clusters. (See Zhang, Ramakrishnan and Livney or Ganti, Ramakrishnan, and Gehrke)
R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 61
Ch 3/ 62
l Feature transformation.
Normalizing features to the same scale by subtracting the mean and dividing by the standard deviation.
l Feature Selection
As in classification, not all features are equally important.
R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 63
Clustering: Summary
l Clustering is an old and multidisciplinary area. l New challenges related to new or newly important kinds of data
Noisy Large High-Dimensional New Kinds of Similarity Measures (non-metric) Clusters of Variable Size and Density Arbitrary Cluster Shapes (non-globular) Many and Mixed Attribute Types (temporal, continuous, categorical)
l New data mining approaches and algorithms are being developed that may be more suitable for these problems.
R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 64