Sie sind auf Seite 1von 64

High Performance Data Mining

Chapter 3: Clustering

Vipin Kumar
Army High Performance Computing Research Center Department of Computer Science University of Minnesota
http://www.cs.umn.edu/~kumar

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 1

Chapter 3: Clustering Algorithms


Outline l K-means Clustering l Hierarchical Clustering Single Link, Group Average, CURE, Chameleon l Graph Based Clustering l Miscellaneous Topics Scalability Other Clustering Techniques

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 2

Clustering Definition
l Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that
Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another.

l Similarity Measures:

Euclidean Distance if attributes are continuous. Other Problem-specific Measures.

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 3

Input Data for Clustering


l A set of N points in an M dimensional space OR l A proximity matrix that gives the pairwise distance or similarity between points.
Can be viewed as a weighted graph.
I1 I2 I3 I4 I5 I6 I1 1.00 0.70 0.80 0.00 0.00 0.00 I2 0.70 1.00 0.65 0.25 0.00 0.00 I3 0.80 0.65 1.00 0.00 0.00 0.00 I4 0.00 0.25 0.00 1.00 0.90 0.85 I5 0.00 0.00 0.00 0.90 1.00 0.95 I6 0.00 0.00 0.00 0.85 0.95 1.00

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 4

K-means Clustering
l Find a single partition of the data into K clusters such that the within cluster error, e.g., r r x - c , is minimized. l Basic K-means Algorithm:
K 2 r i=1 x C i
i

1. 2. 3. 4.

Select K points as the initial centroids. Assign all points to the closest centroid. Recompute the centroids. Repeat steps 2 and 3 until the centroids dont change.

l K-means is a gradient-descent algorithm that always converges - perhaps to a local minimum.


(Clustering for Applications, Anderberg)
R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 5

Example: Kmeans

Initial Data and Seeds


R. Grossman, C. Kamath, V. Kumar

Final Clustering
Data Mining for Scientific and Engineering Applications Ch 3/ 6

Example: K-means

Initial Data and Seeds


R. Grossman, C. Kamath, V. Kumar

Final Clustering
Data Mining for Scientific and Engineering Applications Ch 3/ 7

K-means: Initial Point Selection


l Bad set of initial points gives a poor solution. l Random selection
Simple and efficient. Initial points dont cover clusters with high probability. Many runs may be needed for optimal solution.

l Choose initial points from


Dense regions so that the points are well-separated.

l Many more variations on initial point selection.


R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 8

K-means: How to Update Centroids


l Depends on the exact error criterion used. l If trying to minimize the squared error, K r r2 xrC x - c i , then the new centroid is the i=1 i mean of the points in a cluster. l If trying to minimize the sum of distances, , then the new centroid is the median of the points in a cluster.
r i= 1 x C i
R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 9

r r x - ci

K-means: When to Update Centroids


l Update the centroids only after all points are assigned to centers. l Update the centroids after each point assignment.
May adjust the relative weight of the point being added and the current center to speed convergence. Possibility of better accuracy and faster convergence at the cost of more work. Update issues are similar to those of updating weights for neural-nets using back-propagation. (Artificial Intelligence, Winston)
R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 10

K-means: Pre and Post Processing


l Outliers can dominate the clustering and, in some cases, are eliminated by preprocessing. l Post-processing attempts to fix-up the clustering produced by the K-means algorithm.
Merge clusters that are close to each other. Split loose clusters that contribute most to the error. Permanently eliminate small clusters since they may represent groups of outliers.

l Approaches are based on heuristics and require the user to choose parameter values.
R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 11

K-means: Time and Space requirements


l O(MN) space since it uses just the vectors, not the proximity matrix.
M is the number of attributes. N is the number of points. Also keep track of which cluster each point belongs to and the K cluster centers.

l Time for basic K-means is O(T*K*M*N),


T is the number of iterations. (T is often small, 5-10, and can easily be bounded, as few changes occur after the first few iterations).
R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 12

K-means: Determining the Number of Clusters


l Mostly heuristic and domain dependant approaches. l Plot the error for 2, 3, clusters and find the knee in the curve. l Use domain specific knowledge and inspect the clusters for desired characteristics.

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 13

K-means: Problems and Limitations


l Based on minimizing within cluster error - a criterion that is not appropriate for many situations.
Unsuitable when clusters have widely different sizes or have convex shapes.

l Restricted to data in Euclidean spaces, but variants of K-means can be used for other types of data.
R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 14

K-medoid Clustering
l Find a single partition of the data into K clusters such that each cluster has a most representative point, i.e., a point that is the most centrally located point in the cluster with respect to some measure, e.g., distance. l These representative points are called medoids. l Basic K-medoid Algorithm: 1. Select K points as the initial medoids. 2. Assign all points to the closest medoid. 3. See if any other point is a better medoid. 4. Repeat steps 2 and 3 until the medoids dont change.

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 15

K-medoid Clustering
l Can be used with similarities, as well as distances and there is no Euclidean space restriction. l Finding a better medoid involves comparing all pairs of medoid and non-medoid points and is relatively inefficient.
Sampling may be used. (Efficient and effective clustering method for spatial data mining. Ng and
Han, 94)

l Better resistance to outliers.


(Finding Groups in Data, Kaufman and Rousseeuw)

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 16

Types of Clustering: Partitional and Hierarchical


l Partitional Clustering ( K-means and K-medoid) finds a one-level partitioning of the data into K disjoint groups. l Hierarchical Clustering finds a hierarchy of nested clusters (dendogram).
May proceed either bottom-up (agglomerative) or top-down (divisive). Uses a proximity matrix. Can be viewed as operating on a proximity graph.
R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 17

Hierarchical Clustering Algorithms


l Hierarchical Agglomerative Clustering
1. Initially each item belongs to a single cluster. 2. Combine the two most similar clusters. 3. Repeat step 2 until there is only a single cluster. Most popular approach. Starting with a single cluster, divide clusters until only single item clusters remain. Less popular, but equivalent in functionality.

l Hierarchical Divisive Clustering

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 18

Cluster Similarity: MIN or Single Link


l Similarity of two clusters is based on the two most similar (closest) points in the different clusters.
Determined by one pair of points, i.e., by one link in the proximity graph.

l Can handle non-elliptical shapes. l Sensitive to noise and outliers.


I1 I2 I3 I4 I5 I1 1.00 0.90 0.10 0.65 0.20 I2 0.90 1.00 0.70 0.60 0.50 I3 0.10 0.70 1.00 0.40 0.30 I4 0.65 0.60 0.40 1.00 0.80 I5 0.20 0.50 0.30 0.80 1.00

2 3

5
Ch 3/ 19

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Cluster Similarity: MAX or Complete Linkage


l Similarity of two clusters is based on the two least similar (most distant) points in the different clusters.
Determined by all pairs of points in the two clusters. Tends to break large clusters. Less susceptible to noise and outliers.
I1 I1 1.00 I2 0.90 I3 0.10 I4 0.65 I5 0.20 I2 I3 I4 0.90 0.10 0.65 1.00 0.70 0.60 0.70 1.00 0.40 0.60 0.40 1.00 0.50 0.30 0.80 I5 0.20 0.50 0.30 0.80 1.00

3 4

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 20

Cluster Similarity: Group Average


l Similarity of two clusters is the average of pairwise similarities between points in the two clusters.
Similarity(Clusteri , Clusterj ) =
piClusteri p jClusterj

Similarity(p , p )
i j

l Compromise between Single and Complete Link. l Need to use average connectivity for scalability since total connectivity favors large clusters.
I1 I2 I3 I4 I5 I1 1.00 0.90 0.10 0.65 0.20 I2 0.90 1.00 0.70 0.60 0.50 I3 0.10 0.70 1.00 0.40 0.30 I4 0.65 0.60 0.40 1.00 0.80 I5 0.20 0.50 0.30 0.80 1.00

|Clusteri |*|Clusterj |

2 3

5
Ch 3/ 21

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Cluster Similarity: Centroid Methods


l Similarity of two clusters is based on the distance of the centroids of the two clusters. l Similar to K-means
Euclidean distance requirement Problems with different sized clusters and convex shapes.

l Variations include median based methods.

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 22

Hierarchical Clustering: Time and Space requirements


l O(N2) space since it uses the proximity matrix.
N is the number of points.

l O(N3) time in many cases.


There are N steps and at each step the size, N2, proximity matrix must be updated and searched. By being careful, the complexity can be reduced to O(N2 log(N) ) time for some approaches.

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 23

Hierarchical Clustering: Problems and Limitations


l Once a decision is made to combine two clusters, it cannot be undone. l No objective function is directly minimized. l Different schemes have problems with one or more of the following:
Sensitivity to noise and outliers. Difficulty handling different sized clusters and convex shapes. Breaking large clusters.

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 24

Recent Approaches: CURE


l Uses a number of points to represent a cluster. l Representative points are found by selecting a constant number of points from a cluster and then shrinking them toward the center of the cluster. l Cluster similarity is the similarity of the closest pair of representative points from different clusters. l Shrinking representative points toward the center helps avoid problems with noise and outliers. l CURE is better able to handle clusters of arbitrary shapes and sizes.
(CURE, Guha, Rastogi, Shim)

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 25

Experimental Results CURE

(centroid)

(single link)

Picture from CURE, Guha, Rastogi, Shim.


R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 26

Experimental Results CURE


(centroid)

(single link)

Picture from CURE, Guha, Rastogi, Shim.


R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 27

Limitations of Current Merging Schemes


lExisting merging schemes are static in nature.

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 28

Chameleon: Clustering Using Dynamic Modeling


l Adapt to the characteristics of the data set to find the natural clusters. l Use a dynamic model to measure the similarity between clusters.
Main property is the relative closeness and relative interconnectivity of the cluster. Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters. The merging scheme preserves

self-similarity.

l One of the areas of application is spatial data.


R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 29

Characteristics of Spatial Data Sets


Clusters are defined as densely populated regions of the space. Clusters have arbitrary shapes, orientation, and non-uniform sizes. Difference in densities across clusters and variation in density within clusters. Existence of special artifacts (streaks) and noise. The clustering algorithm must address the above characteristics and also require minimal supervision.

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 30

Chameleon: Steps
l Preprocessing Step: Represent the Data by a Graph Given a set of points, we construct the k-nearestneighbor (k-NN) graph to capture the relationship between a point and its k nearest neighbors.

l Phase 1: Use a multilevel graph partitioning algorithm on the graph to find a large number of clusters of well-connected vertices.
Each cluster should contain mostly points from one true cluster, i.e., is a sub-cluster of a real cluster. Graph algorithms take into account global structure.
R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 31

Chameleon: Steps
l Phase 2: Use Hierarchical Agglomerative Clustering to merge sub-clusters.
Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters. Two key properties are used to model cluster similarity:
Relative Interconnectivity: Absolute interconnectivity of two clusters normalized by the internal connectivity of the clusters. Relative Closeness: Absolute closeness of two clusters normalized by the internal closeness of the clusters.

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 32

Experimental Results CHAMELEON

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 33

Experimental Results CHAMELEON

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 34

Experimental Results CURE (10 clusters)

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 35

Experimental Results CURE (15 clusters)

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 36

Experimental Results CHAMELEON

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 37

Experimental Results CURE (9 clusters)

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 38

Experimental Results CURE (15 clusters)

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 39

Hierarchical Divisive Clustering


l Starting with a single cluster, divide clusters until only single item clusters remain.
MST (Minimum Spanning Tree). Graph-based clustering.
Same as Single Link. Susceptible to noise and outliers. Global view. Less susceptible to noise and outliers. Example: Graph-base clustering is not fooled by bridges.

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 40

Hypergraph-Based Clustering
Construct aahypergraph in which related data are Construct hypergraph in which related data are connected via hyperedges. connected via hyperedges. Partition this hypergraph in aaway such that each partition Partition this hypergraph in way such that each partition contains highly connected data. contains highly connected data.

How do we find related sets of data items? Use Association Rules! How do we find related sets of data items? Use Association Rules!
R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 41

S&P 500 Stock Data


l S&P 500 stock price movement from Jan. 1994 to Oct. 1996.
Day 1: Intel-UP Day 1: Intel-UP Day 2: Intel-DOWN Day 2: Intel-DOWN Day 3: Intel-UP Day 3: Intel-UP Microsoft-UP Morgan-Stanley-DOWN Microsoft-UP Morgan-Stanley-DOWN Microsoft-DOWN Morgan-Stanley-UP Microsoft-DOWN Morgan-Stanley-UP Microsoft-DOWN Morgan-Stanley-DOWN Microsoft-DOWN Morgan-Stanley-DOWN

l Frequent item sets from the stock data.


{Intel-up, Microsoft-UP} {Intel-up, Microsoft-UP} {Intel-down, Microsoft-DOWN, Morgan-Stanley-UP} {Intel-down, Microsoft-DOWN, Morgan-Stanley-UP} {Morgan-Stanley-UP, MBNA-Corp-UP, Fed-Home-Loan-UP} {Morgan-Stanley-UP, MBNA-Corp-UP, Fed-Home-Loan-UP}

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 42

Clustering of S&P 500 Stock Data

Discovered Clusters

Industry Group

1 2 3 4 5 6

Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN, Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN, Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN, Sun-DOWN Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, ADV-Micro-Device-DOWN,Andrew-Corp-DOWN, Computer-Assoc-DOWN,Circuit-City-DOWN, Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN Fannie-Mae-DOWN,Fed-Home-Loan-DOWN, MBNA-Corp-DOWN,Morgan-Stanley-DOWN Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Schlumberger-UP Barrick-Gold-UP,Echo-Bay-Mines-UP Homestake-Mining-UP,Newmont-Mining-UP, Placer-Dome-Inc-UP Alcan-Aluminum-DOWN,Asarco-Inc-DOWN, Cyprus-Amax-Min-DOWN,Inland-Steel-Inc-Down, Inco-LTD-DOWN,Nucor-Corp-DOWN,Praxair-Inc-DOWN, Reynolds-Metals-DOWN,Stone-Container-DOWN, USX-US-Steel-DOWN

Technology1-DOWN

Technology2-DOWN

Financial-DOWN Oil-UP Gold-UP

Metal-DOWN

Other clusters found: Bank, Paper/Lumber, Motor/Machinery, Other clusters found: Bank, Paper/Lumber, Motor/Machinery, Retail, Telecommunication, Tech/Electronics Retail, Telecommunication, Tech/Electronics
R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 43

1984 Congressional Voting records


l Voting records on 16 key votes of 435 congressmen
Congressman 1: crime-YES, education-spending-YES, mx-missile-NO, Congressman 1: crime-YES, education-spending-YES, mx-missile-NO, Congressman 2: crime-YES,education-spending-NO,mx-missile-YES, ... Congressman 2: crime-YES,education-spending-NO,mx-missile-YES, ... Congressman 3: crime-NO,education-spending-NO,mx-missile-YES, Congressman 3: crime-NO,education-spending-NO,mx-missile-YES,

l Frequent item sets from the voting data.


{crime-YES, education-spending-YES} {crime-YES, education-spending-YES} {education-spending-NO, mx-missile-YES} {education-spending-NO, mx-missile-YES} {crime-NO, education-spending-NO, physician-fee-freeze-NO} {crime-NO, education-spending-NO, physician-fee-freeze-NO}

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 44

Clustering of Congressional Voting Data


Our Results
250 200 150 100 50
Cluster Cluster
200 150 100

AutoClass result

Dem. Rep.

50
Cluster Cluster Cluster

Dem. Rep.

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 45

Clustering of ESTs in Protein Coding Database


Laboratory Experiments

New Protein Functionality of the protein

Similarity Match Clusters of Short Segments of Protein-Coding Sequences (EST)

Researchers John Carlis John Riedl Ernest Retzel Elizabeth Shoop

Known Proteins

Expressed Sequence Tags (EST)


l Generate short segments of protein-coding sequences (EST). l Match ESTs against known proteins using similarity matching algorithms. l Find Clusters of ESTs that have same functionality. l Match new protein against the EST clusters. l Experimentally verify only the functionality of the proteins represented by the matching EST clusters
R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 47

EST Clusters by Hypergraph-Based Scheme


l 662 different items corresponding to ESTs. l 11,986 variables corresponding to known proteins l Found 39 clusters
12 clean clusters each corresponds to single protein family (113 ESTs) 6 clusters with two protein families 7 clusters with three protein families 3 clusters with four protein families 6 clusters with five protein families
EST Counts

25

other-8
other-7
other-6
other-5
other-4
other-3
other-2

20

15

other-1
dehybrogenase
cytochrome-b5

10

actin
5-methyltetra
ubiqutin
s-adenosyl
proline-rich
iso-reductase
xyloglycan
glycine-hydro

0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 Clusters

l Runtime was less than 5 minutes.


R. Grossman, C. Kamath, V. Kumar

heat-shock
tubulin

Data Mining for Scientific and Engineering Applications

Ch 3/ 48

EST clusters by LSI/K-means


l Dimension has been reduced to 50 using LSI l Found 38 clusters
17 clean clusters (69 ESTs) 8 clusters with several protein families 1 cluster with 508 ESTs 22 clusters with one or two ESTs
EST Counts

16 14 12 10 8 6 4 2 0 1 3 5 7 9 11 13 15 17 19 21 23 25 Clusters

other-5 other-4 other-3 other-2 other-1 nucleoside actin-depoly heat-shock protein-h7 sucrose adp-ribosylation p-rich-low arabinogalactan glycine-rich kinase keratin cyclophylin proline-rich histone-h2b expansin tubulin

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 49

Web Document Data


l Explosive growth of documents on the World Wide Web l Searching relevant documents becomes a challenge l Cluster of related words from documents can serve as keywords in searching l Cluster of documents can aid filtering and categorization of retrieved web documents

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 50

Clustering of Related Words


Hypergraph Model Hypergraph Model
l 87 transactions l 87 transactions corresponding to Web corresponding to Web documents documents l 5772 items l 5772 items corresponding to distinct corresponding to distinct word stems word stems l 20 clusters l 20 clusters l Runtime of 5 minutes l Runtime of 5 minutes
R. Grossman, C. Kamath, V. Kumar

AutoClass AutoClass

l 5772 transactions l 5772 transactions corresponding to distinct corresponding to distinct word stems word stems l 87 attributes l 87 attributes corresponding to Web corresponding to Web documents documents l 35 word clusters l 35 word clusters l Runtime of 55 minutes l Runtime of 55 minutes

Data Mining for Scientific and Engineering Applications

Ch 3/ 51

Word Clusters Using Hypergraph-Based Method


Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5
http http internet internet mov mov please please site site web web ww ww access access approach approach comput comput electron electron goal goal manufactur manufactur power power step step act act busi busi check check enforc enforc feder feder follow follow govern govern informate informate page page public public data data engineer engineer includes includes manag manag network network services services softwar softwar support support systems systems technologi technologi wide wide action action administrate administrate agenci agenci complianc complianc establish establish health health law law laws laws nation nation offic offic regulations regulations

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 52

Word Clusters Using AutoClass


Cluster 11 Cluster
copyright copyright design design found found internate internate object object

Cluster 22 Cluster
adopt adopt efficientli efficientli hr hr http http librari librari offices offices procedur procedur automaticalli automaticalli resist resist basic basic bookmarks bookmarks com com comprehens comprehens held held html html hyper hyper mov mov please please programm programm reserv reserv bas bas ww ww changes changes

Cluster 33 Cluster
concern concern agent agent documents documents juli juli apr apr nov nov patents patents register register bear bear timeout timeout trademark trademark uspto uspto court court doc doc appeals appeals list list notficate notficate pac pac recent recent sites sites tac tac topics topics user user word word

Cluster 44 Cluster
cornell cornell amend amend formerli formerli meet meet onlin onlin own own people people publications publications select select servers servers technic technic version version web web center center effort effort amendments amendments appear appear news news organize organize pages pages portions portions sections sections server server structur structur uscod uscod visit visit welcom welcom central central

Cluster 55 Cluster
congress congress employ employ equal equal homepag homepag ii ii implementate implementate legislate legislate major major nbsp nbsp representatives representatives senat senat thomas thomas track track webmast webmast affirm affirm engineer engineer home home house house iii iii legisl legisl mail mail name name page page section section send send bills bills trade trade action action

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 53

Comparison of Word Clusters


Hypergraph

C lu ste rr 1 C lu ste 1
hh ttp ttp in tern e e t in tern t m oo v mv pp lea se lea se site site w eb w eb ww ww

C lu ste rr 2 C lu ste 2
a a cc es s cc es s ap pp ro a ch ap ro a ch cc o m p u t om put e e le ctro n le ctro n gg o a l oal m an uu fa ctu r m an fa ctu r pp o w e r ow er step step

C lu ste rr 3 C lu ste 3
ac t t ac bb u si u si c c h e ck h e ck e e n fo rc n fo rc fed e e r fed r fo llo w fo llo w gg o v e rn o v e rn in fo rm a a te in fo rm te pp a g e age pp u b lic u b lic

C lu ste rr 4 C lu ste 4
dd a ta a ta en gg in e er en in e er in clu dd es in clu es m aa n a g m nag nn e tw o rk e tw o rk serv ice s s serv ice so ftw ar so ftw ar su pp p o rt su p o rt s s y ste m s y ste m s tec hh n o lo g i tec n o lo g i w id e e w id

C lu ste rr 5 C lu ste 5
a a ctio n ctio n a a d m in istra te d m in istra te a a g en c i g en c i co m pp lia n c co m lia n c e e sta b lish sta b lish hh e alth e alth law law la w s s la w nn a tio n a tio n oo ffic ffic re gg u latio n s u latio n s re

LSI/K-means dim 10 128 clusters

Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7
leav leav nuclear nuclear base base structures structures classifi classifi copyright death notices copyright death notices enforc heart attornyes enforc heart attornyes injuri investigate injuri investigate awards awards participat third share participat third share protect central vii protect central vii refus charge class refus charge class commiss commiss committes committes posit posit profess profess race race richard richard sense sense sex sex tell tell thank thank equival equival favor favor ill ill increases increases labor labor provid provid secretari secretari steps steps handbook handbook harm harm incorrectli incorrectli letter letter misus misus names names otherwis otherwis publish publish soleli soleli

Document Clusters Using Hierarchical Clustering


8 7 6 5 4 3 2 1 0

Manufacturing

Labor

Comm/Network

Business

10

13

Clusters
R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 55

16

Document Clusters Using Hypergraph-Based Method


10 8 6 4 2

Manufacturing

Labor

Comm/Network

Business

10

13

Clusters
R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 56

16

Comparison of Document Clusters


Hypergraph
20 18 16
PM
Document Counts

LSI/K-means
16 14 12
Document Counts

PM
MSI
MP
IS
IPT
IP
ER
EC
BC
AA

14 12 10 8 6 4 2 0 1 2 3 4 5 6 7 8 Cluster 9 10 11 12 13 14 15

MSI
MP
IS
IPT
IP
ER
EC
BC
AA

10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Cluster

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 57

Clustering Scalability for Large Data Sets


l One very common solution is sampling, but the sampling could miss small clusters.
Data is sometimes not organized to make valid sampling easy or efficient.

l Another approach is to compress the data or portions of the data.


Any such approach must ensure that not too much information is lost.
(Scaling Clustering Algorithms to Large Databases, Bradley, Fayyad and Reina.)

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 58

Clustering Scalability for Large Data Sets: Birch


lBIRCH (Balanced and Iterative Reducing and Clustering using Hierarchies)
BIRCH can efficiently cluster data with a single pass and can improve that clustering in additional passes. Can work with a number of different distance metrics. BIRCH can also also deal effectively with outliers.

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 59

Clustering Scalability for Large Data Sets: Birch


l BIRCH is based on the notion of a clustering feature (CF) and a CF tree.
A cluster of data points (vectors) can be represented by a triple of numbers (N, LS, SS)
N is the number of points in the cluster LS is the linear sum of the points SS is the sum of squares of the points. Each point is placed in the leaf node corresponding to the closest cluster (CF). Clusters (CFs) are updated.

Points are processed incrementally.

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 60

Clustering Scalability for Large Data Sets: Birch


l Basic steps of BIRCH
Load the data into memory by creating a CF tree that summarizes the data. Perform global clustering.
Produces a better clustering than the initial step. An agglomerative, hierarchical technique was selected.

Redistribute the data points using the centroids of clusters discovered in the global clustering phase, and thus, discover a new (and hopefully better) set of clusters. (See Zhang, Ramakrishnan and Livney or Ganti, Ramakrishnan, and Gehrke)
R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 61

Other Clustering Approaches


l Modeling clusters as a mixture of Multivariate Normal Distributions. (Raftery and Fraley) l Bayesian Approaches (AutoClass, Cheeseman) l Density-Based Clustering (DB-SCAN, Kriegel) l Neural Network Approaches (SOM, Kohonen) l Subspace Clustering (CLIQUE, Agrawal) l Many, many other variations and combinations of approaches.

R. Grossman, C. Kamath, V. Kumar

Data Mining for Scientific and Engineering Applications

Ch 3/ 62

Other Important Topics


l Dimensionality Reduction
Latent Semantic Indexing (LSI) Principal Component Analysis (PCA)

l Feature transformation.
Normalizing features to the same scale by subtracting the mean and dividing by the standard deviation.

l Feature Selection
As in classification, not all features are equally important.
R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 63

Clustering: Summary
l Clustering is an old and multidisciplinary area. l New challenges related to new or newly important kinds of data
Noisy Large High-Dimensional New Kinds of Similarity Measures (non-metric) Clusters of Variable Size and Density Arbitrary Cluster Shapes (non-globular) Many and Mixed Attribute Types (temporal, continuous, categorical)

l New data mining approaches and algorithms are being developed that may be more suitable for these problems.
R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 3/ 64

Das könnte Ihnen auch gefallen