Sie sind auf Seite 1von 104

S8:

Associa+on Rules and Clustering


Shawndra Hill Spring 2013 TR 1:30-3pm and 3-4:30



DSS Course Outline


Introduction to Modeling & Data Mining nFundamental concepts and terminology Data Mining methods nClassification decision trees, association rules, clustering and segmentation, collaborative filtering, genetic algorithms etc. nInner workings nStrengths and weaknesses Evaluation nHow to evaluate the results of a data mining solutions nApplications nReal-world business problems DM can be applied to

Recall?

Instance Based (Lazy) versus Eager Incremental versus not Unsupervised versus Supervised

Classification Model
Historical/ Training Data Classification Algorithms (Decision Tree, Nave Bayes, KNN, ..., Your Classification Algortihm Here)

NAME Balance Mike 23,000 Mary 51,100 Bill 68,000 Jim 74,000 Dave 23,000 Anne 100,000

Age 30 40 55 46 47 49

Default yes yes no no yes no

Classifier: Applicant Class (yes/no)

Predic+on (Recall Our Goal)

Inputs (Independent Variables) Income 100,000 30,000 Num. Kids 2 2

Outputs (Target, Class, Dependent Variable) Risk of Defaulting 0.12 0.98

For tradi+onal learning methods, you will need a table of i.i.d records. i.i.d?
5

Unsupervised Learning (Recall Our Goal)

Inputs (Independent Variables) Income 100,000 30,000 Num. Kids 2 2

Data Mining Spring 2006

Is there anything useful on MMBs?

http://www.mmm-online.com/avandia-warning-signs-seen-online-as-early-as-04/printarticle/177982/

Social Mining for More than Cent$

What Can You Learn by Eavesdropping on a Million Conversa+ons (on the Web)?

What do breast cancer pa+ents talk about the most?


Drug citrate calcium cis platinum tums glycerin suppositories lactic acid natural tears medroxyprogesterone acetate liquid vitamin nystatin artificial tears norvasc lisinopril klonopin tums senokot eye drops benzonatate simply sleep zyrtec Symptom constipating amnesia appetite lost Count 2 2 2 2 2 3 2 2 2 6 2 2 2 2 2 4 3 2 2 Lift 18647.49206 15061.4359 6166.887139 5327.854875 3071.351634 2966.646465 2576.298246 2317.143984 2250.559387 2149.009756 1967.825796 1671.112376 1566.389333 1321.475816 1223.741667 1123.665232 1108.29434 993.9018613 987.6351408 P_Value IND/AE/FP 2.83E-09 AE

constipating

Recipes?

All chemo 6.44E-09 FP drugs, makes sense 3.48E-08 FP


5.94E-08 IND 1.95E-07 AE 1.25E-10 AE 2.49E-07 AE 3.39E-07 FP 3.66E-07 IND 2.61E-19 IND 4.28E-07 IND 6.69E-07 AE 7.72E-07 IND 1.05E-06 IND 1.27E-06 IND 5.41E-12 AE 7.32E-10 IND 1.84E-06 IND 1.96E-06 IND

acidity Isotope? watery eyes fluid retention heart burn oral thrush dry eyes chest pain muscle weakness anxiousness acid indigestion constipate eyes water cough All well and good, but not sure how much can be restless gleaned from this. . . nasal drip

Is there anything useful on MMBs?


Arimidex hormonal therapy for breast cancer
Side eect joint pain hot ashes menopause bone pain bone loss weight gain arthri+s osteoporosis depression sleeplessness
12

Count 925 661 434 265 243 219 207 134 125 113

Side eect hair loss muscle pain night sweat joint s+ness trigger nger mood swings fracture dry eye vaginal dryness high cholesterol

Count 88 74 63 47 41 38 37 35 32 29

Not on label!

Our Social TV Recommenda+on System Approach

Data: Collec+on
Amazon Mechanical Turk Show Twiger handles 572 current show names Jan-May 2012 IMDB TV crawler Television content features

Twiger API

Show network Status updates Local network

Models: Overview
Network models Show follower feature models
Gender Loca+on General demographics-based Show network condence * Follower social network * Network popularity

User-generated text model

Show feature model Matrix factoriza+on Random

TF-IDF transform on all words* TF-IDF transform on all words less show related words TF-IDF on show related words Show content similarity

Randomly selected shows Randomly selected words

Data: TV show follower network


S1 American Idol S2 The Voice S3 Duets

S1

S2

S3

Over 19 million unique followers

Data: Sampled followers


Iden+ed all users who followed >= 2 shows ( 5.5 million) Randomly sampled up to 1,000 in each shows local network
u S1 w v S3 S2 S1 American Idol S2 The Voice S3 Duets u Shawndra v Adrian w Jin

Data: Status updates


Collected up to the past 400 tweets from each user in follower sample Removed user u if
language != en |Followers(u)| > 2000

Sample of 114K users

Experiment framework
10-fold cross validation over 114K users

func+on VALIDATE(Engine e, List[Set[user]] tests, List[Set[user]] trains) { List[Result] results = []; FOR (i IN 1:10) { Model m = TRAIN(e, trains[i]); FOR (u IN tests[i]) { Show randShow = GET_RANDOM_SHOW(u) List[Show] recommended = PREDICT(m, u, randShow) results += GET_PERFORMANCE(recommended, u, randShow) } } RETURN (SUM(results)/10); }

Valida+on metrics

Models: Show network condence


Compute similarity matrix between all shows based on condence from show network
S1 American Idol S2 The Voice u Shawndra v Adrian

S1

u v

S2

Models: Follower social network


For test user u, nd all neighbors who are connected to TV shows Rank recommenda+ons by number of neighbors following show
u S1 American Idol S2 The Voice u Shawndra S1 S2 Ranking: S1, S2

Considered user neighbors to either have the follower, friend, or reciprocally linked relation

Models: Network popularity


Simply rank by number of Twiger followers Ignores features of input user and show

S1

S2

S3

S1 American Idol S2 The Voice S3 Duets

Ranking: S2, S3, S1

Results: Precision

Results: Recall

Results: Text-based English only

Only restric+ng to standard English words results in similar level of performance 4 million 40,000 tokens

Visualiza+on of the Similarity Matrix

http://www.thesocialtvlab.com/adrian/videos/ http://thesocialtvlab.com/adrian/network_vis/interactive_network_recommender/

Learning Association Rules from Data


A descriptive approach for discovering relevant and valid associations among items in the data. E.g.,
Then Buy beer

If buy diapers

28

Market Basket Analysis


Transaction No. Item 1 100 101 102 103 104 Beer Milk Beer Beer Ice Cream Item 2 Diapers Chocolate Wine Cheese Diapers Item 3 Chocolate Shampoo Vodka Diapers Beer

Examples: Shoppers who buy ice cream are very likely to buy beer. Then If buy Buy beer ice cream Shoppers who buy Beer and Wine are likely to buy Cheese and Chocolate Then
If buy Beer, Wine Buy Cheese, Chocolate
29

Associa+on Rules
Rule format: If {set of items} Then {set of items} body head Then If {Diapers, {Beer, Wine} Baby Food} Body implies Head

30

Market basket analysis


Basket data: collec+on of transac+ons, each consis+ng of a set of items bought in the same transac+on. Associa+on rules from basket data: Learn which items are frequently bought together.

31

Applications
n

n n n n

Store planning: Placing associated items together (Milk & Bread)? n May reduce basket total value (buy less unplanned merchandise) Customer segmentation based on buying behavior, Cross-marketing, Catalog design, etc. Fraud detection: Finding in insurance data that a certain doctor often works with a certain lawyer may indicate potential fraudulent activity.
32

Evalua+on of Associa+on Rules


What rules should be considered valid? body head Then If {Diapers} {Beer} An associa+on rule is valid if it sa+ses some evalua+on measures

33

Rule Evaluation

Support
Milk & Wine co-occur n But Only 2 out of 200K transactions contain these items
n
Transaction No. 100 101 102 103 104 .
34

Item 1 Beer Milk Beer Beer Ice Cream

Item 2 Diapers Chocolate Wine Cheese Diapers

Item 3 Chocolate Wine Vodka Diapers Beer

Rule Evaluation
Signicance is measured by Support: The frequency in which the items in body and head co-occur.

E.g., The support of the If {Diapers} then {Beer} rule is 3/5: 60% of the transac+ons contain both items.

Support =

No. of transac0ons containing items in body and head



Item 1 Beer Milk Beer Beer Ice Cream

Total no. of transac0ons in database


Item 2 Diapers Chocolate Wine Cheese Diapers Item 3 Chocolate Shampoo Vodka Diapers Beer
35

Transaction No. 100 101 102 103 104

body If {Diapers} Then

head {Beer}

Rule Evaluation

Strength of the Implica+on


Which implica+on is stronger? Of the transac+ons containing Milk
1% contain Wine. 20% contain Beer
body If {Milk} body If {Milk} Then Then head {Wine} head {Beer}
36

Rule Evaluation
A rules strength is measured by its condence: How strongly the body implies the head? Condence: The propor+on of transac+ons containing the body that also contain the head. Example: The condence of the rule is 3/3, i.e., in100% of the transac+ons that contain diapers also contain beer. No. of transac0ons containing both body and head Confidence = No. of transac0ons containing body
Transaction No. 100 101 102 103 104 Item 1 Beer Milk Beer Beer Ice Cream Item 2 Diapers Chocolate Wine Cheese Diapers Item 3 Chocolate Shampoo Vodka Diapers Beer
37

Condence

body If {Diapers} Then

head {Beer}

Rule Evaluation: Confidence

Is the rule Milk Wine equivalent to the rule Wine Milk? When is the implica+ons Milk Wine more likely than the reverse?

38

Rule Evaluation

Example: Liy
Consider the rule:

body If {Milk} Then head {Beer}

Assume: Support 20%, Condence 100% Assume condence is 60%


What if 60% of shoppers buy beer ?

What if all shoppers at the store buy beer?

Find rules where the frequency of head given body > expected frequency of head 39

More Evaluation Criteria

Liy
Measures how much more likely is the head given the body than merely the head (condence/frequency of head) Example:
Total number of customer in database: 1000 No. of customers buying Milk: 200 No. of customers buying beer: 50 No. of customers buying Milk & beer: 20 body Then If {Milk} {Beer} head

Frequency of head:
50/1000 (5%)

Condence:
20/200 (10%)

Liy:
10%/5%=2
40

Comparison to Tradi+onal Database Queries


Tradi9onal methods Tradi+onal methods such as database queries: support hypothesis verica+on about a rela+onship such as the co- occurrence of diapers & beer.
Transaction No. 100 101 102 103 104

Item 1 Beer Milk Beer Beer Ice Cream

Item 2 Diapers Chocolate Wine Cheese Diapers

Item 3 Chocolate Shampoo Vodka Diapers Beer

41

Comparison to Traditional Database Queries

Data Mining: Explore the data for pagerns. Data Mining methods automa+cally discover signicant associa+ons rules from data.
Find whatever pagerns exist in the database, without the user having to specify in advance what to look for (data driven). Therefore allow nding unexpected correla+ons

42

Algorithms to Extract Associa+on Rules


The standard method was developed by Agrawal et. al. The Associa+on Rules problem was dened as:
Generate all associa+on rules that have support greater than the user-specied minsup (minimum support) and condence greater than the user-specied minconf (minimum condence)

The algorithm performs an ecient search over the data to nd all such rules.
43

Extrac+ng Associa+on Rules from Data


Problem is decomposed into two sub-problems

Find all sets of items (called itemsets) that co- occur in at least minsup of the transac+ons in the database.
Itemsets with at least minimum support are called frequent itemsets ( or large itemsets)

44

Extrac+ng Associa+on Rules from Data

Example
n n

A data set with 5 transactions Minsup = 40%, Minconf=80% n Phase 1: Find all frequent itemsets
{Beer} (support=80%), {Diaper} (60%), {Chocolate} (40%) {Beer, Diaper} (60%) Transaction No. 100 101 102 103 104 Item 1 Beer Milk Beer Beer Item 2 Diaper Wine Cheese Item 3 Chocolate Vodka Diaper Beer
45

Chocolate Shampoo

Ice Cream Diaper

Phase 1:

Mining Association Rules Example

Phase 1: Finding all frequent itemsets


How to perform an ecient search of all frequent itemsets?
Note: if {diaper, beer} is frequent then {diaper} and {beer} are each frequent as well This means that If an itemset is not frequent (e.g., {wine}) then no itemset that contains wine such as {wine, beer} can be frequent either.
46

Phase 1:

Mining Association Rules Example

Example: assume the following itemsets of size 1 were found to be frequent: {Milk},{Bread},{ BuEer} Since {wine} is not frequent {wine, BuEer} cannot be frequent. Only if both {wine} & {BuEer} were frequent, then {wine, BuEer} may be frequent. Therefore 1. Find all itemsets of size 1 that are frequent. 2. To nd out which itemsets of size 2 are frequent count the frequency of itemsets of size 2 that contain two of the following items Milk, Bread, BuEer
47

Phase 2:

Mining Association Rules Example

Assume {Milk, Bread, BuEer} is a frequent itemset.

For each frequent itemset, nd all possible rules BodyHead (using items contained in the itemset). Example:
Does {Milk} {Bread, Buger} sa+sfy minimum condence? What about {Bread} {Milk, Buger}, {Buger} {Milk, Bread}, {Bread, Buger} {Milk}, {Milk, Buger} {Bread}, {Milk, Bread} {Buger} To calculate the condence of the rule
{Milk} {Bread, Buger} :


No. of transaction that support {Milk, Bread, Butter} Support {Milk, Bread, Butter} = No. of transaction that support {Milk} Support {Milk}

48

Associa+on
If the rule {Yogurt} {Bread, Buger } is found to have minimum condence. Does it mean the rule:

{Bread, Buger} {Yogurt} also has minimum condence? Example: Support of {Yogurt} is 20%, Support of {Yogurt, Bread, Buger } is 10% Support of {Bread and Buger } is 50% Condence of {Yogurt} {Bread, Buger} is 10%/20%=50% Condence of {Bread, Buger} {Yogurt} is 10%/50%=20%

49

Back to the example


Minsup = 40%. Minconf=80%
Phase 1: Find all large itemsets
Large itemsets: {Beer} (support=80%), {Diaper} (60%), {Chocolate} (40%), {Beer, Diaper} (60%)

Phase 2: For each frequent itemset of size > =2 (containing more than 2 items) nd all rules that sa+sfy minimum condence.
{Beer} {Diaper}, Condence = (75%) not sucient condence {Diaper} {Beer}, condence=3/3 (100%)
Transaction No. 100 101 102 103 104
50

Item 1 Beer Milk Beer Beer Ice Cream

Item 2 Diaper Chocolate Wine Cheese Diaper

Item 3 Chocolate Shampoo Vodka Diaper Beer

Applica+ons
Store planning:
Placing associated items together (Milk & Bread)?
May reduce basket total value (buy less unplanned merchandise)

Fraud detec+on:
Finding in insurance data that a certain doctor oyen works with a certain lawyer may indicate poten+al fraudulent ac+vity.

51

Sequen+al Pagerns
Instead of nding associa+on between items in a single transac+ons, nd associa+on between items bought by the same customer in dierent occasions.
Customer ID AA AA BB BB Transaction Data. 2/2/2001 1/13/2002 4/5/2002 8/10/2002 Item 1 Laptop Wireless network card laptop Wireless network card Item 2 Case Router iPaq Router

n n

Sequence : {Laptop}, {Wireless Card, Router} A sequences has to satisfy some predetermined minimum support

52

Clustering

What is clustering?
Clustering: the process of grouping a set of objects into classes of similar objects
Items within a cluster should be similar. Documents from different clusters should be dissimilar.

The commonest form of unsupervised learning


Unsupervised learning = learning from raw data, as opposed to supervised data where a classification of examples is given

A common and important task that finds many applications in IR and other places

Clustering
What is Clustering? Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with nding a structure in a collec+on of unlabeled data. A loose deni+on of clustering could be the process of organizing objects into groups whose members are similar in some way. A cluster is therefore a collec+on of objects which are similar between them and are dissimilar to the objects belonging to other clusters. We can show this with a simple graphical example:

55

A data set with clear cluster structure

How would you design an algorithm for finding the three clusters in this case?

Examples of Clustering Applica+ons


Marke9ng: discover customer groups and use them for targeted marke+ng and re-organiza+on Astronomy: nd groups of similar stars and galaxies Earth-quake studies: Observed earth quake epicenters should be clustered along con+nent faults Genomics: nding groups of gene with similar expressions

57

Applications of clustering in IR
Whole corpus analysis/navigation
Better user interface: search without typing

For improving recall in search applications


Better search results (like pseudo RF)

For better navigation of search results


Effective user recall will be higher

For speeding up vector space retrieval


Cluster-based retrieval gives faster search

Yahoo! Hierarchy isnt clustering but is the kind of output you want from clustering
www.yahoo.com/Science (30) agriculture ... dairy biology ... physics ... CS ... space ... craft missions

botany cell AI courses crops magnetism HCI agronomy evolution forestry relativity

Google News: automatic clustering gives an effective news presentation metaphor

Scatter/Gather: Cutting, Karger, and Pedersen

For visualizing a document collection and its themes


Wise et al, Visualizing the non-visual PNNL ThemeScapes, Cartia
[Mountain height = cluster size]

For improving search recall


Cluster hypothesis - Documents in the same cluster behave similarly with respect to relevance to information needs Therefore, to improve search recall: Cluster docs in corpus a priori When a query matches a doc D, also return other docs in the cluster containing D Hope if we do this: The query car will also return docs containing automobile Because clustering grouped together docs containing car with those containing automobile.

Why might this happen?

For better navigation of search results


For grouping search results thematically

clusty.com / Vivisimo

Issues for clustering


Representation for clustering
Document representation
Vector space? Normalization?
Centroids arent length normalized

Need a notion of similarity/distance

How many clusters?


Fixed a priori? Completely data driven?
Avoid trivial clusters - too large or small
In an application, if a cluster's too large, then for navigation purposes you've wasted an extra user click

What makes docs related?


Ideal: semantic similarity. Practical: statistical similarity
We will use cosine similarity. Docs as vectors. For many algorithms, easier to think in terms of a distance (rather than similarity) between docs. We will use Euclidean distance.

Clustering Algorithms
Flat algorithms
Usually start with a random (partial) partitioning Refine it iteratively
K means clustering (Model based clustering)

Hierarchical algorithms
Bottom-up, agglomerative (Top-down, divisive)

Hard vs. soft clustering


Hard clustering: Each document belongs to exactly one cluster
More common and easier to do

Soft clustering: A document can belong to more than one cluster.


Makes more sense for applications like creating browsable hierarchies You may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoes You can only do that with a soft clustering approach.

We wont do soft clustering today.

Partitioning Algorithms
Partitioning method: Construct a partition of n documents into a set of K clusters Given: a set of documents and the number K Find: a partition of K clusters that optimizes the chosen partitioning criterion
Globally optimal: exhaustively enumerate all partitions Effective heuristic methods: K-means and Kmedoids algorithms

K-Means
Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c:

1 (c) = x | c | xc
Reassignment of instances to clusters is based on distance to the current cluster centroids. (Or one can equivalently phrase it in terms of similarities)

K-Means Algorithm
Select K random docs {s1, s2, sK} as seeds. Until clustering converges or other stopping criterion: For each doc di: Assign di to the cluster cj such that dist(xi, sj) is minimal. (Update the seeds to the centroid of each cluster) For each cluster cj sj = (cj)

K Means Example
(K=2)
Pick seeds Reassign clusters Compute centroids Reassign clusters Compute centroids Reassign clusters Converged!

Termination conditions
Several possibilities, e.g.,
A fixed number of iterations. Doc partition unchanged. Centroid positions dont change.

Does this mean that the docs in a cluster are unchanged?

Convergence
Why should the K-means algorithm ever reach a fixed point?
A state in which clusters dont change.

K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm.
EM is known to converge. Number of iterations could be large.
But in practice usually isnt

Lower case

Convergence of K-Means
Define goodness measure of cluster k as sum of squared distances from cluster centroid:
Gk = i (di ck)2 (sum over all di in cluster k)

G = k Gk Reassignment monotonically decreases G since each vector is assigned to the closest centroid.

Convergence of K-Means
Recomputation monotonically decreases each Gk since (mk is number of members in cluster k): (di a)2 reaches minimum for: 2(di a) = 0 di = a mK a = di a = (1/ mk) di = ck K-means typically converges quickly

Time Complexity
Computing distance between two docs is O(m) where m is the dimensionality of the vectors. Reassigning clusters: O(Kn) distance computations, or O(Knm). Computing centroids: Each doc gets added once to some centroid: O(nm). Assume these two steps are each done once for I iterations: O(IKnm).

Seed Choice
Results can vary based on random seed selection. Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. Select good seeds using a heuristic (e.g., doc least similar to any existing mean) Try out multiple starting points Initialize with the results of another method.
Example showing sensitivity to seeds

In the above, if you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} {C,F}

How Many Clusters?


Number of clusters K is given Partition n docs into predetermined number of clusters Finding the right number of clusters is part of the problem Given docs, partition into an appropriate number of subsets. E.g., for query results - ideal value of K not known up front - though UI may impose limits. Can usually take an algorithm for one flavor and convert to the other.

K not specified in advance


Say, the results of a query. Solve an optimization problem: penalize having lots of clusters
application dependent, e.g., compressed summary of search results list.

Tradeoff between having more clusters (better focus within each cluster) and having too many clusters

K not specified in advance


Given a clustering, define the Benefit for a doc to be the cosine similarity to its centroid Define the Total Benefit to be the sum of the individual doc Benefits.
Why is there always a clustering of Total Benefit n?

Penalize lots of clusters


For each cluster, we have a Cost C. Thus for a clustering with K clusters, the Total Cost is KC. Define the Value of a clustering to be =
Total Benefit - Total Cost.

Find the clustering of highest value, over all choices of K.


Total benefit increases with increasing K. But can stop when it doesnt increase by much. The Cost term enforces this.

K-means issues, variations, etc.


Recomputing the centroid after every assignment (rather than after all points are reassigned) can improve speed of convergence of K-means Assumes clusters are spherical in vector space
Sensitive to coordinate changes, weighting etc.

Disjoint and exhaustive


Doesnt have a notion of outliers by default But can add outlier filtering

Hierarchical Clustering
Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents.
animal vertebrate fish reptile amphib. mammal invertebrate worm insect crustacean

One approach: recursive application of a partitional clustering algorithm.

Dendogram: Hierarchical Clustering


Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster.

Hierarchical Agglomerative Clustering (HAC)


Starts with each doc in a separate cluster
then repeatedly joins the closest pair of clusters, until there is only one cluster.

The history of merging forms a binary tree or hierarchy.

Closest pair of clusters


Many variants to defining closest pair of clusters Single-link
Similarity of the most cosine-similar (singlelink)

Complete-link
Similarity of the furthest points, the least cosine-similar

Centroid
Clusters whose centroids (centers of gravity)

Single Link Agglomerative Clustering


Use maximum similarity of pairs:

sim(ci ,c j ) = max sim( x, y)


Can result in straggly (long and thin) clusters due to chaining effect. After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:
xci , yc j

sim(( ci c j ), ck ) = max( sim(ci , ck ), sim(c j , ck ))

Single Link Example

Complete Link Agglomerative Clustering


Use minimum similarity of pairs: Makes tighter, spherical clusters that are typically preferable. After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

sim(ci ,c j ) = min sim( x, y )


xci , yc j

sim(( ci c j ), ck ) = min( sim(ci , ck ), sim(c j , ck ))


Ci Cj Ck

Complete Link Example

Computational Complexity
In the first iteration, all HAC methods need to compute similarity of all pairs of n individual instances which is O(n2). In each of the subsequent n-2 merging iterations, compute the distance between the most recently created cluster and all other existing clusters. In order to maintain an overall O(n2) performance, computing similarity to each other cluster must be done in constant

Group Average Agglomerative Clustering


Similarity of two clusters = average similarity of all pairs within merged cluster.

1 sim(ci , c j ) = c ) y(c )sim( x, y) ci c j ( ci c j 1) x( ci j c j : y x i


Compromise between single and complete link. Two options:

Averaged across all ordered pairs in the merged cluster Averaged over all pairs between the two original clusters
No clear difference in efficacy

Computing Group Average Similarity


Always maintain sum of vectors in each cluster.
s (c j ) = x
xc j

Compute similarity of clusters in constant time: ( s (ci ) + s (c j )) ( s (ci ) + s (c j )) (| ci | + | c j |)


sim(ci , c j ) = (| ci | + | c j |)(| ci | + | c j | 1)

What Is A Good Clustering?


Internal criterion: A good clustering will produce high quality clusters in which:
the intra-class (that is, intra-cluster) similarity is high the inter-class similarity is low The measured quality of a clustering depends on both the document representation and the similarity measure used

External criteria for clustering quality


Quality measured by its ability to discover some or all of the hidden patterns or latent classes in gold standard data Assesses a clustering with respect to ground truth requires labeled data Assume documents with C gold standard classes, while our clustering algorithms produce K clusters, 1, 2, , K with ni members.

External Evaluation of Cluster Quality


Simple measure: purity, the ratio between the dominant class in the cluster i and the size of cluster i
1 Purity (i ) = max j (nij ) j C ni Biased because having n clusters maximizes purity Others are entropy of classes in clusters (or mutual information between classes and clusters)

Purity example

Cluster I


Cluster II


Cluster III

Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6 Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6 Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5

Rand Index measures between pair decisions. Here RI = 0.68


Number of points Same class in ground truth Different classes in ground truth Same Cluster in clustering Different Clusters in clustering

20 20

24 72

Rand index and Cluster F-measure

A+ D RI = A+ B +C + D
Compare with standard Precision and Recall:

A P= A+ B

A R= A+C

People also define and use a cluster Fmeasure, which is probably a better measure.

Final word and resources


In clustering, clusters are inferred from the data without human input (unsupervised learning) However, in practice, its a bit less clear: there are many ways of influencing the outcome of clustering: number of clusters, similarity measure, representation of documents, . . .

Discussion
Can interpret clusters by using supervised learning
learn a classier based on clusters pre-processing step E.g. use principal component analysis

Decrease dependence between agributes?

Can be used to ll in missing values Key advantage of probabilis+c clustering:


Can es+mate likelihood of data Use it to compare dierent models objec+vely

102

Clustering Summary
unsupervised many approaches
K-means simple, some+mes useful
K-medoids is less sensi+ve to outliers

Hierarchical clustering works for symbolic agributes

Evalua+on is a problem

103

S8: Associa+on Rules and Clustering


Shawndra Hill Spring 2013 TR 1:30-3pm and 3-4:30

Das könnte Ihnen auch gefallen