Shawndra Hill Spring 2013 TR 1:30 - 3pm and 3 - 4:30

S8:
Associa+on Rules and Clustering

Shawndra Hill Spring 2013 TR 1:30-3pm and 3-4:30

DSS Course Outline

Introduction to Modeling & Data Mining nFundamental concepts and terminology Data Mining methods nClassification decision trees, association rules, clustering and segmentation, collaborative filtering, genetic algorithms etc. nInner workings nStrengths and weaknesses Evaluation nHow to evaluate the results of a data mining solutions nApplications nReal-world business problems DM can be applied to
Recall?
Instance Based (Lazy) versus Eager Incremental versus not Unsupervised versus Supervised
Classification Model
Historical/ Training Data Classification Algorithms (Decision Tree, Nave Bayes, KNN, ..., Your Classification Algortihm Here)
NAME Balance Mike 23,000 Mary 51,100 Bill 68,000 Jim 74,000 Dave 23,000 Anne 100,000
Age 30 40 55 46 47 49
Default yes yes no no yes no
Classifier: Applicant Class (yes/no)
Predic+on (Recall Our Goal)
Inputs (Independent Variables) Income 100,000 30,000 Num. Kids 2 2
Outputs (Target, Class, Dependent Variable) Risk of Defaulting 0.12 0.98
For tradi+onal learning methods, you will need a table of i.i.d records. i.i.d?
5
Unsupervised Learning (Recall Our Goal)
Inputs (Independent Variables) Income 100,000 30,000 Num. Kids 2 2
Data Mining Spring 2006
Is there anything useful on MMBs?
http://www.mmm-online.com/avandia-warning-signs-seen-online-as-early-as-04/printarticle/177982/
Social Mining for More than Cent$
What Can You Learn by Eavesdropping on a Million Conversa+ons (on the Web)?
What do breast cancer pa+ents talk about the most?

Drug citrate calcium cis platinum tums glycerin suppositories lactic acid natural tears medroxyprogesterone acetate liquid vitamin nystatin artificial tears norvasc lisinopril klonopin tums senokot eye drops benzonatate simply sleep zyrtec Symptom constipating amnesia appetite lost Count 2 2 2 2 2 3 2 2 2 6 2 2 2 2 2 4 3 2 2 Lift 18647.49206 15061.4359 6166.887139 5327.854875 3071.351634 2966.646465 2576.298246 2317.143984 2250.559387 2149.009756 1967.825796 1671.112376 1566.389333 1321.475816 1223.741667 1123.665232 1108.29434 993.9018613 987.6351408 P_Value IND/AE/FP 2.83E-09 AE
constipating
Recipes?
All chemo 6.44E-09 FP drugs, makes sense 3.48E-08 FP

5.94E-08 IND 1.95E-07 AE 1.25E-10 AE 2.49E-07 AE 3.39E-07 FP 3.66E-07 IND 2.61E-19 IND 4.28E-07 IND 6.69E-07 AE 7.72E-07 IND 1.05E-06 IND 1.27E-06 IND 5.41E-12 AE 7.32E-10 IND 1.84E-06 IND 1.96E-06 IND
acidity Isotope? watery eyes fluid retention heart burn oral thrush dry eyes chest pain muscle weakness anxiousness acid indigestion constipate eyes water cough All well and good, but not sure how much can be restless gleaned from this. . . nasal drip
Is there anything useful on MMBs?

Arimidex hormonal therapy for breast cancer
Side eect joint pain hot ashes menopause bone pain bone loss weight gain arthri+s osteoporosis depression sleeplessness
12
Count 925 661 434 265 243 219 207 134 125 113
Side eect hair loss muscle pain night sweat joint s+ness trigger nger mood swings fracture dry eye vaginal dryness high cholesterol
Count 88 74 63 47 41 38 37 35 32 29
Not on label!
Our Social TV Recommenda+on System Approach
Data: Collec+on
Amazon Mechanical Turk Show Twiger handles 572 current show names Jan-May 2012 IMDB TV crawler Television content features
Twiger API
Show network Status updates Local network
Models: Overview
Network models Show follower feature models
Gender Loca+on General demographics-based Show network condence * Follower social network * Network popularity
User-generated text model
Show feature model Matrix factoriza+on Random
TF-IDF transform on all words* TF-IDF transform on all words less show related words TF-IDF on show related words Show content similarity
Randomly selected shows Randomly selected words
Data: TV show follower network

S1 American Idol S2 The Voice S3 Duets
S1
S2
S3
Over 19 million unique followers
Data: Sampled followers

Iden+ed all users who followed >= 2 shows ( 5.5 million) Randomly sampled up to 1,000 in each shows local network
u S1 w v S3 S2 S1 American Idol S2 The Voice S3 Duets u Shawndra v Adrian w Jin
Data: Status updates

Collected up to the past 400 tweets from each user in follower sample Removed user u if
language != en |Followers(u)| > 2000
Sample of 114K users
Experiment framework
10-fold cross validation over 114K users
func+on VALIDATE(Engine e, List[Set[user]] tests, List[Set[user]] trains) { List[Result] results = []; FOR (i IN 1:10) { Model m = TRAIN(e, trains[i]); FOR (u IN tests[i]) { Show randShow = GET_RANDOM_SHOW(u) List[Show] recommended = PREDICT(m, u, randShow) results += GET_PERFORMANCE(recommended, u, randShow) } } RETURN (SUM(results)/10); }
Valida+on metrics
Models: Show network condence

Compute similarity matrix between all shows based on condence from show network
S1 American Idol S2 The Voice u Shawndra v Adrian
S1
u v
S2
Models: Follower social network

For test user u, nd all neighbors who are connected to TV shows Rank recommenda+ons by number of neighbors following show
u S1 American Idol S2 The Voice u Shawndra S1 S2 Ranking: S1, S2
Considered user neighbors to either have the follower, friend, or reciprocally linked relation
Models: Network popularity

Simply rank by number of Twiger followers Ignores features of input user and show
S1
S2
S3
S1 American Idol S2 The Voice S3 Duets
Ranking: S2, S3, S1
Results: Precision
Results: Recall
Results: Text-based English only
Only restric+ng to standard English words results in similar level of performance 4 million 40,000 tokens
Visualiza+on of the Similarity Matrix
http://www.thesocialtvlab.com/adrian/videos/ http://thesocialtvlab.com/adrian/network_vis/interactive_network_recommender/
Learning Association Rules from Data

A descriptive approach for discovering relevant and valid associations among items in the data. E.g.,
Then Buy beer
If buy diapers
28
Market Basket Analysis

Transaction No. Item 1 100 101 102 103 104 Beer Milk Beer Beer Ice Cream Item 2 Diapers Chocolate Wine Cheese Diapers Item 3 Chocolate Shampoo Vodka Diapers Beer
Examples: Shoppers who buy ice cream are very likely to buy beer. Then If buy Buy beer ice cream Shoppers who buy Beer and Wine are likely to buy Cheese and Chocolate Then
If buy Beer, Wine Buy Cheese, Chocolate
29
Associa+on Rules
Rule format: If {set of items} Then {set of items} body head Then If {Diapers, {Beer, Wine} Baby Food} Body implies Head
30
Market basket analysis

Basket data: collec+on of transac+ons, each consis+ng of a set of items bought in the same transac+on. Associa+on rules from basket data: Learn which items are frequently bought together.
31
Applications
n
n n n n
Store planning: Placing associated items together (Milk & Bread)? n May reduce basket total value (buy less unplanned merchandise) Customer segmentation based on buying behavior, Cross-marketing, Catalog design, etc. Fraud detection: Finding in insurance data that a certain doctor often works with a certain lawyer may indicate potential fraudulent activity.
32
Evalua+on of Associa+on Rules

What rules should be considered valid? body head Then If {Diapers} {Beer} An associa+on rule is valid if it sa+ses some evalua+on measures
33
Rule Evaluation
Support
Milk & Wine co-occur n But Only 2 out of 200K transactions contain these items
n
Transaction No. 100 101 102 103 104 .
34
Item 1 Beer Milk Beer Beer Ice Cream
Item 2 Diapers Chocolate Wine Cheese Diapers
Item 3 Chocolate Wine Vodka Diapers Beer
Rule Evaluation
Signicance is measured by Support: The frequency in which the items in body and head co-occur.

E.g., The support of the If {Diapers} then {Beer} rule is 3/5: 60% of the transac+ons contain both items.

Support =

No. of transac0ons containing items in body and head

Total no. of transac0ons in database

Item 2 Diapers Chocolate Wine Cheese Diapers Item 3 Chocolate Shampoo Vodka Diapers Beer
35
Transaction No. 100 101 102 103 104
body If {Diapers} Then
head {Beer}
Rule Evaluation
Strength of the Implica+on

Which implica+on is stronger? Of the transac+ons containing Milk
1% contain Wine. 20% contain Beer
body If {Milk} body If {Milk} Then Then head {Wine} head {Beer}
36
Rule Evaluation
A rules strength is measured by its condence: How strongly the body implies the head? Condence: The propor+on of transac+ons containing the body that also contain the head. Example: The condence of the rule is 3/3, i.e., in100% of the transac+ons that contain diapers also contain beer. No. of transac0ons containing both body and head Confidence = No. of transac0ons containing body
Transaction No. 100 101 102 103 104 Item 1 Beer Milk Beer Beer Ice Cream Item 2 Diapers Chocolate Wine Cheese Diapers Item 3 Chocolate Shampoo Vodka Diapers Beer
37
Condence
body If {Diapers} Then
head {Beer}
Rule Evaluation: Confidence
Is the rule Milk Wine equivalent to the rule Wine Milk? When is the implica+ons Milk Wine more likely than the reverse?
38
Rule Evaluation
Example: Liy
Consider the rule:

body If {Milk} Then head {Beer}
Assume: Support 20%, Condence 100% Assume condence is 60%

What if 60% of shoppers buy beer ?
What if all shoppers at the store buy beer?
Find rules where the frequency of head given body > expected frequency of head 39
More Evaluation Criteria
Liy
Measures how much more likely is the head given the body than merely the head (condence/frequency of head) Example:
Total number of customer in database: 1000 No. of customers buying Milk: 200 No. of customers buying beer: 50 No. of customers buying Milk & beer: 20 body Then If {Milk} {Beer} head
Frequency of head:
50/1000 (5%)
Condence:
20/200 (10%)
Liy:
10%/5%=2
40
Comparison to Tradi+onal Database Queries

Tradi9onal methods Tradi+onal methods such as database queries: support hypothesis verica+on about a rela+onship such as the co- occurrence of diapers & beer.
Transaction No. 100 101 102 103 104
Item 2 Diapers Chocolate Wine Cheese Diapers
Item 3 Chocolate Shampoo Vodka Diapers Beer
41
Comparison to Traditional Database Queries
Data Mining: Explore the data for pagerns. Data Mining methods automa+cally discover signicant associa+ons rules from data.
Find whatever pagerns exist in the database, without the user having to specify in advance what to look for (data driven). Therefore allow nding unexpected correla+ons
42
Algorithms to Extract Associa+on Rules

The standard method was developed by Agrawal et. al. The Associa+on Rules problem was dened as:
Generate all associa+on rules that have support greater than the user-specied minsup (minimum support) and condence greater than the user-specied minconf (minimum condence)
The algorithm performs an ecient search over the data to nd all such rules.
43
Extrac+ng Associa+on Rules from Data

Problem is decomposed into two sub-problems
Find all sets of items (called itemsets) that co- occur in at least minsup of the transac+ons in the database.
Itemsets with at least minimum support are called frequent itemsets ( or large itemsets)
44
Extrac+ng Associa+on Rules from Data
Example
n n
A data set with 5 transactions Minsup = 40%, Minconf=80% n Phase 1: Find all frequent itemsets
{Beer} (support=80%), {Diaper} (60%), {Chocolate} (40%) {Beer, Diaper} (60%) Transaction No. 100 101 102 103 104 Item 1 Beer Milk Beer Beer Item 2 Diaper Wine Cheese Item 3 Chocolate Vodka Diaper Beer
45
Chocolate Shampoo
Ice Cream Diaper
Phase 1:
Mining Association Rules Example
Phase 1: Finding all frequent itemsets

How to perform an ecient search of all frequent itemsets?
Note: if {diaper, beer} is frequent then {diaper} and {beer} are each frequent as well This means that If an itemset is not frequent (e.g., {wine}) then no itemset that contains wine such as {wine, beer} can be frequent either.
46
Phase 1:
Example: assume the following itemsets of size 1 were found to be frequent: {Milk},{Bread},{ BuEer} Since {wine} is not frequent {wine, BuEer} cannot be frequent. Only if both {wine} & {BuEer} were frequent, then {wine, BuEer} may be frequent. Therefore 1. Find all itemsets of size 1 that are frequent. 2. To nd out which itemsets of size 2 are frequent count the frequency of itemsets of size 2 that contain two of the following items Milk, Bread, BuEer
47
Phase 2:

Assume {Milk, Bread, BuEer} is a frequent itemset.
For each frequent itemset, nd all possible rules BodyHead (using items contained in the itemset). Example:
Does {Milk} {Bread, Buger} sa+sfy minimum condence? What about {Bread} {Milk, Buger}, {Buger} {Milk, Bread}, {Bread, Buger} {Milk}, {Milk, Buger} {Bread}, {Milk, Bread} {Buger} To calculate the condence of the rule
{Milk} {Bread, Buger} :

No. of transaction that support {Milk, Bread, Butter} Support {Milk, Bread, Butter} = No. of transaction that support {Milk} Support {Milk}
48
Associa+on
If the rule {Yogurt} {Bread, Buger } is found to have minimum condence. Does it mean the rule:
{Bread, Buger} {Yogurt} also has minimum condence? Example: Support of {Yogurt} is 20%, Support of {Yogurt, Bread, Buger } is 10% Support of {Bread and Buger } is 50% Condence of {Yogurt} {Bread, Buger} is 10%/20%=50% Condence of {Bread, Buger} {Yogurt} is 10%/50%=20%
49
Back to the example

Minsup = 40%. Minconf=80%
Phase 1: Find all large itemsets
Large itemsets: {Beer} (support=80%), {Diaper} (60%), {Chocolate} (40%), {Beer, Diaper} (60%)
Phase 2: For each frequent itemset of size > =2 (containing more than 2 items) nd all rules that sa+sfy minimum condence.
{Beer} {Diaper}, Condence = (75%) not sucient condence {Diaper} {Beer}, condence=3/3 (100%)
Transaction No. 100 101 102 103 104
50
Item 2 Diaper Chocolate Wine Cheese Diaper
Item 3 Chocolate Shampoo Vodka Diaper Beer
Applica+ons
Store planning:
Placing associated items together (Milk & Bread)?
May reduce basket total value (buy less unplanned merchandise)
Fraud detec+on:
Finding in insurance data that a certain doctor oyen works with a certain lawyer may indicate poten+al fraudulent ac+vity.
51
Sequen+al Pagerns
Instead of nding associa+on between items in a single transac+ons, nd associa+on between items bought by the same customer in dierent occasions.
Customer ID AA AA BB BB Transaction Data. 2/2/2001 1/13/2002 4/5/2002 8/10/2002 Item 1 Laptop Wireless network card laptop Wireless network card Item 2 Case Router iPaq Router
n n
Sequence : {Laptop}, {Wireless Card, Router} A sequences has to satisfy some predetermined minimum support
52
Clustering
What is clustering?
Clustering: the process of grouping a set of objects into classes of similar objects
Items within a cluster should be similar. Documents from different clusters should be dissimilar.
The commonest form of unsupervised learning

Unsupervised learning = learning from raw data, as opposed to supervised data where a classification of examples is given
A common and important task that finds many applications in IR and other places
Clustering
What is Clustering? Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with nding a structure in a collec+on of unlabeled data. A loose deni+on of clustering could be the process of organizing objects into groups whose members are similar in some way. A cluster is therefore a collec+on of objects which are similar between them and are dissimilar to the objects belonging to other clusters. We can show this with a simple graphical example:
55
A data set with clear cluster structure
How would you design an algorithm for finding the three clusters in this case?
Examples of Clustering Applica+ons

Marke9ng: discover customer groups and use them for targeted marke+ng and re-organiza+on Astronomy: nd groups of similar stars and galaxies Earth-quake studies: Observed earth quake epicenters should be clustered along con+nent faults Genomics: nding groups of gene with similar expressions
57
Applications of clustering in IR
Whole corpus analysis/navigation
Better user interface: search without typing
For improving recall in search applications

Better search results (like pseudo RF)
For better navigation of search results

Effective user recall will be higher
For speeding up vector space retrieval

Cluster-based retrieval gives faster search
Yahoo! Hierarchy isnt clustering but is the kind of output you want from clustering
www.yahoo.com/Science (30) agriculture ... dairy biology ... physics ... CS ... space ... craft missions
botany cell AI courses crops magnetism HCI agronomy evolution forestry relativity
Google News: automatic clustering gives an effective news presentation metaphor
Scatter/Gather: Cutting, Karger, and Pedersen
For visualizing a document collection and its themes

Wise et al, Visualizing the non-visual PNNL ThemeScapes, Cartia
[Mountain height = cluster size]
For improving search recall

Cluster hypothesis - Documents in the same cluster behave similarly with respect to relevance to information needs Therefore, to improve search recall: Cluster docs in corpus a priori When a query matches a doc D, also return other docs in the cluster containing D Hope if we do this: The query car will also return docs containing automobile Because clustering grouped together docs containing car with those containing automobile.
Why might this happen?
For better navigation of search results

For grouping search results thematically
clusty.com / Vivisimo
Issues for clustering

Representation for clustering
Document representation
Vector space? Normalization?
Centroids arent length normalized
Need a notion of similarity/distance
How many clusters?

Fixed a priori? Completely data driven?
Avoid trivial clusters - too large or small
In an application, if a cluster's too large, then for navigation purposes you've wasted an extra user click
What makes docs related?

Ideal: semantic similarity. Practical: statistical similarity
We will use cosine similarity. Docs as vectors. For many algorithms, easier to think in terms of a distance (rather than similarity) between docs. We will use Euclidean distance.
Clustering Algorithms
Flat algorithms
Usually start with a random (partial) partitioning Refine it iteratively
K means clustering (Model based clustering)
Hierarchical algorithms
Bottom-up, agglomerative (Top-down, divisive)
Hard vs. soft clustering

Hard clustering: Each document belongs to exactly one cluster
More common and easier to do
Soft clustering: A document can belong to more than one cluster.

Makes more sense for applications like creating browsable hierarchies You may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoes You can only do that with a soft clustering approach.
We wont do soft clustering today.
Partitioning Algorithms
Partitioning method: Construct a partition of n documents into a set of K clusters Given: a set of documents and the number K Find: a partition of K clusters that optimizes the chosen partitioning criterion
Globally optimal: exhaustively enumerate all partitions Effective heuristic methods: K-means and Kmedoids algorithms
K-Means
Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c:
1 (c) = x | c | xc
Reassignment of instances to clusters is based on distance to the current cluster centroids. (Or one can equivalently phrase it in terms of similarities)
K-Means Algorithm
Select K random docs {s1, s2, sK} as seeds. Until clustering converges or other stopping criterion: For each doc di: Assign di to the cluster cj such that dist(xi, sj) is minimal. (Update the seeds to the centroid of each cluster) For each cluster cj sj = (cj)
K Means Example
(K=2)
Pick seeds Reassign clusters Compute centroids Reassign clusters Compute centroids Reassign clusters Converged!
Termination conditions
Several possibilities, e.g.,
A fixed number of iterations. Doc partition unchanged. Centroid positions dont change.
Does this mean that the docs in a cluster are unchanged?
Convergence
Why should the K-means algorithm ever reach a fixed point?
A state in which clusters dont change.
K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm.
EM is known to converge. Number of iterations could be large.
But in practice usually isnt
Lower case
Convergence of K-Means
Define goodness measure of cluster k as sum of squared distances from cluster centroid:
Gk = i (di ck)2 (sum over all di in cluster k)
G = k Gk Reassignment monotonically decreases G since each vector is assigned to the closest centroid.
Convergence of K-Means
Recomputation monotonically decreases each Gk since (mk is number of members in cluster k): (di a)2 reaches minimum for: 2(di a) = 0 di = a mK a = di a = (1/ mk) di = ck K-means typically converges quickly
Time Complexity
Computing distance between two docs is O(m) where m is the dimensionality of the vectors. Reassigning clusters: O(Kn) distance computations, or O(Knm). Computing centroids: Each doc gets added once to some centroid: O(nm). Assume these two steps are each done once for I iterations: O(IKnm).
Seed Choice
Results can vary based on random seed selection. Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. Select good seeds using a heuristic (e.g., doc least similar to any existing mean) Try out multiple starting points Initialize with the results of another method.
Example showing sensitivity to seeds
In the above, if you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} {C,F}
How Many Clusters?

Number of clusters K is given Partition n docs into predetermined number of clusters Finding the right number of clusters is part of the problem Given docs, partition into an appropriate number of subsets. E.g., for query results - ideal value of K not known up front - though UI may impose limits. Can usually take an algorithm for one flavor and convert to the other.
K not specified in advance

Say, the results of a query. Solve an optimization problem: penalize having lots of clusters
application dependent, e.g., compressed summary of search results list.
Tradeoff between having more clusters (better focus within each cluster) and having too many clusters
K not specified in advance

Given a clustering, define the Benefit for a doc to be the cosine similarity to its centroid Define the Total Benefit to be the sum of the individual doc Benefits.
Why is there always a clustering of Total Benefit n?
Penalize lots of clusters

For each cluster, we have a Cost C. Thus for a clustering with K clusters, the Total Cost is KC. Define the Value of a clustering to be =
Total Benefit - Total Cost.
Find the clustering of highest value, over all choices of K.

Total benefit increases with increasing K. But can stop when it doesnt increase by much. The Cost term enforces this.
K-means issues, variations, etc.

Recomputing the centroid after every assignment (rather than after all points are reassigned) can improve speed of convergence of K-means Assumes clusters are spherical in vector space
Sensitive to coordinate changes, weighting etc.
Disjoint and exhaustive

Doesnt have a notion of outliers by default But can add outlier filtering
Hierarchical Clustering
Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents.
animal vertebrate fish reptile amphib. mammal invertebrate worm insect crustacean
One approach: recursive application of a partitional clustering algorithm.
Dendogram: Hierarchical Clustering

Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster.
Hierarchical Agglomerative Clustering (HAC)

Starts with each doc in a separate cluster
then repeatedly joins the closest pair of clusters, until there is only one cluster.
The history of merging forms a binary tree or hierarchy.
Closest pair of clusters

Many variants to defining closest pair of clusters Single-link
Similarity of the most cosine-similar (singlelink)
Complete-link
Similarity of the furthest points, the least cosine-similar
Centroid
Clusters whose centroids (centers of gravity)
Single Link Agglomerative Clustering

Use maximum similarity of pairs:
sim(ci ,c j ) = max sim( x, y)

Can result in straggly (long and thin) clusters due to chaining effect. After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:
xci , yc j
sim(( ci c j ), ck ) = max( sim(ci , ck ), sim(c j , ck ))
Single Link Example
Complete Link Agglomerative Clustering

Use minimum similarity of pairs: Makes tighter, spherical clusters that are typically preferable. After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:
sim(ci ,c j ) = min sim( x, y )

xci , yc j
sim(( ci c j ), ck ) = min( sim(ci , ck ), sim(c j , ck ))

Ci Cj Ck
Complete Link Example
Computational Complexity
In the first iteration, all HAC methods need to compute similarity of all pairs of n individual instances which is O(n2). In each of the subsequent n-2 merging iterations, compute the distance between the most recently created cluster and all other existing clusters. In order to maintain an overall O(n2) performance, computing similarity to each other cluster must be done in constant
Group Average Agglomerative Clustering

Similarity of two clusters = average similarity of all pairs within merged cluster.
1 sim(ci , c j ) = c ) y(c )sim( x, y) ci c j ( ci c j 1) x( ci j c j : y x i

Compromise between single and complete link. Two options:
Averaged across all ordered pairs in the merged cluster Averaged over all pairs between the two original clusters
No clear difference in efficacy
Computing Group Average Similarity

Always maintain sum of vectors in each cluster.
s (c j ) = x
xc j
Compute similarity of clusters in constant time: ( s (ci ) + s (c j )) ( s (ci ) + s (c j )) (| ci | + | c j |)

sim(ci , c j ) = (| ci | + | c j |)(| ci | + | c j | 1)
What Is A Good Clustering?

Internal criterion: A good clustering will produce high quality clusters in which:
the intra-class (that is, intra-cluster) similarity is high the inter-class similarity is low The measured quality of a clustering depends on both the document representation and the similarity measure used
External criteria for clustering quality

Quality measured by its ability to discover some or all of the hidden patterns or latent classes in gold standard data Assesses a clustering with respect to ground truth requires labeled data Assume documents with C gold standard classes, while our clustering algorithms produce K clusters, 1, 2, , K with ni members.
External Evaluation of Cluster Quality

Simple measure: purity, the ratio between the dominant class in the cluster i and the size of cluster i
1 Purity (i ) = max j (nij ) j C ni Biased because having n clusters maximizes purity Others are entropy of classes in clusters (or mutual information between classes and clusters)
Purity example

Cluster I

Cluster II

Cluster III
Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6 Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6 Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5
Rand Index measures between pair decisions. Here RI = 0.68

Number of points Same class in ground truth Different classes in ground truth Same Cluster in clustering Different Clusters in clustering
20 20
24 72
Rand index and Cluster F-measure
A+ D RI = A+ B +C + D
Compare with standard Precision and Recall:
A P= A+ B
A R= A+C
People also define and use a cluster Fmeasure, which is probably a better measure.
Final word and resources

In clustering, clusters are inferred from the data without human input (unsupervised learning) However, in practice, its a bit less clear: there are many ways of influencing the outcome of clustering: number of clusters, similarity measure, representation of documents, . . .
Discussion
Can interpret clusters by using supervised learning
learn a classier based on clusters pre-processing step E.g. use principal component analysis
Decrease dependence between agributes?
Can be used to ll in missing values Key advantage of probabilis+c clustering:

Can es+mate likelihood of data Use it to compare dierent models objec+vely
102
Clustering Summary
unsupervised many approaches
K-means simple, some+mes useful
K-medoids is less sensi+ve to outliers
Hierarchical clustering works for symbolic agributes
Evalua+on is a problem
103
S8: Associa+on Rules and Clustering

Shawndra Hill Spring 2013 TR 1:30-3pm and 3-4:30

Shawndra Hill Spring 2013 TR 1:30 - 3pm and 3 - 4:30

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Shawndra Hill Spring 2013 TR 1:30 - 3pm and 3 - 4:30

Hochgeladen von

Copyright:

Verfügbare Formate

S8:

Associa+on Rules and Clustering

Shawndra Hill Spring 2013 TR 1:30-3pm and 3-4:30

DSS Course Outline

Default yes yes no no yes no

Classifier: Applicant Class (yes/no)

Predic+on (Recall Our Goal)

Inputs (Independent Variables) Income 100,000 30,000 Num. Kids 2 2

Outputs (Target, Class, Dependent Variable) Risk of Defaulting 0.12 0.98

Unsupervised Learning (Recall Our Goal)

Inputs (Independent Variables) Income 100,000 30,000 Num. Kids 2 2

Data Mining Spring 2006

Is there anything useful on MMBs?

Social Mining for More than Cent$

What do breast cancer pa+ents talk about the most?

All chemo 6.44E-09 FP drugs, makes sense 3.48E-08 FP

Is there anything useful on MMBs?

Our Social TV Recommenda+on System Approach

Show network Status updates Local network

User-generated text model

Show feature model Matrix factoriza+on Random

Randomly selected shows Randomly selected words

Data: TV show follower network

Over 19 million unique followers

Data: Sampled followers

Data: Status updates

Sample of 114K users

Models: Show network condence

Models: Follower social network

Models: Network popularity

S1 American Idol S2 The Voice S3 Duets

Ranking: S2, S3, S1

Results: Text-based English only

Visualiza+on of the Similarity Matrix

Learning Association Rules from Data

Market Basket Analysis

Market basket analysis

Evalua+on of Associa+on Rules

Item 1 Beer Milk Beer Beer Ice Cream

Item 2 Diapers Chocolate Wine Cheese Diapers

Item 3 Chocolate Wine Vodka Diapers Beer

No. of transac0ons containing items in body and head

Total no. of transac0ons in database

Transaction No. 100 101 102 103 104

body If {Diapers} Then

Strength of the Implica+on

body If {Diapers} Then

Rule Evaluation: Confidence

Assume: Support 20%, Condence 100% Assume condence is 60%

What if all shoppers at the store buy beer?

More Evaluation Criteria

Comparison to Tradi+onal Database Queries

Item 1 Beer Milk Beer Beer Ice Cream

Item 2 Diapers Chocolate Wine Cheese Diapers

Item 3 Chocolate Shampoo Vodka Diapers Beer

Comparison to Traditional Database Queries

Algorithms to Extract Associa+on Rules

Extrac+ng Associa+on Rules from Data

Extrac+ng Associa+on Rules from Data

Ice Cream Diaper

Mining Association Rules Example

Phase 1: Finding all frequent itemsets

Mining Association Rules Example