Beruflich Dokumente
Kultur Dokumente
Recall?
Instance Based (Lazy) versus Eager Incremental versus not Unsupervised versus Supervised
Classification Model
Historical/ Training Data Classification Algorithms (Decision Tree, Nave Bayes, KNN, ..., Your Classification Algortihm Here)
NAME Balance Mike 23,000 Mary 51,100 Bill 68,000 Jim 74,000 Dave 23,000 Anne 100,000
Age 30 40 55 46 47 49
For
tradi+onal
learning
methods,
you
will
need
a
table
of
i.i.d
records.
i.i.d?
5
http://www.mmm-online.com/avandia-warning-signs-seen-online-as-early-as-04/printarticle/177982/
What Can You Learn by Eavesdropping on a Million Conversa+ons (on the Web)?
constipating
Recipes?
acidity Isotope? watery eyes fluid retention heart burn oral thrush dry eyes chest pain muscle weakness anxiousness acid indigestion constipate eyes water cough All well and good, but not sure how much can be restless gleaned from this. . . nasal drip
Count 925 661 434 265 243 219 207 134 125 113
Side eect hair loss muscle pain night sweat joint s+ness trigger nger mood swings fracture dry eye vaginal dryness high cholesterol
Count 88 74 63 47 41 38 37 35 32 29
Not on label!
Data:
Collec+on
Amazon Mechanical Turk Show
Twiger
handles
572
current
show
names
Jan-May 2012 IMDB TV crawler Television
content
features
Twiger API
Models:
Overview
Network
models
Show
follower
feature
models
Gender
Loca+on
General
demographics-based
Show
network
condence
*
Follower
social
network
*
Network
popularity
TF-IDF transform on all words* TF-IDF transform on all words less show related words TF-IDF on show related words Show content similarity
S1
S2
S3
Experiment
framework
10-fold cross validation over 114K users
func+on VALIDATE(Engine e, List[Set[user]] tests, List[Set[user]] trains) { List[Result] results = []; FOR (i IN 1:10) { Model m = TRAIN(e, trains[i]); FOR (u IN tests[i]) { Show randShow = GET_RANDOM_SHOW(u) List[Show] recommended = PREDICT(m, u, randShow) results += GET_PERFORMANCE(recommended, u, randShow) } } RETURN (SUM(results)/10); }
Valida+on metrics
S1
u v
S2
Considered user neighbors to either have the follower, friend, or reciprocally linked relation
S1
S2
S3
Results: Precision
Results: Recall
Only restric+ng to standard English words results in similar level of performance 4 million 40,000 tokens
http://www.thesocialtvlab.com/adrian/videos/ http://thesocialtvlab.com/adrian/network_vis/interactive_network_recommender/
If buy diapers
28
Examples:
Shoppers
who
buy
ice
cream
are
very
likely
to
buy
beer.
Then
If buy Buy beer ice cream
Shoppers
who
buy
Beer
and
Wine
are
likely
to
buy
Cheese
and
Chocolate
Then
If buy Beer, Wine Buy Cheese, Chocolate
29
Associa+on
Rules
Rule
format:
If
{set
of
items}
Then
{set
of
items}
body head
Then If {Diapers, {Beer, Wine} Baby Food}
Body
implies
Head
30
31
Applications
n
n n n n
Store planning: Placing associated items together (Milk & Bread)? n May reduce basket total value (buy less unplanned merchandise) Customer segmentation based on buying behavior, Cross-marketing, Catalog design, etc. Fraud detection: Finding in insurance data that a certain doctor often works with a certain lawyer may indicate potential fraudulent activity.
32
33
Rule Evaluation
Support
Milk & Wine co-occur n But Only 2 out of 200K transactions contain these items
n
Transaction No. 100 101 102 103 104 .
34
Rule Evaluation
Signicance
is
measured
by
Support:
The
frequency
in
which
the
items
in
body
and
head
co-occur.
E.g.,
The
support
of
the
If
{Diapers}
then
{Beer}
rule
is
3/5:
60%
of
the
transac+ons
contain
both
items.
Support =
head {Beer}
Rule Evaluation
Rule Evaluation
A
rules
strength
is
measured
by
its
condence:
How
strongly
the
body
implies
the
head?
Condence:
The
propor+on
of
transac+ons
containing
the
body
that
also
contain
the
head.
Example:
The
condence
of
the
rule
is
3/3,
i.e.,
in100%
of
the
transac+ons
that
contain
diapers
also
contain
beer.
No.
of
transac0ons
containing
both
body
and
head
Confidence =
No.
of
transac0ons
containing
body
Transaction No. 100 101 102 103 104 Item 1 Beer Milk Beer Beer Ice Cream Item 2 Diapers Chocolate Wine Cheese Diapers Item 3 Chocolate Shampoo Vodka Diapers Beer
37
Condence
head {Beer}
Is the rule Milk Wine equivalent to the rule Wine Milk? When is the implica+ons Milk Wine more likely than the reverse?
38
Rule Evaluation
Example:
Liy
Consider
the
rule:
body If {Milk} Then head {Beer}
Find rules where the frequency of head given body > expected frequency of head 39
Liy
Measures
how
much
more
likely
is
the
head
given
the
body
than
merely
the
head
(condence/frequency
of
head)
Example:
Total
number
of
customer
in
database:
1000
No.
of
customers
buying
Milk:
200
No.
of
customers
buying
beer:
50
No.
of
customers
buying
Milk
&
beer:
20
body Then If {Milk} {Beer} head
Frequency
of
head:
50/1000
(5%)
Condence:
20/200
(10%)
Liy:
10%/5%=2
40
41
Data
Mining:
Explore
the
data
for
pagerns.
Data
Mining
methods
automa+cally
discover
signicant
associa+ons
rules
from
data.
Find
whatever
pagerns
exist
in
the
database,
without
the
user
having
to
specify
in
advance
what
to
look
for
(data
driven).
Therefore
allow
nding
unexpected
correla+ons
42
The
algorithm
performs
an
ecient
search
over
the
data
to
nd
all
such
rules.
43
Find
all
sets
of
items
(called
itemsets)
that
co- occur
in
at
least
minsup
of
the
transac+ons
in
the
database.
Itemsets
with
at
least
minimum
support
are
called
frequent
itemsets
(
or
large
itemsets)
44
Example
n n
A data set with 5 transactions Minsup = 40%, Minconf=80% n Phase 1: Find all frequent itemsets
{Beer} (support=80%), {Diaper} (60%), {Chocolate} (40%) {Beer, Diaper} (60%) Transaction No. 100 101 102 103 104 Item 1 Beer Milk Beer Beer Item 2 Diaper Wine Cheese Item 3 Chocolate Vodka Diaper Beer
45
Chocolate Shampoo
Phase 1:
Phase 1:
Example:
assume
the
following
itemsets
of
size
1
were
found
to
be
frequent:
{Milk},{Bread},{
BuEer}
Since
{wine}
is
not
frequent
{wine,
BuEer}
cannot
be
frequent.
Only
if
both
{wine}
&
{BuEer}
were
frequent,
then
{wine,
BuEer}
may
be
frequent.
Therefore
1.
Find
all
itemsets
of
size
1
that
are
frequent.
2.
To
nd
out
which
itemsets
of
size
2
are
frequent
count
the
frequency
of
itemsets
of
size
2
that
contain
two
of
the
following
items
Milk,
Bread,
BuEer
47
Phase 2:
For
each
frequent
itemset,
nd
all
possible
rules
BodyHead
(using
items
contained
in
the
itemset).
Example:
Does
{Milk}
{Bread,
Buger}
sa+sfy
minimum
condence?
What
about
{Bread}
{Milk,
Buger},
{Buger}
{Milk,
Bread},
{Bread,
Buger}
{Milk},
{Milk,
Buger}
{Bread},
{Milk,
Bread}
{Buger}
To
calculate
the
condence
of
the
rule
{Milk}
{Bread,
Buger}
:
No. of transaction that support {Milk, Bread, Butter} Support {Milk, Bread, Butter} = No. of transaction that support {Milk} Support {Milk}
48
Associa+on
If
the
rule
{Yogurt}
{Bread,
Buger
}
is
found
to
have
minimum
condence.
Does
it
mean
the
rule:
{Bread, Buger} {Yogurt} also has minimum condence? Example: Support of {Yogurt} is 20%, Support of {Yogurt, Bread, Buger } is 10% Support of {Bread and Buger } is 50% Condence of {Yogurt} {Bread, Buger} is 10%/20%=50% Condence of {Bread, Buger} {Yogurt} is 10%/50%=20%
49
Phase
2:
For
each
frequent
itemset
of
size
>
=2
(containing
more
than
2
items)
nd
all
rules
that
sa+sfy
minimum
condence.
{Beer}
{Diaper},
Condence
=
(75%)
not
sucient
condence
{Diaper}
{Beer},
condence=3/3
(100%)
Transaction No. 100 101 102 103 104
50
Applica+ons
Store
planning:
Placing
associated
items
together
(Milk
&
Bread)?
May
reduce
basket
total
value
(buy
less
unplanned
merchandise)
Fraud
detec+on:
Finding
in
insurance
data
that
a
certain
doctor
oyen
works
with
a
certain
lawyer
may
indicate
poten+al
fraudulent
ac+vity.
51
Sequen+al
Pagerns
Instead
of
nding
associa+on
between
items
in
a
single
transac+ons,
nd
associa+on
between
items
bought
by
the
same
customer
in
dierent
occasions.
Customer ID AA AA BB BB Transaction Data. 2/2/2001 1/13/2002 4/5/2002 8/10/2002 Item 1 Laptop Wireless network card laptop Wireless network card Item 2 Case Router iPaq Router
n n
Sequence : {Laptop}, {Wireless Card, Router} A sequences has to satisfy some predetermined minimum support
52
Clustering
What is clustering?
Clustering: the process of grouping a set of objects into classes of similar objects
Items within a cluster should be similar. Documents from different clusters should be dissimilar.
A common and important task that finds many applications in IR and other places
Clustering
What
is
Clustering?
Clustering
can
be
considered
the
most
important
unsupervised
learning
problem;
so,
as
every
other
problem
of
this
kind,
it
deals
with
nding
a
structure
in
a
collec+on
of
unlabeled
data.
A
loose
deni+on
of
clustering
could
be
the
process
of
organizing
objects
into
groups
whose
members
are
similar
in
some
way.
A
cluster
is
therefore
a
collec+on
of
objects
which
are
similar
between
them
and
are
dissimilar
to
the
objects
belonging
to
other
clusters.
We
can
show
this
with
a
simple
graphical
example:
55
How would you design an algorithm for finding the three clusters in this case?
57
Applications of clustering in IR
Whole corpus analysis/navigation
Better user interface: search without typing
Yahoo! Hierarchy isnt clustering but is the kind of output you want from clustering
www.yahoo.com/Science (30) agriculture ... dairy biology ... physics ... CS ... space ... craft missions
botany cell AI courses crops magnetism HCI agronomy evolution forestry relativity
clusty.com / Vivisimo
Clustering Algorithms
Flat algorithms
Usually start with a random (partial) partitioning Refine it iteratively
K means clustering (Model based clustering)
Hierarchical algorithms
Bottom-up, agglomerative (Top-down, divisive)
Partitioning Algorithms
Partitioning method: Construct a partition of n documents into a set of K clusters Given: a set of documents and the number K Find: a partition of K clusters that optimizes the chosen partitioning criterion
Globally optimal: exhaustively enumerate all partitions Effective heuristic methods: K-means and Kmedoids algorithms
K-Means
Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c:
1 (c) = x | c | xc
Reassignment of instances to clusters is based on distance to the current cluster centroids. (Or one can equivalently phrase it in terms of similarities)
K-Means Algorithm
Select K random docs {s1, s2, sK} as seeds. Until clustering converges or other stopping criterion: For each doc di: Assign di to the cluster cj such that dist(xi, sj) is minimal. (Update the seeds to the centroid of each cluster) For each cluster cj sj = (cj)
K Means Example
(K=2)
Pick seeds Reassign clusters Compute centroids Reassign clusters Compute centroids Reassign clusters Converged!
Termination conditions
Several possibilities, e.g.,
A fixed number of iterations. Doc partition unchanged. Centroid positions dont change.
Convergence
Why should the K-means algorithm ever reach a fixed point?
A state in which clusters dont change.
K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm.
EM is known to converge. Number of iterations could be large.
But in practice usually isnt
Lower case
Convergence of K-Means
Define goodness measure of cluster k as sum of squared distances from cluster centroid:
Gk = i (di ck)2 (sum over all di in cluster k)
G = k Gk Reassignment monotonically decreases G since each vector is assigned to the closest centroid.
Convergence of K-Means
Recomputation monotonically decreases each Gk since (mk is number of members in cluster k): (di a)2 reaches minimum for: 2(di a) = 0 di = a mK a = di a = (1/ mk) di = ck K-means typically converges quickly
Time Complexity
Computing distance between two docs is O(m) where m is the dimensionality of the vectors. Reassigning clusters: O(Kn) distance computations, or O(Knm). Computing centroids: Each doc gets added once to some centroid: O(nm). Assume these two steps are each done once for I iterations: O(IKnm).
Seed Choice
Results can vary based on random seed selection. Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. Select good seeds using a heuristic (e.g., doc least similar to any existing mean) Try out multiple starting points Initialize with the results of another method.
Example showing sensitivity to seeds
In the above, if you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} {C,F}
Tradeoff between having more clusters (better focus within each cluster) and having too many clusters
Hierarchical Clustering
Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents.
animal vertebrate fish reptile amphib. mammal invertebrate worm insect crustacean
Complete-link
Similarity of the furthest points, the least cosine-similar
Centroid
Clusters whose centroids (centers of gravity)
Computational Complexity
In the first iteration, all HAC methods need to compute similarity of all pairs of n individual instances which is O(n2). In each of the subsequent n-2 merging iterations, compute the distance between the most recently created cluster and all other existing clusters. In order to maintain an overall O(n2) performance, computing similarity to each other cluster must be done in constant
Averaged across all ordered pairs in the merged cluster Averaged over all pairs between the two original clusters
No clear difference in efficacy
Purity example
Cluster I
Cluster II
Cluster III
Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6 Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6 Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5
20 20
24 72
A+ D RI = A+ B +C + D
Compare with standard Precision and Recall:
A P= A+ B
A R= A+C
People also define and use a cluster Fmeasure, which is probably a better measure.
Discussion
Can
interpret
clusters
by
using
supervised
learning
learn
a
classier
based
on
clusters
pre-processing
step
E.g.
use
principal
component
analysis
102
Clustering
Summary
unsupervised
many
approaches
K-means
simple,
some+mes
useful
K-medoids
is
less
sensi+ve
to
outliers
Evalua+on is a problem
103