Beruflich Dokumente
Kultur Dokumente
Today
Cluster
Evalua4on
Internal
We
dont
know
anything
about
the
desired
labels
External
We
have
some
informa4on
about
the
labels
Internal
Evalua4on
Clusters
have
been
iden4ed.
How
successful
a
par44oning
of
the
data
set
was
constructed
through
clustering?
Internal
measures
have
the
quality
that
they
can
be
directly
op4mized.
Intercluster
variability
How
far
is
a
point
from
its
cluster
centroid
1 ||xi ci ||2 J(x|) = N i
Intui4on: every point assigned to a cluster should be closer to the center of that cluster than any other cluster K-means op4mizes this measure.
Model
Likelihood
p(x|)
Intui4on:
the
model
that
ts
the
data
best
represents
the
best
clustering
Requires
a
probabilis4c
model.
Can
be
included
in
AIC
and
BIC
measures
to
limit
the
number
of
parameters.
GMM
style:
p(x|) = k p(x|k , k )
k
Intui4on: two points that are similar should be in the same cluster Spectral Clustering op4mizes this func4on.
The number of clusters may not equal the number of classes. It may be dicult to assign a class to a cluster.
Some
principles.
Homogeneity
Each
cluster
should
include
members
of
as
few
classes
as
possible
Completeness
Each
class
should
be
represented
in
as
few
clusters
as
possible.
Some
approaches
Purity
k 1 P urity = max(ni ) r n i r=1
F-measure
Cluster
deni4ons
of
Precision
and
Recall
nij Combined
using
harmonic
R(ci , kj ) = |ci | mean
as
in
tradi4onal
nij f-measure
P (ci , kj ) =
|kj |
F-measure: 0.6
F-measure: 0.6
F-measure: 0.5
F-measure: 0.5
V-Measure
Condi4onal
Entropy
based
measure
to
explicitly
calculate
homogeneity
and
completeness.
(1 + ) h c V = ( h) + c 1 if H(C, K) = 0 h= 1 H(C|K) else H(C) 1 if H(K, C) = 0 c= 1 H(K|C) else H(K)
Con4ngency
Matrix
Want
to
know
how
much
the
introduc4on
of
clusters
is
improving
the
informa4on
about
the
class
distribu4on.
3
1
4
1
1
1
1
4
4
2
2
3
4
1
4
4
4
4
Entropy
Entropy
calculates
the
amount
of
informa4on
in
a
distribu4on.
Wide
distribu4ons
have
a
lot
of
informa4on
Narrow
distribu4ons
have
very
liWle
Based
on
Shannons
limit
of
the
number
of
bits
required
to
transmit
a
distribu4on
Calcula4on
of
entropy:
H(x) = p(xi ) log2 p(xi )
i
1 1 1 1 1 1 1 1 H(x) = log log log log 4 4 4 4 4 4 4 4 1 1 1 1 H(x) = (2) (2) (2) (2) 4 4 4 4 = 2
4 4 4 4 1 1 1 1 H(x) = log log log log 10 10 10 10 10 10 10 10 4 4 1 1 = (1.32) (1.32) (3.32) (3.32) 10 10 10 10 = 1.72
Jaccard
SS Jaccard = SS + SD + DS
Folkes-Mallow
FM =
SS SS SS + SD SS + DS
B-cubed
P recision =
Similar to pair based coun4ng systems, B-cubed calculates an element by element precision and recall.
P recision(e) n
Recall =
Recall(e) n
Next
Time
Project
Presenta4ons
The
schedule
has
15
minutes
per
presenta4on.
This
includes
transi4on
to
the
next
speaker,
and
ques4ons.
Prepare
for
10
minutes.
Course Evalua4ons
Thank you