Beruflich Dokumente
Kultur Dokumente
of
Computa%onal
Journalism
Columbia
Journalism
School
Week
2:
Clustering
September
17,
2012
Week
2:
Clustering
Vector
representa%on
of
objects
Distance
Metrics
Clustering
Algorithms
Editorial
Choice
Week
2:
Clustering
Vector
representa%on
of
objects
Distance
Metrics
Clustering
Algorithms
Editorial
Choice
! # # # # # # # "
Examples
of
features
number
of
claws
la%tude
color
{red,
yellow,
blue}
number
of
break-ins
1
for
bought
X,
0
for
did
not
buy
X
%me,
dura%on,
etc.
number
of
%mes
word
Y
appears
in
document
votes
cast
Feature
selec%on
Technical
meaning
in
machine
learning
etc.:
which
variables
ma.er?
Were
journalists,
so
were
interested
in
an
earlier
process:
how
to
describe
the
world
in
numbers?
Choosing
Features
! # # # # # # # "
Journalism
How
do
we
represent
the
world
numerically?
where k N
Categorical
nite,
e.g.
{on,
o}
innite
e.g.
{red,
yellow,
blue,
...
chartreuse}
ordered?
equivalence
classes
or
other
structure?
Likert Scale Discrete scale, no xed origin , abstract units, compara%ve, non-uniform
! # # # # # # # "
Even with all these caveats, the vector representa%on is incredibly exible and powerful.
Week
2:
Clustering
Vector
representa%on
of
objects
Distance
Metrics
Clustering
Algorithms
Editorial
Choice
Distance
metric
Intui%vely:
how
(dis)similar
are
two
items?
Formally:
d(x,
y)
0
d(x,
x)
=
0
d(x,
y)
=
d(y,
x)
d(x,
z)
d(x,
y)
+
d(y,
z)
Distance
metric
d(x,
y)
0
- distance
is
never
nega%ve
d(x,
x)
=
0
- reexivity:
zero
distance
to
self
d(x,
y)
=
d(y,
x)
- symmetry:
x
to
y
same
as
y
to
x
Distance
matrix
Data
matrix
for
M
objects
of
N
dimensions
x1,2 x2,2
Distance
matrix
! d # 1,1 # d2,1 Dij = D ji = d(xi , x j ) = # # # d1,M " d1,2 dM ,M $ & & d2,2 & & dM ,M & %
Week
1:
Basics
Vector
representa%on
of
objects
Distance
Metrics
Clustering
Algorithms
Editorial
Choice
Agglomera%ve
hierarchical
start
with
leaves,
repeatedly
merge
clusters
e.g.
MIN
and
MAX
approaches
Divisive
hierarchical
start
with
root,
repeatedly
split
clusters
e.g.
binary
split
K-means demo
hqp://www.paused21.net/o/kmeans/bin/
average
0 LDem Bp LDem Con LDem Lab Con LDem XB LDem Con LDem XB Con LDem Con LDem Con LDem Con XB Con LDem Con LDem Con LDem LDem Con Con LDem Con Con LDem Con LDem Con XB Con Con Con LDem XB LDem Con Con LDem UUP Con XB Con Con Con LDem Con Con XB Con Con LDem XB Con UKIP XB Con Con Con LDem XB Con LDem Con XB LDem Con Con Con Con LDem XB LDem Con Con LDem Con LDem Con LDem Con XB LDem Con Con Con LDem XB Con LDem Con LDem XB Con LDem Bp XB Lab XB Lab XB Bp DUP XB Lab DUP Lab XB Lab Bp XB Lab Other XB Ind Lab Bp XB Lab Lab Bp Lab Lab XB Lab XB Lab Lab Lab XB Lab Lab XB Lab Lab Lab Other Lab Lab XB Lab Lab Bp XB Lab XB Bp Lab XB XB Lab Lab XB Lab PC XB Lab Lab Lab Lab XB Lab XB Lab Lab Lab Lab Lab Lab Lab XB Lab Lab Lab XB Lab XB Lab Con XB XB Con XB Con XB XB Con Lab UKIP XB LDem XB Con XB Bp XB DUP Bp XB Lab Bp XB DUP Lab XB Bp XB LDem XB XB Con XB Bp XB Bp LDem XB Con Bp XB Lab Bp XB XB LDem Bp XB
50
100
150
200
Dimensionality
reduc%on
Given
{x}
RN
project
to
{y}
RK<<N
Probably
K=2
or
K=3.
Want
a
good
projec%on
that
preserves
separa%on
between
clusters.
Dimensionality reduc%on
Mul%dimensional
scaling
Idea:
try
to
preserve
distances
between
clusters.
Given
{xi}
RN
and
a
distance
matrix
D
=
|xi
xj|
for
all
i,j
We
can
recover
the
{xi}
coordinates
exactly
(up
to
rigid
transforma%ons)
Mul%dimensional
scaling
Torgerson's
"classical
MDS"
algorithm
(1952)
stress(x) = xi x j dij
i, j
Think of springs between every pair of points. Spring between xi, xj has rest length dij Stress is zero if all high-dimensional distances matched exactly in low dimension.
Week
2:
Clustering
Vector
representa%on
of
objects
Distance
Metrics
Clustering
Algorithms
Editorial
Choice
Robustness
of
results
If
we
see
the
same
paqern
using
many
dierent
techniques,
its
probably
real
but
there
can
s%ll
be
major
interpre%ve
errors
what
does
the
paqern
mean
and
how
do
we
know
were
even
looking
at
the
right
objects?
Republicans Democrats
Jones (R NC3)
2 20 30 40 50 60 70 80 90
Robustness
of
results
Regarding
these
analyses
of
congressional
vo%ng,
we
could
s%ll
ask:
Are
we
modeling
the
right
thing?
(What
about
other
legisla%ve
work,
e.g.
in
commiqee?)
Are
our
underlying
assump%ons
correct?
(do
representa%ves
really
have
ideal
points
in
a
preference
space?)
What
are
we
trying
to
argue?
What
will
be
the
eect
of
poin%ng
out
this
result?
Editorial
choice
in
object
selec%on
object
encoding
distance
func%on
design
clustering
algorithm
visualiza%on
algorithm
variables
to
correlate
against
visualiza%on
design
story
lede
presenta%on