Sie sind auf Seite 1von 41

Frontiers of

Computational Journalism
Columbia Journalism School
Week 2: Clustering
September 18, 2015

Classication and Clustering


Classification is arguably one of the most central and
generic of all our conceptual exercises. It is the
foundation not only for conceptualization, language,
and speech, but also for mathematics, statistics, and
data analysis in general.
- Kenneth D. Bailey, Typologies and Taxonomies: An

Introduction to Classification Techniques

Vector representation of objects


!
#
#
#
#
#
#
#
"

x1 $
&
x2 &
&
x3 &
&
&
xN &
%

Each xi is a numerical or categorical feature


N = number of features or dimension

Distance metric
Intuitively: how (dis)similar are two items?
Formally:
d(x, y) 0
d(x, x) = 0
d(x, y) = d(y, x)
d(x, z) d(x, y) + d(y, z)

Distance metric
d(x, y) 0
-

distance is never negative

d(x, x) = 0
-

reflexivity: zero distance to self

d(x, y) = d(y, x)
-

symmetry: x to y same as y to x

d(x, z) d(x, y) + d(y, z)


- triangle inequality: going direct is shorter

Distance matrix
Data matrix for M objects of N dimensions
! x1 $ ! x1,1
# & #
x2 & # x2,1
#
X=
=
# & #
# & #
" xM % #" x1,M

Distance matrix

x1,N $
&
&
&

&
xM ,N &%

x1,2
x2,2

! d
# 1,1
# d
Dij = D ji = d(xi , x j ) = # 2,1
#
# d1,M
"

d1,2 dM ,M $
&
&
d2,2
&

&
dM ,M &%

We think of a cluster like this

Real data isnt so simple

Many denitions of a cluster

Many denitions of a cluster


every point inside is closer to the center of this
cluster than the center of any other
no point outside this cluster is closer than to any
point inside
every point in this cluster is closer to all points inside
than any point outside

Dierent clustering algorithms


Partitioning
o keep adjusting clusters until convergence
o e.g. K-means

Agglomerative hierarchical
o start with leaves, repeatedly merge clusters
o e.g. MIN and MAX approaches

Divisive hierarchical
o start with root, repeatedly split clusters
o e.g. binary split

K-means demo

http://www.paused21.net/off/kmeans/bin/

Agglomerative
combining clusters
put each item into a leaf node
while num clusters > 1

find two closest clusters

merge them

Divisive spliJing clusters


put all items into one cluster
while num clusters < num items

find largest cluster

split so pieces as far as
possible

complete link or max

single link or "min"

average

Trees and Dendrograms

UK House of Lords voting clusters

UK House of Lords voting clusters


Algorithm instructed to separate MPs into five clusters. Output:
!
!
1
1
2
2
1
1
1
2
1
2
1

2
1
2
3
5
4
1
3
1
1
2
!
!

2
1
1
2
2
1
1
1
2
2
2
!
!

1
1
2
1
1
1
2
2
1
1
1
!
!

3
1
3
3
4
4
3
1
1
2
2

2
1
2
1
2
1
3
4
2
2
3
!
!

2
5
2
1
2
2
2
1
2
1
4
!
!

2
2
4
2
1
2
2
1
2
3
2
!
!

!
!

1
1
2
1
2
1
2
4
2
2
2
.!
.!

4 !
1 !
1 !
2 !
1 !
5 !
5 !
4 !
1 !
1 !
2!

Voting clusters with parties


LDem
1
Con
1
Lab
2
Lab
2
Con
1
Con
1
Con
1
Lab
2
Con
1
Lab
2
Con
1

XB
2
Con
1
Lab
2
XB
3
Con
5
XB
4
Con
1
XB
3
Con
1
LDem
1
Lab
2

Lab
2
LDem
1
Con
1
Lab
2
Lab
2
Con
1
Con
1
Con
1
Lab
2
Lab
2
XB
2

LDem
1
Con
1
Lab
2
Con
1
Con
1
Con
1
Lab
2
Lab
2
Con
1
Con
1
Con
1

XB
3
Con
1
XB
3
XB
3
XB
4
XB
4
Bp
3
Con
1
Con
1
Lab
2
XB
2

Lab
2
Con
1
XB
2
XB
1
Lab
2
Con
1
XB
3
XB
4
XB
2
Lab
2
XB
3

XB
2
LDem
5
Lab
2
LDem
1
Lab
2
Lab
2
Lab
2
Con
1
Lab
2
Con
1
XB
4

Lab
2
Lab
2
XB
4
Lab
2
Con
1
XB
2
Lab
2
Con
1
Lab
2
XB
3
Lab
2

Con
1
Con
1
Lab
2
XB
1
XB
2
LDem
1
Lab
2
XB
4
Lab
2
Lab
2
Lab
2

XB !
4 !
LDem !
1 !
Con !
1 !
Lab !
2 !
XB !
1 !
Con !
5 !
LDem !
5 !
XB !
4 !
Con !
1 !
Con !
1 !
Lab !
2!

Clustering Algorithm

Input: data points (feature vectors).


Output: a set of clusters, each of which is a set
of points.
Visualization

Input: data points (feature vectors).


Output: a picture of the points.

Dimensionality reduction
Problem: vector space is high-dimensional. Up to
thousands of dimensions. The screen is twodimensional.
We have to go from
x RN
to much lower dimensional points
y RK<<N
Probably K=2 or K=3.

This is called "projection"

Projection from 3 to 2 dimensions

Linear projections
Projects in a straight line
to closest point on
"screen." Mathematically,
y = Px
where P is a K by N matrix.

Projection from 2 to 1 dimensions

Think of this as rotating to align the "screen" with


coordinate axes, then simply throwing out values of
higher dimensions.

Projection from 3 to 2 dimensions

Which direction should we look from?


Principal components analysis: find a linear projection
that preserves greatest variance

Take rst K eigenvectors of covariance matrix


corresponding to largest eigenvalues. This gives a K-
dimensional sub-space for projection.

Sometimes overlap is unavoidable

Real data isnt so simple

Nonlinear projections
Still going from highdimensional x to lowdimensional y, but now
y = f(x)
for some function f(), not
linear. So, may not
preserve relative
distances, angles, etc.

Fish-eye projection from 3 to 2 dimensions

Multidimensional scaling
Idea: try to preserve distances between points "as much as
possible."
If we have the distances between all points in a distance matrix,
D = |xi xj| for all i,j
We can recover the original {xi} coordinates exactly (up to rigid
transformations.) Like working out a country map if you know
how far away each city is from every other.

Multidimensional scaling

Torgerson's "classical MDS" algorithm (1952)

Reducing dimension with MDS


Notice: dimension N is not encoded in the distance
matrix D (its M by M where M is number of points)
MDS formula (theoretically) allows us to recover point
coordinates {x} in any number of dimensions k.

MDS Stress minimization


The formula actually minimizes stress

stress(x) = xi x j dij
i, j

Think of springs between every pair of points. Spring between xi,


xj has rest length dij

Stress is zero if all high-dimensional distances matched exactly in


low dimension.

Multi-dimensional Scaling
Like "flattening" a
stretchy structure into
2D, so that distances
between points are
preserved (as much as
possible")

House of Lords MDS plot

Robustness of results
Regarding these analyses of congressional voting, we
could still ask:
Are we modeling the right thing? (What about other
legislative work, e.g. in committee?)
Are our underlying assumptions correct? (do
representatives really have ideal points in a
preference space?)
What are we trying to argue? What will be the effect of
pointing out this result?

Why do clusters have meaning?



What is the connection between
mathematical and semantic properties?

No unique right clustering


Different distance metrics and clustering algorithms
give different results.
Should we sort incident reports by location, time,
actor, event type, author, cost, casualties?
There is only context-specific categorization.
And the computer doesnt understand your context.

Dierent libraries,
dierent categories

Das könnte Ihnen auch gefallen