Clustering. Computational Journalism Week 2

Frontiers of
Computational Journalism
Columbia Journalism School
Week 2: Clustering
September 18, 2015
Classication and Clustering

Classification is arguably one of the most central and
generic of all our conceptual exercises. It is the
foundation not only for conceptualization, language,
and speech, but also for mathematics, statistics, and
data analysis in general.
- Kenneth D. Bailey, Typologies and Taxonomies: An
Introduction to Classification Techniques
Vector representation of objects

!
#
#
#
#
#
#
#
"
x1 $
&
x2 &
&
x3 &
&
&
xN &
%
Each xi is a numerical or categorical feature

N = number of features or dimension
Distance metric
Intuitively: how (dis)similar are two items?
Formally:
d(x, y) 0
d(x, x) = 0
d(x, y) = d(y, x)
d(x, z) d(x, y) + d(y, z)
Distance metric
d(x, y) 0
-
distance is never negative
d(x, x) = 0
-
reflexivity: zero distance to self
d(x, y) = d(y, x)
-
symmetry: x to y same as y to x
d(x, z) d(x, y) + d(y, z)

- triangle inequality: going direct is shorter
Distance matrix
Data matrix for M objects of N dimensions
! x1 $ ! x1,1
# & #
x2 & # x2,1
#
X=
=
# & #
# & #
" xM % #" x1,M
Distance matrix
x1,N $
&
&
&
&
xM ,N &%
x1,2
x2,2
! d
# 1,1
# d
Dij = D ji = d(xi , x j ) = # 2,1
#
# d1,M
"
d1,2 dM ,M $
&
&
d2,2
&
&
dM ,M &%
We think of a cluster like this
Real data isnt so simple
Many denitions of a cluster
Many denitions of a cluster

every point inside is closer to the center of this
cluster than the center of any other
no point outside this cluster is closer than to any
point inside
every point in this cluster is closer to all points inside
than any point outside
Dierent clustering algorithms

Partitioning
o keep adjusting clusters until convergence
o e.g. K-means
Agglomerative hierarchical
o start with leaves, repeatedly merge clusters
o e.g. MIN and MAX approaches
Divisive hierarchical
o start with root, repeatedly split clusters
o e.g. binary split
K-means demo
http://www.paused21.net/off/kmeans/bin/
Agglomerative
combining clusters
put each item into a leaf node
while num clusters > 1

find two closest clusters

merge them
Divisive spliJing clusters

put all items into one cluster
while num clusters < num items

find largest cluster

split so pieces as far as
possible
complete link or max
single link or "min"
average
Trees and Dendrograms
UK House of Lords voting clusters
UK House of Lords voting clusters

Algorithm instructed to separate MPs into five clusters. Output:
!
!
1
1
2
2
1
1
1
2
1
2
1
2
1
2
3
5
4
1
3
1
1
2
!
!
2
1
1
2
2
1
1
1
2
2
2
!
!
1
1
2
1
1
1
2
2
1
1
1
!
!
3
1
3
3
4
4
3
1
1
2
2
2
1
2
1
2
1
3
4
2
2
3
!
!
2
5
2
1
2
2
2
1
2
1
4
!
!
2
2
4
2
1
2
2
1
2
3
2
!
!
!
!
1
1
2
1
2
1
2
4
2
2
2
.!
.!
4 !
1 !
1 !
2 !
1 !
5 !
5 !
4 !
1 !
1 !
2!
Voting clusters with parties

LDem
1
Con
1
Lab
2
Lab
2
Con
1
Con
1
Con
1
Lab
2
Con
1
Lab
2
Con
1
XB
2
Con
1
Lab
2
XB
3
Con
5
XB
4
Con
1
XB
3
Con
1
LDem
1
Lab
2
Lab
2
LDem
1
Con
1
Lab
2
Lab
2
Con
1
Con
1
Con
1
Lab
2
Lab
2
XB
2
LDem
1
Con
1
Lab
2
Con
1
Con
1
Con
1
Lab
2
Lab
2
Con
1
Con
1
Con
1
XB
3
Con
1
XB
3
XB
3
XB
4
XB
4
Bp
3
Con
1
Con
1
Lab
2
XB
2
Lab
2
Con
1
XB
2
XB
1
Lab
2
Con
1
XB
3
XB
4
XB
2
Lab
2
XB
3
XB
2
LDem
5
Lab
2
LDem
1
Lab
2
Lab
2
Lab
2
Con
1
Lab
2
Con
1
XB
4
Lab
2
Lab
2
XB
4
Lab
2
Con
1
XB
2
Lab
2
Con
1
Lab
2
XB
3
Lab
2
Con
1
Con
1
Lab
2
XB
1
XB
2
LDem
1
Lab
2
XB
4
Lab
2
Lab
2
Lab
2
XB !
4 !
LDem !
1 !
Con !
1 !
Lab !
2 !
XB !
1 !
Con !
5 !
LDem !
5 !
XB !
4 !
Con !
1 !
Con !
1 !
Lab !
2!
Clustering Algorithm
Input: data points (feature vectors).

Output: a set of clusters, each of which is a set
of points.
Visualization
Input: data points (feature vectors).

Output: a picture of the points.
Dimensionality reduction
Problem: vector space is high-dimensional. Up to
thousands of dimensions. The screen is twodimensional.
We have to go from
x RN
to much lower dimensional points
y RK<<N
Probably K=2 or K=3.
This is called "projection"
Projection from 3 to 2 dimensions
Linear projections
Projects in a straight line
to closest point on
"screen." Mathematically,
y = Px
where P is a K by N matrix.
Think of this as rotating to align the "screen" with

coordinate axes, then simply throwing out values of
higher dimensions.
Which direction should we look from?

Principal components analysis: find a linear projection
that preserves greatest variance
Take rst K eigenvectors of covariance matrix

corresponding to largest eigenvalues. This gives a K-
dimensional sub-space for projection.
Sometimes overlap is unavoidable
Real data isnt so simple
Nonlinear projections
Still going from highdimensional x to lowdimensional y, but now
y = f(x)
for some function f(), not
linear. So, may not
preserve relative
distances, angles, etc.
Fish-eye projection from 3 to 2 dimensions
Multidimensional scaling
Idea: try to preserve distances between points "as much as
possible."
If we have the distances between all points in a distance matrix,
D = |xi xj| for all i,j
We can recover the original {xi} coordinates exactly (up to rigid
transformations.) Like working out a country map if you know
how far away each city is from every other.
Multidimensional scaling
Torgerson's "classical MDS" algorithm (1952)
Reducing dimension with MDS

Notice: dimension N is not encoded in the distance
matrix D (its M by M where M is number of points)
MDS formula (theoretically) allows us to recover point
coordinates {x} in any number of dimensions k.
MDS Stress minimization

The formula actually minimizes stress
stress(x) = xi x j dij
i, j
Think of springs between every pair of points. Spring between xi,

xj has rest length dij
Stress is zero if all high-dimensional distances matched exactly in

low dimension.
Multi-dimensional Scaling
Like "flattening" a
stretchy structure into
2D, so that distances
between points are
preserved (as much as
possible")
House of Lords MDS plot
Robustness of results
Regarding these analyses of congressional voting, we
could still ask:
Are we modeling the right thing? (What about other
legislative work, e.g. in committee?)
Are our underlying assumptions correct? (do
representatives really have ideal points in a
preference space?)
What are we trying to argue? What will be the effect of
pointing out this result?
Why do clusters have meaning?

What is the connection between
mathematical and semantic properties?
No unique right clustering

Different distance metrics and clustering algorithms
give different results.
Should we sort incident reports by location, time,
actor, event type, author, cost, casualties?
There is only context-specific categorization.
And the computer doesnt understand your context.
Dierent libraries,
dierent categories

Clustering. Computational Journalism Week 2

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Clustering. Computational Journalism Week 2

Hochgeladen von

Copyright:

Verfügbare Formate

Frontiers of

Classication and Clustering

Introduction to Classification Techniques

Vector representation of objects

Each xi is a numerical or categorical feature

distance is never negative

reflexivity: zero distance to self

d(x, z) d(x, y) + d(y, z)

We think of a cluster like this

Real data isnt so simple

Many denitions of a cluster

Many denitions of a cluster

Dierent clustering algorithms

Divisive spliJing clusters

complete link or max

single link or "min"

Trees and Dendrograms

UK House of Lords voting clusters

UK House of Lords voting clusters

Voting clusters with parties

Input: data points (feature vectors).

Input: data points (feature vectors).

This is called "projection"

Projection from 3 to 2 dimensions

Projection from 2 to 1 dimensions

Think of this as rotating to align the "screen" with

Projection from 3 to 2 dimensions

Which direction should we look from?

Take rst K eigenvectors of covariance matrix

Sometimes overlap is unavoidable

Real data isnt so simple

Fish-eye projection from 3 to 2 dimensions

Torgerson's "classical MDS" algorithm (1952)

Reducing dimension with MDS

MDS Stress minimization

Think of springs between every pair of points. Spring between xi,

Stress is zero if all high-dimensional distances matched exactly in

House of Lords MDS plot

Why do clusters have meaning?

No unique right clustering

Das könnte Ihnen auch gefallen