Sie sind auf Seite 1von 40

lronuers of

CompuLauonal !ournallsm

Columbla !ournallsm School
Week 2: ClusLerlng

SepLember 12, 2014



Classlcauon and ClusLerlng
Classlcauon ls arguably one of Lhe mosL
cenLral and generlc of all our concepLual
exerclses. lL ls Lhe foundauon noL only for
concepLuallzauon, language, and speech, buL
also for maLhemaucs, sLausucs, and daLa
analysls ln general."

- kenneLh u. 8alley, !"#$%$&'() +,- !+.$,$/'()0 1, 2,34$-567$, 3$
8%+))'96+7$, !(6:,';5()
Lach x
l
ls a numerlcal or caLegorlcal feaLure
n = number of feaLures or dlmenslon"

x
1
x
2
x
3
!
x
N
!
"
#
#
#
#
#
#
#
$
%
&
&
&
&
&
&
&
vecLor represenLauon of ob[ecLs
Lxamples of vecLor represenLauons
Cbvlous
movles waLched / lLems purchased
Leglslauve voung hlsLory for a polluclan
crlme locauons
Less obvlous, buL sLandard
documenL vecLor space model
psychologlcal survey resulLs
1rlcky research problem: dlsparaLe eld Lypes
CorporaLe llng documenL
Wlklleaks SlCAC1
WhaL can we do wlLh vecLors?

redlcL one varlable based on oLhers
Lhls ls called regresslon"
supervlsed machlne learnlng

Croup slmllar lLems LogeLher
1hls ls classlcauon or clusLerlng
We may or may noL know pre-exlsung classes

ulsLance meLrlc
lnLuluvely: how (dls)slmllar are Lwo lLems?

lormally:

d(x, y) > 0
d(x, x) = 0
d(x, y) = d(y, x)
d(x, z) < d(x, y) + d(y, z)

ulsLance meLrlc
d(x, y) > 0
- dlsLance ls never negauve
d(x, x) = 0
- reexlvlLy": zero dlsLance Lo self
d(x, y) = d(y, x)
- symmeLry": x Lo y same as y Lo x
d(x, z) < d(x, y) + d(y, z)
- Lrlangle lnequallLy": golng dlrecL ls shorLer


ulsLance maLrlx
uaLa maLrlx for M ob[ecLs of n dlmenslons




ulsLance maLrlx
X =
x
1
x
2
!
x
M
!
"
#
#
#
#
$
%
&
&
&
&
=
x
1,1
x
1,2
" x
1, N
x
2,1
x
2,2
! #
x
1,M
x
M, N
!
"
#
#
#
#
#
$
%
&
&
&
&
&
D
ij
= D
ji
= d(x
i
, x
j
) =
d
1,1
d
1,2
! d
M,M
d
2,1
d
2,2
" #
d
1,M
d
M,M
!
"
#
#
#
#
#
$
%
&
&
&
&
&
We Lhlnk of a clusLer llke Lhls.
8eal daLa lsn'L so slmple.
Many posslble denluons of a clusLer
Many posslble denluons of a clusLer
every polnL lnslde ls closer Lo Lhe cenLer of
Lhls clusLer Lhan Lhe cenLer of any oLher"
no polnL ouLslde Lhls clusLer ls closer Lhan c
Lo any polnL lnslde"
every polnL ln Lhls clusLer ls closer Lo all
polnLs lnslde Lhan any polnL ouLslde"

ulerenL clusLerlng algorlLhms
aruuonlng
keep ad[usung clusLers unul convergence
e.g. k-means
Agglomerauve hlerarchlcal
sLarL wlLh leaves, repeaLedly merge clusLers
e.g. Mln and MAx approaches
ulvlslve hlerarchlcal
sLarL wlLh rooL, repeaLedly spllL clusLers
e.g. blnary spllL
k-means demo
hup://www.paused21.neL/o/kmeans/bln/
Agglomerauve - comblnlng clusLers

!"# %&'( )#%* )+#, & -%&. +,/%
0()-% +"* '-"1#%21 3 4
.)+/ #0, '-,1%1# '-"1#%21
*%25% #(%*

slngle llnk or mln"
compleLe llnk or max"
average
uk Pouse of Lords voung clusLers
AlgorlLhm lnsLrucLed Lo separaLe Ms lnLo ve clusLers. CuLpuL:


1 2 2 1 3 2 2 2 1 4
1 1 1 1 1 1 5 2 1 1
2 2 1 2 3 2 2 4 2 1
2 3 2 1 3 1 1 2 1 2
1 5 2 1 4 2 2 1 2 1
1 4 1 1 4 1 2 2 1 5
1 1 1 2 3 3 2 2 2 5
2 3 1 2 1 4 1 1 4 4
1 1 2 1 1 2 2 2 2 1
2 1 2 1 2 2 1 3 2 1
1 2 2 1 2 3 4 2 2 2
.
.
.
voung clusLers wlLh parues
LDem XB Lab LDem XB Lab XB Lab Con XB
1 2 2 1 3 2 2 2 1 4
Con Con LDem Con Con Con LDem Lab Con LDem
1 1 1 1 1 1 5 2 1 1
Lab Lab Con Lab XB XB Lab XB Lab Con
2 2 1 2 3 2 2 4 2 1
Lab XB Lab Con XB XB LDem Lab XB Lab
2 3 2 1 3 1 1 2 1 2
Con Con Lab Con XB Lab Lab Con XB XB
1 5 2 1 4 2 2 1 2 1
Con XB Con Con XB Con Lab XB LDem Con
1 4 1 1 4 1 2 2 1 5
Con Con Con Lab Bp XB Lab Lab Lab LDem
1 1 1 2 3 3 2 2 2 5
Lab XB Con Lab Con XB Con Con XB XB
2 3 1 2 1 4 1 1 4 4
Con Con Lab Con Con XB Lab Lab Lab Con
1 1 2 1 1 2 2 2 2 1
Lab LDem Lab Con Lab Lab Con XB Lab Con
2 1 2 1 2 2 1 3 2 1
Con Lab XB Con XB XB XB Lab Lab Lab
1 2 2 1 2 3 4 2 2 2
.
.
.

ClusLerlng AlgorlLhm

lnpuL: daLa polnLs (feaLure vecLors).
CuLpuL: a seL of clusLers, each of whlch ls a seL of
polnLs.

vlsuallzauon

lnpuL: daLa polnLs (feaLure vecLors).
CuLpuL: a plcLure of Lhe polnLs.

ulmenslonallLy reducuon
roblem: vecLor space ls hlgh-dlmenslonal. up Lo
Lhousands of dlmenslons. 1he screen ls Lwo-
dlmenslonal.

We have Lo go from
x 8
n

Lo much lower dlmenslonal polnLs
y 8
k<<n


robably k=2 or k=3.
1hls ls called "pro[ecuon"
ro[ecuon from 3 Lo 2 dlmenslons
Llnear pro[ecuons
ro[ecLs ln a )34+'&:3 %',( Lo
closesL polnL on "screen."
MaLhemaucally,

y = x

where ls a k by n maLrlx.
ro[ecuon from 2 Lo 1 dlmenslons

1hlnk of Lhls as roLaung Lo allgn Lhe "screen" wlLh coordlnaLe
axes, Lhen slmply Lhrowlng ouL values of hlgher dlmenslons.
ro[ecuon from 3 Lo 2 dlmenslons
Whlch dlrecuon should we look from?
rlnclpal componenLs analysls: nd a llnear pro[ecuon LhaL
preserves greaLesL varlance



1ake rsL k elgenvecLors of covarlance maLrlx correspondlng Lo
largesL elgenvalues. 1hls glves a k-dlmenslonal sub-space for
pro[ecuon.
Someumes overlap ls unavoldable
8eal daLa lsn'L so slmple.
nonllnear pro[ecuons
Sull golng from hlgh-
dlmenslonal x Lo low-
dlmenslonal y, buL now

y = f(x)

for some funcuon f(), noL
llnear. So, may noL
preserve relauve
dlsLances, angles, eLc.
llsh-eye pro[ecuon from 3 Lo 2 dlmenslons
Muludlmenslonal scallng
ldea: Lry Lo preserve dlsLances beLween polnLs
"as much as posslble."

lf we have Lhe dlsLances beLween all polnLs ln a dlsLance maLrlx,

u = |x
l
- x
[
| for all l,[

We can recover Lhe orlglnal [x
l
} coordlnaLes exacLly (up Lo rlgld
Lransformauons.) Llke worklng ouL a counLry map lf you know
how far away each clLy ls from every oLher.



Muludlmenslonal scallng
1orgerson's "classlcal MuS" algorlLhm (1932)




8educlng dlmenslon wlLh MuS
nouce: dlmenslon n ls noL encoded ln Lhe
dlsLance maLrlx u (lL's M by M where M ls
number of polnLs)

MuS formula (Lheoreucally) allows us Lo recover
polnL coordlnaLes [x} ln any number of
dlmenslons k.




MuS SLress mlnlmlzauon
1he formula acLually mlnlmlzes sLress"


1hlnk of sprlngs" beLween every palr of polnLs. Sprlng
beLween x
l
, x
[
has resL lengLh d
l[






SLress ls zero lf all hlgh-dlmenslonal dlsLances maLched
exacLly ln low dlmenslon.







stress(x) = x
i
! x
j
!d
ij
( )
2
i, j
"
Mulu-dlmenslonal Scallng
Llke "auenlng" a sLreLchy
sLrucLure lnLo 2u, so LhaL
dlsLances beLween polnLs
are preserved (as much as
posslble")
Pouse of Lords MuS ploL
8obusLness of resulLs
8egardlng Lhese analyses of congresslonal voung,
we could sull ask:
Are we modellng Lhe rlghL Lhlng? (WhaL abouL
oLher leglslauve work, e.g. ln commluee?)
Are our underlylng assumpuons correcL? (do
represenLauves really have ldeal polnLs" ln a
preference space?)
WhaL are we Lrylng Lo argue? WhaL wlll be Lhe
eecL of polnung ouL Lhls resulL?
Why do clusLers have meanlng?

WhaL ls Lhe connecuon beLween maLhemaucal
and semanuc properues?
no unlque rlghL" clusLerlng
ulerenL dlsLance meLrlcs and clusLerlng algorlLhms glve
dlerenL resulLs.

Should we sorL lncldenL reporLs by locauon, ume, acLor, evenL
Lype, auLhor, cosL, casualues.?

1here ls only conLexL-speclc caLegorlzauon.
And Lhe compuLer doesn'L undersLand your conLexL.
ulerenL llbrarles, dlerenL caLegorles

Das könnte Ihnen auch gefallen