Sie sind auf Seite 1von 55

lronuers of

CompuLauonal !ournallsm

Columbla !ournallsm School
Week 1: 8aslcs

SepLember 3, 2014



LecLure 1: 8aslcs

CompuLer Sclence and !ournallsm

Course SLrucLure

lnLerpreung Plgh ulmenslonal uaLa




CompuLauonal !ournallsm: uenluons
8roadly dened, lL can lnvolve changlng how
sLorles are dlscovered, presenLed, aggregaLed,
moneuzed, and archlved. CompuLauon can
advance [ournallsm by drawlng on lnnovauons
ln Loplc deLecuon, vldeo analysls,
personallzauon, aggregauon, vlsuallzauon, and
sensemaklng."

- Cohen, PamllLon, 1urner, !"#$%&'(")'* ,"%-)'*./#0 2011
CompuLauonal !ournallsm: uenluons
SLorles wlll emerge from sLacks of nanclal
dlsclosure forms, courL records, leglslauve hearlngs,
omclals' calendars or meeung noLes, and
regulaLors' emall messages LhaL no one Loday has
ume or money Lo mlne. WlLh a sulLe of reporung
Lools, a [ournallsL wlll be able Lo scan, Lranscrlbe,
analyze, and vlsuallze Lhe pauerns ln Lhese
documenLs."

- Cohen, PamllLon, 1urner, !"#$%&'(")'* ,"%-)'*./#0 2011
Cohen 1& '*2 model
uaLa
8eporung
user
CompuLer
Sclence
CS for presenLauon / lnLeracuon
uaLa
8eporung
user
CS
CS
lllLer many sLorles for user
user
uaLa
8eporung
CS
uaLa
8eporung
CS
uaLa
8eporung
CS
lllLerlng
CS
CS
CS
CS
WhaL an edlLor puLs on Lhe fronL page
Coogle news
8eddlL's commenL sysLem
1wluer
lacebook news feed
1echmeme
.
Lxamples of lLers
31#1&-'451- by Leskovlc, 8acksLrom, klelnberg
6")7 89:8 1'-*7 )1&;"-50 by Cllad LoLan / Soclalow
Where CS applles Lo !ournallsm
user
uaLa
8eporung
CS
uaLa
8eporung
CS
uaLa
8eporung
CS
lllLerlng
CS
CS
CS
CS
LecLs
CS
Where does daLa come from?
Cuanucauon
uaLa
!ournallsm as a cycle
user
uaLa
8eporung
CS
lllLerlng
CS
CS
LecLs
CS
AlgorlLhms as Lhe sub[ecL of [ournallsm
<1=/.&1/ >'-7 ?-.41/0 @1'*/ A'/1B ") C/1-/D E)F"-#'(")
valenuno-uevrles, Slnger-vlne and SolLanl, WS!, 2012
31//'G1 3'4H.)1
!e Larson, Al Shaw, roubllca, 2012
CompuLer Sclence ln !ournallsm

8eporung
resenLauon
lllLerlng
1racklng
AlgorlLhmlc accounLablllLy

CompuLauonal !ournallsm: uenluons
Lhe appllcauon of compuLer sclence Lo Lhe
problems of publlc lnformauon, knowledge, and
bellef, by pracuuoners who see Lhelr mlsslon as
ouLslde of boLh commerce and governmenL."

- !onaLhan SLray, I !"#$%&'(")'* ,"%-)'*./# J1'B.)G K./&0 2011
Course SLrucLure
lnformauon reLrleval: 1l-lul, search englnes
1exL analysls: clusLerlng and Loplc modellng
lnformauon lLerlng sysLems
Soclal neLwork analysls
knowledge represenLauon
urawlng concluslons from daLa
lnformauon SecurlLy
1racklng ow and eecLs






naLural Language
rocesslng
vlsuallzauon
Soclology
Aruclal
lnLelllgence
Cognluve Sclence
SLausucs
Craph 1heory
ClusLerlng
1exL Analysls
lllLer ueslgn
Soclal neLwork Analysls
knowledge 8epresenLauon
urawlng Concluslons
lnformauon 8eLrleval
LplsLemology
AdmlnlsLrauon
AsslgnmenL aer each class
lour asslgnmenLs requlre programmlng, buL
your wrlung counLs for more Lhan your code!
Course blog
hup://comp[ournallsm.com

llnal pro[ecL
for 6-pL sLudenLs only






Cradlng
uual degree sLudenLs
ass/lall.
llnal pro[ecL: paper, sLory, or soware.
non-[ournallsm sLudenLs
80 asslgnmenLs
20 class paruclpauon








uenluon of daLa?

a collecuon of relaLed pleces of
recorded lnformauon
My uenluon of daLa
sLrucLured daLa
unsLrucLured daLa
Cuanucauon
x
1
x
2
x
3
!
x
N
!
"
#
#
#
#
#
#
#
$
%
&
&
&
&
&
&
&
CLher Lhlngs LhaL are Lrlcky Lo
quanufy, buL quanued anyway
lnLelllgence
Academlc performance
Cender
8ace, eLhnlclLy, nauonallLy
number of sexual harassmenL lncldenLs
lncome
ollucal ldeology
...
ulerenL Lypes of quanuLauve"
numerlc
conunuous
counLable
bounded?
unlLs of measuremenL?
CaLegorlcal
nlLe, e.g. [on, o}
lnnlLe e.g. [red, yellow, blue, ... charLreuse.}
ordered?
equlvalence classes or oLher sLrucLure?
ulerenL Lypes of scales
!"#$"%&'(%"
Conunuous scale, xed zero polnL, physlcal unlLs,
comparauve, unlform
*+,"%' -.&/"
ulscreLe scale, no xed orlgln , absLracL unlLs,
comparauve, non-unlform
LlkerL scales are non-unlform
no averages on a non-unlform scale
lL's noL llnear, so
ls 2x
1
Lwlce as good?
(x
1
+c) - (x
2
+c)

= x
1
- x
2


LoLs of Lhlngs don'L make much sense, such as

sum(x
1
... x
n
) / n = ?
Average ls noL well dened! (nor sLd dev, eLc.)
8uL rank order sLausucs are robusL.
And all of Lhls #.GH& noL be a problem ln pracuce.
CLher lssues wlLhquanuLauve"
Where dld Lhe daLa come from?
physlcal measuremenL
compuLer logglng
human recordlng
WhaL are Lhe sources of error?
measuremenL error
mlsslng daLa
amblgulLy ln human classlcauon
process errors
lnLenuonal blas / decepuon

vecLor represenLauon of ob[ecLs
x
1
x
2
x
3
!
x
N
!
"
#
#
#
#
#
#
#
$
%
&
&
&
&
&
&
&
lundamenLal represenLauon for many daLa mlnlng,
clusLerlng, machlne learnlng, vlsuallzauon, nL, eLc.
algorlLhms.
Lach x
l
ls a numerlcal or caLegorlcal feaLure
n = number of feaLures or dlmenslon"

Lxamples of feaLures
number of claws
lauLude
color [red, yellow, blue}
number of break-lns
1 for boughL x", 0 for dld noL buy x"
ume, durauon, eLc.
number of umes word ? appears ln documenL
voLes casL
.
leaLure selecuon"
1echnlcal meanlng ln machlne learnlng eLc.:

01+.1 2&%+&3/"4 #&5"%6

We're [ournallsLs, so we're lnLeresLed ln an
1'-*.1- process:

170 '7 8"4.%+3" '1" 07%/8 +9 9(#3"%46
Chooslng leaLures
where k <n
x
1
x
2
x
3
!
x
N
!
"
#
#
#
#
#
#
#
$
%
&
&
&
&
&
&
&
x
f (1)
x
f (2)
!
x
f (k)
!
"
#
#
#
#
#
$
%
&
&
&
&
&
:7(%9&/+4#
Pow do we
represenL Lhe
world
numerlcally?
;&.1+9" /"&%9+9<
Whlch varlables
carry Lhe mosL
lnformauon?
Lxamples of vecLor represenLauons
Cbvlous
movles waLched / lLems purchased
Leglslauve voung hlsLory for a polluclan
crlme locauons
Less obvlous, buL sLandard
documenL vecLor space model
psychologlcal survey resulLs
1rlcky research problem: dlsparaLe eld Lypes
CorporaLe llng documenL
Wlklleaks SlCAC1
WhaL can we do wlLh vecLors?

redlcL one varlable based on oLhers
Lhls ls called regresslon"
supervlsed machlne learnlng

Croup slmllar lLems LogeLher
1hls ls classlcauon or clusLerlng
We may or may noL know pre-exlsung classes

Das könnte Ihnen auch gefallen