Sie sind auf Seite 1von 49

From Algorithms

to Stories
Jonathan Stray
Columbia / ProPublica / Overview

or...
three hard lessons
in building tools for
computational
journalism

Science Journalism

Journalism through Science

Computational Journalism

Stories will emerge from stacks of nancial disclosure
forms, court records, legislative hearings, ocials'
calendars or meeting notes, and regulators' email messages
that no one today has time or money to mine. With a suite
of reporting tools, a journalist will be able to scan,
transcribe, analyze, and visualize the paFerns in these
documents.

- Cohen, Hamilton, Turner, 2011

Links links links!


bit.ly/OverviewHackers

Doc sets in journalism

and then...
nobody used it


three years later...

Finalist, 2014 PuliPer Prize in Public Service

Winner, 2014 PuliPer Prize in Public Service

Demo: Obama Form LeFers

Algorithm agnostic via visualization plugin API


Ships with clustering, word clouds, advanced search...

Lesson 1
Workow >> Algorithm

User testing!
Loaded confirmation link, which goes to /docsets. "Hmm. What do I do now?" Eventually
clicked import link. "I need more guidance what to do next." Import pane opened to DC
login. Looked like he was about to type in credentials. Then: "I can't really do any of these
now." Eventually saw "example document sets" and clicked.
Cloned caracas-cables example set. Waited. Understood when document set import
complete. Then hesitated. Didn't know where to click to open. Eventually clicked.
"In general, you could be way more communicative."
Moved mouse to document list immediately. "For some reason, this drew me." Clicked around
doc list. "What am I looking at?"
Moved to tree view. Clicked + without hesitation to open node. Saw document in viewer
change. "It's not clear what I'm looking at in the viewer." Eventually: "Which document is
showing when I click a node? Is it the first?"
A little later, more conversationally: "I don't know how useful the document list is." He said this
twice at different points. "Is this a comma separated list of documents? It just looks like one
block of text." Suggested a horizontal delimiter of some sort.

The hardest feature to implement


The most requested, the most used

Lesson 2
It's humans + machines

By Maria Kiselyova
(Reuters) - Russian mobile phone operator Vimpelcom has become the
latest company to come under scrutiny over its operations in Uzbekistan, an
authoritarian country where rival MTS had its assets confiscated.
U.S.-listed Vimpelcom, Uzbekistan's biggest mobile operator by subscribers,
said on Wednesday that it was being investigated by the U.S. Securities and
Exchange Commission (SEC) and Dutch authorities.

Demo: Uzbekistan's Telco Bribes

VIS: Visual Investigative Scenarios

Lesson 3
Real data is messy

What researchers choose


News articles
Academic literature
NLP test data sets

What journalists deal with

PDF dumps
Printed, scanned emails
Scraping thousands of pages from an antique site
CD full of Excel files
...

Standard Named Entity


Recognition not working
Test of OpenCalais against 5 random articles from various sources
versus hand-tagged entities


Overall precision = 77%
Overall recall = 30%

...and this is on the cleanest possible data

Meta-lesson
You don't know what
the user's problem is

Iterative design loop

A number of previous tools aim to help the user ex-


plore a document collection (such as [6, 9, 10, 12]),
though few of these tools have been evaluated with
users from a specic target domain who bring their own
data, making us suspect that this imprecise term often
masks a lack of understanding of actual user tasks.

Six case studies, four of which were "search" tasks


(journalist needed to locate known or suspected evidence)

There are surprisingly few papers that comment on the


adoption of a visualization tool without the prompting
of designers: in a recent survey of eight hundred
visualization papers containing an evaluation
component, only ve commented on adoption [31]

What are the metrics


that count?

Evaluation Methods for Topic Models


Wallach et. al. 2009

BeFer metrics?
How many stories got done?
o Are you solving a niche problem?
o Would resources have been better spent on reporting?

How long did it take to do the story?


o Is this faster than using text search?
o Is it even faster than just reading the documents?
o How much would it have cost to pay someone to do it?

What happened after the story was published?

Journalism as a cycle
Action

Data

Reporting
User

Distribution

Story

Use it!
overviewproject.org

Code it!
github.com/overview

Thank you!
Knight Foundation, Google Ideas, Open Syllabus Project

Das könnte Ihnen auch gefallen