Sie sind auf Seite 1von 10

What is Really Going On?

What is Really Going On?


Large amounts of news are produced everyday by news media who interpret and highlight the events
around us. Each news agency coverage of events outside their own country is likely colored by a variety of
factors, such as the preferences/inclinations of their audience, a limited amount of resources (e.g. number of
pages, staff) and the relations between countries. Our project sought to compare the coverage of events by news
media around the world with the coverage of the New York Times World section (NYT) and in the likely case
where there are differences/biases, investigate how and why the coverage in the NYT differs from the rest of the
world.
Using countries as the unit of analysis on data from January to April 2015, we found that there was a
correlation between the number of articles written in the NYT about a country and the number of articles
written by world media about the country. However, this correlation could not be explained by factors such as
the countrys demographics (e.g. population, Gross Domestic Product, Labor Force) or the countrys
relationships with the rest of world (e.g. foreign direct investment inflow, official development assistance).
While we were unable to determine why the NYTs coverage differs from the rest of the world, we have
developed visualizations1 to see the differences for themselves, perhaps helping them to figure out what is really
going on in both the world media and in the New York Times.
COMPONENTS OF THE PROJECT
This section outlines the main parts of the project in terms of data capture, analysis, and visualization, as
well as my contributions where applicable. Figure 1 shows an overview of how the parts are related to each
other.

Link to visualizations: http://128.40.150.34/~ucfntka. This can be accessed via the UCL network and requires the data service
gdeltnytDataServer to be running on Node.

What is Really Going On?

Figure 1: Overview of project components and relationships

Data Capture
We used three main sources of data for the project:
A. The Global Database of Events, Language, and Tone (GDELT)2. The database contains event data
collected based on the worlds news media. Each event is a record in the database, and each record
contains information such as the number of times it is mentioned in news articles, the location the event
took place and the parties involved in the event (The GDELT Project, n.d.). We used data from January
to April 2015 for the project, amounting to 18 million records in the database. I contributed by
downloading the csv files and creating scripts in MySQL database to import the files. To decrease the
query time needed when the user interacts with the visualization, I also wrote scripts to create new tables

GDELT link: http://gdeltproject.org/data.html#rawdatafiles

What is Really Going On?

where the records were grouped by country, by date, and by both country and date. These reduced query
time considerably.
B. The New York Times (NYT) APIs3. We used the NYT Article Search API to retrieve data on articles
published in the New York Times World section from January to April 2015. We also used the NYT
Community API to retrieve comments written about these articles. I contributed by writing the Python
scripts to retrieve the data from these APIs and organize the data by both country and date. I also created
MySQL scripts to import the results to the database.
C. The World Bank API4. We used the World Bank indicators API to retrieve information on about 20
indicators such as population, labor force and Gross Domestic Product for over 200 countries around the
world. The most recent information, typically between 2012 and 2014, was retrieved for each indicator.
These indicators would be used for the regression analysis later. I contributed by writing the Python
script to retrieve the data.
We also created API services5 for the project, which we used in our visualizations. I contributed by writing
and documenting all the API services on node.js. One important way to improve the API services would be to
plan on writing more flexible API endpoints from the beginning, so that it is easier to retrieve more data or
retrieve different subsets of data when needed for analysis/visualization.
Data Analysis
There were 2 components for the analysis:
A. Regression analysis. To investigate what influenced how often the New York Times covered a particular
country, we conducted regression analysis using GDELT and World Bank data as independent variables.
3

New York Times API link: http://developer.nytimes.com/docs


World Bank Data API link: http://data.worldbank.org/node/9
5
Documentation for these services can be found at http://128.40.150.34:8886/ when the gdeltnytDataServer is running on node
4

What is Really Going On?

As the New York Times data was based on articles in its World section where the focus was on events
outside the United States, we removed US-related data in the analysis. We tested multiple models using
countries demographics and their relations to the rest of the world and found that almost all models
explained little to none of the variation in how much a country was mentioned in the New York Times.
Instead, the number of times a country was mentioned in the GDELT database was far better at
explaining variation in how much a country was mentioned in the New York Times. The models tested
and their cross-validated R-squared values are summarized in Table 1. I contributed by writing the
Python scripts to integrate the NYT, GDELT and world bank data and to run the regression analyses
using the scikit-learn package (Pedregosa et al, 2011). Given more time, other important factors could
be tested, such as those relating to education levels in the country, whether the country was English
speaking, and a countrys specific relationship with the United States (e.g. trade relations, diplomatic
relations). A better approach to the analysis may be to use events as the unit of analysis and investigate
how likely the New York Times would report a particular event in the GDELT database, using factors
such as country of origin, significance/impact of event and number of news sources covering the event.
We did not take this approach mainly because it would take significant effort to link individual events
in the GDELT to NYT articles.

What is Really Going On?


Table 1: Summary of regression results

B. Sentiment analysis. Although not visualized or mentioned on the website, we conducted sentiment
analysis on readers comments on New York Times articles as well.
Data Visualization
The website for the project was set up with Bootstrap, a javascript framework for developing responsive,
mobile first projects (Bootstrap, n.d.). It houses the projects visualizations and storyline. I created 2 maps in
d3.js for the project (with design input from group mates), which are housed in iframes on the website:
A. News Coverage Map. This is an interactive choropleth map that allows users to investigate how news
coverage of countries varied over time. Figure 2 depicts the map. Users can use a date slider to change
the period they wish to investigate, and the map would update accordingly. Using the checkboxes below
the slider, users can choose to look at only GDELT coverage, only NYT coverage, or both together.
Figure 2: News Coverage Map for January 2015

What is Really Going On?

For GDELT coverage, the map uses GDELT data to calculate the number of articles written per
day over the time period for each country (A) based on the dates on the slider. These figures are
compared against the number of articles written per day over the base period, January to April 2015 (B),
using the following formula:
Difference in GDELT news coverage on country = (A B) / B x 100%

The results are mapped using a diverging color scheme from ColorBrewer. Blue is used for countries
with higher news coverage than usual over the time period chosen, and red is used for countries with
lower news coverage than usual. The darker the color, the larger the deviation from usual. For example,
Nepal was covered much more than usual by news media after the earthquake in the last week of April,
and shows up as dark green if the dates are selected on the slider. Users can also hover over a country to
see more information on the GDELT coverage for the particular country.
The New York Times coverage is represented by yellow spotlights on countries that have been
mentioned by the NYT World section at least once. This method was chosen to allow users to look at
both the NYT coverage and GDELT coverage at the same time. The New York Times has a limited
6

What is Really Going On?

amount of space for stories everyday (about 30 articles per day on average), and this visualization shows
where there are discrepancies between NYT coverage and the rest of the worlds media coverage for
particular time periods.
One future improvement could be to adjust the size (or some other characteristic) of the spotlight
for the NYT coverage based on the number of times the country is mentioned using a formula similar to
the GDELT coverage. It would also be useful to add information on NYT coverage when users hover
over or click on the spotlight. An especially useful piece of contextual information to add would be to
show the top news headlines in the country from GDELT and NYT in a sidebar when users click on the
country, so that users can see if the content reported (if any) was similar as well.
B. Correlation Map. This is a choropleth map that sought to compare how often countries were covered by
the world media (represented by GDELT) versus the New York Times. Figure 3 depicts the map. This
map correlates the number of times a country was mentioned by the global media (from GDELT data)
with the number of times a country was mentioned in the New York Times. Both numbers were
standardized by dividing them with the maximum number of times any country was mentioned in the
dataset from January to April 2015. A darker green meant a higher correlation between world media
coverage and NYT coverage. Users can hover over a country to see the correlation figure.

What is Really Going On?


Figure 3: Correlation map

There are many ways of improving the visualization further in future. I could have tested alternative
ways to standardize the datasets to see if the data could be visualized better. Visualizing the overall
correlations also tells users nothing about the day-to-day variations within countries, which may be more
pertinent when thinking about media bias. The visualization could be improved by showing a chart
similar to Figure 4 when users click on individual countries, so they get a better understanding of the
day to day variations, and where coverage actually differs between NYT and the world media in general.
Here, the number of mentions per day from each dataset are normalized by dividing the number of
mentions by the maximum number of times the particular country was mentioned over the entire time
period. This creates a ratio between 0 and 1, with the red line representing NYT and the blue line
representing media around the world (GDELT). While we did visualize this for one country on the
website, it would be more interesting for users to investigate individual countries themselves.

What is Really Going On?


Figure 4: Correlation chart for China (correlation of 0.41)

FUTURE WORK
Other than the improvements mentioned, future work could expand the scope of the project. We focused on
the New York Times for this project as information was easily accessible through its API. One future direction
could be to look at how different news outlets cover world events. This could also be compared against trends in
social media to understand what people pay attention to. It would also be useful to find objective sources of
events in the world and compare them against the GDELT database. News is our lens to the rest of the world,
and we would be better off if we understood how and why our lenses are biased.

What is Really Going On?

References (includes data sources and tools that were mentioned in the text)
Bootstrap. (n.d.). Bootstrap. Available from: http://getbootstrap.com/. Accessed 20th May 2015.
ColorBrewer2. (n.d.). ColorBrewer2. Available from: http://colorbrewer2.org/. Accessed 20th May 2015.
D3 (2013). Overview. Available from: http://d3js.org/. Accessed 20th May 2015.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.
(2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research. p. 28252830.
The GDELT Project. (n.d.). Intro. Available from: http://gdeltproject.org/. Accessed 20th May 2015.
The New York Times. (2014). APIs. Available from: http://developer.nytimes.com/docs. Accessed 20th May
2015.
The World Bank. (n.d.). Data. Available from: http://data.worldbank.org/node/9. Accessed 20th May 2015.

10

Das könnte Ihnen auch gefallen