Beruflich Dokumente
Kultur Dokumente
2 Seth Stephens-Davidowitz
Hal Varian
Google, Inc.
3 September 2014
4 Revised: March 7, 2015
5 Abstract
6 This document describes how to access and use Google data for social sci-
7 ence research. This document was created using the literate programming
8 system knitr so that all code in the document can be run as it stands.
9 Google provides three data sources that can be useful for social science: Google
10 Trends, Google Correlate, and Google Consumer Surveys. Google Trends pro-
11 vides an index of search activity on specific terms and categories of terms across
12 time and geography. Google Correlate finds queries that are correlated with
13 other queries or with user-supplied data across time or US states. Google Con-
14 sumer Surveys offers a simple, convenient way to conduct quick and inexpensive
15 surveys of internet users.
16 1 Google Correlate
17 Economic data is often reported with a lag of months or quarters while Google
18 query data is available in near real time. This means that queries that are
19 contemporaneously correlated with an economic time series may be helpful for
20 economic nowcasting.
21 We illustrate here how Google Correlate can help build a model for housing
22 activity. The first step is to download data for New One Family Houses Sold
23 from FRED1 We dont use data prior to January 2004 since thats when the
24 Google series starts. Delete the column headers and extraneous material from
25 the CSV file after downloading.
26 Now go to Google Correlate and click on Enter your own data followed by
27 Monthly Time Series. Select your CSV file, upload it, give the series a name,
28 and click Search correlations. You should something similar to Figure 1.
29 Note that the term most correlated with housing sales is [tahitian noni juice],
30 which appears to be a spurious correlation. The next few terms are similarly
31 spurious. However, after that, you get some terms that are definitely real-estate
1 http://research.stlouisfed.org/fred2/series/HSN1FNSA.
1
Figure 1: Screenshot from Google Correlate.
32 related. (Note that that the difference in the correlation coefficient for [tahitian
33 noni juice] and [80/20 mortgage] is tiny.)
34 You can download the hundred most correlated terms by clicking on the
35 Export as CSV link. The resulting CSV file contains the original series and
36 one hundred correlates. Each series is standardized by subtracting off its mean
37 and dividing by its standard deviation.
38 The question now is how to use these correlates to build a predictive model.
39 One option is to simply use your judgment in choosing possible predictors. As
40 indicated above, there will generally be spurious correlates in the data, so it
41 makes sense to remove these prior to further analysis. The first, and most
42 obvious, correlates to remove are queries that are unlikely to persist, such as
43 [tahitian noni juice], since that query will likely not help for future nowcasting.
44 For economic series, we generally remove non-economic queries from the CSV
45 file. When we do that, we end up with about 70 potential predictors for the 105
46 monthly observations.
47 At this point, it makes sense to use a variable selection mechanism such as
48 stepwise regression or LASSO. We will use a system developed by Steve Scott
49 at Google called Bayesian Structural Time Series, that allows you to model
50 both the time series and regression components of the predictive model.2
2 urlhttp://cran.r-project.org/web/packages/bsts/
2
trend seasonal.12.1 regression
appreciation.rate
2
1.5
1.5
1.5
irs.1031
1.0
1.0
1.0
1
century.21.realtors
distribution
distribution
distribution
distribution
0.5
0.5
0.5
real.estate.purchase
0
0.0
0.0
0.0
real.estate.appraisal
0.5
0.5
0.5
1
X80.20.mortgage
1.0
1.0
1.0
2004 2008 2012 2004 2008 2012 2004 2008 2012
2004 2006 2008 2010 2012 0.0 0.2 0.4 0.6 0.8 1.0
time time time
56 We now wait patiently while the model is estimated and then examine the
57 results, shown in Figure 2. The first panel shows the fit, the second panel
58 shows the most probable predictors, and third panel show the decomposition of
59 the time series into three components: a trend, a seasonal component, and a
3
actual
base
trends
1.5
1.0
0.5
dts
0.0
0.5
1.0 2006 2007 2008 2009 2010 2011 2012
Index
60 regression component. The last panel shows that the regression predictors are
61 important.
By default, the model computes the in-sample predictions. In order to eval-
uate the forecasting accuracy of the model, it may be helpful to examine out-
of-sample prediction. This can be done with BSTS but it is time consuming, so
we follow a hybrid strategy. We consider two models, a baseline autoregressive
model with a one-month and twelve-month lag:
yt = b1 yt1 + b12 yt12 + et ,
and the same model supplemented with some additional predictors from Google
Correlate:
yt = b1 yt1 + b12 yt12 + at xt + et .
62 We estimate each model through period t, forecast period t+1, and then compare
63 the mean absolute percent error (MAPE).
# load package for out-of-sample-forecasts
source("oosf.R")
# choose top predictors
x1 <- zoo(x[,cbind("appreciation.rate","irs.1031","century.21.realtors",
"real.estate.purchase")],as.Date(dat[,1]))
reg1 <- OutOfSampleForecast12(y,x1,k=24)
# mae.delta is the ratio of the trends MAE to the base MAE
MaeReport(reg1)
4
68 3 Cross section
69 We can also use Correlate to build models predicting cross-section data from
70 US states. (Other countries are not yet available.)
## [1] "short.sale.process"
## [2] "short.sale"
## [3] "underwater.mortgage"
## [4] "seterus"
## [5] "harp.3.0"
## [6] "short.sale.package"
## [7] "mortgage.forgiveness.debt.relief"
## [8] "mortgage.forgiveness.debt.relief.act"
## [9] "upside.down.mortgage"
## [10] "mortgage.short.sale"
78 Figure 3.1 illustrates the correlation between the price decline and the search
79 [short sale process].
80 If we take a linear combination of these queries (e.g., a regression) we can
81 normally improved prediction performance. We use the BoomSpikeSlab package
82 from CRAN to find good predictors.
library(BoomSpikeSlab)
reg0 <- lm.spike(housing.decline ~ .,niter=4000,data=d0,seed=123,ping=0)
plot(reg0,inc=.10)
5
Figure 4: Price decline and [short sale process].
underwater.mortgage
toontown.invasion.tracker
toontown.invasion
harp.3.0
short.sale.process
pool.heater
Inclusion Probability
83
d1 <- d0[,-grep("toontown",names(d0))]
d2 <- d1[,-grep("penguin",names(d1))]
reg2 <- lm.spike(housing.decline ~ .,niter=4000,data=d2,seed=123,ping=0)
plot(reg2,inc=.10)
6
Figure 5: Searches on toontown
harp.3.0
pool.heater
short.sale.process
solar.pool.heaters
Inclusion Probability
7
Nevada
Arizona Florida
California
pred
0
1
1 0 1 2
actual
94 drop the [solar pool heater] predictor as it is unlikely that the next housing
95 crisis would start in the sand states. On the other hand, if this query showed
96 up as a predictor early in the crisis, it may have helped attention to focus more
97 attention on those geographies where [solar pool heater] was common.
98 Finally, Figure 7 plots actual versus predicted, to give some idea of the fit.
99
8
107 that short life expectancy and the queries associated with short life expectancy
108 are concentrated in the Deep South and Appalachia.
109 We download the series of correlates as before and then build a predictive
110 model. Since this is cross sectional data, we use the package BoomSpikeSlab.
# library(BoomSpikeSlab)
dat <- read.csv("Data/correlate-negative_life_expectancy.csv")
d <- dat[,-1]
reg <- lm.spike(negative.life.expectancy ~ .,niter=4000,data=d)
plot(reg,inc=.10)
111 The predictors are interesting. The Obama says predictor seemed strange
112 so we tried it in Google Suggest. On April 17, 2014, the suggested completions
113 of Obama says . . . were 1) he is god, 2) there are 57 states, 4) constitution
114 is dead, 4) serve satan, 5) uh too much. Most of these searches seem to express
115 negative sentiment Obama.
116 Finally Figure 9 shows the actual morbidity compared to fitted. The big
117 negative outlier is the District of Columbia. In fact, we find that District of
118 Columbia is often an outlier. This could be because many searches likely come
119 from commuters.
9
3/17/2014 negative life expectancy - Google Correlate
hal@google.com|ManagemyCorrelatedata|Signout
CompareUSstates Correlatedwithnegativelifeexpectancy
Compareweeklytimeseries 0.9092 bloodpressuremedicine
0.8891 obamasaid
0.8882 sodomand
0.8882 theantichrist
0.8865 globelife
0.8858 thejudge
0.8834 hairpics
0.8833 medicinesideeffects
0.8829 momma
0.8828 jamesdavid
0.8823 flexeril
https://www.google.com/trends/correlate/search?e=id:vs1wQ1crZMx&t=all#default,90 1/4
10
obama.says
full.figured.women
blood.pressure.medicine
X40.caliber
obama.pics
gunshot.wounds
Inclusion Probability
2.0
1.5
1.0
neg.life
0.5
0.0
0.5
1.0
1 0 1 2
d$negative.life.expectancy
11
136 You can also look at an index of all searches in a category. For example,
137 choose the category Sports and the geography Worldwide and leave the search
138 term blank. This shows us an index for sports-related queries. The four-year
139 cycle of the Olympics is apparent.
140 Another way to disambiguate queries is to use the entities selection. Google
141 attempts to identify entities by looking at searches surrounding the search in
142 question. For example, if someone searches apple in conjunction with [turkey],
143 [sweet potato], [apple] they are probably looking for search results refer-
144 ring to the fruit. Entities are useful in that they bind together different ways to
145 describe somethingabbreviations, spelling, synonyms and so on.
150 - means to exclude a word. If you type jobs - steve, results will be
151 searches that include jobs but do not include steve
152 A space means and. If you type Lakers Celtics, the results will be
153 searches that include both the word Lakers and the word Celtics. The
154 order does not matter.
155
156 Quotes force a phrase match. If you type Lakers Celtics, results
157 will be searches that include the exact phrase Lakers Celtics.
12
175 relatively more searches in Rhode Island than California on that term. This is
176 the more meaningful metric for social science, since otherwise bigger places with
177 more searches would always score higher.
178 Here are four more important points. First, Google Trends has an unreported
179 privacy threshold. If total searches are below that threshold, a 0 will be reported.
180 This means that not enough were made to advance past the threshold. The
181 privacy threshold is based on absolute numbers. Thus, smaller places will more
182 frequently show zeros, as will earlier time periods. If you run into zeros, it may
183 be helpful to use a coarser time period or geography.
184 Second, Google Trends data comes from a sample of the total Google search
185 corpus. This means samples might differ slightly if you get a different sample. If
186 very precise data is necessary, a researcher can average different samples. That
187 said, the data is large enough that each sample should give similar results. In
188 cases where there appear to be outliers, researchers can just issue their query
189 again on another day.
190 Third, Google Trends data is averaged to the nearest integer. If this is a
191 concern, a researcher can pull multiple samples and average them to get a more
192 precise estimate. If you compare two queries, one of which is very popular and
193 the other much less so, the normalization can push the unpopular query to
194 zero. The way to deal with this is to run a separate request for each query.
195 The normalized magnitude of the queries will no longer be comparable, but the
196 growth rate comparison will still be meaningful.
197 Fourth, and related to the previous two points, data is cached each day. Even
198 though it comes from a sample, the same request made on the same day will
199 report data from the same sample. A researcher who wants to average multiple
200 samples must wait a day to get a new sample.
201 It is worth emphasizing that the sampling generally gives reasonably precise
202 estimates. Generally we do not expect that expect that researchers will need
203 more than a single sample.
13
218 popular on Sundays and is an order of magnitude more common than [martini
219 recipe] on January 1.
220 You can also compare multiple geos over the same time period. Figure 10
221 shows search volume for [hangover cure] during the same time period in the
222 United States. But it also adds another country, the United Kingdom. On
223 average, the United Kingdom searches for [hangover cure] more frequently
224 during this time period. But apparently the United States has bigger New
225 Years parties, as Americans top the British for [hangover cure] searches on
226 January 1.
14
Figure 10: Hangovers, United States versus United Kingdom
15
Figure 11: Search for Jewish versus Mormon
16
296 measures. Stephens-Davidowitz [2012] shows that the Google measure of racism
297 correlates with General Social Survey measures, such as opposition to interra-
298 cial marriage. Baker and Fradkin [2013] shows that Google job search measures
299 correlate with time-use survey measures. While existing measures have weak-
300 nesses motivating the use of Google Trends, zero or negative correlation between
301 Google searches and these measures may make us question the validity of the
302 Google measures.
303 There are many papers that use Google Trends for nowcasting economic
304 variables. Choi and Varian [2009] look at a number of examples, including
305 automobile sales, initial claims for unemployment benefits, destination plan-
306 ning, and consumer confidence. Scott and Varian [2012, 2014b] describe the
307 Bayesian Structure Time Series approach to variable selection mentioned earlier
308 and present models for initial claims, monthly retail sales, consumer sentiment,
309 and gun sales.
310 Researchers at several central banks have built interesting models using
311 Trends data as leading indicators. Noteworthy examples include Arola and
312 Galan [2012], McLaren and Shanbhoge [2011], Hellerstein and Middeldorp [2012],
313 Suhoy [2009], Carriere-Swallow and Labbe [2011], Cesare et al. [2014], and Meja
314 et al. [2013].
17
Figure 12: Example of survey shown to user.
18
Figure 13: Output screen of Google Consumer Surveys
362 I prefer to buy products that are assembled in America. [Agree or disagree
363 If you were asked to use one of these commonly used names for social
364 classes, which would you say you belong in?
365 Some of these cases were an attempt to replicate other published surveys.
366 For example, the last question about social class, was in a survey conducted by
367 Morin and Motel [2012]. Figure 5 shows a screenshot of GCS output for this
368 question.
369 Figure 14 shows the distribution of responses for the Pew survey and the
370 Google survey for this question. As can be seen the results are quite close.
371 We have found that the GCS surveys are generally similar to surveys pub-
372 lished by reputable organizations. Keeter and Christian [2012] is a report that
373 critically examines GCS results and is overall positive. Of course, the GCS sur-
374 veys have limitations: they have to be very short, you can only ask one question,
375 the sample of users is not necessarily representative, and so on. Nevertheless,
376 they can be quite useful for getting rapid results.
377 Recently has released a mobile phone survey tool called the Google Opinions
378 Rewards that targets mobile phone users who opt in to the program and allows
379 for a more flexible survey design.
19
pew
gcs
40
30
20
10
0
Figure 14: Comparing Pew and GCS answers to social class question.
20
Figure 15: Panel (a) shows searches for chevrolet, while Panel (b) shows
searches for toyota
.
406 1. Run a GCS asking Do you support Obama in the upcoming election?
407 2. Associate each (yes,no) response in the survey data to the city associated
408 with the respondent.
409 3. Build a predictive model for the responses using the Trends category data
410 described above.
21
Figure 16: Predictors of Obama supporters
22
Figure 17: Predictors for assembled in America question
428 6 Summary
429 We have described a few of the applications of Google Correlate, Google Trends,
430 and Google Consumer Surveys. In our view, these tools for data can be used
431 to generate several insights for social science and there a many other examples
432 waiting to be discovered.
433 References
434 Concha Arola and Enrique Galan. Tracking the future on the web: Construction
435 of leading indicators using internet searches. Technical report, Bank of Spain,
436 2012. URL http://www.bde.es/webbde/SES/Secciones/Publicaciones/
437 PublicacionesSeriadas/DocumentosOcasionales/12/Fich/do1203e.pdf.
438 Scott R. Baker and Andrey Fradkin. The Impact of Unemployment Insurance
439 on Job Search: Evidence from Google Search Data. SSRN Electronic Journal,
440 2013.
441 Yan Carriere-Swallow and Felipe Labbe. Nowcasting with Google Trends in
442 an emerging market. Journal of Forecasting, 2011. doi: 10.1002/for.1252.
443 URL http://ideas.repec.org/p/chb/bcchwp/588.html. Working Papers
444 Central Bank of Chile 588.
23
445 Antonio Di Cesare, Giuseppe Grande, Michele Manna, and Marco Taboga.
446 Recent estimates of sovereign risk premia for euro-area countries. Tech-
447 nical report, Banca dItalia, 2014. URL http://www.bancaditalia.it/
448 pubblicazioni/econo/quest_ecofin_2/qef128/QEF_128.pdf.
449 Hyunyoung Choi and Hal Varian. Predicting the present with Google Trends.
450 Technical report, Google, 2009. URL http://google.com/googleblogs/
451 pdfs/google_predicting_the_present.pdf.
452 Rebecca Hellerstein and Menno Middeldorp. Forecasting with internet search
453 data. Liberty Street Economics Blog of the Federal Reserve Bank of New
454 York, Jan 4 2012. URL http://libertystreeteconomics.newyorkfed.
455 org/2012/01/forecasting-with-internet-search-data.html.
456 Scott Keeter and Leah Christian. A comparison of results from
457 surveys by the pew research center and google consumer surveys.
458 Technical report, Pew Research Center for People and the Press,
459 2012. URL http://www.people-press.org/files/legacy-pdf/11-7-12%
460 20Google%20Methodology%20paper.pdf.
461 Andreas Madestam, Daniel Shoag, Stan Veuger, and David Yanagizawa-Drott.
462 Do Political Protests Matter? Evidence from the Tea Party Movement. The
463 Quarterly Journal of Economics, 128(4):16331685, August 2013. ISSN 0033-
464 5533. doi: 10.1093/qje/qjt021. URL http://qje.oxfordjournals.org/
465 content/128/4/1633.full.
466 Alex Mathews and Catherine Tucker. Government surveillance and internet
467 search behavior. Technical report, MIT, 2014. URL http://papers.ssrn.
468 com/sol3/papers.cfm?abstract_id=2412564.
469 Nick McLaren and Rachana Shanbhoge. Using internet search data as eco-
470 nomic indicators. Bank of England Quarterly Bulletin, June 2011. URL
471 http://www.bankofengland.co.uk/publications/quarterlybulletin/
472 qb110206.pdf.
473 Luis Fernando Meja, Daniel Monsalve, Santiago Pulido Yesid Parra, and ngela
474 Mara Reyes. Indicadores ISAAC: Siguiendo la actividad sectorial a partir
475 de google trends. Technical report, Ministerio de Hacienda y Credito Pulico
476 Colombia, 2013. URL http://www.minhacienda.gov.co/portal/page/
477 portal/HomeMinhacienda/politicafiscal/reportesmacroeconomicos/
478 NotasFiscales/22%20Siguiendo%20la%20actividad%20sectorial%20a%
479 20partir%20de%20Google%20Trends.pdf.
480 Rich Morin and Seth Motel. A third of americans now say they are
481 in the lower classes. Technical report, Pew Research Social & Demo-
482 graphic Trends, 2012. URL http://www.pewsocialtrends.org/2012/09/
483 10/a-third-of-americans-now-say-they-are-in-the-lower-classes/.
24
484 Steve Scott and Hal Varian. Bayesian variable selection for nowcasting economic
485 time series. Technical report, Google, 2012. URL http://www.ischool.
486 berkeley.edu/~hal/Papers/2012/fat.pdf. Presented at Joint Statistical
487 Meetings, San Diego.
488 Steve Scott and Hal Varian. Predicting the present with Bayesian structural
489 time series. Int. J. Mathematical Modeling and Numerical Optimisation, 5
490 (1), 2014a. URL http://www.ischool.berkeley.edu/~hal/Papers/2013/
491 pred-present-with-bsts.pdf. NBER Working Paper 19567.
492 Steven L. Scott and Hal R. Varian. Predicting the present with Bayesian struc-
493 tural time series. International Journal of Mathematical Modeling and Numer-
494 ical Optimization, 5(1/2):423, 2014b. URL http://www.sims.berkeley.
495 edu/~hal/Papers/2013/pred-present-with-bsts.pdf.
496 Seth Stephens-Davidowitz. The Cost of Racial Animus on a Black Presidential
497 Candidate: Evidence Using Google Search Data. Working Paper, 2012.
498 Seth Stephens-Davidowitz. Who Will Vote? Ask Google. Technical report,
499 Google, 2013.
500 Tanya Suhoy. Query indices and a 2008 downturn: Israeli data. Techni-
501 cal report, Bank of Israel, 2009. URL http://www.bankisrael.gov.il/
502 deptdata/mehkar/papers/dp0906e.pdf.
503 C. Douglas Swearingen and Joseph T. Ripberger. Google Insights and U.S.
504 Senate Elections: Does Search Traffic Provide a Valid Measure of Pub-
505 lic Attention to Political Candidates? Social Science Quarterly, pages 0
506 0, January 2014. ISSN 00384941. doi: 10.1111/ssqu.12075. URL http:
507 //doi.wiley.com/10.1111/ssqu.12075.
508 Hal R. Varian. Big data: New tricks for ecoometrics. Journal of Economic
509 Perspectives, 28(2):328, 2014.
25