Sie sind auf Seite 1von 25

An Investigation into Using Google Trends as an Administrative Data Source in ONS

Daniel Ayoubkhani
Time Series Analysis Branch Survey Methodology and Statistical Computing Division Office for National Statistics, UK

Overview
1. Introduction to Google Trends 2. Using Google Trends Data An Investigation Conducted by ONS: 3. Data 4. Methods 5. Results 6. Conclusions and Considerations

1. Introduction to Google Trends


Google provide weekly data on changes in search query share (rather than volume)
need to convert to levels and aggregate to months/quarters

Data are available:


back to the start of January 2004 for individual search queries, 25 top level categories and hundreds of lower level categories for free, to anyone with a Gmail account, from:
www.google.com/trends

1. Introduction to Google Trends


Example Google searches for statistics

Source: Google Trends

1. Introduction to Google Trends


Example Search query to top level classification:

statistics
Poverty & Hunger Social Issues & Advocacy People & Society

Demographics Social Sciences

Reference

2. Using Google Trends Data


Choi, H and Varian, H (2009) Predicting the Present with Google Trends:
Paper pioneered use of Google Trends data as a nowcasting tool for economic variables Fitted loglinear models to US retail, automotive and home sales Predictive performance of models increased when Google Trends terms were included

Many studies using Google Trends data for prediction of economic variables published since then

2. Using Google Trends Data


Potential uses of Google Trends (GT) data identified by ONS:
1. Quality assurance of outputs 2. Nowcasting of outputs 3. Replacement of existing data sources

Focus of this investigation: quality assurance of the UK Retail Sales Index (RSI)

2. Using Google Trends Data


Aims of this investigation: Fit benchmark models that are representative of current ONS practice Fit alternative models that include appropriate GT terms as predictors Compare models using empirical measures Draw conclusions to inform ONS strategy

3. Data - Retail Sales Index


All Retail Sales Non-Specialised Food Stores Non-Specialised Non-Food Stores Textiles, Clothing and Footwear Furniture and Lighting Home Appliances Hardware, Paints and Glass Audio and Video Equipment and Recordings Books, Newspapers and Stationary Computers and Telecommunications Non-Store Retailing

3. Data - Retail Sales Index

Source: ONS

3. Data - Google Trends


All extracted GT time series:
represent weekly UK search activity start in January 2004 end in July 2011

Each RSI series matched with:


at least one GT search category top five search queries with each category

3. Data - Google Trends

RSI Series: Furniture and Lighting Google Trends category Lighting Home and Garden Homemaking and Interior Decor Home Furnishings Top 5 Google Trends queries lighting, light, lights, lamp, lamps furniture, ikea, garden, b&q, homebase blinds, curtains, curtains curtains curtains, bedroom furniture, ikea, beds, lighting, table table

4. Methods - Benchmark Models


RegARIMA (linear regression + ARIMA noise)
Regression terms capture deterministic effects:
inconsistent survey periods due to 4-4-5 design moving holidays (e.g. Easter) additive outliers and level shifts

ARMA terms capture autocorrelation in the regression residuals Non-stationarity handled via log transformation and differencing Models automatically identified and estimated using X-12-ARIMA

4. Methods - Alternative Models


Benchmark models extended with (log transformed, differenced) GT variables: Forced static relationships estimated for all series Lagged relationships identified from crosscorrelation plots of pre-whitened series:
1. fit ARIMA models to all RSI and GT series 2. correlate each RSI residual series with each of its corresponding GT residual series (i.e. remove trend and seasonality and correlate the shocks)

Relationships identified at more than one lag modelled both individually and together

4. Methods - Alternative Models


Example Furniture and Lighting vs garden:

5. Results - Initial Analysis


Component of the RSI (no. alternative models fitted)
All Retail Sales (8) Non-Specialised Food Stores (6) Non-Specialised Non-Food Stores (6) Textiles, Clothing & Footwear (23) Furniture & Lighting (31) Home Appliances (7) Hardware, Paints & Glass (6) Audio & Video Equipment (44) Books, Newspapers & Stationery (6) Computers & Telecommunications (31) Non-Store Retailing (7)

% alt. models with AICC lower than benchmark


0.0 0.0 0.0 30.4 90.3 14.3 50.0 43.2 16.7 9.7 42.9

% GT terms significant at 5% level


37.5 0.0 83.3 36.0 78.8 0.0 100.0 51.0 100.0 15.2 42.9

5. Results - Initial Analysis


Furniture and Lighting top three models in terms of AICC:
GT term in model lighting curtains curtains curtains Lag(s) 0 0&1 GT category Home Furnishings Homemaking & Interior Decor AICC 412.47 414.76

lights

Lighting
Benchmark

415.63
432.29

5. Results - More Recent Analysis


Focused on GT search categories due to transient nature of popular search queries Compared models using out-of-sample, onestep-ahead predictions
relies on having sufficient number of observations for initial fitting 24 periods: May 2010 to April 2012 only calculated for models with significant GT terms

5. Results - More Recent Analysis


Component of the RSI All Retail Sales Non-Specialised Food Stores Non-Specialised Non-Food Stores Clothing & Footwear Furniture & Lighting Home Appliances MAPE of benchmark model 2.01 2.70 3.78 5.20 MAPE of best alternative model 1.87 1.80 2.89 4.30 No. alt. models with MAPE lower than benchmark 1/1 1/2 7/7 4/4

Hardware, Paints & Glass


Audio & Video Equipment Books, Newspapers & Stationery Computers & Telecoms Non-Store Retailing

4.90
4.03 3.71 7.76 3.25

4.07
3.46 3.55 6.21 3.24

4/4
3/9 1/3 5/8 1/1

5. Results - More Recent Analysis


Furniture and Lighting:
GT search category
[Lamps & Lighting] + [Rugs & Carpets] Home Furnishings Lamps & Lighting Rugs & Carpets Sofas & Chairs Homemaking & Interior Decor Clocks

Lag(s)
0,0 0 0 0 0 0 0 Benchmark

MAPE
2.89 2.90 2.97 3.19 3.29 3.56 3.65 3.78

6. Conclusions and Considerations


Promising results for some RSI components... Furniture and Lighting Hardware, Paints and Glass Audio Equipment and Recordings ...but less so for others All Retail Sales Non-Specialised Food Stores Non-Specialised Non-Food Stores Additional information is only useful when the RSI series is not dominated by trend and seasonality

6. Conclusions and Considerations


1. GT variable selection
millions of potential explanatory variables need for automation Google Correlate popularity of search queries is transitory:
Home Improvement - top 5 search queries August 2011 b&q homebase bq b and q diy August 2012 doors paint flooring tiles homebase

6. Conclusions and Considerations


2. Changes to GT categorisation taxonomy

happened in December 2011


new categories created infrequent categories deleted changes to taxonomic parents became possible to have more than one parent

3. GT data only available from 2004 onwards


most ONS economic series start much earlier

6. Conclusions and Considerations


4. Some factors affect the response variable but not the GT predictor (or vice-versa), even if the model performs well overall
e.g. heavy snowfall prevents customers travelling to shops, but internet sales unlikely to be adversely affected key economic outputs e.g. Index of Services other possibilities e.g. migration?

5. Wider applicability to outputs


6. Future cost and accessibility of GT data?

Questions?

Das könnte Ihnen auch gefallen