Sie sind auf Seite 1von 31

Using R to analyse

complex survey samples


Thomas Lumley
Associate Professor of Biostatistics,
University of Washington.
R Core Development Team
Outline
The R survey package
Why has R become successful?
Why open-source software matters to statistics
R (needs no introduction)
An open-source reimplementation of the S language
from Bell Labs
Initially a Kiwi creation, now used around the world
2008 Pickering Medal to Ross Ihaka for R
Probably the most popular medium for distributing
new statistical methodology
CRAN: 1500 packages
Bioconductor: 500 packages
http://faculty.washington.edu/tlumley/survey/
Brief history
2002: I visit Auckland, start writing survey package
January 2003: first version released
July 2003: replicate weights
April 2004: published in J. Stat. Software
(US) Spring 2005: multistage sampling, calibration
(US) Winter 2006: two-phase designs
(NZ) Winter 2008: database-backed designs
August 2009: book (I hope)
Design philosophy
Mostly comes from limited resources
write in high-level language
code reuse to expose bugs
keep data in memory (mostly)
dont optimize until someone complains (Moores Law)
emphasize features that look like biostatistics
Package is about 8000 lines of code
cf 250,000 for VPLX from US Census Bureau
about 300,000 for all of R; 25,000,000 for SAS (!)
Interesting features
Secondary analysis/modelling of large surveys
graphics, smoothing
regression models
analysis of multiply-imputed data
Simulations
R programming language
Calibration (raking, GREG) estimators
including calibration for regression models
Database-backed objects
data loaded as needed from relational database
Why me?
[ie: Lumley? What does he know about surveys?]
Semiparametric model-based methods are
converging on design-based inference
sandwich variance estimators
model-robustness
concept of parameters as functionals on distributions
IPW in causal inference, missing data
two-phase sampling in cohort studies
Emphases are different: thats what users are for.
User interface
Data and design meta-data are stored in a survey
design object
ensures meta-data and data are kept together
subset operator sets up data for domain estimation
post-stratification/calibration creates new object
Data variables in the object are specified by model
formulas
Example: NHANES III
dhanes <- svydesign(id=~SDPPSU6, strata=~SDPSTRA6,
weight=~WTPFQX6, data=nhanes3, nest=TRUE)
svymean(~BMPWTMI+BMPHTMI, design=dhanes)
svyquantile(~BMPWTMI, design=dhanes, quantile=0.5)
svytotal(~factor(HAB1MI), design=dhanes)
adults <- subset(dhanes, HSAGEIR>18)
adults <- update(adults,
bmi= BMPWTMI/(BMPHTMI/100)^2 )
adults <- update(adults,
bmigp=cut(bmi,c(0,18.5,25,30,Inf)))
svymean(~bmigp, adults)
svyby(~bmigp, ~HAB1MI, svymean, design=adults)
Example: Californian schools
dclus2<-svydesign(id=~dnum+snum, fpc=~fpc1+fpc2,
data=apiclus2)
model1<-svyglm(api00~api99+emer,design=dclus2)
model2<-svyglm(api00~api99+meals+mobility+ell+
emer, design=dclus2)
model3<-svyglm(api00~api99+stype+emer,
design=dclus2)
summary(model1)
summary(model2)
summary(model3)
Large data
With all data kept in memory
on a laptop, NHANES-scale analyses feasible if
relevant variables selected first
inexpensive 64-bit Linux systems can handle millions
of records
Database-backed
variables loaded on-demand for each command
hundreds of thousands of records possible on laptop
2007 BRFSS: 430,000 records (is just feasible)
Database-backed
dhanes <- svydesign(id=~SDPPSU6,
strata=~SDPSTRA6, weight=~WTPFQX6,
data=set1, dbtype=ODBC,
dbname=nhanes3, nest=TRUE)
Specify a SQL database table or view as the data
source. Only read access is needed
Design metadata is kept in memory, other
variables loaded only as needed
Works with ODBC, JDBC, and directly with
Oracle, PostgreSQL and other popular databases
Data from NHIS: about 25k observations
Health insurance coverage (by age)
Post-stratification, raking,
calibration
Adjust weights so that estimates match known
population totals
non-response correction
greater efficiency
rake(), postStratify(), calibrate()
produce a new survey object with all the necessary
information for estimates and standard errors
Calibration to phase one for two-phase designs
(sampling from cohorts)
Why is R successful?
Charlton Heston brings SAS down from Mt Sinai
Charlton Heston brings SAS down from Mt Sinai
R spreads through a terrified nation
R spreads through a terrified nation
1998
Extensibility
Cost
Rapid development
Network effects
Reasons for the R pandemic?
Extensibility
Can users write extensions that look like built-in
functionality?
Can users find these extensions?
Is it easy to tell what extensions are installed and
to get rid of them?
Can old versions of the software co-exist with new
ones?
Free (as in beer)
Price sensitivity should be lower for specialist
statisticians, and for large companies where
statistics is mission-critical
but these are more likely to use R
Students are price-sensitive
low cost is useful in teaching
academics learn computing from their PhD students
Rapid development
Users syntax is the same as developers language
deliberate design for slippery slope to programming
Functional language, dynamic types
slow, inefficient memory use
lack of side effects makes it very easy to use
most of R is written in R
Network effects
Statistical researchers benefit from publishing
methods in widely-used software
slow uptake of CART software
rapid uptake of S-PLUS tree()
Widely-used means among academic researchers
Cost + extensibility + rapid development leads to
initial use, feedback leads to wider use
Why is open-source statistical
software important?
Open-source
Three related benefits
publication of novel methods
dissemination of good statistical practice
reproducibility
Open-source platform is not required, but it helps
need widely available platform
need packaging system for distributable code
need archive of old platform versions
Code as language
Code describes exactly what analyses you did
equations miss many practical aspects
complete and precise descriptions in English are hard,
and more ugly than the code
Code can be reused
a problem should not need to be solved more than once
Tools for communicating with others also help
when communicating with yourself

Das könnte Ihnen auch gefallen