Thomas Lumley Associate Professor of Biostatistics, University of Washington. R Core Development Team Outline The R survey package Why has R become successful? Why open-source software matters to statistics R (needs no introduction) An open-source reimplementation of the S language from Bell Labs Initially a Kiwi creation, now used around the world 2008 Pickering Medal to Ross Ihaka for R Probably the most popular medium for distributing new statistical methodology CRAN: 1500 packages Bioconductor: 500 packages http://faculty.washington.edu/tlumley/survey/ Brief history 2002: I visit Auckland, start writing survey package January 2003: first version released July 2003: replicate weights April 2004: published in J. Stat. Software (US) Spring 2005: multistage sampling, calibration (US) Winter 2006: two-phase designs (NZ) Winter 2008: database-backed designs August 2009: book (I hope) Design philosophy Mostly comes from limited resources write in high-level language code reuse to expose bugs keep data in memory (mostly) dont optimize until someone complains (Moores Law) emphasize features that look like biostatistics Package is about 8000 lines of code cf 250,000 for VPLX from US Census Bureau about 300,000 for all of R; 25,000,000 for SAS (!) Interesting features Secondary analysis/modelling of large surveys graphics, smoothing regression models analysis of multiply-imputed data Simulations R programming language Calibration (raking, GREG) estimators including calibration for regression models Database-backed objects data loaded as needed from relational database Why me? [ie: Lumley? What does he know about surveys?] Semiparametric model-based methods are converging on design-based inference sandwich variance estimators model-robustness concept of parameters as functionals on distributions IPW in causal inference, missing data two-phase sampling in cohort studies Emphases are different: thats what users are for. User interface Data and design meta-data are stored in a survey design object ensures meta-data and data are kept together subset operator sets up data for domain estimation post-stratification/calibration creates new object Data variables in the object are specified by model formulas Example: NHANES III dhanes <- svydesign(id=~SDPPSU6, strata=~SDPSTRA6, weight=~WTPFQX6, data=nhanes3, nest=TRUE) svymean(~BMPWTMI+BMPHTMI, design=dhanes) svyquantile(~BMPWTMI, design=dhanes, quantile=0.5) svytotal(~factor(HAB1MI), design=dhanes) adults <- subset(dhanes, HSAGEIR>18) adults <- update(adults, bmi= BMPWTMI/(BMPHTMI/100)^2 ) adults <- update(adults, bmigp=cut(bmi,c(0,18.5,25,30,Inf))) svymean(~bmigp, adults) svyby(~bmigp, ~HAB1MI, svymean, design=adults) Example: Californian schools dclus2<-svydesign(id=~dnum+snum, fpc=~fpc1+fpc2, data=apiclus2) model1<-svyglm(api00~api99+emer,design=dclus2) model2<-svyglm(api00~api99+meals+mobility+ell+ emer, design=dclus2) model3<-svyglm(api00~api99+stype+emer, design=dclus2) summary(model1) summary(model2) summary(model3) Large data With all data kept in memory on a laptop, NHANES-scale analyses feasible if relevant variables selected first inexpensive 64-bit Linux systems can handle millions of records Database-backed variables loaded on-demand for each command hundreds of thousands of records possible on laptop 2007 BRFSS: 430,000 records (is just feasible) Database-backed dhanes <- svydesign(id=~SDPPSU6, strata=~SDPSTRA6, weight=~WTPFQX6, data=set1, dbtype=ODBC, dbname=nhanes3, nest=TRUE) Specify a SQL database table or view as the data source. Only read access is needed Design metadata is kept in memory, other variables loaded only as needed Works with ODBC, JDBC, and directly with Oracle, PostgreSQL and other popular databases Data from NHIS: about 25k observations Health insurance coverage (by age) Post-stratification, raking, calibration Adjust weights so that estimates match known population totals non-response correction greater efficiency rake(), postStratify(), calibrate() produce a new survey object with all the necessary information for estimates and standard errors Calibration to phase one for two-phase designs (sampling from cohorts) Why is R successful? Charlton Heston brings SAS down from Mt Sinai Charlton Heston brings SAS down from Mt Sinai R spreads through a terrified nation R spreads through a terrified nation 1998 Extensibility Cost Rapid development Network effects Reasons for the R pandemic? Extensibility Can users write extensions that look like built-in functionality? Can users find these extensions? Is it easy to tell what extensions are installed and to get rid of them? Can old versions of the software co-exist with new ones? Free (as in beer) Price sensitivity should be lower for specialist statisticians, and for large companies where statistics is mission-critical but these are more likely to use R Students are price-sensitive low cost is useful in teaching academics learn computing from their PhD students Rapid development Users syntax is the same as developers language deliberate design for slippery slope to programming Functional language, dynamic types slow, inefficient memory use lack of side effects makes it very easy to use most of R is written in R Network effects Statistical researchers benefit from publishing methods in widely-used software slow uptake of CART software rapid uptake of S-PLUS tree() Widely-used means among academic researchers Cost + extensibility + rapid development leads to initial use, feedback leads to wider use Why is open-source statistical software important? Open-source Three related benefits publication of novel methods dissemination of good statistical practice reproducibility Open-source platform is not required, but it helps need widely available platform need packaging system for distributable code need archive of old platform versions Code as language Code describes exactly what analyses you did equations miss many practical aspects complete and precise descriptions in English are hard, and more ugly than the code Code can be reused a problem should not need to be solved more than once Tools for communicating with others also help when communicating with yourself