Bootstrap and Cross-Validation For Evaluating Modelling Strategies - R-Bloggers

07/06/2016
Bootstrapandcrossvalidationforevaluatingmodellingstrategies|Rbloggers
TaskatHandand
AnalyzingJointly
theResulting
Rowsand
Columns
Othersites
SASblogs
StatisticsofIsrael
JobsforRusers
Bootstrapandcrossvalidation
forevaluatingmodelling
strategies
June4,2016
ByPeter'sstatsstuffR
Like
Share
258
Tweet
Share
12
(ThisarticlewasfirstpublishedonPeter'sstatsstuffR,andkindlycontributedtoR
bloggers)
Modellingstrategies
IvebeenrereadingFrankHarrellsRegressionModellingStrategies,
amustreadforanyonewhoeverfitsaregressionmodel,althoughbe
prepareddependingonyourbackground,youmightget30pagesin
andsuddenlybecomeconvincedyouvebeendoingnearlyeverything
wrongbefore,whichcanbedisturbing.
Iwantedtoevaluatethreesimplemodellingstrategiesindealingwith
datawithmanyvariables.Usingdatawith54variableson1,785area
unitsfromNewZealands2013census,Imlookingtopredictmedian
incomeonthebasisoftheother53variables.Thefeaturesareall
continuousandarevariableslikemeannumberofbedrooms,
proportionofindividualswithnoreligionandproportionof
individualswhoaresmokers.Restrictingmyselftotraditionallinear
regressionwithanormallydistributedresponse,mythreealternative
strategieswere:
useall53variables
eliminatethevariablesthatcanbepredictedeasilyfromthe
othervariables(definedbyhavingavarianceinflationfactor
greaterthanten),onebyoneuntilthemaincollinearity
problemsaregoneor
eliminatevariablesoneatatimefromthefullmodelonthebasis
ofcomparingAkaikesInformationCriterionofmodelswithand
withouteachvariable.
NoneoftheseisexactlywhatIwoulduseforreal,buttheyservethe
purposeofsettingupacompetitionofstrategiesthatIcantestwitha
varietyofmodelvalidationtechniques.
Validatingmodels
ThemainpurposeoftheexercisewasactuallytoensureIhadmyhead
arounddifferentwaysofestimatingthevalidityofamodel,loosely
definableashowwellitwouldperformatpredictingnewdata.As
thereisnopossibilityofnewareasinNewZealandfrom2013that
needtohavetheirincomepredicted,thepredictionisathought
exercisewhichweneedtofindaplausiblewayofsimulating.
Confidenceinhypotheticalpredictionsgivesusconfidenceinthe
insightsthemodelgivesintorelationshipsbetweenvariables.
Therearemanymethodsofvalidatingmodels,althoughIthinkkfold
crossvalidationhasmarketdominance(notwithHarrellthough,who
prefersvarietiesofthebootstrap).ThethreevalidationmethodsIve
usedforthispostare:
1.simplebootstrap.Thisinvolvescreatingresampleswith
replacementfromtheoriginaldata,ofthesamesizeapplying
themodellingstrategytotheresampleusingthemodelto
predictthevaluesofthefullsetoforiginaldataandcalculatinga
goodnessoffitstatistic(egeitherRsquaredorrootmean
squarederror)comparingthepredictedvaluetotheactualvalue.
NoteFollowingEfron,Harrellcallsthisthesimple
bootstrap,butotherauthorsandtheusefulcaretpackageuse
http://www.rbloggers.com/bootstrapandcrossvalidationforevaluatingmodellingstrategies/
3/11
07/06/2016
simplebootstraptomeantheresamplemodelisusedto
predicttheoutofbagvaluesateachresamplepoint,ratherthan
thefulloriginalsample.
2.enhancedbootstrap.Thisisalittlemoreinvolvedandis
basicallyamethodofestimatingtheoptimismofthegoodness
offitstatistic.Theresanicestepbystepexplanationby
thestatsgeekwhichIwonttrytoimproveon.
3.repeated10foldcrossvalidation.10foldcrossvalidation
involvesdividingyourdataintotenparts,thentakingturnstofit
themodelon90%ofthedataandusingthatmodeltopredictthe
remaining10%.Theaverageofthe10goodnessoffitstatistics
becomesyourestimateoftheactualgoodnessoffit.Oneofthe
problemswithkfoldcrossvalidationisthatithasahigh
varianceiedoingitdifferenttimesyougetdifferentresults
basedontheluckofyoukwaysplitsorepeatedkfoldcross
validationaddressesthisbyperformingthewholeprocessa
numberoftimesandtakingtheaverage.
Asthesamplesizesgetbiggerrelativetothenumberofvariablesin
themodelthemethodsshouldconverge.Thebootstrapmethodscan
giveoveroptimisticestimatesofmodelvaliditycomparedtocross
validationtherearevariousothermethodsavailabletoaddressthis
issuealthoughnoneseemtometoprovideallpurposesolution.
Itscriticalthattheresamplingintheprocessenvelopestheentire
modelbuildingstrategy,notjustthefinalfit.Inparticular,ifthe
strategyinvolvesvariableselection(astwoofmycandidatestrategies
do),youhavetoautomatethatselectionprocessandrunitoneach
differentresample.Thatsbecauseoneofthehighestriskpartsofthe
modellingprocessisthatvariableselection.Runningcrossvalidation
orthebootstraponafinalmodelafteryouveeliminatedabunchof
variablesismissingthepoint,andwillgivemateriallymisleading
statistics(biasedtowardsthingsbeingmoresignificantthanthere
reallyisevidencefor).Ofcourse,thatdoesntstopthisbeingcommon
misguidedpractice.
Results
Onenicefeatureofstatisticssincetherevolutionofthe1980sisthat
thebootstraphelpsyouconceptualisewhatmighthavehappenedbut
didnt.Herestherootmeansquarederrorfromthe100different
bootstrapresampleswhenthethreedifferentmodellingstrategies
(includingvariableselection)wereapplied:
Noticeanything?Notonlydoesitseemtobegenerallyabadideato
dropvariablesjustbecausetheyarecollinearwithothers,but
occasionallyitturnsouttobeareallybadidealikeinresamples#4,
#6andaroundthirtyothers.Thosethirtyorsospikesareinresamples
whererandomchanceledtooneofthemoreimportantvariablesbeing
dumpedbeforeithadachancetocontributetothemodel.
Thethingthatsurprisedmeherewasthatthegenerallymalignedstep
wiseselectionstrategyperformednearlyaswellasthefullmodel,
judgedbythesimplebootstrap.Thatresultcomesthroughfortheother
twovalidationmethodsaswell:
4/11
07/06/2016
Inallthreevalidationmethodstheresreallynothingsubstantiveto
choosebetweenthefullmodelandstepwisestrategies,based
purelyonresults.
Reflections
Thefullmodelismucheasiertofit,interpret,estimateconfidence
intervalsandperformtestsonthanstepwise.Allthestandardstatistics
forafinalmodelchosenbystepwisemethodsaremisleadingand
carefulrecalculationsareneededbasedonelaboratebootstrapping.So
thefullmodelwinshandsdownasageneralstrategyinthiscase.
Withthisdata,wehaveabitoffreedomfromthegeneroussample
size.IfapproachingthisforrealIwouldnteliminateanyvariables
unlessthereweretheoretical/subjectmatterreasonstodoso.Ihave
madethemistakeofeliminatingthecolinearvariablesbeforefrom
thisdatasetbutwilltrynottodoitagain.Theruleofthumbistohave
20observationsforeachparameter(thisisoneofthemostaskedand
mostdodgedquestionsinstatisticseducationseeTable4.1of
RegressionModellingStrategiesforthisparticularanswer),which
suggestswecanhaveupto80parameterswithabittospare.This
givesus30parameterstousefornonlinearrelationshipsand/or
interactions,whichisthedirectionImightgoinasubsequentpost.
Bearingthatinmind,Imnotbotheringtoreportheretheactual
substantiveresults(egwhichfactorsarerelatedtoincomeandhow)
thatcanwaitforabettermodeldownthetrack.
Dataandcomputing
ThecensusdataareultimatelyfromStatisticsNewZealandofcourse,
butaretidiedupandavailableinmynzelectRpackage,whichisstill
verymuchunderdevelopmentandmaychangewithoutnotice.Its
onlyavailablefromGitHubatthemoment(installationcodebelow).
Idothebootstrappingwiththeaidofthebootpackage,whichis
generallytherecommendedapproachinR.Forrepeatedcross
validationofthetwostraightforwardstrategies(fullmodeland
stepwisevariableselection)Iusethecaretpackage,incombination
withstepAICwhichisintheVenablesandRipleyMASSpackage.For
themorecomplexstrategythatinvolveddroppingvariableswithhigh
varianceinflationfactorsIfounditeasiesttodotherepeatedcross
validationoldschoolwithmyownforloops.
ThisexercisewasabitcomplexandIwontbeastonishedifsomeone
pointsoutanerror.Ifyouseeaproblem,orhaveanysuggestionsor
questions,pleaseleaveacomment.
Heresthecode:
#===================setup=======================
library(ggplot2)
library(scales)
library(MASS)
library(boot)
library(caret)
library(dplyr)
library(tidyr)
library(directlabels)
set.seed(123)
#installnzelectpackagethathascensusdata
devtools::install_github("ellisp/nzelect/pkg")
library(nzelect)
#dropthecolumnswithareas'codeandname
au<AreaUnits2013%>%
select(AU2014,Area_Code_and_Description)
5/11
07/06/2016
#givemeaningfulrownames,helpfulforsomediagnosticplotslater
row.names(au)<AreaUnits2013$Area_Code_and_Description
#removesomerepetitionfromthevariablenames
names(au)<gsub("2013","",names(au))
#restricttoareaswithnomissingdata.Ifthiswasanymorecomplicated(eg
#imputation),itwouldneedtobepartofthevalidationresamplingtoo;but
#justdroppingthemallatthebeginningdoesn'tneedtoberesampled;theonly
#implicationwouldbesamplesizewhichwouldbesmallimpactandcomplicating.
au<au[complete.cases(au),]
#==================functionsfortwoofthemodellingstrategies=====================
#Thestepwisevariableselection:
model_process_step<function(the_data){
model_full<lm(MedianIncome~.,data=the_data)
model_final<stepAIC(model_full,direction="both",trace=0)
return(model_final)
}
#Thedroppingofhighlycollinearvariables,basedonVarianceInflationFactor:
model_process_vif<function(the_data){
#removethecollinearvariablesbasedonvif
x<20
while(max(x)>10){
mod1<lm(MedianIncome~.,data=the_data)
x<sort(car::vif(mod1),decreasing=TRUE)
the_data<the_data[,names(the_data)!=names(x)[1]]
#message(paste("dropping",names(x)[1]))
}
model_vif<lm(MedianIncome~.,data=the_data)
return(model_vif)
}
#Thethirdstrategy,fullmodel,isonlyaonelinerwithstandardfunctions
#soIdon'tneedtodefineafunctionseparatelyforit.
#==================Differentvalidationmethods=================
#simplebootstrapcomparison
#createafunctionsuitableforbootthatwillreturnthegoodnessoffit
#statisticstestingmodelsagainstthefulloriginalsample.
compare<function(orig_data,i){
#createtheresampleddata
train_data<orig_data[i,]
test_data<orig_data#iethefulloriginalsample
#fitthethreemodellingprocesses
model_step<model_process_step(train_data)
model_vif<model_process_vif(train_data)
model_full<lm(MedianIncome~.,data=train_data)
#predictthevaluesontheoriginal,unresampleddata
predict_step<predict(model_step,newdata=test_data)
predict_vif<predict(model_vif,newdata=test_data)
predict_full<predict(model_full,newdata=test_data)
#returnavectorof6summaryresults
results<c(
step_R2=R2(predict_step,test_data$MedianIncome),
vif_R2=R2(predict_vif,test_data$MedianIncome),
full_R2=R2(predict_full,test_data$MedianIncome),
step_RMSE=RMSE(predict_step,test_data$MedianIncome),
vif_RMSE=RMSE(predict_vif,test_data$MedianIncome),
full_RMSE=RMSE(predict_full,test_data$MedianIncome)
)
return(results)
}
#performbootstrap
Repeats<100
res<boot(au,statistic=compare,R=Repeats)
#restructureresultsforagraphicshowingrootmeansquareerror,andfor
#latercombinationwiththeotherresults.IchosejusttofocusonRMSE;
#themessagesaresimilarifRsquaredisused.
RMSE_res<as.data.frame(res$t[,4:6])
names(RMSE_res)<c("AICstepwiseselection","Removecollinearvariables","Useallvariables")
RMSE_res%>%
mutate(trial=1:Repeats)%>%
gather(variable,value,trial)%>%
#reorderlevels:
mutate(variable=factor(variable,levels=c(
"Removecollinearvariables","AICstepwiseselection","Useallvariables"
)))%>%
ggplot(aes(x=trial,y=value,colour=variable))+
geom_line()+
geom_point()+
ggtitle("'Simple'bootstrapofmodelfitofthreedifferentregressionstrategies",
subtitle="Predictingareas'medianincomebasedoncensusvariables")+
labs(x="Resampleid(therenomeaningintheorderofresamples)n",
y="RootMeanSquareError(higherisworse)n",
colour="Strategy",
caption="DatafromNewZealandCensus2013")
#storethethree"simplebootstrap"RMSEresultsforlater
simple<apply(RMSE_res,2,mean)
6/11
07/06/2016
#enhanced(optimism)bootstrapcomparison
#forconvenience,estimatethemodelsontheoriginalsampleofdata
orig_step<model_process_step(au)
orig_vif<model_process_vif(au)
orig_full<lm(MedianIncome~.,data=au)
#createafunctionsuitableforbootthatwillreturntheoptimismestimatesfor
#statisticstestingmodelsagainstthefulloriginalsample.
compare_opt<function(orig_data,i){
#createtheresampleddata
train_data<orig_data[i,]
#fitthethreemodellingprocesses
model_step<model_process_step(train_data)
model_vif<model_process_vif(train_data)
model_full<lm(MedianIncome~.,data=train_data)
#predictthevaluesontheoriginal,unresampleddata
predict_step<predict(model_step,newdata=orig_data)
predict_vif<predict(model_vif,newdata=orig_data)
predict_full<predict(model_full,newdata=orig_data)
#returnavectorof6summaryoptimismresults
results<c(
step_R2=R2(fitted(model_step),train_data$MedianIncome)R2(predict_step,orig_data$MedianIncome),
vif_R2=R2(fitted(model_vif),train_data$MedianIncome)R2(predict_vif,orig_data$MedianIncome),
full_R2=R2(fitted(model_full),train_data$MedianIncome)R2(predict_full,orig_data$MedianIncome),
step_RMSE=RMSE(fitted(model_step),train_data$MedianIncome)RMSE(predict_step,orig_data$MedianIncome),
vif_RMSE=RMSE(fitted(model_vif),train_data$MedianIncome)RMSE(predict_vif,orig_data$MedianIncome),
full_RMSE=RMSE(fitted(model_full),train_data$MedianIncome)RMSE(predict_full,orig_data$MedianIncome)
)
return(results)
}
#performbootstrap
res_opt<boot(au,statistic=compare_opt,R=Repeats)
#calculateandstoretheresultsforlater
original<c(
RMSE(fitted(orig_step),au$MedianIncome),
RMSE(fitted(orig_vif),au$MedianIncome),
RMSE(fitted(orig_full),au$MedianIncome)
)
optimism<apply(res_opt$t[,4:6],2,mean)
enhanced<originaloptimism
#repeatedcrossvalidation
#Thenumberofcrossvalidationrepeatsisthenumberofbootstraprepeats/10:
cv_repeat_num<Repeats/10
#usecaret::trainforthetwostandardmodels:
the_control<trainControl(method="repeatedcv",number=10,repeats=cv_repeat_num)
cv_full<train(MedianIncome~.,data=au,method="lm",trControl=the_control)
cv_step<train(MedianIncome~.,data=au,method="lmStepAIC",trControl=the_control,trace=0)
#doitbyhandfortheVIFmodel:
results<numeric(10*cv_repeat_num)
for(jin0:(cv_repeat_num1)){
cv_group<sample(1:10,nrow(au),replace=TRUE)
for(iin1:10){
train_data<au[cv_group!=i,]
test_data<au[cv_group==i,]
results[j*10+i]<RMSE(
predict(model_process_vif(train_data),newdata=test_data),
test_data$MedianIncome)
}
}
cv_vif<mean(results)
cv_vif_results<data.frame(
results=results,
trial=rep(1:10,cv_repeat_num),
cv_repeat=rep(1:cv_repeat_num,each=10)
)
#===============reportingresults===============
#combinethethreecrossvalidationresultstogetherandcombinedwith
#thebootstrapresultsfromearlier
summary_results<data.frame(rbind(
simple,
enhanced,
c(mean(cv_step$resample$RMSE),
cv_vif,
mean(cv_full$resample$RMSE)
)
),check.names=FALSE)%>%
mutate(method=c("Simplebootstrap","Enhancedbootstrap",
paste(cv_repeat_num,"repeats10foldncrossvalidation")))%>%
gather(variable,value,method)
#Drawaplotsummarisingtheresults
direct.label(
summary_results%>%
mutate(variable=factor(variable,levels=c(
"Useallvariables","AICstepwiseselection","Removecollinearvariables"
)))%>%
7/11
07/06/2016
ggplot(aes(y=method,x=value,colour=variable))+
geom_point(size=3)+
labs(x="EstimatedRootMeanSquareError(higherisworse)n",
colour="Modellingnstrategy",
y="Methodofestimatingmodelfitn",
caption="DatafromNewZealandCensus2013")+
ggtitle("Threedifferentvalidationmethodsofthreedifferentregressionstrategies",
subtitle="Predictingareas'medianincomebasedoncensusvariables")
)
Like
Share
258
Tweet
12
Share
Related
Electionanalysis
contestentrypart4
driversofpreference
forGreenoverLabour
party
In"Rbloggers"
Exponentialrandom
graphmodelswithR
Thisnotedocuments
theasmallbutgrowing
microverseofR
packagesonCRANto
producevariousforms
In"Rbloggers"
ofexponentialrandom
258
Like
Modellingwith
R:part2
ModellingwithR:part
2
In"Rbloggers"
Tweet
12
Share
Share
Toleaveacommentfortheauthor,pleasefollowthelinkandcommentontheir
blog:Peter'sstatsstuffR.
Rbloggers.comoffersdailyemailupdatesaboutRnewsandtutorialsontopics
suchas:Datascience,BigData,Rjobs,visualization(ggplot2,Boxplots,maps,
animation),programming(RStudio,Sweave,LaTeX,SQL,Eclipse,git,hadoop,
WebScraping)statistics(regression,PCA,timeseries,trading)andmore...
Ifyougotthisfar,whynotsubscribeforupdatesfromthesite?
Chooseyourflavor:email,twitter,RSS,orfacebook...
Like
Share
258
Tweet
Share
12
Commentsareclosed.
Search&HitEnter
Recentpopularposts
WhataretheBestMachineLearning
PackagesinR?
Bootstrapandcrossvalidationfor
evaluatingmodellingstrategies
PredictiveBookmakerConsensus
ModelfortheUEFAEuro2016
Mostvisitedarticlesofthe
week
1.HowtowritethefirstforloopinR
2.InstallingRpackages
3.PredictiveBookmakerConsensus
ModelfortheUEFAEuro2016
4.Rtutorials
5.Indepthintroductiontomachine
learningin15hoursofexpertvideos
6.Usingapply,sapply,lapplyinR
7.HowtoperformaLogisticRegression
inR
8.ComputingandvisualizingPCAinR
9.WhataretheBestMachineLearning
PackagesinR?
Sponsors
8/11

Bootstrap and Cross-Validation For Evaluating Modelling Strategies - R-Bloggers

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Bootstrap and Cross-Validation For Evaluating Modelling Strategies - R-Bloggers

Hochgeladen von

Copyright:

Verfügbare Formate

07/06/2016

Das könnte Ihnen auch gefallen