Sie sind auf Seite 1von 5

9/6/2016

ImproveYourModelPerformanceusingCrossValidation(forPythonUsers)

Introduction
I have closely monitored the series of Data Hackathons and found an interesting trend (shown
below).Thistrendisbasedonparticipantrankingsonpublicandprivateleaderboard.
Inoticed,thatparticipantswhorankhigheronpublicleaderboard,losetheirpositionaftertheirranks
gets validated at private leaderboard. Some even failed to secure rank in top 20s on private
leaderboard(imagebelow).
Eventually,Idiscoveredthephenomenonwhichbringssuchripplesontheleaderboard.

Takeaguess!Whatcouldbethepossiblereasonforhighvariationinranks?Inotherwords,why
does their model lose stability when evaluated on private leaderboard? Lets look some
possiblereason.

Whydomodelslosestability?
Letsunderstandthisusingthesnapshotillustratingfitofvariousmodelsbelow:

http://www.analyticsvidhya.com/blog/2015/11/improvemodelperformancecrossvalidationinpythonr/

1/5

9/6/2016

ImproveYourModelPerformanceusingCrossValidation(forPythonUsers)

Here, we are trying to find the relationship between size and price. For which, weve taken the
followingsteps:
1.Weveestablishedtherelationshipusingalinearequationforwhichtheplotshavebeenshown.Firstplot
hashigherrorfromtrainingdatapoints.Therefore,thiswillnotperformwellatbothpublicandprivate
leader board. This is an example of Under fitting. In this case, our model fails to capture the
underlyingtrendofthedata.
2.In second plot, we just found the right relationship between price and size i.e. low training error and
generalizationofrelationship
3.In third plot, we found a relationship which has almost zero training error. This is because, the
relationshipisdevelopedbyconsideringeachdeviationinthedatapoint(includingnoise)i.e.modelis
toosensitiveandcapturesrandompatternswhicharepresentonlyinthecurrentdataset.Thisisan
exampleofOverfitting.Inthisrelationship,therecouldbehighdeviationinpublicandprivateleader
board.

A common practice in data science competitions is to iterate over various models to find a better
performingmodel.However,itbecomesdifficulttodistinguishwhetherthisimprovementinscoreis
comingbecausewearecapturingtherelationshipbetterorwearejustoverfittingthedata.Tofind
the right answer of this question, we use cross validation technique. This method helps us to
achievemoregeneralizedrelationships.
Note:Thisarticleismeantforeveryaspiringdatascientistkeentoimprovehis/herperformancein
datasciencecompetitions.Intheend,IvesharedpythonandRcodesforcrossvalidation.InR,Ive
usedirisdatasetfordemonstrationpurpose.

http://www.analyticsvidhya.com/blog/2015/11/improvemodelperformancecrossvalidationinpythonr/

2/5

9/6/2016

ImproveYourModelPerformanceusingCrossValidation(forPythonUsers)

WhatisCrossValidation?
CrossValidationisatechniquewhichinvolvesreservingaparticularsampleofadatasetonwhich
youdonottrainthemodel.Later,youtestthemodelonthissamplebeforefinalizingthemodel.
Herearethestepsinvolvedincrossvalidation:
1.Youreserveasampledataset.
2.Trainthemodelusingtheremainingpartofthedataset.
3.Usethereservesampleofthedatasettest(validation)set.Thiswillhelpyoutoknowtheeffectiveness
ofmodelperformance.Ityourmodeldeliversapositiveresultonvalidationdata,goaheadwithcurrent
model.Itrocks!

WhatarecommonmethodsusedforCrossValidation
?
Therearevariousmethodsofcrossvalidation.Ivediscussedfewofthembelow:

1.TheValidationsetApproach
http://www.analyticsvidhya.com/blog/2015/11/improvemodelperformancecrossvalidationinpythonr/

3/5

9/6/2016

ImproveYourModelPerformanceusingCrossValidation(forPythonUsers)

1.TheValidationsetApproach
In this approach, we reserve 50% of dataset for validation and rest 50% for model training.After
testing the model performance on validation data set. However, a major disadvantage of this
approachisthatwetrainamodelon50%ofthedatasetonly,whereas,itmaybepossiblethatwe
areleavingsomeinterestinginformationaboutdatai.e.higherbias.

2.Leaveoneoutcrossvalidation(LOOCV)
Inthisapproach,wereserveonlyonedatapointoftheavailabledataset.And,trainmodelonthe
rest of data set. This process iterates for each data point. It is also known for its advantages and
disadvantages.Letslookatthem:
Wemakeuseofalldatapoints,hencelowbias.
Werepeatthecrossvalidationprocessiteratesntimes(wherenisnumberofdatapoints)whichresults
inhigherexecutiontime
Thisapproachleadstohighervariationintestingmodeleffectivenessbecausewetestagainstonedata
point. So, our estimation gets highly influenced by the data point. If the data point turns out to be an
outlier,itcanleadtohighervariation.

3.kfoldcrossvalidation
Fromtheabovetwovalidationmethods,wevelearnt:
1.Weshouldtrainmodelonlargeportionofdataset.Else,wedfaileverytimetoreadtheunderlyingtrend
ofdatasets.Eventually,resultinginhigherbias.
2.We also need a good ratio testing data points.As, we have seen that lower data points can lead to
varianceerrorwhiletestingtheeffectivenessofmodel.
3.Weshoulditerateontrainingandtestingprocessmultipletimes.Weshouldchangethetrainandtest
datasetdistribution.Thishelpstovalidatethemodeleffectivenesswell.

Dowehaveamethodwhichtakescareofallthese3requirements?
Yes!Thatmethodisknownaskfoldcrossvalidation.Itseasytofollowandimplement.Hereare

http://www.analyticsvidhya.com/blog/2015/11/improvemodelperformancecrossvalidationinpythonr/

4/5

9/6/2016

ImproveYourModelPerformanceusingCrossValidation(forPythonUsers)

thequicksteps:
1.Randomlysplityourentiredatasetintokfolds.
2.Foreachkfoldsinyourdataset,buildyourmodelonk1foldsofthedataset.Then,testthemodelto
checktheeffectivenessforkthfold.
3.Recordtheerroryouseeoneachofthepredictions.
4.Repeatthisuntileachofthekfoldshasservedasthetestset.
5.Theaverageofyourkrecordederrorsiscalledthecrossvalidationerrorandwillserveasyour
performancemetricforthemodel.

Belowisthevisualizationofhowdoesakfoldvalidationworkfork=10.

Now,oneofmostcommonlyaskedquestionis,Howtochooserightvalueofk?
Always remember, lower value of K is more biased and hence undesirable. On the other hand,
highervalueofKislessbiased,butcansufferfromlargevariability.Itisgoodtoknowthat,smaller
value of k always takes us towards validation set approach, where as higher value of k leads to
LOOCVapproach.Hence,itisoftensuggestedtousek=10.

Howtomeasurethemodelsbiasvariance?
After kfold cross validation, well get k different model estimation errors (e1, e2 ..ek). In ideal
scenario,theseerrorvaluesshouldaddtozero.Toreturnthemodelsbias,wetaketheaverageof
alltheerrors.Lowertheaveragevalue,betterthemodel.

http://www.analyticsvidhya.com/blog/2015/11/improvemodelperformancecrossvalidationinpythonr/

5/5