Sie sind auf Seite 1von 5

# 9/6/2016

ImproveYourModelPerformanceusingCrossValidation(forPythonUsers)

Introduction
I have closely monitored the series of Data Hackathons and found an interesting trend (shown
gets validated at private leaderboard. Some even failed to secure rank in top 20s on private

Takeaguess!Whatcouldbethepossiblereasonforhighvariationinranks?Inotherwords,why
does their model lose stability when evaluated on private leaderboard? Lets look some
possiblereason.

Whydomodelslosestability?
Letsunderstandthisusingthesnapshotillustratingfitofvariousmodelsbelow:

http://www.analyticsvidhya.com/blog/2015/11/improvemodelperformancecrossvalidationinpythonr/

1/5

9/6/2016

ImproveYourModelPerformanceusingCrossValidation(forPythonUsers)

Here, we are trying to find the relationship between size and price. For which, weve taken the
followingsteps:
1.Weveestablishedtherelationshipusingalinearequationforwhichtheplotshavebeenshown.Firstplot
hashigherrorfromtrainingdatapoints.Therefore,thiswillnotperformwellatbothpublicandprivate
leader board. This is an example of Under fitting. In this case, our model fails to capture the
underlyingtrendofthedata.
2.In second plot, we just found the right relationship between price and size i.e. low training error and
generalizationofrelationship
3.In third plot, we found a relationship which has almost zero training error. This is because, the
relationshipisdevelopedbyconsideringeachdeviationinthedatapoint(includingnoise)i.e.modelis
toosensitiveandcapturesrandompatternswhicharepresentonlyinthecurrentdataset.Thisisan
board.

A common practice in data science competitions is to iterate over various models to find a better
performingmodel.However,itbecomesdifficulttodistinguishwhetherthisimprovementinscoreis
comingbecausewearecapturingtherelationshipbetterorwearejustoverfittingthedata.Tofind
the right answer of this question, we use cross validation technique. This method helps us to
achievemoregeneralizedrelationships.
Note:Thisarticleismeantforeveryaspiringdatascientistkeentoimprovehis/herperformancein
datasciencecompetitions.Intheend,IvesharedpythonandRcodesforcrossvalidation.InR,Ive
usedirisdatasetfordemonstrationpurpose.

http://www.analyticsvidhya.com/blog/2015/11/improvemodelperformancecrossvalidationinpythonr/

2/5

9/6/2016

ImproveYourModelPerformanceusingCrossValidation(forPythonUsers)

WhatisCrossValidation?
youdonottrainthemodel.Later,youtestthemodelonthissamplebeforefinalizingthemodel.
Herearethestepsinvolvedincrossvalidation:
1.Youreserveasampledataset.
2.Trainthemodelusingtheremainingpartofthedataset.
3.Usethereservesampleofthedatasettest(validation)set.Thiswillhelpyoutoknowtheeffectiveness
model.Itrocks!

WhatarecommonmethodsusedforCrossValidation
?
Therearevariousmethodsofcrossvalidation.Ivediscussedfewofthembelow:

1.TheValidationsetApproach
http://www.analyticsvidhya.com/blog/2015/11/improvemodelperformancecrossvalidationinpythonr/

3/5

9/6/2016

ImproveYourModelPerformanceusingCrossValidation(forPythonUsers)

1.TheValidationsetApproach
In this approach, we reserve 50% of dataset for validation and rest 50% for model training.After
testing the model performance on validation data set. However, a major disadvantage of this
approachisthatwetrainamodelon50%ofthedatasetonly,whereas,itmaybepossiblethatwe

2.Leaveoneoutcrossvalidation(LOOCV)
Inthisapproach,wereserveonlyonedatapointoftheavailabledataset.And,trainmodelonthe
rest of data set. This process iterates for each data point. It is also known for its advantages and
Wemakeuseofalldatapoints,hencelowbias.
Werepeatthecrossvalidationprocessiteratesntimes(wherenisnumberofdatapoints)whichresults
inhigherexecutiontime
point. So, our estimation gets highly influenced by the data point. If the data point turns out to be an

3.kfoldcrossvalidation
Fromtheabovetwovalidationmethods,wevelearnt:
ofdatasets.Eventually,resultinginhigherbias.
2.We also need a good ratio testing data points.As, we have seen that lower data points can lead to
varianceerrorwhiletestingtheeffectivenessofmodel.
3.Weshoulditerateontrainingandtestingprocessmultipletimes.Weshouldchangethetrainandtest
datasetdistribution.Thishelpstovalidatethemodeleffectivenesswell.

Dowehaveamethodwhichtakescareofallthese3requirements?

http://www.analyticsvidhya.com/blog/2015/11/improvemodelperformancecrossvalidationinpythonr/

4/5

9/6/2016

ImproveYourModelPerformanceusingCrossValidation(forPythonUsers)

thequicksteps:
1.Randomlysplityourentiredatasetintokfolds.
2.Foreachkfoldsinyourdataset,buildyourmodelonk1foldsofthedataset.Then,testthemodelto
checktheeffectivenessforkthfold.
3.Recordtheerroryouseeoneachofthepredictions.
4.Repeatthisuntileachofthekfoldshasservedasthetestset.
5.Theaverageofyourkrecordederrorsiscalledthecrossvalidationerrorandwillserveasyour
performancemetricforthemodel.

Belowisthevisualizationofhowdoesakfoldvalidationworkfork=10.