Beruflich Dokumente
Kultur Dokumente
Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog
Howtogetintothetop
15ofaKaggle
competitionusing
Python
03MAY2016intutorials,python,data,science,kaggle,andexpedia
Kagglecompetitionsareafantasticwaytolearndatascienceandbuildyourportfolio.Ipersonallyused
Kaggletolearnmanydatascienceconcepts.IstartedoutwithKaggleafewmonthsafterlearning
programming,andlaterwonseveralcompetitions.
DoingwellinaKagglecompetitionrequiresmorethanjustknowingmachinelearningalgorithms.Itrequires
therightmindset,thewillingnesstolearn,andalotofdataexploration.Manyoftheseaspectsarenttypically
emphasizedintutorialsongettingstartedwithKaggle,though.Inthispost,Illcoverhowtogetstartedwith
theKaggleExpediahotelrecommendationscompetition,includingestablishingtherightmindset,settingup
testinginfrastructure,exploringthedata,creatingfeatures,andmakingpredictions.
Attheend,wellgenerateasubmissionfileusingthetechniquesinthethispost.Asofthiswriting,the
submissionwouldrankinthetop .
https://www.dataquest.io/blog/kaggletutorial/
1/27
5/13/2016
Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog
Wherethissubmissionwouldrankasofthiswriting.
TheExpediaKaggle
competition
TheExpediacompetitionchallengesyouwithpredictingwhathotelauserwillbookbasedonsomeattributes
aboutthesearchtheuserisconductingonExpedia.Beforewediveintoanycoding,wellneedtoputintime
tounderstandboththeproblemandthedata.
Aquickglanceatthecolumns
Thefirststepistolookatthedescriptionofthecolumnsofthedataset.Youcanfindthathere.Towardsthe
bottomofthepage,youllseeadescriptionofeachcolumninthedata.Lookingoverthis,itappearsthatwe
havequiteabitofdataaboutthesearchesusersareconductingonExpedia,alongwithdataonwhathotel
clustertheyeventuallybookedin test.csv and train.csv . destinations.csv containsinformationabout
theregionsuserssearchinforhotels.Wewontworryaboutwhatwerepredictingjustyet,wellfocuson
understandingthecolumns.
Expedia
SincethecompetitionconsistsofeventdatafromusersbookinghotelsonExpedia,wellneedtospendsome
timeunderstandingtheExpediasite.Lookingatthebookingflowwillhelpuscontextualizethefieldsinthe
data,andhowtheytieintousingExpedia.
https://www.dataquest.io/blog/kaggletutorial/
2/27
5/13/2016
Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog
Thepageyouinitiallyseewhenbookingahotel.
3/27
5/13/2016
Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog
ExploringtheKaggledatain
Python
Nowthatwehaveahandleonthedataatahighlevel,wecandosomeexplorationtotakeadeeperlook.
Downloadingthedata
Youcandownloadthedatahere.Thedatasetsarefairlylarge,soyoullneedagoodamountofdiskspace.
Youllneedtounzipthefilestogetraw .csv filesinsteadof .csv.gz .
ExploringthedatawithPandas
Giventheamountofmemoryonyoursystem,itmayormaynotbefeasibletoreadallthedatain.Ifitisnt,
youshouldconsidercreatingamachineonECorDigitalOceantoprocessthedatawith.Heresatutorialon
howtogetstartedwiththat.
Oncewedownloadthedata,wecanreaditinusingPandas:
importpandas
destinations=pd.read_csv("destinations.csv")
test=pd.read_csv("test.csv")
train=pd.read_csv("train.csv")
Letsfirstlookathowmuchdatathereis:
train.shape
(37670293,24)
https://www.dataquest.io/blog/kaggletutorial/
4/27
5/13/2016
Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog
test.shape
(2528243,22)
train.head(5)
date_time
site_name
posa_continent
user_location_country
user_location
2014081107:46:59
66
348
2014081108:22:12
66
348
2014081108:24:33
66
348
2014080918:05:16
66
442
2014080918:08:18
66
442
Thereareafewthingsthatimmediatelystickout:
date_time couldbeusefulinourpredictions,sowellneedtoconvertit.
Mostofthecolumnsareintegersorfloats,sowecantdoalotoffeatureengineering.For
example, user_location_country isntthenameofacountry,itsaninteger.Thismakesitharderto
createnewfeatures,becausewedontknowexactlywhicheachvaluemeans.
https://www.dataquest.io/blog/kaggletutorial/
5/27
5/13/2016
Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog
test.head(5)
id
date_time
site_name
posa_continent
user_location_country
user_loc
2015090317:09:54
66
174
2015092417:38:35
66
174
2015060715:53:02
66
142
2015091414:49:10
66
258
2015071709:32:04
66
467
Thereareafewthingswecantakeawayfromlookingat test.csv :
Figuringoutwhattopredict
Whatwerepredicting
Wellbepredictingwhich hotel_cluster auserwillbookafteragivensearch.Accordingtothedescription,
thereare clustersintotal.
Howwellbescored
https://www.dataquest.io/blog/kaggletutorial/
6/27
5/13/2016
Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog
TheevaluationpagesaysthatwellbescoredusingMeanAveragePrecision@,whichmeansthatwellneed
tomake clusterpredictionsforeachrow,andwillbescoredonwhetherornotthecorrectprediction
appearsinourlist.Ifthecorrectpredictioncomesearlierinthelist,wegetmorepoints.
Forexample,ifthecorrectclusteris ,andwepredict ,,,, ,ourscorewillbelowerthanif
wepredict ,,,, .Weshouldputpredictionsweremorecertainaboutearlierinourlistof
predictions.
Exploringhotelclusters
Nowthatweknowwhatwerepredicting,itstimetodiveinandexplore hotel_cluster .Wecanuse
thevalue_countsmethodonSeriestodothis:
train["hotel_cluster"].value_counts()
911043720
41772743
48754033
64704734
65670960
5620194
...
53134812
88107784
27105040
7448355
Theoutputaboveistruncated,butitshowsthatthenumberofhotelsineachclusterisfairlyevenly
distributed.Theredoesntappeartobeanyrelationshipbetweenclusternumberandthenumberofitems.
https://www.dataquest.io/blog/kaggletutorial/
7/27
5/13/2016
Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog
Exploringtrainandtestuserids
Finally,wellconfirmourhypothesisthatallthe test useridsarefoundinthe train DataFrame.Wecando
thisbyfindingtheuniquevaluesfor user_id in test ,andseeingiftheyallexistin train .
Inthecodebelow,well:
Createasetofalltheunique test userids.
Createasetofalltheunique train userids.
Figureouthowmany test useridsareinthe train userids.
Seeifthecountmatchesthetotalnumberof test userids.
test_ids=set(test.user_id.unique())
train_ids=set(train.user_id.unique())
intersection_count=len(test_ids&train_ids)
intersection_count==len(test_ids)
True
Lookslikeourhypothesisiscorrect,whichwillmakeworkingwiththisdatamucheasier!
DownsamplingourKaggle
data
Theentire train.csv datasetcontains millionrows,whichmakesithardtoexperimentwithdifferent
techniques.Ideally,wewantasmallenoughdatasetthatletsusquicklyiteratethroughdifferentapproaches
butisstillrepresentativeofthewholetrainingdata.
Wecandothisbyfirstrandomlysamplingrowsfromourdata,thenselectingnewtrainingandtestingdatasets
https://www.dataquest.io/blog/kaggletutorial/
8/27
5/13/2016
Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog
Addintimesanddates
Thefirststepistoadd month and year fieldsto train .Becausethe train and test dataisdifferentiated
bydate,wellneedtoadddatefieldstoallowustosegmentourdataintotwosetsthesameway.Ifwe
add year and month fields,wecansplitourdataintotrainingandtestingsetsusingthem.
Thecodebelowwill:
Convertthe date_time columnin train froman object toa datetime value.Thismakesiteasierto
workwithasadate.
Extractthe year and month fromfrom date_time ,andassignthemtotheirowncolumns.
train["date_time"]=pd.to_datetime(train["date_time"
train["year"]=train["date_time"].dt.year
train["month"]=train["date_time"].dt.month
Pick10000users
Becausetheuseridsin test areasubsetoftheuseridsin train ,wellneedtodoourrandomsamplingina
waythatpreservesthefulldataofeachuser.Wecanaccomplishthisbyselectingacertainnumberofusers
randomly,thenonlypickingrowsfrom train where user_id isinourrandomsampleofuserids.
importrandom
unique_users=train.user_id.unique()
sel_user_ids=[unique_users[i]foriinsorted(random
https://www.dataquest.io/blog/kaggletutorial/
9/27
5/13/2016
Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog
sel_train=train[train.user_id.isin(sel_user_ids)]
Picknewtrainingandtesting
sets
Wellnowneedtopicknewtrainingandtestingsetsfrom sel_train .Wellcallthesesets t and t .
t1=sel_train[((sel_train.year==2013)|((sel_train
t2=sel_train[((sel_train.year==2014)&(sel_train
Intheoriginal train and test DataFrames, test containeddatafrom ,and train containeddata
from and .Wesplitthisdatasothatanythingafter July isin t ,andanythingbeforeis
in t .Thisgivesussmallertrainingandtestingsetswithsimilarcharacteristicsto train and test .
Removeclickevents
If is_booking is ,itrepresentsaclick,anda representsabooking. test containsonlybookingevents,
sowellneedtosample t toonlycontainbookingsaswell.
t2=t2[t2.is_booking==True]
Asimplealgorithm
Themostsimpletechniquewecouldtryonthisdataistofindthemostcommonclustersacrossthedata,then
usethemaspredictions.
https://www.dataquest.io/blog/kaggletutorial/
10/27
5/13/2016
Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog
Wecanagainusethevalue_countsmethodtohelpushere:
most_common_clusters=list(train.hotel_cluster.value_counts
Generatingpredictions
Wecanturn most_common_clusters intoalistofpredictionsbymakingthesamepredictionforeachrow.
predictions=[most_common_clustersforiinrange(
Thiswillcreatealistwithasmanyelementsastherearerowsin t .Eachelementwillbeequal
to most_common_clusters .
Evaluatingerror
Inordertoevaluateerror,wellfirstneedtofigureouthowtocomputeMeanAveragePrecision.Luckily,Ben
Hamnerhaswrittenanimplementationthatcanbefoundhere.Itcanbeinstalledaspartof
the ml_metrics package,andyoucanfindinstallationinstructionsforhowtoinstallithere.
importml_metricsasmetrics
target=[[l]forlint2["hotel_cluster"]]
metrics.mapk(target,predictions,k=5)
https://www.dataquest.io/blog/kaggletutorial/
11/27
5/13/2016
Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog
0.058020770920711007
Ourresulthereisntgreat,butwevejustgeneratedourfirstsetofpredictions,andevaluatedourerror!The
frameworkwevebuiltwillallowustoquicklytestoutavarietyoftechniquesandseehowtheyscore.Were
wellonourwaytobuildingagoodperformingsolutionfortheleaderboard.
Findingcorrelations
Beforewemoveontocreatingabetteralgorithm,letsseeifanythingcorrelateswellwith hotel_cluster .
Thiswilltellusifweshoulddivemoreintoanyparticularcolumns.
Wecanfindlinearcorrelationsinthetrainingsetusingthecorrmethod:
train.corr()["hotel_cluster"]
site_name0.022408
posa_continent0.014938
user_location_country0.010477
user_location_region0.007453
user_location_city0.000831
orig_destination_distance0.007260
user_id0.001052
is_mobile0.008412
is_package0.038733
channel0.000707
12/27
5/13/2016
Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog
Unfortunately,thismeansthattechniqueslikelinearregressionandlogisticregressionwontworkwellonour
data,becausetheyrelyonlinearcorrelationsbetweenpredictorsandtargets.
Creatingbetterpredictions
forourKaggleentry
Thisdataforthiscompetitionisquitedifficulttomakepredictionsonusingmachinelearningforafew
reasons:
Therearemillionsofrows,whichincreasesruntimeandmemoryusageforalgorithms.
Thereare differentclusters,andaccordingtothecompetitionadmins,theboundariesarefairly
fuzzy,soitwilllikelybehardtomakepredictions.Asthenumberofclustersincreases,classifiers
generallydecreaseinaccuracy.
Nothingislinearlycorrelatedwiththetarget( hotel_clusters ),meaningwecantusefastmachine
learningtechniqueslikelinearregression.
Forthesereasons,machinelearningprobablywontworkwellonourdata,butwecantryanalgorithmand
findout.
Generatingfeatures
Thefirststepinapplyingmachinelearningistogeneratefeatures.Wecangeneratefeaturesusingbothwhats
availableinthetrainingdata,andwhatsavailablein destinations .Wehaventlookedat destinations yet,
soletstakeaquickpeek.
Generatingfeaturesfrom
https://www.dataquest.io/blog/kaggletutorial/
13/27
5/13/2016
Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog
destinations
Destinationscontainsanidthatcorrespondsto srch_destination_id ,alongwith columnsoflatent
informationaboutthatdestination.Heresasample:
srch_destination_id
d1
d2
d3
d4
d5
d6
2.198657
2.198657
2.198657
2.198657
2.198657
1.897627
2.181690
2.181690
2.181690
2.082564
2.181690
2.165028
2.183490
2.224164
2.224164
2.189562
2.105819
2.075407
2.177409
2.177409
2.177409
2.177409
2.177409
2.115485
2.189562
2.187783
2.194008
2.171153
2.152303
2.056618
Thecompetitiondoesnttellusexactlywhateachlatentfeatureis,butitssafetoassumethatitssome
combinationofdestinationcharacteristics,likename,description,andmore.Theselatentfeatureswere
convertedtonumbers,sotheycouldbeanonymized.
Wecanusethedestinationinformationasfeaturesinamachinelearningalgorithm,butwellneedtocompress
thenumberofcolumnsdownfirst,tominimizeruntime.WecanusePCAtodothis.PCAwillreducethe
numberofcolumnsinamatrixwhiletryingtopreservethesameamountofvarianceperrow.Ideally,PCA
willcompressalltheinformationcontainedinallthecolumnsintoless,butinpractice,someinformationis
lost.
Inthecodebelow,we:
InitializeaPCAmodelusingscikitlearn.
Specifythatwewanttoonlyhave columnsinourdata.
Transformthecolumns dd into columns.
https://www.dataquest.io/blog/kaggletutorial/
14/27
5/13/2016
Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog
fromsklearn.decompositionimportPCA
pca=PCA(n_components=3)
dest_small=pca.fit_transform(destinations[["d{0}"
dest_small=pd.DataFrame(dest_small)
dest_small["srch_destination_id"]=destinations["srch_destination_id"
Generatingfeatures
Nowthatthepreliminariesaredonewith,wecangenerateourfeatures.Welldothefollowing:
Generatenewdatefeaturesbasedon date_time , srch_ci ,and srch_co .
Removenonnumericcolumnslike date_time .
Addinfeaturesfrom dest_small .
Replaceanymissingvalueswith .
defcalc_fast_features(df):
df["date_time"]=pd.to_datetime(df["date_time"
df["srch_ci"]=pd.to_datetime(df["srch_ci"],format
df["srch_co"]=pd.to_datetime(df["srch_co"],format
props={}
forpropin["month","day","hour","minute",
props[prop]=getattr(df["date_time"].dt,prop
carryover=[pforpindf.columnsifpnotin
https://www.dataquest.io/blog/kaggletutorial/
15/27
5/13/2016
Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog
forpropincarryover:
props[prop]=df[prop]
date_props=["month","day","dayofweek","quarter"
forpropindate_props:
props["ci_{0}".format(prop)]=getattr(df["srch_ci"
props["co_{0}".format(prop)]=getattr(df["srch_co"
props["stay_span"]=(df["srch_co"]df["srch_ci"
ret=pd.DataFrame(props)
ret=ret.join(dest_small,on="srch_destination_id"
ret=ret.drop("srch_destination_iddest",axis=
returnret
df=calc_fast_features(t1)
df.fillna(1,inplace=True)
Theabovewillcalculatefeaturessuchaslengthofstay,checkinday,andcheckoutmonth.Thesefeatures
willhelpustrainamachinelearningalgorithmlateron.
Replacingmissingvalueswith isntthebestchoice,butitwillworkfinefornow,andwecanalways
optimizethebehaviorlateron.
Machinelearning
Nowthatwehavefeaturesforourtrainingdata,wecantrymachinelearning.Welluse fold cross
validationacrossthetrainingsettogenerateareliableerrorestimate.Crossvalidationsplitsthetrainingsetup
into parts,thenpredicts hotel_cluster foreachpartusingtheotherpartstotrainwith.
WellgeneratepredictionsusingtheRandomForestalgorithm.Randomforestsbuildtrees,whichcanfitto
nonlineartendenciesindata.Thiswillenableustomakepredictions,eventhoughnoneofourcolumnsare
https://www.dataquest.io/blog/kaggletutorial/
16/27
5/13/2016
Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog
linearlyrelated.
Wellfirstinitializethemodelandcomputecrossvalidationscores:
predictors=[cforcindf.columnsifcnotin["hotel_cluster"
fromsklearnimportcross_validation
fromsklearn.ensembleimportRandomForestClassifier
clf=RandomForestClassifier(n_estimators=10,min_weight_fraction_leaf
scores=cross_validation.cross_val_score(clf,df[predictors
scores
array([0.06203556,0.06233452,0.06392277])
Theabovecodedoesntgiveusverygoodaccuracy,andconfirmsouroriginalsuspicionthatmachinelearning
isntagreatapproachtothisproblem.However,classifierstendtohaveloweraccuracywhenthereisahigh
clustercount.Wecaninsteadtrytraining binaryclassifiers.Eachclassifierwilljustdetermineifarowis
initscluster,ornot.Thiswillentailtrainingoneclassifierperlabelin hotel_cluster .
Binaryclassifiers
WellagaintrainRandomForests,buteachforestwillpredictonlyasinglehotelcluster.Welluse fold
crossvalidationforspeed,andonlytrain treesperlabel.
Inthecodebelow,we:
Loopacrosseachunique hotel_cluster .
TrainaRandomForestclassifierusingfoldcrossvalidation.
Extracttheprobabilitiesfromtheclassifierthattherowisintheunique hotel_cluster
Combinealltheprobabilities.
https://www.dataquest.io/blog/kaggletutorial/
17/27
5/13/2016
Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog
fromsklearn.ensembleimportRandomForestClassifier
fromsklearn.cross_validationimportKFold
fromitertoolsimportchain
all_probs=[]
unique_clusters=df["hotel_cluster"].unique()
forclusterinunique_clusters:
df["target"]=1
df["target"][df["hotel_cluster"]!=cluster]=
predictors=[colforcolindfifcolnotin[
probs=[]
cv=KFold(len(df["target"]),n_folds=2)
clf=RandomForestClassifier(n_estimators=10,min_weight_fraction_leaf
fori,(tr,te)inenumerate(cv):
clf.fit(df[predictors].iloc[tr],df["target"
preds=clf.predict_proba(df[predictors].iloc
probs.append([p[1]forpinpreds])
full_probs=chain.from_iterable(probs)
all_probs.append(list(full_probs))
prediction_frame=pd.DataFrame(all_probs).T
prediction_frame.columns=unique_clusters
deffind_top_5(row):
returnlist(row.nlargest(5).index)
preds=[]
forindex,rowinprediction_frame.iterrows():
preds.append(find_top_5(row))
metrics.mapk([[l]forlint2.iloc["hotel_cluster"]],
https://www.dataquest.io/blog/kaggletutorial/
18/27
5/13/2016
Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog
0.041083333333333326
Ouraccuracyhereisworsethanbefore,andpeopleontheleaderboardhavemuchbetteraccuracyscores.
Wellneedtoabandonmachinelearningandmovetothenexttechniqueinordertocompete.Machine
learningcanbeapowerfultechnique,butitisntalwaystherightapproachtoeveryproblem.
Topclustersbasedon
hotel_cluster
ThereareafewKaggleScriptsforthecompetitionthatinvolveaggregating hotel_cluster based
on orig_destination_distance ,or srch_destination_id .Aggregating
on orig_destination_distance willexploitadataleakinthecompetition,andattempttomatchthesame
usertogether.Aggregatingon srch_destination_id willfindthemostpopularhotelclustersforeach
destination.Wellthenbeabletopredictthatauserwhosearchesforadestinationisgoingtooneofthemost
popularhotelclustersforthatdestination.Thinkofthisasamoregranularversionofthemostcommon
clusterstechniqueweusedearlier.
Wecanfirstgeneratescoresforeach hotel_cluster ineach srch_destination_id .Wellweightbookings
higherthanclicks.Thisisbecausethetestdataisallbookingdata,andthisiswhatwewanttopredict.We
wanttoincludeclickinformation,butdownweightittoreflectthis.Stepbystep,well:
Group t by srch_destination_id ,and hotel_cluster .
Iteratethrougheachgroup,and:
Assignpointtoeachhotelclusterwhere is_booking isTrue.
Assign . pointstoeachhotelclusterwhere is_booking isFalse.
Assignthescoretothe srch_destination_id / hotel_cluster combinationinadictionary.
Heresthecodetoaccomplishtheabovesteps:
https://www.dataquest.io/blog/kaggletutorial/
19/27
5/13/2016
Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog
defmake_key(items):
return"_".join([str(i)foriinitems])
match_cols=["srch_destination_id"]
cluster_cols=match_cols+['hotel_cluster']
groups=t1.groupby(cluster_cols)
top_clusters={}
forname,groupingroups:
clicks=len(group.is_booking[group.is_booking
bookings=len(group.is_booking[group.is_booking
score=bookings+.15*clicks
clus_name=make_key(name[:len(match_cols)])
ifclus_namenotintop_clusters:
top_clusters[clus_name]={}
top_clusters[clus_name][name[1]]=score
{'39331':{20:1.15,30:0.15,81:0.3},
'511':{17:0.15,34:0.15,55:0.15,70:0.15}}
https://www.dataquest.io/blog/kaggletutorial/
20/27
5/13/2016
Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog
Heresthecode:
importoperator
cluster_dict={}
fornintop_clusters:
tc=top_clusters[n]
top=[l[0]forlinsorted(tc.items(),key=operator
cluster_dict[n]=top
Makingpredictionsbasedon
destination
Onceweknowthetopclustersforeach srch_destination_id ,wecanquicklymakepredictions.Tomake
predictions,allwehavetodois:
Iteratethrougheachrowin t .
Extractthe srch_destination_id fortherow.
Findthetopclustersforthatdestinationid.
Appendthetopclustersto preds .
Heresthecode:
preds=[]
forindex,rowint2.iterrows():
key=make_key([row[m]forminmatch_cols])
ifkeyincluster_dict:
preds.append(cluster_dict[key])
else:
preds.append([])
https://www.dataquest.io/blog/kaggletutorial/
21/27
5/13/2016
Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog
[
[2,25,28,10,64],
[25,78,64,90,60],
...
]
Calculatingerror
Oncewehaveourpredictions,wecancomputeouraccuracyusingthe mapk functionfromearlier:
metrics.mapk([[l]forlint2["hotel_cluster"]],preds
0.22388136288998359
Weredoingprettywell!Weboostedouraccuracyxoverthebestmachinelearningapproach,andwedidit
withafarfasterandsimplerapproach.
Youmayhavenoticedthatthisvalueisquiteabitlowerthanaccuraciesontheleaderboard.Localtesting
resultsinaloweraccuracyvaluethansubmitting,sothisapproachwillactuallydofairlywellonthe
leaderboard.Differencesinleaderboardscoreandlocalscorecancomedowntoafewfactors:
Differentdatalocallyandinthehiddensetthatleaderboardscoresarecomputedon.Forexample,were
computingerrorinasampleofthetrainingset,andtheleaderboardscoreiscomputedonthetestingset.
Techniquesthatresultinhigheraccuracywithmoretrainingdata.Wereonlyusingasmallsubsetofdata
fortraining,anditmaybemoreaccuratewhenweusethefulltrainingset.
Differentrandomization.Withcertainalgorithms,randomnumbersareinvolved,butwerenotusingany
ofthese.
https://www.dataquest.io/blog/kaggletutorial/
22/27
5/13/2016
Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog
Generatingbetter
predictionsforyourKaggle
submission
TheforumsareveryimportantinKaggle,andcanoftenhelpyoufindnuggetsofinformationthatwillletyou
boostyourscore.TheExpediacompetitionisnoexception.Thispostdetailsadataleakthatallowsyouto
matchusersinthetrainingsetfromthetestingsetusingasetofcolumnsincluding user_location_country ,
and user_location_region .
Wellusetheinformationfromtheposttomatchusersfromthetestingsetbacktothetrainingset,whichwill
boostourscore.Basedontheforumthread,itsokaytodothis,andthecompetitionwontbeupdatedasa
resultoftheleak.
Findingmatchingusers
Thefirststepistofindusersinthetrainingsetthatmatchusersinthetestingset.
Inordertodothis,weneedto:
Splitthetrainingdataintogroupsbasedonthematchcolumns.
Loopthroughthetestingdata.
Createanindexbasedonthematchcolumns.
Getanymatchesbetweenthetestingdataandthetrainingdatausingthegroups.
Heresthecodetoaccomplishthis:
match_cols=['user_location_country','user_location_region'
groups=t1.groupby(match_cols)
https://www.dataquest.io/blog/kaggletutorial/
23/27
5/13/2016
Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog
defgenerate_exact_matches(row,match_cols):
index=tuple([row[t]fortinmatch_cols])
try:
group=groups.get_group(index)
exceptException:
return[]
clus=list(set(group.hotel_cluster))
returnclus
exact_matches=[]
foriinrange(t2.shape[0]):
exact_matches.append(generate_exact_matches(t2.
Attheendofthisloop,wellhavealistofliststhatcontainanyexactmatchesbetweenthetrainingandthe
testingsets.However,therearentthatmanymatches.Toaccuratelyevaluateerror,wellhavetocombine
thesepredictionswithourearlierpredictions.Otherwise,wellgetaverylowaccuracyvalue,becausemost
rowshaveemptylistsforpredictions.
Combiningpredictions
Wecancombinedifferentlistsofpredictionstoboostaccuracy.Doingsowillalsohelpusseehowgoodour
exactmatchstrategyis.Todothis,wellhaveto:
Combine exact_matches , preds ,and most_common_clusters .
Onlytaketheuniquepredictions,insequentialorder,usingthe f functionfromhere.
Ensurewehaveamaximumof predictionsforeachrowinthetestingset.
Hereshowwecandoit:
deff5(seq,idfun=None):
https://www.dataquest.io/blog/kaggletutorial/
24/27
5/13/2016
Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog
ifidfunisNone:
defidfun(x):returnx
seen={}
result=[]
foriteminseq:
marker=idfun(item)
ifmarkerinseen:continue
seen[marker]=1
result.append(item)
returnresult
full_preds=[f5(exact_matches[p]+preds[p]+most_common_clusters
mapk([[l]forlint2["hotel_cluster"]],full_preds
0.28400041050903119
Thisislookingquitegoodintermsoferrorweimproveddramaticallyfromearlier!Wecouldkeepgoing
andmakingmoresmallimprovements,butwereprobablyreadytosubmitnow.
MakingaKagglesubmission
file
Luckily,becauseofthewaywewrotethecode,allwehavetodotosubmitisassign train tothe
variable t ,and test tothevariable t .Then,wejusthavetorerunthecodetomakepredictions.Re
runningthecodeoverthe train and test setsshouldtakelessthananhour.
Oncewehavepredictions,wejusthavetowritethemtoafile:
write_p=["".join([str(l)forlinp])forpinfull_preds
https://www.dataquest.io/blog/kaggletutorial/
25/27
5/13/2016
Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog
write_frame=["{0},{1}".format(t2["id"][i],write_p
write_frame=["id,hotel_clusters"]+write_frame
withopen("predictions.csv","w+")asf:
f.write("\n".join(write_frame))
Wellthenhaveasubmissionfileintherightformattosubmit.Asofthiswriting,makingthissubmissionwill
getyouintothetop .
Summary
Wecamealongwayinthispost!Wewentfromjustlookingatthedataallthewaytocreatingasubmission
andgettingontotheleaderboard.Alongtheway,someofthekeystepswetookwere:
Exploringthedataandunderstandingtheproblem.
Settingupawaytoiteratequicklythroughdifferenttechniques.
Creatingawaytofigureoutaccuracylocally.
Readingtheforums,scripts,andthedescriptionsofthecontestverycloselytobetterunderstandthe
structureofthedata.
Tryingavarietyoftechniquesandnotbeingafraidtonotusemachinelearning.
ThesestepswillserveyouwellinanyKagglecompetition.
Furtherimprovements
Inordertoiteratequicklyandexploretechniques,speediskey.Thisisdifficultwiththiscompetition,but
thereareafewstrategiestotry:
Samplingdownthedataevenmore.
Parallelizingoperationsacrossmultiplecores.
UsingSparkorothertoolswheretaskscanberunonparallelworkers.
https://www.dataquest.io/blog/kaggletutorial/
26/27
5/13/2016
Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog
Exploringvariouswaystowritecodeandbenchmarkingtofindthemostefficientapproach.
Avoidingiteratingoverthefulltrainingandtestingsets,andinsteadusinggroups.
Writingfast,efficientcodeisahugeadvantageinthiscompetition.
Futuretechniquestotry
Onceyouhaveastablefoundationonwhichtorunyourcode,thereareafewavenuestoexploreintermsof
techniquestoboostaccuracy:
Findingsimilaritybetweenusers,thenadjustinghotelclusterscoresbasedonsimilarity.
Usingsimilaritybetweendestinationstogroupmultipledestinationstogether.
Applyingmachinelearningwithinsubsetsofthedata.
Combiningdifferentpredictionstrategiesinalessnaiveway.
Exploringthelinkbetweenhotelclustersandregionsmore.
Ihopeyouhavefunwiththiscompetition!Idlovetohearanyfeedbackyouhave.Ifyouwanttolearnmore
beforedivingintothecompetition,checkoutourcoursesonDataquesttolearnaboutdatamanipulation,
statistics,machinelearning,howtoworkwithSpark,andmore.
https://www.dataquest.io/blog/kaggletutorial/
27/27