How To Get Into The Top 15 of A Kaggle Competition Using Python - Dataquest Blog PDF

5/13/2016
Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog
Howtogetintothetop
15ofaKaggle
competitionusing
Python
03MAY2016intutorials,python,data,science,kaggle,andexpedia
Kagglecompetitionsareafantasticwaytolearndatascienceandbuildyourportfolio.Ipersonallyused
Kaggletolearnmanydatascienceconcepts.IstartedoutwithKaggleafewmonthsafterlearning
programming,andlaterwonseveralcompetitions.
DoingwellinaKagglecompetitionrequiresmorethanjustknowingmachinelearningalgorithms.Itrequires
therightmindset,thewillingnesstolearn,andalotofdataexploration.Manyoftheseaspectsarenttypically
emphasizedintutorialsongettingstartedwithKaggle,though.Inthispost,Illcoverhowtogetstartedwith
theKaggleExpediahotelrecommendationscompetition,includingestablishingtherightmindset,settingup
testinginfrastructure,exploringthedata,creatingfeatures,andmakingpredictions.
Attheend,wellgenerateasubmissionfileusingthetechniquesinthethispost.Asofthiswriting,the
submissionwouldrankinthetop .
https://www.dataquest.io/blog/kaggletutorial/
1/27
5/13/2016
Wherethissubmissionwouldrankasofthiswriting.
TheExpediaKaggle
competition
TheExpediacompetitionchallengesyouwithpredictingwhathotelauserwillbookbasedonsomeattributes
aboutthesearchtheuserisconductingonExpedia.Beforewediveintoanycoding,wellneedtoputintime
tounderstandboththeproblemandthedata.
Aquickglanceatthecolumns
Thefirststepistolookatthedescriptionofthecolumnsofthedataset.Youcanfindthathere.Towardsthe
bottomofthepage,youllseeadescriptionofeachcolumninthedata.Lookingoverthis,itappearsthatwe
havequiteabitofdataaboutthesearchesusersareconductingonExpedia,alongwithdataonwhathotel
clustertheyeventuallybookedin test.csv and train.csv . destinations.csv containsinformationabout
theregionsuserssearchinforhotels.Wewontworryaboutwhatwerepredictingjustyet,wellfocuson
understandingthecolumns.
Expedia
SincethecompetitionconsistsofeventdatafromusersbookinghotelsonExpedia,wellneedtospendsome
timeunderstandingtheExpediasite.Lookingatthebookingflowwillhelpuscontextualizethefieldsinthe
data,andhowtheytieintousingExpedia.
2/27
5/13/2016
Thepageyouinitiallyseewhenbookingahotel.
Theboxlabelled GoingTo mapstothe srch_destination_type_id , hotel_continent , hotel_country ,

and hotel_market fieldsinthedata.
Theboxlabelled Checkin mapstothe srch_ci fieldinthedata,andtheboxlabelled Checkout mapsto

the srch_co fieldinthedata.
Theboxlabelled Guests mapstothe srch_adults_cnt , srch_children_cnt ,and srch_rm_cnt fieldsin

thedata.
Theboxlabelled AddaFlight mapstothe is_package fieldinthedata.
site_name isthenameofthesiteyouvisited,whetheritbethemain Expedia.com site,oranother.
user_location_country , user_location_region , user_location_city , is_mobile , channel is_booking ,
and cnt areallattributesthataredeterminedbywheretheuserit,whattheirdeviceis,ortheirsessiononthe

Expediasite.
Justbylookingatonescreen,wecanimmediatelycontextualizeallthevariables.Playingaroundwiththe
screen,fillinginvalues,andgoingthroughthebookingprocesscanhelpfurthercontextualize.
3/27
5/13/2016
ExploringtheKaggledatain
Python
Nowthatwehaveahandleonthedataatahighlevel,wecandosomeexplorationtotakeadeeperlook.
Downloadingthedata
Youcandownloadthedatahere.Thedatasetsarefairlylarge,soyoullneedagoodamountofdiskspace.
Youllneedtounzipthefilestogetraw .csv filesinsteadof .csv.gz .
ExploringthedatawithPandas
Giventheamountofmemoryonyoursystem,itmayormaynotbefeasibletoreadallthedatain.Ifitisnt,
youshouldconsidercreatingamachineonECorDigitalOceantoprocessthedatawith.Heresatutorialon
howtogetstartedwiththat.
Oncewedownloadthedata,wecanreaditinusingPandas:
importpandas
destinations=pd.read_csv("destinations.csv")
test=pd.read_csv("test.csv")
train=pd.read_csv("train.csv")
Letsfirstlookathowmuchdatathereis:
train.shape
(37670293,24)
4/27
5/13/2016
test.shape
(2528243,22)
Wehaveabout milliontrainingsetrows,and milliontestingsetrows,whichwillmakethisproblema

bitchallengingtoworkwith.
Wecanexplorethefirstfewrowsofthedata:
train.head(5)
date_time
site_name
posa_continent
user_location_country
user_location
2014081107:46:59
66
348
2014081108:22:12
66
348
2014081108:24:33
66
348
2014080918:05:16
66
442
2014080918:08:18
66
442
Thereareafewthingsthatimmediatelystickout:
date_time couldbeusefulinourpredictions,sowellneedtoconvertit.
Mostofthecolumnsareintegersorfloats,sowecantdoalotoffeatureengineering.For
example, user_location_country isntthenameofacountry,itsaninteger.Thismakesitharderto
createnewfeatures,becausewedontknowexactlywhicheachvaluemeans.
5/27
5/13/2016
test.head(5)
id
date_time
site_name
posa_continent
user_location_country
user_loc
2015090317:09:54
66
174
2015092417:38:35
66
174
2015060715:53:02
66
142
2015091414:49:10
66
258
2015071709:32:04
66
467
Thereareafewthingswecantakeawayfromlookingat test.csv :
Itlookslikeallthedatesin test.csv arelaterthanthedatesin train.csv ,andthedatapageconfirms

this.Thetestingsetcontainsdatesfrom ,andthetrainingsetcontainsdatesfrom and .
Itlooksliketheuseridsin test.csv areasubsetoftheuseridsin train.csv ,giventheoverlapping
integerranges.Wecanconfirmthislateron.
The is_booking columnalwayslookstobe in test.csv .Thedatapageconfirmsthis.
Figuringoutwhattopredict
Whatwerepredicting
Wellbepredictingwhich hotel_cluster auserwillbookafteragivensearch.Accordingtothedescription,
thereare clustersintotal.
Howwellbescored
6/27
5/13/2016
TheevaluationpagesaysthatwellbescoredusingMeanAveragePrecision@,whichmeansthatwellneed
tomake clusterpredictionsforeachrow,andwillbescoredonwhetherornotthecorrectprediction
appearsinourlist.Ifthecorrectpredictioncomesearlierinthelist,wegetmorepoints.
Forexample,ifthecorrectclusteris ,andwepredict ,,,, ,ourscorewillbelowerthanif
wepredict ,,,, .Weshouldputpredictionsweremorecertainaboutearlierinourlistof
predictions.
Exploringhotelclusters
Nowthatweknowwhatwerepredicting,itstimetodiveinandexplore hotel_cluster .Wecanuse
thevalue_countsmethodonSeriestodothis:
train["hotel_cluster"].value_counts()
911043720
41772743
48754033
64704734
65670960
5620194
...
53134812
88107784
27105040
7448355
Theoutputaboveistruncated,butitshowsthatthenumberofhotelsineachclusterisfairlyevenly
distributed.Theredoesntappeartobeanyrelationshipbetweenclusternumberandthenumberofitems.
7/27
5/13/2016
Exploringtrainandtestuserids
Finally,wellconfirmourhypothesisthatallthe test useridsarefoundinthe train DataFrame.Wecando
thisbyfindingtheuniquevaluesfor user_id in test ,andseeingiftheyallexistin train .
Inthecodebelow,well:
Createasetofalltheunique test userids.
Createasetofalltheunique train userids.
Figureouthowmany test useridsareinthe train userids.
Seeifthecountmatchesthetotalnumberof test userids.
test_ids=set(test.user_id.unique())
train_ids=set(train.user_id.unique())
intersection_count=len(test_ids&train_ids)
intersection_count==len(test_ids)
True
Lookslikeourhypothesisiscorrect,whichwillmakeworkingwiththisdatamucheasier!
DownsamplingourKaggle
data
Theentire train.csv datasetcontains millionrows,whichmakesithardtoexperimentwithdifferent
techniques.Ideally,wewantasmallenoughdatasetthatletsusquicklyiteratethroughdifferentapproaches
butisstillrepresentativeofthewholetrainingdata.
Wecandothisbyfirstrandomlysamplingrowsfromourdata,thenselectingnewtrainingandtestingdatasets
8/27
5/13/2016
from train.csv .Byselectingbothsetsfrom train.csv ,wellhavethetrue hotel_cluster labelforevery

row,andwellbeabletocalculateouraccuracyaswetesttechniques.
Addintimesanddates
Thefirststepistoadd month and year fieldsto train .Becausethe train and test dataisdifferentiated
bydate,wellneedtoadddatefieldstoallowustosegmentourdataintotwosetsthesameway.Ifwe
add year and month fields,wecansplitourdataintotrainingandtestingsetsusingthem.
Thecodebelowwill:
Convertthe date_time columnin train froman object toa datetime value.Thismakesiteasierto
workwithasadate.
Extractthe year and month fromfrom date_time ,andassignthemtotheirowncolumns.
train["date_time"]=pd.to_datetime(train["date_time"
train["year"]=train["date_time"].dt.year
train["month"]=train["date_time"].dt.month
Pick10000users
Becausetheuseridsin test areasubsetoftheuseridsin train ,wellneedtodoourrandomsamplingina
waythatpreservesthefulldataofeachuser.Wecanaccomplishthisbyselectingacertainnumberofusers
randomly,thenonlypickingrowsfrom train where user_id isinourrandomsampleofuserids.
importrandom
unique_users=train.user_id.unique()
sel_user_ids=[unique_users[i]foriinsorted(random
9/27
5/13/2016
sel_train=train[train.user_id.isin(sel_user_ids)]
TheabovecodecreatesaDataFramecalled sel_train thatonlycontainsdatafrom users.
Picknewtrainingandtesting
sets
Wellnowneedtopicknewtrainingandtestingsetsfrom sel_train .Wellcallthesesets t and t .
t1=sel_train[((sel_train.year==2013)|((sel_train
t2=sel_train[((sel_train.year==2014)&(sel_train
Intheoriginal train and test DataFrames, test containeddatafrom ,and train containeddata
from and .Wesplitthisdatasothatanythingafter July isin t ,andanythingbeforeis
in t .Thisgivesussmallertrainingandtestingsetswithsimilarcharacteristicsto train and test .
Removeclickevents
If is_booking is ,itrepresentsaclick,anda representsabooking. test containsonlybookingevents,
sowellneedtosample t toonlycontainbookingsaswell.
t2=t2[t2.is_booking==True]
Asimplealgorithm
Themostsimpletechniquewecouldtryonthisdataistofindthemostcommonclustersacrossthedata,then
usethemaspredictions.
10/27
5/13/2016
Wecanagainusethevalue_countsmethodtohelpushere:
most_common_clusters=list(train.hotel_cluster.value_counts
Theabovecodewillgiveusalistofthe mostcommonclustersin train .Thisisbecausetheheadmethod

returnsthefirst rowsbydefault,andtheindexpropertywillreturntheindexoftheDataFrame,whichisthe
hotelclusterafterrunningthevalue_countsmethod.
Generatingpredictions
Wecanturn most_common_clusters intoalistofpredictionsbymakingthesamepredictionforeachrow.
predictions=[most_common_clustersforiinrange(
Thiswillcreatealistwithasmanyelementsastherearerowsin t .Eachelementwillbeequal
to most_common_clusters .
Evaluatingerror
Inordertoevaluateerror,wellfirstneedtofigureouthowtocomputeMeanAveragePrecision.Luckily,Ben
Hamnerhaswrittenanimplementationthatcanbefoundhere.Itcanbeinstalledaspartof
the ml_metrics package,andyoucanfindinstallationinstructionsforhowtoinstallithere.
Wecancomputeourerrormetricwiththe mapk methodin ml_metrics :
importml_metricsasmetrics
target=[[l]forlint2["hotel_cluster"]]
metrics.mapk(target,predictions,k=5)
11/27
5/13/2016
0.058020770920711007
Ourtargetneedstobeinlistoflistsformatfor mapk towork,soweconvertthe hotel_cluster column

of t intoalistoflists.Then,wecallthe mapk methodwithourtarget,ourpredictions,andthenumberof
predictionswewanttoevaluate( ).
Ourresulthereisntgreat,butwevejustgeneratedourfirstsetofpredictions,andevaluatedourerror!The
frameworkwevebuiltwillallowustoquicklytestoutavarietyoftechniquesandseehowtheyscore.Were
wellonourwaytobuildingagoodperformingsolutionfortheleaderboard.
Findingcorrelations
Beforewemoveontocreatingabetteralgorithm,letsseeifanythingcorrelateswellwith hotel_cluster .
Thiswilltellusifweshoulddivemoreintoanyparticularcolumns.
Wecanfindlinearcorrelationsinthetrainingsetusingthecorrmethod:
train.corr()["hotel_cluster"]
site_name0.022408
posa_continent0.014938
user_location_country0.010477
user_location_region0.007453
user_location_city0.000831
orig_destination_distance0.007260
user_id0.001052
is_mobile0.008412
is_package0.038733
channel0.000707
Thistellsusthatnocolumnscorrelatelinearlywith hotel_cluster .Thismakessense,becausethereisno

12/27
5/13/2016
linearorderingto hotel_cluster .Forexample,havingahigherclusternumberisnttiedtohavinga

higher srch_destination_id .
Unfortunately,thismeansthattechniqueslikelinearregressionandlogisticregressionwontworkwellonour
data,becausetheyrelyonlinearcorrelationsbetweenpredictorsandtargets.
Creatingbetterpredictions
forourKaggleentry
Thisdataforthiscompetitionisquitedifficulttomakepredictionsonusingmachinelearningforafew
reasons:
Therearemillionsofrows,whichincreasesruntimeandmemoryusageforalgorithms.
Thereare differentclusters,andaccordingtothecompetitionadmins,theboundariesarefairly
fuzzy,soitwilllikelybehardtomakepredictions.Asthenumberofclustersincreases,classifiers
generallydecreaseinaccuracy.
Nothingislinearlycorrelatedwiththetarget( hotel_clusters ),meaningwecantusefastmachine
learningtechniqueslikelinearregression.
Forthesereasons,machinelearningprobablywontworkwellonourdata,butwecantryanalgorithmand
findout.
Generatingfeatures
Thefirststepinapplyingmachinelearningistogeneratefeatures.Wecangeneratefeaturesusingbothwhats
availableinthetrainingdata,andwhatsavailablein destinations .Wehaventlookedat destinations yet,
soletstakeaquickpeek.
Generatingfeaturesfrom
13/27
5/13/2016
destinations
Destinationscontainsanidthatcorrespondsto srch_destination_id ,alongwith columnsoflatent
informationaboutthatdestination.Heresasample:
srch_destination_id
d1
d2
d3
d4
d5
d6
2.198657
2.198657
2.198657
2.198657
2.198657
1.897627
2.181690
2.181690
2.181690
2.082564
2.181690
2.165028
2.183490
2.224164
2.224164
2.189562
2.105819
2.075407
2.177409
2.177409
2.177409
2.177409
2.177409
2.115485
2.189562
2.187783
2.194008
2.171153
2.152303
2.056618
Thecompetitiondoesnttellusexactlywhateachlatentfeatureis,butitssafetoassumethatitssome
combinationofdestinationcharacteristics,likename,description,andmore.Theselatentfeatureswere
convertedtonumbers,sotheycouldbeanonymized.
Wecanusethedestinationinformationasfeaturesinamachinelearningalgorithm,butwellneedtocompress
thenumberofcolumnsdownfirst,tominimizeruntime.WecanusePCAtodothis.PCAwillreducethe
numberofcolumnsinamatrixwhiletryingtopreservethesameamountofvarianceperrow.Ideally,PCA
willcompressalltheinformationcontainedinallthecolumnsintoless,butinpractice,someinformationis
lost.
Inthecodebelow,we:
InitializeaPCAmodelusingscikitlearn.
Specifythatwewanttoonlyhave columnsinourdata.
Transformthecolumns dd into columns.
14/27
5/13/2016
fromsklearn.decompositionimportPCA
pca=PCA(n_components=3)
dest_small=pca.fit_transform(destinations[["d{0}"
dest_small=pd.DataFrame(dest_small)
dest_small["srch_destination_id"]=destinations["srch_destination_id"
Theabovecodecompressesthe columnsin destinations downto columns,andcreatesanew

DataFramecalled dest_small .Wepreservemostofthevariancein destinations whiledoingthis,sowe
dontlosealotofinformation,butsavealotofruntimeforamachinelearningalgorithm.
Generatingfeatures
Nowthatthepreliminariesaredonewith,wecangenerateourfeatures.Welldothefollowing:
Generatenewdatefeaturesbasedon date_time , srch_ci ,and srch_co .
Removenonnumericcolumnslike date_time .
Addinfeaturesfrom dest_small .
Replaceanymissingvalueswith .
defcalc_fast_features(df):
df["date_time"]=pd.to_datetime(df["date_time"
df["srch_ci"]=pd.to_datetime(df["srch_ci"],format
df["srch_co"]=pd.to_datetime(df["srch_co"],format
props={}
forpropin["month","day","hour","minute",
props[prop]=getattr(df["date_time"].dt,prop
carryover=[pforpindf.columnsifpnotin
15/27
5/13/2016
forpropincarryover:
props[prop]=df[prop]
date_props=["month","day","dayofweek","quarter"
forpropindate_props:
props["ci_{0}".format(prop)]=getattr(df["srch_ci"
props["co_{0}".format(prop)]=getattr(df["srch_co"
props["stay_span"]=(df["srch_co"]df["srch_ci"
ret=pd.DataFrame(props)
ret=ret.join(dest_small,on="srch_destination_id"
ret=ret.drop("srch_destination_iddest",axis=
returnret
df=calc_fast_features(t1)
df.fillna(1,inplace=True)
Theabovewillcalculatefeaturessuchaslengthofstay,checkinday,andcheckoutmonth.Thesefeatures
willhelpustrainamachinelearningalgorithmlateron.
Replacingmissingvalueswith isntthebestchoice,butitwillworkfinefornow,andwecanalways
optimizethebehaviorlateron.
Machinelearning
Nowthatwehavefeaturesforourtrainingdata,wecantrymachinelearning.Welluse fold cross
validationacrossthetrainingsettogenerateareliableerrorestimate.Crossvalidationsplitsthetrainingsetup
into parts,thenpredicts hotel_cluster foreachpartusingtheotherpartstotrainwith.
WellgeneratepredictionsusingtheRandomForestalgorithm.Randomforestsbuildtrees,whichcanfitto
nonlineartendenciesindata.Thiswillenableustomakepredictions,eventhoughnoneofourcolumnsare
16/27
5/13/2016
linearlyrelated.
Wellfirstinitializethemodelandcomputecrossvalidationscores:
predictors=[cforcindf.columnsifcnotin["hotel_cluster"
fromsklearnimportcross_validation
fromsklearn.ensembleimportRandomForestClassifier
clf=RandomForestClassifier(n_estimators=10,min_weight_fraction_leaf
scores=cross_validation.cross_val_score(clf,df[predictors
scores
array([0.06203556,0.06233452,0.06392277])
Theabovecodedoesntgiveusverygoodaccuracy,andconfirmsouroriginalsuspicionthatmachinelearning
isntagreatapproachtothisproblem.However,classifierstendtohaveloweraccuracywhenthereisahigh
clustercount.Wecaninsteadtrytraining binaryclassifiers.Eachclassifierwilljustdetermineifarowis
initscluster,ornot.Thiswillentailtrainingoneclassifierperlabelin hotel_cluster .
Binaryclassifiers
WellagaintrainRandomForests,buteachforestwillpredictonlyasinglehotelcluster.Welluse fold
crossvalidationforspeed,andonlytrain treesperlabel.
Inthecodebelow,we:
Loopacrosseachunique hotel_cluster .
TrainaRandomForestclassifierusingfoldcrossvalidation.
Extracttheprobabilitiesfromtheclassifierthattherowisintheunique hotel_cluster
Combinealltheprobabilities.
17/27
5/13/2016
Foreachrow,findthe largestprobabilities,andassignthose hotel_cluster valuesaspredictions.

Computeaccuracyusing mapk .
fromsklearn.ensembleimportRandomForestClassifier
fromsklearn.cross_validationimportKFold
fromitertoolsimportchain
all_probs=[]
unique_clusters=df["hotel_cluster"].unique()
forclusterinunique_clusters:
df["target"]=1
df["target"][df["hotel_cluster"]!=cluster]=
predictors=[colforcolindfifcolnotin[
probs=[]
cv=KFold(len(df["target"]),n_folds=2)
clf=RandomForestClassifier(n_estimators=10,min_weight_fraction_leaf
fori,(tr,te)inenumerate(cv):
clf.fit(df[predictors].iloc[tr],df["target"
preds=clf.predict_proba(df[predictors].iloc
probs.append([p[1]forpinpreds])
full_probs=chain.from_iterable(probs)
all_probs.append(list(full_probs))
prediction_frame=pd.DataFrame(all_probs).T
prediction_frame.columns=unique_clusters
deffind_top_5(row):
returnlist(row.nlargest(5).index)
preds=[]
forindex,rowinprediction_frame.iterrows():
preds.append(find_top_5(row))
metrics.mapk([[l]forlint2.iloc["hotel_cluster"]],
18/27
5/13/2016
0.041083333333333326
Ouraccuracyhereisworsethanbefore,andpeopleontheleaderboardhavemuchbetteraccuracyscores.
Wellneedtoabandonmachinelearningandmovetothenexttechniqueinordertocompete.Machine
learningcanbeapowerfultechnique,butitisntalwaystherightapproachtoeveryproblem.
Topclustersbasedon
hotel_cluster
ThereareafewKaggleScriptsforthecompetitionthatinvolveaggregating hotel_cluster based
on orig_destination_distance ,or srch_destination_id .Aggregating
on orig_destination_distance willexploitadataleakinthecompetition,andattempttomatchthesame
usertogether.Aggregatingon srch_destination_id willfindthemostpopularhotelclustersforeach
destination.Wellthenbeabletopredictthatauserwhosearchesforadestinationisgoingtooneofthemost
popularhotelclustersforthatdestination.Thinkofthisasamoregranularversionofthemostcommon
clusterstechniqueweusedearlier.
Wecanfirstgeneratescoresforeach hotel_cluster ineach srch_destination_id .Wellweightbookings
higherthanclicks.Thisisbecausethetestdataisallbookingdata,andthisiswhatwewanttopredict.We
wanttoincludeclickinformation,butdownweightittoreflectthis.Stepbystep,well:
Group t by srch_destination_id ,and hotel_cluster .
Iteratethrougheachgroup,and:
Assignpointtoeachhotelclusterwhere is_booking isTrue.
Assign . pointstoeachhotelclusterwhere is_booking isFalse.
Assignthescoretothe srch_destination_id / hotel_cluster combinationinadictionary.
Heresthecodetoaccomplishtheabovesteps:
19/27
5/13/2016
defmake_key(items):
return"_".join([str(i)foriinitems])
match_cols=["srch_destination_id"]
cluster_cols=match_cols+['hotel_cluster']
groups=t1.groupby(cluster_cols)
top_clusters={}
forname,groupingroups:
clicks=len(group.is_booking[group.is_booking
bookings=len(group.is_booking[group.is_booking
score=bookings+.15*clicks
clus_name=make_key(name[:len(match_cols)])
ifclus_namenotintop_clusters:
top_clusters[clus_name]={}
top_clusters[clus_name][name[1]]=score
Attheend,wellhaveadictionarywhereeachkeyisan srch_destination_id .Eachvalueinthedictionary

willbeanotherdictionary,containinghotelclustersaskeyswithscoresasvalues.Hereshowitlooks:
{'39331':{20:1.15,30:0.15,81:0.3},
'511':{17:0.15,34:0.15,55:0.15,70:0.15}}
Wellnextwanttotransformthisdictionarytofindthetophotelclustersforeach srch_destination_id .In

ordertodothis,well:
Loopthrougheachkeyin top_clusters .
Findthetop clustersforthatkey.
Assignthetop clusterstoanewdictionary, cluster_dict .
20/27
5/13/2016
Heresthecode:
importoperator
cluster_dict={}
fornintop_clusters:
tc=top_clusters[n]
top=[l[0]forlinsorted(tc.items(),key=operator
cluster_dict[n]=top
Makingpredictionsbasedon
destination
Onceweknowthetopclustersforeach srch_destination_id ,wecanquicklymakepredictions.Tomake
predictions,allwehavetodois:
Iteratethrougheachrowin t .
Extractthe srch_destination_id fortherow.
Findthetopclustersforthatdestinationid.
Appendthetopclustersto preds .
Heresthecode:
preds=[]
forindex,rowint2.iterrows():
key=make_key([row[m]forminmatch_cols])
ifkeyincluster_dict:
preds.append(cluster_dict[key])
else:
preds.append([])
21/27
5/13/2016
Attheendoftheloop, preds willbealistoflistscontainingourpredictions.Itwilllooklikethis:
[
[2,25,28,10,64],
[25,78,64,90,60],
...
]
Calculatingerror
Oncewehaveourpredictions,wecancomputeouraccuracyusingthe mapk functionfromearlier:
metrics.mapk([[l]forlint2["hotel_cluster"]],preds
0.22388136288998359
Weredoingprettywell!Weboostedouraccuracyxoverthebestmachinelearningapproach,andwedidit
withafarfasterandsimplerapproach.
Youmayhavenoticedthatthisvalueisquiteabitlowerthanaccuraciesontheleaderboard.Localtesting
resultsinaloweraccuracyvaluethansubmitting,sothisapproachwillactuallydofairlywellonthe
leaderboard.Differencesinleaderboardscoreandlocalscorecancomedowntoafewfactors:
Differentdatalocallyandinthehiddensetthatleaderboardscoresarecomputedon.Forexample,were
computingerrorinasampleofthetrainingset,andtheleaderboardscoreiscomputedonthetestingset.
Techniquesthatresultinhigheraccuracywithmoretrainingdata.Wereonlyusingasmallsubsetofdata
fortraining,anditmaybemoreaccuratewhenweusethefulltrainingset.
Differentrandomization.Withcertainalgorithms,randomnumbersareinvolved,butwerenotusingany
ofthese.
22/27
5/13/2016
Generatingbetter
predictionsforyourKaggle
submission
TheforumsareveryimportantinKaggle,andcanoftenhelpyoufindnuggetsofinformationthatwillletyou
boostyourscore.TheExpediacompetitionisnoexception.Thispostdetailsadataleakthatallowsyouto
matchusersinthetrainingsetfromthetestingsetusingasetofcolumnsincluding user_location_country ,
and user_location_region .
Wellusetheinformationfromtheposttomatchusersfromthetestingsetbacktothetrainingset,whichwill
boostourscore.Basedontheforumthread,itsokaytodothis,andthecompetitionwontbeupdatedasa
resultoftheleak.
Findingmatchingusers
Thefirststepistofindusersinthetrainingsetthatmatchusersinthetestingset.
Inordertodothis,weneedto:
Splitthetrainingdataintogroupsbasedonthematchcolumns.
Loopthroughthetestingdata.
Createanindexbasedonthematchcolumns.
Getanymatchesbetweenthetestingdataandthetrainingdatausingthegroups.
Heresthecodetoaccomplishthis:
match_cols=['user_location_country','user_location_region'
groups=t1.groupby(match_cols)
23/27
5/13/2016
defgenerate_exact_matches(row,match_cols):
index=tuple([row[t]fortinmatch_cols])
try:
group=groups.get_group(index)
exceptException:
return[]
clus=list(set(group.hotel_cluster))
returnclus
exact_matches=[]
foriinrange(t2.shape[0]):
exact_matches.append(generate_exact_matches(t2.
Attheendofthisloop,wellhavealistofliststhatcontainanyexactmatchesbetweenthetrainingandthe
testingsets.However,therearentthatmanymatches.Toaccuratelyevaluateerror,wellhavetocombine
thesepredictionswithourearlierpredictions.Otherwise,wellgetaverylowaccuracyvalue,becausemost
rowshaveemptylistsforpredictions.
Combiningpredictions
Wecancombinedifferentlistsofpredictionstoboostaccuracy.Doingsowillalsohelpusseehowgoodour
exactmatchstrategyis.Todothis,wellhaveto:
Combine exact_matches , preds ,and most_common_clusters .
Onlytaketheuniquepredictions,insequentialorder,usingthe f functionfromhere.
Ensurewehaveamaximumof predictionsforeachrowinthetestingset.
Hereshowwecandoit:
deff5(seq,idfun=None):
24/27
5/13/2016
ifidfunisNone:
defidfun(x):returnx
seen={}
result=[]
foriteminseq:
marker=idfun(item)
ifmarkerinseen:continue
seen[marker]=1
result.append(item)
returnresult
full_preds=[f5(exact_matches[p]+preds[p]+most_common_clusters
mapk([[l]forlint2["hotel_cluster"]],full_preds
0.28400041050903119
Thisislookingquitegoodintermsoferrorweimproveddramaticallyfromearlier!Wecouldkeepgoing
andmakingmoresmallimprovements,butwereprobablyreadytosubmitnow.
MakingaKagglesubmission
file
Luckily,becauseofthewaywewrotethecode,allwehavetodotosubmitisassign train tothe
variable t ,and test tothevariable t .Then,wejusthavetorerunthecodetomakepredictions.Re
runningthecodeoverthe train and test setsshouldtakelessthananhour.
Oncewehavepredictions,wejusthavetowritethemtoafile:
write_p=["".join([str(l)forlinp])forpinfull_preds
25/27
5/13/2016
write_frame=["{0},{1}".format(t2["id"][i],write_p
write_frame=["id,hotel_clusters"]+write_frame
withopen("predictions.csv","w+")asf:
f.write("\n".join(write_frame))
Wellthenhaveasubmissionfileintherightformattosubmit.Asofthiswriting,makingthissubmissionwill
getyouintothetop .
Summary
Wecamealongwayinthispost!Wewentfromjustlookingatthedataallthewaytocreatingasubmission
andgettingontotheleaderboard.Alongtheway,someofthekeystepswetookwere:
Exploringthedataandunderstandingtheproblem.
Settingupawaytoiteratequicklythroughdifferenttechniques.
Creatingawaytofigureoutaccuracylocally.
Readingtheforums,scripts,andthedescriptionsofthecontestverycloselytobetterunderstandthe
structureofthedata.
Tryingavarietyoftechniquesandnotbeingafraidtonotusemachinelearning.
ThesestepswillserveyouwellinanyKagglecompetition.
Furtherimprovements
Inordertoiteratequicklyandexploretechniques,speediskey.Thisisdifficultwiththiscompetition,but
thereareafewstrategiestotry:
Samplingdownthedataevenmore.
Parallelizingoperationsacrossmultiplecores.
UsingSparkorothertoolswheretaskscanberunonparallelworkers.
26/27
5/13/2016
Exploringvariouswaystowritecodeandbenchmarkingtofindthemostefficientapproach.
Avoidingiteratingoverthefulltrainingandtestingsets,andinsteadusinggroups.
Writingfast,efficientcodeisahugeadvantageinthiscompetition.
Futuretechniquestotry
Onceyouhaveastablefoundationonwhichtorunyourcode,thereareafewavenuestoexploreintermsof
techniquestoboostaccuracy:
Findingsimilaritybetweenusers,thenadjustinghotelclusterscoresbasedonsimilarity.
Usingsimilaritybetweendestinationstogroupmultipledestinationstogether.
Applyingmachinelearningwithinsubsetsofthedata.
Combiningdifferentpredictionstrategiesinalessnaiveway.
Exploringthelinkbetweenhotelclustersandregionsmore.
Ihopeyouhavefunwiththiscompetition!Idlovetohearanyfeedbackyouhave.Ifyouwanttolearnmore
beforedivingintothecompetition,checkoutourcoursesonDataquesttolearnaboutdatamanipulation,
statistics,machinelearning,howtoworkwithSpark,andmore.
27/27

How To Get Into The Top 15 of A Kaggle Competition Using Python - Dataquest Blog PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

How To Get Into The Top 15 of A Kaggle Competition Using Python - Dataquest Blog PDF

Hochgeladen von

Copyright:

Verfügbare Formate

5/13/2016

Theboxlabelled GoingTo mapstothe srch_destination_type_id , hotel_continent , hotel_country ,

Theboxlabelled Checkin mapstothe srch_ci fieldinthedata,andtheboxlabelled Checkout mapsto

Theboxlabelled Guests mapstothe srch_adults_cnt , srch_children_cnt ,and srch_rm_cnt fieldsin

site_name isthenameofthesiteyouvisited,whetheritbethemain Expedia.com site,oranother.

user_location_country , user_location_region , user_location_city , is_mobile , channel is_booking ,

and cnt areallattributesthataredeterminedbywheretheuserit,whattheirdeviceis,ortheirsessiononthe

Wehaveabout milliontrainingsetrows,and milliontestingsetrows,whichwillmakethisproblema

Itlookslikeallthedatesin test.csv arelaterthanthedatesin train.csv ,andthedatapageconfirms

from train.csv .Byselectingbothsetsfrom train.csv ,wellhavethetrue hotel_cluster labelforevery

TheabovecodecreatesaDataFramecalled sel_train thatonlycontainsdatafrom users.

Theabovecodewillgiveusalistofthe mostcommonclustersin train .Thisisbecausetheheadmethod

Wecancomputeourerrormetricwiththe mapk methodin ml_metrics :

Ourtargetneedstobeinlistoflistsformatfor mapk towork,soweconvertthe hotel_cluster column

Thistellsusthatnocolumnscorrelatelinearlywith hotel_cluster .Thismakessense,becausethereisno

linearorderingto hotel_cluster .Forexample,havingahigherclusternumberisnttiedtohavinga

Theabovecodecompressesthe columnsin destinations downto columns,andcreatesanew

Foreachrow,findthe largestprobabilities,andassignthose hotel_cluster valuesaspredictions.

Attheend,wellhaveadictionarywhereeachkeyisan srch_destination_id .Eachvalueinthedictionary

Wellnextwanttotransformthisdictionarytofindthetophotelclustersforeach srch_destination_id .In

Attheendoftheloop, preds willbealistoflistscontainingourpredictions.Itwilllooklikethis:

Das könnte Ihnen auch gefallen