Sie sind auf Seite 1von 27

5/13/2016

Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog

Howtogetintothetop
15ofaKaggle
competitionusing
Python
03MAY2016intutorials,python,data,science,kaggle,andexpedia
Kagglecompetitionsareafantasticwaytolearndatascienceandbuildyourportfolio.Ipersonallyused
Kaggletolearnmanydatascienceconcepts.IstartedoutwithKaggleafewmonthsafterlearning
programming,andlaterwonseveralcompetitions.
DoingwellinaKagglecompetitionrequiresmorethanjustknowingmachinelearningalgorithms.Itrequires
therightmindset,thewillingnesstolearn,andalotofdataexploration.Manyoftheseaspectsarenttypically
emphasizedintutorialsongettingstartedwithKaggle,though.Inthispost,Illcoverhowtogetstartedwith
theKaggleExpediahotelrecommendationscompetition,includingestablishingtherightmindset,settingup
testinginfrastructure,exploringthedata,creatingfeatures,andmakingpredictions.
Attheend,wellgenerateasubmissionfileusingthetechniquesinthethispost.Asofthiswriting,the
submissionwouldrankinthetop .

https://www.dataquest.io/blog/kaggletutorial/

1/27

5/13/2016

Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog

Wherethissubmissionwouldrankasofthiswriting.

TheExpediaKaggle
competition
TheExpediacompetitionchallengesyouwithpredictingwhathotelauserwillbookbasedonsomeattributes
aboutthesearchtheuserisconductingonExpedia.Beforewediveintoanycoding,wellneedtoputintime
tounderstandboththeproblemandthedata.

Aquickglanceatthecolumns
Thefirststepistolookatthedescriptionofthecolumnsofthedataset.Youcanfindthathere.Towardsthe
bottomofthepage,youllseeadescriptionofeachcolumninthedata.Lookingoverthis,itappearsthatwe
havequiteabitofdataaboutthesearchesusersareconductingonExpedia,alongwithdataonwhathotel
clustertheyeventuallybookedin test.csv and train.csv . destinations.csv containsinformationabout
theregionsuserssearchinforhotels.Wewontworryaboutwhatwerepredictingjustyet,wellfocuson
understandingthecolumns.

Expedia
SincethecompetitionconsistsofeventdatafromusersbookinghotelsonExpedia,wellneedtospendsome
timeunderstandingtheExpediasite.Lookingatthebookingflowwillhelpuscontextualizethefieldsinthe
data,andhowtheytieintousingExpedia.

https://www.dataquest.io/blog/kaggletutorial/

2/27

5/13/2016

Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog

Thepageyouinitiallyseewhenbookingahotel.

Theboxlabelled GoingTo mapstothe srch_destination_type_id , hotel_continent , hotel_country ,


and hotel_market fieldsinthedata.

Theboxlabelled Checkin mapstothe srch_ci fieldinthedata,andtheboxlabelled Checkout mapsto


the srch_co fieldinthedata.

Theboxlabelled Guests mapstothe srch_adults_cnt , srch_children_cnt ,and srch_rm_cnt fieldsin


thedata.
Theboxlabelled AddaFlight mapstothe is_package fieldinthedata.

site_name isthenameofthesiteyouvisited,whetheritbethemain Expedia.com site,oranother.

user_location_country , user_location_region , user_location_city , is_mobile , channel is_booking ,

and cnt areallattributesthataredeterminedbywheretheuserit,whattheirdeviceis,ortheirsessiononthe


Expediasite.
Justbylookingatonescreen,wecanimmediatelycontextualizeallthevariables.Playingaroundwiththe
screen,fillinginvalues,andgoingthroughthebookingprocesscanhelpfurthercontextualize.
https://www.dataquest.io/blog/kaggletutorial/

3/27

5/13/2016

Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog

ExploringtheKaggledatain
Python
Nowthatwehaveahandleonthedataatahighlevel,wecandosomeexplorationtotakeadeeperlook.

Downloadingthedata
Youcandownloadthedatahere.Thedatasetsarefairlylarge,soyoullneedagoodamountofdiskspace.
Youllneedtounzipthefilestogetraw .csv filesinsteadof .csv.gz .

ExploringthedatawithPandas
Giventheamountofmemoryonyoursystem,itmayormaynotbefeasibletoreadallthedatain.Ifitisnt,
youshouldconsidercreatingamachineonECorDigitalOceantoprocessthedatawith.Heresatutorialon
howtogetstartedwiththat.
Oncewedownloadthedata,wecanreaditinusingPandas:

importpandas
destinations=pd.read_csv("destinations.csv")
test=pd.read_csv("test.csv")
train=pd.read_csv("train.csv")

Letsfirstlookathowmuchdatathereis:

train.shape

(37670293,24)
https://www.dataquest.io/blog/kaggletutorial/

4/27

5/13/2016

Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog

test.shape

(2528243,22)

Wehaveabout milliontrainingsetrows,and milliontestingsetrows,whichwillmakethisproblema


bitchallengingtoworkwith.
Wecanexplorethefirstfewrowsofthedata:

train.head(5)

date_time

site_name

posa_continent

user_location_country

user_location

2014081107:46:59

66

348

2014081108:22:12

66

348

2014081108:24:33

66

348

2014080918:05:16

66

442

2014080918:08:18

66

442

Thereareafewthingsthatimmediatelystickout:
date_time couldbeusefulinourpredictions,sowellneedtoconvertit.

Mostofthecolumnsareintegersorfloats,sowecantdoalotoffeatureengineering.For
example, user_location_country isntthenameofacountry,itsaninteger.Thismakesitharderto
createnewfeatures,becausewedontknowexactlywhicheachvaluemeans.

https://www.dataquest.io/blog/kaggletutorial/

5/27

5/13/2016

Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog

test.head(5)

id

date_time

site_name

posa_continent

user_location_country

user_loc

2015090317:09:54

66

174

2015092417:38:35

66

174

2015060715:53:02

66

142

2015091414:49:10

66

258

2015071709:32:04

66

467

Thereareafewthingswecantakeawayfromlookingat test.csv :

Itlookslikeallthedatesin test.csv arelaterthanthedatesin train.csv ,andthedatapageconfirms


this.Thetestingsetcontainsdatesfrom ,andthetrainingsetcontainsdatesfrom and .
Itlooksliketheuseridsin test.csv areasubsetoftheuseridsin train.csv ,giventheoverlapping
integerranges.Wecanconfirmthislateron.
The is_booking columnalwayslookstobe in test.csv .Thedatapageconfirmsthis.

Figuringoutwhattopredict
Whatwerepredicting
Wellbepredictingwhich hotel_cluster auserwillbookafteragivensearch.Accordingtothedescription,
thereare clustersintotal.

Howwellbescored
https://www.dataquest.io/blog/kaggletutorial/

6/27

5/13/2016

Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog

TheevaluationpagesaysthatwellbescoredusingMeanAveragePrecision@,whichmeansthatwellneed
tomake clusterpredictionsforeachrow,andwillbescoredonwhetherornotthecorrectprediction
appearsinourlist.Ifthecorrectpredictioncomesearlierinthelist,wegetmorepoints.
Forexample,ifthecorrectclusteris ,andwepredict ,,,, ,ourscorewillbelowerthanif
wepredict ,,,, .Weshouldputpredictionsweremorecertainaboutearlierinourlistof
predictions.

Exploringhotelclusters
Nowthatweknowwhatwerepredicting,itstimetodiveinandexplore hotel_cluster .Wecanuse
thevalue_countsmethodonSeriestodothis:

train["hotel_cluster"].value_counts()

911043720
41772743
48754033
64704734
65670960
5620194
...
53134812
88107784
27105040
7448355

Theoutputaboveistruncated,butitshowsthatthenumberofhotelsineachclusterisfairlyevenly
distributed.Theredoesntappeartobeanyrelationshipbetweenclusternumberandthenumberofitems.

https://www.dataquest.io/blog/kaggletutorial/

7/27

5/13/2016

Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog

Exploringtrainandtestuserids
Finally,wellconfirmourhypothesisthatallthe test useridsarefoundinthe train DataFrame.Wecando
thisbyfindingtheuniquevaluesfor user_id in test ,andseeingiftheyallexistin train .

Inthecodebelow,well:
Createasetofalltheunique test userids.
Createasetofalltheunique train userids.
Figureouthowmany test useridsareinthe train userids.
Seeifthecountmatchesthetotalnumberof test userids.

test_ids=set(test.user_id.unique())
train_ids=set(train.user_id.unique())
intersection_count=len(test_ids&train_ids)
intersection_count==len(test_ids)

True

Lookslikeourhypothesisiscorrect,whichwillmakeworkingwiththisdatamucheasier!

DownsamplingourKaggle
data
Theentire train.csv datasetcontains millionrows,whichmakesithardtoexperimentwithdifferent
techniques.Ideally,wewantasmallenoughdatasetthatletsusquicklyiteratethroughdifferentapproaches
butisstillrepresentativeofthewholetrainingdata.
Wecandothisbyfirstrandomlysamplingrowsfromourdata,thenselectingnewtrainingandtestingdatasets
https://www.dataquest.io/blog/kaggletutorial/

8/27

5/13/2016

Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog

from train.csv .Byselectingbothsetsfrom train.csv ,wellhavethetrue hotel_cluster labelforevery


row,andwellbeabletocalculateouraccuracyaswetesttechniques.

Addintimesanddates
Thefirststepistoadd month and year fieldsto train .Becausethe train and test dataisdifferentiated
bydate,wellneedtoadddatefieldstoallowustosegmentourdataintotwosetsthesameway.Ifwe
add year and month fields,wecansplitourdataintotrainingandtestingsetsusingthem.

Thecodebelowwill:
Convertthe date_time columnin train froman object toa datetime value.Thismakesiteasierto
workwithasadate.
Extractthe year and month fromfrom date_time ,andassignthemtotheirowncolumns.

train["date_time"]=pd.to_datetime(train["date_time"
train["year"]=train["date_time"].dt.year
train["month"]=train["date_time"].dt.month

Pick10000users
Becausetheuseridsin test areasubsetoftheuseridsin train ,wellneedtodoourrandomsamplingina
waythatpreservesthefulldataofeachuser.Wecanaccomplishthisbyselectingacertainnumberofusers
randomly,thenonlypickingrowsfrom train where user_id isinourrandomsampleofuserids.

importrandom
unique_users=train.user_id.unique()
sel_user_ids=[unique_users[i]foriinsorted(random
https://www.dataquest.io/blog/kaggletutorial/

9/27

5/13/2016

Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog

sel_train=train[train.user_id.isin(sel_user_ids)]

TheabovecodecreatesaDataFramecalled sel_train thatonlycontainsdatafrom users.

Picknewtrainingandtesting
sets
Wellnowneedtopicknewtrainingandtestingsetsfrom sel_train .Wellcallthesesets t and t .

t1=sel_train[((sel_train.year==2013)|((sel_train
t2=sel_train[((sel_train.year==2014)&(sel_train

Intheoriginal train and test DataFrames, test containeddatafrom ,and train containeddata
from and .Wesplitthisdatasothatanythingafter July isin t ,andanythingbeforeis
in t .Thisgivesussmallertrainingandtestingsetswithsimilarcharacteristicsto train and test .

Removeclickevents
If is_booking is ,itrepresentsaclick,anda representsabooking. test containsonlybookingevents,
sowellneedtosample t toonlycontainbookingsaswell.

t2=t2[t2.is_booking==True]

Asimplealgorithm
Themostsimpletechniquewecouldtryonthisdataistofindthemostcommonclustersacrossthedata,then
usethemaspredictions.

https://www.dataquest.io/blog/kaggletutorial/

10/27

5/13/2016

Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog

Wecanagainusethevalue_countsmethodtohelpushere:

most_common_clusters=list(train.hotel_cluster.value_counts

Theabovecodewillgiveusalistofthe mostcommonclustersin train .Thisisbecausetheheadmethod


returnsthefirst rowsbydefault,andtheindexpropertywillreturntheindexoftheDataFrame,whichisthe
hotelclusterafterrunningthevalue_countsmethod.

Generatingpredictions
Wecanturn most_common_clusters intoalistofpredictionsbymakingthesamepredictionforeachrow.

predictions=[most_common_clustersforiinrange(

Thiswillcreatealistwithasmanyelementsastherearerowsin t .Eachelementwillbeequal
to most_common_clusters .

Evaluatingerror
Inordertoevaluateerror,wellfirstneedtofigureouthowtocomputeMeanAveragePrecision.Luckily,Ben
Hamnerhaswrittenanimplementationthatcanbefoundhere.Itcanbeinstalledaspartof
the ml_metrics package,andyoucanfindinstallationinstructionsforhowtoinstallithere.

Wecancomputeourerrormetricwiththe mapk methodin ml_metrics :

importml_metricsasmetrics
target=[[l]forlint2["hotel_cluster"]]
metrics.mapk(target,predictions,k=5)

https://www.dataquest.io/blog/kaggletutorial/

11/27

5/13/2016

Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog

0.058020770920711007

Ourtargetneedstobeinlistoflistsformatfor mapk towork,soweconvertthe hotel_cluster column


of t intoalistoflists.Then,wecallthe mapk methodwithourtarget,ourpredictions,andthenumberof
predictionswewanttoevaluate( ).

Ourresulthereisntgreat,butwevejustgeneratedourfirstsetofpredictions,andevaluatedourerror!The
frameworkwevebuiltwillallowustoquicklytestoutavarietyoftechniquesandseehowtheyscore.Were
wellonourwaytobuildingagoodperformingsolutionfortheleaderboard.

Findingcorrelations
Beforewemoveontocreatingabetteralgorithm,letsseeifanythingcorrelateswellwith hotel_cluster .
Thiswilltellusifweshoulddivemoreintoanyparticularcolumns.
Wecanfindlinearcorrelationsinthetrainingsetusingthecorrmethod:

train.corr()["hotel_cluster"]

site_name0.022408
posa_continent0.014938
user_location_country0.010477
user_location_region0.007453
user_location_city0.000831
orig_destination_distance0.007260
user_id0.001052
is_mobile0.008412
is_package0.038733
channel0.000707

Thistellsusthatnocolumnscorrelatelinearlywith hotel_cluster .Thismakessense,becausethereisno


https://www.dataquest.io/blog/kaggletutorial/

12/27

5/13/2016

Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog

linearorderingto hotel_cluster .Forexample,havingahigherclusternumberisnttiedtohavinga


higher srch_destination_id .

Unfortunately,thismeansthattechniqueslikelinearregressionandlogisticregressionwontworkwellonour
data,becausetheyrelyonlinearcorrelationsbetweenpredictorsandtargets.

Creatingbetterpredictions
forourKaggleentry
Thisdataforthiscompetitionisquitedifficulttomakepredictionsonusingmachinelearningforafew
reasons:
Therearemillionsofrows,whichincreasesruntimeandmemoryusageforalgorithms.
Thereare differentclusters,andaccordingtothecompetitionadmins,theboundariesarefairly
fuzzy,soitwilllikelybehardtomakepredictions.Asthenumberofclustersincreases,classifiers
generallydecreaseinaccuracy.
Nothingislinearlycorrelatedwiththetarget( hotel_clusters ),meaningwecantusefastmachine
learningtechniqueslikelinearregression.
Forthesereasons,machinelearningprobablywontworkwellonourdata,butwecantryanalgorithmand
findout.

Generatingfeatures
Thefirststepinapplyingmachinelearningistogeneratefeatures.Wecangeneratefeaturesusingbothwhats
availableinthetrainingdata,andwhatsavailablein destinations .Wehaventlookedat destinations yet,
soletstakeaquickpeek.

Generatingfeaturesfrom
https://www.dataquest.io/blog/kaggletutorial/

13/27

5/13/2016

Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog

destinations
Destinationscontainsanidthatcorrespondsto srch_destination_id ,alongwith columnsoflatent
informationaboutthatdestination.Heresasample:

srch_destination_id

d1

d2

d3

d4

d5

d6

2.198657

2.198657

2.198657

2.198657

2.198657

1.897627

2.181690

2.181690

2.181690

2.082564

2.181690

2.165028

2.183490

2.224164

2.224164

2.189562

2.105819

2.075407

2.177409

2.177409

2.177409

2.177409

2.177409

2.115485

2.189562

2.187783

2.194008

2.171153

2.152303

2.056618

Thecompetitiondoesnttellusexactlywhateachlatentfeatureis,butitssafetoassumethatitssome
combinationofdestinationcharacteristics,likename,description,andmore.Theselatentfeatureswere
convertedtonumbers,sotheycouldbeanonymized.
Wecanusethedestinationinformationasfeaturesinamachinelearningalgorithm,butwellneedtocompress
thenumberofcolumnsdownfirst,tominimizeruntime.WecanusePCAtodothis.PCAwillreducethe
numberofcolumnsinamatrixwhiletryingtopreservethesameamountofvarianceperrow.Ideally,PCA
willcompressalltheinformationcontainedinallthecolumnsintoless,butinpractice,someinformationis
lost.
Inthecodebelow,we:
InitializeaPCAmodelusingscikitlearn.
Specifythatwewanttoonlyhave columnsinourdata.
Transformthecolumns dd into columns.

https://www.dataquest.io/blog/kaggletutorial/

14/27

5/13/2016

Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog

fromsklearn.decompositionimportPCA
pca=PCA(n_components=3)
dest_small=pca.fit_transform(destinations[["d{0}"
dest_small=pd.DataFrame(dest_small)
dest_small["srch_destination_id"]=destinations["srch_destination_id"

Theabovecodecompressesthe columnsin destinations downto columns,andcreatesanew


DataFramecalled dest_small .Wepreservemostofthevariancein destinations whiledoingthis,sowe
dontlosealotofinformation,butsavealotofruntimeforamachinelearningalgorithm.

Generatingfeatures
Nowthatthepreliminariesaredonewith,wecangenerateourfeatures.Welldothefollowing:
Generatenewdatefeaturesbasedon date_time , srch_ci ,and srch_co .
Removenonnumericcolumnslike date_time .
Addinfeaturesfrom dest_small .
Replaceanymissingvalueswith .

defcalc_fast_features(df):
df["date_time"]=pd.to_datetime(df["date_time"
df["srch_ci"]=pd.to_datetime(df["srch_ci"],format
df["srch_co"]=pd.to_datetime(df["srch_co"],format

props={}
forpropin["month","day","hour","minute",
props[prop]=getattr(df["date_time"].dt,prop

carryover=[pforpindf.columnsifpnotin
https://www.dataquest.io/blog/kaggletutorial/

15/27

5/13/2016

Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog

forpropincarryover:
props[prop]=df[prop]

date_props=["month","day","dayofweek","quarter"
forpropindate_props:
props["ci_{0}".format(prop)]=getattr(df["srch_ci"
props["co_{0}".format(prop)]=getattr(df["srch_co"
props["stay_span"]=(df["srch_co"]df["srch_ci"

ret=pd.DataFrame(props)

ret=ret.join(dest_small,on="srch_destination_id"
ret=ret.drop("srch_destination_iddest",axis=
returnret
df=calc_fast_features(t1)
df.fillna(1,inplace=True)

Theabovewillcalculatefeaturessuchaslengthofstay,checkinday,andcheckoutmonth.Thesefeatures
willhelpustrainamachinelearningalgorithmlateron.
Replacingmissingvalueswith isntthebestchoice,butitwillworkfinefornow,andwecanalways
optimizethebehaviorlateron.

Machinelearning
Nowthatwehavefeaturesforourtrainingdata,wecantrymachinelearning.Welluse fold cross
validationacrossthetrainingsettogenerateareliableerrorestimate.Crossvalidationsplitsthetrainingsetup
into parts,thenpredicts hotel_cluster foreachpartusingtheotherpartstotrainwith.

WellgeneratepredictionsusingtheRandomForestalgorithm.Randomforestsbuildtrees,whichcanfitto
nonlineartendenciesindata.Thiswillenableustomakepredictions,eventhoughnoneofourcolumnsare
https://www.dataquest.io/blog/kaggletutorial/

16/27

5/13/2016

Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog

linearlyrelated.
Wellfirstinitializethemodelandcomputecrossvalidationscores:

predictors=[cforcindf.columnsifcnotin["hotel_cluster"
fromsklearnimportcross_validation
fromsklearn.ensembleimportRandomForestClassifier
clf=RandomForestClassifier(n_estimators=10,min_weight_fraction_leaf
scores=cross_validation.cross_val_score(clf,df[predictors
scores

array([0.06203556,0.06233452,0.06392277])

Theabovecodedoesntgiveusverygoodaccuracy,andconfirmsouroriginalsuspicionthatmachinelearning
isntagreatapproachtothisproblem.However,classifierstendtohaveloweraccuracywhenthereisahigh
clustercount.Wecaninsteadtrytraining binaryclassifiers.Eachclassifierwilljustdetermineifarowis
initscluster,ornot.Thiswillentailtrainingoneclassifierperlabelin hotel_cluster .

Binaryclassifiers
WellagaintrainRandomForests,buteachforestwillpredictonlyasinglehotelcluster.Welluse fold
crossvalidationforspeed,andonlytrain treesperlabel.

Inthecodebelow,we:
Loopacrosseachunique hotel_cluster .
TrainaRandomForestclassifierusingfoldcrossvalidation.
Extracttheprobabilitiesfromtheclassifierthattherowisintheunique hotel_cluster
Combinealltheprobabilities.

https://www.dataquest.io/blog/kaggletutorial/

17/27

5/13/2016

Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog

Foreachrow,findthe largestprobabilities,andassignthose hotel_cluster valuesaspredictions.


Computeaccuracyusing mapk .

fromsklearn.ensembleimportRandomForestClassifier
fromsklearn.cross_validationimportKFold
fromitertoolsimportchain
all_probs=[]
unique_clusters=df["hotel_cluster"].unique()
forclusterinunique_clusters:
df["target"]=1
df["target"][df["hotel_cluster"]!=cluster]=
predictors=[colforcolindfifcolnotin[
probs=[]
cv=KFold(len(df["target"]),n_folds=2)
clf=RandomForestClassifier(n_estimators=10,min_weight_fraction_leaf
fori,(tr,te)inenumerate(cv):
clf.fit(df[predictors].iloc[tr],df["target"
preds=clf.predict_proba(df[predictors].iloc
probs.append([p[1]forpinpreds])
full_probs=chain.from_iterable(probs)
all_probs.append(list(full_probs))
prediction_frame=pd.DataFrame(all_probs).T
prediction_frame.columns=unique_clusters
deffind_top_5(row):
returnlist(row.nlargest(5).index)
preds=[]
forindex,rowinprediction_frame.iterrows():
preds.append(find_top_5(row))
metrics.mapk([[l]forlint2.iloc["hotel_cluster"]],
https://www.dataquest.io/blog/kaggletutorial/

18/27

5/13/2016

Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog

0.041083333333333326

Ouraccuracyhereisworsethanbefore,andpeopleontheleaderboardhavemuchbetteraccuracyscores.
Wellneedtoabandonmachinelearningandmovetothenexttechniqueinordertocompete.Machine
learningcanbeapowerfultechnique,butitisntalwaystherightapproachtoeveryproblem.

Topclustersbasedon
hotel_cluster
ThereareafewKaggleScriptsforthecompetitionthatinvolveaggregating hotel_cluster based
on orig_destination_distance ,or srch_destination_id .Aggregating
on orig_destination_distance willexploitadataleakinthecompetition,andattempttomatchthesame
usertogether.Aggregatingon srch_destination_id willfindthemostpopularhotelclustersforeach
destination.Wellthenbeabletopredictthatauserwhosearchesforadestinationisgoingtooneofthemost
popularhotelclustersforthatdestination.Thinkofthisasamoregranularversionofthemostcommon
clusterstechniqueweusedearlier.
Wecanfirstgeneratescoresforeach hotel_cluster ineach srch_destination_id .Wellweightbookings
higherthanclicks.Thisisbecausethetestdataisallbookingdata,andthisiswhatwewanttopredict.We
wanttoincludeclickinformation,butdownweightittoreflectthis.Stepbystep,well:
Group t by srch_destination_id ,and hotel_cluster .
Iteratethrougheachgroup,and:
Assignpointtoeachhotelclusterwhere is_booking isTrue.
Assign . pointstoeachhotelclusterwhere is_booking isFalse.
Assignthescoretothe srch_destination_id / hotel_cluster combinationinadictionary.

Heresthecodetoaccomplishtheabovesteps:
https://www.dataquest.io/blog/kaggletutorial/

19/27

5/13/2016

Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog

defmake_key(items):
return"_".join([str(i)foriinitems])
match_cols=["srch_destination_id"]
cluster_cols=match_cols+['hotel_cluster']
groups=t1.groupby(cluster_cols)
top_clusters={}
forname,groupingroups:
clicks=len(group.is_booking[group.is_booking
bookings=len(group.is_booking[group.is_booking

score=bookings+.15*clicks

clus_name=make_key(name[:len(match_cols)])
ifclus_namenotintop_clusters:
top_clusters[clus_name]={}
top_clusters[clus_name][name[1]]=score

Attheend,wellhaveadictionarywhereeachkeyisan srch_destination_id .Eachvalueinthedictionary


willbeanotherdictionary,containinghotelclustersaskeyswithscoresasvalues.Hereshowitlooks:

{'39331':{20:1.15,30:0.15,81:0.3},
'511':{17:0.15,34:0.15,55:0.15,70:0.15}}

Wellnextwanttotransformthisdictionarytofindthetophotelclustersforeach srch_destination_id .In


ordertodothis,well:
Loopthrougheachkeyin top_clusters .
Findthetop clustersforthatkey.
Assignthetop clusterstoanewdictionary, cluster_dict .

https://www.dataquest.io/blog/kaggletutorial/

20/27

5/13/2016

Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog

Heresthecode:

importoperator
cluster_dict={}
fornintop_clusters:
tc=top_clusters[n]
top=[l[0]forlinsorted(tc.items(),key=operator
cluster_dict[n]=top

Makingpredictionsbasedon
destination
Onceweknowthetopclustersforeach srch_destination_id ,wecanquicklymakepredictions.Tomake
predictions,allwehavetodois:
Iteratethrougheachrowin t .
Extractthe srch_destination_id fortherow.
Findthetopclustersforthatdestinationid.
Appendthetopclustersto preds .

Heresthecode:

preds=[]
forindex,rowint2.iterrows():
key=make_key([row[m]forminmatch_cols])
ifkeyincluster_dict:
preds.append(cluster_dict[key])
else:
preds.append([])
https://www.dataquest.io/blog/kaggletutorial/

21/27

5/13/2016

Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog

Attheendoftheloop, preds willbealistoflistscontainingourpredictions.Itwilllooklikethis:

[
[2,25,28,10,64],
[25,78,64,90,60],
...
]

Calculatingerror
Oncewehaveourpredictions,wecancomputeouraccuracyusingthe mapk functionfromearlier:

metrics.mapk([[l]forlint2["hotel_cluster"]],preds

0.22388136288998359

Weredoingprettywell!Weboostedouraccuracyxoverthebestmachinelearningapproach,andwedidit
withafarfasterandsimplerapproach.
Youmayhavenoticedthatthisvalueisquiteabitlowerthanaccuraciesontheleaderboard.Localtesting
resultsinaloweraccuracyvaluethansubmitting,sothisapproachwillactuallydofairlywellonthe
leaderboard.Differencesinleaderboardscoreandlocalscorecancomedowntoafewfactors:
Differentdatalocallyandinthehiddensetthatleaderboardscoresarecomputedon.Forexample,were
computingerrorinasampleofthetrainingset,andtheleaderboardscoreiscomputedonthetestingset.
Techniquesthatresultinhigheraccuracywithmoretrainingdata.Wereonlyusingasmallsubsetofdata
fortraining,anditmaybemoreaccuratewhenweusethefulltrainingset.
Differentrandomization.Withcertainalgorithms,randomnumbersareinvolved,butwerenotusingany
ofthese.

https://www.dataquest.io/blog/kaggletutorial/

22/27

5/13/2016

Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog

Generatingbetter
predictionsforyourKaggle
submission
TheforumsareveryimportantinKaggle,andcanoftenhelpyoufindnuggetsofinformationthatwillletyou
boostyourscore.TheExpediacompetitionisnoexception.Thispostdetailsadataleakthatallowsyouto
matchusersinthetrainingsetfromthetestingsetusingasetofcolumnsincluding user_location_country ,
and user_location_region .

Wellusetheinformationfromtheposttomatchusersfromthetestingsetbacktothetrainingset,whichwill
boostourscore.Basedontheforumthread,itsokaytodothis,andthecompetitionwontbeupdatedasa
resultoftheleak.

Findingmatchingusers
Thefirststepistofindusersinthetrainingsetthatmatchusersinthetestingset.
Inordertodothis,weneedto:
Splitthetrainingdataintogroupsbasedonthematchcolumns.
Loopthroughthetestingdata.
Createanindexbasedonthematchcolumns.
Getanymatchesbetweenthetestingdataandthetrainingdatausingthegroups.
Heresthecodetoaccomplishthis:

match_cols=['user_location_country','user_location_region'
groups=t1.groupby(match_cols)
https://www.dataquest.io/blog/kaggletutorial/

23/27

5/13/2016

Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog

defgenerate_exact_matches(row,match_cols):
index=tuple([row[t]fortinmatch_cols])
try:
group=groups.get_group(index)
exceptException:
return[]
clus=list(set(group.hotel_cluster))
returnclus
exact_matches=[]
foriinrange(t2.shape[0]):
exact_matches.append(generate_exact_matches(t2.

Attheendofthisloop,wellhavealistofliststhatcontainanyexactmatchesbetweenthetrainingandthe
testingsets.However,therearentthatmanymatches.Toaccuratelyevaluateerror,wellhavetocombine
thesepredictionswithourearlierpredictions.Otherwise,wellgetaverylowaccuracyvalue,becausemost
rowshaveemptylistsforpredictions.

Combiningpredictions
Wecancombinedifferentlistsofpredictionstoboostaccuracy.Doingsowillalsohelpusseehowgoodour
exactmatchstrategyis.Todothis,wellhaveto:
Combine exact_matches , preds ,and most_common_clusters .
Onlytaketheuniquepredictions,insequentialorder,usingthe f functionfromhere.
Ensurewehaveamaximumof predictionsforeachrowinthetestingset.

Hereshowwecandoit:

deff5(seq,idfun=None):
https://www.dataquest.io/blog/kaggletutorial/

24/27

5/13/2016

Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog

ifidfunisNone:
defidfun(x):returnx
seen={}
result=[]
foriteminseq:
marker=idfun(item)
ifmarkerinseen:continue
seen[marker]=1
result.append(item)
returnresult

full_preds=[f5(exact_matches[p]+preds[p]+most_common_clusters
mapk([[l]forlint2["hotel_cluster"]],full_preds

0.28400041050903119

Thisislookingquitegoodintermsoferrorweimproveddramaticallyfromearlier!Wecouldkeepgoing
andmakingmoresmallimprovements,butwereprobablyreadytosubmitnow.

MakingaKagglesubmission
file
Luckily,becauseofthewaywewrotethecode,allwehavetodotosubmitisassign train tothe
variable t ,and test tothevariable t .Then,wejusthavetorerunthecodetomakepredictions.Re
runningthecodeoverthe train and test setsshouldtakelessthananhour.

Oncewehavepredictions,wejusthavetowritethemtoafile:

write_p=["".join([str(l)forlinp])forpinfull_preds
https://www.dataquest.io/blog/kaggletutorial/

25/27

5/13/2016

Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog

write_frame=["{0},{1}".format(t2["id"][i],write_p
write_frame=["id,hotel_clusters"]+write_frame
withopen("predictions.csv","w+")asf:
f.write("\n".join(write_frame))

Wellthenhaveasubmissionfileintherightformattosubmit.Asofthiswriting,makingthissubmissionwill
getyouintothetop .

Summary
Wecamealongwayinthispost!Wewentfromjustlookingatthedataallthewaytocreatingasubmission
andgettingontotheleaderboard.Alongtheway,someofthekeystepswetookwere:
Exploringthedataandunderstandingtheproblem.
Settingupawaytoiteratequicklythroughdifferenttechniques.
Creatingawaytofigureoutaccuracylocally.
Readingtheforums,scripts,andthedescriptionsofthecontestverycloselytobetterunderstandthe
structureofthedata.
Tryingavarietyoftechniquesandnotbeingafraidtonotusemachinelearning.
ThesestepswillserveyouwellinanyKagglecompetition.

Furtherimprovements
Inordertoiteratequicklyandexploretechniques,speediskey.Thisisdifficultwiththiscompetition,but
thereareafewstrategiestotry:
Samplingdownthedataevenmore.
Parallelizingoperationsacrossmultiplecores.
UsingSparkorothertoolswheretaskscanberunonparallelworkers.
https://www.dataquest.io/blog/kaggletutorial/

26/27

5/13/2016

Howtogetintothetop15ofaKagglecompetitionusingPythonDataquestBlog

Exploringvariouswaystowritecodeandbenchmarkingtofindthemostefficientapproach.
Avoidingiteratingoverthefulltrainingandtestingsets,andinsteadusinggroups.
Writingfast,efficientcodeisahugeadvantageinthiscompetition.

Futuretechniquestotry
Onceyouhaveastablefoundationonwhichtorunyourcode,thereareafewavenuestoexploreintermsof
techniquestoboostaccuracy:
Findingsimilaritybetweenusers,thenadjustinghotelclusterscoresbasedonsimilarity.
Usingsimilaritybetweendestinationstogroupmultipledestinationstogether.
Applyingmachinelearningwithinsubsetsofthedata.
Combiningdifferentpredictionstrategiesinalessnaiveway.
Exploringthelinkbetweenhotelclustersandregionsmore.
Ihopeyouhavefunwiththiscompetition!Idlovetohearanyfeedbackyouhave.Ifyouwanttolearnmore
beforedivingintothecompetition,checkoutourcoursesonDataquesttolearnaboutdatamanipulation,
statistics,machinelearning,howtoworkwithSpark,andmore.

https://www.dataquest.io/blog/kaggletutorial/

27/27

Das könnte Ihnen auch gefallen