Beruflich Dokumente
Kultur Dokumente
DataScience:AKaggleWalkthroughUnderstandingtheData
ThisarticleonunderstandingthedataisPartIIinaserieslookingatdatascienceandmachine
learningbywalkingthroughaKagglecompetition.PartIcanbefoundhere.
ContinuingonthewalkthroughofdatascienceviaaKagglecompetitionentry,inthispartwe
focusonunderstandingthedataprovidedfortheAirbnbKagglecompetition.
ReviewingtheData
Inanyprocessinvolvingdata,thefirstgoalshouldalwaysbeunderstandingthedata.This
involveslookingatthedataandansweringarangeofquestionsincluding(butnotlimitedto):
1.Whatfeatures(columns)doesthedatasetcontain?
2.Howmanyrecords(rows)havebeenprovided?
3.Whatformatisthedatain(e.g.whatformatarethedatesprovided,aretherenumerical
values,whatdothedifferentcategoricalvalueslooklike)?
4.Aretheremissingvalues?
5.Howdothedifferentfeaturesrelatetoeachother?
Forthiscompetition,Airbnbhaveprovided6differentfiles.Twoofthesefilesprovide
backgroundinformation(countries.csvandage_gender_bkts.csv),
whilesample_submission_NDF.csvprovidesanexampleofhowthesubmissionfilecontaining
ourfinalpredictionsshouldbeformatted.Thethreeremainingfilesarethekeyones:
1.train_users_2.csvThisdatasetcontainsdataonAirbnbusers,includingthedestination
countries.Eachrowrepresentsoneuserwiththecolumnscontainingvariousinformation
suchtheusersagesandwhentheysignedup.Thisistheprimarydatasetthatwewill
usetotrainthemodel.
2.test_users.csvThisdatasetalsocontainsdataonAirbnbusers,inthesameformat
astrain_users_2.csv,exceptwithoutthedestinationcountry.Thesearetheusersfor
whichwewillhavetomakeourfinalpredictions.
3.sessions.csvThisdataissupplementarydatathatcanbeusedtotrainthemodeland
makethefinalpredictions.Itcontainsinformationabouttheactions(e.g.clickedona
listing,updatedawishlist,ranasearchetc.)takenbytheusersinboththetestingand
trainingdatasetsabove.
Withthisinformationinmind,aneasyfirststepinunderstandingthedataisreviewingthe
http://brettromero.com/wordpress/datascienceakagglewalkthroughunderstandingthedata/
1/9
6/7/2016
DataScience:AKaggleWalkthroughUnderstandingtheData
informationprovidedbythedataproviderAirbnb.Forthiscompetition,theinformationcanbe
foundhere.Themainpoints(asidefromthedescriptionsofthecolumns)areasfollows:
AlltheusersinthedataprovidedarefromtheUSA.
Thereare12possibleoutcomesofthedestinationcountry:US,FR,CA,GB,ES,
IT,PT,NL,DE,AU,NDF(nodestinationfound),andother.
othermeanstherewasabooking,butinacountrynotincludedinthelist,whileNDF
meanstherewasnotabooking.
Thetrainingandtestsetsaresplitbydates.Inthetestset,youwillpredictthe
destinationcountryforallthenewuserswithfirstactivitiesafter7/1/2014
Inthesessionsdataset,thedataonlydatesbackto1/1/2014,whilethetrainingdataset
datesbackto2010.
Afterabsorbingthisinformation,wecanstartlookingattheactualdata.Fornowwewillfocus
onthetrain_users_2.csvfileonly.
Table1Threerows(transposed)fromtrain_users_2.csv
ColumnName
Example1
Example2
Example3
id
4ft3gnwmtx
v5lq9bj8gv
msucfwmlzc
date_account_created
28/9/10
30/6/14
30/6/14
timestamp_first_active
20090609231247
20140630234429
20140630234729
date_first_booking
2/8/10
gender
FEMALE
age
56
signup_method
basic
basic
basic
signup_flow
25
language
en
en
en
affiliate_channel
direct
direct
direct
affiliate_provider
direct
direct
direct
16/3/15
unknown
MALE
43
http://brettromero.com/wordpress/datascienceakagglewalkthroughunderstandingthedata/
2/9
6/7/2016
DataScience:AKaggleWalkthroughUnderstandingtheData
first_affiliate_tracked
untracked
untracked
untracked
signup_app
Web
iOS
Web
first_device_type
Windows
iPhone
Windows
Desktop
Desktop
first_browser
IE
unknown
Firefox
country_destination
US
NDF
US
Lookingatthesampleofthreerecordsaboveprovidesuswithafewkeypiecesofinformation
aboutthisdataset.Thefirstisthatatleasttwocolumnshavemissingvaluestheagecolumn
anddate_first_bookingcolumn.Thistellsusthatbeforeweusethisdatafortrainingamodel,
thesemissingvaluesneedtobefilledortherowsexcludedaltogether.Theseoptionswillbe
discussedinmoredetailinthenextpartofthisseries.
Secondly,mostofthecolumnsprovidedcontaincategoricaldata(i.e.thevaluesrepresentone
ofsomefixednumberofcategories).Infact11ofthe16columnsprovidedappeartobe
categorical.Mostofthealgorithmsthatareusedinclassificationdonothandlecategoricaldata
likethisverywell,andsowhenitcomestothedatatransformationstep,wewillneedtofinda
waytochangethisdataintoaformthatismoresuitedforclassification.
Thirdly,thetimestamp_first_activecolumnlookstobeafulltimestamp,butintheformatofa
number.Forexample20090609231247lookslikeitshouldbe2009060923:12:47.This
formattingwillneedtobecorrectedifwearetousethedatevalues.
DivingDeeper
Nowthatwehavegainedabasicunderstandingofthedatabylookingatafewexample
records,thenextstepistostartlookingatthestructureofthedata.
CountryDestinationValues
Arguably,themostimportantcolumninthedatasetistheonethemodelwilltrytopredict
country_destination.Lookingatthenumberofrecordsthatfallintoeachcategorycanhelp
providesomeinsightsintohowthemodelshouldbeconstructedaswellaspitfallstoavoid.
http://brettromero.com/wordpress/datascienceakagglewalkthroughunderstandingthedata/
3/9
6/7/2016
DataScience:AKaggleWalkthroughUnderstandingtheData
Table2UsersbyDestination
Destination
Records
%ofTotal
NDF
124,543
58.3%
US
62,376
29.2%
other
10,094
4.7%
FR
5,023
2.4%
IT
2,835
1.3%
GB
2,324
1.1%
ES
2,249
1.1%
CA
1,428
0.7%
DE
1,061
0.5%
NL
762
0.4%
AU
539
0.3%
PT
217
0.1%
GrandTotal
213,451
100.0%
Lookingatthebreakdownofthedata,onethingthatimmediatelystandsoutisthatalmost90%
ofusersfallintotwocategories,thatis,theyareeitheryettomakeabooking(NDF)orthey
madetheirfirstbookingintheUS.Whatsmore,breakingdownthesepercentagesplitsbyyear
revealsthatthepercentageofusersyettomakeabookingincreaseseachyearandreached
over60%in2014.
Table3UsersbyDestinationandYear
Destination
2010
2011
2012
2013
2014
Overall
NDF
42.5%
45.4%
55.0%
59.2%
61.8%
58.3%
US
44.0%
38.1%
31.1%
28.9%
26.7%
29.2%
http://brettromero.com/wordpress/datascienceakagglewalkthroughunderstandingthedata/
4/9
6/7/2016
DataScience:AKaggleWalkthroughUnderstandingtheData
other
2.8%
4.7%
4.9%
4.6%
4.8%
4.7%
FR
4.3%
4.0%
2.8%
2.2%
1.9%
2.4%
IT
1.1%
1.7%
1.5%
1.2%
1.3%
1.3%
GB
1.0%
1.5%
1.3%
1.0%
1.0%
1.1%
ES
1.5%
1.7%
1.2%
1.0%
0.9%
1.1%
CA
1.5%
1.1%
0.7%
0.6%
0.6%
0.7%
DE
0.6%
0.8%
0.7%
0.5%
0.3%
0.5%
NL
0.4%
0.6%
0.4%
0.3%
0.3%
0.4%
AU
0.3%
0.3%
0.3%
0.3%
0.2%
0.3%
PT
0.0%
0.2%
0.1%
0.1%
0.1%
0.1%
Total
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
Formodelingpurposes,thistypeofsplitmeansacoupleofthings.Firstly,thespreadof
categorieshaschangedovertime.Consideringthatourfinalpredictionswillbemadeagainst
userdatafromJuly2014onwards,thischangeprovidesuswithanincentivetofocusonmore
recentdatafortrainingpurposes,asitismorelikelytoresemblethetestdata.
Secondly,becausethevastmajorityofusersfallinto2categories,thereisariskthatifthe
modelistoogeneralized,orinotherwordsnotsensitiveenough,itwillselectoneofthosetwo
categoriesforeveryprediction.Akeystepwillbeensuringthetrainingdatahasenough
informationtoensurethemodelwillpredictothercategoriesaswell.
AccountCreationDates
Letsnowmoveontothedate_account_createdcolumntoseehowthevalueshavechanged
overtime.
Chart1AccountsCreatedOverTime
http://brettromero.com/wordpress/datascienceakagglewalkthroughunderstandingthedata/
5/9
6/7/2016
DataScience:AKaggleWalkthroughUnderstandingtheData
Chart1providesexcellentevidenceoftheexplosivegrowthofAirbnb,averagingover10%
growthinnewaccountscreatedpermonth.IntheyeartoJune2014,thenumberofnew
accountscreatedwas125,884132%increasefromtheyearbefore.
ButasidefromshowinghowquicklyAirbnbhasgrown,thisdataalsoprovidesanother
importantinsight,themajorityofthetrainingdataprovidedcomesfromthelatest2years.In
fact,ifwelimitedthetrainingdatatoaccountscreatedfromJanuary2013onwards,wewould
stillbeincludingover70%ofallthedata.Thismattersbecause,referringbacktothenotes
providedbyAirbnb,ifwewanttousethedatainsessions.csvwewouldbelimitedtodatafrom
January2014onwards.Againlookingatthenumbers,thismeansthateventhough
thesessions.csvdataonlycovers11%ofthetimeperiod(6outof54months),itstillcovers
over30%ofthetrainingdataor76,466users.
AgeBreakdown
http://brettromero.com/wordpress/datascienceakagglewalkthroughunderstandingthedata/
6/9
6/7/2016
DataScience:AKaggleWalkthroughUnderstandingtheData
Lookingatthebreakdownbyage,wecanseeagoodexampleofanotherissuethatanyone
workingwithdata(whetheraDataScientistornot)facesregularlydataqualityissues.Ascan
beseenfromChart2,thereareasignificantnumberofusersthathavereportedtheiragesas
wellover100.Infact,asignificantnumberofusersreportedtheiragesasover1000.
Chart2ReportedAgesofUsers
Sowhatisgoingonhere?Firstly,itappearsthatanumberofusershavereportedtheirbirth
yearinsteadoftheirage.Thiswouldhelptoexplainwhytherearealotofuserswithages
between1924and1953.Secondly,wealsoseesignificantnumbersofusersreportingtheirage
as105and110.Thisishardertoexplainbutitislikelythatsomeusersintentionallyentered
theirageincorrectlyforprivacyreasons.Eitherway,thesevalueswouldappeartobeerrors
thatwillneedtobeaddressed.
Additionally,aswesawintheexampledataprovidedabove,anotherissuewiththeage
columnisthatsometimesagehasnotbeenreportedatall.Infact,ifwelookacrossallthe
trainingdataprovided,wecanseealargenumberofmissingvaluesinallyears.
http://brettromero.com/wordpress/datascienceakagglewalkthroughunderstandingthedata/
7/9
6/7/2016
DataScience:AKaggleWalkthroughUnderstandingtheData
Table4MissingAges
Year
MissingValues
TotalRecords
%Missing
2010
1,082
2,788
38.8%
2011
4,090
11,775
34.7%
2012
13,740
39,462
34.8%
2013
34,950
82,960
42.1%
2014
34,128
76,466
44.6%
Total
87,990
213,451
41.2%
Whenwecleanthedata,wewillhavetodecidewhattodowiththesemissingvalues.
FirstDeviceType
Finally,onelastcolumnthatwewilllookatisthefirst_device_usedcolumn.
Table5FirstDeviceUsed
All
Device
2010
2011
2012
2013
2014
MacDesktop
37.2%
40.4%
47.2%
44.2%
37.3%
42.0%
Windows
21.6%
25.2%
37.7%
36.9%
31.0%
34.1%
iPhone
5.8%
6.3%
3.8%
7.5%
15.9%
9.7%
iPad
4.6%
4.8%
6.1%
7.1%
7.0%
6.7%
Other/Unknown
28.8%
21.3%
3.8%
2.8%
4.6%
5.0%
AndroidPhone
1.1%
1.2%
0.7%
0.4%
2.6%
1.3%
AndroidTablet
0.4%
0.4%
0.3%
0.5%
0.9%
0.6%
Years
Desktop
http://brettromero.com/wordpress/datascienceakagglewalkthroughunderstandingthedata/
8/9
6/7/2016
Desktop
DataScience:AKaggleWalkthroughUnderstandingtheData
0.4%
0.4%
0.4%
0.6%
0.7%
0.6%
SmartPhone
(Other)
0.0%
0.1%
0.1%
0.0%
0.0%
0.0%
Total
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
(Other)
Theinterestingthingaboutthedatainthiscolumnishowthetypesofdevicesusedhave
changedovertime.Windowsusershaveincreasedsignificantlyasapercentageofallusers.
iPhoneusershavetripledtheirshare,whileusersusingOther/unknowndeviceshavegone
fromthesecondlargestgrouptolessthan5%ofusers.Further,themajorityofthesechanges
occurredbetween2011and2012,suggestingthattheremayhavebeenachangeintheway
theclassificationwasdone.
Likewiththeothercolumnswehavereviewedabove,thischangeovertimereinforcesthe
presumptionthatrecentdataislikelytobethemostusefulforbuildingourmodel.
OtherColumns
Itshouldbenotedthatalthoughwehavenotcoveredallofthemhere,havingsome
understandingofallthedataprovidedinadatasetisimportantforbuildinganaccurate
classificationmodel.Insomecases,thismaynotbepossibleduetothepresenceofavery
largenumberofcolumns,orduetothefactthatthedatahasbeenabstracted(thatis,thedata
hasbeenconvertedintoadifferentform).However,inthisparticularcase,thenumberof
columnsisrelativelysmallandtheinformationiseasilyunderstandable.
NextTime
Nowthatwehavetakenthefirststepunderstandingthedatainthenextpiece,wewillstart
cleaningthedatatogetitintoaformthatwillhelptooptimizethemodelsperformance.
http://brettromero.com/wordpress/datascienceakagglewalkthroughunderstandingthedata/
9/9