Sie sind auf Seite 1von 9

6/7/2016

DataScience:AKaggleWalkthroughUnderstandingtheData

ThisarticleonunderstandingthedataisPartIIinaserieslookingatdatascienceandmachine
learningbywalkingthroughaKagglecompetition.PartIcanbefoundhere.
ContinuingonthewalkthroughofdatascienceviaaKagglecompetitionentry,inthispartwe
focusonunderstandingthedataprovidedfortheAirbnbKagglecompetition.

ReviewingtheData
Inanyprocessinvolvingdata,thefirstgoalshouldalwaysbeunderstandingthedata.This
involveslookingatthedataandansweringarangeofquestionsincluding(butnotlimitedto):
1.Whatfeatures(columns)doesthedatasetcontain?
2.Howmanyrecords(rows)havebeenprovided?
3.Whatformatisthedatain(e.g.whatformatarethedatesprovided,aretherenumerical
values,whatdothedifferentcategoricalvalueslooklike)?
4.Aretheremissingvalues?
5.Howdothedifferentfeaturesrelatetoeachother?
Forthiscompetition,Airbnbhaveprovided6differentfiles.Twoofthesefilesprovide
backgroundinformation(countries.csvandage_gender_bkts.csv),
whilesample_submission_NDF.csvprovidesanexampleofhowthesubmissionfilecontaining
ourfinalpredictionsshouldbeformatted.Thethreeremainingfilesarethekeyones:
1.train_users_2.csvThisdatasetcontainsdataonAirbnbusers,includingthedestination
countries.Eachrowrepresentsoneuserwiththecolumnscontainingvariousinformation
suchtheusersagesandwhentheysignedup.Thisistheprimarydatasetthatwewill
usetotrainthemodel.
2.test_users.csvThisdatasetalsocontainsdataonAirbnbusers,inthesameformat
astrain_users_2.csv,exceptwithoutthedestinationcountry.Thesearetheusersfor
whichwewillhavetomakeourfinalpredictions.
3.sessions.csvThisdataissupplementarydatathatcanbeusedtotrainthemodeland
makethefinalpredictions.Itcontainsinformationabouttheactions(e.g.clickedona
listing,updatedawishlist,ranasearchetc.)takenbytheusersinboththetestingand
trainingdatasetsabove.
Withthisinformationinmind,aneasyfirststepinunderstandingthedataisreviewingthe
http://brettromero.com/wordpress/datascienceakagglewalkthroughunderstandingthedata/

1/9

6/7/2016

DataScience:AKaggleWalkthroughUnderstandingtheData

informationprovidedbythedataproviderAirbnb.Forthiscompetition,theinformationcanbe
foundhere.Themainpoints(asidefromthedescriptionsofthecolumns)areasfollows:
AlltheusersinthedataprovidedarefromtheUSA.
Thereare12possibleoutcomesofthedestinationcountry:US,FR,CA,GB,ES,
IT,PT,NL,DE,AU,NDF(nodestinationfound),andother.
othermeanstherewasabooking,butinacountrynotincludedinthelist,whileNDF
meanstherewasnotabooking.
Thetrainingandtestsetsaresplitbydates.Inthetestset,youwillpredictthe
destinationcountryforallthenewuserswithfirstactivitiesafter7/1/2014
Inthesessionsdataset,thedataonlydatesbackto1/1/2014,whilethetrainingdataset
datesbackto2010.
Afterabsorbingthisinformation,wecanstartlookingattheactualdata.Fornowwewillfocus
onthetrain_users_2.csvfileonly.

Table1Threerows(transposed)fromtrain_users_2.csv
ColumnName

Example1

Example2

Example3

id

4ft3gnwmtx

v5lq9bj8gv

msucfwmlzc

date_account_created

28/9/10

30/6/14

30/6/14

timestamp_first_active

20090609231247

20140630234429

20140630234729

date_first_booking

2/8/10

gender

FEMALE

age

56

signup_method

basic

basic

basic

signup_flow

25

language

en

en

en

affiliate_channel

direct

direct

direct

affiliate_provider

direct

direct

direct

16/3/15
unknown

MALE
43

http://brettromero.com/wordpress/datascienceakagglewalkthroughunderstandingthedata/

2/9

6/7/2016

DataScience:AKaggleWalkthroughUnderstandingtheData

first_affiliate_tracked

untracked

untracked

untracked

signup_app

Web

iOS

Web

first_device_type

Windows

iPhone

Windows

Desktop

Desktop

first_browser

IE

unknown

Firefox

country_destination

US

NDF

US

Lookingatthesampleofthreerecordsaboveprovidesuswithafewkeypiecesofinformation
aboutthisdataset.Thefirstisthatatleasttwocolumnshavemissingvaluestheagecolumn
anddate_first_bookingcolumn.Thistellsusthatbeforeweusethisdatafortrainingamodel,
thesemissingvaluesneedtobefilledortherowsexcludedaltogether.Theseoptionswillbe
discussedinmoredetailinthenextpartofthisseries.
Secondly,mostofthecolumnsprovidedcontaincategoricaldata(i.e.thevaluesrepresentone
ofsomefixednumberofcategories).Infact11ofthe16columnsprovidedappeartobe
categorical.Mostofthealgorithmsthatareusedinclassificationdonothandlecategoricaldata
likethisverywell,andsowhenitcomestothedatatransformationstep,wewillneedtofinda
waytochangethisdataintoaformthatismoresuitedforclassification.
Thirdly,thetimestamp_first_activecolumnlookstobeafulltimestamp,butintheformatofa
number.Forexample20090609231247lookslikeitshouldbe2009060923:12:47.This
formattingwillneedtobecorrectedifwearetousethedatevalues.

DivingDeeper
Nowthatwehavegainedabasicunderstandingofthedatabylookingatafewexample
records,thenextstepistostartlookingatthestructureofthedata.

CountryDestinationValues
Arguably,themostimportantcolumninthedatasetistheonethemodelwilltrytopredict
country_destination.Lookingatthenumberofrecordsthatfallintoeachcategorycanhelp
providesomeinsightsintohowthemodelshouldbeconstructedaswellaspitfallstoavoid.
http://brettromero.com/wordpress/datascienceakagglewalkthroughunderstandingthedata/

3/9

6/7/2016

DataScience:AKaggleWalkthroughUnderstandingtheData

Table2UsersbyDestination
Destination

Records

%ofTotal

NDF

124,543

58.3%

US

62,376

29.2%

other

10,094

4.7%

FR

5,023

2.4%

IT

2,835

1.3%

GB

2,324

1.1%

ES

2,249

1.1%

CA

1,428

0.7%

DE

1,061

0.5%

NL

762

0.4%

AU

539

0.3%

PT

217

0.1%

GrandTotal

213,451

100.0%

Lookingatthebreakdownofthedata,onethingthatimmediatelystandsoutisthatalmost90%
ofusersfallintotwocategories,thatis,theyareeitheryettomakeabooking(NDF)orthey
madetheirfirstbookingintheUS.Whatsmore,breakingdownthesepercentagesplitsbyyear
revealsthatthepercentageofusersyettomakeabookingincreaseseachyearandreached
over60%in2014.

Table3UsersbyDestinationandYear
Destination

2010

2011

2012

2013

2014

Overall

NDF

42.5%

45.4%

55.0%

59.2%

61.8%

58.3%

US

44.0%

38.1%

31.1%

28.9%

26.7%

29.2%

http://brettromero.com/wordpress/datascienceakagglewalkthroughunderstandingthedata/

4/9

6/7/2016

DataScience:AKaggleWalkthroughUnderstandingtheData

other

2.8%

4.7%

4.9%

4.6%

4.8%

4.7%

FR

4.3%

4.0%

2.8%

2.2%

1.9%

2.4%

IT

1.1%

1.7%

1.5%

1.2%

1.3%

1.3%

GB

1.0%

1.5%

1.3%

1.0%

1.0%

1.1%

ES

1.5%

1.7%

1.2%

1.0%

0.9%

1.1%

CA

1.5%

1.1%

0.7%

0.6%

0.6%

0.7%

DE

0.6%

0.8%

0.7%

0.5%

0.3%

0.5%

NL

0.4%

0.6%

0.4%

0.3%

0.3%

0.4%

AU

0.3%

0.3%

0.3%

0.3%

0.2%

0.3%

PT

0.0%

0.2%

0.1%

0.1%

0.1%

0.1%

Total

100.0%

100.0%

100.0%

100.0%

100.0%

100.0%

Formodelingpurposes,thistypeofsplitmeansacoupleofthings.Firstly,thespreadof
categorieshaschangedovertime.Consideringthatourfinalpredictionswillbemadeagainst
userdatafromJuly2014onwards,thischangeprovidesuswithanincentivetofocusonmore
recentdatafortrainingpurposes,asitismorelikelytoresemblethetestdata.
Secondly,becausethevastmajorityofusersfallinto2categories,thereisariskthatifthe
modelistoogeneralized,orinotherwordsnotsensitiveenough,itwillselectoneofthosetwo
categoriesforeveryprediction.Akeystepwillbeensuringthetrainingdatahasenough
informationtoensurethemodelwillpredictothercategoriesaswell.

AccountCreationDates
Letsnowmoveontothedate_account_createdcolumntoseehowthevalueshavechanged
overtime.

Chart1AccountsCreatedOverTime
http://brettromero.com/wordpress/datascienceakagglewalkthroughunderstandingthedata/

5/9

6/7/2016

DataScience:AKaggleWalkthroughUnderstandingtheData

Chart1providesexcellentevidenceoftheexplosivegrowthofAirbnb,averagingover10%
growthinnewaccountscreatedpermonth.IntheyeartoJune2014,thenumberofnew
accountscreatedwas125,884132%increasefromtheyearbefore.
ButasidefromshowinghowquicklyAirbnbhasgrown,thisdataalsoprovidesanother
importantinsight,themajorityofthetrainingdataprovidedcomesfromthelatest2years.In
fact,ifwelimitedthetrainingdatatoaccountscreatedfromJanuary2013onwards,wewould
stillbeincludingover70%ofallthedata.Thismattersbecause,referringbacktothenotes
providedbyAirbnb,ifwewanttousethedatainsessions.csvwewouldbelimitedtodatafrom
January2014onwards.Againlookingatthenumbers,thismeansthateventhough
thesessions.csvdataonlycovers11%ofthetimeperiod(6outof54months),itstillcovers
over30%ofthetrainingdataor76,466users.

AgeBreakdown
http://brettromero.com/wordpress/datascienceakagglewalkthroughunderstandingthedata/

6/9

6/7/2016

DataScience:AKaggleWalkthroughUnderstandingtheData

Lookingatthebreakdownbyage,wecanseeagoodexampleofanotherissuethatanyone
workingwithdata(whetheraDataScientistornot)facesregularlydataqualityissues.Ascan
beseenfromChart2,thereareasignificantnumberofusersthathavereportedtheiragesas
wellover100.Infact,asignificantnumberofusersreportedtheiragesasover1000.

Chart2ReportedAgesofUsers

Sowhatisgoingonhere?Firstly,itappearsthatanumberofusershavereportedtheirbirth
yearinsteadoftheirage.Thiswouldhelptoexplainwhytherearealotofuserswithages
between1924and1953.Secondly,wealsoseesignificantnumbersofusersreportingtheirage
as105and110.Thisishardertoexplainbutitislikelythatsomeusersintentionallyentered
theirageincorrectlyforprivacyreasons.Eitherway,thesevalueswouldappeartobeerrors
thatwillneedtobeaddressed.
Additionally,aswesawintheexampledataprovidedabove,anotherissuewiththeage
columnisthatsometimesagehasnotbeenreportedatall.Infact,ifwelookacrossallthe
trainingdataprovided,wecanseealargenumberofmissingvaluesinallyears.
http://brettromero.com/wordpress/datascienceakagglewalkthroughunderstandingthedata/

7/9

6/7/2016

DataScience:AKaggleWalkthroughUnderstandingtheData

Table4MissingAges
Year

MissingValues

TotalRecords

%Missing

2010

1,082

2,788

38.8%

2011

4,090

11,775

34.7%

2012

13,740

39,462

34.8%

2013

34,950

82,960

42.1%

2014

34,128

76,466

44.6%

Total

87,990

213,451

41.2%

Whenwecleanthedata,wewillhavetodecidewhattodowiththesemissingvalues.

FirstDeviceType
Finally,onelastcolumnthatwewilllookatisthefirst_device_usedcolumn.

Table5FirstDeviceUsed
All

Device

2010

2011

2012

2013

2014

MacDesktop

37.2%

40.4%

47.2%

44.2%

37.3%

42.0%

Windows

21.6%

25.2%

37.7%

36.9%

31.0%

34.1%

iPhone

5.8%

6.3%

3.8%

7.5%

15.9%

9.7%

iPad

4.6%

4.8%

6.1%

7.1%

7.0%

6.7%

Other/Unknown

28.8%

21.3%

3.8%

2.8%

4.6%

5.0%

AndroidPhone

1.1%

1.2%

0.7%

0.4%

2.6%

1.3%

AndroidTablet

0.4%

0.4%

0.3%

0.5%

0.9%

0.6%

Years

Desktop

http://brettromero.com/wordpress/datascienceakagglewalkthroughunderstandingthedata/

8/9

6/7/2016

Desktop

DataScience:AKaggleWalkthroughUnderstandingtheData

0.4%

0.4%

0.4%

0.6%

0.7%

0.6%

SmartPhone
(Other)

0.0%

0.1%

0.1%

0.0%

0.0%

0.0%

Total

100.0%

100.0%

100.0%

100.0%

100.0%

100.0%

(Other)

Theinterestingthingaboutthedatainthiscolumnishowthetypesofdevicesusedhave
changedovertime.Windowsusershaveincreasedsignificantlyasapercentageofallusers.
iPhoneusershavetripledtheirshare,whileusersusingOther/unknowndeviceshavegone
fromthesecondlargestgrouptolessthan5%ofusers.Further,themajorityofthesechanges
occurredbetween2011and2012,suggestingthattheremayhavebeenachangeintheway
theclassificationwasdone.
Likewiththeothercolumnswehavereviewedabove,thischangeovertimereinforcesthe
presumptionthatrecentdataislikelytobethemostusefulforbuildingourmodel.

OtherColumns
Itshouldbenotedthatalthoughwehavenotcoveredallofthemhere,havingsome
understandingofallthedataprovidedinadatasetisimportantforbuildinganaccurate
classificationmodel.Insomecases,thismaynotbepossibleduetothepresenceofavery
largenumberofcolumns,orduetothefactthatthedatahasbeenabstracted(thatis,thedata
hasbeenconvertedintoadifferentform).However,inthisparticularcase,thenumberof
columnsisrelativelysmallandtheinformationiseasilyunderstandable.

NextTime
Nowthatwehavetakenthefirststepunderstandingthedatainthenextpiece,wewillstart
cleaningthedatatogetitintoaformthatwillhelptooptimizethemodelsperformance.

http://brettromero.com/wordpress/datascienceakagglewalkthroughunderstandingthedata/

9/9

Das könnte Ihnen auch gefallen