Beruflich Dokumente
Kultur Dokumente
TeamYelpadvisor:StephanieWuerth,BichenWu,TsuFangLu
December14,2015
ProblemStatementandBackground
OurgoalistoclassifyrestaurantsintoexistinglabelsusingtheYelpacademicdataset.Wealsohopeto
furtherclassifyrestaurantswithmorespecificlabelsthantheirgivenlabels.Forexample,arestaurant
mayjustbelabeledasChinesebutcanwefurtherclassifyitasSichuanorTaiwanese?
OurdatasetistheYelpsacademicdataset,whichisprovidedforuseaspartoftheYelpDataset
Challenge1.Thisdatasetspansapproximately10yearsofYelpreviews(textandstarrating)and5years
ofYelptips,alongwithhourlysumsofcheckinsforeachbusiness.Italsoincludesgeneralinformation
aboutthebusiness,suchasitscategories,ambiance,businesshours,andaddress,andsomeinformation
abouttheYelpreviewerssuchasthenumberofreviewstheyhavewrittenandtheaveragestarratingthey
havegiven.Thedatasetincludes10cities:Edinburgh,Karlsruhe,Charlotte,Urbana,Madison,LasVegas,
Phoenix,Pittsburgh,Montreal,andWaterloo(Canada).
Tomeasureaccuracy,wecompareourpredictedlabeltothetruelabel.Wemeasurebasicaccuracy,
precisionandrecall,aswellasAUC.Wealsocompareourmodelsaccuracytoabaselinemodel.
Somepotentialapplicationsofourclassificationmodelsinclude:
1.HelpYelpautomaterestaurantlabelingwithoutuserinputs.
2.Labelvaguelylabeledrestaurantsmorespecificallyorlabelrestaurantswithmissinglabels.
3.Informcustomersaboutrestaurantsspecialtiesandparticularcuisinesbyfurthersubcategorizing
restaurantsintomorespecificlabels.
Methods
Datacollection
WeusedtheYelpacademicdataset,whichismadeavailablebyYelpfortheYelpDatasetChallenge1.To
obtainthisdata,weregisteredfortheYelpDatasetChallengeathttp://www.yelp.com/dataset_challenge.
Datapreparation
ThedataprovidedisinJSONformat,butaPythonscriptforconvertingtocsvisofferredat
https://github.com/Yelp/datasetexamples.Weusedthisscript(json_to_csv_converter.py)toconvertthe
JSONdataintocsvfiles,thenwereadthoseintoaPythonnotebookandstoredthedatainPandas
dataframes.Wesubsetthedataforwhatispotentiallyusefulforourchosenproblem.Weuse9ofthe10
citiesintheYelpAcademicDatasetforourmodel.Karlsruhe,Germanydataisomittedbecausemost
reviewsherearenotwritteninEnglish,andreviewtextistherichestcomponentofourdataset.We
furthersubsetbyselectingonlyrestaurants(excludingHotels,Spas,etc.).Withinrestaurantswefurther
subsetforthe20mostcommontypesofrestaurant,asdictatedbytheirgivenlabels.Labelschosenand
numberofrestaurantswitheachlabelinoursubsetteddatasetaregiveninFig.1(SeeinAppendix).We
alsogotridofEOL,carriagereturns,andcertainregexpatternsinthereviewtextsforourbagofwords
modeltoworkbetter.
Featurization
WefeaturizeourreviewtextusingaBagofWords(BoW)model,buildingatrainingmatrixofnumberof
restaurantsbysizeofvocabularyasfollows:
Allreviewsreceivedbyeachrestaurantinthetrainingset(70%oftotal)arejoinedandtokenizedwith
stopwordsremoved,thenwordsarecountedtocreatethesparseBoWvectorforeachrestaurant.
Wetestedseveraldifferentfeatureinclusions:
Ngrams:UnigramsOnly,Bigrams+Unigrams
Numberoffeaturesretained:6000,15000,100,000,or~200,000(whichisthetotalcountof
uniquewordsinourtrainingcorpus)
Featureweights:rawfrequenciesortermfrequency,inversedocumentfrequency(TFIDF)
weighting.WenoteherethespecificsoftheTFIDFweighting:weusedthedefaultparametersof
thesklearn.feature_extraction.text.TfidfTransformer()tool.(norm='l2',use_idf=True,
smooth_idf=True,sublinear_tf=False).Thenormparametermeanswenormalizethefinal
vectors,andthesmooth_idfanduse_idfparametersmeanourfeaturesareweightedaccordingto
tf*(idf+1),wheretfisthefrequencyofthefeatureintherestaurant'smergedreviews,andidf
istheinversefrequencyofthefeatureintheentiretrainingcorpus(allrestaurantreviews).
Anotherfeaturizationwetried,butdidnotimplementinthefinalpipeline,istousethestarratingmatrix,
whichisamatrixofnumberofusersbynumberofrestaurants.Eachelementinthematrixcorrespondsto
ausersratingforacertainrestaurant.Thenweperformedmatrixfactorization(throughPCAand
AlternativeLeastSquares)toobtainafactormatrixofnumberoffactorsbynumberofrestaurants.We
treatedeachvector(withthelengthoffactors)asadatapointtorepresenteachrestaurant.
Learning
Firstwedescribethelearningmethodsusedforthesupervisedproblemofclassifyingrestaurantsinto
theirexistinglabels.Thenwedescribethemethodsfortheunsupervisedproblemofclassifying
restaurantsintosubcategories.
Supervisedtextbasedclassificationintoexistinglabels
Modelstested:Logisticregressionandrandomforest
Logisticregressionmarginallyoutperformedourrandomforestmodels,sowehavechosenthelogistic
regressionmodelasourprimarymodel.
Parameterchoices:
LogisticRegression(C=1.0,class_weight=None,dual=False,fit_intercept=True,
intercept_scaling=1,max_iter=100,multi_class='ovr',
penalty='l2',random_state=None,solver='liblinear',tol=0.0001,
verbose=0)
Logisticregression:Multi_class=ovrindicatesthatabinaryproblemisfitforeachlabel.Soinour
case,foreachofthe20categorieswemodelwhetherarestaurantdoesordoesnotfallintothatcategory.
Thisisalogicalchoicesincesomerestaurantsfallintomorethanonecategory(forexample,manySushi
BarsarealsoJapanese).
RandomForestClassifier(bootstrap=True,class_weight=None,criterion='gini',
max_depth=None,max_features=6000,max_leaf_nodes=None,
min_samples_leaf=1,min_samples_split=2,
min_weight_fraction_leaf=0.0,n_estimators=100,n_jobs=4,
oob_score=False,random_state=None,verbose=0,
warm_start=False)
Randomforests:Wetestedanumberofparameterchoices,butthebestperformancewasachievedby
keeping6000features,100estimators,bootstrapon,andginicriterion.Weinitiallyincludedfewer
features,becauseaccordingtosckitlearndocumentation2,forclassificationtasks,thenumberoffeatures
usedinarandomforestmodelisoftenoptimizedwithmax_features=sqrt(n_features).N_featuresinour
caseis~200,000,so~500wouldbeagoodchoiceformax_features.However,wesawincreased
accuracywhenweincludedmorefeatures.
MultinomialNB(alpha=1.0,class_prior=None,fit_prior=True)
MultinomialNaveBayes(MNB)(baselinemodel):
Thealphaparameterissetbydefaultto1toincludeadaptivesmoothing.
WealsotriedBernoulliNaiveBayesbybinarizingfeaturessuchthatpresenceofaword(countof1or
more)gaveafeaturevalueof1whileabsenceofawordgaveafeaturevalueof0.Thismethodgaveus
fairlyhighaccuracy,butzerorecallforallcategories,sowepresentMultinomialNBasourbaseline
model.
ClusteringforSubCategorization
Forsubcategorization,weimplementedspectralclustering,whichcanbesummarizedasthefollowing
procedures3:
Weimplementedthisalgorithmourselvesandappliedittocluster:(1)allrestaurantsintogroupsinorder
toseewhetherornotthesegroupscorrespondtoasensicalcompositionofgivenrestauranttypes,and(2)
Chineserestaurantsintosubcategories.Theparameterinthisalgorithmis ,whichcontrolsthe
connectivityofdatapoints.Thesmaller is,themoreseparatedclusterswillappear.Weset tothe
valuetomakethenumberofseparatedclustersequalto5.
Otherthingswetried:
Initiallywewereworkingonadifferentprobleminvolvingtimeseriesanalysisofdailyreviewcounts.
Wetookthetimeseriesofthedailyreviewcountsasourfeaturesandourhopewasto(1)finddifferent
customerinfluxpatternsfordifferenttypesofrestaurants,and(2)predictcustomerinfluxtocertaincities
andvenuesbasedonthesetimeseries.Ouranalysisfailedbecauseafterrunningsomestatisticaltests,we
learnedthatthereisnotenoughinformationinthetimeseriesforustodistinguishdifferenttypesof
restaurantsandpredictcustomerinflux.
Anothermethodwetriedwastofactorizethestarratingmatrixtoyieldafactormatrixcorrespondingto
restaurants,thenusethisasfeaturesandapplykmeansonittofindrestauranttypes.However,the
starratingmatrixisverysparse.EvenusingALS(AlternativeLeastSquares)factorization,ouraverage
predictionerror(measuredbyrootofmeansquarederror)waslargerthan1star.Soweabandonedthis
featureandusedbagofwordsinstead.
Results
SupervisedLabeling
Howmanyfeaturesshouldweinclude?
Figure2intheappendixshowsthat,forthecaseofunigramsonlyandnoTFIDFweighting,accuracy,
precision,andAUCareallmaximizedbyincludingtheentirecorpus.Theeffectofincreasedcorpussize
onrecallislessclearcut.Sincerecallisnotdrasticallydecreasedbyincludingmorefeatures,wecanbase
ourchoicesontheprecisionandaccuracies.
ShouldweweightourdatabyTFIDF?Shouldweincludebigrams?
InFigure3,wecomparetheperformanceofourlogisticregressionmodelforfourfeaturizationchoices.
Inallcases,~200,000featuresareused,butwevaryinclusionofunigramsvs.unigrams+bigrams,and
wetestwhetherornottoweightwithTFIDF.Thetopplotshowsthatusingbigramsinadditionto
unigramshaslittleeffectontheoverallaccuracies.WeseeaslightimprovementforItalianandChinese
restaurantswhenaddingbigrams,butthisimprovementisnotsubstantial.Themiddleandbottomplots
showprecisionandrecall.WeseethatusingTFIDFweightsgenerallyincreasesprecisionbutdecreases
recall.Weaverageoverall20categoriesforthesemeasuresinthetablebelow:
Rawfrequencies, TFIDFweights,
unigramsonly
unigramsonly
Rawfrequencies, TFIDFweights,
bigrams+
bigrams+
unigrams
unigrams
Accuracy
0.9517
0.9486
0.9562
0.9447
AUC
0.9220
0.9492
0.9347
0.9498
Precision
0.7215
0.8611
0.7737
0.9040
Recall
0.5672
0.3304
0.5551
0.2665
Table1.Comparisonofdifferentfeaturizationchoicesforthelogisticregressionmodel(with~200,000
featuresretained).Measuresareaveragedacrossscoresforthe20categories.
4
Thehighestscoresforeachaccuracymeasureareshowninbold.Thehighestoverallaccuracyisachieved
byincludingbigramsandunigrams,andweightingthefeaturesbytheirrawcounts.PrecisionandAUC
areimprovedbyweightingbyTFIDF,butrecallismarkedlydecreased.Suchlowrecallwouldcauseus,
forexample,tofailtorecommendarelevantrestauranttoaYelpuser,sowechoosenottoweightour
databyTFIDF.Thusourfinalchoiceforfeaturizationis:~200,000featuresofbigramsandunigrams,
weightedbyrawwordcounts.
Discussionofindividualcategoryaccuracies
Figure4showsaccuracy,precisionandrecallforeachcategoryforourchosenmodelandfeaturization.
Alongsideourmodelsaccuracies,weincludeAlwaysFalseaccuracies,whichistheaccuracyfora
modelthatsimplypredictsfalseuniformlyforeachlabel.Weseethatforallcategories,ourmodel
outperformsthe"AlwaysFalse"classifier.However,accuracyisveryclosetothis"AlwaysFalse"
classifierfortherarestcategories:SushiBars,Delis,Steakhouses,Seafood,andChickenWings.With
largernumbersofthesetypesofrestaurants,performanceforthesecategorieswouldpotentiallybe
improved.
Takingintoaccountaccuracy,precision,andrecall,weseethatourclassifierisbestatlabelingMexican,
Pizza,andChinese.ItisnotassuccessfulatclassifyingAmerican(TraditionalandNew),oratclassifying
thelabel"Food."ThismakessensebecauseAmericanrestaurantsand"Food"restaurantshaveless
obviouslyidentifyingwordfeaturesthanMexicanorChineserestaurants.Tovisualizethiseffect,we
examinewordcloudsforsomeofthesecases(Figure5).Thewordcloudsdisplaythemostfrequent
wordsinallreviewsforagivencategory,sizedbytheirfrequencies.Stopwordsareremovedinaddition
tothewordfood,whichiscommontoallcategories.
RandomForestModelResults
HerewealsoreporttheaccuracymeasuresforourRandomForestmodel,becauseitperformednearlyas
wellastheLogisticRegressionmodel.Fortheresultspresentedhere,weusedthesametrainingmatrixas
intheprimarymodel(unigrams+bigrams,rawcounts),butweonlyretain6000features.Theparameters
usedaregiveninthemethodssections.Wehalveourtestdatasetintovalidation(fortestingparameter
combinations)andfinaltestdatasets.Table2summarizesaccuracyscoresforthismodel(bothvalidation
andfinaltestscores),withlogisticregressionincludedforcomparison.Accuracyandrecallarebelowthe
logisticregressionmodel,butprecisionishigher.Ifgivenmoretimetotestmoreparametercombinations,it
isplausiblethatwecouldachievehigheraccuracywiththisrandomforestsapproach.Recallmight
improvewithshallowertreesorfewerfeaturesconsidered,sincetheseparametersgiveasimplermodel
withlowervariance,butwiththiscomespotentiallyhigherbias(loweraccuracy).
Accuracy
Precision
Recall
LogisticRegression
0.9562
0.7737
0.5551
RandomForest(validation)
0.9530
0.8867
0.4666
RandomForest(finaltest)
0.9522
0.8972
0.4550
Table2.Accuracymeasurecomparisonsbetweenprimary(logisticregression)andarandomforest
model.Measuresareaveragedacrossscoresforthe20categories.
5
Therandomforestmodelallowsustoexaminethemostimportantfeatures.Herewelistsomeofthemost
importantfeatures(indecreasingorder):pizza,chinese,bar,mexican,pizzas,burger,subway,mcdonalds,
mexicanfood,chinesefood,sandwiches,bartender,sandwich,sushi,tacos,taco,bartenders,italian,coffee,
burrito,crust,fries,burgers,barfood,drive,friedrice,pizzagood,asada,salsa,pepperoni,beer,goodpizza,
fastfood,pasta,rice,carneasada,waitress,breakfast,italianfood,burritos,subs,wings,bestpizza,happy
hour,bars,bread,mein,drinks,pizzaplace,beers,sub,fast,greatpizza,cafe,restaurant,italianrestaurant,
eggs,japanese,place,ricebeans,deli,tacobell,great,carne,chineserestaurant,pub.
Manyofthesefeaturesareobviousidentifiersforcertainlabels.
Comparisontobaselinemodel
Inthetablebelowwesummarizeaccuracymeasures(averagedacrossour20categories)forourprimary
modelandourbaselinemodel.Weseesignificantimprovementinallmeasuresexceptforrecall.Thelow
precisionofthebaselinemodelindicatesthatitunderfitsourdata,whichisexpectedofasimplemodel
suchasNaiveBayes.
Accuracy
AUC
Precision
Recall
MultinomialNB(Baseline)
0.9119
0.8690
0.4818
0.7648
LogisticRegression(Primary)
0.9562
0.9347
0.7737
0.5551
Table3.Accuracymeasurecomparisonsbetweenprimary(logisticregression)andbaseline(multinomial
naiveBayes)models.Measuresareaveragedacrossscoresforthe20categories.
WecompareperformanceagainstthebaselinemodelforallcategoriesinFigure6.Inthetoppanelwesee
thatourmodelismoreaccuratethanourbaselinemodelforallcategories.Whiletheimprovementdoes
notappeardrastic,itshouldbenotedthat"AlwaysFalse"neveroutperformsourprimarymodel,butit
outperformsthebaselinemodelfor11ofthe20categories(these11beingFastFood,American
(Traditional),Sandwiches,Food,American(New),BreakfastandBrunch,Cafes,Delis,Steakhouses,
Seafood,andChickenWings).The9categoriesforwhichthebaselinemodelsurpassesAlwaysFalse
inaccuracyareallcategoriesweexpecttohavemoreuniquevocabularies,suchasethniccuisine.Our
primarymodeloutperformsthebaselinemodelmostsignificantlyforlabelsSandwiches(improvementby
>20%)andFastFood(improvementby>10%).Betteraccuracyisexpectedforlogisticregressionas
comparedtonaiveBayesforaproblemsuchasoursbecausenaiveBayesisasimplificationoflogistic
regression.NaiveBayesassumesthatfeatures(words)aregeneratedindependentlygiventheclass(inour
case,theclassistrueorfalseforeachlabel),whereaslogisticregressiondoesnotmakethis
assumption.Assuch,weexpectthenaiveBayesmodeltohavehigherbiasbutlowervariance,andthatit
willunderfitourdata,leadingtolowprecisionandhighrecall.
SpectralClustering
Weappliedspectralclusteringon:(1)allrestaurantstoclassifythemintogroupsandanalyzethetrue
labelsthatcomprisethesegroups,and(2)Chineserestaurantstoclassifythemintosubcategories.Inorder
tofigureoutwhichlabelseachclustercorrespondsto,weprintedoutthetop5truelabelsofeachofthe5
clusters.AsshowninFigure7,weseethatcluster3correspondstopizzaorItalianrestaurants,cluster2
correspondstobar,nightlifetypeofrestaurant.Theotherthreeclustersaremoredifficulttointerpret
becausetheycontainmixedtypesofrestaurants.Figure8istheresultofapplyingspectralclusteringon
6
Chineserestaurants.ManyoftheChineserestaurantshavetruelabelsinadditiontoChinese,suchas
TaiwaneseorBuffet.So,aswedidfortheclustersofallrestaurants,wecanagainprintthemost
commontruelabels(otherthanChinese)fortherestaurantsinourChineseclusters.Firstwenoticethat
themostfrequentlabelsineachclustersareAsianFusion,Buffet,whichprovideslittleinformation
abouttheirtypes.Otherthanthat,weseethatinthefirstcluster,weobserveJapaneseandSushibar,
whichimpliesthattheirstylesaremoredominatedbyJapanesefood.Inthefifthcluster,weobserveThai,
Vietnamese,Szechuanrestaurants,whicharerelativelyspicy.
Tools
WeperformedallofouranalysisiniPythonnotebooksbecausethisplatformisusefulforvisualizing
resultsalongsidecode.WeusedPandasandNumPyfordatamanipulationsbecausethesearetoolsall
groupmembersuse.Atfirst,webuiltourBoWfeatures(andTFIDFweights)usinghandwrittencode
adaptedfromCS294homework,butlaterwemigratedtowardsscikitlearntoolsforthistask.
sklearn.feature_extraction.text.CountVectorizer()wasusedtoformBoWtrainingmatrices.Thistool
simplifiedafewtasks:
(1)settingthemaximumfeatureretentioncount(max_featuresparameter),
(2)settingwhichngramstoinclude(ngramrange),and
(3)settingwhichstopwordstoremove(weremovedwordsfromthegivenenglishstopwordlist).
Oncethosematriceswerebuilt,wecouldtransformthecountsintotheirTFIDFrepresentationwith
sklearn.feature_extraction.text.TfidfTransformer().
Forsupervisedlabeling,weimplementedseveralmodelsfromscikitlearn.Thejustificationisthatthese
toolsareeasytouse,especiallyinaniPythonnotebook.Modelsweusedinclude:
fromsklearn.linear_model:LogisticRegression()
fromsklearn.naive_bayes:BernoulliNB()andMultinomialNB()
fromsklearn.ensemble:RandomForestClassifier()
Wealsousedthesetoolsforquantifyingmodelperformance:
fromsklearn.metrics:roc_curve,roc_auc_score,auc
Forunsupervisedclustering,webasicallyusedkmeansfromscikitlearn.
Forvisualization,weusedMatplotlibbecauseitiswellsuitedforsimplegraphics,andcanbeusedinline
inaniPythonnotebook.Wealsousedthewordcloudpackagetocreatesomeappealingvisualizationsof
ourreviewtext.
LessonsLearned
SupervisedLabeling:Weexploredanumberofmachinelearningapproachesforthesupervisedproblem
ofclassifyingYelprestaurantsintoexistinglabels.Ourbestmodelwasalogisticregressionmodel,
closelyfollowedbyarandomforestsmodel.Wethusselectedthelogisticregressionmodelasour
primarymodel,andwecompareittoabaselinemodel(multinomialnaivebayes).Thefeaturesusedwere
thewordsfromallofthereviewswrittenforeachrestaurantthatweaimedtoclassify.Weevaluateda
numberoffeaturizationchoicesforthesewordsincluding:
(1)whethertouseunigramsonlyorwhethertoadditionallyincludebigrams,
(2)whethertoweightthefeaturesbyrawwordcountsorTFIDFweights,and
(3)howmanyfeaturestoinclude.
AsseeninFigure2andTable1,weachievedthebestperformanceforthelogisticregressionmodelby
usingbigrams+unigrams,retaining200,000+features,andrepresentingfeaturesasrawwordcounts.
Wemeasureaccuracyinanumberofways:
(1)accuracy(didwecorrectlypredictthatarestaurantdoesordoesnotfallwithinacertaincategory?),
(2)areaundertheROCcurve,
(3)precision,and
(4)recall.
ScoresfortheseaccuracymeasuresaredisplayedinTable2.Ourlogisticregressionmodeloutperforms
ourbaselinemodelsubstantiallyinaccuracy,AUC,andprecision,butthebaselinemodelhashigher
recall.WealsoshowthatourprimarymodeloutperformstheAlwaysFalsemodelforall20categories,
whereasourbaselinemodeldoesnotformanycategories.Ourprimarymodelperformsbestatclassifying
ethniccuisinesuchasChineseandMexican,whichwehypothesizeisduetothesetypesof
restaurantshavingspecialanduniqueidentifyingwordssuchasMexicanandtacosforMexican
restaurantsandChineseandnoodlesforChineserestaurants.Thisiscorroboratedbythewordcloud
visualizationsinFigure5andinlookingatthemostimportantfeaturesforourrandomforestsmodel.
UnsupervisedLabeling:
Unsupervisedlearningforsubcategorizationisrelativelymoredifficult.Inthisproject,weapplied
spectralclusteringonthereviewtextinordertofindsubcategoriesofrestaurants.Theintuitionis,lets
say,forChineserestaurants,peoplemayusehot,spicytodescribeaSichuanrestaurantandusemilk
tea,saltedpopcornchickeninreviewsforTaiwaneserestaurants.However,thedifficultyis,itsnot
obviouswhateachclustercorrespondsto.
Onewaytofigurethisoutistolookatthepercentageofexistinglabels.Forexample,ifinacluster,50%
ofrestaurantsarebar,25%arenightlife,thenwecouldreasonthisclustercorrespondstothebartype
ofrestaurants.Thoughwedoobservethisinsomeoftheclusters(refertotheresultssection),thereare
alsoclusterswithmixedlabelsthatarenoteasytointerpret.Amorefundamentalquestiontoaskis,isthe
clusteringbasedonrestauranttypes?Or,isitperhapsmorerelatedtosomethingelselikestarrating,cost,
orotherlatentfactors?Akeylessonforusisthatunsupervisedlearningdoesntalwaysgiveustheresult
weexpect.
TeamContributions
*CS294*Bichen(40%):Timeseriesanalysis(majorityoftheProjectPreliminaryDataAnalysis
submission),starratingmatrixfactorization,spectralclusteringofreviewtextsforunsupervised
subcategoryclassification.
*CS294*Stephanie(40%):Initialreadinginofdataandexplorationofbusinessdataset(majorityof
ProjectDataExplorationsubmission).Completionofbagofwordsfeaturization.Smallscale
supervisedlabeling(majorityofresultspresentedinPowerPointpresentation).Majorityoftext
featurizationandsupervisedlabelingpresentedinposterpresentationandpresentedhere.
*CS194*TsuFang(20%):Dataexplorationonreviewtextsanduserdata.Startedbagofwords
featurizationandTFIDFanalysis.TestedvalueofaddingrestaurantnamefeatureandTFIDFeffectson
modelaccuraciesafterlogisticregressionandnaivebayes(notshown).Portedandformattedresultsfor
poster/presentation.
References
(1) Yelpacademicdataset.https://www.yelp.com/academic_dataset.
(2) EnsembleMethods.http://scikitlearn.org/stable/modules/ensemble.html
(3) Ng,AndrewY.,MichaelI.Jordan,andYairWeiss."Onspectralclustering:Analysisandan
algorithm."Advancesinneuralinformationprocessingsystems2(2002):849856.
Ourgithubrepositoryishere:htps://github.com/tsufanglu/YelpDatasetChallenget
Themostrelevantnotebookstothisreportare:
CatsAllCities.ipynb
Yelp_Restaurnats_Spectral_Clustering.ipynb
Theyarelocatedinthecodefolderoftherepo:
https://github.com/tsufanglu/YelpDatasetChallenge/tree/master/code
Appendix(Figures)
Figure1.Chosenrestaurantlabelsandtheircounts.
10
Figure2.Accuracies,precisions,andrecallsforourlogisticregressionmodelcoloredbythenumberof
wordsretainedinthetrainingcorpuses(noTFIDFweighting).Theseindicatethatweoughttokeepas
manywordsaspossibleasfeatures.
11
Figure3.Comparisonofaccuraciesfor4differentfeaturizationchoices.Ineachcase211964words(or
211964bigrams+unigramsinthebigramcase)areretainedfortraining.
12
Figure4.Accuracymeasuresforourchosenmodel,brokendownbycategory.
Figure5.WordcloudsforAmerican(New)(UpperLeft),American(Traditional)(UpperRight),
Mexican,andChinese.NoticeChinesehaswordsuniquetoitsuchasChinese,noodle,rice,and
dumplingMexicanhasuniquewordslikeMexican,taco,andburrito,buttheupper2wordclouds
donotshowobviouslyuniquewords.
13
Figure6.Baselinecomparisons.Thereissubstantialimprovementoverbaselineforaccuracy,AUC,and
precision.Thesimplebaselinemodelhashigherrecall.Bottompanellabelsserveasaguideforall
panels.
14
Figure7.Spectralclusteringresultofallrestaurants.Mostfrequent5labelsineachcluster.
Figure8.SpectralclusteringresultsforChineserestaurants.Mostfrequent5labelsineachcluster.
15