Sie sind auf Seite 1von 15

ClassifyingYelpRestaurants

TeamYelpadvisor:StephanieWuerth,BichenWu,TsuFangLu
December14,2015

ProblemStatementandBackground
OurgoalistoclassifyrestaurantsintoexistinglabelsusingtheYelpacademicdataset.Wealsohopeto
furtherclassifyrestaurantswithmorespecificlabelsthantheirgivenlabels.Forexample,arestaurant
mayjustbelabeledasChinesebutcanwefurtherclassifyitasSichuanorTaiwanese?
OurdatasetistheYelpsacademicdataset,whichisprovidedforuseaspartoftheYelpDataset
Challenge1.Thisdatasetspansapproximately10yearsofYelpreviews(textandstarrating)and5years
ofYelptips,alongwithhourlysumsofcheckinsforeachbusiness.Italsoincludesgeneralinformation
aboutthebusiness,suchasitscategories,ambiance,businesshours,andaddress,andsomeinformation
abouttheYelpreviewerssuchasthenumberofreviewstheyhavewrittenandtheaveragestarratingthey
havegiven.Thedatasetincludes10cities:Edinburgh,Karlsruhe,Charlotte,Urbana,Madison,LasVegas,
Phoenix,Pittsburgh,Montreal,andWaterloo(Canada).

Tomeasureaccuracy,wecompareourpredictedlabeltothetruelabel.Wemeasurebasicaccuracy,
precisionandrecall,aswellasAUC.Wealsocompareourmodelsaccuracytoabaselinemodel.

Somepotentialapplicationsofourclassificationmodelsinclude:
1.HelpYelpautomaterestaurantlabelingwithoutuserinputs.
2.Labelvaguelylabeledrestaurantsmorespecificallyorlabelrestaurantswithmissinglabels.
3.Informcustomersaboutrestaurantsspecialtiesandparticularcuisinesbyfurthersubcategorizing
restaurantsintomorespecificlabels.

Methods
Datacollection
WeusedtheYelpacademicdataset,whichismadeavailablebyYelpfortheYelpDatasetChallenge1.To
obtainthisdata,weregisteredfortheYelpDatasetChallengeathttp://www.yelp.com/dataset_challenge.
Datapreparation
ThedataprovidedisinJSONformat,butaPythonscriptforconvertingtocsvisofferredat
https://github.com/Yelp/datasetexamples.Weusedthisscript(json_to_csv_converter.py)toconvertthe
JSONdataintocsvfiles,thenwereadthoseintoaPythonnotebookandstoredthedatainPandas
dataframes.Wesubsetthedataforwhatispotentiallyusefulforourchosenproblem.Weuse9ofthe10
citiesintheYelpAcademicDatasetforourmodel.Karlsruhe,Germanydataisomittedbecausemost
reviewsherearenotwritteninEnglish,andreviewtextistherichestcomponentofourdataset.We
furthersubsetbyselectingonlyrestaurants(excludingHotels,Spas,etc.).Withinrestaurantswefurther
subsetforthe20mostcommontypesofrestaurant,asdictatedbytheirgivenlabels.Labelschosenand
numberofrestaurantswitheachlabelinoursubsetteddatasetaregiveninFig.1(SeeinAppendix).We
alsogotridofEOL,carriagereturns,andcertainregexpatternsinthereviewtextsforourbagofwords
modeltoworkbetter.

Featurization
WefeaturizeourreviewtextusingaBagofWords(BoW)model,buildingatrainingmatrixofnumberof
restaurantsbysizeofvocabularyasfollows:
Allreviewsreceivedbyeachrestaurantinthetrainingset(70%oftotal)arejoinedandtokenizedwith
stopwordsremoved,thenwordsarecountedtocreatethesparseBoWvectorforeachrestaurant.
Wetestedseveraldifferentfeatureinclusions:
Ngrams:UnigramsOnly,Bigrams+Unigrams
Numberoffeaturesretained:6000,15000,100,000,or~200,000(whichisthetotalcountof
uniquewordsinourtrainingcorpus)
Featureweights:rawfrequenciesortermfrequency,inversedocumentfrequency(TFIDF)
weighting.WenoteherethespecificsoftheTFIDFweighting:weusedthedefaultparametersof
thesklearn.feature_extraction.text.TfidfTransformer()tool.(norm='l2',use_idf=True,
smooth_idf=True,sublinear_tf=False).Thenormparametermeanswenormalizethefinal
vectors,andthesmooth_idfanduse_idfparametersmeanourfeaturesareweightedaccordingto
tf*(idf+1),wheretfisthefrequencyofthefeatureintherestaurant'smergedreviews,andidf
istheinversefrequencyofthefeatureintheentiretrainingcorpus(allrestaurantreviews).

Anotherfeaturizationwetried,butdidnotimplementinthefinalpipeline,istousethestarratingmatrix,
whichisamatrixofnumberofusersbynumberofrestaurants.Eachelementinthematrixcorrespondsto
ausersratingforacertainrestaurant.Thenweperformedmatrixfactorization(throughPCAand
AlternativeLeastSquares)toobtainafactormatrixofnumberoffactorsbynumberofrestaurants.We
treatedeachvector(withthelengthoffactors)asadatapointtorepresenteachrestaurant.

Learning
Firstwedescribethelearningmethodsusedforthesupervisedproblemofclassifyingrestaurantsinto
theirexistinglabels.Thenwedescribethemethodsfortheunsupervisedproblemofclassifying
restaurantsintosubcategories.

Supervisedtextbasedclassificationintoexistinglabels

Modelstested:Logisticregressionandrandomforest
Logisticregressionmarginallyoutperformedourrandomforestmodels,sowehavechosenthelogistic
regressionmodelasourprimarymodel.

Parameterchoices:

LogisticRegression(C=1.0,class_weight=None,dual=False,fit_intercept=True,
intercept_scaling=1,max_iter=100,multi_class='ovr',
penalty='l2',random_state=None,solver='liblinear',tol=0.0001,
verbose=0)

Logisticregression:Multi_class=ovrindicatesthatabinaryproblemisfitforeachlabel.Soinour
case,foreachofthe20categorieswemodelwhetherarestaurantdoesordoesnotfallintothatcategory.

Thisisalogicalchoicesincesomerestaurantsfallintomorethanonecategory(forexample,manySushi
BarsarealsoJapanese).

RandomForestClassifier(bootstrap=True,class_weight=None,criterion='gini',
max_depth=None,max_features=6000,max_leaf_nodes=None,
min_samples_leaf=1,min_samples_split=2,
min_weight_fraction_leaf=0.0,n_estimators=100,n_jobs=4,
oob_score=False,random_state=None,verbose=0,
warm_start=False)

Randomforests:Wetestedanumberofparameterchoices,butthebestperformancewasachievedby
keeping6000features,100estimators,bootstrapon,andginicriterion.Weinitiallyincludedfewer
features,becauseaccordingtosckitlearndocumentation2,forclassificationtasks,thenumberoffeatures
usedinarandomforestmodelisoftenoptimizedwithmax_features=sqrt(n_features).N_featuresinour
caseis~200,000,so~500wouldbeagoodchoiceformax_features.However,wesawincreased
accuracywhenweincludedmorefeatures.

MultinomialNB(alpha=1.0,class_prior=None,fit_prior=True)

MultinomialNaveBayes(MNB)(baselinemodel):
Thealphaparameterissetbydefaultto1toincludeadaptivesmoothing.
WealsotriedBernoulliNaiveBayesbybinarizingfeaturessuchthatpresenceofaword(countof1or
more)gaveafeaturevalueof1whileabsenceofawordgaveafeaturevalueof0.Thismethodgaveus
fairlyhighaccuracy,butzerorecallforallcategories,sowepresentMultinomialNBasourbaseline
model.

ClusteringforSubCategorization

Forsubcategorization,weimplementedspectralclustering,whichcanbesummarizedasthefollowing
procedures3:

1. FormtheaffinitymatrixA,withAij=exp(||sisj||2 / 2)for i =/ j andAii=0


2. DefineDtobeadiagonalmatrixwhose(i,i)elementisthesumofAsithrow.Andconstruct
L=D1/2AD1/2
3. Findkeigenvectorsx1,,xkcorrespondingtotheksmallesteigenvaluesofL.Formmatrix
X=[x1...xk].
4. RenormalizeeachrowofXtoformmatrixY.
5. TreateachrowofYasadatapoint,doKmeansclusteringonY.

Weimplementedthisalgorithmourselvesandappliedittocluster:(1)allrestaurantsintogroupsinorder
toseewhetherornotthesegroupscorrespondtoasensicalcompositionofgivenrestauranttypes,and(2)
Chineserestaurantsintosubcategories.Theparameterinthisalgorithmis ,whichcontrolsthe
connectivityofdatapoints.Thesmaller is,themoreseparatedclusterswillappear.Weset tothe
valuetomakethenumberofseparatedclustersequalto5.

Otherthingswetried:

Initiallywewereworkingonadifferentprobleminvolvingtimeseriesanalysisofdailyreviewcounts.
Wetookthetimeseriesofthedailyreviewcountsasourfeaturesandourhopewasto(1)finddifferent
customerinfluxpatternsfordifferenttypesofrestaurants,and(2)predictcustomerinfluxtocertaincities
andvenuesbasedonthesetimeseries.Ouranalysisfailedbecauseafterrunningsomestatisticaltests,we
learnedthatthereisnotenoughinformationinthetimeseriesforustodistinguishdifferenttypesof
restaurantsandpredictcustomerinflux.

Anothermethodwetriedwastofactorizethestarratingmatrixtoyieldafactormatrixcorrespondingto
restaurants,thenusethisasfeaturesandapplykmeansonittofindrestauranttypes.However,the
starratingmatrixisverysparse.EvenusingALS(AlternativeLeastSquares)factorization,ouraverage
predictionerror(measuredbyrootofmeansquarederror)waslargerthan1star.Soweabandonedthis
featureandusedbagofwordsinstead.

Results

SupervisedLabeling

Howmanyfeaturesshouldweinclude?
Figure2intheappendixshowsthat,forthecaseofunigramsonlyandnoTFIDFweighting,accuracy,
precision,andAUCareallmaximizedbyincludingtheentirecorpus.Theeffectofincreasedcorpussize
onrecallislessclearcut.Sincerecallisnotdrasticallydecreasedbyincludingmorefeatures,wecanbase
ourchoicesontheprecisionandaccuracies.

ShouldweweightourdatabyTFIDF?Shouldweincludebigrams?
InFigure3,wecomparetheperformanceofourlogisticregressionmodelforfourfeaturizationchoices.
Inallcases,~200,000featuresareused,butwevaryinclusionofunigramsvs.unigrams+bigrams,and
wetestwhetherornottoweightwithTFIDF.Thetopplotshowsthatusingbigramsinadditionto
unigramshaslittleeffectontheoverallaccuracies.WeseeaslightimprovementforItalianandChinese
restaurantswhenaddingbigrams,butthisimprovementisnotsubstantial.Themiddleandbottomplots
showprecisionandrecall.WeseethatusingTFIDFweightsgenerallyincreasesprecisionbutdecreases
recall.Weaverageoverall20categoriesforthesemeasuresinthetablebelow:

Rawfrequencies, TFIDFweights,
unigramsonly
unigramsonly

Rawfrequencies, TFIDFweights,
bigrams+
bigrams+
unigrams
unigrams

Accuracy

0.9517

0.9486

0.9562

0.9447

AUC

0.9220

0.9492

0.9347

0.9498

Precision

0.7215

0.8611

0.7737

0.9040

Recall

0.5672

0.3304

0.5551

0.2665

Table1.Comparisonofdifferentfeaturizationchoicesforthelogisticregressionmodel(with~200,000
featuresretained).Measuresareaveragedacrossscoresforthe20categories.
4


Thehighestscoresforeachaccuracymeasureareshowninbold.Thehighestoverallaccuracyisachieved
byincludingbigramsandunigrams,andweightingthefeaturesbytheirrawcounts.PrecisionandAUC
areimprovedbyweightingbyTFIDF,butrecallismarkedlydecreased.Suchlowrecallwouldcauseus,
forexample,tofailtorecommendarelevantrestauranttoaYelpuser,sowechoosenottoweightour
databyTFIDF.Thusourfinalchoiceforfeaturizationis:~200,000featuresofbigramsandunigrams,
weightedbyrawwordcounts.

Discussionofindividualcategoryaccuracies
Figure4showsaccuracy,precisionandrecallforeachcategoryforourchosenmodelandfeaturization.
Alongsideourmodelsaccuracies,weincludeAlwaysFalseaccuracies,whichistheaccuracyfora
modelthatsimplypredictsfalseuniformlyforeachlabel.Weseethatforallcategories,ourmodel
outperformsthe"AlwaysFalse"classifier.However,accuracyisveryclosetothis"AlwaysFalse"
classifierfortherarestcategories:SushiBars,Delis,Steakhouses,Seafood,andChickenWings.With
largernumbersofthesetypesofrestaurants,performanceforthesecategorieswouldpotentiallybe
improved.

Takingintoaccountaccuracy,precision,andrecall,weseethatourclassifierisbestatlabelingMexican,
Pizza,andChinese.ItisnotassuccessfulatclassifyingAmerican(TraditionalandNew),oratclassifying
thelabel"Food."ThismakessensebecauseAmericanrestaurantsand"Food"restaurantshaveless
obviouslyidentifyingwordfeaturesthanMexicanorChineserestaurants.Tovisualizethiseffect,we
examinewordcloudsforsomeofthesecases(Figure5).Thewordcloudsdisplaythemostfrequent
wordsinallreviewsforagivencategory,sizedbytheirfrequencies.Stopwordsareremovedinaddition
tothewordfood,whichiscommontoallcategories.

RandomForestModelResults
HerewealsoreporttheaccuracymeasuresforourRandomForestmodel,becauseitperformednearlyas
wellastheLogisticRegressionmodel.Fortheresultspresentedhere,weusedthesametrainingmatrixas
intheprimarymodel(unigrams+bigrams,rawcounts),butweonlyretain6000features.Theparameters
usedaregiveninthemethodssections.Wehalveourtestdatasetintovalidation(fortestingparameter
combinations)andfinaltestdatasets.Table2summarizesaccuracyscoresforthismodel(bothvalidation
andfinaltestscores),withlogisticregressionincludedforcomparison.Accuracyandrecallarebelowthe
logisticregressionmodel,butprecisionishigher.Ifgivenmoretimetotestmoreparametercombinations,it
isplausiblethatwecouldachievehigheraccuracywiththisrandomforestsapproach.Recallmight
improvewithshallowertreesorfewerfeaturesconsidered,sincetheseparametersgiveasimplermodel
withlowervariance,butwiththiscomespotentiallyhigherbias(loweraccuracy).

Accuracy

Precision

Recall

LogisticRegression

0.9562

0.7737

0.5551

RandomForest(validation)

0.9530

0.8867

0.4666

RandomForest(finaltest)

0.9522

0.8972

0.4550

Table2.Accuracymeasurecomparisonsbetweenprimary(logisticregression)andarandomforest
model.Measuresareaveragedacrossscoresforthe20categories.
5

Therandomforestmodelallowsustoexaminethemostimportantfeatures.Herewelistsomeofthemost
importantfeatures(indecreasingorder):pizza,chinese,bar,mexican,pizzas,burger,subway,mcdonalds,
mexicanfood,chinesefood,sandwiches,bartender,sandwich,sushi,tacos,taco,bartenders,italian,coffee,
burrito,crust,fries,burgers,barfood,drive,friedrice,pizzagood,asada,salsa,pepperoni,beer,goodpizza,
fastfood,pasta,rice,carneasada,waitress,breakfast,italianfood,burritos,subs,wings,bestpizza,happy
hour,bars,bread,mein,drinks,pizzaplace,beers,sub,fast,greatpizza,cafe,restaurant,italianrestaurant,
eggs,japanese,place,ricebeans,deli,tacobell,great,carne,chineserestaurant,pub.
Manyofthesefeaturesareobviousidentifiersforcertainlabels.

Comparisontobaselinemodel
Inthetablebelowwesummarizeaccuracymeasures(averagedacrossour20categories)forourprimary
modelandourbaselinemodel.Weseesignificantimprovementinallmeasuresexceptforrecall.Thelow
precisionofthebaselinemodelindicatesthatitunderfitsourdata,whichisexpectedofasimplemodel
suchasNaiveBayes.

Accuracy

AUC

Precision

Recall

MultinomialNB(Baseline)

0.9119

0.8690

0.4818

0.7648

LogisticRegression(Primary)

0.9562

0.9347

0.7737

0.5551

Table3.Accuracymeasurecomparisonsbetweenprimary(logisticregression)andbaseline(multinomial
naiveBayes)models.Measuresareaveragedacrossscoresforthe20categories.
WecompareperformanceagainstthebaselinemodelforallcategoriesinFigure6.Inthetoppanelwesee
thatourmodelismoreaccuratethanourbaselinemodelforallcategories.Whiletheimprovementdoes
notappeardrastic,itshouldbenotedthat"AlwaysFalse"neveroutperformsourprimarymodel,butit
outperformsthebaselinemodelfor11ofthe20categories(these11beingFastFood,American
(Traditional),Sandwiches,Food,American(New),BreakfastandBrunch,Cafes,Delis,Steakhouses,
Seafood,andChickenWings).The9categoriesforwhichthebaselinemodelsurpassesAlwaysFalse
inaccuracyareallcategoriesweexpecttohavemoreuniquevocabularies,suchasethniccuisine.Our
primarymodeloutperformsthebaselinemodelmostsignificantlyforlabelsSandwiches(improvementby
>20%)andFastFood(improvementby>10%).Betteraccuracyisexpectedforlogisticregressionas
comparedtonaiveBayesforaproblemsuchasoursbecausenaiveBayesisasimplificationoflogistic
regression.NaiveBayesassumesthatfeatures(words)aregeneratedindependentlygiventheclass(inour
case,theclassistrueorfalseforeachlabel),whereaslogisticregressiondoesnotmakethis
assumption.Assuch,weexpectthenaiveBayesmodeltohavehigherbiasbutlowervariance,andthatit
willunderfitourdata,leadingtolowprecisionandhighrecall.

SpectralClustering

Weappliedspectralclusteringon:(1)allrestaurantstoclassifythemintogroupsandanalyzethetrue
labelsthatcomprisethesegroups,and(2)Chineserestaurantstoclassifythemintosubcategories.Inorder
tofigureoutwhichlabelseachclustercorrespondsto,weprintedoutthetop5truelabelsofeachofthe5
clusters.AsshowninFigure7,weseethatcluster3correspondstopizzaorItalianrestaurants,cluster2
correspondstobar,nightlifetypeofrestaurant.Theotherthreeclustersaremoredifficulttointerpret
becausetheycontainmixedtypesofrestaurants.Figure8istheresultofapplyingspectralclusteringon
6

Chineserestaurants.ManyoftheChineserestaurantshavetruelabelsinadditiontoChinese,suchas
TaiwaneseorBuffet.So,aswedidfortheclustersofallrestaurants,wecanagainprintthemost
commontruelabels(otherthanChinese)fortherestaurantsinourChineseclusters.Firstwenoticethat
themostfrequentlabelsineachclustersareAsianFusion,Buffet,whichprovideslittleinformation
abouttheirtypes.Otherthanthat,weseethatinthefirstcluster,weobserveJapaneseandSushibar,
whichimpliesthattheirstylesaremoredominatedbyJapanesefood.Inthefifthcluster,weobserveThai,
Vietnamese,Szechuanrestaurants,whicharerelativelyspicy.

Tools
WeperformedallofouranalysisiniPythonnotebooksbecausethisplatformisusefulforvisualizing
resultsalongsidecode.WeusedPandasandNumPyfordatamanipulationsbecausethesearetoolsall
groupmembersuse.Atfirst,webuiltourBoWfeatures(andTFIDFweights)usinghandwrittencode
adaptedfromCS294homework,butlaterwemigratedtowardsscikitlearntoolsforthistask.
sklearn.feature_extraction.text.CountVectorizer()wasusedtoformBoWtrainingmatrices.Thistool
simplifiedafewtasks:
(1)settingthemaximumfeatureretentioncount(max_featuresparameter),
(2)settingwhichngramstoinclude(ngramrange),and
(3)settingwhichstopwordstoremove(weremovedwordsfromthegivenenglishstopwordlist).
Oncethosematriceswerebuilt,wecouldtransformthecountsintotheirTFIDFrepresentationwith
sklearn.feature_extraction.text.TfidfTransformer().

Forsupervisedlabeling,weimplementedseveralmodelsfromscikitlearn.Thejustificationisthatthese
toolsareeasytouse,especiallyinaniPythonnotebook.Modelsweusedinclude:
fromsklearn.linear_model:LogisticRegression()
fromsklearn.naive_bayes:BernoulliNB()andMultinomialNB()
fromsklearn.ensemble:RandomForestClassifier()
Wealsousedthesetoolsforquantifyingmodelperformance:
fromsklearn.metrics:roc_curve,roc_auc_score,auc

Forunsupervisedclustering,webasicallyusedkmeansfromscikitlearn.

Forvisualization,weusedMatplotlibbecauseitiswellsuitedforsimplegraphics,andcanbeusedinline
inaniPythonnotebook.Wealsousedthewordcloudpackagetocreatesomeappealingvisualizationsof
ourreviewtext.

LessonsLearned
SupervisedLabeling:Weexploredanumberofmachinelearningapproachesforthesupervisedproblem
ofclassifyingYelprestaurantsintoexistinglabels.Ourbestmodelwasalogisticregressionmodel,
closelyfollowedbyarandomforestsmodel.Wethusselectedthelogisticregressionmodelasour
primarymodel,andwecompareittoabaselinemodel(multinomialnaivebayes).Thefeaturesusedwere
thewordsfromallofthereviewswrittenforeachrestaurantthatweaimedtoclassify.Weevaluateda
numberoffeaturizationchoicesforthesewordsincluding:
(1)whethertouseunigramsonlyorwhethertoadditionallyincludebigrams,
(2)whethertoweightthefeaturesbyrawwordcountsorTFIDFweights,and
(3)howmanyfeaturestoinclude.

AsseeninFigure2andTable1,weachievedthebestperformanceforthelogisticregressionmodelby
usingbigrams+unigrams,retaining200,000+features,andrepresentingfeaturesasrawwordcounts.
Wemeasureaccuracyinanumberofways:
(1)accuracy(didwecorrectlypredictthatarestaurantdoesordoesnotfallwithinacertaincategory?),
(2)areaundertheROCcurve,
(3)precision,and
(4)recall.
ScoresfortheseaccuracymeasuresaredisplayedinTable2.Ourlogisticregressionmodeloutperforms
ourbaselinemodelsubstantiallyinaccuracy,AUC,andprecision,butthebaselinemodelhashigher
recall.WealsoshowthatourprimarymodeloutperformstheAlwaysFalsemodelforall20categories,
whereasourbaselinemodeldoesnotformanycategories.Ourprimarymodelperformsbestatclassifying
ethniccuisinesuchasChineseandMexican,whichwehypothesizeisduetothesetypesof
restaurantshavingspecialanduniqueidentifyingwordssuchasMexicanandtacosforMexican
restaurantsandChineseandnoodlesforChineserestaurants.Thisiscorroboratedbythewordcloud
visualizationsinFigure5andinlookingatthemostimportantfeaturesforourrandomforestsmodel.
UnsupervisedLabeling:
Unsupervisedlearningforsubcategorizationisrelativelymoredifficult.Inthisproject,weapplied
spectralclusteringonthereviewtextinordertofindsubcategoriesofrestaurants.Theintuitionis,lets
say,forChineserestaurants,peoplemayusehot,spicytodescribeaSichuanrestaurantandusemilk
tea,saltedpopcornchickeninreviewsforTaiwaneserestaurants.However,thedifficultyis,itsnot
obviouswhateachclustercorrespondsto.
Onewaytofigurethisoutistolookatthepercentageofexistinglabels.Forexample,ifinacluster,50%
ofrestaurantsarebar,25%arenightlife,thenwecouldreasonthisclustercorrespondstothebartype
ofrestaurants.Thoughwedoobservethisinsomeoftheclusters(refertotheresultssection),thereare
alsoclusterswithmixedlabelsthatarenoteasytointerpret.Amorefundamentalquestiontoaskis,isthe
clusteringbasedonrestauranttypes?Or,isitperhapsmorerelatedtosomethingelselikestarrating,cost,
orotherlatentfactors?Akeylessonforusisthatunsupervisedlearningdoesntalwaysgiveustheresult
weexpect.

TeamContributions

*CS294*Bichen(40%):Timeseriesanalysis(majorityoftheProjectPreliminaryDataAnalysis
submission),starratingmatrixfactorization,spectralclusteringofreviewtextsforunsupervised
subcategoryclassification.
*CS294*Stephanie(40%):Initialreadinginofdataandexplorationofbusinessdataset(majorityof
ProjectDataExplorationsubmission).Completionofbagofwordsfeaturization.Smallscale
supervisedlabeling(majorityofresultspresentedinPowerPointpresentation).Majorityoftext
featurizationandsupervisedlabelingpresentedinposterpresentationandpresentedhere.
*CS194*TsuFang(20%):Dataexplorationonreviewtextsanduserdata.Startedbagofwords
featurizationandTFIDFanalysis.TestedvalueofaddingrestaurantnamefeatureandTFIDFeffectson
modelaccuraciesafterlogisticregressionandnaivebayes(notshown).Portedandformattedresultsfor
poster/presentation.

References
(1) Yelpacademicdataset.https://www.yelp.com/academic_dataset.
(2) EnsembleMethods.http://scikitlearn.org/stable/modules/ensemble.html
(3) Ng,AndrewY.,MichaelI.Jordan,andYairWeiss."Onspectralclustering:Analysisandan
algorithm."Advancesinneuralinformationprocessingsystems2(2002):849856.

Ourgithubrepositoryishere:htps://github.com/tsufanglu/YelpDatasetChallenget

Themostrelevantnotebookstothisreportare:
CatsAllCities.ipynb
Yelp_Restaurnats_Spectral_Clustering.ipynb

Theyarelocatedinthecodefolderoftherepo:
https://github.com/tsufanglu/YelpDatasetChallenge/tree/master/code

Appendix(Figures)

Figure1.Chosenrestaurantlabelsandtheircounts.

10

Figure2.Accuracies,precisions,andrecallsforourlogisticregressionmodelcoloredbythenumberof
wordsretainedinthetrainingcorpuses(noTFIDFweighting).Theseindicatethatweoughttokeepas
manywordsaspossibleasfeatures.

11


Figure3.Comparisonofaccuraciesfor4differentfeaturizationchoices.Ineachcase211964words(or
211964bigrams+unigramsinthebigramcase)areretainedfortraining.

12


Figure4.Accuracymeasuresforourchosenmodel,brokendownbycategory.

Figure5.WordcloudsforAmerican(New)(UpperLeft),American(Traditional)(UpperRight),
Mexican,andChinese.NoticeChinesehaswordsuniquetoitsuchasChinese,noodle,rice,and
dumplingMexicanhasuniquewordslikeMexican,taco,andburrito,buttheupper2wordclouds
donotshowobviouslyuniquewords.

13

Figure6.Baselinecomparisons.Thereissubstantialimprovementoverbaselineforaccuracy,AUC,and
precision.Thesimplebaselinemodelhashigherrecall.Bottompanellabelsserveasaguideforall
panels.

14


Figure7.Spectralclusteringresultofallrestaurants.Mostfrequent5labelsineachcluster.

Figure8.SpectralclusteringresultsforChineserestaurants.Mostfrequent5labelsineachcluster.

15

Das könnte Ihnen auch gefallen