Sie sind auf Seite 1von 13

2/9/2016 IT Shared: Data Science Interview Questions

IT Shared
Where Knowledge meets Sharing

19October2015
OurPosts
Data Science Interview Questions 2015(21)
December(3)
October(3)
DataScienceInterview
Questions
PostedbyAlexeyGrigorev
JavaInterviewQuestions
MasteringDataAnalysis
withR
DataScienceInterviewQuestions
June(3)
April(2)
March(3)
February(2)
January(5)

2014(2)

WhoAreWe?

WeareagroupofMaster'sDegree
studentsfromallovertheworldwhomet
eachotherthankstothefoundationand
specialisationcoursesoftheErasmus
MundusIT4BIProgramme.

AhmetAnlPala
Source:DataScience:AnIntroduction

OurIT4BIMasterstudiesfinished,andthenextlogicalstepaftergraduationisfindingajob.Iwas
interestedinDataSciencejobsandthispostisasummaryofmyinterviewexperienceand AlexeyGrigorev
preparation.

ThetermDataScienceisnotyetwellestablish,sointerviewsforDataSciencejobsmightincludea AndresFelipeZamora
verybroadrangeofquestions,dependingontheinterpretationofthetermbyaparticularcompany.In Montao
thispostIattempttoorganizeDataScienceinterviewquestionsinsomeusableform,butitmightalso
bebiasedbyhowIseeDataSciencemyself.Ihopeyoualsocanfindituseful.

Thesourcesofthequestionsare: ElenaSamota

linksthatIdiscoveredontheInternet,
myowndatascienceinterviews(beingontheintervieweeside) GuvenToprakkiran
Thequestionsarewithoutanswers.Firstofall,theanswerthatIwouldwritecouldbebadorwrong,
andsecond,thepostwouldbetoobig.Also,goingthroughthelistandlookingfortheanswers
yourselfisagoodexercisetoprepareforaninterview. HichamAkaoka
Badssi
Thislistmightlookscaryatfirst,butitsveryunlikelythatallofthesequestionswillbeaskedduring
oneinterview.Veryfewjobsrequireapplicantstoknowallofthesepoints.Soitsratherabroad
overviewofthingsthatmaypotentiallybeasked.Dontletthislistofquestionsdiscourageyouifyou JosLuisPinoLpez
dontknowtheanswertosomeofthem:chancesarethatthesequestionsarenotimportantforyour
interview.

So,letsgetstarted. MadalinaBurghelea

TableofContent
MaximilianoAriel
BackgroundQuestions Lpez
Careers

http://www.itshared.org/2015/10/data-science-interview-questions.html 1/13
2/9/2016 IT Shared: Data Science Interview Questions
GeneralQuestions
MiaJohnsonViouls
Process
Mathematics
LinearAlgebra
NavidMahlouji
OtherAreas

ProbabilityandStatistics
BasicProbability NyamiRonald
Mitterand
Distributions
BasicStatistics
ExperimentDesign SteffiMelinda
PointEstimates
Testing
A/BTests StephanyGarca
Martnez
BayesianStatistics
TimeSeries
Advanced TamaraMendt
MachineLearning
GeneralMLQuestions
Regression
Classification
Regularization
DimensionalityReduction
ClusterAnalysis
Optimization
Recommendation
FeatureEngineering
NaturalLanguageProcessing
MetaLearning
Miscellanea

ComputerScience
LibrariesandTools
Databases
DistributedSystemsandBigData

HandsOn
ProblemtoSolve
Coding
Papers

Sources
UsefulLinks

BackgroundQuestions

Usually,interviewsstartwithbackgroundquestions:theycanaskyoutotalkaboutyourself.Thiscan
alsohappenatthetelephoneinterviewstage.

Careers

Forbackgroundquestionsbereadytotalkaboutasummaryofyourcareer.

Summarizeyourexperience
Whatcompaniesyouworkedat?Whatwasyourrole?
Doyouhaveaprojectportfolio?Whatprojectsyouimplemented?Discusssomeofthem
indetails
Forgraduatingstudents:Tellmeaboutyourmasterthesis
Foraspiringdatascientists:Whydoyouwantacareerindatascience?
Whatareyourcareergoals?

GeneralQuestions

Therealsobesomequestionsnotdirectlyrelatedtotheprojectsyoudid,butrathertoyour

http://www.itshared.org/2015/10/data-science-interview-questions.html 2/13
2/9/2016 IT Shared: Data Science Interview Questions
(self)education.Forexample:

Whathaveyoudonetoimproveyourdataanalysisknowledgeinthepastyear?
Whatisthelatestpaperorbookyouread?Whydidyoureaditandwhatdidyoulearn?
Whatdatascienceblogsdoyoufollow?
Haveyoutakenanydatasciencerelatedonlinecourses?Ifyes,howmanydidyou
completewithacertificate?

Process

AllMachineLearning,DataMiningandDataScienceprojectsshouldfollowsomeprocess,sothere
canbequestionsaboutit:

Canyououtlinethestepsinananalyticsproject?
HaveyouheardofCRISPDM(CrossIndustryStandardProcessforDataMining)?

CRISPDMdefinesthefollowingsteps:

ProblemDefinition
DataUnderstanding(orDataExploration)
DataPreparation
Modeling
Evaluation
Deployment(fortheproduction)

Sonextyoumaydiscusseachofthesestepsindetails

Whatisthegoalofeachstep?
Whatarepossibleactivitiesateachstep?

Mathematics

SomebackgroundmathematicsisnecessaryfordoingDataScience,thereforeyoushouldexpect
mathrelatedquestions.

LinearAlgebra

BasicLinearAlgebraquestionsmightinclude:

Whatis[MathProcessingError]?Howtosolveit?
Howdowemultiplymatrices?
WhatisanEigenvalue?AndwhatisanEigenvector?WhatisEigenvalueDecomposition
orTheSpectralTheorem?
WhatisSingularValueDecomposition?
YoucanexpecttonsofLinerAlgebraquestionsintheMachineLearningpartofthe
interview(seebelow).

IfyouareinterestedinlearningorrefreshingLinearAlgebra,seeBestTimetoLearnLinearAlgebra
isNow!

OtherAreas
DiscreteMathematicsandLogicsarenotthatimportantforDataScience
ProbabilityandStatisticsarecoreskillsanddiscussedinthenextsection
CalculusandOptimizationareusuallydiscussedintheMachineLearningpartandusually
whentalkingaboutaparticularalgorithm

ProbabilityandStatistics

ProbabilityandStatisticsisanimportantpartofaninterview,becauseitsthebasicsforMachine
Learning.Itisalsousefulifthecompanyisdoingsomemarketingorwebsiteoptimization,sothey
couldaskaboutrelatedconceptssuchasA/Btests.

BasicProbability
http://www.itshared.org/2015/10/data-science-interview-questions.html 3/13
2/9/2016 IT Shared: Data Science Interview Questions
Youcanhaveacoupleofsimplequestionstocheckyourunderstandingofprobability.

Forexample:

Giventwofairdices,whatistheprobabilityofgettingscoresthatsumto4?to8?
AsimplequestionsonBayesrule:Imagineatestwithatruepositiverateof100%and
falsepositiverateof5%.Imagineapopulationwitha1/1000rateofhavingthecondition
thetestidentifies.Givenapositivetest,whatistheprobabilityofhavingthatcondition?

Distributions

Youcanexpectquestionsaboutprobabilitydistributions:

Whatisthenormaldistribution?Giveanexampleofsomevariablethatfollowsthis
distribution
Whataboutlognormal?
Explainwhatalongtaileddistributionisandprovidethreeexamplesofrelevant
phenomenathathavelongtails.Whyaretheyimportantinclassificationandprediction
problems?
HowtocheckifadistributionisclosetoNormal?Whywouldyouwanttocheckit?Whatis
aQQPlot?
GiveexamplesofdatathatdoesnothaveaGaussiandistribution,orlognormal.
Doyouknowwhattheexponentialfamilyis?
DoyouknowtheDirichletdistribution?themultinomialdistribution?

BasicStatistics
WhatistheLawsofLargeNumbers?CentralLimitTheorem?
WhyaretheyimportantforStatistics?
Whatsummarystatisticsdoyouknow?

ExperimentDesign

DesigningexperimentsisanimportantpartofStatistics,anditsespeciallyusefulfordoingA/Btests.

SamplingandRandomization

Whydoweneedtosampleandhow?
Whyisrandomizationimportantinexperimentaldesign?
Some3rdpartyorganizationrandomlyassignedpeopletocontrolandexperimentgroups.
Howcanyouverifythattheassignmenttrulywasrandom?
Howdoyoucalculateneededsamplesize?
Poweranalysis.Whatisit?

Biases

Whenyousample,whatbiasareyouinflicting?
Howdoyoucontrolforbiases?
WhataresomeofthefirstthingsthatcometomindwhenIdoXintermsofbiasingyour
data?

Otherquestions

Whatareconfoundingvariables?

PointEstimates

Confidenceintervals

Whatisapointestimate?Whatisaconfidenceintervalforit?
Howaretheyconstructed?
Whydoyouneedtostandardize?
Howtointerpretconfidenceintervals?

Testing

Hypothesistests

Whydoweneedhypothesistesting?WhatisPValue?

http://www.itshared.org/2015/10/data-science-interview-questions.html 4/13
2/9/2016 IT Shared: Data Science Interview Questions
Whatisthenullhypothesis?Howdowestateit?
DoyouknowwhatTypeI/TypeIIerrorsare?
Whatis[MathProcessingError]Test/[MathProcessingError]Test/ANOVA?Whentouse
it?
Howwouldyoutestiftwopopulationshavethesamemean?Whatifyouhave3or4
populations?
YouappliedANOVAanditsaysthatthemeanisdifferent.Howdoyouidentifythe
populationswherethemeansaredifferent?
Whatisthedistributionofpvalues,ingeneral?

A/BTests
WhatisA/Btesting?HowisitdifferentfromusualHypothesistesting?
Howcanyouprovethatoneimprovementyouvebroughttoanalgorithmisreallyan
improvementovernotdoinganything?HowfamiliarareyouwithA/Btesting?
Howcanwetellwhetherourwebsiteisimproving?
Whatarethemetricstoevaluateawebsite?Asearchengine?
Whatkindofmetricswouldyoutrackforyoumusicstreamingwebsite?
Commonmetrics:Engagement/retentionrate,conversion,similarproducts/duplicates
matching,howtomeasurethem.
Reallifenumbersandintuition:Expecteduserbehavior,reasonablerangesforuser
signup/retentionrate,sessionlength/count,registered/unregisteredusers,deep/top
levelengagement,spamrate,complaintrate,adsefficiency.

BayesianStatistics

InmyinterviewsIdidnthaveanyquestionsaboutBayesianStats,nordidIfindalotofquestionson
theInternet.Butherearesome:

HaveyoueverseenBayesTheorem?
Doyouknowwhataconjugateprioris?

YoumightalsogetquestionsaboutBayesiannonparametricmodels,butImnotsureifitscommon.

TimeSeries
Whatisatimeseries?
Whatisthedifferencebetweendataforusualstatisticalanalysisandtimeseriesdata?
Haveyouusedanyofthefollowing:Timeseriesmodels,Crosscorrelationswithtimelags,
Correlograms,Spectralanalysis,Signalprocessingandfilteringtechniques?Ifyes,in
whichcontext?
Intimeseriesmodelinghowcanwedealwithmultipletypesofseasonalitylikeweeklyand
yearlyseasonality?

Advanced

Resampling

Explainwhatresamplingmethodsareandwhytheyareuseful.Alsoexplaintheir
limitations.
Bootstrappinghowandwhyitisused?
Howtouseresamplingforhypothesistesting?HaveyouheardofPermutationTests?
Howwouldyouapplyresamplingtotimeseriesdata?

MachineLearning

Inmyexperience,theMachineLearningpartisusuallythelargestpartoftheinterview.Itmaybea
fewbasicquestions,butitshelpfultobepreparedtomoreindepthMachineLearningquestions,
especiallyifyouclaimtohaveworkedwithitonyourCV.

GeneralMLQuestions

TheMLpartmaystartwithsomethinglike:

Whatisthedifferencebetweensupervisedandunsupervisedlearning?Whichalgorithms
aresupervisedlearningandwhicharenot?Why?

http://www.itshared.org/2015/10/data-science-interview-questions.html 5/13
2/9/2016 IT Shared: Data Science Interview Questions
WhatisyourfavoriteMLalgorithmandwhy?

Andthengointodetails

Regression
Describetheregressionproblem.Isitsupervisedlearning?Why?
Whatislinearregression?Whyisitcalledlinear?
Discussthebiasvariancetradeoff.

LinearRegression:

WhatisOrdinaryLeastSquaresRegression?Howitcanbelearned?
CanyouderivetheOLSRegressionformula?(Foronestepsolution)
Ismodel[MathProcessingError]stilllinear?Why?
Dowealwaysneedtheinterceptterm?Whendoweneeditandwhendowenot?
Whatiscollinearityandwhattodowithit?Howtoremovemulticollinearity?
Whatifthedesignmatrixisnotfullrank?
Whatisoverfittingaregressionmodel?Whatarewaystoavoidit?
WhatisRidgeRegression?HowisitdifferentfromOLSRegression?Whydoweneedit?
WhatisLassoregression?HowisitdifferentfromOLSandRidge?

LinearRegressionassumptions:

Whataretheassumptionsrequiredforlinearregression?
Whatifsomeoftheseassumptionsareviolated?

SignificantfeaturesinRegression

Youwouldliketofindsignificantfeatures.Howwouldyoudothat?
Youfitamultipleregressiontoexaminetheeffectofaparticularfeature.Thefeature
comesbackinsignificant,butyoubelieveitissignificant.Whycanithappen?
Yourmodelconsidersthefeature[MathProcessingError]significant,and[Math
ProcessingError]isnot,butyouexpectedtheoppositeresult.Whycanithappen?

Evaluation

Howtocheckistheregressionmodelfitsthedatawell?

Otheralgorithmsforregression

Decisiontreesforregression
[MathProcessingError]NearestNeighborsforregression.Whentouse?
Doyouknowothers?E.g.Splines?LOESS/LOWESS?

Classification

Basic:

Canyoudescribewhatistheclassificationproblem?
Whatisthesimplestclassificationalgorithm?
Whatclassificationalgorithmsdoyouknow?Whichoneyoulikethemost?

Decisiontrees:

Whatisadecisiontree?
Whataresomebusinessreasonsyoumightwanttouseadecisiontreemodel?
Howdoyoubuildit?
Whatimpuritymeasuresdoyouknow?
Describesomeofthedifferentsplittingrulesusedbydifferentdecisiontreealgorithms.
Isabigbrushytreealwaysgood?Whywouldyouwanttopruneit?
Isitagoodideatocombinemultipletrees?
WhatisRandomForest?Whyisitgood?

Logisticregression:

Whatislogisticregression?
Howdowetrainalogisticregressionmodel?
Howdoweinterpretitscoefficients?

SupportVectorMachines

Whatisthemaximalmarginclassifier?Howthismargincanbeachievedandwhyisit

http://www.itshared.org/2015/10/data-science-interview-questions.html 6/13
2/9/2016 IT Shared: Data Science Interview Questions
beneficial?
HowdowetrainSVM?WhatabouthardSVMandsoftSVM?
Whatisakernel?ExplaintheKerneltrick
Whichkernelsdoyouknow?Howtochooseakernel?

NeuralNetworks

WhatisanArtificialNeuralNetwork?
HowtotrainanANN?Whatisbackpropagation?
Howdoesaneuralnetworkwiththreelayers(oneinputlayer,oneinnerlayerandone
outputlayer)comparetoalogisticregression?
Whatisdeeplearning?WhatisCNN(ConvolutionNeuralNetwork)orRNN(Recurrent
NeuralNetwork)?

Othermodels:

Whatothermodelsdoyouknow?
HowcanweuseNaiveBayesclassifierforcategoricalfeatures?Whatifsomefeatures
arenumerical?
Tradeoffsbetweendifferenttypesofclassificationmodels.Howtochoosethebestone?
Comparelogisticregressionwithdecisiontreesandneuralnetworks.

Regularization
WhatisRegularization?
WhichproblemdoesRegularizationtrytosolve?
Whatdoesitmean(practically)foradesignmatrixtobeillconditioned?
Whenmightyouwanttouseridgeregressioninsteadoftraditionallinearregression?
Whatisthedifferencebetweenthe[MathProcessingError]and[MathProcessingError]
regularization?
Why(geometrically)doesLASSOproducesolutionswithzerovaluedcoefficients(as
opposedtoridge)?
LetusgothroughthederivationofOLSorLogisticRegression.Whathappenswhenwe
add[MathProcessingError]regularization?Howdothederivationschange?Whatifwe
replace[MathProcessingError]regularizationwith[MathProcessingError]regularization?

DimensionalityReduction

Basics:

Whatisthepurposeofdimensionalityreductionandwhydoweneedit?
Aredimensionalityreductiontechniquessupervisedornot?Areallofthemare
(un)supervised?
Whatwaysofreducingdimensionalitydoyouknow?
Isfeatureselectionadimensionalityreductiontechnique?
Whatisthedifferencebetweenfeatureselectionandfeatureextraction?
IsitbeneficialtoperformdimensionalityreductionbeforefittinganSVM?Whyorwhynot?

PrincipalComponentAnalysis:

WhatisPrincipalComponentAnalysis(PCA)?Whatistheproblemitsolves?Howisit
relatedtoeigenvaluedecomposition(EVD)?
WhatstherelationshipbetweenPCAandSVD?WhenSVDisbetterthanEVDforPCA?
UnderwhatconditionsisPCAeffective?
WhydoweneedtocenterdataforPCAandwhatcanhappedifwedontdoit?Dowe
needtoscaledataforPCA?
IsPCAalinearmodelornot?Why?

OtherDimensionalityReductiontechniques:

DoyouknowotherDimensionalityReductiontechniques?
WhatisIndependentComponentAnalysis(ICA)?WhatsthedifferencebetweenICAand
PCA?
Supposeyouhaveaverysparsematrixwhererowsarehighlydimensional.Youproject
theserowsonarandomvectorofrelativelysmalldimensionality.Isitavalid
dimensionalityreductiontechniqueornot?
HaveyouheardofKernelPCAorothernonlineardimensionalityreductiontechniques?
WhataboutLLE(LocallyLinearEmbedding)or[MathProcessingError]SNE([Math
ProcessingError]distributedStochasticNeighborEmbedding)
WhatisFisherDiscriminantAnalysis?HowitisdifferentfromPCA?Isitsupervisedor

http://www.itshared.org/2015/10/data-science-interview-questions.html 7/13
2/9/2016 IT Shared: Data Science Interview Questions
not?

ClusterAnalysis
Whatistheclusteranalysisproblem?
Whichclusteranalysismethodsyouknow?
Describe[MathProcessingError]Means.Whatistheobjectiveof[MathProcessingError]
Means?CanyoudescribetheLloydalgorithm?
Howdoyouselect[MathProcessingError]forKMeans?
Howcanyoumodify[MathProcessingError]Meanstoproducesoftclassassignments?
Howtoassessthequalityofclustering?
Describeanyotherclusteranalysismethod.E.g.DBSCAN.

Optimization

Youmayhavesomebasicquestionsaboutoptimization:

Whatisthedifferencebetweenaconvexfunctionandnonconvex?
WhatisGradientDescentMethod?
WillGradientDescentmethodsalwaysconvergetothesamepoint?
Whatisalocaloptimum?
Isitalwaysbadtohavelocaloptima?
WhattheNewtonsmethodis?
WhatkindofproblemsarewellsuitedforNewtonsmethod?BFGS?SGD?
Whatareslackvariables?
Describeaconstrainedoptimizationproblemandhowyouwouldtackleit.

Recommendation
Whatisarecommendationengine?Howdoesitwork?
DoyouknowabouttheNetflixPrizeproblem?Howwouldyouapproachit?
Howtodocustomerrecommendation?
WhatisCollaborativeFiltering?
Howwouldyougeneraterelatedsearchesforasearchengine?
HowwouldyousuggestfollowersonTwitter?

FeatureEngineering
HowtoapplyMachineLearningtoaudiodata,images,texts,graphs,etc?
WhatisFeatureEngineering?Canyougiveanexample?Whydoweneedit?
Howtogofromcategoricalvariablestonumerical?

NaturalLanguageProcessing

Ifthecompanydealswithtextdata,youcanexpectsomequestionsonNLPandInformation
Retrieval:

WhatisNLP?HowisitrelatedtoMachineLearning?
HowwouldyouturnunstructuredtextdataintostructureddatausableforMLmodels?
WhatistheVectorSpaceModel?
WhatisTFIDF?
Whichdistancesandsimilaritymeasurescanweusetocomparedocuments?Whatis
cosinesimilarity?
Whydoweremovestopwords?Whendowenotremovethem?
LanguageModels.Whatis[MathProcessingError]Grams?

MetaLearning

FeatureSelection:

Areallfeaturesequallygood?
Whatarethedownfallsofusingtoomanyortoofewvariables?
Howmanyfeaturesshouldyouuse?Howdoyouselectthebestfeatures?

http://www.itshared.org/2015/10/data-science-interview-questions.html 8/13
2/9/2016 IT Shared: Data Science Interview Questions
WhatisFeatureSelectionandwhydoweneedit?
Describeseveralfeatureselectionmethods.Arethesemethodsdependonthemodelor
not?

Modelselection:

Youhavebuiltseveraldifferentmodels.Howwouldyouselectthebestone?
Youhaveonemodelandwanttofindthebestsetofparametersforthismodel.How
wouldyoudothat?
Howwouldyoulookforthebestparameters?Doyouknowsomethingelseapartfromgrid
search?
WhatisCrossValidation?
Whatis10FoldCV?
Whatisthedifferencebetweenholdingoutavalidationsetanddoing10FoldCV?

Modelevaluation

Howdoyouknowifyourmodeloverfits?
Howdoyouassesstheresultsofalogisticregression?
Whichevaluationmetricsyouknow?Somethingapartfromaccuracy?
Whichisbetter:Toomanyfalsepositivesortoomanyfalsenegatives?
Whatprecisionandrecallare?
WhatisaROCcurve?WhatisAUROC(AUC)?HowtointerpretthecurveandAUROC?
DoyouknowaboutConcordanceorLift?

DiscussionQuestions:

Youhaveamarketingcampaignandyouwanttosendemailstousers.Youdevelopeda
modelforpredictingifauserwillreplyornot.Howcanyouevaluatethismodel?Istherea
chartyoucanuse?

Miscellanea

CurseofDimensionality

WhatisCurseofDimensionality?Howdoesitaffectdistanceandsimilaritymeasures?
Whataretheproblemsoflargefeaturespace?Howdoesitaffectdifferentmodels,e.g.
OLS?Whataboutcomputationalcomplexity?
Whatdimensionalityreductionscanbeusedforpreprocessingthedata?
Whatisthedifferencebetweendensitysparsedataanddimensionallysparsedata?

Others

Youaretraininganimageclassifierwithlimiteddata.Whataresomewaysyoucan
augmentyourdataset?

ComputerScience

KnowledgeinComputerScienceisasimportantforDataScienceasknowledgeinMachineLearning.
Soyoumaygetthesametypeofquestionsasforanysoftwaredeveloperposition,butpossiblywith
lowerexpectationsonyouranswers.

IwasaJavadeveloperforquitesometime,andIpreparedalistofquestionsIasked(andoftenwas
asked)onJavainterviews:JavaInteviewquestions.Thislistcanalsobehelpfulforpreparingtoa
DataScienceinterview.

LibrariesandTools

ApartfrombasicsofJava/Scala/Python/etc,youmaybeaskedaboutlibrariesfordataanalysis:

WhichlibrariesfordataanalysisdoyouknowinPython/R/Java?
Haveyouusednumpy,scipy,pandas,sklearn?
WhataresomefeaturesofthesklearnapithatdifferentiateitfromfittingmodelsinR?
Whataresomefeaturesofpandas/scipythatyoulike?Hate?SamequestionsforR.
Whyisvectorizationsuchapowerfulmethodforoptimizingnumericalcode?Whatis
goingonthatmakesthecodefasterrelativetoalternativeslikenestedforloops?
Whenisitbettertowriteyourowncodethanusingadatasciencesoftwarepackage?
Stateany3positiveandnegativeaspectsaboutyourfavoritestatisticalsoftware.
Describeadifficultbugyouveencounteredandhowyouresolvedit.

http://www.itshared.org/2015/10/data-science-interview-questions.html 9/13
2/9/2016 IT Shared: Data Science Interview Questions
Howdoesfloatingpointaffectprecisionofcalculations?Equalitytests?
WhatisBLAS?LAPACK?

Databases
Haveyoubeeninvolvedindatabasedesignanddatamodeling?
SQLRelatedquestions:e.g.whatsgroupby?
OrgivensomeDBschemayoumaybeaskedtowriteasimpleSQLquery.
Whatisastarschema?snowflakeschema?
DescribedifferentNoSQLtechnologiesyourefamiliarwith,whattheyaregoodat,and
whattheyarebadat.

DistributedSystemsandBigData

BasicBigDataquestions:

Whatisthebiggestdatasetthatyouhaveprocessedandhowdidyouprocessit?What
wastheresult?
HaveyouusedApacheHadoop,ApacheSpark,ApacheFlink?Why?Haveyouused
ApacheMahout?

MapReduce

Whataretheadvantages/disadvantagesofsharednothingarchitecture?
WhatisMapReduce?Whyisitsharednothingarchitecture?
CanyouimplementwordcountinMapReduce?Whataboutsomethingabitmorecomplex
likeTFIDF?NaiveBayes?
Whatisloadbalance?HowtomakesureaMapReduceapplicationhasgoodload
balance?
CanyougiveexampleswhereMapReducedoesnotwork?
Whatareexamplesofembarassinglyparallelizablealgorithms?

Implementationquestions

Howwouldyouestimatethemedianofadatasetthatistoobigtoholdinthememory?

TherearesomepoststhatyoumayfindusefulwhenpreparingfortheBigDatapart:

HadoopandMapReduce
NaiveBayesonApacheFlink

HandsOn

Also,manyinterviewshaveapartwhichIcallhandson:youaregivensomeproblemdescription
andyouareaskedtosolveit.Youcanjusttalktheinterviewersthroughyoursolutionorevenbe
askedtositandimplementsomeparts.Sometimesthereisalsoatestassignmenttobedoneat
home(priortotheinterview).

ProblemtoSolve

Forexample:

Assumethatyouareaskedtoleadaprojectonchurndetection,andhavedatasetofknownusers
whostoppedusingtheserviceandoneswhoarestillusing.Thisdataincludesdemographicsand
otherfeatures.

Dothefollowing:

1.Describethemethodologyandmodelthatyouwillchosetoidentifychurn,anddescribeyour
thoughtprocess.
2.ThinkhowwouldyoucommunicatetheresultstotheCEO?
3.Supposeinthedatasetonly0.025ofuserschurned.Howwouldyoumakeitmorebalanced?

Also:

Howwouldyouimplementitifyouhadoneday?Onemonth?Oneyear?
Howwouldyourapproachscale?

Otherproblems:

http://www.itshared.org/2015/10/data-science-interview-questions.html 10/13
2/9/2016 IT Shared: Data Science Interview Questions
Howwouldyouapproachidentifyingplagiarism?
Howtofindindividualpaidaccountssharedbymultipleusers?
Howtodetectbogusreviews,orbogusFacebookaccountsusedforbadpurposes?
Usuallythedomainoftheproblemisrelatedtowhatthecompanyisdoing.Iftheyredoing
marketing,itwillmostlikelybemarketingrelated.

Additionally,youmaybeasked:

Howwouldyouapproachcollectingthedataifyoudidnthavethedataset?

Coding

Sometimesyouevenmaybepresentedasmalldatasetandasktodoaparticulartaskwithanytool.
Forexample,

writeascripttoextractfeatures,
thendosomeexploratorydataanalysisand
finallyapplysomeMLalgorithmtothisdataset.

Orjustthelasttwo,withareadytousedatasetintabularform.

Papers

ItsalsopossiblethatyoullbeaskedtoreadsomeMLpaperandshareyourthoughtsonit,andthen
discusstheproposedalgorithm,itstimecomplexity,howitcanbeimplementedandimproved.

Iwasntaskedtodoitmyself,butbasedonmyexperienceworkingasaMLdeveloper,Ibelievethat
readingpapersandbeingabletounderstandthemisanimportantskill,sodontbesurprisedif
somebodytriestocheckthisability.

Sources

Ihadtoworkthroughalotofsourcestomakethiscompilation.IdidnotincludeallthequestionsI
cameacross,justtheonesthatmadesenseoronesIreallygotduringmyinterviews.Italso,of
course,includesmyowninterviews.

HereisthelistofsourcesIused:

http://www.quora.com/Whatisatypicaldatascientistinterviewlike
http://www.quora.com/Whataretheinterviewquestionsonregressionmodeling
http://www.quora.com/HowshouldIprepareforstatisticsquestionsforadatascience
interview
http://www.quora.com/ABTesting/WhatkindofABtestingquestionsshouldIexpectin
adatascientistinterviewandhowshouldIprepareforsuchquestions
http://www.quora.com/Whatare20questionstodetectfakedatascientists
http://www.quora.com/WhataresomecommonMachineLearninginterviewquestions
http://www.quora.com/Whatarethebestinterviewquestionstoevaluateamachine
learningresearcher
https://www.quora.com/AreCSquestionspartofadatascientistinterviewatFacebook
http://www.quora.com/DataScience/HowshouldIprepareforstatisticsquestionsfora
datascienceinterview
http://stats.stackexchange.com/questions/5465/statisticsinterviewquestions
http://www.reddit.com/r/datascience/comments/2nhb4k/what_interview_questions_have_y
ou_been_asked/
http://www.reddit.com/r/statistics/comments/310h76/i_have_an_interview_for_a_parttime_
data_analyst/
https://www.reddit.com/r/datascience/comments/3fsz54/my_top10_technical_questions_fo
r_job_candidates/
https://www.reddit.com/r/datascience/comments/3kzf69/data_scientist_interview_question
s_on_pca_svm/
http://www.reddit.com/r/MachineLearning/comments/392nwy/interview_questions_for_data
_scientist_positions/
http://blog.udacity.com/2015/04/datascienceinterviewquestions.html
http://alyaabbott.wordpress.com/2014/10/01/howtoaceadatascienceinterview/
http://www.marketingdistillery.com/2014/09/03/howtosuccessfullyrecruitadatascientist/
http://www.edureka.co/blog/frequentlyaskeddatascienceinterviewquestions

http://www.itshared.org/2015/10/data-science-interview-questions.html 11/13
2/9/2016 IT Shared: Data Science Interview Questions
http://www.galvanize.it/blog/howtonailadatascienceinterview
http://analyticsindiamag.com/commonanalyticsinterviewquestions/
http://www.datasciencecentral.com/profiles/blogs/66jobinterviewquestionsfordata
scientists

UsefulLinks

IfyouarepreparingtoaDataScienceinterview,youmayalsofindthefollowinglinksuseful:

http://www2.udacity.com/rs/udacity/images/Ultimate%20Skills%20Checklist%20For%20Yo
ur%20First%20Data%20Analyst%20Job.pdf
http://www.quora.com/Whataresomeimportantquestionstoaskarecruiterwhen
interviewingforadatasciencejob
http://www.quora.com/InadatascientistinterviewshouldIusePythonorC++for
algorithmdatastructurequestions
http://www.quora.com/HowdoIprepareforadatascientistinterview
http://datascienceinterview.quora.com/DataScienceInterviewPreparation
http://datascienceinterview.quora.com/Answers1
https://github.com/gkamradt/LessonsLearnedDataScienceInterviews
http://mathewanalytics.com/2015/08/18/homeworkduringthehiringprocessnothanks/
https://medium.com/@D33B/interviewquestionsfordatascientistpositions
5ad3c5d5b8bd
https://medium.com/@D33B/interviewquestionsfordatascientistpositionspartii
ac294c2c7241
http://www.jasq.org/justanotherscalaquant/newageyinterviewsatthegrocerystartup
http://www.erinshellman.com/crusheditlandingadatasciencejob/
http://treycausey.com/data_science_interviews.html
http://nadbordrozd.github.io/interviews/

TheEnd

Eventhoughthepostwaslengthy,Ihopeyouenjoyeditandfoundthisinformationuseful.Happy
interviewing!Andpleasedoletusknowifyougotanyinterestingquestionsthatweshouldadd.

+23 Recommend this on Google

Labels:byAlexey,DataScience,InterviewQuestions,MachineLearning

http://www.itshared.org/2015/10/data-science-interview-questions.html 12/13
2/9/2016 IT Shared: Data Science Interview Questions

5 Comments IT Shared
1 Login

Recommend 1 Share Sort by Best

Join the discussion

Vickie-The Philosopher 2 months ago


Awesome work. I am delighted.
Reply Share

Gerald S 3 months ago


Indeed inspiring post this is.
Reply Share

Unknown 3 months ago


amazing content, thanks a bunch!
Reply Share

Dave Od 4 months ago


fantastic overall view of data science and associated technologies and
methods - thanks
Reply Share

ming 4 months ago


Thank you for your sharing!
Reply Share

Subscribe d Add Disqus to your site Add Disqus Add Privacy

NewerPost Home OlderPost

Subscribeto:PostComments(Atom)

PoweredbyBlogger.

http://www.itshared.org/2015/10/data-science-interview-questions.html 13/13