Sie sind auf Seite 1von 15

EvaluatingNoSQLperformance:Which

databaseisrightforyourdata?
February3,2014 SergeySverchkov

#andypiper #eclipse #mqtt

reddit 14 Facebook Google+


Sergey
Sverchkov
explainshowyoucancutthroughthemassesofNoSQLmarketing
jargonandfindtheperfectdatabaseforyourproject.

OftenreferredtoasNoSQL,nonrelationaldatabasesfeatureelasticityandscalability.Inaddition,they
canstorebigdataandworkwithcloudcomputingsystems.Allofthesefactorsmakethemextremely
popular.In2013,thenumberofNoSQLproductsreached150plus,andthefigureisstillgrowing.That
varietymakesitdifficulttoselectjustone.Tomakethingsworse,theabundanceofmarketing
materialsdescribingNoSQLproductsmakesithardtounderstandwhetheraparticularsolutionwillbe
usefulforyourusecase.

TochoosethebestNoSQLsolutionsforourcustomersprojects,weatAltorostestthemundervarying
typesofworkloads.Thisarticleoverviewstheresultsofthelatestperformancetestsofsomeofthe
mostmatureandpopularNoSQLdatastoresonthemarket.

NoSQLdatastores:Thebasics
WhydidNoSQLdatastoresappear?Mostlybecauserelationaldatabases(RDBMS)haveanumberof
restrictionswhenworkingwithlargedatasets.Forexample,RDBMSarehardtoscaleandtheir
architectureisdesignedtoworkonasinglemachinewhen:

Scalingwriteoperationsishard,expensive,orimpossible.

Verticalscaling(orupgradingequipment)iseitherlimitedorveryexpensive.Unfortunately,thisis
oftentheonlypossiblewayyoucanscale.

Horizontalscaling(oraddingnewnodestothecluster)iseitherunavailableoryoucanonly
implementitpartially.TherearesomesolutionsfromOracleandMicrosoftthatmakeitpossibleto
havecomputinginstancesonseveralservers.Still,thedatabaseitselfremainsinsharedstorage.

Inadditiontopoorscalability,RDBMShavestrictdatamodels.Theschemaiscreatedtogetherwith
thedatabaseandyoumayneedsignificanttimeandefforttochangethisstructure.Inmostcasesitis
acomplexprocessthatmostlikelyinvolvesaconsiderableamountofproductiondowntime.Apartfrom
that,RDBMShavedifficultieswithsemistructureddata.
NoSQLsolutionsaddresstheseandalotofotherproblems.Severaltypesexistkeyvalue,
columnar,documentoriented,andgraph.Noneofthemusetherelationaldatamodel,beinginherently
schemafree,withoutobsessivecomplexity,withaflexibledatamodelandeventualconsistency
(complyingwithBASEratherthanACID).

NoSQLdatastoresprovideAPIstoperformvariousoperations.Someofthemsupportquerylanguage
operations,forexample,CassandraandHBase.However,thereisnostandard.NoSQLarchitectures
aredesignedtoruninclusters.Thismakesitpossibletoscalethemhorizontallybyincreasingthe
numberofnodesinthedeployment.Inaddition,NoSQLdatastoresservehugeamountsofdataand
provideaveryhighthroughput.

HowdoyouevaluateNoSQLdatastores?
NoSQLdatabasesdifferfromRDBMSintheirdatamodels.Thesesystemscanbedividedintofour
distinctgroups:
1. Keyvaluestoresaresimilartomapsordictionarieswheredataisaddressedbyauniquekey.

2. DocumentorientedstoresencapsulatekeyvaluepairsinJSONorJSONlikedocuments.Within
documents,keyshavetobeunique.Incontrasttokeyvaluestores,valuesarenotopaquetothe
systemandcanbequeried,aswell.
3. Columnfamilystoresarealsoknownascolumnorientedstores,extensiblerecordstores,andwide
columnarstores.
4. GraphdatabasesincontrasttokeyorientedNoSQLdatabases,graphdatabasesarespecialized
inefficientmanagementofheavilylinkeddata.
NoSQLdatabasesdifferinthewaytheydistributedataamongmultiplemachines.Sincedatamodels
ofkeyvaluestores,documentstores,andcolumnfamilystoresarekeyoriented,thetwocommon
partitionstrategiesarebasedonkeys,too:
1.Thefirststrategydistributesdatasetsbytherangeoftheirkeys.Aroutingserversplitsthewhole
keysetintoblocksandallocatestheseblockstodifferentnodes.Afterwards,onenodeisresponsible
forstorageandrequesthandlingofhisspecifickeyranges.Inordertofindacertainkey,clientshave
tocontacttheroutingservertogetthepartitiontable.

2.Higheravailabilityandmuchsimplerclusterarchitecturecanbeachievedwiththesecondtypeof
distribution.
ReplicationisanothercorefeatureofNoSQLsolutions.Inadditiontobetterreadperformance
throughloadbalancing,replicationbringsbetteravailabilityanddurability,becausefailingnodescan
beeasilyreplaced.

Ifallreplicasofamasterserverwereupdatedsynchronously,thesystemwouldnotbeavailableuntil
allslaveshadcommittedawriteoperation.Thatiswhythissolutionisnotsuitableforplatformsrelying
onhighavailability,becauseevenafewmillisecondsoflatencycanhaveabiginfluenceonuser
behavior.

Andobviously,performanceisalsoaveryimportantfactor.Performanceofdatastoragesolutionscan
beevaluatedusingtypicalscenarios.Thesescenariossimulatethemostcommonoperations
performedbyapplicationsthatusethedatastore,alsoknownastypicalworkloads.Theteststhat
wereperformedbyAltorostocompareperformanceofseveralNoSQLdatastoresalsousedtypical
workloads.

Performanceevaluationapproach
Databasevendorsusuallymeasureproductivityoftheirproductswithcustomhardwareandsoftware
settingsdesignedtodemonstratetheadvantagesoftheirsolutions.Inourtestswetriedtoseehow
NoSQLdatastoresperformundertheequalconditions.

Forbenchmarking,weusedtheYahooCloudServingBenchmark(YCSB).ThekernelofYCSBhasa
frameworkwithaworkloadgeneratorthatcreatestestworkloadandasetofworkloadscenarios.
WhenusingYCSB,developershavetodescribethescenariooftheworkloadbyoperationtype,i.e.
indicatewhatoperationsareperformedonwhattypesofrecords.Supportedoperationsinclude:insert,
update(changeoneofthefields),read(onerandomfieldorallthefieldofonerecord),andscan(read
therecordsintheorderofthekeystartingfromarandomlyselectedrecord).

Inourtestseachworkloadwasappliedtoatableof100,000,000recordseachrecordwas1,000
bytesinsizeandcontained10fields.Aprimarykeyidentifiedeachrecord,whichwasastring,such
as,forexample,user234123.Eachfieldwasnamedfield0,field1,andsoon.Thevaluesineachfield
wererandomstringsofASCIIcharacters,100byteseach.

Databaseperformancewasdefinedbythespeedatwhichitcomputedbasicoperations.Anaction
performedbytheworkloadexecutor,whichdrivesmultipleclientthreads,wasconsideredtobeabasic
operation.InaNoSQLdatastore,eachthreadexecutesasequentialseriesofoperationsbymaking
callstothedatabaseinterfacelayerbothtoloadthedatabase(theloadphase)andtoexecutethe
workload(thetransactionphase).Thethreadsthrottletherateatwhichtheygeneraterequests,sothat
wemaydirectlycontroltheload.Inaddition,thethreadsmeasurethelatencyandthroughputachieved
whenperformingoperations.Thisdataisthensenttothestatisticsmodule.

Testingenvironment
Forourtest,wedecidedtousetheAWSpubliccloudenvironment.Allthevirtualmachinesoperatedin
oneregion(Ireland,Europe)andinoneavailabilityzone(oronedatacenter).Eachdatabasestored
datainafournodecluster.Weusedm1.xlargecomputinginstances.Thenodeswere64bitinstances
with16GBofRAM,4vCPU,8ECU,andhighperformancenetwork.WeusedAmazonLinuxasthe
operatingsystem.

Foreachnodeinthecluster,wecreateddatastorage.WeusedfourEBSoptimizedelasticblock
storagevolumes(EBS)of25GBeach.ThevolumeswereassembledinRAID0.Dataweredistributed
orshardedinto4nodes.

TheclientwiththeYCSBframeworkwasonaseparatec1.xlargeinstance.ForMongoDB,weused
twoadditionalc1.mediuminstancesthatservedasrouters.Thiswasnecessaryduetothespecificsof
MongoDBsarchitecture.

Westartedalltheinstanceswiththesamesecuritygroupandpreconfiguredallthenecessarynetwork
portsforthenodestocommunicate.Wealsoconfiguredalltheportsrequiredforeachdatabasetobe
opened.ThenweuploadedworkloaddefinitionstotheYCSBclientandperformedthetests.The
resultsweremeasuredattheinstancewhereYCSBwasinstalled.

Figure1:TheinfrastructurefortestingNoSQLdatastores.Source:Altoros

Databasestoevaluateandworkloaddefinitions
InourtestwemeasuredtheperformanceofseveralNoSQLdatabasesthatweconsidertobethe
mostmatureandpopularproductsonthemarket.Letstakeacloserlookateachofthem:

Cassandra2.0.Thisisacolumnvaluedatastore.WeranitwithJava1.7.40installed.The
transactionswereperformedwiththenondefaultconfiguration.Inparticular,weusedarandom
partitionertosectiondatabynodes.Theamountofdatacashforthekeyswas1GB.Thesizeof
rowcashwas6GB.ThesizeofJVMheapwas6GB.
MongoDB2.4.6.Thisisadocumentorienteddatabase.Here,wedidnotdomuchadditional
configurationortuningweaddedtwoinstancesthatservedasrouters,asrecommendedby
MongoDBdocumentation.However,ifyouneedtosimplifythemodel,mongoroutermayrunonthe
samemachinewheretheYCSBclientis.Inoneofourearliertests,wediscoveredthatitusesalot
ofCPU.Thisiswhyweplacedrouterprocesseson2separatemachines.Datashardingfor
MongoDBwasbasedondocumentkey.
HBase0.92.ForHBase,wesetthesizeofmemoryfortheJVMto12GB.

Additionally,weuseddatacompressionwiththeSnappycompressorforCassandraandHBase.
Thereplicafactorwassettooneforalldatastores.Thisapproachwasintentionalwewantedtotest
performance,notthefailuretoleranceofthecluster.

WeusedthefollowingworkloadswiththeYSCBframework:

WorkloadA.WorkloadAisanupdateheavilyscenariothatsimulateshowadatabaseworks,when
recordingtypicalactionsofanecommercesolutionuser.
WorkloadB.WorkloadBisareadmostlyworkloadthathasa95/5(ninetyfivetofivepercent)
read/updateratio.Itrecapscontenttaggingwhenaddingatagisanupdate,butmostoperations
includereadingtags.
WorkloadC.WorkloadCisareadonlyworkloadthatsimulatesadatacachinglayer,forexamplea
userprofilecache.
WorkloadD.WorkloadDhasa95/5read/insertratio.Theworkloadsimulatesaccessingthelatest
data,suchasuserstatusupdates,orworkingwithinboxmessages.
WorkloadE.WorkloadEisascanshortrangesworkloadwithascan/insertpercentileproportionof
95/5.ItsimulatesthreadedconversationsthatareclusteredbyathreadID.Eachscanisperformed
forthepostsofagiventhread.
WorkloadF.WorkloadFhasreadmodifywrite/readopsinaproportionof50/50.Itsimulates
accessingauserdatabasewhererecordsarereadandmodifiedbytheuser.Thisactivityisalso
recordedtothisdatabase.
WorkloadG.WorkloadGhasa10/90read/insertratio.Itsimulatesaprocessofdatamigrationora
processofcreatinglargeamountsofdata.
Everyworkloadwasexecutedby100concurrentthreads.Thedatasetconsistedof100,000,000
recordsandthenumberofoperationsthatweredividedbetweenthreadswas10,000.

Loadphase
Duringthefirststageofthetest,theloadphase,weuploaded100,000,000recordsof1Kbeachto
everydatastore.YCSBmeasuredtheaveragethroughputinoperationspersecondandaverage
latencyofoperationsinmilliseconds.Thenextdiagramdisplaystheresultsoftheloadphase:
Figure2:Theresultsoftheloadphase.Source:Altoros
HBasedemonstratedthelowestthroughput,probablybecauseweturnedontheautoflashmode.This
modeensuresthateachoperationthatcreatesarecordwillbesentfromtheclienttotheserverand
thenpersistedtothedatabase.HBasealsosupportsanalternativemodethatusesadditionalcashon
theclientside.Whentheclientisoutofclientcacheitsendsdatafromthecashtotheserver.Inthis
alternativemode,HBasesavesdatatodiskinbatches.

Asweexpected,Cassandrademonstratedexcellentresultswithalmost18,000operationspersecond.
ThisisduetoCassandrasarchitecture.Itsimultaneouslyupdatesdatainmemoryandwritesittothe
transactionjournalonthedisk.Thisguaranteesdatapersistencyshouldanodecrash.

ThenumberofoperationspersecondinMongoDBsresultswasprettyclosetothatofHbasethe
averagelatencywasaroundsevenmillisecondsat13,000operationspersecond.

Inthisparticulartest,alldatawasloadedinasingleiteration,butinsert,update,read,andscan
operationsinthetransactionphaseofthetestwereperformedinfiveiterationsforeachworkload(for
everydatabase).

Itshouldbenotedthatmanyofthediagramswithtestresultsdemonstratethatdatabaseperformance
islimitedandstartstodeclineatacertainthroughputlevel.Alsoweneedtomentionthatweused
AmazonAWSandnetworkstoragewhichcouldpotentiallyinfluencetheresults.

WorkloadA
WorkloadAincludesreadandupdateoperationsinaratioof50/50.Itsimulatesanecommerce
application.Thisslideshowstheresultsofupdateoperations.
Figure3:TheresultsofupdateoperationsinWorkloadA.Source:Altoros

CassandraandHBasedemonstratedgoodperformancewithathroughputbelow20milliseconds.

MongoDBslatencyincreasedsubstantiallyasweincreasedtheworkload.At100operationsper
secondallthefourdatabaseshadsimilarperformance.Butwhentheworkloadreached1,500
operationspersecond,MongoDBslatencyincreasedto100milliseconds.

ThenextdiagramshowsresultsofreadoperationsinWorkloadA.

Figure4:Figure4.TheresultsofreadoperationsforWorkloadA.Source:Altoros

Allthegraphsareverydifferentbecausereadsandupdateswererandomlydistributed.Theresultsfor
readoperationsinWorkloadAweremoreorlesssimilarinallthetestedsolutions.Thedifferencein
latencieswasinsignificant,withinarangeof1530milliseconds.

WorkloadB
WorkloadBisreadmostlywith95%ofreadsandonly5%ofupdates.Itsimulatescontenttagging
whenaddingatagisanupdate,butmostothertransactionsarereads.Herearetheresultsforupdate
operationsinworkloadB.

Figure5.TheresultsofupdateoperationsforWorkloadB.Source:Altoros

Cassandrademonstratesaverylowlatency,butherperformanceislimitedto1200operationsper
second.WithHBase,thelatencyincreasesevenlyastheworkloadgrows.ThebehaviorofMongoDB
issimilartotheprevioustestwherethelatencyincreasedtogetherwiththethroughput.

Thenextslideshowstheresultsofreadoperationsthatmakeup95%ofWorkloadB.
Figure6.TheresultsofreadoperationsforWorkloadB.Source:Altoros

HBasedemonstratedthehighestthroughputwithseveralpeaksinlatency.MongoDBalsohadgood
resultsthankstoitsarchitecture,whichfeaturesmemorymappedfiles.Onthisdiagram,thereare
pointswherethemaximumthroughputstartstodegrade.Theywerelefthereonpurpose.

WorkloadC
WorkloadCisareadonlyworkload.MongoDBsperformancedoesnotdegradeastheworkload
increases.LatencypeaksonthegraphsforCassandraandHBasecanbeexplainedbythefactthat
weusedcloudinfrastructuretorunthetestsandnetworkstorageforthedatanodes.
Figure7.TheresultsofreadoperationsforWorkloadC.Source:Altoros

WorkloadD
WorkloadDincludesa5%ofinsertand95%ofreadoperations.Itsimulatesuserscheckinginbox
messagesoraccessinglatestdata,forexample,statusupdates.Thefollowingslideshowstheresults
ofinsertoperations.

Figure8.TheresultsofinsertoperationsforWorkloadD.Source:Altoros
Cassandrademonstratedthebestperformance.Thelatencieswerewithinfivemilliseconds.Thisis
similartowhatwesawattheuploadstagewhereCassandrawasprettyefficient.

ThemaximumthroughputforHBaseis1500operationspersecond.

Mongohasanacceptablethroughputofupto2500opspersecond.Atthesametime,theaverage
latencydoubles.

ReadoperationsarethesecondscenarioinWorkloadD.CassandraandMongoDBachievedgood
results.ForHBase,thethroughputwaslimitedto1500operationspersecond.Themaximum
throughputinthismixedworkloadwasdemonstratedbyMongoDB.

Figure9.TheresultsofreadoperationsforWorkloadD.Source:Altoros

WorkloadF
WorkloadFconsistedofrandomlydistributedreadandcomplexread/modify/writeoperations.Each
recordwasread,changed,andthensavedtothedatabase.

Thefirstdiagramshowstheresultsofreadoperationsonly.Allthedatabasesdemonstratedsimilar
resultsinthistest.CassandraandHBasehadmaximumthroughputvaluesofabout1,500operations
persecond(forreadoperations).PerformanceofMongoDBstartstodeclineatlessthan1,000
operationspersecond.
Figure10.TheresultsofreadoperationsforWorkloadF.Source:Altoros

TheseconddiagramforWorkloadFshowstheresultsofupdateoperationsperformedondatathat
hasalreadybeenread.

HbaseandCassandrahaveverylowlatencies.Asyoucanseeonallthepreviousgraphs,update
operationsareperformedprettyfastbyallthedatabases,exceptforMongoDB.Asweincreasethe
workload,MongoDBslatencystartstogrowsubstantially.

Figure11.TheresultsofreadoperationsforWorkloadF.Source:Altoros

ThethirddiagramforWorkloadFshowstheresultsofreadmodifywriteoperations.Cassandraand
HBasehavesimilarperformance.MongoDBslatencyincreasestogetherwiththeworkload,uptothe
maximumthroughputofroughly1,000operationspersecond.

WecanconcludethatCassandraandHbaseareprettygoodatdealingwithmixedworkloads.

Figure12.TheresultsofreadmodirywriteoperationsforWorkloadF.Source:Altoros

WorkloadG
Thelastworkload,WorkloadG,mostlyconsistsofinsertoperations.Itsimulatestheprocessofdata
migrationorinsertingalotofdataintoadatabase.Theresultsaresimilartowhatyoucanseeonthe
previousgraphs.HBaseandCassandrademonstratelowlatenciesandahighthroughput.MongoDBs
performancestartstodeclineatabout4,000opspersecondwithanaveragelatencyuptofivetimes
greaterthaninotherdatabases.
Figure13.TheresultsofinsertoperationsforWorkloadG.Source:Altoros

Belowisthelastdiagram,showingtheresultsofthereadoperationswhichmakeup10%ofworkload
G.Herewecanseethatlatenciesvaryfordifferentsolutions.Thismightbebecausethedataisin
networkstorageonthecloud.Cassandrashowsamaximumthroughputofupto7,000operationsper
secondandMongoDBsthroughputislimitedby4,000operationspersecond.

Figure14.TheresultsofreadoperationsforWorkloadG.Source:Altoros

Conclusion
Whatyouchoosedependsonyourneeds.Beforemakingthedecisionyoushould:

Determinewhatyourdatasetsandyourdatamodelwillbelikethedatamodelwilldependonthe
datasetsandtypicaloperationsthatyourappwillperform
Determineyourrequirementstotransactionsupportdecidewhetheryouneedtransactions

Choosewhetheryouneedreplicationdecideonyourrequirementstodataconsistency

Determineyourperformancerequirements

Iftheprojectisbasedonanexistingsolution,evaluateifitispossibletomigrateexistingdata

Then,takingintoaccountallthesefactors,evaluatedifferentsolutionsandtesttheirperformance.Itis
veryusefultobuildaprototypeofyourfuturesystemduringtheproofofconceptphase.Prototyping
makesitpossibletoseehowthesolutionwillworkinareallifeproject.Ifitdoesnotworkwellenough,
youneedtoreviewthearchitecture,thecomponents,andbuildanewprototype.

TherearenoperfectsolutionsandtherearenobadNoSQLorRDBMSdatastores.Whichdatabaseis
thebestforyourusecasecanbedeterminedbyparticularsystemrequirements.Thetestsperformed
byAltorosshowthatindifferentscenariosdifferentsolutionshaveverydifferentresults.Yourfinal
choicemightbeacompromise.Themaindeterminantwillbewhatyouwanttoachieveandwhat
propertiesyouneedmost.Thesystemmayuseseveralsolutions,includingrelationalandNoSQL
databases.

SergeySverchkov:HeadofSoftwareDevelopment,Altoros

Withmorethan15yearsinITindustryandstrongexperienceinthewholecycleofsoftwareproject
developmentandsupport,SergeyservesasSeniorR&DEngineerandProjectManageratAltoros.He
specializesinBigDataanalysis,integrationandprocessing.SergeyisanOracleDBAtrainer,anda
frequentspeakerondataanalysisandcloudtechnologies.

Das könnte Ihnen auch gefallen