Beruflich Dokumente
Kultur Dokumente
NoSQLandSQLDatabases
Part1
2016EDW
byMichaelBowers
20160314
v. 4.9
mike@cssDesignPatterns.com
1
Abstract
Weareinthemiddleofadatabaserevolution.NoSQLisdisrupting
thedatabaseworldbyinnovatinginmanydisruptiveways.
Howdowemodelinthesenewparadigms?
HowdoestheoldSQLparadigmfitinthisnewbraveworld?
Whatparadigmisbestforyourproject?
Weareinanewdataparadigm:
Newdatabasearchitectures(softwareandhardware)handle
thelargeandevergrowingvelocityandvolumeofdatathat
isdispersedacrossgeographicallydistantdatacenters
Newgraph,document,andwidecolumnmodeling
paradigmscompetewithrelational,anddimensional
Schemalessdatabasesenablemaximumagilityofsoftware
developmentandrapidchangestohugedatasets
2
Whatwillyoulearn?
Youwillbeabletochoosethebestdatabasetomeetyourneedsfor
velocity,volume,variety,variability,relevance,productivity,data
model,scale,consistency,andcost.
YouwillknowthetradeoffsofACIDorBASEconsistencymodels
andwhenitisOKornotOKtocompromiseconsistency.
Youwillunderstandthestrengthsandweaknessesofrelational,
dimensional,document,keyvalue,andtriplemodels,andwhich
SQLandNoSQLdatabasessupportwhichmodels.
AbouttheAuthor
MichaelBowers
PrincipalArchitect
LDSChurch
Author
ProCSSandHTMLDesign
Patterns
PublishedbyApress,2007
ProHTML5andCSS3Design
Patterns
PublishedbyApress,2011
mike@cssDesignPatterns.com
4
ChurchofJesusChristofLatterdaySaints
15millionmembers(29,621congregationsworldwide)
Humanitarianassistancein185countries
Thousandsofdocumentsin188publishedlanguages
192websitesandapplicationsinproduction
withbillionsofpageviewsannuallyrunningon
hundredsofMarkLogicservers
WeareinaDatabaseRevolution
Existingparadigmsare
beingchallenged
Models
Hardware
Software
Languages
Willtweakingrelational
databasesbeenough?
wikipedia
Agenda
1.
2.
3.
4.
DefiningNoSQLandBigData
OptimizingforVelocityorVolume
OptimizingforAvailabilityorConsistency
OptimizingforModelingParadigms
Agenda
1.
2.
3.
4.
DefiningNoSQLandBigData
OptimizingforVelocityorVolume
OptimizingforAvailabilityorConsistency
OptimizingforModelingParadigms
microseconds
10Kt 100Kt
newSQL
LiveAnalytics
#1 Oracle Exalytics
#19 SAP HANA
1Kt
#58 GemFire
#69 Oracle x10
WideColumn
Complex
Key
#8 Cassandra
#15 Hbase
Key/Value
Simple
Key
#9 Redis
#23 Memcached
#26 DynamoDB
#31 Riak
SQL
Hoursminutessecondsmilliseconds
PBsTBsGBs0.1Kt0.5Kt
LowLatencyOperational Velocity
HighBandwidthAnalytical Volume
Databases(Rankedbypopularityasof20160314)
DataWarehouse
Document
JSON
#4 MongoDB
#24 Couchbase
#25 CouchDB
#32 MarkLogic
#41 OrientDB
#48 Cloudant
Relational
Morestructure(schema)
Hospital Name:
Operation Number:
Operation Type:
Surgeon Name:
Drug
Name
Minicillan
Maxicillan
Minicillan
#1 Oracle Exadata
#13 Teradata
#16 Hive
#28 Netezza
#29 Vertica
#33 Greenplum
#36 Amazon Redshift
Dimensional
#20 Neo4j
#32 MarkLogic
#41 OrientDB
#44 Titan
DocWarehouse
XML
#1 Oracle DB
#2 MySQL
#3 SQL Server
#5 PostgreSQL
#6 DB2
#10 SQLite
#12 SAP AS
#19 SAP HANA
#21 Informix
#22 MariaDB
Graph/RDF
Big Data
John Hopkins
13
Heart Transplant
Dorothy Oz
Drug
Manufacturer
Drugs R Us
Canada4Less Drugs
Drug USA
Dose
Size
200
400
150
Dose
UOM
mg
mg
mg
#11 ElasticSearch
#14 Solr
#35 MarkLogic
#37 Sphinx
Widecolumn/Keyvalue
Raw
Hadoop
#18 Splunk
Document
Graph Raw
Lessstructure(schemaless)
microseconds
10Kt 100Kt
newSQL
LiveAnalytics
Oracle Exalytics
SAP HANA
WideColumn
Complex
Key
Cassandra
Key/Value
Simple
Key
DynamoDB
Document
JSON
Graph/RDF
MarkLogic
MongoDB
MarkLogic
1Kt
Oracle x10
SQL
Hoursminutessecondsmilliseconds
PBsTBsGBs0.1Kt0.5Kt
LowLatencyOperational Velocity
HighBandwidthAnalytical Volume
Databases(Rankedforenterprisesasof20160314)
DataWarehouse
DocWarehouse
XML
Oracle DB
SQL Server
DB2
MySQL
EnterpriseDB
Hospital Name:
Operation Number:
Operation Type:
Surgeon Name:
Drug
Name
Minicillan
Maxicillan
Minicillan
Oracle Exadata
Teradata
Netezza
Big Data
John Hopkins
13
Heart Transplant
Dorothy Oz
Drug
Manufacturer
Drugs R Us
Canada4Less Drugs
Drug USA
Dose
Size
200
400
150
Dose
UOM
mg
mg
mg
Raw
MarkLogic
Splunk
Hadoop
Relational
Morestructure(schema)
Dimensional
Widecolumn/Keyvalue
Document
Graph Raw
Lessstructure(schemaless)
10
microseconds
10Kt 100Kt
newSQL
LiveAnalytics
WideColumn
Complex
Key
Key/Value
Simple
Key
Document
JSON
Graph/RDF
Oracle DB
1Kt
#1Multimodel
SQL
Database
SQL
Hoursminutessecondsmilliseconds
PBsTBsGBs0.1Kt0.5Kt
LowLatencyOperational Velocity
HighBandwidthAnalytical Volume
MultimodelSQLDatabases
DataWarehouse
DocWarehouse
XML
Hospital Name:
Operation Number:
Operation Type:
Surgeon Name:
Oracle DB
Drug
Name
Minicillan
Maxicillan
Minicillan
Big Data
John Hopkins
13
Heart Transplant
Dorothy Oz
Drug
Manufacturer
Drugs R Us
Canada4Less Drugs
Drug USA
Dose
Size
200
400
150
Dose
UOM
mg
mg
mg
Raw
Oracle Exadata
Relational
Morestructure(schema)
Dimensional
Widecolumn/Keyvalue
Graph Raw
Document
Lessstructure(schemaless)
11
LiveAnalytics
WideColumn
Complex
Key
Key/Value
Simple
Key
Document
JSON
Graph/RDF
MarkLogic
MarkLogic
MarkLogic
1Kt
microseconds
10Kt 100Kt
newSQL
SQL
Hoursminutessecondsmilliseconds
PBsTBsGBs0.1Kt0.5Kt
LowLatencyOperational Velocity
HighBandwidthAnalytical Volume
MultimodelNoSQLDatabases
DataWarehouse
DocWarehouse
XML
Hospital Name:
Operation Number:
Operation Type:
Surgeon Name:
Drug
Name
Minicillan
Maxicillan
Minicillan
#1Multimodel
NoSQL
Database
Relational
Morestructure(schema)
Dimensional
Widecolumn/Keyvalue
Big Data
John Hopkins
13
Heart Transplant
Dorothy Oz
Drug
Manufacturer
Drugs R Us
Canada4Less Drugs
Drug USA
Dose
Size
200
400
150
Dose
UOM
mg
mg
mg
Raw
MarkLogic
Graph Raw
Document
Lessstructure(schemaless)
12
microseconds
10Kt 100Kt
newSQL
LiveAnalytics
#1 Oracle Exalytics
#19 SAP HANA
1Kt
#58 GemFire
#69 Oracle x10
WideColumn
Complex
Key
#8 Cassandra
#15 Hbase
Key/Value
Simple
Key
#9 Redis
#23 Memcached
#26 DynamoDB
#31 Riak
SQL
Hoursminutessecondsmilliseconds
PBsTBsGBs0.1Kt0.5Kt
LowLatencyOperational Velocity
HighBandwidthAnalytical Volume
Databases(Rankedbypopularityasof20160314)
DataWarehouse
Document
JSON
#4 MongoDB
#24 Couchbase
#25 CouchDB
#32 MarkLogic
#41 OrientDB
#48 Cloudant
Relational
Morestructure(schema)
Hospital Name:
Operation Number:
Operation Type:
Surgeon Name:
Drug
Name
Minicillan
Maxicillan
Minicillan
#1 Oracle Exadata
#13 Teradata
#16 Hive
#28 Netezza
#29 Vertica
#33 Greenplum
#36 Amazon Redshift
Dimensional
#20 Neo4j
#32 MarkLogic
#41 OrientDB
#44 Titan
DocWarehouse
XML
#1 Oracle DB
#2 MySQL
#3 SQL Server
#5 PostgreSQL
#6 DB2
#10 SQLite
#12 SAP AS
#19 SAP HANA
#21 Informix
#22 MariaDB
Graph/RDF
Big Data
John Hopkins
13
Heart Transplant
Dorothy Oz
Drug
Manufacturer
Drugs R Us
Canada4Less Drugs
Drug USA
Dose
Size
200
400
150
Dose
UOM
mg
mg
mg
#11 ElasticSearch
#14 Solr
#35 MarkLogic
#37 Sphinx
Widecolumn/Keyvalue
Raw
Hadoop
#18 Splunk
Document
Graph Raw
Lessstructure(schemaless)
13
Agenda
1.
2.
3.
4.
DefiningNoSQLandBigData
OptimizingforVelocityorVolume
OptimizingforAvailabilityorConsistency
OptimizingforModelingParadigms
14
WhatswrongwithSQLDBs?
Velocity
SQLDBsareserialized
toensureconsistency
Theyusehighlatencydisk
Thispreventsthemfromscaling
horizontallyandlimitsvelocity
Volume
SQLDBsareserialized
toshareresources:
cores,caches,andstorage
Thispreventsthemfromscaling
horizontallyandlimitsvolume
hacky
15
Storage CostandPerformance
DefinesPhysicalDatabaseArchitecture
Cost/
Blade
GB /
Blade
Cost/
GB
Bandwidth
(MB/s)
Cost/
MB/s
$2,500
4,000
$0.63
200
$12.58
Flash*
$8,600
1,200
$7.00
975
RAM
$11,700
768
$15.23
12,800
RAID0HDDs
Volume
Velocity
IOPs
(1000/s)
Latency
(S)
Cost/
IOPs
7,000
$2.10
$10.00
115
81
$0.20
$0.91
1,333,000
0.02
$0.00
*FlashistheaverageofSSDsandFlashPCIecards
Whatdoyouneedyourdatabasestobeoptimizedfor?
16
SyncorAsync
Transactions
HardwareArchitecture
VelocityOptimized(OLTP)
RAM DB
JSON
SyncorAsync
XML
Hospital Name:
Operation Number:
Operation Type:
Surgeon Name:
Drug
Name
Minicillan
Maxicillan
Minicillan
John Hopkins
13
Heart Transplant
Dorothy Oz
Drug
Manufacturer
Drugs R Us
Canada4Less Drugs
Drug USA
Dose
Size
200
400
150
Dose
UOM
mg
mg
mg
SyncorAsync
Disk DB
JSON
SyncorAsync
XML
Hospital Name:
Operation Number:
Operation Type:
Surgeon Name:
Drug
Name
Minicillan
Maxicillan
Minicillan
John Hopkins
13
Heart Transplant
Dorothy Oz
Drug
Manufacturer
Drugs R Us
Canada4Less Drugs
Drug USA
Dose
Size
200
400
150
Dose
UOM
mg
mg
mg
VolumeOptimized(Warehouse)
17
Problem:Serialized DBDesign
SQLDBsuseserializationandsynchronizationforconsistency
Loggingdata
Lockingrowsofdata
LoadingdatafromdisktoRAM
Lockingbuffered/cacheddata
PercentagesfromMichaelStonebraker@2011NoSQLConferenceSanJose
18
CostofSynchronization
Singlethreadedoperationsare746timesfasterthanmultiplethreadswithlocks
Time(ns)
Singlethread
300
Singlethreadwithmemorybarrier
4,700
SinglethreadwithCAS
5,700
Singlethreadwithlock
10,000
TwothreadswithCAS
30,000
Twothreadswithlock
224,000
0
SlidefromJamesGatesSORTpresentation,WhenPerformanceReallyMatters
19
ExponentialTransistors
Transistors(000)
1,000,000
ClockSpeed(MHz)
Power(W)
100,000
Gap
Causes
Multiple
Cores
FlatClock
ILP
(fasterclockneedsmorepower)
10,000
FlatPower
1,000
(powerisexpensiveandhot)
100
FlatILP
(instructionlevelparallelism)
10
1
0
1970
1975
1980
1985
1990
1995
2000
2005
2010
20
DatabaseVelocity
Volume
PerDay
Realworld1K
Transactions
PerDay
Realworld 1K
Transactions
PerSecond
Relational
DB
Document
DB
Wide
Columnor
KeyValue
8GB
8,640,000
100 AsIs
86 GB
86,400,000
1,000 Tuned*
AsIs
432GB
432,000,000
5,000 Appliance
Tuned*
AsIs
864GB
864,000,000
10,000 Clustered
Appliance
Clustered
Servers
Tuned*
8,640GB
8,640,000,000
100,000
Many
Clustered
Servers
Clustered
Servers
43,200GB
43,200,000,000
500,000
Many
Clustered
Servers
* Tunedmeanstuningthemodel,queries,and/orhardware(moreCPU,RAM,andFlash)
21
HardwareTakeaway
ChooseDBdesignedtomeetyourscalingneeds
forvelocityandvolumeatlowesthardwarecost
LeverageRAMwhenyouneedmaximumvelocity
(lowlatency)
Leveragediskwhenyouneedmassivevolume
(highbandwidth)
Scalehorizontallyformaximumparallelprocessing
Choosetherightmixofsynchronousand
asynchronoustransactions
22
Whatvelocityandvolumedoyouneed?
Thoughts?
23
microseconds
10Kt 100Kt
newSQL
LiveAnalytics
#1 Oracle Exalytics
#19 SAP HANA
1Kt
#58 GemFire
#69 Oracle x10
WideColumn
Complex
Key
#8 Cassandra
#15 Hbase
Key/Value
Simple
Key
#9 Redis
#23 Memcached
#26 DynamoDB
#31 Riak
SQL
Hoursminutessecondsmilliseconds
PBsTBsGBs0.1Kt0.5Kt
LowLatencyOperational Velocity
HighBandwidthAnalytical Volume
Databases(Rankedbypopularityasof20160314)
DataWarehouse
Document
JSON
#4 MongoDB
#24 Couchbase
#25 CouchDB
#32 MarkLogic
#41 OrientDB
#48 Cloudant
Relational
Morestructure(schema)
Hospital Name:
Operation Number:
Operation Type:
Surgeon Name:
Drug
Name
Minicillan
Maxicillan
Minicillan
#1 Oracle Exadata
#13 Teradata
#16 Hive
#28 Netezza
#29 Vertica
#33 Greenplum
#36 Amazon Redshift
Dimensional
#20 Neo4j
#32 MarkLogic
#41 OrientDB
#44 Titan
DocWarehouse
XML
#1 Oracle DB
#2 MySQL
#3 SQL Server
#5 PostgreSQL
#6 DB2
#10 SQLite
#12 SAP AS
#19 SAP HANA
#21 Informix
#22 MariaDB
Graph/RDF
Big Data
John Hopkins
13
Heart Transplant
Dorothy Oz
Drug
Manufacturer
Drugs R Us
Canada4Less Drugs
Drug USA
Dose
Size
200
400
150
Dose
UOM
mg
mg
mg
#11 ElasticSearch
#14 Solr
#35 MarkLogic
#37 Sphinx
Widecolumn/Keyvalue
Raw
Hadoop
#18 Splunk
Document
Graph Raw
Lessstructure(schemaless)
24
Agenda
1.
2.
3.
4.
DefiningNoSQLandBigData
OptimizingforVelocityorVolume
OptimizingforAvailabilityorConsistency
OptimizingforModelingParadigms
25
Availabilityvs.Consistency
Availability:
Receiver
Publisher
Asynchronous
Replication
Input
Disruptor
Output
Disruptor
Journaler
Replicator
Marshaller
Unmarshaller
Business
Logic
Processor
DoIwantmultipledatacenterstohaveconsistent dataimmediatelyoreventually?
DoIwantreadsand/orwritestobeavailablewhendatacentersaredown?
VariousNoSQLdatabasesgiveyoudifferentoptions
26
Brewers1998CAPTheorem
Wikipedia
Itisimpossiblefora distributedcomputersystem tosimultaneously
provideallthreeofthefollowingguarantees:
Consistency (allnodesseethesamedataatthesametime)
Availability (aguaranteethateveryrequestreceivesaresponse
aboutwhetheritsucceededorfailed)
Partitiontolerance (thesystemcontinuestooperatedespite
arbitrarypartitioningduetonetworkfailures)
27
Realtime
Synchronous
FewDataCopies
LessCompute
VerticalScale
LessAvailability
OneCPUCore
MultipleCores
MultipleCPUs
Servers
CAPInPracticeToday
Asdistanceincreases,communicationlatency
increases.Thismakescommunicationless
reliable,andrealtime accesstodatabecomes
slowerandlessavailable
Thisdoesnothavetoaffectthe
pointintimeconsistency ofdata
AvailabilityZones
GlobalDataCenters
Neartime
Asynchronous
ManyDataCopies
MoreCompute
HorizontalScale
GlobalAvailability
Solutions
GloballyConsistentClusters
MultimasterClusters
28
Consistent
Realtime
FewDataCopies
LessCompute
VerticalScale
LessAvailability
ClustersMitigate
CAPLimitations
OneCPUCore
PurposesofDatabaseClusters
MultipleCores
1. Availability
MultipleCPUs
2. Scalability
Servers
AvailabilityZones
Data
Processing
GlobalDataCenters
Inconsistent
Neartime
ManyDataCopies
MoreCompute
HorizontalScale
GlobalAvailability
29
Consistent
Realtime
FewDataCopies
LessCompute
VerticalScale
LessAvailability
AvailabilityClusters
Datacenter
Datacenter
OneCPUCore
Zone2
Zone1
MultipleCores
MultipleCPUs
sync
sync
Servers
sync
sync
AvailabilityZones
Datacenter
GlobalDataCenters
Inconsistent
Neartime
ManyDataCopies
MoreCompute
HorizontalScale
GlobalAvailability
sync
Zone2
Zone1
async
sync
sync
Datacenter
Zone2
Zone1
Independent
Storage
sync
sync
sync
30
Consistent
Realtime
FewDataCopies
LessCompute
VerticalScale
LessAvailability
OneCPUCore
MultipleCores
MultipleCPUs
Servers
AvailabilityZones
ScaleClusters
Datacenter
Datais
automatically
spreadacrossall
serverstoscalethe
storage,
processing,and
queryingofbig
data
GlobalDataCenters
Inconsistent
Neartime
ManyDataCopies
MoreCompute
HorizontalScale
GlobalAvailability
Datacanbedispersedrandomlyacrossallserversfor
maximumparallelqueryperformance
Datacanbeshardedontospecificserversfordatacolocation,
predictabledataprocessing,predictablelookups,etc.
Commondatacanbereplicatedacrossallserversforquick
localaccess
31
GloballyConsistent
Clusters
Consistent
Realtime
FewDataCopies
LessCompute
VerticalScale
LessAvailability
Datacenter1
OneCPUCore
Writesonlygotoone
scaleclusteratatime
withautomaticfailover
betweenzonesanddata
centers
Zone1
MultipleCores
sync
Datacenter2
Zone2
MultipleCPUs
async
Servers
Zone1
AvailabilityZones
GlobalDataCenters
Inconsistent
Neartime
ManyDataCopies
MoreCompute
HorizontalScale
GlobalAvailability
sync
Pros
Zone2
Globalavailability
Realtimetoneartimereads&writes
Simpledevelopment
Consistentdata(someclustersgetconsistentdatasoonerthanothers)
Cons
Allwritesaroundtheworldgotoonecluster thisslowswritesfordistantlocations
Localreadsmaynotcomefromthelatestwrites
Whenadatacenterfails,anydatacommittedinitbutnotyettransmittedtootherdata
centersislostuntilthefaileddatacentercomesbackonline
32
Consistent
Realtime
FewDataCopies
LessCompute
VerticalScale
LessAvailability
Datacenter1
Multimaster
Clusters
sync
sync
OneCPUCore
Zone1
MultipleCores
Zone2
MultipleCPUs
async
async
Writeseventuallygo
toallclustersandthe
applicationdeals
withconflictsand
inconsistentdata
Datacenter2
Servers
Zone1
AvailabilityZones
GlobalDataCenters
Inconsistent
Neartime
ManyDataCopies
MoreCompute
HorizontalScale
GlobalAvailability
Pros
Zone2
Globalavailability
Realtimetoneartimereads&writes
Allwritesgotoclosestlocation
Localreadsusuallycomefromlatestwrites
Cons
Inconsistentdata(atanypointintimeeachclusterhasdifferentdata)
Complexdevelopmenttohandleconflictresolution
Whenadatacenterfails,anydatacommittedinitbutnotyettransmittedtootherdata
centersislostuntilthefaileddatacentercomesbackonline
33
ChoosingBetweenGlobalClusters
ConsistencyandDevelopmentSimplicityvs. WriteLocality
GloballyConsistentClusters
MultimasterClusters
Datacenter1
Datacenter1
Zone1
Zone1
Datacenter2
Zone2
Zone2
Datacenter2
Zone1
Zone1
Zone2
Zone2
34
ACIDvs.BASE
H2SO4
Sulfuric Acid
Atomic
Consistent
Isolated
Durable
Basically
Available
Softstate
Eventualconsistency
NaOH
Sodium Hydroxide
35
ACIDvs.BASEinLargeScaleClusters
ACIDtransactionsbetweennodes(withinclustersoracrossdatacenters)are
synchronouslycoupledthroughatwophasecommit
1.
2.
3.
Transactioncoordinatorprecommitsthetransactiononeachnodeandindicateifthecommitispossible
Ifbothnodesagreeacommitispossible,thetransactioncoordinatorasksbothnodestoapplythe
commit
Ifanynodevetoesthecommit,thetransactioncoordinatorasksbothnodestorollbackthecommit
4.
Addingnodesreducesavailabilityandslowsinserts,updates,anddeletes
BASEtransactionsbetweennodes(withinclustersoracrossdatacenters) are
asynchronouslydecoupled
1.
2.
3.
4.
Useasynchronous,guaranteeddelivery,andorderedmessages
Addalogtabletotargetdatabasetotracksuccessfulexecutionofqueuemessages
Entriesintothelogtableoccurwhenmessagesaresuccessfullyexecutedinthetargetdatabase
Messagesinthequeuearedequeuedonlyafterthelogconfirmstheyhavebeenexecuted
5.
Addingnodesincreasesavailability,parallelprocessing,datamovement,and
inconsistency
36
WhatswrongwithNoSQL?
InmostNoSQLsolutions,
thedeveloperisresponsible
forensuringconsistency
Imagineprogramminganappto
coordinatethousandsofconcurrent
threadsacrossgigabytesofdata
structures
Imaginewritingcodetohandleall
threadingissues
Locks
Contention
Serialization
Deadlocks
Raceconditions
hacky
Threadingbugs
NOTE:MarkLogiciscurrentlytheonlyNoSQLdatabasethatisfullyACIDcompliant
37
WhatareACIDtransactions?
RelationaldatabasesusetheACIDmodeltomakeiteasy,reliable,andfastfor
concurrentprocessestoqueryandmodifythesamedataconsistently
FewNoSQLdatabasesusetheACIDmodel
Atomic
Alldataandcommandsinatransactionsucceed,orallfailandrollback
Consistent
Allcommitteddatamustbeconsistentwithalldatarulesincluding
constraints,triggers,cascades,atomicity,isolation,anddurability
H2SO4
Sulfuric Acid
Isolated
Notransactioncaninterferewithotherconcurrenttransactions
Durable
Onceatransactioniscommitted,datawillsurvivesystemfailures,and
canbereliablyrecoveredafteranunwanteddeletion
38
Hoursminutessecondsmillisecondsmicroseconds
PBsTBsGBs0.1Kt0.5Kt1Kt10Kt100Kt
LowLatencyOperational Velocity
HighBandwidthAnalytical Volume
ACIDcompliantNoSQLDatabases
newSQL
LiveAnalytics
WideColumn
Complex
Key
Key/Value
Simple
Key
Document
JSON
Graph/RDF
MarkLogic
FoundationDB
MarkLogic
MarkLogic
SQL
DataWarehouse
DocWarehouse
XML
Hospital Name:
Operation Number:
Operation Type:
Surgeon Name:
Drug
Name
Minicillan
Maxicillan
Minicillan
Big Data
John Hopkins
13
Heart Transplant
Dorothy Oz
Drug
Manufacturer
Drugs R Us
Canada4Less Drugs
Drug USA
Dose
Size
200
400
150
Dose
UOM
mg
mg
mg
Raw
MarkLogic
Relational
Morestructure(schema)
Dimensional
Widecolumn/Keyvalue
Document
Graph Raw
Lessstructure(schemaless)
39
WhatisDurability?
Durabledatasurvivessystemfailures
Onceatransactionhascommitted,itsdataisguaranteedtosurvivesystemfailures
Failuresmayoccurintheserver,operatingsystem,disk,anddatabase.
Failuresmaybecausedbyservercrash,fulldisk,corrupteddisk,poweroutages,etc.
Durabilityrequiresstoragetoberedundant.
Durabilityrequireslogstoreplayasynchronouslywrittendata.
Durabilityrequireslogstobearchivedtoanotherlocationsotheycanberecovered.
Durabilityworkswithatomicity toensurethatpartiallywrittendataisnotdurable.
Withoutdurabilityyoucanhavefasterinserts,updates,anddeletesbecauseyouhaveno
logstowriteandyoucanstoredatainvolatilememorywhilelazilywritingittodisk
Durabledatacanberecoveredafterunwanteddeletion
Durabilityrequiresbackupstobedurable
Durabilityrequiresrecoveriestobedurable
Durabilityallowsdatatoberecoveredtoapointintimebeforethesystemfailedorbefore
applicationsorusersincorrectlydestroyedormodifieddata
40
Howmuchdurabilitydoyouneed?
Howdurableisthedatabasesdata?
Doesthedatabaseensuredataisdurablebeforeitreturnssuccess?
Doesthedatabaseuselogsandarchiveitslogs?
Doesitwritedatatomultipleservers?
Doesitallowdeveloperstooverridedurabilityatruntimeforhighperformance?
Howwelldoesthedatabasedobackupsandrestores?
Howeasyisittomanagebackupsandrestores?Schedules?Userinterface?
Doesdatabaseprovidefull,incremental,anddifferentialbackups?
Canitrestoretoapointintime?
Howmuchcode willyouhavetowrite
tobackupdataandconfigurations?
torecoveraccidentaldeletionofdata?
toleveragethephysicallayoutofthedatabasecluster?
topropagatedatadurablyacrossthecluster?
torestoredatabasedataandclusterconfigurationstoapointintime?
41
WhatisAtomicity?
Anatomictransactionisallornothing
Atomicitymeansallpartsofatransactionsucceedornothingdoes
Thisrequirespartiallywrittendatatobeautomaticallyremoved
Atransaction isasinglecommandorasetofcommandsthatexecuteatomically
Asinglecommandtoaprogrammeroftenrepresentsmultiplecommandstothedatabase
becausethedatabaseneedstoreplicatedatatomultipledisks,toupdateindexes,to
executetriggers,toverifyconstraints,andtocascadedeletesandupdates,etc.
Withoutatomicity,youcanhavefastertransactionsbecausetheydontneedtwophase
commit
Setsofdata
Anoperationmayneedtoprocessmultipledataitems
Alldatainthesetneedstobechangedornone;otherwiseitbecomesarbitrarilyinconsistent
Forexample,youwanttodeletepartofasetofdataandpartwaythroughthetransactionfails.Ifallthechangesarenot
automaticallyrolledback,thedataisleftinaninconsistentstatethatcanaffecttheresultsofothertransactions.
Setsofcommands
Aseriesofcommandsoftenneedstoworkasaunittoproduceaccurateresults
Allcommandsneedtosucceedorallneedtofail;otherwisedatabecomesarbitrarilyinconsistent
Inconsistentdataishardtofix:datamaycontradictotherdata,andtheremaybeextradataormissingdata.Withoutatomicity
torollbackfailures,theremaybenowaytofixthedata.
Theclassicexampleiswhereyouneedtodebitoneaccountandcreditanotherasasingletransaction.Ifonecommandsucceeds
andtheotherfails,accounttotalsareinaccurate.
42
HowmuchAtomicitydoyouneed?
Howwelldoesthedatabasesupportatomictransactions?
Howimportantisitforthedatabasetostartatransaction
thatspansmorethanonedocument?
thatspansmorethanonecommand?
thatspansmorethanoneserverorclusterordatacenter?
thatrollsbackatransactioninprocess?
Issavingsomedatamoreimportantthansavingallornothing?
Howmuchcode willyouhavetowrite
tohandlecornercaseswherepartsoftransactionssucceedandothersfail?
tofindandfixinconsistentdatacausedbypartiallyfailedtransactions?
todeterminewhendatainconsistencyiscausedbyabugorapartially
completetransaction?
toimplementtwophasecommits?
toimplementatomicsequences,constraints,triggers,datareplications?
tohandleinconsistenciescausedbymultimasterupdates?
43
WhatisIsolation?
Isolationpreventsconcurrenttransactionsfromaffectingeachother
Readisolationensuresqueriesreturnaccurateresultsasofapointintime
Writeisolationlocksdatatopreventraceconditionsduringupdates,deletes,andinserts
Withoutisolation,queriesandtransactionsrunfasterbecausethedatabasedoesnthaveto
provideaconsistentviewusinglocks,snapshots,orsystemversionnumbers
Setsofdata
Anoperationcanonlyproduceaccurateresultsasofapointintime
Ittakestimeforacommandtoprocessasetofdata
Duringthistime,concurrenttransactionsmayinsert,update,anddeletedata
Withoutisolation,asinglecommandexecutesagainstdatawhileitisbeingchanged
byotherconcurrenttransactions.
Recordsmaybeaddedafterthecommandstartedrunning
Recordsmaybedeletedorchangedafterthecommandhasprocessedthem.
Changesduringaquery,violatesjoinsacrossdifferenttypesofrecords
Thiscreatesinconsistentresults:aggregatefunctionsproducewronganswers
Setsofcommands
Aseriesofcommandsneedtoworkonaconsistentviewofdatatoproduce
accurateresults.Withoutisolation,eachcommandinaserieswillexecute
againstarbitrarilydifferentdata.
44
Isolation Examples
UnisolatedMultimasterQuery
IsolatedQuery
Totalquantityis
unaffectedbydata
changesthathappen
afterthequerystarts
nomatterhow
longitruns
Totalquantityisaffectedbydatachangesthathappen
whilethequeryrunsandduringclustersynchronization
Cluster1DataActions
Insert id1,qty:1
Cluster2DataActions
Insert id2,qty:3
Update id1,qty:5
Insert
Update id1,qty:1
Delete id2
Always
Sometimes
Cluster2Synced Actions
Correct
Correct
Insert id1,qty:1
Insert id2,qty:3
1
3
Insert id2,qty:3
Insert id1,qty:1
4
Pointintime
4
Update id1,qty:5
Insert id3,qty:2
resultsfor
8
6
aggregating
Insert id3,qty:2
Update id1,qty:5
10
10
total
Update id1,qty:1
Delete id2
quantity
6
7
Delete id2
Update id1,qty:1
3
3
id3,qty:2
DataActions
Correctansweris4
query
starts
RecordsQueried
id2,qty:3
id1,qty:5
query
ends
id3,qty:2
Answeris10
45
HowmuchIsolationdoyouneed?
Howwelldoesthedatabasesupportreadisolation?
Howimportantisqueryaccuracy?
Howwelldoesthedatabasesupportwriteisolation?
Howimportantisittopreventraceconditionsanddeadlocks?
Howimportantisittoensurecommandsrunonaconsistentsetofdataata
pointintime?
Howmuchcode willyouhavetowrite
tohideconcurrentupdates,insertsanddeletesfromqueries?
tohandleupdateconflicts,raceconditionsanddeadlocks?
tohandlejoinstolookuptablestoensuretheydonotchangeduringaquery
orawrite?
toensureaggregatequeriesoperateonunchangingdata
46
WhatisConsistency?
Consistencyistheproductofatomicity,isolation,anddurability
Atomicity ensuresthatifdatarulesareviolated,suchasconstraintsandtriggers,
thetransactionfailsandallchangesarerolledback.
Isolation ensuresaqueryseesaconsistentsetofdatawhileotherconcurrent
commandsaremodifyingtheunderlyingdata
Isolation ensuresbulkupdateslocksetsofdatasotheycanbeprocessedasa
consistentunitwithoutotherconcurrentcommandsmodifyingtheirdata.
Consistencyis
thelastrefuge
ofthe
unimaginative
OscarWilde
Durability ensuresthatdataisconsistentlyreplicatedtoothernodesinacluster
soalossofanodewontcausealossofdata
Allcommitteddatamustbeconsistentwithalldatarules
Constraints,triggers,cascades,atomicity,isolation,anddurability
Datamustalwaysbeinaconsistentstateatanypointintime
47
Doyouneedcompleteconsistency?
Notnecessarily instead,youmayprefer
Absolutefastestperformanceatlowesthardwarecost
Highestglobaldataavailabilityatlowesthardwarecost
Workingwithonedocumentatatime
Writingadvancedcodetocreateyourownconsistencymodel
Consistencyis
thelastrefuge
ofthe
unimaginative
OscarWilde
Eventuallyconsistentdata
Someinconsistentdatathatcantbereconciled
Somemissingdatathatcantberecovered
Someinconsistentqueryresults
48
ConsistencyTakeaway
Chooseadatabasethatmeetsyourneeds
forwritelocalityorconsistency
MultimasterClusters
NaOH
BASE
Datacenter1
WriteLocality
Zone1
Zone2
Datacenter2
Zone1
Zone2
ACID
H2SO4
PointintimeConsistency
Lessdataloss(durability)
Morequeryaccuracy(isolation)
GloballyConsistentClusters
Datacenter1
Zone1
Zone2
Moredataintegrity(atomicity)
Lesscode tocompensatefordata
inconsistenciesandconflicts
Datacenter2
Zone1
Zone2
49
Whatdoyouneedmost?
Thoughts?
Highestperformanceforqueriesandtransactions
LowestHardwarecostacrossmultipledatacenters
WriteLocality
Lessdataloss(i.e.durability)
Morequeryaccuracy&lessdeadlocks(i.e.isolation)
Moredataintegrity(i.e.atomicity)
Lesscode tocompensateforlackofACIDcompliance
50
Agenda
1.
2.
3.
4.
5.
DefiningNoSQLandBigData
OptimizingforVelocityorVolume
OptimizingforAvailabilityorConsistency
OptimizingforModelingParadigms
Summary
51
ModelingTakeaway
Noonephysicaldatamodelmeetsallneeds,sochooseamultimodelDB
Dimensional
BusinessIntelligencereportingand
analytics
Relational
Flexiblequeries,joins,updates,
mature,standard
WideColumn
Simple,fastputsandgets,massively
scalable
Document
Fastdevelopment,schemaless
JSON/XML,searchable
Graph/RDF
Modelinganythingatruntime
includingrelationships
DocumentscombinedwithGraph
arethefuture
52
VelocityTakeaway
ChooseDBthathandlesyourrequiredvelocity
Volume
PerDay
Realworld1K
Transactions
PerDay
Realworld 1K
Transactions
PerSecond
Relational
Document
WideColumn
orKeyValue
8GB
8,640,000
100 AsIs
86 GB
86,400,000
1,000 Tuned*
AsIs
432GB
432,000,000
5,000 Appliance
Tuned*
AsIs
864GB
864,000,000
10,000 Clustered
Appliance
Clustered
Servers
Tuned*
8,640GB
8,640,000,000
100,000
43,200GB
43,200,000,000
500,000
ManyClustered Clustered
Servers
Servers
ManyClustered
Servers
* Tunedmeanstuningthemodel,queries,and/orhardware(moreCPU,RAM,andFlash)
53
HardwareTakeaway
ChooseDBdesignedtomeetyourscalingneeds
forvelocityandvolumeatlowesthardwarecost
LeveragesRAMwhenyouneedmaximumvelocity
(lowlatency)
Leveragesdiskwhenyouneedmassivevolume
(highbandwidth)
Scaleshorizontallyformaximumparallel
processing
Letsyouchoosetherightmixofsynchronousand
asynchronoustransactions
54
ConsistencyTakeaway
Chooseadatabasethatmeetsyourneeds
forwritelocalityorconsistency
MultimasterClusters
NaOH
BASE
Datacenter1
WriteLocality
Zone1
Zone2
Datacenter2
Zone1
Zone2
ACID
H2SO4
PointintimeConsistency
Lessdataloss(durability)
Morequeryaccuracy(isolation)
GloballyConsistentClusters
Datacenter1
Zone1
Zone2
Moredataintegrity(atomicity)
Lesscode tocompensatefordata
inconsistenciesandconflicts
Datacenter2
Zone1
Zone2
55
ModelingTakeaway
Chooseadatabasethatmeetsyourmultiplemodelingneeds
Dimensional
BusinessIntelligencereportingand
analytics
Relational
Flexiblequeries,joins,updates,
mature,standard
WideColumn
Simple,fastputsandgets,massively
scalable
Document
Fastdevelopment,schemaless
JSON/XML,searchable
Graph/RDF
Modelinganythingatruntime
includingrelationships
DocumentscombinedwithGraph
arethefuture
56
NoSQLEnterpriseReadinessTakeaways
NoSQL
DBaaS
DB
Appliances
SQL
MapReduce
Technology
Trigger
Inflated
Expectations
Disillusionment
Enlightenment
EnterpriseReady
1to5years
Productivity
5to10years
DerivedfromGartnerHypeCycleforDataManagement
57
microseconds
10Kt 100Kt
newSQL
LiveAnalytics
#1 Oracle Exalytics
#19 SAP HANA
1Kt
#58 GemFire
#69 Oracle x10
WideColumn
Complex
Key
#8 Cassandra
#15 Hbase
Key/Value
Simple
Key
#9 Redis
#23 Memcached
#26 DynamoDB
#31 Riak
SQL
Hoursminutessecondsmilliseconds
PBsTBsGBs0.1Kt0.5Kt
LowLatencyOperational Velocity
HighBandwidthAnalytical Volume
Databases(Rankedbypopularityasof20160314)
DataWarehouse
Document
JSON
#4 MongoDB
#24 Couchbase
#25 CouchDB
#32 MarkLogic
#41 OrientDB
#48 Cloudant
Relational
Morestructure(schema)
Hospital Name:
Operation Number:
Operation Type:
Surgeon Name:
Drug
Name
Minicillan
Maxicillan
Minicillan
#1 Oracle Exadata
#13 Teradata
#16 Hive
#28 Netezza
#29 Vertica
#33 Greenplum
#36 Amazon Redshift
Dimensional
#20 Neo4j
#32 MarkLogic
#41 OrientDB
#44 Titan
DocWarehouse
XML
#1 Oracle DB
#2 MySQL
#3 SQL Server
#5 PostgreSQL
#6 DB2
#10 SQLite
#12 SAP AS
#19 SAP HANA
#21 Informix
#22 MariaDB
Graph/RDF
Big Data
John Hopkins
13
Heart Transplant
Dorothy Oz
Drug
Manufacturer
Drugs R Us
Canada4Less Drugs
Drug USA
Dose
Size
200
400
150
Dose
UOM
mg
mg
mg
#11 ElasticSearch
#14 Solr
#35 MarkLogic
#37 Sphinx
Widecolumn/Keyvalue
Raw
Hadoop
#18 Splunk
Document
Graph Raw
Lessstructure(schemaless)
58
EvaluatingandModeling
NoSQLandSQLDatabases
Part1
2016EDW
byMichaelBowers
20160314
v. 4.9
mike@cssDesignPatterns.com
59