Sie sind auf Seite 1von 57

MySQL Scaling and High

Availability Architectures
Jeremy Cole
jeremy@provenscaling.com

Eric Bergen
eric@provenscaling.com
Who are we?
• Proven Scaling is a consulting company founded
in 2006 by Eric and Jeremy specializing in MySQL

• We primarily deal with architecture and design for


large scalable systems
• We also do training, DBA work, custom MySQL
features, etc.
• Jeremy: optimization, architecture, performance
• Eric: operations, administration, monitoring
Overview
• What’s the problem?
• Basic Tenets of Scaling and High Availability
• Lifetime of a Scalable System
• Approaches to Scaling
• Approaches to High Availability
• Tools and Components
What’s the problem?
• Internet-age systems can grow (or be forced to
choose between growth and death) very quickly
• No matter what you plan for or predict, users will
always surprise you
• Mobs, err, valued users can be very annoying
sometimes (e.g. “biggest group ever” logic)
• Users may have vastly different usage patterns
• Web 2.0™ (blechhhh!) sites have changed the
world of scaling; it’s much harder now
• Everyone (your VCs included) expects you to be
Web 2.0® compliant™
Basic Tenets
• Don’t design scalable or highly available systems:
 Using components you do not control or that have
loose tolerances (e.g. DNS)
 Using processes with potentially ugly side effects
(e.g. code changes to add a new server) [Yes,
configuration files are very often “code”]
• If a user doesn’t think/notice something is down,
it’s not really “down”
• Eliminate (or limit) single points of failure -- if you
have only one of any component, examine why
• Cache everything
Lifetime of a Scalable
System
Newborn
• Shared hosting
• Might start worrying (a little bit) about query
optimization at this point
• Don’t have much control over configuration
• Overall performance may be poor
• Traffic picks up, and performance is bad... What do
we do about it?
Toddler
• A single (dedicated) server for everything
• MySQL and Apache etc., competing for resources
• MySQL needs memory for caching data
• Apache (and especially PHP etc.) needs lots of
memory for handling requests
• Memory contention will be the first major
bottleneck
Child
• Separate web servers and database server
• Usually go ahead and get multiple web servers
now, since it’s easy
• Get a single database server, since it’s hard --
maybe better hardware?
• Now we need to do session management across
web servers… hmm we have this nice database…
• Other load same as before, but now with added
network overhead
• Single database server becomes your biggest
bottleneck
Teenager
• “Simple” division of load by moving tables or
processes
• Use replication to move reporting off production
• Move individual tables or databases to lighten load
• Use replication to move reads to slaves
• Modify code to know where everything is
• Still too many writes in some parts of system
• Replication synchronization problems mean either
annoying users or writing lots of code to work
around the problem
Late teens to 20s
• The “awkward” stage
• This is where many applications (and sometimes
entire companies) die by making bad decisions
• Death can be slow (becoming irrelevant due to
speed or lack of scalability) or quick (massive
meltdown losing user confidence)
• Managing the move from teenager into adulthood
is often the first real project requiring specs and
real processes to do it right
• Downtime at this point is hard to swallow due to
size of userbase
Adult
• Scalable system that can grow for a long time,
generally based on data partitioning
• Most improvements now are incremental
• System is built to allow incremental improvements
without downtime
• A lot has been learned from the successful
transition to adulthood
Data Partitioning:
The only game in town
What is partitioning?
• Distributing data on a record-by-record basis
• Usually a single basis for distributing records in
each data set is chosen: a “partition key”
• An application may have multiple partition keys
• Each node has all related tables, but only a portion
of the data
Partitioning Models
• Fixed “hash key” partitioning
• Dynamic “directory” partitioning

• Partition by “group”
• Partition by “user”
Partitioning Difficulties
• Inter-partition interactions are a lot more difficult
• Example: Partitioning by user, where do we store a
message sent from one user to another? How
about a friend list?

• Overall reporting becomes more difficult


• Example: Find the average number of friends a
user has by state…
Partition by …
• Partitioning by user (or equivalent) allows for the
most flexibility in most applications
• In many cases it may make sense to partition by
groups, if most (or all) interactions between users
are within that group
• You could also get most of the same benefits of
partitioning by group by partitioning by user with
an affinity based on group
Fixed Hash Key
• Divide the data into B buckets
• Divide the B buckets over M machines
• Example: Define 1024 user buckets 0..1023 based on (user_id
% 1024) for 4 physical servers, so each server gets 256 of the
buckets by range: 0-255, 256-511, 512-767, 768-1023

• Problem: Moving entire buckets means affecting 1/B of your


users at a time in the best case… in simple implmentations
you may have to affect 1/M or 2/M of your users
• Problem: The bucket-to-machine mapping must be stored
somewhere (usually in code) and updated atomically
• Problem: You have no control over which bucket (and thus
machine) a given user is assigned to
Dynamic Directory
• A “directory” server maintains a database of
mappings between users and partitions
• A user is assigned (often randomly) to one
partition and that mapping is stored
• Any user may be moved later by locking the user,
moving their data, and updating their mapping in
the directory

• Solution: Only single users are affected by any


repartitioning that must be done
• Solution: Partitions may be rebalanced user-by-
user at any time
Custom Solutions
Custom Solutions
• It’s very easy to implement simple hash key
partitioning to get data distributed
• It’s much more difficult to be able to re-partition
• It’s difficult to grow
Hibernate Shards
Hibernate Shards
• Sort of a merge between fixed key partitioning and
directory-based partitioning
• “Virtual Shards” abstract the mapping of objects to
shards, but simplistically
• It’s still painful to repartition
• It doesn’t handle rebalancing at all currently
• It doesn’t handle aggregation at all
HiveDB
HiveDB Project
• HiveDB is an Open Source project to design and
implement the entire “standard” partition-by-key
MySQL system in Java
• Originally envisioned by Jeremy while working with
several customers
• Implemented by Fortress Consulting and
CafePress along with help and guidance from
Proven Scaling
• Many companies have built somewhat similar
systems but nobody has really open sourced it
Why HiveDB?
• Many solutions that exist only solve the easy part:
storing and retrieving data across many machines
• Nobody really touches on the hard part: being able
to rebalance and move users on the fly
Server Architecture
• Hive Metadata
 Partition definition
• Directory
 Partition Key -> Partition mapping
 Secondary Key -> Partition Key mapping
• Hive Queen - makes management and rebalancing
decisions
• Job Server (Quartz) - actually executes tasks
• Aggregation Layer (future)
Client Architecture
• Client uses Hive API to request a connection for a
certain partition key
• Client uses those direct connections to do work

• Hive API should be written in each development


language as necessary
High Availability
Goals
• Avoid downtime due to failures
• No single point of failure
• Extremely fast failover
• No dependency on DNS changes
• No dependency on code changes

• Allow for painless, worry-free “casual failovers” to


upgrade, change hardware, etc.
• Fail-back must be just as painless
MySQL Replication
Basics
• MySQL replication is master-slave one-way
asynchronous replication
• “Master” keeps logs of all changes – called “binary
logs” or “binlogs”
• “Slave” connects to the master through the normal
MySQL protocol on TCP port 3306

• Slave requests binary logs from last position


• Master sends binary logs up to current time
• Master keeps sending binary logs in real-time
More Basics
• Replication works with all tables types and (mostly)
all features

• Any “critical”, reads, ones that cannot be allowed


to return stale data, must be done on the master –
replication is asynchronous, so there may be a
delay at any time
Typical Setup
• One Master (single source of truth)
• Any number of slaves

• Slaves are used for reads only


• All writes go to the master

• There are many other possibilities…


Replication Topologies
Master with One Slave

Master

Slave
Master with Many Slaves

Master

Slave Slave Slave Slave Slave


Master with Relay Slave

Master

Relay
Slave

Slave
Master with Relay and
Many Slaves

Master

Relay
Slave

Slave Slave Slave Slave Slave


Master with Many Relays

Master

Relay Relay Relay Relay Relay


Slave Slave Slave Slave Slave

Slave Slave Slave Slave Slave


Dual Masters

Master Master
Dual Masters with
Slaves
Slave

Master Master

Relay
Slave

Slave Slave Slave


Ring (Don’t Use)

Master Master

Master
High Availability Options
Dual Master
• Two machines with independent storage
configured as master and slave of each other
• Optionally: Any number of slaves for reads only
• Manual (scripted) or automatic (heartbeat-based)
failover is possible
Dual Master Pros
• Very simple configuration
• Simple to understand = simple to maintain
• Very similar to basic master-slave configuration
that many are familiar with
• Allows easy failover in either direction without
reconfiguration or rebuilding
• Allows for easy and reliable failover for non-
emergency situations: upgrades, schema changes,
etc.
• Allows for quick failover in emergency
• Can work between distant sites fairly easily
Dual Master Cons
• Does not help scale writes (no, not at all)
• Limited to two sites; replication does not allow
multiple masters, so three or more is not possible
• Replication is asynchronous, and may get behind --
there is always a chance of data loss (albeit small)
SAN
• Shared storage of a single set of disks by two
MySQL servers, with a single copy of the data on a
FibreChannel or IP/iSCSI SAN
• Automatic (heartbeat) failover by fencing and
mounting the SAN on the other machine
SAN Pros
• Single copy of the data means lower storage cost
for extremely large databases
• No worries about replication getting behind
• SAN systems can achieve very high performance
for same or lower cost as two very large RAIDs
SAN Cons
• Single copy of the data means corruption is
possible, and could be very damaging
• For medium or small databases, cost can be
prohibitive
• FibreChannel requires additional infrastructure
often not present in typical MySQL systems; iSCSI
can be very helpful in this regard
• Single copy of the data -- no schema change tricks
are possible
DRBD
• Block device-level replication between two
machines with their own independent storage
(mirrors of the same data)
• Automatic (heartbeat-based) failover by fencing
and mounting local copy of filesystem is typical
DRBD Pros
• Simple hardware and infrastructure using locally-
attached RAID
• No expensive hardware or network
DRBD Cons
• Complex configuration and maintenance
• May cause performance problems, especially if
poorly configured
• Failure of or problems with mirror can cause
problems in production
• From the software perspective, there is still a
single copy of the data, which may get corrupted
• Single copy of the data -- no schema change tricks
are possible
Putting It All Together
Partitioning + HA
• No partitioning solutions really address HA .. They
treat the “shards” or “partitions” as single MySQL
servers
• In reality you would implement an HA solution for
each partition
• There are many possibilities
HiveDB + Dual Master
• We recommend HiveDB plus Dual Master for most
installations
• While not technically perfect, and with a chance of
data loss, administrative tasks are very simple
• Additionally, LVM for volume management gives
ability to take snapshot backups easily
Any questions?
Discussion!

Das könnte Ihnen auch gefallen