Beruflich Dokumente
Kultur Dokumente
Scaling Storytime
April 2007
Brad Fitzpatrick
brad@danga.com
http://danga.com/words/
1
This Talk’s Gracious Sponsor
http://danga.com/words/
2
The plan...
BIG-IP
perlbal (httpd/proxy) Global Database
bigip1 mod_perl
bigip2 proxy1 master_a master_b
web1
proxy2 web2
proxy3 Memcached
web3 slave1 slave2 ... slave5
djabberd proxy4 mc1
web4
djabberd proxy5 ... mc2 User DB Cluster 1
djabberd
webN mc3 uc1a uc1b
mc4 User DB Cluster 2
... uc2a uc2b
gearmand
Mogile Storage Nodes gearmand1 mcN User DB Cluster 3
sto1 sto2 gearmandN uc3a uc3b
Mogile Trackers
... sto8
tracker1 tracker3 User DB Cluster N
ucNa ucNb
MogileFS Database “workers”
gearwrkN Job Queues (xN)
mog_a mog_b theschwkN jqNa jqNb
slave1 slaveN
http://danga.com/words/
4
LiveJournal Overview
http://danga.com/words/
5
Stuff we've built...
memcached djabberd
− distributed caching − the super-extensible
MogileFS everything-is-a-plugin
− distributed filesystem mod_perl/qpsmtpd of
Perlbal XMPP/Jabber servers
− HTTP load balancer, web .....
server, swiss-army knife OpenID
gearman federated identity
− LB/HA/coalescing low- protocol
latency function call
“router”
TheSchwartz
− reliable, async job
dispatch system
http://danga.com/words/
6
“Uh, why?”
http://danga.com/words/
7
Yes.
We build wheels.
− ... when existing suck,
− ... or don’t exist.
http://danga.com/words/
8
Yes.
We build wheels.
− ... when existing suck,
− ... or don’t exist.
http://danga.com/words/
8
Yes.
We build wheels.
− ... when existing suck,
− ... or don’t exist.
http://danga.com/words/
8
Yes.
We build wheels.
− ... when existing suck,
− ... or don’t exist.
http://danga.com/words/
8
Part I
Quick Scaling History
http://danga.com/words/
9
Quick Scaling History
1 server to hundreds...
you can do all this with just 1 server!
− then you’re ready for tons of servers, without pain
− don’t repeat our scaling mistakes
http://danga.com/words/
10
Terminology
Scaling:
− NOT: “How fast?”
− But: “When you add twice as many servers, are you
twice as fast (or have twice the capacity)?”
Fast still matters,
− 2x faster: 50 servers instead of 100...
that’s some good money
− but that’s not what scaling is.
http://danga.com/words/
11
Terminology
“Cluster”
− varying definitions... basically:
− making a bunch of computers work together for
some purpose
− what purpose?
load balancing (LB),
high availablility (HA)
Load Balancing?
High Availability?
Venn Diagram time!
− I love Venn Diagrams
http://danga.com/words/
12
LB vs. HA
http://danga.com/words/
13
LB vs. HA
Load High
Balancing Availability
LVS heartbeat,
http
round-robin DNS, cold/warm/hot spare,
reverse proxy,
data partitioning, ...
wackamole,
.... ...
http://danga.com/words/
14
Favorite Venn Diagram
http://danga.com/words/
15
My Hiring Venn Diagram
Talk Think
Do
http://danga.com/words/
16
enough Venn Diagrams!
http://danga.com/words/
17
One Server
Simple:
mysql
apache
http://danga.com/words/
18
Two Servers
apache
mysql
http://danga.com/words/
19
Two Servers - Problems
http://danga.com/words/
20
Four Servers
3 webs, 1 db
Now we need to load-balance!
LVS, mod_backhand, whackamole, BIG-IP,
Alteon, pound, Perlbal, etc, etc..
− ...
http://danga.com/words/
21
Four Servers - Problems
http://danga.com/words/
22
Five Servers
introducing MySQL replication
We buy a new DB
MySQL replication
Writes to DB (master)
Reads from both
http://danga.com/words/
23
More Servers
Chaos!
http://danga.com/words/
24
net.
BIG-IP
bigip1
bigip2
mod_proxy
mod_perl
proxy1
web1
proxy2
web2
proxy3 Global Database
web3
web4 master
...
web12
slave1 slave2 ... slave6
http://danga.com/words/
25
Problems with Architecture
or,
“This don't scale...”
DB master is SPOF
Adding slaves doesn't scale
well...
− only spreads reads, not writes!
500 reads/s
250 reads/s 250 reads/s
http://danga.com/words/
26
Eventually...
http://danga.com/words/
27
Spreading Writes
http://danga.com/words/
28
Partition your data!
to join in app
Each user assigned to a
cluster number
Each cluster has multiple
machines
− writes self-contained in
cluster (writing to 2-3
machines, not 6)
http://danga.com/words/
29
User Clusters
http://danga.com/words/
30
User Clusters
SELECT userid,
clusterid FROM
user WHERE
user='bob'
http://danga.com/words/
30
User Clusters
SELECT userid,
clusterid FROM
user WHERE
user='bob'
userid: 839
clusterid: 2
http://danga.com/words/
30
User Clusters
userid: 839
clusterid: 2
http://danga.com/words/
30
User Clusters
OMG i like
totally hate
my parents
userid: 839 they just
clusterid: 2 dont
understand me
and i h8 the
world omg lol
rofl *! :^-
^^;
add me as a
friend!!!
http://danga.com/words/
30
Details
per-user numberspaces
− don't use AUTO_INCREMENT
− PRIMARY KEY (user_id, thing_id)
− so:
Can move/upgrade users 1-at-a-time:
− per-user “readonly” flag
− per-user “schema_ver” property
− user-moving harness
job server that coordinates, distributed long-
http://danga.com/words/
31
Shared Storage
(SAN, SCSI, DRBD...)
innodb repairs,
http://danga.com/words/
32
Shared Storage: DRBD
http://danga.com/words/
35
Caching
caching's key to performance
− store result of a computation or I/O for quicker future
access (classic space/time trade-off)
Where to cache?
− mod_perl/php internal caching
memory waste (address space per apache child)
− shared memory
limited to single machine, same with Java/C#/
Mono
− MySQL query cache
flushed per update, small max size
− HEAP tables
fixed length rows, small max size
http://danga.com/words/
36
memcached
http://www.danga.com/memcached/
http://danga.com/words/
38
Web Load Balancing
http://danga.com/words/
39
Perlbal
Perl
single threaded, async event-based
− uses epoll, kqueue, etc.
console / HTTP remote management
− live config changes
handles dead nodes, smart balancing
multiple modes
− static webserver
− reverse proxy
− plug-ins (Javascript message bus.....)
plug-ins
− GIF/PNG altering, ....
http://danga.com/words/
40
Perlbal: Persistent Connections
http://danga.com/words/
41
Perlbal: can verify new connections
#include <sys/socket.h>
int listen(int sockfd, int backlog);
http://danga.com/words/
42
Perlbal: multiple queues
http://danga.com/words/
43
Perlbal: cooperative large file serving
http://danga.com/words/
44
Perlbal: cooperative large file serving
internal redirects
− mod_perl can pass off
serving a big file to
Perlbal
either from disk, or from
other URL(s)
− client sees no HTTP
redirect
− “Friends-only” images
one, clean URL
mod_perl does auth, and is
done.
perlbal serves.
http://danga.com/words/
45
Internal redirect picture
http://danga.com/words/
46
MogileFS
http://danga.com/words/
47
oMgFileS
http://danga.com/words/
48
MogileFS
http://danga.com/words/
49
MogileFS: Why
http://danga.com/words/
50
MogileFS: Main Ideas
http://danga.com/words/
51
MogileFS components
clients
mogilefsd (does all real work)
database(s) (MySQL, .... abstract)
storage nodes
http://danga.com/words/
52
MogileFS: Clients
http://danga.com/words/
53
MogileFS: Tracker
(mogilefsd)
The Meat
event-based message bus
load balances client requests, world info
process manager
− heartbeats/watchdog, respawner, ...
Child processes:
− ~30x client interface (“query” process)
interfaces client protocol w/ db(s), etc
− ~5x replicate
− ~2x delete
− ~1x fsck, reap, monitor, ..., ...
http://danga.com/words/
54
Trackers' Database(s)
http://danga.com/words/
55
MogileFS storage nodes
(mogstored)
HTTP transport
− GET
− PUT
− DELETE
mogstored listens on 2 ports...
HTTP. --server={perlbal,lighttpd,...}
− management/status:
iostat interface, AIO control, multi-stat() (for faster
fsck)
files on filesystem, not DB
− sendfile()! future: splice()
− filesystem can be any filesystem
http://danga.com/words/
56
Large file
GET
request
http://danga.com/words/
57
Auth: complex, but quick
Large file
GET
request
http://danga.com/words/
57
Spoonfeeding:
slow, but event-
based
Large file
GET
request
http://danga.com/words/
57
And the reverse...
http://danga.com/words/
58
Gearman
http://danga.com/words/
59
manaGer
http://danga.com/words/
60
Manager
dispatches work,
but doesn't do anything useful itself. :)
http://danga.com/words/
61
Gearman
different languages,
db connection pooling,
...
...
http://danga.com/words/
62
Gearman Pieces
gearmand
− the function call router
− event-loop (epoll, kqueue, etc)
workers.
− Gearman::Worker – perl
− register/heartbeat/grab jobs
clients
− Gearman::Client[::Async] -- perl
− also start of Ruby client recently
− submit jobs to gearmand
− opaque (to server) “funcname” string
− optional opaque (to server) “args” string
− opt coallescing key
http://danga.com/words/
63
Gearman Picture
http://danga.com/words/
64
Gearman Picture
http://danga.com/words/
64
Gearman Picture
Worker Worker
http://danga.com/words/
64
Gearman Picture
can_do(“funcA”)
can_do(“funcA”)
can_do(“funcB”)
Worker Worker
http://danga.com/words/
64
Gearman Picture
can_do(“funcA”)
can_do(“funcA”)
can_do(“funcB”)
http://danga.com/words/
64
Gearman Picture
call(“funcA”)
can_do(“funcA”)
can_do(“funcA”)
can_do(“funcB”)
http://danga.com/words/
64
Gearman Picture
call(“funcA”)
can_do(“funcA”)
can_do(“funcA”)
can_do(“funcB”)
http://danga.com/words/
64
Gearman Picture
call(“funcA”)
can_do(“funcA”)
call(“funcB”)
can_do(“funcA”)
can_do(“funcB”)
http://danga.com/words/
64
Gearman Protocol
http://danga.com/words/
65
Gearman Uses
http://danga.com/words/
66
Gearman Uses, cont..
http://danga.com/words/
67
Gearman Misc
Guarantees:
− none! hah! :)
please wait for your results.
goes away.
No policy/conventions in gearmand
− all policy/meaning between clients <-> workers
...
http://danga.com/words/
68
Sick Gearman Demo
$ ./dmap.pl
dmap says:
1 = sammy
2 = papag
3 = sammy
4 = papag
5 = sammy
6 = papag
7 = sammy
8 = papag
9 = sammy
10 = papag
http://danga.com/words/
69
Gearman Summary
Gearman is sexy.
− especially the coalescing
Check it out!
− it's kinda our little unadvertised secret
oh crap, did I leak the secret?
http://danga.com/words/
70
TheSchwartz
http://danga.com/words/
71
TheSchwartz
Like gearman:
− job queuing system
− opaque function name
− opaque “args” blob
− clients are either:
submitting jobs
workers
But not like gearman:
− Reliable job queueing system
− not low latency
− fire & forget (as opposed to gearman, where you wait for
result)
currently library, not network service
http://danga.com/words/
72
TheSchwartz Primitives
insert job
“grab” job (atomic grab)
− for 'n' seconds.
mark job done
temp fail job for future
− optional notes, rescheduling details..
replace job with 1+ other jobs
− atomic.
...
http://danga.com/words/
73
TheSchwartz
backing store:
− a database
− uses Data::ObjectDriver
MySQL,
Postgres,
SQLite,
....
but HA: you tell it @dbs, and it finds one to
insert job into
− likewise, workers foreach (@dbs) to do work
http://danga.com/words/
74
TheSchwartz uses
http://danga.com/words/
75
gearmand + TheSchwartz
http://danga.com/words/
76
djabberd
http://danga.com/words/
77
djabberd
http://danga.com/words/
78
djabberd hooks
everything is a hook
− not just auth! like, everything.
− auth,
− roster,
− vcard info (avatars),
− presence,
− delivery,
− inter-node cluster delivery,
− ala mod_perl, qpsmtpd, etc.
async hooks
− hooks phases can take as long as they want before
they answer, or decline to next phase in hook chain...
− we use Gearman::Client::Async
http://danga.com/words/
79
Thank you!
Questions to:
brad@danga.com
Software:
http://danga.com/
http://code.sixapart.com/
http://danga.com/words/
80
Bonus Slides
if extra time
http://danga.com/words/
81
Data Integrity
http://danga.com/words/
82
Persistent Connection Woes
http://danga.com/words/
83