Beruflich Dokumente
Kultur Dokumente
jp
scaling out with open source
Batara Kesuma
mixi, Inc.
bkesuma@mixi.co.jp
Introduction
•Batara Kesuma
•CTO of mixi, Inc.
What is mixi?
•Social networking service
• Diary, community, message, review, photo
album, etc.
• Invitation only
•Largest and fastest growing
SNS in Japan
Latest information
- Friends new diary
- Comments history
- Communities topics
Friends - Friends new reviews
- Friends new albums
My latest diaries
and reviews
Google
Japan
mixi
Amazon
Japan
Users growth in 2 years
3,500,000
Users
2,625,000
1,750,000
875,000
0
04/03 05/03 06/03
Our technology
solutions
The technology behind
•Linux 2.6
•Apache 2.0
•MySQL
•Perl 5.8
•memcached
•Squid
S T RE
UE QU
mod_proxy E Q EST images
R
mod_perl
memcached
HOT OBJECTS
Replicate
IT E)
W R
Y (
UER
Q
replication 25 reads/s
SLAVES 50 writes/s
25 reads/s
MASTER 50 writes/s MASTER
50 writes/s 50 reads/s 50 writes/s 50 writes/s
25 reads/s
100 100
50 writes/s
reads/s reads/s
50 writes/s
50 reads/s
25 reads/s
Some statistics
•Diary related tables
• Read 85%
• Write 15%
diary tables
other tables
DB
DB 1 DB 2
Vertical partition
•Too many tables to deal with at
one time
•The transition in splitting gets
complex and difficult
Horizontal partition
$dbh = $db->load_dbh(type =>
$dbh = $db->load_dbh();
“message”);
message tables
message tables
NEW DB
diary tables
$dbh = $db->load_dbh(type =>
“diary”);
NEW DB
OLD DB Also called level 1
partitioning within mixi
Partition map for level 1
•Small and static
•Just put it in configuration file
•For example:
$DB_DIARY = ‘DBI:mysql:host=db1;database=diary’;
$DB_MESSAGE = ‘DBI:mysql:host=db2;database=message’;
...
Easy transition
mod_perl 1 Writes to both DBs
W
TE
RI
RI
TE
W
AD
RE
AD
RE
Shifts reads 3
SELECT
OLD DB NEW DB
INSERT IGNORE
2 Copies in background
Problems with level 1
•Cannot use JOIN anymore
• Use FEDERATED TABLE from MySQL 5
• Or do SELECT twice which is faster than
using FEDERATED TABLEs
• If table is small, just duplicate it
Next step
•When the new DB gets
overloaded
•We split the DB, yet again
•Get ready for level 2
Partitioning key
•user id, message id
•Choose wisely!
user A user B
user id message id
Level 2 partition
user A user B user C user D
message tables
LEVEL 1 DB
message tables
2 Connects to node
NODE 3
Manager based
•Pros:
• Easy to manage
• Add a new node, move data between
nodes
•Cons:
• This process increases by 1 query for
partition map
• It needs to send a request to the manager
Algorithm based
•Pros:
• Application servers can compute node id
by themselves
• Bypass the connection to the manager
•Cons:
• Difficult to manage
• Adding new nodes is tricky
Adding nodes is tricky
old_node_id=(member_id%2)+1
3 Copies in background
number of nodes = 2
new_node_id=(member_id%4)+1
1 Adds a new application logic
number of nodes = 4 NODE 1
COPY
WRITE
mod_perl NODE 2
READ
+
COPY
WR 2 Writes to both DBs
ITE if node_id is different
NODE 3
RE
AD NODE 4
4 Shifts reads
Problems with level 2
member tables
• Too many connections
to different DBs
NODE 1
• Fortunately, on mixi,
member tables the majority are small
NODE 2 data sets
member tables • Cache them all by
NODE 3 using distributed
memory caching
community tables • We rarely hit the DB
NODE 1 • Average page load
time is about 0.02 sec*
community tables
NODE 2
* depending on data sets average load time may vary
Caching
•memcached
• Also used in LiveJournal, Slashdot, etc
•Install server on mod_perl
machine
•39 machines x 2 GB memory
Summary of DB partitioning
•Level 1 partition (split by table
types)
•Level 2 partition (split by
partitioning key)
• Manager based
• Algorithm based
Summary of DB partitioning
1 Split by table types user A user B user C
other tables
OLD DB
message tables message tables
LEVEL 2 LEVEL 2
Image Servers
Statistics
•Total size is more than 8 TB of
storage
•Growth rate is about 23 GB / day
•We use MySQL to store
metadata only
Two types of images
•Frequently accessed images
• Number of image files is relatively small
(about a few million files)
• For example, user profile photos,
community logos
•Rarely accessed images
• About hundred millions of image files
• Diary photos, album photos, etc
Frequently accessed images
•Few hundred GBs of files
•Distribute via the use of FTP and
Squid
•Third party Content Delivery
Network
Frequently accessed images
Squid CDN
sto1.mixi.jp sto2.mixi.jp
2 Pull images from storage
Storage
MANAGER 2 Arranges a pair sto1.mixi.jp
DB of area_id
area_id Storage
=1,2 sto2.mixi.jp
Assigns a id for
Storage
1
an image file
abc.gif sto3.mixi.jp
AD
O
PL
D
U
mod_perl UP LO A
3 Uploads image
Storage
to storage sto4.mixi.jp
Viewing rarely accessed images
7 Asks for abc.gif
User Storage
8 Returns abc.gif
sto1.mixi.jp
1 Asks for
view_diary.pl 6 Returns view_diary.pl Storage
and URL for abc.gif
sto2.mixi.jp
2 Detects abc.gif 5 Creates
in view_diary.pl mod_perl image URL
Storage
abc.gif area_id =1 sto3.mixi.jp
3 Asks for area_id
4 Returns area_id
MANAGER Storage
sto4.mixi.jp
DB
To do
•Try MySQL Cluster
•Try to implement better
algorithm
• Consistent hashing?
• Linear hashing?
•Level 3 partitioning?
• Split again by timestamp?
Questions?
Thank you
•Further questions to
bkesuma@mixi.co.jp
•We are hiring :)
•Have a nice day!