Jose Icpp 2011

Memcached
Design on High Performance

RDMA Capable Interconnects
Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang,
Jian Huang, Md. Wasi-ur-Rahman, Nusrat S. Islam,
Xiangyong Ouyang, Hao Wang, Sayantan Sur & D. K. Panda

Network-Based Compu2ng Laboratory
Department of Computer Science and Engineering
The Ohio State University, USA

Outline
IntroducLon
Overview of Memcached
Modern High Performance Interconnects
Unied CommunicaLon RunLme (UCR)
Memcached Design using UCR
Performance EvaluaLon
Conclusion & Future Work
2
ICPP - 2011
IntroducLon
Tremendous increase in interest in interactive web-sites
(social networking, e-commerce etc.)
Dynamic data is stored in databases for future retrieval and
analysis
Database lookups are expensive
Memcached a distributed memory caching layer,
implemented using traditional BSD sockets
Socket interface provides portability, but entails additional
processing and multiple message copies
High-Performance Computing (HPC) has adopted advanced
interconnects (e.g. InfiniBand, 10 Gigabit Ethernet/iWARP,
RoCE)
Low latency, High Bandwidth, Low CPU overhead
Many machines in Top500 list (http://www.top500.org)

3
ICPP - 2011
Outline
IntroducLon
4
ICPP - 2011
Memcached Overview
System
Area
Network
System
Area
Network
Internet
Proxy Servers
(Memcached Clients)
Memcached
Servers
Database
Servers
Memcached provides a scalable distributed caching

Spare memory in data-center servers can be aggregated to speedup
lookups
Basically a key-value distributed memory store
Keys can be any character strings, typically MD5 sums or hashes
Typically used to cache database queries, results of API calls or webpage
rendering elements
Scalable model, but typical usage very network intensive -Performance
directly related to that of underlying networking technology
5
ICPP - 2011
Outline
IntroducLon
6
ICPP - 2011

ApplicaAon
ApplicaLon
Interface
Sockets
Kernel
Space
TCP/IP
IB Verbs
TCP/IP
Protocol
ImplementaLon
Hardware
Ooad
SDP
RDMA
User
space
Ethernet
Driver
IPoIB
Network
Adapter
1/10 GigE
Adapter
InniBand
Adapter
10 GigE
Adapter
InniBand
Adapter
InniBand
Adapter
Network
Switch
Ethernet
Switch
InniBand
Switch
10 GigE
Switch
InniBand
Switch
InniBand
Switch
10 GigE-TOE
SDP
IB Verbs
1/10 GigE
IPoIB
RDMA
7
ICPP - 2011
Problem Statement
High-performance RDMA capable interconnects
have emerged in the scientific computation
domain
Applications using Memcached are still relying on
sockets
Performance of Memcached is critical to most of
its deployments
Can Memcached be re-designed from the ground
up to utilize RDMA capable networks?
8
ICPP - 2011
A New Approach using

Current Approach
Our Approach
ApplicaAon
ApplicaAon
UCR
Sockets
IB Verbs
1/10 GigE
Network
RDMA Capable N/ws

(IB, 10GE, iWARP, RoCE ...)
Sockets not designed for high-performance

Stream semanLcs o\en mismatch for upper layers (Memcached, Hadoop)
MulLple copies can be involved
9
ICPP - 2011
Outline
IntroducLon & MoLvaLon
10
ICPP - 2011

Initially proposed to unify communication runtimes of different
parallel programming models
J. Jose, M. Luo, S. Sur and D. K. Panda, Unifying UPC and MPI
Runtimes: Experience with MVAPICH, (PGAS10)
Design of UCR evolved from MVAPICH/MVAPICH2 software

stacks (h`p://mvapich.cse.ohio-state.edu/)
UCR provides interfaces for Active Messages as well as onesided put/get operations
Enhanced APIs to support Cloud computing applications
Several enhancements in UCR
end-point based design, revamped active-message API, fault
tolerance and synchronization with timeouts.
Communications based on endpoint, analogous to sockets

11
ICPP - 2011
AcLve Messaging in UCR

Active messages are proven to be very powerful in many
environments
GASNet Project (UC Berkeley), MPI design using LAPI (IBM), etc.
We introduce Active messages into the data-center domain

An Active Message consists of two parts header and data
When the message arrives at the target, header handler is run
Header handler identifies the destination buffer for the data
Data is put into the destination buffer
Completion handler is run afterwards (optional)
Special flags to indicate local & remote completions (optional)
12
ICPP - 2011
AcLve Messaging in UCR (contd.)

Origin
Target
Origin
Header
Header + Data
Set
LComplFlag
Header
Handler
Set
ComplFlag
Header
Handler
Copy
Data
RDMA Data
Set
LComplFlag
Target
Completion
Handler
Completion
Handler
Set
RComplFlag
Set
RComplFlag
Set
ComplFlag
(General Active Message Functionality)
(Optimized Short Active Message Functionality)
13
ICPP - 2011
Outline
14
ICPP - 2011

Sockets
Client
Master
Thread
Sockets
Worker
Thread
Verbs
Worker
Thread
Shared
Data

Memory
Slabs
Items

Sockets
Worker
Thread
RDMA
Client
Verbs
Worker
Thread
Server and client perform a negoLaLon protocol

Master thread assigns clients to appropriate worker thread
Once a client is assigned a verbs worker thread, it can communicate directly
and is bound to that thread
All other Memcached data structures are shared among RDMA and Sockets
worker threads
15
ICPP - 2011
Outline
16
ICPP - 2011
Experimental Setup
Used Two Clusters
Intel Clovertown
Each node has 8 processor cores on 2 Intel Xeon 2.33 GHz Quad-core CPUs,
6 GB main memory, 250 GB hard disk
Network: 1GigE, IPoIB, 10GigE TOE and IB (DDR)
Intel Westmere
Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz Quad-core CPUs,
12 GB main memory, 160 GB hard disk
Network: 1GigE, IPoIB, and IB (QDR)
Memcached So\ware
Memcached Server: 1.4.5
Memcached Client: (libmemcached) 0.45

17
ICPP - 2011
Memcached Get Latency (Small Message)

250
350
300
200
Time (us)
Time (us)
250
150
100
SDP
200
IPoIB
150
OSU Design
100
50
1G
50
10G-TOE
0
1
16 32 64 128 256 512 1K 2K 4K
16 32 64 128 256 512 1K 2K 4K

Message Size
Message Size
Intel Clovertown Cluster (IB: DDR)
Intel Westmere Cluster (IB: QDR)
Memcached Get latency

4 bytes DDR: 6 us; QDR: 5 us
4K bytes -- DDR: 20 us; QDR:12 us
Almost factor of four improvement over 10GE (TOE) for 4KB on the DDR cluster
Almost factor of seven improvement over IPoIB for 4KB on the QDR cluster
18
ICPP - 2011
Memcached Get Latency (Large Message)

5000
700
4500
600
4000
500
Time (us)
3500
3000
SDP
400
IPoIB
2500
2000
300
1500
200
1G
100
10G-TOE
1000
500
0
OSU Design
0
4K
8K
16K
32K
64K 128K 256K 512K
4K
Message Size
8K
16K
32K
64K
128K
512K
Message Size

8K bytes DDR: 17 us; QDR: 13 us
512K bytes -- DDR: 362 us; QDR: 94 us
Almost factor of three improvement over 10GE(TOE) for 512KB on the DDR cluster
Almost factor of four improvement over IPoIB for 512K bytes on the QDR cluster
19
ICPP - 2011
Memcached Set Latency (Small Message)

250
100
90
200
80
Time (us)
70
150
100
50
60
SDP
50
IPoIB
40
OSU Design
30
1G
20
10G-TOE
10
0
0
1
16 32 64 128 256 512 1K 2K 4K

Message Size
16 32 64 128 256 512 1K 2K 4K

Message Size
Memcached Set latency

Almost factor of four improvement over 10GE (TOE) for 4KB on the DDR Cluster
Almost factor of six improvement over IPoIB for 4KB on the QDR Cluster
20

ICPP - 2011

Memcached Set Latency (Large Message)

4500
800
4000
700
3500
600
Time (us)
3000
2500
2000
1500
500
SDP
400
IPoIB
300
OSU Design
1000
200
500
100
0
4K
8K
16K
32K
64K
128K 256K 512K
1G
10G-TOE
0
4K
Message Size
8K
16K
32K
64K
128K 256K 512K
Message Size

8K bytes DDR: 18 us; QDR: 15 us
Almost factor of two improvement over 10GE (TOE) for 512KB on the DDR cluster
Almost factor of three improvement over IPoIB for 512KB on the QDR cluster
21
ICPP - 2011
Memcached Latency
(10% Set, 90% Get)
350
250
300
200
Time (us)
250
150
SDP
200
100
50
IPoIB
150
OSU Design
100
1G
10G
50
0
0
1
16 32 64 128 256 512 1K 2K 4K
16 32 64 128 256 512 1K 2K 4K

Message Size
Message Size

Almost factor of four improvement over 10GE (TOE) for 4K bytes on
the DDR cluster

ICPP - 2011
22
Memcached Latency
(50% Set, 50% Get)
250
100
90
200
80
Time (us)
70
150
100
50
60
SDP
50
IPoIB
40
OSU Design
30
1G
20
10G-TOE
10
0
0
1
16
32 64 128 256 512 1K 2K 4K

Message Size
16 32 64 128 256 512 1K 2K 4K

Message Size

Almost factor of four improvement over 10GE (TOE) for 4K bytes on
the DDR cluster

ICPP - 2011
23
Memcached Get TPS (4byte)

Thousands of Transactions per
second (TPS)
700
600
500
2000
SDP
1800
10G - TOE
1600
OSU Design
1400
SDP
IPoIB
OSU Design
1200
400
1000
300
800
200
600
400
100
200
0
8 Clients
16 Clients
8 Clients
16 Clients
Memcached Get transacLons per second for 4 bytes

On IB DDR about 600K/s for 16 clients
On IB QDR 1.9M/s for 16 clients
Almost factor of six improvement over 10GE (TOE)
Signicant improvement with naLve IB QDR compared to SDP and
IPoIB
ICPP - 2011

24
Memcached Get TPS (4KB)

Transactions per second(TPS)
350000
300000
250000
900000
SDP
800000
10G - TOE
700000
OSU Design
600000
200000
500000
150000
400000
SDP
IPoIB
OSU Design
300000
100000
200000
50000
100000
0
8 Clients
16 Clients
8 Clients
16 Clients
Memcached Get transacLons per second for 4K bytes

On IB DDR about 330K/s for 16 clients
On IB QDR 842K/s for 16 clients
Almost factor of four improvement over 10GE (TOE)
Signicant improvement with naLve IB QDR

ICPP - 2011
25
Outline
26
ICPP - 2011

Described a novel design of Memcached for RDMA capable networks
Provided a detailed performance comparison of our design compared to
unmodied Memcached using sockets over RDMA and 10GE networks
Observed signicant performance improvement with the proposed
design
Factor of four improvement in Memcached get latency (4K bytes)
Factor of six improvement in Memcached get transacLons/s (4 bytes)
We plan to improve UCR by taking into account the many features in
OpenFabrics API , Unreliable Datagram transport and designing iWARP
and RoCE versions of UCR, and thereby scaling Memcached
We are working on enhancing the Hadoop/HBase designs for RDMA
capable networks
27
ICPP - 2011
Thank You!
{jose, subramon, luom, zhanjmin, huangjia, rahmanmd,
islamn, ouyangx, wangh, surs,panda}@cse.ohio-state.edu
Network-Based CompuLng Laboratory

h`p://nowlab.cse.ohio-state.edu/
MVAPICH Web Page
h`p://mvapich.cse.ohio-state.edu/
ICPP - 2011
28

Jose Icpp 2011

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Jose Icpp 2011

Hochgeladen von

Copyright:

Verfügbare Formate

Memcached

Design on High Performance

Many machines in Top500 list (http://www.top500.org)

Memcached provides a scalable distributed caching

Modern High Performance Interconnects

A New Approach using

RDMA Capable N/ws

Sockets not designed for high-performance

Unied CommunicaLon RunLme (UCR)

Design of UCR evolved from MVAPICH/MVAPICH2 software

Communications based on endpoint, analogous to sockets

AcLve Messaging in UCR

We introduce Active messages into the data-center domain

AcLve Messaging in UCR (contd.)

(General Active Message Functionality)

(Optimized Short Active Message Functionality)

Memcached Design using UCR

Server and client perform a negoLaLon protocol

Memcached Get Latency (Small Message)

16 32 64 128 256 512 1K 2K 4K

16 32 64 128 256 512 1K 2K 4K

Intel Clovertown Cluster (IB: DDR)

Intel Westmere Cluster (IB: QDR)

Memcached Get latency

Memcached Get Latency (Large Message)

64K 128K 256K 512K

Intel Clovertown Cluster (IB: DDR)

Intel Westmere Cluster (IB: QDR)

Memcached Get latency

Memcached Set Latency (Small Message)

16 32 64 128 256 512 1K 2K 4K

Intel Clovertown Cluster (IB: DDR)

16 32 64 128 256 512 1K 2K 4K

Intel Westmere Cluster (IB: QDR)

Memcached Set latency

Memcached Set Latency (Large Message)

128K 256K 512K

128K 256K 512K

Intel Westmere Cluster (IB: QDR)

Intel Clovertown Cluster (IB: DDR)

Memcached Get latency

16 32 64 128 256 512 1K 2K 4K

16 32 64 128 256 512 1K 2K 4K

Intel Clovertown Cluster (IB: DDR)

Intel Westmere Cluster (IB: QDR)

Memcached Get latency

32 64 128 256 512 1K 2K 4K

Intel Clovertown Cluster (IB: DDR)

16 32 64 128 256 512 1K 2K 4K

Intel Westmere Cluster (IB: QDR)

Memcached Get latency

Memcached Get TPS (4byte)

Intel Clovertown Cluster (IB: DDR)

Intel Westmere Cluster (IB: QDR)

Memcached Get transacLons per second for 4 bytes

Memcached Get TPS (4KB)

Intel Westmere Cluster (IB: QDR)

Intel Clovertown Cluster (IB: DDR)

Memcached Get transacLons per second for 4K bytes

Conclusion & Future Work

Network-Based CompuLng Laboratory

Das könnte Ihnen auch gefallen