Sie sind auf Seite 1von 28

Memcached

Design on High Performance


RDMA Capable Interconnects
Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang,
Jian Huang, Md. Wasi-ur-Rahman, Nusrat S. Islam,
Xiangyong Ouyang, Hao Wang, Sayantan Sur & D. K. Panda

Network-Based Compu2ng Laboratory
Department of Computer Science and Engineering
The Ohio State University, USA

Outline
IntroducLon
Overview of Memcached
Modern High Performance Interconnects
Unied CommunicaLon RunLme (UCR)
Memcached Design using UCR
Performance EvaluaLon
Conclusion & Future Work
2
ICPP - 2011

IntroducLon
Tremendous increase in interest in interactive web-sites
(social networking, e-commerce etc.)
Dynamic data is stored in databases for future retrieval and
analysis
Database lookups are expensive
Memcached a distributed memory caching layer,
implemented using traditional BSD sockets
Socket interface provides portability, but entails additional
processing and multiple message copies
High-Performance Computing (HPC) has adopted advanced
interconnects (e.g. InfiniBand, 10 Gigabit Ethernet/iWARP,
RoCE)
Low latency, High Bandwidth, Low CPU overhead

Many machines in Top500 list (http://www.top500.org)


3
ICPP - 2011

Outline
IntroducLon
Overview of Memcached
Modern High Performance Interconnects
Unied CommunicaLon RunLme (UCR)
Memcached Design using UCR
Performance EvaluaLon
Conclusion & Future Work
4
ICPP - 2011

Memcached Overview
System
Area
Network

System
Area
Network

Internet

Proxy Servers
(Memcached Clients)

Memcached
Servers

Database
Servers

Memcached provides a scalable distributed caching


Spare memory in data-center servers can be aggregated to speedup
lookups
Basically a key-value distributed memory store
Keys can be any character strings, typically MD5 sums or hashes
Typically used to cache database queries, results of API calls or webpage
rendering elements
Scalable model, but typical usage very network intensive -Performance
directly related to that of underlying networking technology
5
ICPP - 2011

Outline
IntroducLon
Overview of Memcached
Modern High Performance Interconnects
Unied CommunicaLon RunLme (UCR)
Memcached Design using UCR
Performance EvaluaLon
Conclusion & Future Work
6
ICPP - 2011

Modern High Performance Interconnects


ApplicaAon

ApplicaLon
Interface

Sockets

Kernel
Space

TCP/IP

IB Verbs

TCP/IP

Protocol
ImplementaLon

Hardware
Ooad

SDP

RDMA

User
space

Ethernet
Driver

IPoIB

Network
Adapter

1/10 GigE
Adapter

InniBand
Adapter

10 GigE
Adapter

InniBand
Adapter

InniBand
Adapter

Network
Switch

Ethernet
Switch

InniBand
Switch

10 GigE
Switch

InniBand
Switch

InniBand
Switch

10 GigE-TOE

SDP

IB Verbs

1/10 GigE

IPoIB

RDMA

7
ICPP - 2011

Problem Statement
High-performance RDMA capable interconnects
have emerged in the scientific computation
domain
Applications using Memcached are still relying on
sockets
Performance of Memcached is critical to most of
its deployments
Can Memcached be re-designed from the ground
up to utilize RDMA capable networks?

8
ICPP - 2011

A New Approach using


Unied CommunicaLon RunLme (UCR)
Current Approach

Our Approach

ApplicaAon

ApplicaAon

UCR
Sockets
IB Verbs
1/10 GigE
Network

RDMA Capable N/ws


(IB, 10GE, iWARP, RoCE ...)

Sockets not designed for high-performance


Stream semanLcs o\en mismatch for upper layers (Memcached, Hadoop)
MulLple copies can be involved

9
ICPP - 2011

Outline
IntroducLon & MoLvaLon
Overview of Memcached
Modern High Performance Interconnects
Unied CommunicaLon RunLme (UCR)
Memcached Design using UCR
Performance EvaluaLon
Conclusion & Future Work
10
ICPP - 2011

Unied CommunicaLon RunLme (UCR)


Initially proposed to unify communication runtimes of different
parallel programming models
J. Jose, M. Luo, S. Sur and D. K. Panda, Unifying UPC and MPI
Runtimes: Experience with MVAPICH, (PGAS10)

Design of UCR evolved from MVAPICH/MVAPICH2 software


stacks (h`p://mvapich.cse.ohio-state.edu/)
UCR provides interfaces for Active Messages as well as onesided put/get operations
Enhanced APIs to support Cloud computing applications
Several enhancements in UCR
end-point based design, revamped active-message API, fault
tolerance and synchronization with timeouts.

Communications based on endpoint, analogous to sockets


11
ICPP - 2011

AcLve Messaging in UCR


Active messages are proven to be very powerful in many
environments
GASNet Project (UC Berkeley), MPI design using LAPI (IBM), etc.

We introduce Active messages into the data-center domain


An Active Message consists of two parts header and data
When the message arrives at the target, header handler is run
Header handler identifies the destination buffer for the data
Data is put into the destination buffer
Completion handler is run afterwards (optional)
Special flags to indicate local & remote completions (optional)
12
ICPP - 2011

AcLve Messaging in UCR (contd.)


Origin

Target

Origin

Header

Header + Data
Set
LComplFlag

Header
Handler

Set
ComplFlag

Header
Handler
Copy
Data

RDMA Data

Set
LComplFlag

Target

Completion
Handler

Completion
Handler

Set
RComplFlag

Set
RComplFlag

Set
ComplFlag

(General Active Message Functionality)

(Optimized Short Active Message Functionality)

13
ICPP - 2011

Outline
IntroducLon & MoLvaLon
Overview of Memcached
Modern High Performance Interconnects
Unied CommunicaLon RunLme (UCR)
Memcached Design using UCR
Performance EvaluaLon
Conclusion & Future Work
14
ICPP - 2011

Memcached Design using UCR


Sockets
Client

Master
Thread

Sockets
Worker
Thread

Verbs
Worker
Thread

Shared
Data

Memory
Slabs
Items

Sockets
Worker
Thread

RDMA
Client

Verbs
Worker
Thread

Server and client perform a negoLaLon protocol


Master thread assigns clients to appropriate worker thread
Once a client is assigned a verbs worker thread, it can communicate directly
and is bound to that thread
All other Memcached data structures are shared among RDMA and Sockets
worker threads
15
ICPP - 2011

Outline
IntroducLon & MoLvaLon
Overview of Memcached
Modern High Performance Interconnects
Unied CommunicaLon RunLme (UCR)
Memcached Design using UCR
Performance EvaluaLon
Conclusion & Future Work
16
ICPP - 2011

Experimental Setup
Used Two Clusters
Intel Clovertown
Each node has 8 processor cores on 2 Intel Xeon 2.33 GHz Quad-core CPUs,
6 GB main memory, 250 GB hard disk
Network: 1GigE, IPoIB, 10GigE TOE and IB (DDR)

Intel Westmere
Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz Quad-core CPUs,
12 GB main memory, 160 GB hard disk
Network: 1GigE, IPoIB, and IB (QDR)

Memcached So\ware
Memcached Server: 1.4.5
Memcached Client: (libmemcached) 0.45


17
ICPP - 2011

Memcached Get Latency (Small Message)


250

350
300

200
Time (us)

Time (us)

250
150
100

SDP

200

IPoIB

150

OSU Design

100
50

1G

50

10G-TOE

0
1

16 32 64 128 256 512 1K 2K 4K

16 32 64 128 256 512 1K 2K 4K


Message Size

Message Size

Intel Clovertown Cluster (IB: DDR)

Intel Westmere Cluster (IB: QDR)

Memcached Get latency


4 bytes DDR: 6 us; QDR: 5 us
4K bytes -- DDR: 20 us; QDR:12 us
Almost factor of four improvement over 10GE (TOE) for 4KB on the DDR cluster
Almost factor of seven improvement over IPoIB for 4KB on the QDR cluster
18

ICPP - 2011

Memcached Get Latency (Large Message)


5000

700

4500

600

4000

500

Time (us)

3500
3000

SDP

400

IPoIB

2500
2000

300

1500

200

1G

100

10G-TOE

1000
500
0

OSU Design

0
4K

8K

16K

32K

64K 128K 256K 512K

4K

Message Size

8K

16K

32K

64K

128K

512K

Message Size

Intel Clovertown Cluster (IB: DDR)

Intel Westmere Cluster (IB: QDR)

Memcached Get latency


8K bytes DDR: 17 us; QDR: 13 us
512K bytes -- DDR: 362 us; QDR: 94 us
Almost factor of three improvement over 10GE(TOE) for 512KB on the DDR cluster
Almost factor of four improvement over IPoIB for 512K bytes on the QDR cluster
19

ICPP - 2011

Memcached Set Latency (Small Message)


250

100
90

200

80

Time (us)

70
150
100
50

60

SDP

50

IPoIB

40

OSU Design

30

1G

20

10G-TOE

10
0

0
1

16 32 64 128 256 512 1K 2K 4K


Message Size

Intel Clovertown Cluster (IB: DDR)

16 32 64 128 256 512 1K 2K 4K


Message Size

Intel Westmere Cluster (IB: QDR)

Memcached Set latency


4 bytes DDR: 7 us; QDR: 5 us
4K bytes -- DDR: 15 us; QDR:13 us
Almost factor of four improvement over 10GE (TOE) for 4KB on the DDR Cluster
Almost factor of six improvement over IPoIB for 4KB on the QDR Cluster
20

ICPP - 2011

Memcached Set Latency (Large Message)


4500

800

4000

700

3500

600

Time (us)

3000
2500
2000
1500

500

SDP

400

IPoIB

300

OSU Design

1000

200

500

100

0
4K

8K

16K

32K

64K

128K 256K 512K

1G
10G-TOE

0
4K

Message Size

8K

16K

32K

64K

128K 256K 512K

Message Size

Intel Westmere Cluster (IB: QDR)

Intel Clovertown Cluster (IB: DDR)

Memcached Get latency


8K bytes DDR: 18 us; QDR: 15 us
512K bytes -- DDR: 375 us; QDR:185 us
Almost factor of two improvement over 10GE (TOE) for 512KB on the DDR cluster
Almost factor of three improvement over IPoIB for 512KB on the QDR cluster
21

ICPP - 2011

Memcached Latency
(10% Set, 90% Get)
350

250

300

200
Time (us)

250
150

SDP

200

100
50

IPoIB

150

OSU Design

100

1G
10G

50
0

0
1

16 32 64 128 256 512 1K 2K 4K

16 32 64 128 256 512 1K 2K 4K


Message Size

Message Size

Intel Clovertown Cluster (IB: DDR)

Intel Westmere Cluster (IB: QDR)

Memcached Get latency


4 bytes DDR: 7 us; QDR: 5 us
4K bytes -- DDR: 15 us; QDR:12 us
Almost factor of four improvement over 10GE (TOE) for 4K bytes on
the DDR cluster

ICPP - 2011

22

Memcached Latency
(50% Set, 50% Get)
250

100
90

200

80

Time (us)

70
150
100
50

60

SDP

50

IPoIB

40

OSU Design

30

1G

20

10G-TOE

10
0

0
1

16

32 64 128 256 512 1K 2K 4K


Message Size

Intel Clovertown Cluster (IB: DDR)

16 32 64 128 256 512 1K 2K 4K


Message Size

Intel Westmere Cluster (IB: QDR)

Memcached Get latency


4 bytes DDR: 7 us; QDR: 5 us
4K bytes -- DDR: 15 us; QDR:12 us
Almost factor of four improvement over 10GE (TOE) for 4K bytes on
the DDR cluster

ICPP - 2011

23

Memcached Get TPS (4byte)


Thousands of Transactions per
second (TPS)

700
600
500

2000
SDP

1800

10G - TOE

1600

OSU Design

1400

SDP
IPoIB
OSU Design

1200

400

1000
300

800

200

600
400

100

200

0
8 Clients

16 Clients

Intel Clovertown Cluster (IB: DDR)

8 Clients

16 Clients

Intel Westmere Cluster (IB: QDR)

Memcached Get transacLons per second for 4 bytes


On IB DDR about 600K/s for 16 clients
On IB QDR 1.9M/s for 16 clients
Almost factor of six improvement over 10GE (TOE)
Signicant improvement with naLve IB QDR compared to SDP and
IPoIB
ICPP - 2011

24

Memcached Get TPS (4KB)


Transactions per second(TPS)

350000
300000
250000

900000
SDP

800000

10G - TOE

700000

OSU Design

600000

200000

500000

150000

400000

SDP
IPoIB
OSU Design

300000

100000

200000
50000

100000

0
8 Clients

16 Clients

8 Clients

16 Clients

Intel Westmere Cluster (IB: QDR)

Intel Clovertown Cluster (IB: DDR)

Memcached Get transacLons per second for 4K bytes


On IB DDR about 330K/s for 16 clients
On IB QDR 842K/s for 16 clients
Almost factor of four improvement over 10GE (TOE)
Signicant improvement with naLve IB QDR

ICPP - 2011

25

Outline
IntroducLon & MoLvaLon
Overview of Memcached
Modern High Performance Interconnects
Unied CommunicaLon RunLme (UCR)
Memcached Design using UCR
Performance EvaluaLon
Conclusion & Future Work
26
ICPP - 2011

Conclusion & Future Work


Described a novel design of Memcached for RDMA capable networks
Provided a detailed performance comparison of our design compared to
unmodied Memcached using sockets over RDMA and 10GE networks
Observed signicant performance improvement with the proposed
design
Factor of four improvement in Memcached get latency (4K bytes)
Factor of six improvement in Memcached get transacLons/s (4 bytes)
We plan to improve UCR by taking into account the many features in
OpenFabrics API , Unreliable Datagram transport and designing iWARP
and RoCE versions of UCR, and thereby scaling Memcached
We are working on enhancing the Hadoop/HBase designs for RDMA
capable networks
27
ICPP - 2011

Thank You!
{jose, subramon, luom, zhanjmin, huangjia, rahmanmd,
islamn, ouyangx, wangh, surs,panda}@cse.ohio-state.edu

Network-Based CompuLng Laboratory


h`p://nowlab.cse.ohio-state.edu/
MVAPICH Web Page
h`p://mvapich.cse.ohio-state.edu/
ICPP - 2011

28

Das könnte Ihnen auch gefallen