A5-Choitowards High-Performance SAN With Fast Storage Devices

i
i
i
i
i
i
i
i
5
Towards High-Performance SAN with Fast Storage Devices
JAE WOO CHOI, Seoul National University
DONG IN SHIN, Taejin Infotech
YOUNG JIN YU, HYEONSANG EOM, and HEON YOUNG YEOM, Seoul National University
Storage area network (SAN) is one of the most popular solutions for constructing server environments these
days. In these kinds of server environments, HDD-based storage usually becomes the bottleneck of the over-
all system, but it is not enough to merely replace the devices with faster ones in order to exploit their high
performance. In other words, proper optimizations are needed to fully utilize their performance gains. In
this work, we rst adopted a DRAM-based SSD as a fast backend-storage in the existing SAN environment,
and found signicant performance degradation compared to its own capabilities, especially in the case of
small-sized random I/O pattern, even though a high-speed network was used. We have proposed three
optimizations to solve this problem: (1) removing software overhead in the SAN I/O path; (2) increasing
parallelism in the procedures for handling I/O requests; and (3) adopting the temporal merge mechanism to
reduce network overheads. We have implemented them as a prototype and found that our approaches make
substantial performance improvements by up to 39% and 280% in terms of both the latency and bandwidth,
respectively.
Categories and Subject Descriptors: D.4.2 [Operating Systems]: Storage ManagementSecondary
storage; D.4.4 [Operating Systems]: Communication ManagementNetwork communication
General Terms: Design, Measurement, Performance
Additional Key Words and Phrases: SAN, fast storage, RDMA, InniBand, storage stack optimization
ACM Reference Format:
Choi, J. W., Shin, D. I., Yu, Y. J., Eom, H., and Yeom, H. Y. 2013. Towards high-performance SAN with fast
storage devices. ACM Trans. Storage 10, 2, Article 5 (March 2014), 18 pages.
DOI:http://dx.doi.org/10.1145/2577385
1. INTRODUCTION
Todays server environments consist of many machines constructing clusters for dis-
tributed computing systems or storage area networks (SANs) for effectively processing
or saving enormous data. In these environments, an SAN is the system that replaces
the data bus between host computer and storage with high-speed networks and pro-
vides access to block-level data storage via standard network protocol. In the SAN en-
vironment, the host computer and the storage server are typically called Initiator and
This work was supported by the Technology Innovation Program (Industrial Strategic technology develop-
ment program, 10039163) funded by the Ministry of Knowledge Economy (MKE, Korea). This research was
also supported by the Basic Science Research Program through the National Research Foundation of Korea
(NRF) funded by the Ministry of Education, Science and Technology (2010-0024969).
Authors addresses: J. W. Choi, D. I. Shin, Y. J. Yu, H. Eom (corresponding author), and H. Y. Yeom,
Distributed Computing System Laboratory, Seoul National University, Daehak-dong, Gwanak-gu,
Seoul, Republic of Korea, 151-742; emails: aryuze@hotmail.com, tuataras@taejin.co.kr, lefoot@gmail.com,
hseom@cse.snu.ac.kr, yeom@snu.ac.kr.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for prot or commercial advantage and that
copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned
by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or repub-
lish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. Request
permissions from permissions@acm.org.
c 2014 ACM 1553-3077/2014/03-ART5 $15.00
DOI:http://dx.doi.org/10.1145/2577385
ACM Transactions on Storage, Vol. 10, No. 2, Article 5, Publication date: March 2014. i
i
i
i
i
i
i
i
5:2 J. W. Choi et al.
Target, respectively. This kind of system is widely being used in the enterprise and
small- to medium-sized business environments, and can be effectively implemented
with high-performing interconnect solutions supporting Gigabit Ethernet, Fibre Chan-
nel, InniBand, and so on [Tate et al. 2012].
Although the computing capability and networks are becoming more powerful these
days due to the increasing number of cores on systems and high throughput of data
transfer, the performance development of storage systems is not up to speed with these
evolutions, and therefore, the systems become the bottleneck of the overall server envi-
ronment. To overcome this problem, demands for high-performance storages are ever
more increasing, and the storage based on HDD can no longer satisfy this need. So,
other types of storage and techniques, such as ash-SSDs, DRAM-based SSDs, and
using main memory for data storage, are rising as alternatives, and research for next-
generation storage, such as PCM, FeRAM, and MRAM, is actively in progress [Burr
et al. 0008; Freitas and Wilcke 2008].
In this work, we employed a DRAM-based SSD [Taejin Infotech 2012] using the
PCI-E interface and InniBand network supporting RDMA data transfer [Pster
2001]. We rst applied the storage device to the existing InniBand SAN solution,
the SCST project using SCSI RDMA Protocol (SRP) [Bolhovitin 2004], and discovered
huge performance degradation in comparison with the device capability, especially in
the case of small-sized random I/O pattern.
The rst problem we gured out is software overheads in the SAN I/O path. The
current block device I/O path has been developed with diverse optimizations based on
HDDunder the assumption that the disk is very slow. This old assumption causes large
overheads to the SAN I/O path when high-performance storages are applied. Existing
SAN solutions usually conduct storage virtualization at the SCSI layer [Palekar et al.
2001], but the SCSI layer itself can make the I/O path ineffectively longer if used with
the low-latency device, and there is some research that proves the existence of layer
overheads [Cauleld et al. 2010]. Also, the I/O scheduler usually adopts a policy that
utilizes delays, which is very effective for HDD-based devices but not for others.
The second problem is the lack of parallelism on the procedure handling I/O requests
on the Target side. One of the most important lessons we have learnt with existing
SAN solutions is that maximizing parallelism on the Target side is essential to achiev-
ing high performance, especially if low-latency storage devices are employed in the
system. SCST (i.e., SCSI Target Subsystem), evaluated in this article, handles incom-
ing I/O requests in a serial manner for conducting device I/O, which causes signicant
performance penalty under small-sized random access workload. As the latency of a
storage device becomes lower, the portion of software overhead becomes so high that
the benet of using fast devices might be eliminated even.
The last problem is that of processing overhead in networks. Actually, InniBand
with RDMAdata transfer shows very high performance, but when small-sized requests
are transmitted in the random pattern, the overhead that was unnoticed due to huge
latency of disk I/O is exposed in the fast storage environments and thus affects the
overall performance badly.
Therefore, we propose some optimizations that are necessary for constructing high-
performance SAN environments with fast storage devices, and our contributions are
the following.
First, we made the SAN I/O path short by removing software overheads.
Second, we boosted parallelism in the procedure processing I/O requests on the
Target side.
Third, we proposed the temporal merge mechanism to minimize network processing
overheads.
i
i
i
i
i
i
i
Towards High-Performance SAN with Fast Storage Devices 5:3
Finally, we implemented a new SAN solution with the optimizations as prototype
and have found high performance increment.
Concretely, we have found in experiments that our solution lowers read and write
latencies by up to 39% and 33%, respectively, and increases the bandwidth in the di-
verse I/O patterns by up to 280% compared with the existing SAN solution. Also, we
have conducted various experiments to analyze the performance of the preceding opti-
mizations and evaluate them in the real workload situation.
We rst explain the backgrounds of this work in Section 2. Next, we describe the
three optimizations for the new SAN solution in Section 3, and then describe imple-
mentation detail aspects of Initiator and Target in Section 4. In Section 5, we show
the result of evaluating the performance of our solution with a variety of I/O bench-
mark tools. We present related work and conclude this article in Sections 6 and 7,
respectively.
2. BACKGROUNDS
2.1. Fast Storage System
Today, the demand for fast storage systems is rapidly increasing, but HDD-based stor-
ages are not satisfying these needs, so ash-SSD or DRAM-based storage systems are
considered as solutions.
Among current ash-SSDs, the ioDrive series from Fusion-io is a notable exam-
ple of fast storage.
1
It has a nonvolatile property and great throughput [Humphries
et al. 2011]. However, it also has some problems which the ash itself has, such as
the lifetime issue and performance degradation with random write pattern as time
goes by.
There are two primary ways for DRAM-based storage solution. The rst is using
the main memory itself as storage. This is also divided into using the main memories
of thousands of common machines connected with high-speed networks [Ousterhout
et al. 2010] and equipping mass DRAMs to one machine [Schneider and Jandhyala
2012].
The former has the benet that the system can be built with only common machines,
but it also has the spatial problem of many machines and network overheads between
them. The latter needs special hardware techniques for using enormous DRAMs as
main memory. Both ways have a fatal defect that might cause data loss at power fail-
ures due to the volatile property of DRAM itself. To overcome this weakness, replica-
tion between machines or data backup techniques are exploited.
The other way to use DRAM in storage systems is DRAM-based SSD. In this case,
the SSDs are usually connected to the host computer via the PCI-E channel and make
up for the drawback of volatileness with additional batteries that may prevent the
system from loosing data at power failures. Although its performance is lower than
that of the rst way, it still has better performance compared to other storage systems,
such as ash-SSDs or HDDs.
2.2. Traditional SAN I/O Path
Every I/O request holds essential information, such as the read/write type, start ad-
dress, and data size, for executing I/O operations. In the Linux I/O subsystem, the
unit of the I/O request can be changed by going through the system layers, but the
information is constantly maintained in units. Figure 1 shows a typical SAN I/O path
abstractly. As you can see, the I/O path on the Initiator side is quite similar to the path
for local SCSI devices. An I/O request initiated from the application layer goes through
1
http://www.fusionio.com/products/iodrive/
i
i
i
i
i
i
i
Fig. 1. Traditional SAN I/O path in the Linux I/O subsystem [Palekar et al. 2001].
the le system and is transferred to the block layer. The request can be merged with an
existing request if their data locations are adjacent. This is called spatial merge. The
requests are then inserted into the request queue and must wait until the I/O sched-
uler dispatches them to the SCSI layer. Finally, the SCSI command is constructed
from the request at the SCSI mid-level, forwarded to the front-end Initiator driver to
make a command including information for I/O processes in Target, and transmitted
via high-speed network.
Depending on the command from the Initiator, a sequence of procedures such as
RDMA data transfer and device I/O are executed on the Target side. After the oper-
ations are complete, Target sends a completion response to Initiator to nish the I/O
request entirely.
2.3. Existing SAN Solutions with InniBand
InniBand is widely known as one of the fastest network devices these days.
2
There
are two easy ways to create SAN environments with InniBand networks in the
current Linux system [Mellanox 2012]. One is using SCSI RDMA Protocol (SRP)
[Bolhovitin 2004], and the other is employing iSCSI extensions for RDMA (iSER) [Ko
2003]. Both of them have the same properties in that they support the InniBand
driver and use the RDMAdata transfer at the SCSI level, but it is necessary to consider
some trade-offs between them when determining which solution should be adopted.
SRP and iSER can be implemented via the SCST and tgt project, respectively. It
is easier to construct SAN environments with SRP than iSER. Also, while iSER is in
user space, SRP Target is implemented in kernel space; thus, the performance of SRP
is better than iSERregarding the aspects of I/Olatency and bandwidth. However, iSER
has benets for the maintenance and management of SAN environments, because its
2
InniBand
TM
architecture specication download.
http://www.infinibandta.org/content/pages.php?pg=technology download.
i
i
i
i
i
i
i
Table I. Latency Breakdown of Common-Case I/O Path in
the SAN Environment
Layer Context Function In/Out
Time
(usecs)
VFS dd do sync read in 0
BLK dd gen.make.req in 4
BLK dd gen.make.req out 6
VFS dd io schedule in 8
SCSI kworker scsi.req.fn in 10
SCSI kworker scsi.req.fn out 14
BLK softirq blk.done.irq in R + 14
BLK softirq bio endio in R + 18
BLK softirq bio endio out R + 20
SCSI softirq scsi.run.q. in R + 24
SCSI softirq scsi.run.q. out R + 26
BLK softirq blk.done.irq out R + 27
VFS dd io schedule out R + 31
VFS dd do sync read out R + 36
Note: The R value is the latency of data transfer via net-
work and device I/O.
implementation is based on iSCSI so that it can use some useful functions, such as
target discovery and allowance of authority with password.
In this work, we focus more on performance than on other properties in SAN, so we
choose SRP for constructing SAN environments and compare its performance to that
of the new solution we have designed and implemented.
3. DESIGN EXPLORATION
In this section, we propose the following three optimizations needed for the new SAN
solution based on high-performance storage with RDMA data transfer: (1) removing
software overheads in the SAN I/O path, (2) increasing parallelism for device I/O, and
(3) temporal merge in RDMA data transfer.
3.1. Removing Software Overheads
As mentioned earlier, the SCSI layer itself can give a penalty for low-latency devices
in that it makes the I/O path longer, so many drivers for these kinds of up-to-date de-
vices usually do not adopt the SCSI layer. But existing SAN solutions conduct storage
virtualization at the SCSI level and transfer SCSI commands to Target, including the
SRP and SRP-Target in the SCST project for InniBand.
For the latency analysis of the I/O path, we rst allocated a very simple single-
threaded read-only workload. At this point, we had to give the O-DIRECT option to the
dd test to bypass page cache and use the block layer directly. In this case, the merge
operation never takes place, since the next I/O request will reach the block layer only
after the previous one is over.
According to the results indicated in Table I, we can conrm that the software-
latency of a descending I/O path is approximately 14 usecs after a user process enters
the kernel by the read system call. After R, the latency of data transfer via network
and device I/O, the ascending I/O path takes 22 usecs to complete the post-processing,
resulting in a total of 36 usecs of software latency.
These results show that storage stack can already be an I/O performance bottle-
neck with the current storage device technology. In the case of HDD, the R value was
i
i
i
i
i
i
i
Fig. 2. New SAN I/O path where the SCSI layer is removed.
very high, as high as 1,00020,000 usecs, which made it difcult to bring about signi-
cant performance improvement by reducing the software latency. However, the DRAM-
based SSD used for our evaluation has a low hardware latency of 510 usecs, and the
data transfer latency via InniBand is about 10 usecs. In this situation, just a few
usecs of delays can affect the I/O latency signicantly.
Therefore, we rst removed the SCSI layer in the SAN I/O path to reduce I/O latency
and virtualized the storages on the top of the generic block layer. Figure 2 shows the
abbreviated path. The block layer directly interacts with the front-end Initiator driver
that we implemented. Thus, the key I/O unit in the driver is the bio structure, not the
SCSI command. In addition, the information from the bio structure is referenced to
compose a command that will be sent to the Target.
The current I/Oscheduler in the Linux subsystemcommonly adopts a policy with the
disk assumption. CFQ (completely fair queuing) is the typical optimization for HDD-
based device I/O, which admits relatively long delays. Therefore, if fast storage devices
are used, the policy should be changed to NOOP to achieve better performance.
We do not employ the existing I/O scheduler because it still conducts complex op-
erations not suitable for fast storage. These operations include the use of delays for
expecting data merge and the use of a request structure that is employed by the ex-
isting scheduler, not the bio structure that we use to maintain waiting queues. So we
implemented a very simple I/Oscheduler that dispatches the I/Orequests fromwaiting
queues and sends them to Target as soon as possible, and we removed all operations
that could cause unnecessary delays in the block layer, such as the elevator mecha-
nism for merging and the plug-unplug scheme. Since the transfer protocol between the
machines that we designed is also based on the bio structure, it becomes easier and
more efcient to construct the bio structure again with the information from Initiator
on the Target side. Based on these modications, we could simplify the SAN I/O path
and lower the I/O latency substantially.
3.2. Increasing Parallelism
The second optimization is to increase parallelism in a sequence of procedures han-
dling I/O requests from Initiator on the Target side. In high-performance storage
i
i
i
i
i
i
i
Fig. 3. Comparison between the existing RDMA transfer and temporal-merge-applied RDMA transfer.
systems, parallelism is implemented at the device level so that simultaneously incom-
ing I/O requests can be handled effectively by exploiting its high bandwidth.
A target solution in SAN environments usually has an event handler that analyzes
diverse events on the Target side and executes proper operations for the events. The
operations can be classied with the following four operations: (1) jobs for executing
RDMA data transfer, (2) jobs for sending responses to the Initiator, (3) jobs for ter-
minating I/O requests, and (4) jobs for device I/O. They will be processed in order
depending on the I/O type. Also, a number of I/O requests can be in ight. These op-
erations from different I/O requests are independent of one another so that they can
be processed in parallel. Specically, the parallelism of the device I/O is very critical
for the overall performance, because it determines whether to exploit the higher band-
width for the device or not. Actually, the existing SAN solution also conducts parallel
processing for procedures with multithreads. However, we have discovered that the
device I/O is executed in a serial fashion, resulting in severe performance degradation
for small-sized I/O patterns. Therefore, we have optimized the procedures to enable all
operations to be handled by multithreads in parallel, including device I/O. We discuss
this issue in detail in Section 4.
3.3. Temporal Merge
Every data transfer via the network includes pre-processing and post-processing op-
erations that include the essential processes that should be executed before and after
the data transfer, respectively.
Actually, InniBand with RDMA data transfer shows very high performance, but
there are some inefcient aspects in using the existing transfer protocol as it is. The
left part of Figure 3 shows how the existing RDMA data transfer works with write
requests. A command including I/O information from the Initiator and a response in-
cluding completion information are needed for every RDMA data transfer, and network
transfers, such as commands, data, and responses, all require pre-processing and post-
processing operations as well.
The processing overheads have been relatively very small in HDD-based SAN envi-
ronments because the device I/O latencies are so long, causing network overheads to
be unnoticeable. But in the case of low-latency devices, overheads can affect system
performance. Actually, when large-sized data is transmitted, the data transfer time
i
i
i
i
i
i
i
Table II. InniBand RDMA Send Latency
Bytes Min (usecs) Max (usecs) Typical (usecs)
64 1.73 7.38 1.81
128 1.85 8.10 1.91
256 2.71 8.56 2.83
512 3.44 8.76 3.59
1,024 3.96 9.76 4.12
dominates the whole latency in the network, so the processing overhead hardly affects
the system. However, once small data is transferred, the processing overhead in the
network is eventually exposed, because data transfer time per request is relatively
small.
Thus, we propose temporal merge to mitigate the processing overhead for small-
sized random I/O patterns. Its similar to Nagles algorithm in TCP [Peterson and
Davie 2007]. Temporal merge is a technique that merges a number of requests re-
gardless of their spatial adjacency, and it has been introduced already to exploit peak
throughput for low-latency devices [Yu et al. 2012]. We have applied this idea to RDMA
data transfer.
The right part of Figure 3 describes the RDMA transfer method with temporal
merge. As you can see, the small requests in Initiator are merged and used to con-
struct a big command, called a jumbo command, and sent in one to Target. This can
lead to latency benets. According to Table II, it is more efcient to send one 256B data
item than four 64B data items in a row.
In addition, RDMA supports scatter/gather DMA, which can transfer a number of
spatially discontinuous data items in one RDMA execution. We believe that this also
helps reduce processing overhead. After the RDMA data transfer, device I/O will be
conducted in parallel with multithreads. Finally, the responses for I/O completion
are transferred to Initiator one by one in order of termination, which increases the
response speed for each request.
However, temporal merge is not always benecial. From the experiments, we found
that temporal merge is effective in an I/O intensive situation (see Section 5.2), so we
needed to set a criterion for determining if the I/O condition is intensive or not. There-
fore, we implemented the temporal merge that is activated depending on the type of
I/O condition.
As described earlier, we applied the temporal merge technique and gured out that
it works properly, gaining performance benets when small-sized random I/O occurs
intensively during the execution of workload.
4. IMPLEMENTATION DETAILS
To achieve our goal of constructing a fast SAN environment that exploits high per-
formance of fast storage as much as possible, we implemented the SAN solution,
where the optimizations are applied, as a prototype. In Section 4.1, we mainly describe
the implementation on the Initiator side, and then explain how we implemented the
Target part of the new SAN in Section 4.2.
4.1. Block RDMA Protocol (BRP)
As explained before, we discarded the SCSI layer and the SCSI-based transfer protocol
and implemented a bio-based transfer protocol for the fast I/O path on the top of the
block layer in the Linux I/O subsystem. We named the protocol Block RDMA Protocol
(BRP) because it was implemented, referencing the SCSI RDMA Protocol (SRP).
i
i
i
i
i
i
i
The temporal merge mechanism is an optimization for basically improving the per-
formance for small-sized random I/O. Large-sized I/O requests do not need to be
merged because they already exploit the network bandwidth enough, so we allowed
the temporal merge to be only activated in the case of small-sized I/O, which means
the size of the request is smaller than the page size (4KB).
The bio structures from the le system are rst inserted into the wait queue that
we implemented. Then, a worker thread is woken up and executes the following in
order: (1) fetch bios from the wait queue, (2) merge the information from the bios into
one jumbo command needed for executing RDMA data transfer and device I/O on the
Target side, and (3) send the jumbo command to the Target via InniBand functions
[Woodruff et al. 2005].
One signicant consideration is how to determine when the temporal merge should
be activated. As we addressed before, this merge mechanism is only effective in I/O
intensive situations because the temporal merge could cause some serial executions
among the merged requests, which obstructs parallelism on the Target side and de-
grades overall performance in the system. So we use an I/O counter with a threshold
for deciding whether the current I/O situation is busy or not, and the counter increases
at the start of an I/O request and decreases at the end. If the I/O counter is over the
value of the threshold we set, it is assumed that this I/O situation is intensive and that
the temporal merge should be turned on.
Another consideration for implementing temporal merge is the maximum number
of requests merged into one jumbo command. If the I/O situation is intensive enough,
many requests will be pending in the wait queue. In that case, it is necessary to de-
termine how many requests should be merged. We assumed that this might be related
to the number of cores in the system, but in a prototype implementation, we set the
maximum number in a heuristic manner at the moment. We plan to address this issue
in our future work. In this work, we evaluated the performance varying the maximum
numbers and set eight as the proper number, because the performance with numbers
below eight is lower, and with numbers larger than eight, performance is not higher.
Additionally, for implementation issues regarding data transfer via InniBand, such
as resource allocation, DMA mapping, and RDMA data transfer, we referenced and
exploited the SRP source code and related functions [Woodruff et al. 2005].
In this work, we used a bio-based protocol instead of an SCSI command, which is
considered a standard interface in SAN environments. The bio structure is an I/O unit
for the block device in Linux-based systems. This might cause a compatibility issue
with other OS-based Initiator systems like Windows. However, the main purpose of
the bio structure in this work is transferring the command earlier in the upper layer
than SCSI. Also, the contents of the command made from the bio and transferred to
the Target are very general variables for I/O operation over most OS stacks, such as
an I/O direction, start address, data length, and so on. Therefore, we are sure that
compatibility is not a matter of importance.
4.2. BRP Target
BRPT Target is a kernel module executing on the Target side of the new SAN envi-
ronment we implemented. The BRP Target rst analyzes the command from Initiator
and then executes RDMA data transfer and device I/O. Finally, it sends the completion
response to Initiator.
SAN solutions usually have an event handler that analyzes diverse events on the
Target side and executes proper operations for the events. Figure 4 describes the usual
Target in an SAN environment and its operations for I/O request, abstractly, which
can be classied with four operations and will be processed in order depending on the
i
i
i
i
i
i
i
Fig. 4. BRP Target program.
I/O type. Also, a number of I/O requests can be in ight, and all these operations are
independent of one another, so they can be processed in parallel.
Actually, the existing SAN solution exploits a thread pool to handle some operations
on the Target side, so they are already partly processed in parallel, but device I/O
operations are usually handled in a serial manner in the event handler, not in the
thread pool. This is because HDD-based storage systems cannot get any benets from
multiple I/O accesses. In that case, serial execution for device I/O is more effective,
but fast storage provides a high bandwidth for multiple I/O accesses. Therefore, we
implemented the device I/O to be processed in the thread pool, as indicated by the
yellow arrow in Figure 4, which means that all the operations can be executed in
parallel. As a result, this optimization induces multiple I/O accesses, and the high
bandwidth of fast storage can be exploited.
Basically, the procedure for preparing RDMA data transfer in the BRP Target is
not very different from the existing solution, SRP Target. It allocates some buffers for
RDMA rst and constructs the scatterlist structure for scatter/gather DMA. Finally,
it executes DMA mapping and RDMA data transfer with InniBand APIs, but the
biggest difference from SRP Target is that BRP Target processes the jumbo command,
into which the information for several requests is integrated, using the temporal merge
mechanism. If it normally passes through all the procedures as they are, it causes some
serial executions among the merged requests, affecting the overall performance badly,
so we implemented some optimizations for handling jumbo commands effectively.
As we explained before, the temporal merge is executed only in the case of small-
sized I/O with one page size (4KB), so the procedures for resource allocation with the
jumbo command are totally predictable. Therefore, we exploit the pre-allocation skim
to minimize overhead by serial executions. For example, bio structures and scatterlist
structures can be maintained in advance with a xed size matched to the small-sized
I/O. Then, as many as needed can be used when a jumbo command arrives from the
Initiator.
The ways of processing the jumbo command between read and write are a little
different. While read requests execute device I/O rst, write requests do RDMA data
transfer before device I/O. So in the case of read I/O, the BRP Target immediately
i
i
i
i
i
i
i
Table III. Latency Comparison between SRP-SRPT I/O Path
and BRP-BRPT I/O Path
I/O Type SRP-SRPT BRP-BRPT Latency
(usecs) (usecs) Reduction
Read 63 (51) 43 (31) 31.7 (39.2)%
Write 75 (62) 54 (41) 28 (33.8)%
parallelizes the procedures to exploit the high bandwidth of the fast device. In other
words, it rst wakes up the work threads waiting in the thread pool and then executes
multiple device I/Os synchronously.
In the case of write I/O, RDMA data transfer should be processed rst. Also, data
transfer for the several requests in the jumbo command needs to be handled by just
one RDMA operation in order to reduce network processing overheads. As we men-
tioned earlier, this is possible because RDMA supports scatter/gather DMA between
Initiator and Target. In that case, however, the preparation procedure for RDMA with
merged requests has no choice but to be processed in a serial manner, which nega-
tively inuences overall performance. To solve this problem, we allowed the Target
to take advantage of the pre-allocation that minimizes resource-allocation overheads.
After the RDMA data transfer, we parallelized the write I/O executions for the merged
requests, similar to read I/O, so that multiple accesses could be generated.
5. EVALUATION
We performed experiments with two machines, one as the Initiator and the other as
the Target. They both have two Intel Xeon E5630 2.53GHz quad core CPUs (8 cores
in total) and 8GB and 16GB main memory, respectively. Linux 2.6.32 was used in
the experiments. DRAM SSDs were equipped on the Target side (512GB in total =
64GB*8), and one of them was used as a backend storage. We also used two ConnectX-
2 MHQH18B-XTR InniBand cards from Mellanox for the Initiator and Target.
We rst evaluated the performance of the existing SAN solution and then compared
the result with that of the new solution we implemented. We specify our solution as
Block RDMA Protocol (BRP), numbering them as BRP-1, BRP-2, and BRP-3: BRP-1
includes only the optimization removing software overheads; BRP-2 adds the paral-
lelism optimization to BRP-1; BRP-3 adds temporal merge to BRP-2. BRP alone means
BRP-3, the nal version.
5.1. Latency and Bandwidth
As described before, we removed the SCSI layer and implemented a simple and fast
I/O scheduler to lower the I/O latency in an SAN environment. Table III shows the
result of the read/write latency comparison between the existing solution and the new
one. These results were measured by the dd test with the direct I/O option. We con-
ducted 10,000 direct I/Os repeatedly for 4KB size and calculated the average latency.
The numbers in parentheses are the values for the software-only overhead which does
not include device latency, which is between 12 usecs and 13 usecs. According to the
results, read latency is lowered by about 31.7(39.2)%, and write latency is also reduced
by about 28(33.8)%. These results show that our optimization mechanism for removing
software overheads is effective for low-latency devices.
Next, we conducted experiments with the FIO micro benchmark to evaluate
throughput for diverse I/O patterns. Figure 5 shows the result of the throughput com-
parison between SRP-based SAN and BRP-based SAN. We executed 16 I/O threads,
each thread having 3GB workload, which amounts to 48GB in total. The I/O types are
sequential read/write (rd/wr), random read/write (r rd/r wr), and buffered/direct I/O.
i
i
i
i
i
i
i
Fig. 5. Throughput comparison between SRP-SRPT and BRP-BRPT for a variety of I/O patterns.
Two request sizes were set for the small (4KB) and large (1MB) cases. The bars indi-
cate throughput for SRP with the CFQ schedule policy, SRP with the NOOP schedule
policy, the local I/O test for the DRAM-SSD, and the BRP we implemented.
As you can see, the use of CFQ has a huge disadvantage for high-performance stor-
age. The NOOP policy could be a solution for overcoming the drawbacks of using CFQ,
and it actually works for most I/O patterns. However, it still shows relatively low per-
formance compared with the local I/O. Buffered I/O using page cache in the le system
can mitigate the penalties in some cases, but in the case of small-sized direct I/O or
small-sized random I/O, large performance gaps still exist between them.
i
i
i
i
i
i
i
Fig. 6. Performance analysis with different numbers of threads for random write pattern.
We implemented the new SAN solution with three optimizations mentioned before
and accomplished high throughput close to the local I/O performance. Figure 5(a)
shows the throughput of large-sized I/O patterns. Both SRP and BRP generally uti-
lize the high bandwidth well. We also gured out that BRP can be an effective solution
under large-sized workloads. Since temporal merge is not conducted for large-sized
requests, the performance increments are the only result of removing software over-
heads in the I/O path and increasing parallelism. In some cases, the result of local
I/O is inferior to that of BRP, because processing I/O operations in the DRAM SSD
device driver on the Target side is based on polling, not interrupt. For low-latency de-
vices, polling can be a better approach than interrupt [Yang et al. 2011], but can cause
CPU burst. In this case, although the CPU burst can be a little penalty for the I/O
performance, the benet from polling overwhelms the penalty so that the overall per-
formance can increase. In SAN environments, the consumption of computing resources
can be distributed, which means the penalty for CPU burst by polling diminishes on
the Initiator side. This is why throughput for local I/O can be less than that for BRP.
Figure 5(b) shows the throughputs for small-sized I/O patterns. As described in prior
sections, the biggest goal of our optimization was to increase the performance for this
pattern, and we achieved high throughput increments comparable to local I/O results
in almost all cases. The use of the temporal merge technique affects the case of inten-
sive I/O patterns, and the case of random buffered write can be the best beneciary of
the technique, because the asynchronous I/O with writeback makes the I/O situation
very intensive. All of the small-sized I/O experiments were conducted under just 16
threads; the cases of direct I/O and/or random buffered read patterns are not faced
with the I/O intensive situation, basically because 16 threads are too small to induce
the situation. Thus, the temporal merge is not executed in those cases at all (see more
details in the next section regarding BRP performance analysis).
5.2. BRP Performance Analysis
In this section, we analyze the performance of our three optimizations. Figure 6 indi-
cates the throughput variation as the number of threads gradually increases on the
Initiator side; this is the case of small-sized random write pattern in direct I/O. The
result for BRP-3T is distinguished from that for BRP-3 in that BRP-3T always exe-
cutes the temporal merge technique. BRP-3 uses a threshold that activates temporal
merge at the proper time. According to the gure, the throughputs in SRP and BRP-1,
i
i
i
i
i
i
i
Fig. 7. Performance analysis for three optimizations in BRP.
which are not optimized for parallelism, seem to be saturated at 16 threads, while the
performance of the others continuously increases as the number of threads increases.
We can see the effect of temporal merge from the results of BRP-2 and BRP-3T. The
results indicate that in the I/O intensive situation, temporal merge can be effective,
and the throughput is close to the device performance itself, but in sporadic I/O situa-
tions (up to 64 threads in the experiment for Figure 6), temporal merge obstructs the
parallelism. In other words, the penalties from interrupting the parallelism are big-
ger than the benet from temporal merge. Therefore, we needed to set the threshold
to determine the I/O intensive situation and found that the number of in-ight re-
quests is the condition value. The proper threshold value was decided from the results
of repeated experiments. According to the result of the experiment, the value was set
to 64.
Figure 7(a) shows the results of performance analysis for the three optimizations
with 16 threads. The results indicate normalized throughputs for all the optimiza-
tions based on local I/O. The throughput of SRP usually reaches about 35% of local I/O
throughput. In the case of buffered random write, each optimization affects the per-
formance, and the effect of BRP-1 is bigger than that of the direct I/O cases, because
the lower latency increases the efciency of page cache. Direct I/O does not lead to any
i
i
i
i
i
i
i
Fig. 8. Performance comparison for the TPC-C benchmark on PostgreSQL.
benet from temporal merge under 16 threads, as we explained before, so we can see
performance increases only for BRP-1 and BRP-2. Generally, the performance of BRP
seems reasonable enough. In the case of small I/O pattern with 16 threads, every case
is over 80% of local I/O performance, and the performance of buffered random write
reaches 98% of that.
Figure 7(b) shows the result with 128 threads for the same test environment as
before. The number of threads in this experiment is large enough to incur the benets
from temporal merge, so we can see that the performance of BRP-3 increases in the
direct I/O patterns and is over 94% of local I/O performance in all cases.
5.3. Real Workload Benchmark
First, we tested our solutions on a DBMS. We used the BenchmarkSQL tool [Lussier
2004] with the TPC-C benchmark
3
, and PostgreSQL for the database program.
4
The
number of warehouses was set to 320. We evaluated the average transaction per
minute as the number of terminals increased from 8 to 128. Figure 8 shows that
BRP leads to better performance than SRP in the DBMS environment as well, but
the degree of performance increment is relatively smaller than in the case of the micro
benchmark, FIO. After trace analysis, we discovered that the I/O rate from DBMS is
hardly high, and DBMS tends to perform large-sized I/Os due to tuning for writing
to the HDD causing a severely long latency for seeking. Actually, applications such
as DBMS include many complicit optimizations for HDD internally, but they are not
benecial for fast storage devices.
The second real workload benchmark is Postmark [Katcher 1997], an I/O intensive
benchmark which is designed to simulate the operations of the email server. This
generates a lot of small les and executes read/write operations on them. Figure 9
shows the result of using SRP and BRP for the Postmark. We created 100,000 les un-
der 100 directories and performed 200,000 transactions. The le size varied between
4KB and 40KB. We also increased the number of processes to make the I/O situation
more intensive and evaluated the transactions per second. The results represent that
the more processes, the more performance improvements in this benchmark, but this
improvement rate is still lower than that of the micro benchmark.
3
http://www.tpc.org/tpcc/
4
http://www.postgresql.org/
i
i
i
i
i
i
i
Fig. 9. PostgreSQL performance comparison for the Postmark benchmark.
From the results of these two real workload benchmarks, optimizations in diverse
layers, from application to device, are needed to exploit the high performance of fast
storage perfectly.
6. RELATED WORK
NVM (nonvolatile memory) Express is a scalable host controller interface designed to
specify the access for SSDs on a PCI Express bus [Huffman 2012]. As the specica-
tion of NVM Express, it denes a register interface for communication and a standard
command set for use with the NVM subsystem. NVM Express signicantly reduces I/O
latency, bypassing the regular Block/SCSI layer as in our work. However, it increases
a parallelism for processing with multiple command queues, unlike our work which
uses multiple workers with thread pool. The SAN I/O path includes network transmis-
sion in which data packets are serially transferred. Therefore, maintaining multiple
command queues in SAN is not very effective in general cases.
There is also work about SAN storage target using NVM Express storage.
5
It shows
that the fast path network with open Fibre Channel over Ethernet (FCoE) [INCITS
2009] harmonizes well with the NVM Express storage stack.
Network Block Device (NBD) is also a solution for network block devices [Machek
1997]. This is implemented in the block layer as BRP, but NBD is a general-purpose
solution in that it does not include any specic optimizations for fast storage. Actually,
it follows the existing block I/O path with the current I/O scheduler. Also, NBD is based
on TCP/IP, thus it needs a translation layer such as ip over ib in order to use high-
speed networks, such as InniBand, which causes additional translation overhead.
ATA over Ethernet (AoE) is a network protocol designed for implementing simple,
high-performance access of SATA storage devices over Ethernet networks [Cashin
2005]. It can be exploited to build an SAN environment with low cost and standard
technologies.
AoE does not use Internet Protocol (IP), which means that it cannot be accessed
from other IP-based networks at all. As a result, AoE-based networks have fewer pro-
tocol layers, which make AoE fast and lightweight. This approach is quite similar to
our optimization which makes the SAN I/O path short by removing the SCSI layer
and unnecessary block-layer operations. BRP, our solution, is based on the InniBand
5
Next Gen SAN. http://www.nvmexpress.org/.
i
i
i
i
i
i
i
network, while AoE is based on the Ethernet network. Actually, there exists a perfor-
mance gap between them, and InniBand is known to be faster because it uses RDMA
data transfer which takes advantage of zero copy mechanism. We think the network
capability for SAN optimization should be harmonized with its target storage system
in order to expect the proper performance. In this work, we used DRAM-based SSD
as a target storage, which is very fast. Therefore, we believe SAN optimizations over
InniBand are more suitable in implementing high-performance SAN environments
with very fast storage.
7. CONCLUSION
We designed and implemented a new SAN solution for high-performance storage with
RDMA data transfer. In this work, we proposed three major optimizations based on
performance observations for SAN with a low-latency storage device. The rst was
removing software overheads in the SAN I/O path, so we eliminated the SCSI layer
that makes the I/O path longer and replaced the I/O scheduler with a simple and fast
one. Based on these optimizations, we lowered I/O latencies signicantly. The second
optimization that we proposed was increasing parallelism. We analyzed the existing
SAN solutions and found several places where I/O request is serially handled one by
one, which we changed so that every work involved in processing I/O requests could
be handled in parallel with multithreads. We found from our experiments that the
throughput in every type of I/O pattern increased. Finally, we proposed the temporal
merge technique to reduce network processing overhead for small-sized I/O pattern.
The results show that this technique is effective in I/O intensive situations.
REFERENCES
Bolhovitin, V. 2004. SCST: Generic SCSI target subsystem for Linux. SCST Ltd. http://scst.sourceforge.net/.
Burr, G. W., Kurdi, B. N., Scott, J. C., Lam, C. H., Gopalakrishnan, K., and Shenoy, R. S. 2008. Overview of
candidate device technologies for storage-class memory. IBM J. Res. Develop. 52, 4, 449464.
Cashin, E. L. 2005. Kernel korner: ATA over Ethernet: Putting hard drives on the LAN. Linux J. 134, 10.
Cauleld, A. M., De, A., Coburn, J., Mollow, T. I., Gupta, R. K., and Swanson, S. 2010. Moneta: A high-
performance storage array architecture for next-generation, non-volatile memories. In Proceedings of
the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO10). 385395.
Freitas, R. F. and Wilcke, W. W. 2008. Storage-class memory: The next storage system technology. IBM J.
Res. Develop. 52, 4, 439447.
Huffman, A. 2012. NVM express revision 1.1. Intel Corporation.
http://www.nvmexpress.org/wp-content/uploads/NVM-Express-1 1.pdf.
Humphries, C., Tully, S., and Burkheimer, K. 2011. Fusion-io ioDrive performance testing. SPAWARSystems
Center Atlantic. http://www.fusionio.com/load/-media-/1ufyto/docsLibrary/WP - Navy - SSC Atlantic
- Fusion-io Testing.pdf.
INCITS. 2009. Fibre Channel: Backbone 5, revision 2.00. American National Standard of Accredited
Standards Committee INCITS, T11 Project 1871-D/Rev. 200.
http://www.t11.org/ftp/t11/pub/fc/bb-5/09-056v5.pdf.
Katcher, J. 1997. Postmark: A new le system benchmark. Tech. rep. TR-3022. NetApp.
Ko, M. 2003. Technical overview of iSCSI extensions for RDMA (iSER) & Datamover architecture for iSCSI
(DA). In Proceedings of the RDMA Consortium.
Lussier, D. 2004. BenchmarkSQL. http://benchmarksql.sourceforge.net/index.html.
Machek, P. 1997. Network block device (TCP version). http://nbd.sourceforge.net/.
Mellanox. 2012. Building a scalable storage with InniBand. Mellanox Technologies white paper.
http://www.mellanox.com/related-docs/whitepapers/WP Scalable Storage InniBand Final.pdf.
Ousterhout, J., Agrawal, P., Erickson, D., Kozyrakis, C., Leverich, J., Mazi` eres, D., Mitra, S., Narayanan,
A., Parulkar, G., Rosemblum, M., Rumble, S. M., Stratmann, E., and Stutsman, R. 2010. The case for
RAMClouds: Scalable high-performance storage entirely in DRAM. ACM SIGOPS Oper. Syst. Rev. 43,
4, 92105.
i
i
i
i
i
i
i
Palekar, A., Ganapathy, N., Chadda, A., and Russell, R. D. 2001. Design and implementation of a Linux
SCSI target for storage area networks. In Proceedings of the 5th Linux Showcase & Conference (ALS01).
Vol. 5, 11.
Peterson, L. L. and Davie, B. S. 2007. Computer Networks: A Systems Approach 4th Ed. Morgan Kaufmann.
Pster, G. F. 2001. An introduction to the InniBand
TM
architecture. In High Performance Mass Storage
and Parallel I/O: Technologies and Applications. Wiley/IEEE Press, 617663.
Schneider, E. and Jandhyala, R. 2012. SAP HANAR Technical Overview. SAP AG.
https://www.sap.com/bin/sapcom/downloadasset.sap-hana-technical-overview-pdf.html.
Taejin Infotech. 2012. HYBRID Appliance HHA 3804.
http://www.taejin.co.kr/taejin/images/Taejin Products.pdf.
Tate, J., Beck, P., Ibarra, H. H., Kumaravel, S., and Miklas, L. 2012. Introduction to storage area networks
and system networking. IBM Redbooks Tech. rep.
http://www.redbooks.ibm.com/redbooks/pdfs/sg245470.pdf.
Woodruff, B., Hefty, S., Dreier, R., and Rosenstock, H. 2005. Introduction to the InniBand core software. In
Proceedings of the Linux Symposium. Vol. 2, 271282.
Yang, J., Minturn, D. B., and Hady, F. 2011. When poll is better than interrupt. In Proceedings of the 9th
USENIX Conference on File and Storage Technologies (FAST11).
Yu, Y. J., Shin, D. I., Shin, W., Song, N. Y., S., Eom, H., and Yeom, H. Y. 2012. Exploiting peak device
throughput from random access workload. In Proceedings of the 4th USENIX Workshop on Hot Topics
in Storage and File Systems (Hotstorage12).
Received February 2013; revised June 2013; accepted August 2013
ACM Transactions on Storage, Vol. 10, No. 2, Article 5, Publication date: March 2014.

A5-Choitowards High-Performance SAN With Fast Storage Devices

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

A5-Choitowards High-Performance SAN With Fast Storage Devices

Hochgeladen von

Copyright:

Verfügbare Formate

i

Das könnte Ihnen auch gefallen