Beruflich Dokumente
Kultur Dokumente
Storage API
3.2.2 Write Traffic Control Policy
The second modification of the LevelDB is concerned with
SSD Driver Software the write traffic control policy. The traffic control means that
/dev/ssd0 /dev/ssd1 /dev/ssd43 Hardware
LevelDB has the mechanism of limiting the rate of write
requests from users when the number of Level 0 SSTables
SSD Controller
reaches a threshold. The purpose of this mechanism is to
limit the number of Level 0 SSTables, thereby reducing the
Channel 1 Channel 2 Channel 44
search cost associated with multiple overlapping Level 0
... SDF SSTables. In other words, the write throughput is traded for
Flash Flash Flash the read performance.
There are several thresholds that control the number of
Level 0 SSTables. When the threshold kL0 Compaction
Figure 2. The overall architecture of LOCS. Trigger (4 by default) is reached, the compaction process
will be triggered. If the number of Level-0 SSTables can-
because there is only one access port in HDD. Due to the not be efficiently decreased in the compaction process and
latency of moving the disk head, using multiple threads for reaches the second threshold kL0 SlowdownWritesTrigger
SSTable writing and compaction can introduce interference (set as 8 in the stock LevelDB), the LevelDB will enter
among requests from different threads and degrade I/O per- the sleep mode for one millisecond to reduce the receiv-
formance. However, for storage systems based on SSD, the ing rate of write requests. However, if the number of Level-
seek time can be removed. For the SDF used in this work, 0 SSTable keeps increasing and exceeds the third thresh-
since the internal flash channels have been exposed to the old kL0 StopWritesTrigger (12 by default), all write
software level, it is necessary to extend the LevelDB design requests will be blocked until the background compaction
to enable concurrent accesses to these channels. completes.
First, we increase the number of Immutable MemTables The write traffic control policy affects the write through-
in memory to make full use of SDFs 44 channels. As in- put significantly. When the insertion of KV pair is blocked,
troduced in Section 2, in the stock LevelDB there are only the write throughput degrades substantially. Considering
two Memtables: one working MemTable and one Immutable SDFs multiple channel architecture, we adjust the write
MemTable. The entire Immutable MemTable is flushed to traffic control policy to improve the throughput. First, we
SSD in a single write request. If the working MemTable is increase the value of these thresholds, which are originally
full while the Immutable MemTable is still being flushed, optimized for HDDs. When there are multiple MemTables
LevelDB will wait until the dumping ends. Since one write flushed to SSD simultaneously, the threshold should not be
request cannot fill up all channels, we increase the upper triggered too frequently. By increasing the threshold, we can
limit on the number of Immutable MemTables to store more enlarge the intervals between the pausing. Second, we intro-
incoming data. When the working MemTable has accrued duce an additional background thread when the slowdown
enough updates and reached the size threshold (2 MB in condition is triggered. In traditional HDD, all read and write
this work), it will generate one Immutable MemTable, set it I/O requests share the only disk head. Running multiple
ready for write, and send it to the scheduler. With this mod- compactions concurrently will cause random I/O problem.
ification, write requests produced from multiple Immutable However, the multiple channels in SDF make it feasible to
MemTables can be issued concurrently. We will show the trigger multiple compactions at the same time. When the
impact of the number of Immutable MemTables in Section 4. user request is blocked, a single compaction thread cannot
When the scheduler receives a write request, it will insert utilize all channels effectively, i.e., some channels will be
the write request into a proper request queue according to idle. By creating an additional thread to do the compaction
the dispatching policy. As shown in Figure 2, there is one during the pausing, we can reduce the number of files in
I/O request queue for each flash channel. It means that all the Level 0 faster and mitigate the throughput loss brought
requests in the queue are exclusively serviced by a single by the pausing. It should be mentioned that, if too many
threads for compaction are introduced, they will interfere request from LevelDB is issued to a channel, and its data
with the normal data access. Our experiments show an addi- will not be striped. Thus, the requests are distributed to the
tional compaction thread is good enough to reduce through- channels in the granularity of 2MB.
put fluctuation. Additionally, we do not need to block write The round-robin dispatching has the advantage of sim-
requests until all Level 0 SSTables are compacted. Instead, plicity. Since the size of each write request is fixed, it works
we modify LevelDB to make sure that it will accept user efficiently with the case where the write requests are dom-
write requests again when the number of Level 0 SSTables inant. However, when there are intensive read and erase re-
is less than the half of kL0 CompactionTrigger. quests in the queues, the round-robin dispatching becomes
less efficient. This is because the dispatching of both types
3.2.3 Write Ahead Log of requests is fixed. Especially in the case where there are
LevelDB maintains one single log file to recover the MemTa- several read/erase requests waiting in the same queue, it can
bles in main memory when accidents like power-down hap- result in unbalanced queue lengths.
pen. In the original LevelDB design, the log is written to Figure 3(a) gives an example of round-robin dispatch-
HDD using the memory-mapped file. The new updates are ing. Suppose that there are three channels in the SSD. The
appended to the current log file as continuous I/O requests. traces of 11 I/O requests are shown in Figure 3(c), including
This log strategy of using a single log file can seriously 6 write requests and 5 read requests. The third row shows
impact the write throughput when there are intensive con- the channel address for a read request. Note that the chan-
current write requests. In this work we assume that we nel addresses of read requests are already determined and
have a small part of high-performance non-volatile mem- cannot be changed, but we have the flexibility to decide at
ory (NVM), such as PCM or battery-backup DRAM, to which channel each write request should be serviced. When
hold these logs. Since the logs can be discarded whenever the round-robin dispatching is applied, all write requests are
a MemTable is dumped to the SSD, the demand for storage dispatched to the channels uniformly in circular order. How-
space on the non-volatile memory is moderate. For the ex- ever, the round-robin dispatching causes the request queue
ample of using SDF with 44 channels, we only need about of Channel 2 to be much longer than the other two channels,
100 MB.1 since four continuous read requests all fall into Channel 2.
Therefore, the request queues are unbalanced, which means
3.3 Scheduling and Dispatching Policies Channel 2 will take a longer time to handle I/O requests than
With the extension introduced in the last subsection, the Channel 1 and Channel 3, and it is possible that Channel 1
LevelDB can now work with the SDF. In LOCS, we find and 3 will be idle while Channel 2 is still busy. In this situa-
that the scheduling and dispatching policies of write requests tion, parallelism enabled by the multiple channels is not fully
have an impact on the throughput and working efficiency exploited. In fact, we have observed that such an unbalanced
of SSD. For example, if there are a large number of I/O queue can severely impact the performance. The detailed ex-
requests in one channel while other channels are idle, the perimental results are shown and discussed in Section 4.
performance will be severely degraded. However, since we
can directly control the accesses to multiple channels of 3.3.2 Least Weighted-Queue-Length Write
SDF, it is possible to optimize scheduling and dispatching Dispatching
with the guiding information from LevelDB. Thus, in this In order to mitigate the unbalanced queue problem of round-
subsection we study several different policies and discuss robin dispatching, we propose a dispatching policy based on
how to improve the performance of our system. the length of request queues. The basic idea is to maintain
a table of weighted-queue-length to predict the latency of
3.3.1 Round-Robin Dispatching processing all requests in these queues. Since there are three
Our baseline dispatching policy is the simple round-robin types of requests (read, write, and erase) with various laten-
(RR) dispatching, which uniformly distributes write requests cies, we should assign different weights to the three types of
to all channels in the SSD. The write requests from LevelDB requests. Then, the weighted length of a queue can be repre-
are in the granularity of an SSTable, the size of which is sented as follows,
2 MB. We dispatch all write requests to each channel in
N
circular order. X
Lengthweight = Wi Sizei (1)
This is similar to the design in the traditional hardware-
1
based SSD controller. For example, in the Huawei SSD,
which is used in our experiments, each large write request Note that N denotes the total number of requests in a
from LevelDB is stripped over multiple channels to benefit queue, and Wi and Sizei represent the weight and size of
from parallel accesses. In our system, with the multiple each request, respectively. The weights are determined ac-
request queues, the case is a little different. A large write cording to the corresponding latency of each type of re-
quests. The sizes of both write and erase requests are fixed,
1 Currently the write ahead logs are stored in DRAM. and only the size of a read request may vary. Then we select
a channel with the least value of the weighted-queue-length, as illustrated in Step 2. If we do not allocate these SSTa-
and insert the current write request from LevelDB to this bles carefully, it is possible that the Level 1 ab will be
channel. Unlike the round-robin dispatching policy, the least assigned to the Channel 1, which contains a Level 2 ab
weighted-queue-length (denoted as LWQL) dispatching pol- SSTable. This means we have to read both of these SSTables
icy takes all three kinds of I/O requests into consideration. from the same channel in the next compaction, as shown in
Figure 3(b) is an example of using LWQL dispatching. Step 3. Thus the efficiency of compaction is compromised.
In this example, the weighted-queue-lengths of read and It is obvious that the allocation of new SSTables gener-
write requests are 1 and 2, respectively. We notice that the ated in current compaction will affect the efficiency of future
dispatching of the 10th and the 11th requests is different compaction. Thus, write dispatching optimization for com-
from the RR policy, because the weighted queue length of paction is needed to improve the efficiency of compaction.
Channel 2 is not the least when the 10th request arrives. The goal is to make sure that SSTables are carefully dis-
Compared to the round-robin dispatching, it is apparent that patched so that the SSTables with adjacent key ranges are
the queues become more balanced in terms of weighted not allocated in the same channel. Consequently, we pro-
queue length. And the throughput of SDF can be improved pose a technique to optimize dispatching for SSTables gen-
compared to the RR dispatching. erated from compaction. The dispatching policy is based on
the LWQL policy and is described as follow.
10
Write We record the channel location for each SSTable in the
8
Read
11 8
Read Manifest file. The Manifest file also contains the set of
Write
7
Read 11
7
Read 10
SSTables that make up each level, and the corresponding
Write 9
Write
9 6
Read
6
Read key ranges.
Write Write 5 4
5
Read 4
Read Read Read
1 2 3 1 2 3 For each SSTable generated from compaction, we first
Write Write Write Write Write Write look for the queue with the shortest weighted-queue-
Req. Queue Req. Queue Req. Queue Req. Queue Req. Queue Req. Queue length as the candidate. If there are any SSTables in the
for Channel 1 for Channel 2 for Channel 3 for Channel 1 for Channel 2 for Channel 3
next level, of which the keys fall in the range of the four
(a) Round-Robin (b) Least Weighted-Queue-Length closest SSTables, then this candidate queue is skipped.
Trace # 1 2 3 4 5 6 7 8 9 10 11 We then find the queue with the shortest weighted-queue-
Op. W W W R R R R R W W W
Channel 3 2 2 2 2
length in the rest of the queues and check for the condi-
tion of key ranges, as in the previous step. The step is
(c) Trace of the example repeated until the condition is met.
Figure 3. Illustration of RR and LWQL dispatching. Figure 4(b) shows an example of this policy. Step 1 is
the same as 4(a). After Step 1, the compaction thread se-
lects the candidate queue for the newly generated SSTables
3.3.3 Dispatching Optimization for Compaction in Step 2. Supposing Channel 1 is selected for the Level
There are two types of write requests generated by the 1 ab SSTable according to the LWQL policy, then the
LevelDB. Besides the write requests of dumping MemTa- thread will search for four nearest neighbors of Level 2. Con-
bles into Level 0, the other write requests are generated sequently, it will find Level 2 ab which means they will
by the compaction. We notice that the data access pattern probably be merged in future compactions. Thus Channel
of the compaction is predictable. Therefore, we propose a 1 is skipped, and it will find another candidate queue with
dispatching optimization for compaction. As mentioned in the least weighted-queue-length. Then Channel 2 is cho-
Section 2.1, the compaction will merge SSTables within a sen, where no neighbouring SSTables at Level 2 reside. As
certain key range. The process includes both read and write mentioned before, the compaction operation will read all in-
requests. Since the channel addresses of read requests are volved neighboring SSTables into memory. Thus avoiding
fixed, if some input SSTables for one compaction are allo- SSTables with neighboring key ranges to stay in the same
cated in the same channel, they have to be read in sequence, channel effectively improves read performance, as shown
resulting an increase of the latency of compaction. Figure in the Step 3 of Figure 4(b). Similarly, the Level 1 cd
4(a) shows an example of this case. First, when a compaction SSTable in memory will evade Level 2 ab, cd, ef,
process is conducted, three SSTables, denoted as Level 0 and gh SSTables. A statistical experiment shows that
bd (i.e., the SSTable whose key range is bd in Level over 93% compaction operation in our workloads involve
0), Level 1 ab and Level 1 cd will be read out from no more than five SSTables, which means one SSTable will
SDF to the main memory, as shown in Step 1. Then a multi- be merged with four or fewer SSTables in the next level,
way merge sort is performed on these SSTables to generate in most cases. Considering this fact, we set the searching
two new SSTables, Level 1 ab and Level 1 cd. After range to four nearest SSTables in both left and right direc-
the merge operation, they will be written back to the SDF, tions. Note that such a technique cannot be applied to the tra-
Old SSTable
Level 0: Level 1: Level 1: Level 0: Level 1: Level 1:
b~d a~b c~d b~d a~b c~d New SSTable
To be erased
Main Memory Main Memory
Read SSD Read SSD
Level 0 b~d x~z Level 0 b~d x~z
Level 1 a~b c~d s~t e~f Level 1 a~b c~d s~t e~f
Level 2 a~b c~d h~i j~k e~f g~h Level 2 a~b c~d h~i j~k e~f g~h
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5 Ch. 6 Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5 Ch. 6
Step 1 Step 1
Multi-way
merge sort
Level 1: Level 1: Level 1: Level 1:
a~b c~d a~b c~d
Main Memory Main Memory
Write Write
SSD SSD
Level 0 x~z Level 0 x~z
Level 1 a~b c~d s~t e~f Level 1 s~t a~b e~f c~d
Level 2 a~b c~d h~i j~k e~f g~h Level 2 a~b c~d h~i j~k e~f g~h
Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5 Ch. 6 Ch. 1 Ch. 2 Ch. 3 Ch. 4 Ch. 5 Ch. 6
Step 2 Step 2
ditional hardware-based SSD controller because it does not erase and read requests is fixed. An example is shown in in
have enough information from the software level. It demon- Figure 5(a). The existence of one erase operation in Chan-
strates the flexibility of software-hardware co-optimization nel 1s request queue results in a long delay of the following
provided by the scheduler in LOCS. two read operations.
The solution to this problem is to delay the erase re-
3.3.4 Scheduling Optimization for Erase quests and schedule them when there are enough write re-
quests. This is because the write requests can help balance
So far, the dispatching policy is only optimized for write re- the queue length. The key to this solution is to determine
quests. Besides the dispatching, the scheduling of requests whether there are enough write requests. In LOCS, we set up
can also impact the performance of the system. In this sub- a threshold T Hw for the ratio of write requests. The erase re-
section we will discuss how to schedule the erase requests to quests are scheduled when the ratio of write requests reaches
improve the throughput. the threshold. Note that the erase requests are forced to be
As mentioned in previous subsections, the dispatching of scheduled when the percentage of free blocks are lower than
erase requests cannot be adapted. However, since the erase a threshold. This design is similar to the policy of garbage
process is not on the critical path, the scheduling can be collection in a traditional hardware-based SSD controller.
adapted dynamically. For the LSM-tree-based KV stores, the Figure 5(b) shows that by removing this erase operation
erase operations only happen after compaction. The SSTa- from the request queue of Channel 1, the overall read delay
bles used as inputs of compaction are useless and should be will be greatly reduced. Figure 5(c) shows a scenario where
erased. A straightforward method is to recycle these storage the queues contain seven write operations. Each queue is
spaces by erasing the SSTables right after the compaction. scheduled by the LWQL policy. It takes 12 time slots to pro-
However, such an erase strategy can degrade performance, cess the requests. When a delayed erase operation is inserted
especially when there are intensive read operations. First, the into the request queue, as shown in Figure 5(d), the LWQL
read operation may be blocked by the erase for a long period policy ensures that the four write requests arriving after the
of time due to long erase latency. Second, the queues can erase request are inserted to the shortest queues, making the
become unbalanced because the dispatching policy for both
queues balanced without increasing the processing time. We Host Interface PCIe 1.1 x8
can see that the total time to complete all operations in Fig- Channel Count 44
ure 5(a) and 5(c) is 19 without the erase scheduling, while it Flash Capacity per Channel 16 GB
reduces to 15 in Figure 5(b) and 5(d) when the optimization Channel Interface Asynchronous 40 MHz
is applied. Thus the overall throughput is improved. NAND Type 25 nm, MLC
Time Time
Page Size 8 KB
7 7 Block Size 2 MB
Read
6 6
Read
5 5 Table 2. Device specification of SDF and Huawei Gen3.
3 Erase 3
2 Read
2
1 1
Read 4.2 Evaluation Results
Read Read Read Read Read Read
Req. Queue
for Channel 1
Req. Queue
for Channel 2
Req. Queue
for Channel 3
Req. Queue
for Channel 1
Req. Queue
for Channel 2
Req. Queue
for Channel 3
We compare the performance of the stock LevelDB running
on the Huawei SSDs and the optimized LevelDB running
(a) (b)
on SDF. Figure 6(a) shows the comparison of I/O through-
Time Time put. The round-robin dispatching policy is used for the SDF.
12 12 The channel dispatching for Huawei SSD is implemented
Write Write Write within its firmware, and the SSTables generated by LevelDB
8 8 are stripped and dispatched to all the channels uniformly.
Write Write Write Erase Write Write The results show that the I/O throughput can be significantly
4 4 improved on SDF with the optimized LevelDB for differ-
Write Write Write Write Write Write ent benchmarks with various get-put request ratios. On av-
Req. Queue Req. Queue Req. Queue Req. Queue Req. Queue Req. Queue erage, the I/O throughput can be improved by about 2.98.
for Channel 1 for Channel 2 for Channel 3 for Channel 1 for Channel 2 for Channel 3
It shows that LOCS can exploit high access parallelism even
(c) (d) without any optimization of scheduling and dispatching poli-
cies. In the Huawei SSD, each large write request (2MB)
Figure 5. Illustration of scheduling optimization for erase.
from LevelDB is stripped over its 44 channels, with a strip-
ping unit size of 8 KB, and distributed over different chan-
4. Evaluation nels to benefit from the parallel accesses to multiple chan-
In this section we describe the experimental setup for eval- nels. In SDF, the case is different with the multiple request
uation of LOCS and comprehensive evaluation results and queues. The large write request from LevelDB is issued to a
analysis. channel without stripping. Thus, the requests are uniformly
distributed to the channels in the granularity of 2MB. Large
4.1 Experimental Setup requests are split into multiple sub-requests in the Huawei
We conducted experiments on a machine equipped with SSD, and serviced in different channels. Accordingly, re-
SDF, the customized open-channel SSD. The configuration quested data has to be split (for write) or merged (for read)
of the machine is described in Table 1. in the request service, and each channel serves a larger num-
ber of smaller requests. This adds to the overhead and re-
CPU 2 Intel E5620, 2.4 GHz duces throughput. Furthermore, there is additional write am-
Memory 32 GB plification overhead caused by the garbage collection in the
OS Linux 2.6.32 Huawei SSD. The expensive garbage collection can compro-
mise performance stability.
Table 1. Evaluation platform configuration.
Figure 6(b) compares the performance of LevelDB in
We used a PCIe-based SSD, Huawei Gen3, for compari- terms of number of operations per second (OPs). The trend is
son in the evaluation. In fact, the Huawei SSD is the prede- similar to that of I/O throughput. The improvement is about
cessor of SDF, which shares the same flash chips with SDF, 2.84 on average, which is a little bit lower than that of
except that SDF has a redesigned controller and interface. In I/O throughput. The reason is that writes generated by com-
Huawei SSD, data are stripped over its 44 channels with a paction are counted into I/O throughput but not considered in
stripping unit of 8 KB. Table 2 shows its specification. the calculation of OPs of LevelDB. Note that we only show
In the following experiments, RR denotes the round- results for data with a value size of 100 Bytes due to space
robin dispatching policy described in Section 3.3.1, LWQL constraint. In fact, our LOCS system consistently achieves
denotes the least weighted-queue-length dispatching policy performance improvement across different data value sizes.
described in Section 3.3.2, and COMP denotes the dis- In Figure 7, we study the impact of Immutable MemTa-
patching optimization for compaction described in Section bles in main memory on the I/O throughput of SDF. In this
3.3.3. experiment, the get-put data ratio is set to be 1:1, and the
400 RR with SDF Huawei
350 500 kL0_SlowdownWritesTrigger = 8 kL0_SlowdownWritesTrigger = 68
450
Throughput (MB/s)
300 400
Throughput (MB/s)
350
250 300
200 250
200
150 150
100
100 50
0
50 0 50 100 150 200 250 300
0 Time (Sec)
1:4 1:2 1:1 2:1 4:1
Workload Get-Put Ratio Figure 8. Illustration of throughput fluctuation with differ-
(a) ent slowdown thresholds.
3,000,000 RR with SDF Huawei
2,500,000 WritesTrigger is set to be 8 and 68, respectively. Thresh-
Throughput (ops/sec)
300
250
200
350
150
300
100
Throughput (MB/s)
250 50
200 0
150 8 20 32 44 56 68 80
kL0_SlowdownWritesTrigger Value
100
50 Figure 9. Impact of SlowdownWritesTrigger threshold.
0
1 2 4 8 16 32 44
Number of MemTable Note that the throughput is not always improved when
Figure 7. Impact of number of MemTables. the threshold increases. This is because a higher threshold
results in more SSTables in Level 0. Since the latency of
searching data in Level 0 SSTable is increased, the overall
key value size is 8 KB. Since the number of Immutable throughput may be decreased with a too-large threshold. In
MemTables determines the number of concurrent write re- order to confirm this finding, we measure the throughput
quests to the SDF, the I/O throughput increases proportion- with different values of the threshold and compare them in
ally to the number of MemTables. The results show that the Figure 9. We find that the sweet point appears around the
I/O throughput saturates when MemTables count reaches the value of 68.
number of flash channels. In fact, if we further increase the
count, the I/O throughput of SDF could be reduced because 500
450
Single Compaction Thread Additional Compaction Thread
350
rent write requests at each flash channel. Thus, we always set 300
250
the number of MemTables to the number of flash channels. 200
The results for other configurations show the same trend. 150
100
We analyze the impact of the thresholds for write request 50
0
traffic control in LevelDB in the following experiments. The 0 50 100 150 200 250 300
dispatching policy is round-robin, and the value size and get- Time (Sec)
Weighted-Queue-Length
Standard Deviation of
paction thread in Figure 10. We find that the fluctuation 3.50
3.00
of throughput can be mitigated with this extra thread. This 2.50
is because the SSTables in Level 0 can be compacted at a 2.00
1.50
higher rate so that the slowdown of write requests is alle-
1.00
viated. On average, the throughput can be increased from 0.50
221 MB/s to about 361 MB/s. Note that the kL0 Slowdown 0.00
0 50 100 150 200 250 300
WritesTrigger threshold is still set to be 8 in order to Time (Sec)
300
balance the I/O intensity across channels.
The third set of results in Figure 13 shows the deviation
200
of queue length when the optimization of the dispatching for
100 compaction is applied. As discussed in Section 3.3.3, this
0 technique can help to address the problem of dispatching
0 50 100 150 200 250 300
Time (Sec)
multiple SSTables, which are likely to be read for later com-
paction, into the same channel. In other words, it can help to
Figure 11. Illustration of throughput fluctuation after apply- balance the intensity of read requests across different chan-
ing two techniques. nels. Thus, the deviation of queue length can be further re-
duced after using this technique compared with the LWQL
450 Throughput Empty Channel Num. 10 policy. Note that the improvement is not very significant be-
400 9
8
cause the optimization can only be applied to the channels
Throughput (MB/s)
350
Empty Ch. Num.
Throughput (MB/s)
Throughput (MB/s)
Throughput (MB/s)
350 350
300 300 300
250 250
200 200 200
150 150
100 100 100
50 50
0 0 0
1:4 1:2 1:1 2:1 4:1 1:4 1:2 1:1 2:1 4:1 1:4 1:2 1:1 2:1 4:1
Workload Get-Put Ratio Workload Get-Put Ratio Workload Get-Put Ratio
(a) Value size: 100 Bytes (b) Value size: 8 KB (c) Value size: 64 KB
700 LWQL+COMP LWQL RR 4,000,000 LWQL+COMP LWQL RR 50,000 LWQL+COMP LWQL RR
3,500,000 45,000
600
Throughput (ops/sec)
Throughput (ops/sec)
40,000
Throughput (MB/s)
500 3,000,000
35,000
2,500,000 30,000
400
2,000,000 25,000
300 20,000
1,500,000
200 15,000
1,000,000
10,000
100 500,000 5,000
0 0 0
1:4 1:2 1:1 2:1 4:1 1:4 1:2 1:1 2:1 4:1 1:4 1:2 1:1 2:1 4:1
Workload Get-Put Ratio Workload Get-Put Ratio Workload Get-Put Ratio
(d) Value size: 256 KB (e) Value size: 100 Bytes (f) Value size: 8 KB
7,000 LWQL+COMP LWQL RR 1,800 LWQL+COMP LWQL RR W/O Erase Scheduling THw=1 THw=2 THw=3 THw=4
6,000 1,600
Throughput (ops/sec)
Throughput (ops/sec)
1,400 400
5,000
1,200 390
4,000
Throughput (MB/s)
1,000 380
3,000 800
370
600
2,000 360
400
1,000 200 350
0 0 340
1:4 1:2 1:1 2:1 4:1 1:4 1:2 1:1 2:1 4:1
330
Workload Get-Put Ratio Workload Get-Put Ratio
Workload 1 Workload 2
(g) Value size: 64 KB (h) Value size: 256 KB (i) Scheduling optimization for erase
improved by 21% after using the LWQL dispatching policy 5. Related Work
and is further improved to 31% after applying the optimiza- Key-Value Stores. Distributed key-value systems, such as
tion technique for compaction. It means that our optimiza- BigTable [18], Dynamo [22], and Cassandra [28], use a clus-
tion techniques can not only increase the throughput of SSD ter of storage nodes to provide high availability, scalability,
but also improve the performance of LevelDB. and fault-tolerance. LevelDB [10] used in this work runs on
In Figure 14(i), we illustrate the impact of erase schedul- a single node. By wrapping the client-server support around
ing operations with two workloads of varying read/write ra- the LevelDB, LevelDB can be employed as the underlying
tios. The first workload (Workload 1) can be divided into two single-node component in a distributed environment, such
phases: the equal-read-write phase with a get-put request ra- as Tair [6], Riak [5] and HyperDex [23]. In order to meet the
tio of 1 and the write-dominant phase with a get-put request high throughput and low latency demands of applications,
ratio of 1/4. The second workload (Workload 2) has the read- several recent works, such as FlashStore [20], SkimpyS-
modify-write pattern with a get-put ratio of around 1. The tash [21], and SILT [30], explore the key-value store de-
baseline policy is to erase blocks right after the compaction sign on the flash-based storage. But all of the existing works
without intelligent erase scheduling. This is compared to the use the conventional SSDs that hide its internal parallelism
erase scheduling policy described in Section 3.3.4 with var- to the software. Our work focuses on how to use the open-
ious thresholds T Hw on the ratio of write requests. We can channel SSD efficiently with the scheduling and dispatching
see that the throughput is improved for Workload 1 when techniques, and is complementary to the techniques aimed at
using our scheduling policy with a T Hw less than 4. This consistency and node failure.
is because significantly varied read/write ratios provide the Log-Structured Merge Trees. LSM-tree [32] borrows
opportunity for the policy to clearly identify a time period its design philosophy from the log-structured file-system
to perform erase operations for higher throughput. However, (LFS) [35]. The LSM-tree accumulates recent updates in
for Workload 2, the throughput is reduced with our schedul- memory, flushes the changes to the disk sequentially in
ing policy, as the write intensity rarely reaches the threshold batches, and merges on-disk components periodically to re-
for performing erase operations. Thus, most erase operations duce the disk seek costs. LevelDB [10] is one of the repre-
are postponed until free blocks are used up and have to be sentative implementations of LSM trees. TableFS [34] uses
performed on the critical path of the request service. LevelDB as part of the stacked filesystem to store small
metadata. Recently, several similar data structures on persis- writes in the request queue. By setting all weights to 1 in the
tent storage were proposed. TokuDB [7] and TokuFS [24] LWQL policy, LWQL essentially degenerates into SOS and
use Fractal Tree, which is related to cache-oblivious stream- has a reduced throughput.
ing B-trees [12]. VT-trees [36] were developed as a vari- Several SSD designs similar to the open-channel SDF are
ant of LSM-tree that avoids unnecessary copies of SSTa- also proposed by industry. FusionIOs DFS [27] employs a
bles during compaction by adding pointers to old SSTables. virtualized flash storage layer. It moves the FTL in the hard-
Other LSM-tree-based KV stores, such as Cassandra [28] ware controller into the OS kernel to allow direct access
and HBase [9], can also make full use of software-defined to the NAND flash devices. The NVM Express specifica-
flash. The scheduling and dispatching techniques proposed tion [3] defines a new interface, in which a NVM storage
in Section 3.3.3 can also be applicable to them. device owns multiple deep queues to support concurrent op-
NAND Flash-Based SSD. NAND Flash-Based SSDs erations.
have received widespread attention. To enhance the I/O per-
formance, most SSDs today adopt the multi-channel archi- 6. Conclusions
tecture. It is necessary to exploit parallelism at various lay-
ers of the I/O stack for achieving high throughput of SSD The combination of LSM-tree-based KV stores and SSDs
in the high concurrency workloads. Chen et al. investigate has the potential to improve I/O performance for storage sys-
the performance impact of SSDs internal parallelism [19]. tems. However, a straightforward integration of both cannot
By effectively exploiting internal parallelism, the I/O perfor- fully exploit the high parallelism enabled by multiple chan-
mance of SSDs can be significantly improved. Hu et al. fur- nels in the SSDs. We find that the I/O throughput can be
ther explore the four levels of parallelism inside SSDs [26]. significantly improved if the access to channels inside SSD
Awasthi et al. [11] describe how to run HBase [9] on a hybrid can be exposed to KV stores. The experimental results show
storage system consisting of HDDs and SSDs to minimize that the I/O throughput can be improved by up to 2.98.
the cost without compromising throughput. However, their With such a storage system, the scheduling and dispatch-
design doesnt utilize the parallelism of SSD. Our design ing policies for requests from the KV stores have an im-
successfully exploits the channel-level parallelism in a soft- portant impact on the throughput. Thus, we propose several
ware manner. optimization techniques of scheduling and dispatching poli-
Some researchers also tried to make use of the internal cies for the scheduler at the software level. With these tech-
parallelism. Bjorling et al. propose a Linux block I/O layer niques, the I/O throughput of LOCS can be further improved
design with two levels of multi-queue to reduce contention by about 39% on average.
for SSD access on multi-core systems [13]. Our design also
uses multiple queues, but the queues are in the user space
Acknowledgments
instead of in the kernel space. In addition, we propose a se- We are grateful to the anonymous reviewers and our shep-
ries of scheduling and dispatching policies optimized for the herd, Christian Cachin, for their valuable feedback and com-
LSM-tree-based KV stores. Wang et al. present an SSD I/O ments. We thank the engineers at Baidu, especially Chunbo
scheduler in the block layer [38] which utilizes the internal Lai, Yong Wang, Wei Qi, Hao Tang, for their helpful sug-
parallelism of SSD via dividing SSD logical address space gestions. This work was supported by the the National Nat-
into many subregions. However, the subregion division is not ural Science Foundation of China (No. 61202072), National
guaranteed to exactly match the channel layout. Moreover, High-tech R&D Program of China (No. 2013AA013201), as
they use only a round-robin method to dispatch the I/O re- well as by generous grants from Baidu and AMD.
quest, which, as shown in our work, is not optimal.
More closely related to our design is SOS [25], which References
performs out-of-order scheduling at the software level. By [1] Apache CouchDB. http://couchdb.apache.org/.
rearranging I/O requests between queues for different flash [2] HyperLevelDB. http://hyperdex.org/performance/
chips, SOS can balance the queue size to improve the over- leveldb/.
all throughput. However, their software-based scheduler is [3] NVM Express explained. http://nvmexpress.org/wp-
actually located in the flash translation layer (FTL) and im- content/uploads/2013/04/NVM_whitepaper.pdf.
plemented inside an embedded FPGA controller, which is
[4] Redis. http://redis.io/.
under the OS and part of the hardware. So they cannot get
[5] Riak. http://basho.com/leveldb-in-riak-1-2/.
application-level information as we do. What is more, the
scheduling policy they adopt only considers the number of [6] Tair. http://code.taobao.org/p/tair/src/.
requests of a queue, not the weighted-queue-length, as we [7] TokuDB: MySQL performance, MariaDB performance.
do in this work. Since the three types of NAND flash op- http://www.tokutek.com/products/tokudb-for-
erations have sharply different costs, this scheduling policy mysql/, .
make it hard to balance the queues if there are both reads and [8] Tokyo Cabinet: A modern implementation of DBM. http:
//fallabs.com/tokyocabinet/, .
[9] Apache HBase. http://hbase.apache.org/. Commun. Rev., 42(4):2536, Aug. 2012.
[10] LevelDB a fast and lightweight key/value database library [24] J. Esmet, M. A. Bender, M. Farach-Colton, and B. C. Kusz-
by Google. http://code.google.com/p/leveldb/. maul. The TokuFS streaming file system. In Proceedings of
[11] A. Awasthi, A. Nandini, A. Bhattacharya, and P. Sehgal. the 4th USENIX Workshop on Hot Topics in Storage and File
Hybrid HBase: Leveraging flash SSDs to improve cost per Systems, HotStorage 12, pages 14:114:5, 2012.
throughput of HBase. In Proceedings of the 18th Interna- [25] S. Hahn, S. Lee, and J. Kim. SOS: Software-based out-
tional Conference on Management of Data (COMAD), pages of-order scheduling for high-performance NAND flash-based
6879, 2012. SSDs. In Mass Storage Systems and Technologies (MSST),
[12] M. A. Bender, M. Farach-Colton, J. T. Fineman, Y. R. Fogel, 2013 IEEE 29th Symposium on, pages 13:113:5, 2013.
B. C. Kuszmaul, and J. Nelson. Cache-oblivious streaming B- [26] Y. Hu, H. Jiang, D. Feng, L. Tian, H. Luo, and C. Ren. Ex-
trees. In Proceedings of the 19th ACM Symposium on Parallel ploring and exploiting the multilevel parallelism inside SSDs
Algorithms and Architectures, SPAA 07, pages 8192, 2007. for improved performance and endurance. Computers, IEEE
[13] M. Bjorling, J. Axboe, D. Nellans, and P. Bonnet. Linux Transactions on, 62(6):11411155, 2013.
block IO: Introducing multi-queue SSD access on multi-core [27] W. K. Josephson, L. A. Bongo, D. Flynn, and K. Li. DFS: A
systems. In Proceedings of the 6th International Systems and file system for virtualized flash storage. In Proceedings of the
Storage Conference, SYSTOR 13, pages 22:122:10, 2013. 8th USENIX Conference on File and Storage Technologies,
[14] B. H. Bloom. Space/time trade-offs in hash coding with FAST 10, pages 85100, 2010.
allowable errors. Commun. ACM, 13(7):422426, July 1970. [28] A. Lakshman and P. Malik. Cassandra: A decentralized struc-
[15] R. Cattell. Scalable SQL and NoSQL data stores. SIGMOD tured storage system. SIGOPS Oper. Syst. Rev., 44(2):3540,
Rec., 39(4):1227, May 2011. Apr. 2010.
[16] A. M. Caulfield, A. De, J. Coburn, T. I. Mollow, R. K. Gupta, [29] N. Leavitt. Will NoSQL databases live up to their promise?
and S. Swanson. Moneta: A high-performance storage ar- Computer, 43(2):1214, 2010.
ray architecture for next-generation, non-volatile memories. [30] H. Lim, B. Fan, D. G. Andersen, and M. Kaminsky. SILT: A
In Proceedings of the 43rd Annual IEEE/ACM International memory-efficient, high-performance key-value store. In Pro-
Symposium on Microarchitecture, MICRO 43, pages 385 ceedings of the 23rd ACM Symposium on Operating Systems
395, 2010. Principles, SOSP 11, pages 113, 2011.
[17] A. M. Caulfield, T. I. Mollov, L. A. Eisner, A. De, J. Coburn, [31] C. Min, K. Kim, H. Cho, S.-W. Lee, and Y. I. Eom. SFS:
and S. Swanson. Providing safe, user space access to fast, Random write considered harmful in solid state drives. In
solid state disks. In Proceedings of the 17th International Proceedings of the 10th USENIX Conference on File and
Conference on Architectural Support for Programming Lan- Storage Technologies, FAST 12, pages 139154, 2012.
guages and Operating Systems, ASPLOS XVII, pages 387 [32] P. ONeil, E. Cheng, D. Gawlick, and E. ONeil. The log-
400, 2012. structured merge-tree (LSM-tree). Acta Inf., 33(4):351385,
[18] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, June 1996.
M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: [33] J. Ouyang, S. Lin, S. Jiang, Z. Hou, Y. Wang, and Y. Wang.
A distributed storage system for structured data. ACM Trans. SDF: Software-defined flash for web-scale internet storage
Comput. Syst., 26(2):4:14:26, June 2008. systems. In Proceedings of the 19th International Conference
[19] F. Chen, R. Lee, and X. Zhang. Essential roles of exploiting on Architectural Support for Programming Languages and
internal parallelism of flash memory based solid state drives in Operating Systems, ASPLOS 14, pages 471484, 2014.
high-speed data processing. In High Performance Computer [34] K. Ren and G. Gibson. TABLEFS: Enhancing metadata effi-
Architecture (HPCA), 2011 IEEE 17th International Sympo- ciency in the local file system. In Proceedings of the 2013
sium on, pages 266277, 2011. USENIX Annual Technical Conference, USENIX ATC 13,
[20] B. Debnath, S. Sengupta, and J. Li. FlashStore: High through- pages 145156, 2013.
put persistent key-value store. Proc. VLDB Endow., 3(1-2): [35] M. Rosenblum and J. K. Ousterhout. The design and imple-
14141425, Sept. 2010. mentation of a log-structured file system. ACM Trans. Com-
[21] B. Debnath, S. Sengupta, and J. Li. SkimpyStash: RAM space put. Syst., 10(1):2652, Feb. 1992.
skimpy key-value store on flash-based storage. In Proceed- [36] P. Shetty, R. Spillane, R. Malpani, B. Andrews, J. Seyster,
ings of the 2011 ACM SIGMOD International Conference on and E. Zadok. Building workload-independent storage with
Management of Data, SIGMOD 11, pages 2536, 2011. VT-trees. In Proccedings of the 11th Conference on File and
[22] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, Storage Technologies, FAST 13, pages 1730, 2013.
A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, [37] M. Stonebraker. SQL databases v. NoSQL databases. Com-
and W. Vogels. Dynamo: Amazons highly available key- mun. ACM, 53(4):1011, Apr. 2010.
value store. In Proceedings of the 21st ACM Symposium
on Operating Systems Principles, SOSP 07, pages 205220, [38] H. Wang, P. Huang, S. He, K. Zhou, C. Li, and X. He. A
2007. novel I/O scheduler for SSD with improved performance and
lifetime. In Mass Storage Systems and Technologies (MSST),
[23] R. Escriva, B. Wong, and E. G. Sirer. HyperDex: A dis- 2013 IEEE 29th Symposium on, pages 6:16:5, 2013.
tributed, searchable key-value store. SIGCOMM Comput.