Sie sind auf Seite 1von 14

Calibrated Interrupts

Paper #42

Abstract concurrent requests, simply sending interrupts for each com-


pletion could result in an interrupt storm, grinding the system
NVMe storage devices are beginning to approach the per- to a halt [36, 47]. Networking devices, which handle many
formance of networking devices. Interrupt storms, which first more I/O requests than storage devices, have had much ex-
appeared in fast networks, are beginning to appear in storage perience dealing with interrupt storms. As a result, storage
stacks, and techniques from networking are being adapted to stacks have adopted strategies from networking. NVMe de-
mitigate them. We find that interrupt coalescing, borrowed vices borrow interrupt coalescing from networking, which
from networking and adopted by NVMe is not only subop- avoids delivering interrupts until a batch of completions are
timal, but unusable for practical workloads. Instead, we pro- ready or a timeout is met. Our experience reveals that inter-
pose an adaptive coalescing strategy to replace NVMe co- rupt coalescing is unusable for practical storage workloads
alescing and further observe that networks and storage are because it is difficult, if not impossible, to select batch sizes
fundamentally different. We exploit the fact that in storage, or timeouts that do not cause excessive latency or stall the
every I/O is a result of a request from from software. Our storage stack. In fact, NVMe interrupt coalescing is disabled
system, cinterrupts, calibrates interrupts by enabling software by default in Linux and real deployments work around it (§2).
to annotate each request with when it should be interrupted.
Our first contribution is implementing adaptive coalescing
Our results show that by calibrating interrupts, cinterrupts
for NVMe, a heuristic-based dynamic approach from network-
improves throughput and latency by as much as 64% and 77%
ing that tries to adjust batching and timeouts based on the
in microbenchmarks and up to 14% and 28% in macrobench-
workload, and find that it still adds unnecessary latency to
marks, respectively, when compared to other systems.
requests (§2.1). After attempting to adapt network strategies
for storage, we find that storage itself is fundamentally dif-
1 Introduction ferent from networking: each storage I/O is the result of a
The request-response messaging pattern is a basic commu- request from software, whereas in networking, I/O is often
nication pattern used in computer systems. In storage, ap- the result of unsolicited packets arriving on the network. With
plications send requests to storage devices, which respond this frame of mind, the semantic gap becomes clear: in stor-
by completing requests when data is read or written. Since age, the I/O requester knows when it should be notified of
devices may take some time to retrieve or persist data, the completions, but there is no way to express this information
operating system or application will usually schedule other to the device. As a result, the device is left to use heuristics to
work until it learns about the device’s completions, typically guess when to notify software (§2.2).
through an interrupt, which context switches out of the cur- Our solution to this problem is simple: bridge the semantic
rently running task so the requesting application or kernel can gap by enabling the requestor to inform the device of when
process the completion. it wants to be interrupted. We call this technique calibrat-
While interrupts enable concurrency and deliver comple- ing1 interrupts (or simply, cinterrupts), achieved by adding
tions to the requestor quickly, the costs of interrupts and the two simple bits to requests sent to the device. With calibrated
context switches they produce are well documented in the interrupts, interrupt storms are avoided while applications still
literature [22, 51]. In storage, these costs have gained atten- receive completions they need in a timely manner (§3).
tion as new interconnects such as NVM ExpressTM (NVMe) We build an emulator for cinterrupts in Linux 5.0.8 which
have begun to expose parallelism available in next generation runs on real NVMe hardware (§4). In our experiments, we
storage technologies. NVMe, for example, enables applica- show that by delivering interrupts when they are actually
tions to not only submit millions of requests per second, but needed, cinterrupts can maintain low latency for applica-
up to 65,535 concurrent requests [24, 27, 29, 32, 53], a great tions that require it while providing high throughput to other
increase from traditional disks which have only been able to
generate hundreds of interrupts per second. With so many 1 To calibrate: to adjust precisely for a particular function [45].

1
applications. In microbenchmarks, cinterrupts can improve application
throughput and latency by up to 64% and 77% compared to blk req IRQ blk req poll
kernel
other techniques. Evaluating on RocksDB and YCSB bench-
marks, cinterrupts can improve the throughput and latency of device
applications by as much as 14% and 28%, respectively. On the SQ CQ ... SQ CQ
other hand, alternative techniques favored specific workloads
at the expense of others (§5).
Figure 1: NVMe requests are submitted through submission queues
(SQ) and placed in completion queues (CQ) when done. Applications
2 Background and Motivation are notified of completions via either interrupts or polling.
Disks historically had seek times in the milliseconds and pro-
duced at most hundreds of interrupts per second, which meant
interrupts worked well to enable software-level concurrency
while avoiding costly overheads. However, storage devices are second [32, 53], so networking techniques to avoid interrupts
approaching an I/O rate that would result in excessive inter- have now been adopted by storage devices. Linux NAPI-like
rupt rates, creating interrupt storms that prevent the processor interfaces [4] have been added to the Linux block I/O layer
from doing useful work. to support polling (Figure 1). SPDK [10], µDepot [37] and
Lessons from networking. Networking devices have had Arrakis [49] bypass the kernel and enable software to directly
to deal with much higher I/O rates for a long time, so strate- poll storage devices from userspace.
gies now used in storage have been adopted from the network-
ing community. For example, 100Gbps networking cards can NVMe coalescing. The NVMe specification borrows the
process over 100 million packets per second in each direc- idea of interrupt coalescing from the networking community,
tion, over 200× that of a typical NVMe device. To avoid standardizing it for storage devices [7]. With coalescing en-
bombarding the processor with interrupts, network devices abled, an interrupt will fire only if there is a sufficient thresh-
apply interrupt coalescing [52] or moderation, which waits old of items in the CQ or after a timeout. While coalescing
until a threshold of packets are available or a timeout is trig- limits the interrupt rate, it increases the latency of requests
gered. Network stacks may also employ polling [5], where when there are insufficient items in the CQ.
software queries for packets to process rather than being noti-
fied. IX [15] and DPDK [23] expose the device directly to the There are two key problems with NVMe interrupt coalesc-
application, bypassing the kernel and the need for interrupts ing. First, NVMe only allows the aggregation time to be set in
by implementing polling in userspace. Technologies such as 100µs increments [7], while devices are approaching sub 10µs
Intel’s DDIO [25] or ARM’s ACP [48] enable networking latencies. This means that requests that do not meet the thresh-
devices to write incoming data directly into processor caches, old would incur unacceptably long latencies, which render
making polling even faster by turning MMIO queries into interrupt coalescing unusable. For example, for any non-zero
cache hits. threshold value, a thread submitting requests through pread
can only submit requests one at a time, causing requests to
Storage is adapting networking techniques. Storage de-
be delayed by at least 100µs, despite only needing 10µs to
vices have only begun to catch up to network I/O rates. New
complete at the device. This renders the timeout unusable in
storage devices are built with solid state memory which can
general-purpose deployments.
sustain not only millions of requests per second, but multi-
ple concurrent requests per channel. The NVMe specifica- Second, even if the NVMe aggregation granularity were
tion [7] exposes this parallelism to software by providing smaller, NVMe interrupt coalescing is fundamentally unus-
multiple queues where requests can be submitted and com- able because it cannot adapt to the underlying workload.
pleted. Linux also rewrote its block subsystem to match this Instead, because both the threshold and timeout are stati-
multi-queue paradigm [16], and further kernel improvements cally configured, interrupt coalescing easily breaks after small
have been proposed to enable software to drive a higher re- changes in workload – for example, if the workload temporar-
quest rate [35, 39, 50]. ily cannot meet the threshold value. Again, this means the
An NVMe driver can create as many submission queues only reasonable threshold value in a general deployment is
(SQs) and completion queues (CQs) as the device can support. 0, which is no coalescing. In fact, real NVMe deployments
Typically, as in the Linux kernel, the driver will assign one resort to driver workarounds and employ non-standard pro-
SQ/CQ pair per core [17]. Figure 1 shows how an applica- prietary coalescing schemes rather than using NVMe coalesc-
tion enqueues a request to a NVMe device and receives its ing [19, 40, 42, 46]. Interrupt coalescing is disabled by default
completion. With multiple queues, an application can submit in many operating systems including Linux [18]. Throughout
many concurrent requests at once. the rest of the paper, we refer to no coalescing as the “default”
Modern NVMe devices can process over 1M requests per interrupt strategy.

2
default
Algorithm 1: Adaptive coalescing strategy in cinterrupts latency
s1 idle until i c1 s2 idle until i c2 …
1 Parameters: ∆, thr completion completion interrupt fires upon
2 coalesced = 0, timeout = now + ∆; every completion
3 while true do cint
4 while now < timeout do latency interrupt calibrated
u u to the time when the
5 while new completion arrival do s1 idle until
completion i c1 s2 idle until
completion i c2 …
associated urgent
/* burst detection: update timeout */ request finishes
6 timeout = now + ∆; adaptive
7 if ++coalesced ≥ thr then latency
interrupt
8 fire IRQ and reset; s1 idle until delay i c1 s2 idle until delay i c2 … delayed for
completion completion
coalescing
/* end of quiescent period */
if coalesced > 0 then sk CPU time to i CPU time to ck CPU time to
9 submit request k process interrupt complete request k
10 fire IRQ and reset;
11 timeout = now + ∆; Figure 2: Completion timeline for a single submission. The adaptive
strategy (last row) exhibits a semantic gap, which adds a delay to
all completions. Cinterrupts can match the latency of the default
strategy, which refers to no coalescing, with a single annotation.
2.1 Adaptive Coalescing
NVMe interrupt coalescing is unusable due to the sensitivity
when either it has observed a completion quiescent interval of
of the coalescing algorithm to workload changes. We argue
∆ or thr requests have completed. In §5, we explain how de-
that to be usable, NVMe coalescing should at a minimum
vice manufacturers and system administrators can determine
react dynamically to workloads. In this section, we present an
reasonable ∆ and thr configurations. Throughout the rest of
adaptive coalescing strategy to replace the static NVMe coa-
the paper, we refer to Algorithm 1 as the “adaptive” strategy.
lescing algorithm. We devise our adaptive strategy to behave
The adaptive strategy forms the basis for cinterrupts, which
similarly to the adaptive interrupt moderation found in many
we fully describe in §3.
high-speed network interface cards (NICs) [28, 33, 44].
The goal of interrupt coalescing is to generate a single in-
terrupt for many completed requests. Our adaptive coalescing 2.2 The Semantic Gap
strategy further optimizes this approach by observing that Even though it is good at detecting bursts, the adaptive strat-
a device should generate a single interrupt for requests that egy is still suboptimal because it adds unnecessary latency due
temporally complete together, what we call a burst. A burst to ∆. As Figure 2 shows, if there is one submission request,
is any sequence of requests where each request completes the adaptive algorithm adds ∆ delay to confirm that there
within some time unit of the previous. Detecting these bursts are no more requests. The simplest example is a 4 KB read
is the responsibility of the adaptive strategy in cinterrupts. request via pread, which has latency ∼10 µs in the default
The default NVMe strategy cannot detect bursts because its strategy and cinterrupts, but ∼16 µs in the adaptive strategy.
timeout is fixed – it can only batch completions that happen This delay is due to a semantic gap between software and
to arrive in the next timeout interval. The problem is not the device: the device sees a stream of requests and cannot de-
improved by reducing the timeout granularity, because the termine which requests require interrupts in order to unblock
key problem is that this timeout has the wrong semantics. In the application. This semantic gap exists for both the default
our adaptive coalescing strategy, the timeout, called ∆, is a and adaptive strategies – the default strategy simply side-steps
bound on the time between requests, instead of a fixed interval. the issue by generating an interrupt for every request, wasting
Then our adaptive strategy can detect bursts of variable size, as CPU cycles in interrupt handling and context switching.
long as requests continue completing within ∆ of the previous. As a generalization of the previous example, Figure 3
In this sense, this strategy adapts to the size of a burst, acting shows how the adaptive strategy behaves when there are mul-
as a dynamic toggle for the threshold in the default NVMe tiple submission requests. Even though the adaptive algorithm
strategy. Algorithm 1 summarizes our adaptive strategy. The correctly detects the end of the submission batch, there is ad-
most important line is Line 6, where the timeout is pushed out ditional ∆ delay to confirm the end of the burst. Again, the
every time a new completion arrives, enabling the dynamism delay is due to the semantic gap: in the stream of requests
that supports burst detection. Adjusting ∆ detects bursts of s1 − s4 in Figure 3, the device cannot tell which request is
different densities. the last in the stream. Without this additional information the
Finally, to bound request latency, our adaptive strategy also device cannot align the interrupt optimally.
uses a thr that is the maximum number of requests it will In fact, the semantic gap prevents even sophisticated adap-
coalesce into a single interrupt. This is especially necessary tive coalescing strategies from achieving optimal latency. For
for long-lived bursts to prevent infinite delay of request com- example, despite years of tuning by various network vendors,
pletion. With this new strategy, a device will emit interrupts the adaptive coalescing algorithms in high-speed NICs also

3
default 1) device processed first request (k=1) Intel XL710 Mellanox Cx-5
s1 s2 s3 s4 idle i c1 i c2 i c3 i c 4

latency [usec]
70
150 coalescing 66
coalescing
no coalesc no coalesc 18
interrupt calibrated 100
2) device processed to the time when 27 14
cint last request (k=4) the last request 22 22 29 30 24 21
17 14
50 9.8 11 8
1.50.2 1.3 1.5
s1 s2 s3 s4b idle i c1 c2 c3 c 4 (k=4) finishes
0

64
128
256
512
1K
2K
4K
8K
16K
32K
64K

64
128
256
512
1K
2K
4K
8K
16K
32K
64K
adaptive 3) delay expires
s1 s2 s3 s4 idle i c1 c 2 c 3 c 4 150

[int/s]
100
delay
latency 50

sk CPU time to i CPU time to ck kernel CPU time to 0


submit request k process interrupt complete request k

64
128
256
512
1K
2K
4K
8K
16K
32K
64K

64
128
256
512
1K
2K
4K
8K
16K
32K
64K
Figure 3: Completion timeline for multiple submissions. The adap- message size [bytes]
tive strategy can detect them as part of a burst, but only after the delay
expires, which confirms the end of the batch. Cinterrupts bridges this Figure 4: Latency and interrupt rate of netperf TCP_RR with no
semantic gap by marking the last request in the batch. The default coalescing and with adaptive interrupt coalescing enabled. Labels
strategy resorts to an interrupt for every completion. show the differences between latencies. Additional latency imposed
by coalescing shows that NICs also exhibit the semantic gap.

exhibit this delay problem. We create a completion timeline


the completion is expected from the device, thereby saving
similar to the one in Figure 3 by running the TCP_RR, i.e.
wasted CPU cycles. Hybrid polling, however, is a heuristic-
ping-pong, benchmark of netperf [34] with varying message
based technique that is hard to tune and can be inaccurate.
sizes. Figure 4 shows the interrupt rate and latency of the
In cinterrupts, we use a more principled approach to take
benchmark on two NICS, an Intel XL710 40 GbE [26] and a
full advantage of this semantic information that is unique to
Mellanox ConnectX-5 100 GbE [44].
the storage stack. Simply put, cinterrupts observes that in
On both devices, adaptive interrupt coalescing helps reduce storage, software can bridge this semantic gap simply telling
the interrupt rate but causes an increase in latency. In the the device when it wants to be interrupted, enabling effective
Intel NIC, coalescing results in increased latency regardless use of interrupts. cinterrupts coexists with kernel-side polling,
of message size. In the Mellanox NIC, adaptive coalescing is such as in Linux NAPI for networking [1, 5], which switches
not effective if the message size is greater than 1500 bytes, between polling and interrupts based on demand. Userspace
the maximum transmission unit for Ethernet, because the polling as in SPDK is orthogonal to our work as it bypasses
message becomes split across multiple packets. NIC vendors the kernel and requires significant changes to the application.
have tried to add even more advanced heuristics such as Intel’s
low-latency interrupts (LLI) [31, 33, 38], which tries to use
3 Cinterrupts
filters or the TCP_PSH flag to detect when packets should
be delivered with interrupts. In the end, however, users have Cinterrupts bridges the semantic gap by enabling software to
had difficulty using these heuristics [21], and they have been tell the device when a workload requires completions, thereby
dropped from recent devices [26, 28]. NICs are also missing enabling the device to calibrate interrupt generation. Cinter-
semantic knowledge of how many packets comprise a single rupts does so with the use of just two simple types: Urgent
application-level message. and Barrier. In this section, we describe both types and how
Particularly challenging for any adaptive strategy is the they eliminate the problems from the previous section.
case when a submission stream consists of a mix of latency-
sensitive and throughput-sensitive requests, what we call a 3.1 Urgent
mixed workload. In this case, it is impossible for an adaptive Urgent is used to request an interrupt for a single request: the
strategy to achieve optimal latency again because of the se- device will generate an immediate interrupt for any request
mantic gap – in a stream of requests, the device cannot tell annotated with Urgent. The primary use for Urgent is to en-
the difference between the two types of requests. able the device to calibrate interrupts for latency-sensitive
Storage is different from networking. Unlike network requests. For example, s1 in Figure 2 is marked Urgent be-
devices, where unsolicited incoming packets can arrive at line cause the application is otherwise blocked waiting for the
rate, every completion (and resulting interrupt) from a storage completion. Urgent eliminates the need for the delay in the
device is in response to an application request. vIC [12] uses adaptive strategy.
a similar observation to moderate virtual interrupts. Hybrid To demonstrate the effectiveness of Urgent, in fio [13] we
polling [54] also tries to leverage this, only polling when run a synthetic mixed workload with two threads: one submit-

4
total IOPS sync IOPS sync latency interrupts IOPS latency idle CPU interrupts
[1000s] [1000s] [µsec] [1000s] [1000s] [µsec] [%] [1000s]

4.39
450 1.15 450 75 450

1.23
1.31
300 45 90 300

4.40
1.10
2.74
0.83

0.80

0.90
1 proc

0.66
300 300 50 300 200 30 60 200

1.06
0.93

0.80
0.25
0.24
150 150 25 150 100 15 30 100
0 0 0 0 0 0 0 0
(a) (b) (c) (d)

1.45
0.85

3.42
1.18
0.71
300 45 90 300
cint default adaptive

2 procs
200 30 60 200

0.44
Figure 5: Effect of Urgent. Synthetic workload mixing libaio and 100 15 30 100
pread requests. Cinterrupts achieves optimal synchronous latency 0 0 0 0

1.01

3.51
0.88

1.14
and better throughput over the default strategy. The adaptive strategy

0.99
300 90 90 300

4 procs
achieves better overall throughput, at the expense of synchronous 200 60 60 200

0.99
latency. Labels show performance relative to cinterrupts. 100 30 30 100
0 0 0 0
total IOPS sync IOPS sync latency interrupts (a) (b) (c) (d)
[1000s] [1000s] [µsec] [1000s]
2.1 cint default adaptive
6.8
1.20
1.19
1.16

450 45 180 90
0.59
1.11

1.12
Figure 7: Effect of Barrier. Each process submits four requests per
0.44

4.0
1.04

0.32

400 30 120 60
3.2

0.59
0.25

batch. Cinterrupts always detects the end of a batch with Barrier.


2.3
0.15

0.30
1.7

0.15
350 15 60 30 Labels show performance relative to cinterrupts.
300 0 0 0
4
8
16
32
64

4
8
16
32
64

4
8
16
32
64

4
8
16
32
64

adaptive coalescing threshold parameters to throttle interrupt rate to a statically configured


(a) (b) (c) (d) value [11, 43] – cannot bridge this semantic gap. These adap-
cint adaptive
tive schemes favor the asynchronous thread with aggressive
Figure 6: Sophisticated adaptive coalescing, such as that found in coalescing, which simply overwhelms the synchronous re-
NICs, try to throttle interrupt rate with aggressive coalescing. In quests, leading to unusable synchronous latencies.
a mixed workload, increasing the coalescing threshold will only Figure 6 shows how these sophisticated adaptive schemes
increase the latency of synchronous requests proportionally to the behave at higher throughputs. We run the same experiment as
coalescing rate. Labels show performance relative to cinterrupts. in Figure 5 with higher iodepth. As the target coalescing rate
increases for the adaptive strategy, there is a corresponding
linear increase in the synchronous latency. On the other hand,
ting 4 KB read requests via libaio with iodepth=162 and one the purple line in Figure 6(c) shows that Urgent in cinterrupts
submitting 4 KB read requests via pread. In cinterrupts, the makes synchronous latency acceptable. This latency comes
latency-sensitive pread requests are annotated with Urgent, at the expense of less asynchronous throughput, as shown
which is embedded in the NVMe request that is sent to the in Figure 6(a), but we believe this is an acceptable trade-off.
device (see §4.2). Results are shown in Figure 5. §3.3 will show that Urgent can be combined with throttling
Without cinterrupts, the requests from either thread are in- to achieve more rigorous performance guarantees.
distinguishable to the device. The default strategy addresses
this problem by generating an interrupt for every request, re-
3.2 Barrier
sulting in 2.7x more interrupts than cinterrupts (Figure 5(d)).
On the other hand, with Urgent, cinterrupts calibrates inter- To calibrate interrupts for batches of requests, cinterrupts uses
rupts to the latency-sensitive pread requests, enabling low- Barrier, which marks the end of a batch and instructs the de-
latency without generating spurious interrupts that hamper vice to generate an interrupt as soon as all preceding requests
the throughput of the asynchronous thread. This results in have finished. For example, in the submission stream s1 − s4
both higher asynchronous throughput and lower latency for in Figure 3, the last request, s4 , is marked with Barrier. Barrier
the synchronous requests when compared to the default. The minimizes the interrupt rate, which is always beneficial for
adaptive strategy is unable to identify the pread requests and CPU utilization, while enabling the device to generate enough
in fact tries to minimize interrupts for all requests, resulting in interrupts so that the application is not blocked.
higher asynchronous throughput but a corresponding increase To demonstrate the effectiveness of Barrier, we run an ex-
in pread request latency (Figure 5(c)). periment in fio that generates a completion timeline similar
Even the most sophisticated adaptive coalescing schemes – to that found in Figure 3. In the experiment, we run a vari-
such as those in NICs that can adaptively change coalescing able number of threads on the same core, where each thread
is doing 4 KB random reads through libaio, submitting in
2 iodepth represents the number of in-flight requests. fixed batch sizes of 4. The trick is determining the end of the

5
default Algorithm 2: cinterrupts coalescing strategy
p1 p2 p1 p2
1 Parameters: ∆, thr
s1 s2 s3 s4 s1 s2 s3 s4 i c1 i c2 i c3 i c4 i c1 i c2 i c3 i c4 … 2 coalesced = 0, timeout = now + ∆;
3 while true do
cint associated with 4 while now < timeout do
p1 p2 p1 p2 process #1 (=p1) 5 while new completion arrival do
b
s1 s2 s3 sb4 s1 s2 s3 s4 i c1 c2 c3 c4 i c1 c2 c3 c4 … 6 timeout = now + ∆;
associated with 7 if completion type == Urgent then
process #2 (=p2)
8 if ooo processing is enabled then
adaptive
p1 p2 p1 p2 /* only urgent requests */
9 fire urgent IRQ;
s1 s2 s3 s4 s1 s2 s3 s4 idle till last delay i c1 c2 c3 c4 c1 c2 c 3 c4 …
is processed 10 else
/* process all requests */
11 fire IRQ and reset coalesced;
Figure 8: Completion timeline for two threads submitting request 12 if completion type == Barrier then
batches. The adaptive strategy experiences CPU idleness both be- 13 fire IRQ and reset coalesced;
cause of the delay and because it waits to process any completions 14 else
until they all arrive. On the other hand, due to Barrier, cinterrupts 15 if ++coalesced ≥ thr then
can process each batch as soon as it completes. 16 fire IRQ and reset coalesced;

/* end of quiescent period */


17 if coalesced > 0 then
batch without additional overhead, which only cinterrupts can 18 fire IRQ and reset coalesced;
do. Figure 7 shows the throughput, latency, CPU utilization,
19 timeout = now + ∆;
and interrupt rate for the experiment, with each row showing
results for a different number of threads.
Single thread. When there is a single thread, the default
strategy can deliver lower latency than cinterrupts. This is well in this last case, we showed in §3.1 that this aggregation
because there is CPU idleness and no other thread running, comes at the expense of synchronous requests.
which means there is no penalty for the excessive interrupts Note that Figure 8 is a simplification of a real execution,
generated by the default strategy (4.4x the number of inter- because it conflates time spent in userspace and the kernel,
rupts of cinterrupts). Figure 3 also shows that the default and does not show completion reordering. For example, the
strategy can process some completions in parallel with device second set of completions (dark blue) in Figure 8 can be
processing, whereas cinterrupts waits for all completions in reordered to c3 , c4 , c1 , c2 . The full cinterrupts algorithm ad-
the batch to arrive before processing. On the other hand, the ∆ dresses reordering by employing the adaptive strategy, lever-
delay in the adaptive algorithm is clear: the latency of requests aging ∆ and thr to ensure no requests get stuck. In the next
is 29 µs, compared to 22 µs with cinterrupts. section, we describe the full cinterrupts design.
Two threads. When there are two threads in the experi-
ment, the advantage of the default strategy goes away: the 3.3 Out-of-Order Urgent
3.4x number of interrupts taxes a saturated CPU. On the other The full cinterrupts interrupt generation strategy is Algo-
hand, cinterrupts generates exactly one interrupt at the end rithm 1 enhanced by Urgent and Barrier, as shown in Al-
of each batch, as shown in Figure 8, which saves the CPU gorithm 2, which replaces Lines 7–8 in Algorithm 1 with
from wasting time in expensive context switches. The saved Lines 7–16, which handle Urgent and Barrier. Requests in cin-
CPU time is used to drive the other thread: cinterrupts has the terrupts are either unmarked or marked by Urgent or Barrier.
best throughput and latency because the calibrated interrupts Unmarked requests are handled by the underlying adaptive
enable better CPU usage. algorithm, which uses ∆ and thr to generate interrupts. Of
Finally, Figure 8 shows why the adaptive strategy exhibits course, unmarked requests can still piggyback on interrupts
highest synchronous latency, which is explained by CPU idle- generated by Urgent or Barrier.
ness; the idleness comes from waiting for the device and the When running the full cinterrupts algorithm on a mixed
delay built into the algorithm to detect the end of the batch. workload, we noticed that Urgent requests sometimes get
This idleness is eliminated in the next experiment, because completed with many other completions, which renders the
there are enough threads to keep the CPU busy. flag less effective, because the userspace thread is forced to
Four threads. With four threads, the comparison between block until the driver reaps all requests in the batch, increasing
cinterrupts and the default NVMe strategy remains the same. the latency of the Urgent request.
However, at four threads, the adaptive strategy matches the To address this, cinterrupts implements out-of-order (OOO)
performance of cinterrupts because without CPU idleness, the processing, a driver-level optimization for Urgent requests.
delay is less of a factor. Although the adaptive strategy does With OOO processing, the IRQ handler will only reap Ur-
gent requests in the current interrupt context, returning im-

6
start end start end start async IOPS sync latency interrupts idle CPU
[1000s] [µsec] [1000s] [%]
CQ CQ CQ CQ

0.96
0.84
450 60 450 60

0.88
unlimited

8.97
start

0.59
0.47
end end 300 40 300 40

0.37

2.36
1.15
time 150 20 150 20
0 0 0 0
Figure 9: OOO Urgent processing. Grayed entries are reaped entries.

3.70

1.16
Urgent requests in an interrupt batch (first interrupt) are processed 300 60 300 60

rate limited

1.00
1.00
1.00

0.88
1.31
immediately, and the interrupt handler returns. The other requests are 200 40 200 40

1.23
0.87

0.87

0.74

0.01
not reaped until the next interrupt, which consists only of non-Urgent 100 20 100 20
requests. After the second IRQ, the driver rings the completion queue 0 0 0 0
doorbell to signal that the device can reclaim the contiguous range. (a) (b) (c) (d)
cint default adaptive ooocint

mediately after these requests are reaped. This enables faster Figure 10: Out-of-order (OOO) driver processing of Urgent requests
unblocking of the userspace thread that was waiting for the enables lower latency, at the expense of more interrupts. If we instead
Urgent requests. The interrupt handler leaves the remaining limit the number of asynchronous requests (bottom row), this reduces
requests for the next context, as shown in Figure 9. After reap- the overhead of OOO processing.
ing them, the IRQ handler marks OOO Urgent requests with
a special flag so they are ignored by future interrupt contexts.
4.1 Hardware modifications
Unmarked requests will not be reaped until a completion Cinterrupts modifies the hardware-software boundary to sup-
batch consists only of those requests. The driver also does not port two bits, Urgent and Barrier. The key hardware compo-
ring the CQ doorbell until it completes a contiguous range nent in cinterrupts is an NVMe device that both supports these
of entries. thr ensures non-Urgent requests are eventually bits and implements Algorithm 2 as its interrupt generation
reaped. For example, suppose in Figure 9 that thr = 9. Then strategy. Unfortunately, interrupt generation is the responsi-
an interrupt will fire as soon as 9 entries (already reaped or bility of the device firmware, which is typically a blackbox.
otherwise) accumulate in the completion queue. Because we wanted to leverage real NVMe hardware, we
chose to emulate only the interrupt generation portion of cin-
The trade-off with OOO processing is an increase in the terrupts, rather than emulating the full storage device with
number of interrupts generated. In Figure 10 we report per- tools such as flash emulators.
formance metrics from running the same mixed workload as
in Figure 6. OOO processing generates 2.4x the number of
interrupts in order to reduce the latency of synchronous re- 4.1.1 Hardware emulation
quests by almost half. The impact of the additional interrupts We explored using several existing aspects of the NVMe spec-
is noticeable in the reduced number of asynchronous IOPS, ification to emulate interrupt generation in cinterrupts, all
as shown in the first column. of which were insufficient. We considered using the urgent
priority queues to implement Urgent. While this would have
Incidentally, these additional interrupts, as well as the in-
worked for Urgent, there is still no way to implement Barrier
terrupts in the default strategy, act as an inadvertent tax on
or Algorithm 2.
the asynchronous thread. If we instead limit the number of
We also considered using special bogus commands to force
asynchronous requests, the need for these additional interrupts
the NVMe device to generate an interrupt. The specifica-
goes away. In the second row of Figure 10, we throttle the
tion recommends that “commands that complete in error are
asynchronous thread with the blkio cgroup [3] to the through-
not coalesced” [7]. Unfortunately, in two devices that we in-
put of the asynchronous thread in the default scenario (green
spected [29, 53], neither device respected this aspect of the
bar in the first row). In this case, OOO cinterrupts only gener-
specification: interrupts for errored commands were still coa-
ates 23% more interrupts, and its synchronous latency actually
lesced by the existing coalescing policy.
matches that of the default strategy.
Instead, we prototype cinterrupts by emulating interrupt
OOO processing is turned on by default in the cinterrupts generation with a dedicated sidecore that uses interprocessor
NVMe driver but can be disabled with a module parameter. interrupts (IPIs) to emulate hardware interrupts. We imple-
ment this emulation on Linux 5.0.8.
Dedicated Core. Our emulation assigns a dedicated core
4 Implementation to a target core. The target core functions normally by running
applications that submit requests to the core’s NVMe submis-
In this section, we describe the hardware and software changes sion queue, which are passed normally to the NVMe device.
necessary to support cinterrupts. The dedicated core runs a pinned kernel thread, created in the

7
target core dedicated core System call Kernel default annotations
(p)read(v), (p)write(v) Urgent if fd is blocking or if write is O_DIRECT
nvme_irq() IPI preadv2, pwritev2 If RWF_NOWAIT is not set, use Urgent
io_submit Barrier on the last request
cinterrupts fdatasync, fsync, sync, Urgent
Polling algorithm syncfs
SQ CQ msync With MS_SYNC, Barrier on the last request

Table 1: Summary of storage I/O system calls and the corresponding
NVMe controller + device
default bits used by the kernel.

Figure 11: Cinterrupts emulation: dedicated core sends IPIs, which


its children. If this superset contains both Urgent and Barrier,
emulate hardware interrupts of a real device that supports cinterrupts.
we mark the merged request as Urgent for simplicity. This
is not a correctness issue because the underlying adaptive
NVMe device driver, that polls the completion queue of the algorithm will ensure that no request gets stuck.
target core and generates IPIs based on Algorithm 2. To faith- At the block layer, any iocb bits are embedded in the
fully emulate the proposed hardware, we disable hardware blk_mq tag, which is directly passed as the command ID
interrupts for the NVMe queue assigned to that core; in this of the NVMe submission entry. This ID, which is 16 bits, is
way, the target core only receives interrupts iff ideal cinter- used to communicate unique commands between the NVMe
rupts hardware would fire an interrupt. Figure 11 shows how layer and the block subsystem inside the kernel. For example,
our dedicated core emulates the proposed interrupt generation this ID is used by the blk_mq subsystem to finish a request
scheme. Importantly, we still leverage real hardware to exe- whenever it is reaped in the NVMe driver. In our setup, tag
cute the I/O requests, and the driver still communicates with numbers never go beyond 4096 (12 bits) because they are
the NVMe device through the normal SQ/CQ pairs, but we limited by the size of the NVMe queue, so we embed Urgent
replace the hardware’s native interrupt generation mechanism and Barrier in the most significant two bits of the tag without
with the dedicated core. Section 5.1 shows that this emulation overwriting the existing tag.
has a modest 3-6% overhead. Our emulation is forced to embed the cinterrupts bits in the
command ID, because this is the only field that appears in both
4.2 Software modifications the submission and completion entries. Hence the command
ID is the only field that is visible to the polling dedicated
It is software’s responsibility to pass correct request anno- core, which only sees completion entries. In a hardware-level
tations to the device. To minimize programmer burden, our implementation of cinterrupts, Urgent and Barrier would not
implementation of cinterrupts includes a modified kernel that need to be embedded in the command ID. Instead, they can be
sets default bits. Table 1 summarizes how the kernel sets communicated in any number of reserved bits in the submis-
these defaults, based on system call. Generally, any system sion queue entry of the NVMe specification [7]. For example,
call that blocks the application, such as p∗ {read | write}v∗ , in Version 1.3 of the NVMe specification, bits 13:10 in the
will be marked Urgent and any system call that supports asyn- command dword of a SQE are reserved and hence available
chronous submission, such as io_submit, will be marked Bar- for this protocol.
rier. System calls in the sync family are blocking, so they are
When the dedicated core polls for completed entries on the
marked Urgent.
completion queue of the target core, it inspects the command
All applications we evaluate in Section 5 are unmodified
ID to determine if any bits are set. Based on whether Urgent
and use the kernel’s default annotations, which are sufficient
or Barrier is set, the dedicated kernel core will determine
in many applications. Of course, ultimately the application
whether to send an IPI to the target core, as per Algorithm 2.
has the best knowledge of when it requires interrupts, so there
When the target core handles the completion entry, it clears
are cases in which the application might override kernel de-
any cinterrupts bits in the command ID, returning it to the
faults for even better performance. For example, many storage
original blk_mq tag so that the block subsystem can complete
applications use pread for background, non-urgent work. An
the correct request. Finally, to support OOO Urgent, the driver
application can reduce its interrupt load by explicitly inform-
allocates the third most significant bit as a “completion” flag,
ing the kernel not to mark these requests with Urgent. We
which is used to prevent already-reaped Urgent requests from
leave optimizing application-level annotations to future work.
getting completed in future interrupt contexts.
In the system call handling layer, cinterrupts embeds any
bits in the iocb struct. The block layer can split or merge
requests. In the case of request split – for example, a 1M write 4.3 Discussion
will get split into several smaller write blocks – each child Other IO requests. The modified kernel in cinterrupts only
request will retain the bit of the parent. In the case of merged marks requests that are generated through a system call in-
requests, the merged request will retain a superset of the bits in terface. It does not mark requests that are generated by the

8
kernel itself, for example write requests from page cache Sync latency of 4 KB, µs
flushing or file system journalling. Because these requests are mitigations off default off off
unmarked, their interrupts are handled by the underlying adap- baremetal baremetal emulation baremetal
tive strategy. Of course, as with all other requests, they can system interrupts interrupts interrupts polling
P3700 80.2±29.0 81.1±29.1 82.4±28.2 78.4±28.1
also piggyback on other interrupts generated by the device.
Optane 10.3±1.3 11.4±1.3 10.9±1.2 8.1±1.2
Note that unmarked requests do not pose a correctness issue
– as with traditional interrupt coalescing, delayed interrupts Table 2: Overhead of emulation is comparable with the overhead of
simply increase request latency without creating correctness default mitigations. To prevent double overhead we run our system
or reliability issues. Future work will consider the effects of with mitigations disabled to compensate for the emulation overhead.
marking kernel-generated I/O requests.
Other implementations. There are multiple ways to im-
plement cinterrupts in hardware. For example, a hardware Each core is assigned its own NVMe submission and com-
implementation could enforce a stricter Barrier ordering, only pletion queue. The microbenchmarks are run on a single
releasing a Barrier interrupt if all requests in front of it in core, but we run macrobenchmarks on multiple cores. For our
the submission queue have been completed. This strict or- microbenchmarks, we use fio [13] version 3.12 to generate
dering can even be enforced in the kernel: even if the driver workloads. All of our workloads are non-buffered random
receives completion notifications through device interrupts, it access. We run the benchmarks for 60 seconds and report
can withhold the completions from userspace until all other averages of 10 runs.
requests have completed. The cinterrupts implementation in For a consistent evaluation of cinterrupts, we implemented
this paper shows that even with a relaxed implementation of an emulated version of the default strategy. Similar to the em-
Barrier, which uses the timeout from the adaptive strategy to ulation of cinterrupts we described in §4.1.1, device interrupts
cleanup, cinterrupts enjoys significant performance benefits. are also emulated with IPIs.
We explored a preliminary software implementation of the The cinterrupts emulation is lightweight and its overheads
strict Barrier in our dedicated core emulation, but its over- come from sending the IPI and cache contention from the
heads were larger than its benefit. We suspect firmware im- dedicated core continuously polling on the CQ memory of the
plementations of a strict Barrier will be more efficient. target core. Table 2 summarizes the overhead of emulation.
Urgent storm. If all requests in the system are marked as We also show the overhead of mitigations for CPU vulnerabil-
Urgent, this can inadvertently cause an interrupt storm. To ities [6]. The overhead of emulation is comparable with the
address this, cinterrupts can be configured to target a fixed overhead of the default mitigations for CPU. We disabled mit-
interrupt rate, similar to NICs, enforced with a lightweight igations in our server in order to have performance numbers
heuristic based on Exponential Weighted Moving Average close to a real server with default mitigations enabled and to
(EWMA) of the interrupt rate. inspect the performance of cinterrupts in future architectures
that do not require software-level mitigation [14].
Emulation imposes a modest 3-6% latency overhead for
5 Evaluation
both devices. There is a difference in emulation overhead
These questions drive our evaluation: What is the overhead between the devices, which we suspect is due to each device’s
of our cinterrupts emulation (§5.1)? How do admins select ∆ time lag between updating the CQ and actually sending an
and thr (§5.2)? How does cinterrupts compare to the default interrupt. As the difference between the last column and the
and the adaptive strategies in terms of latency and through- first column shows, this lag varies between devices, and the
put (§5.3)? How much does cinterrupts improve latency and longer the lag, the smaller the overhead of emulation.
throughput in a variety of applications (§5.4)? Baselines. In this section, we compare cinterrupts to our
adaptive Algorithm 1 and to the default interrupt strategy
5.1 Methodology without coalescing. Although it seems like a naive baseline,
Experimental Setup. In our experiments we use two NVMe as we have discussed, in practice system administrators cannot
SSD devices: Intel DC P3700, 400 GB [27] and Intel Optane use NVMe interrupt coalescing because its large timeout does
DC P4800X, 375 GB [30]. For brevity, we will use short not work for general deployments.
names for these SSDs: P3700 and Optane.
Both SSDs are installed in a Dell PowerEdge R730 machine 5.2 Selection of ∆ and thr
equipped with two 14-core 2.0 GHz Intel Xeon E5-2660 v4 ∆ should approximate the interarrival time of requests, which
CPUs and 128 GB of memory running Ubuntu 16.04. The depends on workload. Figure 12 shows the interarrival time
server runs cinterrupts’ modified version of Linux 5.0.8 and for two types of workloads. The first workload is a single-
has C-states, Turbo Boost (dynamic clock rate control), and threaded workload that submits read requests of size 4 KB
SMT disabled. We use the maximum performance governor. with libaio and iodepth=256. The second workload is the
Our emulation pairs one dedicated core to one target core. same workload, except with batched requests. We run the

9
CDF of P3700 CDF of Optane sync latency async IOPS async inter. async cycles
[µsec] [1000s] [1000/sec] [1000/IO]
1

1.32
0.8

1.00
1.00
1.19
150 600 600 9

15.39

1.00
1.00
1.00

1.00

0.76
0.6

P3700
0.4 100 400 400 6
libaio libaio

1.00
1.00
0.2 libaio batch libaio batch 50 200 200 3
0 0 0 0 0
0 4 8 12 16 20 0 4 8 12 16 20

1.35
1.00
1.00
interarrival time [usec] 60 600 600 9

12.39

1.00
1.00
0.74
Optane
40 400 400 6

1.56
1.00

0.99
Figure 12: Using interarrival CDF to determine ∆.

0.99
0.99
20 200 200 3
Optane throughput [KIOPS] Optane interrupts [K/s] 0 0 0 0
500 400 (a) (b) (c) (d)
libaio
400 300 libaio batch cint default adaptive ooocint
200
300 libaio 100 Figure 14: Pure workloads. Column (a) shows latency of syn-
libaio batch chronous requests (lower is better); (b), (c) and (d) show metrics for
200 0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 the asynchronous workload.
threshold [thr] threshold [thr]

Figure 13: Determining thr under a fixed ∆ (∆=6 µs for Optane and three evaluated strategies is shown in Figure 14.
∆ = 15 µs for P3700). thr is the smallest value where throughput
As in §3.1, the synchronous workload shows the drawback
plateaus, which is between 16-32, so we set thr = 32 for both devices.
of the adaptive strategy, which adds exactly ∆=15 µs to re-
We omitted P3700 results as it shows virtually the same throughput
and interrupts behavior as Optane, see Figure 14 (b) and (c). quest latency for P3700 and ∆=6 µs for Optane (first column
of Figure 14). Cinterrupts remedies this with Urgent. The
default strategy performs as good as cinterrupts, because it
same workloads on two our different NVMe devices, P3700 generates an interrupt for every request. This strategy is penal-
and Optane, to show that system administrators will pick ized in the asynchronous workload, where the default strategy
different ∆ for different devices. generates 12-15x the number of interrupts as cinterrupts. On
When libaio submits batches, the CPU can send many more the other hand, both the adaptive strategy and cinterrupts use
requests to the device, resulting in lower interarrival times – a the adaptive algorithm to detect asynchronous bursts, so they
90th percentile of 1 µs in the batched case versus 6 µs in the have comparable throughput.
non-batched case for Optane. For P3700, both workloads have In summary, cinterrupts matches the synchronous latency
a 99th percentile of 15 µs. We pick ∆ to minimize the interrupt of default, while achieving up to 35% of the asynchronous
rate without adding unnecessary delay, so for P3700 we set throughput, and matches the asynchronous throughput of the
∆=15 µs and for Optane we set ∆=6 µs. After fixing ∆, thr is adaptive strategy while achieving up to 36% lower latency. Fi-
straightforward to select. For our devices, we sweep thr in the nally, we note that OOO does not add overhead to cinterrupts
[0, 256) range and select the lowest thr after which throughput performance when it is not invoked.
plateaus; the results are shown in Figure 13. thr = 32 achieves
high throughput and low interrupt rate for both devices. 5.4 Macrobenchmarks
In practice, hardware vendors would use this methodology
To evaluate the effect of cinterrupts on real applications,
to set default values to ∆ and thr for their devices, and system
we run three different application setups: RocksDB [9],
administrators could tune these values for their workloads.
KVell [41], and RocksDB and KVell colocated on the same
Specifically, no application, kernel, or users should modify
cores. RocksDB is a widely used key-value store that uses
these parameters. Once cinterrupts becomes available in hard-
pread/write system calls in its storage engine, and KVell is
ware, we expect sysadmin can set ∆ and thr with a tool such
a new key-value store designed to fully utilize NVMe band-
as nvme-cli [8].
width, employing Linux AIO in its storage engine. We run the
unmodified applications, relying on the default bits assigned
5.3 Microbenchmarks by the cinterrupts kernel (see §4.2), which means requests in
We ran two pure workloads to show how cinterrupts behaves RocksDB that go through pread/write are marked Urgent and
at the extremes. The synchronous workload submits blocking batches in KVell are marked with Barrier.
4 KB reads via pread. The asynchronous workload submits We run each application on four cores. Four separate cores
4 KB reads via libaio, such as in a streaming workload, with are isolated for the dedicated cores described in §4.1.1. In
iodepth 256 and batches of size 16; this is a CPU-bound work- KVell, an additional four cores are allocated for clients. We
load. For each device, cinterrupts and the adaptive strategy run all applications on Optane, so both cinterrupts and the
are configured with the same ∆ and thr. The results for all adaptive strategy are configured with ∆ = 6 and thr = 32, as

10
Workload Description
readrandom readwhilewriting A update heavy: 50% reads, 50% writes
98 B read mostly: 95% reads, 5% writes
degradation [%]

40
30 C read only: 100% reads
68
latency

96 F read latest: 95% reads, 5% updates


20
131 D read-modify-write: 50% reads, 50% r-m-w
10 71 71 122123 57 56 83 83 E scan mostly: 95% scans, 5% updates
0
4 threads 8 threads 4 threads 8 threads Table 3: Summary of YCSB workloads.
degradation [%]

0 a. cint vs. default b. cint vs. adaptive


32 32 41 41 36 36 45 47
throughput

5 277 387 421 391 275 277 387 421 391 275

normalized IOPS
10 39 34 42 1.00
15 0.95 379 418
266 379
259 263 261
20 27 0.90
4 threads 8 threads 4 threads 8 threads 349 351
0.85 368
cint default adaptive 0.80 default adaptive
0.75 cint cint
Figure 15: Latency of get operation and throughput in RocksDB A B C F D A B C F D
for varying workloads. We show performance degradation with 1.10
respect to cinterrupts so that we can easily compare against the 1.08 2.68 2.74 3.76
3.73

avg latency
normalized
two other strategies. Labels show absolute values in µs and KIOPS, 2.72 2.71
1.06 2.70 2.62
respectively. As expected, cinterrupts and the default strategy have 1.04 3.62 3.60
nearly the same performance, but the adaptive strategy has up to 1.02
38% worse latency and 5-16% worse throughput due to the ∆ delay. 1.00
3.52 2.58 2.51 2.57 3.52 3.52 2.58 2.51 2.57 3.52
A B C F D A B C F D
determined in §5.2. We will show that cinterrupts is the only 1.15 4.29
strategy that performs the best across all three applications. 1.12 4.42 4.41
p99 latency
normalized

1.09 6.09 6.03 6.07


1.06 5.94 4.17
4.16 3.92
5.4.1 RocksDB 1.03
1.00
5.70 4.03 3.82 3.99 5.73 5.70 4.03 3.82 3.99 5.73
Using db_bench [2], we load a database with 10 M key-value
A B C F D A B C F D
pairs, with 16 byte keys and 1 KB values. We run two exper-
iments from db_bench: readwhilewriting, where one thread YCSB workload
inserts keys and the rest of the threads read randomly, and
Figure 16: Throughput and latency results for YCSB on KVell. La-
then readrandom, where each thread reads randomly from the
bels show absolute throughput in KIOPS and latency in ms.
key space. By the end of the readwhilewriting experiments,
the database has 22 M key-value pairs, for a total size of 22
GB, which is the size of the database during the readrandom throughput, average latency, and 99th percentile latency for
experiments. We use the direct IO option in RocksDB. For each YCSB workload.
each experiment, we also vary the number of threads. Throughput. Cinterrupts does much better than default for
The latency of the get operation and throughput for both throughput, because default generates an interrupt for every
experiments is shown in Figure 15. As expected, for both request. In contrast, cinterrupts uses Barrier to generate an
metrics, cinterrupts and the default strategy perform the same interrupt for a single batch, which consists of 10s of requests.
because both generate interrupts for every request; in the next The difference between cinterrupts and default is more pro-
two applications, the default strategy will suffer due to this nounced for more read-heavy workloads (B, C, F), but less
behavior. On the other hand, adaptive does consistently worse pronounced for write-heavy workloads (A, D). This is because
because of its ∆ delay. With 8 threads, this delay penalty writes in KVell are less CPU intensive than reads, which mean
is amortized across more threads, which reduces the perfor- the additional interrupts have less of an effect.
mance degradation. The adaptive strategy performs similarly to cinterrupts be-
cause it is designed to detect bursts. Its delay is more pro-
5.4.2 KVell nounced in latency measurements.
Latency. The adaptive strategy has 5-8% higher average
We use workloads derived from the YCSB benchmark [20], and 99th percentile latency than cinterrupts in all workloads.
summarized in Table 3. We load 80 M key-value pairs, with Again, this is the effect of the ∆ delay, which cinterrupts reme-
24 byte keys and 1 KB item sizes for a dataset of size 80 GB. dies with Barrier. Cinterrupts latency also does better than
Each workload does 20M operations. Figure 16 shows KVell the default, where interrupt handling and context switching

11
length=10 length=250 interrupt RocksDB normalized KVell normalized
interrupt scans normalized scans normalized scheme get lat [µs] [KIOPS]
scheme [KIOPS] [KIOPS] cint 116±0.8 1.00 171±2.8 1.00
cint 48.2±1.1 1.00 2.0±0.02 1.00 default 115±0.8 0.99 153±2.0 0.89
default 41.7±1.2 0.86 1.8±0.05 0.89 adaptive 129±0.0 1.11 171±2.0 1.00
adaptive 46.9±1.1 0.97 1.9±0.08 0.98
Table 5: Results from colocated experiment: 4 RocksDB threads
Table 4: YCSB-E throughput results for KVell. Excessive interrupt and KVell. As expected, cinterrupts both has lower latency than the
generation limits default throughput to 86%-89% of cinterrupts’. adaptive strategy and higher throughput than the baseline.
Since it is designed for high-throughput workloads, the adaptive
interrupt RocksDB normalized KVell normalized
strategy can almost match the performance of cinterrupts.
scheme get lat [µs] [KIOPS]
cint 164±0 1.00 131±1 1.00
YCSB-E, scan length=10 YCSB-E, scan length=250 default 163±0 0.99 123±0 0.94
1 adaptive 174±1 1.06 124±0 0.95
0.8
CDF

0.6 Table 6: Results from colocated experiment: 8 RocksDB threads and


0.4 cint
default KVell. The performance gains of cinterrupts is reduced with respect
0.2 adaptive
0 to Table 5, because with additional RocksDB threads, the CPU is
2.8 3 3.2 3.4 3.6 3.8 4 3.5 4 4.5 5 5.5 6 both context-switching more and spending more time in userspace.
latency [msec]

Figure 17: Latency CDF of scans of length 10 and 250 in KVell. 5.4.3 Colocated applications

both add to the latency of requests and slow down the request Finally, we run RocksDB and KVell on the same cores to
submission rate. The high number of interrupts in the default see the effects of cinterrupts in consolidated datacenter envi-
strategy also add to latency variability, which is noticeable in ronments. RocksDB runs the readrandom benchmark from
the larger 99th percentile latencies. before, and KVell runs YCSB-C. We run two experiments,
varying the number of threads of the RocksDB instance. The
YCSB-E. Scans are interesting because their latency is de-
latency of RocksDB requests and the throughput of KVell is
termined by the completion of requests that can span multiple
shown in Tables 5 and 6
submission boundaries. Table 4 shows throughput results for
YCSB-E with different scan lengths, and Figure 17 shows When there are four RocksDB threads, the default strategy
latency CDFs for scans of length 10 and 250. matches the RocksDB latency of oursys, but achieves 12%
less KVell throughput due to the excessive interrupt rate. Con-
Similar to Workload C, for shorter scan lengths, cinterrupts
versely, the adaptive strategy can match the KVell throughput
achieves much better throughput than default because it avoids
of cinterrupts, but has 11% worse RocksDB latency, because
unnecessary interrupts. Similar to the other YCSB workloads,
it cannot align interrupts to the RocksDB requests.
the adaptive strategy again can almost match the throughput
of cinterrupts, because it is designed for batching. At higher As before, when there are more RocksDB threads, the effect
scan lengths, factors such as application-level queueing begin of cinterrupts is less pronounced, because the CPU spends less
affecting scan throughput. These factors affect all strategies, of its time handling interrupts and more of its time context-
which is why the benefit of cinterrupts on throughput reduces switching and in userspace. Even so, cinterrupts still achieves
modestly for higher scan lengths. a modest 5-6% higher throughput and up to 6% better latency
than the other two strategies.
Figure 17 shows that there is a notable difference in
scan latency between cinterrupts and the default. At higher
scan lengths, the latency difference becomes even more pro- 6 Conclusion
nounced, because each scan is active across multiple submis- In this paper we show that the existing NVMe interrupt co-
sion batches. When the scan length is 10, the difference in 50th alescing API poses a serious limitation on practical coalesc-
percentile latencies between default and cinterrupts is 600 µs, ing. In addition to devising an adaptive coalescing strategy
but 1200 µs when the scan length is 250. This difference is for NVMe, our main insight is that software directives are
maintained at the 99th percentile latencies. the best way for a device to generate interrupts. Cinterrupts,
Notably, there is a modest 100µs difference between cin- with a combination of Urgent, Barrier, and the adaptive burst-
terrupts and adaptive 50th percentile latencies when the scan detection strategy, generates interrupts exactly when a work-
length is 10, which goes away when the scan length is 250. load needs them, enabling workloads to experience better
The adaptive strategy does well in KVell’s asynchronous pro- performance even in a dynamic environment. In doing so,
gramming model and longer scans are able to amortize the cinterrupts enables the software stack to take full advantage
additional delay over many requests. of existing and future low-latency storage devices.

12
References [22] Daniel Gruss, Moritz Lipp, Michael Schwarz, Richard Fellner, Clé-
mentine Maurice, and Stefan Mangard. KASLR is Dead: Long Live
[1] Batch processing of network packets. https://lwn.net/Articles/
KASLR. In International Symposium on Engineering Secure Software
763056/.
and Systems (ESSoS’17). Springer, 2017.
[2] Benchmarking tools. https://github.com/facebook/rocksdb/wi
[23] DPDK Intel. Data plane development kit, 2014.
ki/Benchmarking-tools.
[24] Intel Corporation. Intel Optane Technology for Data Centers.
[3] Block io controller. https://www.kernel.org/doc/Documentatio https://www.intel.com/content/www/us/en/architecture-
n/cgroup-v1/blkio-controller.txt. and-technology/optane-technology/optane-for-data-
centers.html. Accessed: September, 2019.
[4] Block-layer I/O polling. https://lwn.net/Articles/663879/.
[25] Intel Corporation. Intel Data Direct I/O Technology (Intel DDIO):
[5] Driver porting: Network drivers. https://lwn.net/Articles/
A Primer. https://www.intel.com/content/dam/www/public/
30107/.
us/en/documents/technology-briefs/data-direct-i-o-
[6] Hardware vulnerabilities, The Linux kernel user’s and administrator’s technology-brief.pdf, 2012. Accessed: May 2020.
guide. https://www.kernel.org/doc/html/latest/admin-guid [26] Intel Corporation. Intel Ethernet Converged Network Adapter XL710.
e/hw-vuln/index.html. https://ark.intel.com/content/www/us/en/ark/products/
83967/intel-ethernet-converged-network-adapter-xl710-
[7] NVM Express, Revision 1.3.
qda2.html, 2014. Accessed: May, 2020.
[8] NVMe management command line interface. https://github.com/
[27] Intel Corporation. Intel SSD DC P3700 Series. https://ark.intel.
linux-nvme/nvme-cli.
com/content/www/us/en/ark/products/79624/intel-ssd-dc-
p3700-series-400gb-1-2-height-pcie-3-0-20nm-mlc.html,
[9] RocksDB. https://github.com/facebook/rocksdb.
2014. Accessed: May, 2020.
[10] Storage Performance Development Kit. https://spdk.io/.
[28] Intel Corporation. Intel Ethernet Converged Network Adapter X550.
[11] Tuning Throughput Performance for Intel Ethernet Adapters. https://ark.intel.com/content/www/us/en/ark/products/
https://www.intel.com/content/www/us/en/support/arti 88209/intel-ethernet-converged-network-adapter-x550-
cles/000005811/network-and-i-o/ethernet-products.html, t2.html, 2016. Accessed: May, 2020.
2020. Accessed: April, 2020. [29] Intel Corporation. Intel Optane SSD 900P Series. https://
ark.intel.com/content/www/us/en/ark/products/123626/
[12] Irfan Ahmad, Ajay Gulati, and Ali Mashtizadeh. vIC: Interrupt co-
intel-optane-ssd-900p-series-480gb-1-2-height-pcie-
alescing for virtual machine storage device IO. In USENIX Annual
x4-20nm-3d-xpoint.html, 2017. Accessed: May, 2020.
Technical Conference (USENIX ATC’11), 2011.
[30] Intel Corporation. Intel Optane SSD DC P4800X Series.
[13] Jens Axboe. Flexible i/o tester. https://github.com/axboe/fio.
https://ark.intel.com/content/www/us/en/ark/products/
[14] Andrew Baumann. Hardware is the New Software. In Proceedings of 97162/intel-optane-ssd-dc-p4800x-series-375gb-1-2-
the 16th Workshop on Hot Topics in Operating Systems (HotOS’17), height-pcie-x4-3d-xpoint.html, 2017. Accessed: May, 2020.
2017. [31] Intel Corporation. Intel 82599 10 GbE Controller Datasheet. https://
www.intel.com/content/www/us/en/embedded/products/
[15] Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Chris-
networking/82599-10-gbe-controller-datasheet.html, 2019.
tos Kozyrakis, and Edouard Bugnion. IX: A protected dataplane operat-
Accessed: May, 2020.
ing system for high throughput and low latency. In 11th USENIX Sym-
posium on Operating Systems Design and Implementation (OSDI’14), [32] Intel Corporation. Intel SSD DC P4618 Series. https://
2014. ark.intel.com/content/www/us/en/ark/products/192574/
intel-ssd-dc-p4618-series-6-4tb-1-2-height-pcie-3-1-
[16] Matias Bjørling, Jens Axboe, David Nellans, and Philippe Bonnet.
x8-3d2-tlc.html, 2019. Accessed: September, 2019.
Linux block IO: introducing multi-queue ssd access on multi-core
systems. In Proceedings of the 6th International Systems and Storage [33] Intel Corporation. Intel Ethernet Controller X540 Datasheet. http://
Conference (SYSTOR’13). ACM, 2013. www.intel.com/content/www/us/en/network-adapters/10-
gigabit-network-adapters/ethernet-x540-datasheet.html,
[17] Keith Busch. Linux nvme driver. Flash Memory Summit, 2013. 2020. Accessed: May, 2020.
[18] Keith Bush. Linux-NVME mailing list: Coalescing in polling mode [34] Rick A. Jones. Netperf: A Network Performance Benchmark. ht
in 4.9. http://lists.infradead.org/pipermail/linux-nvme/ tps://github.com/HewlettPackard/netperf, 1995. Accessed:
2018-February/015435.html, 2018. September, 2019.
[19] Keith Bush. Linux-NVME mailing list: nvme pci interrupt han- [35] Sangwook Kim, Hwanju Kim, Joonwon Lee, and Jinkyu Jeong. Enlight-
dling improvements. https://lore.kernel.org/linux-nvme/ ening the I/O path: a holistic approach for application performance. In
20191209175622.1964-1-kbusch@kernel.org/, 2019. 15th USENIX Conference on File and Storage Technologies (FAST’17),
2017.
[20] Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan,
and Russell Sears. Benchmarking cloud serving systems with YCSB. [36] Avi Kivity. Wasted processing time due to nvme interrupts. https://
In Proceedings of the 1st ACM symposium on Cloud computing. ACM, github.com/scylladb/seastar/issues/507, 2018.
2010.
[37] Kornilios Kourtis, Nikolas Ioannou, and Ioannis Koltsidas. Reaping
[21] Kevin R Fall and W Richard Stevens. TCP/IP Illustrated, Volume 1: the performance of fast NVM storage with uDepot. In 17th USENIX
The Protocols. Addison-Wesley, 2011. Conference on File and Storage Technologies (FAST’19), 2019.

13
[38] Charles M. Kozierok. The TCP/IP Guide, chapter TCP Immediate Data us/azure/virtual-machines/windows/storage-performance,
Transfer: "Push" Function. 2005. http://www.tcpipguide.com/ 2019. Accessed: May 2020.
free/t_TCPImmediateDataTransferPushFunction.htm, Ac-
cessed: May, 2020. [47] Jeffrey C. Mogul and Kadangode K. Ramakrishnan. Eliminating Re-
ceive Livelock in an Interrupt-Driven Kernel. In Proceedings of the
[39] Gyusun Lee, Seokha Shin, Wonsuk Song, Tae Jun Ham, Jae W Lee, and 1996 Annual Conference on USENIX Annual Technical Conference
Jinkyu Jeong. Asynchronous I/O stack: a low-latency kernel I/O stack (ATEC’96), 1996.
for ultra-low latency ssds. In USENIX Annual Technical Conference
(USENIX ATC’19), 2019. [48] Rikin J Nayak and Jaiminkumar B Chavda. Comparison of accelerator
coherency port (acp) and high performance port (hp) for data transfer
[40] Ming Lei. Linux-nvme mailing list: nvme-pci: check CQ after batch in ddr memory using xilinx zynq soc. In International Conference on
submission for Microsoft device. https://lore.kernel.org/li Information and Communication Technology for Intelligent Systems.
nux-nvme/20191114025917.24634-3-ming.lei@redhat.com/, Springer, 2017.
2019. Accessed: May 2020.
[49] Simon Peter, Jialin Li, Irene Zhang, Dan RK Ports, Doug Woos, Arvind
[41] Baptiste Lepers, Oana Balmau, Karan Gupta, and Willy Zwaenepoel. Krishnamurthy, Thomas Anderson, and Timothy Roscoe. Arrakis: The
KVell: the design and implementation of a fast persistent key-value operating system is the control plane. In 11th USENIX Symposium on
store. In Proceedings of the 27th ACM Symposium on Operating Operating Systems Design and Implementation (OSDI’14), 2014.
Systems Principles (SOSP’19), 2019.
[50] Woong Shin, Qichen Chen, Myoungwon Oh, Hyeonsang Eom, and
[42] Long Li. LKML: fix interrupt swamp in NVMe. https://lkml.org/
Heon Y Yeom. OS I/O path optimizations for Flash Solid-State Drives.
lkml/2019/8/20/45, 2019.
In USENIX Annual Technical Conference (USENIX ATC’14), 2014.
[43] Kan Liang, Andi Kleen, and Jesse Brandenburg. Improve Network
Performance by Setting per-Queue Interrupt Moderation in Linux. [51] Dan Tsafrir. The context-switch overhead inflicted by hardware inter-
https://01.org/linux-interrupt-moderation, 2017. Accessed: rupts (and the enigma of do-nothing loops). In Proceedings of the 2007
September, 2019. workshop on Experimental computer science. ACM, 2007.

[44] Mellanox Technologies. Mellanox ConnectX-5 VPI Adapter. [52] John Uffenbeck et al. The 80x86 family: design, programming, and
https://www.mellanox.com/related-docs/prod_adapter_ interfacing. Prentice Hall PTR, 1997.
cards/PB_ConnectX-5_VPI_Card.pdf, 2018. Accessed: September, [53] Western Digital. Ultrastar DC SN200. https://documents.we
2019. sterndigital.com/content/dam/doc-library/en_us/assets/
public/western-digital/product/data-center-drives/
[45] Merriam-Webster. "Calibrate". https://www.merriam-webster.co ultrastar-dc-ha200-series/data-sheet-ultrastar-dc-
m/dictionary/calibrate, 2020. Accessed: May, 2020. sn200.pdf. Accessed: September, 2019.

[46] Microsoft. Microsoft Documentation: Optimize performance on the [54] Tom Yates. Improvements to the block layer. https://lwn.net/
Lsv2-series virtual machines. https://docs.microsoft.com/en- Articles/735275/.

14