Sie sind auf Seite 1von 12

cInterrupts: Calibrated Interrupts for Low-Latency Storage Devices

Paper #32

Abstract Interrupt coalescing is suboptimal because devices are


forced to guess on the batch size of the application, which
In this paper, we show that the limited control NVMe de-
can vary over time. A semantic gap exists between the appli-
vices have over notifications results in reduced performance or
cation and device: the device cannot introspect requests to
excessive interrupts. Instead of relying on heuristics to deter-
determine if the application is done sending data. Instead, the
mine when to coalesce interrupts, we propose a much simpler
device must rely on heuristics, which in our experiments, are
solution: let the application tell the device when it should be
very limited in NVMe and especially limit the throughput of
notified. Our system, cInterrupts, provides this information
high throughput applications. In fact, even the Linux devel-
by adding merely two bits, known as hints, to each request.
opers acknowledge that “while adaptive interrupt moderation
We show that combined with a new adaptive coalescing pol-
might be good enough for general cases, it does not provide
icy for NVMe, cInterrupts greatly improves performance in
as good performance as you can achieve by tuning each case
high throughput workloads while providing better fairness in
individually” [21].
mixed workloads.
To bridge the gap, we propose a very simple solution: have
the application inform the device when it should be notified.
1 Introduction Applications communicate to the device through hints, which
are two bits that are added to each request and inform the
State-of-the-art NVM ExpressTM (NVMe) devices expose device when the application wishes to receive a notification.
multiple concurrent queues to allow software to fully exploit In this paper, we describe cInterrupts, which enables ap-
the parallelism available in high speed non-volatile memory. plications to express when they with to receive notifications
These devices are capable of processing hundreds of thou- from NVMe devices, reducing unnecessary interrupt traffic
sands to millions of operations per second [11, 12, 14, 15, 30]. and improving overall performance. When applications can-
In order to communicate with an NVMe device, an application not provide a hint, the kernel provides best-effort hints, and we
must enqueue an I/O request and wait for its completion. Deal- also introduce α, a new adaptive interrupt coalescing scheme
ing with multiple concurrent requests occurring in parallel is a which enables interrupts to be batched automatically by the de-
challenge, as the requests may be completed out of order, and vice at much finer granularities than that provided by NVMe.
the device needs to notify the application of request comple- We evaluate NVMe’s existing interrupt coalescing mecha-
tion. Typically, the application is notified of a completed I/O nism and find that it is not flexible and often results in the de-
request through interrupts. However, interrupts have a high vice generating an interrupt for nearly every request. We then
overhead, and result in expensive context switches [10, 29]. prototype our system and show that cInterrupts’s effectiveness
With 1M operations per second, sending interrupts for each at coalescing interrupts results in increased performance for
request can result in an interrupt storm, grinding the system high throughput workloads by moderating the generation of
to a halt [18]. To address this issue, NVMe devices feature interrupts, and increases fairness in settings where there is a
interrupt coalescing, which configures the device to send mix of latency-sensitive and asynchronous workloads.
an interrupt only if there are sufficient requests based on a The contributions of this paper are:
configurable batching setting or if a timeout is met [1]. This • We present a background of NVMe and how I/O request
approach is suboptimal: if there are insufficient requests, the queueing has evolved to support devices with high inter-
application must pay a latency penalty dependent on the time- nal parallelism (Section 2).
out. If the timeout is set too low, the application loses the
benefit of batching. • We show that naive interrupt coalescing is difficult to

1
tune and misconfiguration leads to increased latency and
degraded performance (Section 2.2). application

• We describe how cInterrupts efficiently coalesces inter-


rupts, first by using an adaptive coalescing algorithm, blk req #3 #3 completion blk req #8 #8 completion
α, designed for NVMe, then by further using hints to kernel
deliver interrupts (Section 3).
device
SQ CQ ... SQ CQ
• We evaluate cInterrupts through an emulator using real-



world benchmarks and show that cInterrupts greatly im-
proves the performance of high throughput workloads
by moderating the generation of interrupts, and increases Figure 1: Requests are submitted through NVMe submis-
fairness in settings where there is a mix of latency- sion queues and completed requests are placed in completion
sensitive and asynchronous workloads (Section 5). queues. Applications are either notified via interrupts (#3) or
can poll for completions (#8).

2 Background and Related Work throughput [K IOPS]

thr=1
In the next sections, we first give a brief overview of NVMe 300 thr=16
thr=32
and how requests are enqueued as well as how applications are ideal
notified via interrupts. Then, we explore how NVMe coalesces 200
interrupts and why it is suboptimal.
100

2.1 NVM ExpressTM (NVMe) 0


0 20 40
The emergence of solid state drives, such as those based on time [sec]
NAND flash or phase change memory are capable of mas-
sive internal parallelism and can process multiple concurrent Figure 2: Suboptimal NVMe coalescing: each workload has
requests at the same time. To expose that parallelism, NVM a static coalescing configuration (threshold, thr) with the best
ExpressTM (NVMe) [1] was introduced, providing multiple performance. But if we vary the workload dynamically, there
queues for enqueueing and completing requests. In software, is no single static configuration that can achieve optimal per-
an NVMe driver can create as many submission queues (SQs) formance for all workloads.
and completion queues (CQs) as the device can support. Typi-
cally, as in the Linux kernel, the driver will assign one SQ/CQ
pair to a core [7]. Linux also rewrote its block subsystem to forcing a context switch to an interrupt service handler (ISR),
match this multi-queue paradigm [6]. which then checks the CQ.
Figure 1 shows how an application enqueues a request to Since modern solid state devices can process over 1M re-
a NVMe device and receives its completion. With multiple quests per second [15, 30], generating interrupts for each
queues, an application can submit many concurrent requests request could overwhelm the application with interrupts and
at once. SQs can also be associated with a priority, and the context switches, known as an interrupt storm. To prevent
device can arbitrate and prioritize high priority queues when interrupt storms, devices must moderate the interrupts they
resources are scarce. It is important to note that NVMe queues generate, or applications must resort to turning off interrupts
do not have a mechanisms to express ordering requirements. and polling, which wastes CPU cycles and limits throughput.
Instead, the standard states “host software or the associated NVMe moderates interrupts through the use of interrupt
application is required to enforce that ordering above the level coalescing, which sends an interrupt only if there is a suffi-
of the controller” [1]. cient batch of items in the completion queue (the aggregation
threshold), or a deadline (the aggregation time). While inter-
2.2 Interrupt Storms and Coaleasing rupt coalescing limits the number of interrupts generated, it
can increase the latency of requests when there are insufficient
The application making a request to an NVMe device must items in the completion queue. Configuring these coalescing
learn when the request completes and the corresponding com- parameters is difficult: optimal aggregation times and thresh-
pletion has been inserted into the CQ. The application can olds vary depending on the workload. Figure 2 illustrates that
learn about the completion either by polling, where the ap- there is no single configuration that works for all workloads
plication polls on the CQ until an entry is present, or the by running an experiment that varies between a synchronous
device can send an interrupt, which notifies the application by workload, a streaming workload, and a streaming workload

2
(i) Intel XL710 latency [usec] (ii) Mellanox Cx-5 latency [usec] coalescing hints
70
urgent
150 coalescing coalescing
no coalesc 66 no coalesc barrier
100
18
kernel agnostic
27 14
17 14
22 22 29 30 24 21
50 9.8 11 8 device
1.50.2 1.3 1.5

𝛼
0 baseline
64
128
256
512
1K
2K
4K
8K
16K
32K
64K

64
128
256
512
1K
2K
4K
8K
16K
32K
64K
coalescing
SQ CQ strategy

+ hints


message size [bytes]
(a) Request latency. …
(i) Intel XL710 [int/s] (ii) Mellanox Cx-5 [int/s] Figure 4: cInterrupts consists of a native, baseline coalescing
coalescing strategy, α, that sits in the device. Additionally, cInterrupts
150 no coalesc supports hints, which are embedded in NVMe submission
100 entries.
50
0
application to express that it wishes to receive a completion
64
128
256
512
1K
2K
4K
8K
16K
32K
64K

64
128
256
512
1K
2K
4K
8K
16K
32K
64K
notification, cInterrupts is able to augment adaptive interrupt
message size [bytes]
coalescing to efficiently batch at the granularity expected by
(b) Interrupt rate. the application, maximizing throughput and reducing spurious
Figure 3: Interrupt rate and latency of netperf TCP RR when interrupts. In the next section, we describe the architecture of
adaptive interrupt coalescing is enabled and when coalesc- cInterrupts.
ing is disabled. Labels on top figures show the differencies
between latencies.
3 Architecture

with small batches. cInterrupts is an interrupt coalescing strategy for low-latency


Moreover, NVMe only allows the aggregation time to be devices that enables both low latency for latency-sensitive re-
set in 100µs increments [1], while devices are approaching quests and high throughput for throughput-sensitive requests.
sub 10µs latencies. This minimizes the value of interrupt coa- It does so with two components: a more fine-grained native
lescing, since workloads which generate only a few requests coalescing strategy, α, that resides entirely in the device and
would incur unacceptably long latencies if they do not meet a hinting mechanism that allows software and hardware to
the aggregation threshold, causing the device to generate an cooperate on coalescing, as shown in Figure 4.
interrupt only based on the deadline. As a result, interrupt
coalescing is typically disabled by default in many system 3.1 α
including Linux [23], because it would result in unacceptable
latencies in many applications without fine tuning. Recall that the NVMe specification exposes an interrupt co-
Network cards, which have long had higher request rates alescing scheme with a timeout granularity of 100µs. Our
than storage devices support a feature known as adaptive first observation is that this granularity is not useful in a low-
interrupt coalescing, which dynamically learns the batching latency device where request completions are 10µs or lower.
parameters based on the running workload. Figure 3 shows If a device user wishes to batch interrupts, she must at the
interrupt rate and latency of netperf [16] TCP_RR benchmark very least set the timeout to 100µs. This means that if the
on two high-speed NICs: Intel XL710 40 GbE [13] and Mel- coalescing threshold isn’t met, requests will be delayed by at
lanox ConnectX-5 100 GbE [22]. On both devices, adaptive least 100µs, as demonstrated in Figure 2.
interrupt coalescing helps reduce the interrupt rate but there is Our second observation is that a device should coalesce
still an increase in latency even when coalescing parameters interrupts that occur together, what we will call a burst. For
are adaptively computed, despite years of tuning by various example, if n requests complete within some ∆ interval, the
network vendors. device should only generate 1 IRQ for all n requests. Because
In cInterrupts, we observe that the reason devices must we are targeting low-latency devices, ∆ is expressed in mi-
resort to heuristic-based mechanisms to coalesce interrupts croseconds. To bound request latency, the device should also
is the presence of a semantic gap between application and use some thr that is the maximum number of requests it will
device. The storage device lacks the semantic context to de- coalesce into a single interrupt.
termine when the application is finished sending a batch of Algorithm 1 summarizes our proposed interrupt coalescing
requests. By adding a simple mechanism which enables the strategy, which we call α, that leverages ∆ and thr. Under this

3
Algorithm 1: α: native coalescing strategy in cInter- URGENT BARRIER Semantics
rupts. 0 0 NONE: kernel will provide best-
effort hints
1 Parameters: ∆, thr
2 coalesced = 0; 0 1 BARRIER: IRQ when all previous
3 timeout = now + ∆; requests finish
4 while true do 1 0 URGENT : IRQ now
5 while now < timeout do 1 1 AGNOSTIC: use α
6 while new completion arrival do
Table 1: cInterrupts provides two hint bits, resulting in four
7 coalesced++;
types of requests, each with a unique interrupt semantic.
8 timeout = now + ∆;
9 if coalesced ≥ thr then
10 fire IRQ;
requests. The use of hints is critical in cInterrupts because
11 end
they allow cInterrupts to adapt dynamically to the workload
12 end
and generate or withhold interrupts when needed.
13 end
The hinting mechanism consists of two flags, which we
/* end of quiescent period */
call the URGENT and BARRIER flags, This results in 4 types of
14 if coalesced > 0 then
requests, as summarized in Table 1. We now describe each
15 fire IRQ;
type.
16 end
NONE. When neither flag is set, this means the application
17 timeout = now + ∆;
is not providing a hint. This leaves the kernel free to speculate
18 end about the ideal coalescing strategy, and the kernel will insert
flags as it sees fit, which we will further describe in Section 3.3.
Typically, this state appears in legacy applications that are
coalescing strategy, a device will emit interrupts when either unaware of this hinting mechanism (see AGNOSTIC below for
thr requests have completed or it has observed a completion hint-aware applications).
quiescent interval of ∆. URGENT. When the URGENT flag is set on a request, soft-
α behaves differently based on ∆ and thr, but in practice, ware is requesting an immediate interrupt from the NVMe
similar to the existing NVMe coalescing strategy, they would device. This flag is used for latency-sensitive requests, which,
be statically configured to avoid the costly overhead of a from the kernel’s perspective, is any blocking system call
system administrator manually setting a new configuration such as read or write.
every time a target configuration is selected. As described in BARRIER. The BARRIER flag acts like a memory bar-
Section 2, there are many existing approaches that use heuris- rier command: when it is set, software is requesting that the
tics to adapt coalescing strategies to workload. Exploring NVMe device fires an interrupt when all requests prior to and
heuristic-based dynamic tuning of these parameters is a rich including this request are completed. This flag is typically
space we suggest for future investigation. used for throughput-sensitive requests, such as a batched-
In Section 5, we explain how to determine a reasonable oriented libaio read or write. The BARRIER flag is required
configuration for ∆ and thr. In particular, ∆ will depend on because the NVMe specification does not require commands
interarrival time of requests. to be executed in the order that they are submitted – in fact,
the specification explicitly warns against this assumption [1].
In an NVMe implementation where commands are executed
3.2 Request Hinting
in the order that they are submitted, the BARRIER flag is se-
While an improvement on the existing hardware coalescing mantically equivalent to the URGENT flag. For example, this
strategy, α always delays latency-sensitive requests by at least was the case for the workloads and device that we used.
∆. In the best case, suppose a latency-sensitive request is the AGNOSTIC. When both the URGENT and BARRIER flags
only request in the system, i.e. – during the read system are set, this is interpreted as the AGNOSTIC state. The agnostic
call in a single-threaded system. Blocking on the system call state differs from the NONE state because applications use it
implicitly forces a quiescent interval soon after the request to indicate to the kernel that they are aware of the hinting
is submitted, which means latency-sensitive requests will be mechanism. In this state, the application requests that the
delayed at most ∆. In the worst case, latency-sensitive requests kernel simply use the baseline coalescing strategy, α, to deal
can be delayed by ∆ · thr if exactly one request from another with the request, so it is an explicit request NOT to attach an
thread arrives in each ∆ time window. URGENT or BARRIER flag to the request. This differs from the
To address this, we propose a hinting mechanism that en- NONE state, where the kernel can can apply either the URGENT
ables the application to inform the device of latency-sensitive or BARRIER flag if it sees fit.

4
Algorithm 2: cInterrupts interrupt coalescing strategy 3.3 Setting Hints
1 Parameters: ∆, thr In this section, we discuss how applications should set the
2 coalesced = 0; hints, as well as how the kernel should set hints in the case of
3 timeout = now + ∆; legacy applications.
4 urgent = false; Generally, applications should set the URGENT flag for
5 while true do blocking system calls, but there can be exceptions. For ex-
6 while now < timeout do ample, many storage applications use pread for background,
7 while new completion arrival do non-urgent work. Such blocking calls would not require the
8 coalesced++; URGENT flag. As an extreme example, some research ker-
9 timeout = now + ∆; nels rewrite the traditional blocking system calls to be asyn-
10 if URGENT || BARRIER then chronous [28]. In this case, pread would not be latency-
11 urgent = true; sensitive, so the application would omit the URGENT flag and
12 end instead use AGNOSTIC. In fact, in such systems, an API change
13 if coalesced ≥ thr then makes clear whether the traditional blocking system call is
14 fire IRQ; asynchronous.
15 end Generally, applications should use the BARRIER and
16 end AGNOSTIC flag for streaming, asynchronous work or any non-
/* Generate an IRQ if there were any urgent work.
URGENT or BARRIER requests */ For legacy applications, the kernel can set flags while han-
17 if urgent then dling the relevant system calls. Unlike with hint-aware ap-
18 fire IRQ; plications, with legacy applications the kernel does not have
19 end enough information to speculate across system call bound-
20 end aries. For example, it would be detrimental to assume that an
/* end of quiescent period */ asynchronous system call is part of a streaming workload, be-
21 if coalesced > 0 then cause if there are no future requests, the kernel has implicitly
22 fire IRQ; added a latency of ∆ to the system call.
23 end Table 2 summarizes all the system calls that could result in
24 timeout = now + ∆; a request to the NVMe device. If the kernel receives a request
25 end in the NONE hint state, it can use the table to assign appropriate
flags to the request. If the kernel does nothing, then α will
handle all interrupt coalescing within the device.

For example, consider a streaming workload that submits 3.4 Discussion


asynchronous writes to the kernel in batches of size b, where a
Page cache flush. The kernel itself will periodically flush
batch is the number of requests submitted via a single system
pages from the page cache to the device. These writes are
call such as in vectored I/O. With the hinting mechanism, the
not urgent, so they can be categorized as “streaming” writes:
application can set all requests to be AGNOSTIC and only set
all writes are marked as AGNOSTIC, and the last write is
the last request in the stream as BARRIER.
marked as BARRIER.
Without the hinting mechanism – for example, if it Hint bits. In the NVMe specification, the URGENT and
is a legacy application – all requests will have state BARRIER flags can be communicated in any number of re-
NONE. Because the kernel does not know when the served bits in the submission queue entry (SQE) [1]. For ex-
stream will end, the best it can do is coalesce within a ample, in Version 1.3 of the NVMe specification, bits 13:10
system call; in other words coalesce at most b by anno- in the command dword of a SQE are reserved and hence avail-
tating the last request in each system call with a BARRIER flag. able for this protocol. Similarly, bytes 15:08 in the generic
command format are still reserved and available.
Limitations. In our prototype, we did not implement a true
Algorithm 2 shows the full coalescing strategy in cInter- BARRIER because observed that our device completes requests
rupts. In particular, there is a subtle optimization in line 11: in order for the workload considered. To implement BARRIER,
instead of firing an immediate interrupt for every URGENT or we would need to instrument the tagging mechanism in the
BARRIER request, the coalescing logic waits until the end of block subsystem to keep track of completed requests.
a burst to fire any urgent interrupts (line 18). This prevents Controlling interrupt rate. A user can intentionally or
unnecessary interrupts that could result in an interrupt storm. unintentionally abuse the URGENT flag by marking every re-

5
System Call Recommendation
(p)read(v) Assign URGENT if the fd is blocking
OR if the write is O_DIRECT. nvme_irq() IPI
(p)write(v) Same as preadv. In particular, if it
Coalescing
is a blocking write that only writes
strategy
to the page cache before returning, Polling
SQ CQ
then the URGENT flag is a no-op.


preadv2/pwritev2 With the RWF_NOWAIT flag, the ker- …
nel can explicitly tell if the call is to Target Core Dedicated Core
be blocking or not. If blocking, then
it should be URGENT. Figure 5: We emulate cInterrupts by implementing a dedi-
io_submit The application is always nonblock- cated core that sends IPIs, which are precisely the interrupts
ing, so use BARRIER. generated by cInterrupts, to the target core.
fdatasync/ fsync/ Although the user blocks on these
sync/ syncfs calls, we consider these streaming
writes, so use BARRIER. 4.1 Dedicated Core
msync With the flag MS_ASYNC, this is a
no-op. When MS_SYNC is passed, We emulated cInterrupts by assigning a dedicated core to a
this is similar to the above, so use target core. The target core functions normally by running
BARRIER. applications that submit requests to the core’s NVMe submis-
sion queue. The dedicated core runs a pinned kernel thread,
Table 2: Summary of storage I/O system calls and recommen- created in the NVMe device driver, that polls the completion
dation for how the application or kernel should tag them. queue of the target core and generates interprocessor inter-
rupts (IPIs) based on the design in Section 3. To faithfully
emulate the proposed hardware, we also disable hardware
quest as urgent, causing an interrupt storm. To address this interrupts for the NVMe queue assigned to that core; in this
we propose extending α with a lightweight heuristic based on way, the target core only receives interrupts iff cInterrupts
an Exponential Weighted Moving Average (EWMA) of the would fire an interrupt. Figure 5 shows how our dedicated
interrupt rate. Then α would adjust ∆ to keep the interrupt rate core emulates the proposed hardware.
around 25K. Our initial prototype seemed promising, and in
future work we will continue exploring dynamically adapting
α with more sophisticated heuristics.
4.2 Embedding Hints
4 Implementation
In our emulation, we embed the URGENT and BARRIER flags in
We explored using several existing aspects of the NVMe spec- the blk_mq tag, which is directly passed as the command ID
ification to build cInterrupts, but they were all insufficient. of the NVMe submission entry. This ID, which is 16 bits, is
First, we considered using the urgent priority queues to used to communicate unique commands between the NVMe
implement the URGENT flag. While this would have worked for layer and the block subsystem inside the kernel. For example,
the URGENT flag, there is no way to implement the semantics of this ID is used by the blk_mq subsystem to finish a request
the AGNOSTIC and BARRIER flags with multiple queues. There whenever it is completed by the NVMe layer. In our setup,
is also no way to implement α with the NVMe specification. tag numbers never go beyond 4096 because they are limited
We also considered using special bogus commands to cause by the size of the NVMe queue, so we embed the flags in the
the NVMe device to generate an interrupt. The specification most significant two bits of the tag without overwriting the
recommends that “commands that complete in error are not existing tag.
coalesced” [1]. We considered using these bogus commands When the dedicated kernel core polls for completed entries
to generate cInterrupts interrupts, but in two devices that we on the completion queue of the target core, it inspects the tag
inspected [14, 30], neither device respected this aspect of to determine if any of the flags are set. Based on the configu-
the specification: interrupts for errored commands were still ration of the flags, the dedicated kernel core will determine
coalesced by the existing coalescing policy. whether to send an interprocessor interrupt (IPI) to the tar-
Instead, we prototype cInterrupts by emulating the desired get core, as per Algorithm 2. When the target core reaps the
coalescing strategy explained in the previous section. The key completion entry, it unsets any flags in the tag so that the tag
to our implementation is emulating the interrupts that a device returns to its original value before the request is reaped in the
would generate under cInterrupts. block subsystem.

6
start end start end start
System
bare-metal bare-metal α
CQ CQ CQ CQ emulation
end start
read latency (µs) 13.3 ± 4.0 16.6 ± 7.1 19.9 ± 6.9
end
max read tput 343 ± 0.6 308 ± 1.1 305 ± 0.5
time
(KIOPS)

Figure 6: State of the completion queue over time with OOO Table 3: Overhead of emulation. Results from a NVMe
processing. Grayed entries are reaped entries. If there are SQ/CQ pair on a single core.
any URGENT or BARRIER requests in an interrupt batch (first
interrupt), they are processed immediately and the interrupt
handler returns. The non-urgent requests are not reaped until • How does α improve over the baseline interrupt coalesc-
the next interrupt, which consists only of non-urgent requests. ing strategy of a device?
At the end of the second IRQ, the driver rings the completion
queue doorbell to signal to the device that it can reclaim the • How does cInterrupts achieve better fairness than α with
contiguous entries. the hinting mechanism?

• How does cInterrupts affect performance in a variety of


4.3 Out-of-Order Processing workloads with the hinting mechanism?
We empirically observed that URGENT requests would some-
times get completed with a large batch of requests, which 5.1 Methodology
renders the flag less effective, because the userspace thread
blocking on the system call must wait for the device driver to Baselines. For a consistent evaluation of cInterrupts, we also
reap all requests in the batch. implemented an emulated version of the bare-metal interrupt
To address this, cInterrupts also implements out-of-order coalescing scheme as described in the current NVMe specifi-
processing (OOO processing) for URGENT or BARRIER re- cation. Similar to the emulation of cInterrupts we described in
quests. With OOO processing, if there are any URGENT or Section 4, device interrupts in the bare-metal baseline are em-
BARRIER requests in the interrupt context, the interrupt han- ulated with IPIs. On our main experimental machine, the IPI
dler will only reap those requests and leave the remaining consistently costs 3, 800 cycles to deliver, which is 1.8 − 2µs.
requests for the next context, as shown in Figure 6. The han- Emulation also imposes cache contention because the dedi-
dler marks OOO reaped requests with a special flag, so they cated core is continuously polling on the completion queue
are not re-processed by future interrupt contexts. memory of the target core. Table 3 summarizes the overhead
NONE or AGNOSTIC requests will not be reaped until a com- of emulation.
pletion batch consists only of those requests. The driver also For the results in this section, we compare cInterrupts to
does not ring the CQ doorbell until it can reap a contiguous this bare-metal emulated baseline (which we will refer to as
range of entries. thr places a bound on the number of inter- ‘baseline’), configured to generate interrupts on every request.
rupts before non-urgent requests are reaped. For example, Because this is a naive baseline, we also compare cInter-
suppose in Figure 6 that thr = 9. Then an interrupt will fire rupts to α. Throughout this section, we will refer to particular
as soon as 9 entries (already reaped or otherwise) accumulate configurations of α and cInterrupts in the form α(∆,thr) or
in the completion queue. cInterrupts (∆,thr).
Of course, the trade-off with OOO processing is an increase Experimental Setup. In our experiments we use two
in the number of interrupts generated. For example, every NVMe SSDs devices: an Intel Optane 900P 480 GB [14]
completion that was deferred in the first IRQ in Figure 6 is installed in a Dell PowerEdge R6415 machine equipped with
associated with two interrupts. In Section 5 we will show that two 8-core 2.1 GHz AMD EPYC 7251 CPUs and 64GB
OOO processing can reduce the latency of URGENT requests of memory, running Ubuntu 18.04, and an Intel DC P3700
by 15%, but in some cases, if the workload already generates 400GB [12] installed in a Dell PowerEdge R730 machine
many interrupts, OOO processing does not help. equipped with two 14-core 2.0 GHz Intel Xeon E5-2660 v4
CPUs and 128 GB of memory running Ubuntu 16.04. If un-
specified, results are reported on the Intel Optane SSD. Both
5 Evaluation machines run Linux 5.0.8 and have hyperthreading, C-states
and Turbo Boost (dynamic clock rate control) disabled. We
The following questions drive our evaluation: use the maximum performance governor.
Our emulation pairs one dedicated core to one target core
• How does the selection of ∆ and thr affect cInterrupts? on the CPU local to the NVMe SSD. We could have run

7
interarrival CDF interarrival CDF throughput [KIOPS]
Intel P3700 SSD Intel Optane SSD
450
1 400
0.8 350
300
0.6 250 optane
0.4 200 p3700
libaio libaio 150
0.2 0 10 20 30 40 50 60
libaio batch libaio batch
0
0 4 8 12 16 20 0 4 8 12 16 20 threshold [thr]
interarrival time [usec] Figure 9: Determining thr under a given ∆ (∆ = 6 for Op-
Figure 7: Using interarrival time to determine ∆. tane and ∆ = 15 for P3700). thr is the smallest value where
throughput plateaus, which is between 16-32, so we set
IOPS and interrupt rate [K/s] IOPS and interrupt rate [K/s]
thr = 32 for both devices.
Intel P3700 SSD Intel Optane SSD

400 batches, the CPU can send many more requests to the device,
300 libaio iops resulting in lower interarrival times – a 90th percentile of
libaio int/s
200 libaio batch iops 1µs versus 3µs in the non-batched case for the Intel Optane
libaio batch int/s SSD. For the Intel P3700 SSD, both workloads have a 99th
100
percentile of 15µs.
0 Figure 8 shows both the throughput of the workloads and
0 4 8 12 16 20 0 4 8 12 16 20
∆ [usec]
the interrupt rate with varying ∆. The interrupt rate clearly
shows the impact of picking different ∆: we want to pick ∆ to
Figure 8: IOPS and interrupt rate for different ∆s. minimize the interrupt rate without adding unnecessary delay.
For example, for the Intel P3700 SSD, we set ∆ = 15 and for
the Intel Optane SSD, we set ∆ = 6.
experiments across multiple target cores with multiple corre- After fixing a ∆, thr is straightforward to select. For our de-
sponding cores, but for simplicity, we run all experiments on a vices, we sweep thr in the [0, 256) range and select the lowest
single core. Each core is assigned its own NVMe submission thr after which throughput plateaus. As shown in Figure 9,
and completion queue. we selected thr = 32 for both devices.
For our benchmarks, we use f io [3] to generate workloads.
For all synchronous workloads, we report IOPS instead of
latency to maintain the consistent intuition that larger numbers 5.3 Benchmarks
are better. We verified that latency of synchronous requests
Pure, single-threaded workloads. We ran two pure work-
was inversely proportional to the IOPS reported.
loads to show how cInterrupts behaves at the extremes. The
first workload is a thread that continuously submits blocking,
5.2 Selection of ∆ and thr synchronous read calls of size 4K. The second workload is a
thread that continuously submits asynchronous read calls of
∆ should approximate the interarrival time of requests, which size 4K with libaio, such as in a streaming workload, with
depends on workload: for example for a single-threaded work- iodepth 256 and batches of size 16.
load sending requests in a loop via the blocking read system cInterrupts and α get the same configuration per device:
call, the interarrival time of requests is the read latency of the ∆ = 15, thr = 32 for the Intel DC P3700 and ∆ = 6, thr = 32
device. For a multi-threaded workload sending asynchronous for the Intel Optane. The results for all three evaluated systems
requests as fast as possible, the interarrival time is much lower. is shown in Figure 10.
Figure 7 shows the interarrival time for two types of work- cInterrupts performs the best in both workloads. The em-
loads. We run the same workloads on two different devices, an ulated baseline performs as well as cInterrupts in the syn-
Intel DC P3700 SSD [12] and an Intel Optane 900P SSD [14], chronous workload, because it receives interrupts for every
to show that system administrators will pick different ∆ for request; this is also the reason why the emulated baseline per-
different devices. The first workload is a single-threaded work- forms the worst in the asynchronous workload. α performs
load that submits read requests of size 4K with libaio and as well as cInterrupts in the asynchronous workload, because
iodepth=256. The second workload is the same workload, streaming workloads are essentially agnostic to ∆, and just
except with batched requests. need a high enough thr to enable coalescing. But α performs
Figure 7 shows the results. First, we observe that interar- the worst in the synchronous workload because it waits ∆
rival time differs for different workloads. When libaio submits after every request.

8
(a) async throughput [K IOPS] (b) sync throughput [K IOPS] throughput [K IOPS] throughput [K IOPS]
350 60 250 45
300 50 40
200 35
250 40
200 30
30 150 25
150 20
100 20 100
15
50 10 10
50
0 0 5
p3700 optane p3700 optane 0 0
baseline async sync
baseline
α p3700 ∆=15, thr = 32 α
cinterrupts optane ∆=6, thr = 32 ∆=6, thr = 32
cinterrupts
cinterrupts+OOO

Figure 10: System throughput with pure workloads. cInter-


rupts performs the best in both types of workloads, on both Figure 11: Throughput under a mixed workload.
devices.
CPU utilization during mixed workload

Mixed workload. We run a mixed workload with 1 thread 1


submitting synchronous reads and 1 thread submitting asyn- 0.8
chronous reads in batches of 16, with iodepth 256. The mixed 0.6 .44 .47
0.4 .36
workload also motivates the use of OOO processing for .23
0.2
URGENT requests, and the results are shown in Figure 11. No-
0
tice that, because the baseline strategy generates an interrupt lin
e ha ts ts
se alp up rup
for every request, it has the best synchronous throughput, but ba terr ter+ooo
cin c i n
this directly causes the asynchronous thread to suffer. On the sync async
other hand, α is successful at coalescing interrupts from the
asynchronous thread, but this causes the synchronous requests Figure 12: CPU utilization for each process for the mixed
to suffer, because they must piggyback with the asynchronous workload. Both the baseline and cInterrupts + OOO allow the
requests to get a completion interrupt. processes to use the CPU fairly.
cInterrupts strikes a balance between these two extremes
by improving the throughput of synchronous requests. This
does come at the cost of reducing the throughput of the asyn- more opportunities for URGENT requests to fire OOO inter-
chronous request, but we argue that this is more fair. Figure 12 rupts. This means that each non-urgent request lives through
shows the CPU utilization for each system while running the more interrupt contexts, which effectively increase the inter-
mixed workload. From the process scheduler’s point of view, rupt amplification of each request. The higher interrupt rate
a fair schedule would enable each thread to get 50% of the directly contributes to the lower throughput.
CPU, as is the case in the baseline, where all requests receive Dynamic Workloads. Unlike the baseline coalescing strat-
interrupts. However, with α, the device inadvertently prevents egy and α, cInterrupts has the ability to adapt dynamically
this fair scheduling because it withholds interrupts from the to workload changes without configuration changes. To high-
synchronous thread: the synchronous thread only gets 23% light this, we run the dynamic workload shown originally in
of the CPU because it is blocked. Section 2 (Figure 2). Recall that this experiment begins with
cInterrupts uses the URGENT hint to better approximate fair a synchronous workload. After 20s, it switches to a streaming
scheduling, and OOO processing achieves the same CPU allo- (asynchronous) workload, and after another 20s, it switches
cation as the baseline, due to immediate delivery of interrupts to a streaming workload with smaller batches. The results are
for synchronous requests. shown in Figure 14.
For this mixed workload, we also sweep thr ∈ [0, 256], with As desired, cInterrupts is on the envelope of the curves,
the results shown in Figure 13. thr > 0 enables the baseline because it is able to achieve the highest throughput for each
to coalesce interrupts. But as Figure 13 shows, this is to the type of workload. The other systems are statically configured,
detriment of the synchronous requests, whose throughput so they cannot adjust to the workload, even though there might
steadily declines with higher thr. be a better static configuration for each workload. α comes
Sweeping thr also reveals an interesting characteristic of close to the envelope because it can detect bursts, but it suffers
the OOO optimization. If thr > 32, OOO processing ends during the synchronous workload, reaching only 70% of the
up harming synchronous throughput without increasing asyn- throughput of cInterrupts.
chronous throughput. This is because, as thr grows, it takes We run another dynamic workload that begins with 4 syn-
longer for non-urgent requests to accumulate, which gives chronous threads submitting read requests. Every 20 seconds,

9
(a) async throughput [K IOPS] (b) sync throughput [K IOPS] async throughput [K IOPS]
300 250
60
250 200
50
200
150
40
100
150 30 50
100 baseline 20 0
50
α 0 20 40 60 80 100
cinterrupts 10
cinterrupts-ooo #threads (sync:async) sync throughput [K IOPS]
0 0
0 20 40 60 80 100 120 0 20 40 60 80 100 120 (4:0) (3:1) (2:2) (1:3) (0:4)
thr 120
100
80
Figure 13: Mixed workload: throughput curves while varying 60
thr. ∆ is fixed at 6. 40
20
0
throughput [KIOPS] 0 20 40 60 80 100
time [sec]
350
300 baseline cinterrupts(6,32)
250 α(6,32) cinterrupts(15,32)
α(15,32)
200
150
100 Figure 15: Dynamic workload: throughput of asynchronous
50
0 and synchronous requests as the workload changes every 20s.
0 20 40 Each workload has (sync:async) number of threads. Aside
time [sec]
from α(15, 32), which unfairly penalizes the synchronous
timeout=100us,thr=0 alpha(6,32)
thr=4 cInterrupts(6,32) requests, cInterrupts (15, 32) provides the best throughput for
thr=32 both sets of curves.
Figure 14: Dynamic workload: each curve indicates a static
configuration that can perform optimally on one of the work-
on the envelope of synchronous curves. This is directly due
loads. cInterrupts is on the envelope of all curves because it
to the URGENT hints.
can adapt dynamically.
Bursty workload. We run an experiment where a single-
threaded application submits requests in a bursty manner,
a synchronous thread is replaced with a thread submitting similar to either a web server or an embedded storage appli-
asynchronous requests in batches of size 1, with iodepth 256. cation. The size of each burst is uniformly random between 0
The last 20 seconds of the workload is a pure asynchronous and 32. Figure 16 shows the throughput for different systems
workload. Each system is statically configured at the begin- under different thinktimes, which specifies the time between
ning of the experiment; we also evaluate two different values each burst. For small thinktimes, the thread is configured to
of ∆ for both α and cInterrupts. Figure 15 shows the results. busy-spin rather than sleep during thinktime, because other-
α(15, 32) has both the best asynchronous throughput and wise the cost of entering and exiting sleep will be longer than
the worse synchronous throughput. This is directly because of the thinktime. This is why CPU utilization increases slightly
the unfairness issue that we highlighted in Figure 12: α favors for every system as thinktime increases from 1µs to 10µs. But
asynchronous threads over synchronous threads. Aside from with a thinktime of 100µs, the thread is able to sleep, which
α(15, 32), cInterrupts (15, 32) is consistently on the envelope is why CPU usage drops dramatically for a systems.
of the synchronous curves and above the envelope of the asyn- Both configurations of cInterrupts can get better throughput
chronous curves, which indicates that, while matching the because the application sets BARRIER flags that result in an
synchronous throughput of the other systems, cInterrupts (15, immediate interrupt at the end of a burst. The throughput of
32) can also achieve higher throughput on the asynchronous cInterrupts is equivalent to the baseline throughput, which
requests. cInterrupts can do this with a combination of the is expected because the baseline throughput generates inter-
underlying α coalescing interrupts for the asynchronous re- rupts for every request. On the other hand, α has to wait for
quests, and the URGENT flag. quiescence (which means ∆ added to the latency of every
Figure 15 also shows that different ∆ can change the perfor- burst). This is why for ∆ = 15, the throughput of α drops
mance of both α and cInterrupts. Higher ∆ drastically skews α dramatically.
in favor of asynchronous requests, but higher ∆ in cInterrupts The baseline coalescing strategy emulates the baremetal
can increase asynchronous throughput without harming the device which would send an interrupt on every request. Due
synchronous requests; notice that cInterrupts (6, 32) is already to this behavior, it can send an interrupt immediately at the

10
throughput [KIOPS]
in the kernel in order to reduce overall latency. [27] tries to
200
minimize the overhead of scheduling delays by merging them
150
into hardware interrupts. Similar to cInterrupts, the focus is
on minimizing events that take valuable CPU time away from
100
applications in order to drive more I/O.
50
IX [5] and Arrakis [25] are general-purpose kernels that
0
1 5 10 100 aim to eliminate the overhead of the kernel when interfacing
with devices. [24] reduces the overhead of interrupt process-
CPU utilization ing in the kernel’s networking stack with interrupt-steering.
1 cInterrupts takes an orthogonal, complementary approach by
0.8 generating interrupts precisely when applications need them,
0.6 thereby reducing time processing unnecessary interrupts.
0.4
0.2
Command queueing was standardized as part of the SCSI
0 standard as Tagged Command Queueing (TCQ) [8] and the
1 5 10 100 SATA standard as Native Command Queueing (NCQ) [26].
thinktime [usec]
This feature enables each request to be tagged with an or-
baseline α(15,32)
α(6,32) cinterrupts(15,32) dering identifier which resembles the hints in cInterrupts. In
cinterrupts(6,32)
TCQ/NCQ, however, these hints are used only for ordering
and not notification. cInterrupts leverages hints to determine
Figure 16: Bursty workload: cInterrupts performs the same as the boundaries of ordered batches so the device has the knowl-
the baseline, which sends IRQ on every request. The baseline edge of when the application desires an interrupt.
pays for this with high CPU utilization. We also notice a
tradeoff with ∆: if it is configured higher, then cInterrupts
becomes much better than α, because it can hint the end of 7 Conclusion
the burst while α must wait ∆.
In this paper we show that the existing NVMe interrupt co-
alescing API poses a serious limitation on practical coalesc-
end of a burst just like cInterrupts, but pays for this with high ing. Our main insight is that application-level hints is the
CPU utilization, which is consistently 33% higher than the best way for a device to coalesce interrupts. cInterrupts, with
CPU utilization of cInterrupts, as shown in Figure 16. a combination of hints and an adaptive coalescing strategy
Although we did not show the results, the baseline coalesc- that can detect bursts and has a smaller timeout granularity,
ing strategy could have been configured to coalesce interrupts generates interrupts exactly when a workload needs them,
into batches of size b. However, this means that for all bursts enabling workloads to experience better performance even in
of size n < b, the requests in the burst would have to wait a a dynamic environment. In doing so, cInterrupts enables the
minimum of 100µs, which is the lowest possible timeout in ex- software stack to take full advantage of existing and future
isting NVMe devices. This shows the power of the BARRIER low-latency storage devices.
flag, which can precisely adjust to the current burst.
References
6 Related Work [1] NVM Express, Revision 1.3.

[2] Storage Performance Development Kit. https://spdk.io/.


A variety of systems redesign the software stack to enable
applications to take full advantage of hardware performance. [3] Jens Axboe. Flexible i/o tester. https://github.com/axboe/fio.
Many of them focus on polling and trying to avoid interrupts. [4] Jens Axboe. Initial support for polled IO. https://lwn.net/Articl
For example, µDepot [19] modifies the application’s storage es/663543/.
stack to use SDPK, a polling based userspace event-driven ex-
[5] Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Chris-
ecution environment, so that the application can actually drive tos Kozyrakis, and Edouard Bugnion. {IX}: A protected dataplane op-
enough IOPS to the device [2]. Linux also introduced polling erating system for high throughput and low latency. In 11th {USENIX}
and hybrid polling to its I/O stack to reduce the overhead of Symposium on Operating Systems Design and Implementation ({OSDI}
the software stack for NVMe devices [4, 9, 32]. 14), pages 49–65, 2014.

[31] and [20] show that with low-latency devices, the ker- [6] Matias Bjørling, Jens Axboe, David Nellans, and Philippe Bonnet.
nel’s block subsystem accounts for a significant portion of Linux block io: introducing multi-queue ssd access on multi-core sys-
a request’s completion latency. The authors address this by tems. In Proceedings of the 6th international systems and storage
conference, page 22. ACM, 2013.
modifying the kernel’s block subsystem to work in an asyn-
chronous, event-driven fashion. [17] prioritizes I/O requests [7] Keith Busch. Linux nvme driver. Flash Memory Summit, 2013.

11
[8] T10 Committee. SCSI Architecture Model - 3 (SAM-3). ftp:// [21] Kan Liang, Andi Kleen, and Jesse Brandenburg. Improve Network
ftp.t10.org/t10/drafts/sam3/sam3r14.pdf, 2004. Accessed: Performance by Setting per-Queue Interrupt Moderation in Linux.
September, 2019. https://01.org/linux-interrupt-moderation, 2017. Accessed:
September, 2019.
[9] Jonathan Corbet. Block-layer I/O polling. https://lwn.net/Arti
cles/663879/. [22] Mellanox Technologies. Mellanox ConnectX-5 VPI Adapter.
https://www.mellanox.com/related-docs/prod_adapter_
[10] Daniel Gruss, Moritz Lipp, Michael Schwarz, Richard Fellner, Clémen- cards/PB_ConnectX-5_VPI_Card.pdf, 2018. Accessed: September,
tine Maurice, and Stefan Mangard. Kaslr is dead: long live kaslr. In 2019.
International Symposium on Engineering Secure Software and Systems,
pages 161–176. Springer, 2017. [23] Alex Nln. coalescing in polling mode in 4.9. http://
lists.infradead.org/pipermail/linux-nvme/2018-
[11] Intel Corporation. Intel Optane Technology for Data Centers.
February/015435.html.
https://www.intel.com/content/www/us/en/architecture-
[24] Aleksey Pesterev, Jacob Strauss, Nickolai Zeldovich, and Robert T
and-technology/optane-technology/optane-for-data-
Morris. Improving network connection locality on multicore systems.
centers.html. Accessed: September, 2019.
In Proceedings of the 7th ACM european conference on Computer
[12] Intel Corporation. Intel SSD DC P3700 Series. https://ark.intel. Systems, pages 337–350. ACM, 2012.
com/content/www/us/en/ark/products/79624/intel-ssd-dc-
p3700-series-400gb-1-2-height-pcie-3-0-20nm-mlc.html, [25] Simon Peter, Jialin Li, Irene Zhang, Dan RK Ports, Doug Woos, Arvind
2014. Accessed: September, 2019. Krishnamurthy, Thomas Anderson, and Timothy Roscoe. Arrakis: The
operating system is the control plane. In 11th {USENIX} Symposium
[13] Intel Corporation. Intel Ethernet Converged Network Adapter XL710. on Operating Systems Design and Implementation ({OSDI} 14), pages
https://ark.intel.com/content/www/us/en/ark/products/ 1–16, 2014.
83967/intel-ethernet-converged-network-adapter-xl710-
qda2.html, 2016. Accessed: September, 2019. [26] SATA-IO. Serial ATA Revision 3.0 (Gold) - June 2, 2009.
https://sata-io.org/sites/default/files/documents/
[14] Intel Corporation. Intel Optane SSD 900P Series. https:// SATA%20Spec%20Rev%203%204%20PR%20FINAL.pdf, 2009. Ac-
ark.intel.com/content/www/us/en/ark/products/123626/ cessed: September, 2019.
intel-optane-ssd-900p-series-480gb-1-2-height-pcie-
x4-20nm-3d-xpoint.html, 2017. Accessed: September, 2019. [27] Woong Shin, Qichen Chen, Myoungwon Oh, Hyeonsang Eom, and
Heon Y Yeom. {OS} i/o path optimizations for flash solid-state drives.
[15] Intel Corporation. Intel SSD DC P4618 Series. https:// In 2014 {USENIX} Annual Technical Conference ({USENIX}{ATC}
ark.intel.com/content/www/us/en/ark/products/192574/ 14), pages 483–488, 2014.
intel-ssd-dc-p4618-series-6-4tb-1-2-height-pcie-3-1-
x8-3d2-tlc.html, 2019. Accessed: September, 2019. [28] Livio Soares and Michael Stumm. Flexsc: Flexible system call schedul-
ing with exception-less system calls. In Osdi, volume 10, pages 1–8,
[16] Rick A. Jones. Netperf: A Network Performance Benchmark. ht
2010.
tps://github.com/HewlettPackard/netperf, 1995. Accessed:
September, 2019.
[29] Dan Tsafrir. The context-switch overhead inflicted by hardware inter-
[17] Sangwook Kim, Hwanju Kim, Joonwon Lee, and Jinkyu Jeong. Enlight- rupts (and the enigma of do-nothing loops). In Proceedings of the 2007
ening the i/o path: a holistic approach for application performance. In workshop on Experimental computer science, page 4. ACM, 2007.
15th {USENIX} Conference on File and Storage Technologies ({FAST}
17), pages 345–358, 2017. [30] Western Digital. Ultrastar DC SN200. https://documents.we
sterndigital.com/content/dam/doc-library/en_us/assets/
[18] Avi Kivity. Wasted processing time due to nvme interrupts. https:// public/western-digital/product/data-center-drives/
github.com/scylladb/seastar/issues/507, 2018. ultrastar-dc-ha200-series/data-sheet-ultrastar-dc-
sn200.pdf. Accessed: September, 2019.
[19] Kornilios Kourtis, Nikolas Ioannou, and Ioannis Koltsidas. Reaping the
performance of fast {NVM} storage with udepot. In 17th {USENIX} [31] Qiumin Xu, Huzefa Siyamwala, Mrinmoy Ghosh, Tameesh Suri, Manu
Conference on File and Storage Technologies ({FAST} 19), pages 1–15, Awasthi, Zvika Guz, Anahita Shayesteh, and Vijay Balakrishnan. Per-
2019. formance analysis of nvme ssds and their implication on real world
databases. In Proceedings of the 8th ACM International Systems and
[20] Gyusun Lee, Seokha Shin, Wonsuk Song, Tae Jun Ham, Jae W Lee, Storage Conference, page 6. ACM, 2015.
and Jinkyu Jeong. Asynchronous i/o stack: a low-latency kernel i/o
stack for ultra-low latency ssds. In 2019 {USENIX} Annual Technical [32] Tom Yates. Improvements to the block layer. https://lwn.net/
Conference ({USENIX}{ATC} 19), pages 603–616, 2019. Articles/735275/.

12

Das könnte Ihnen auch gefallen