Fast, Multicore-Scalable, Low-Fragmentation Memory Allocation Through Large Virtual Memory and Global Data Structures

See
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/274403735
Fast, Multicore-Scalable, Low-Fragmentation

Memory Allocation through Large Virtual
Memory and Global Data Structures
Article in ACM SIGPLAN Notices · March 2015

DOI: 10.1145/2814270.2814294 · Source: arXiv
CITATIONS READS
3 87
4 authors, including:
Christoph M. Kirsch Michael Lippautz

University of Salzburg University of Salzburg
93 PUBLICATIONS 1,900 CITATIONS 17 PUBLICATIONS 74 CITATIONS
SEE PROFILE SEE PROFILE
Ana Sokolova
University of Salzburg
60 PUBLICATIONS 708 CITATIONS
SEE PROFILE
All content following this page was uploaded by Ana Sokolova on 06 April 2015.
The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the original document
and are linked to publications on ResearchGate, letting you access and read them immediately.
Fast, Multicore-Scalable, Low-Fragmentation Memory Allocation
through Large Virtual Memory and Global Data Structures
Martin Aigner Christoph M. Kirsch Michael Lippautz Ana Sokolova

University of Salzburg
firstname.lastname@cs.uni-salzburg.at
arXiv:1503.09006v1 [cs.PL] 31 Mar 2015
Abstract is to decide when to move empty spans from the frontend

The problem of concurrent memory allocation is to find the into the global backend for returning memory to the OS.
right balance between temporal and spatial performance and In this paper we present (a) the idea of virtual spans that
scalability across a large range of workloads. Our contribu- enable uniform treatment of small and big objects; (b) a scal-
tions to address this problem are: uniform treatment of small able backend that builds upon recently developed global data
and big objects through the idea of virtual spans, efficiently structures; and (c) a constant-time (modulo synchronization)
and effectively reclaiming unused memory through fast and frontend that leverages the scalable backend by eagerly re-
scalable global data structures, and constant-time (modulo turning spans to it as soon as possible.
synchronization) allocation and deallocation operations that We have implemented those three new concepts in an
trade off memory reuse and spatial locality without being allocator called scalloc that is written in standard C/C++
subject to false sharing. We have implemented an alloca- and supports the POSIX API of memory allocators (mal-
tor, scalloc, based on these ideas that generally performs and loc, posix memalign, free, etc.). Scalloc borrows from the
scales in our experiments better than other allocators while concept of LABs which are by default TLABs and can be
using less memory and is still competitive otherwise. configured to be CLABs.
Spans in the backend occupy a 2MB portion of mostly
unmapped virtual memory and are thus called virtual spans.
1. Introduction Upon retrieving a virtual span from the backend, the span is
associated with a size class by identifying a potentially much
Dynamic memory management is a key technology for high- smaller portion as so-called real span which is then parti-
level programming. Most of the existing memory allocators tioned into same-sized blocks. Only the memory of a real
are extremely robust and well designed. Nevertheless, dy- span is mapped on demand (upon first access) by the OS.
namic memory management for multicore platforms is still The idea of virtual spans enables uniform treatment of small
a challenge, as shown in our benchmarks. and big objects at the expense of virtual memory fragmenta-
In a concurrent setting, most competitive allocators try tion, only leaving huge objects (>1MB) to the OS. The con-
to enable scalable memory allocation by separating the al- cept of virtual spans may readily be used in other allocators
locator into a local frontend, for fast allocation in so-called and is completely orthogonal to the actual allocator design.
thread-local allocation buffers (TLABs) [3] or core-local al- Note that mapping virtual memory in a scalable fashion, e.g.
location buffers (CLABs), and a global backend, for return- as in RadixVM [4], does not solve the problem of designing
ing empty buffers to the operating system. LABs in the fron- a competitive allocator. Memory fragmentation and system-
tend are usually segregated by so-called size classes. For call overhead still need to be managed.
each size class a LAB contains chunks of memory – here Global shared data structures were long considered per-
called spans – partitioned into same-sized blocks that ac- formance and scalability bottlenecks that were avoided by
commodate objects of a given size class. Spans are called introducing complex hierarchies. Recent developments in
superblocks in Hoard [3], superpages in Streamflow [23], concurrent data structure design show that fast and scalable
spans in TCMalloc [8], and SLABs in llalloc [19]. The ad- pools, queues, and stacks are possible [1, 10, 12]. We lever-
vantage of size-class-partitioned memory is reduced external age these results by providing a fast and scalable backend,
fragmentation and improved spatial locality. The disadvan- called the span-pool, that efficiently and effectively returns
tage is the presence of internal fragmentation and the upper unused memory to the OS (through madvise system calls).
bound on object size. Most size-class-based allocators are The frontend takes advantage of the scalable backend by
therefore hybrids that use relatively small spans for small ob- eagerly returning spans as soon as they get empty. In contrast
jects (fast path) and some other approach for big objects that to other approaches, the frontend runs per-thread in constant
do not fit size classes (slow path). Another general problem
time (modulo synchronization) meaning that at least one arena virtual span real span
thread will make progress in constant time while others may
or d
es d
em pe
have to retry. Scalloc is based on a mix of lock-free and lock-
dr ve
m ap
s
y
ad ser
se
m
based data structures to minimize code complexity without
re
un
block
sacrificing performance (scalloc is implemented in around payload
3000 lines of code). In our and others’ [11] experience locks
are still a good choice for synchronizing data structures that
virtual span real span
are mostly uncontended.
header
Our experiments show that scalloc increases performance
by up to 19% while consuming up to 60% less memory
Figure 1: Structure of arena, virtual spans, and real spans
than the previously fastest allocator (llalloc). In the worst
case of temporal performance scalloc is as fast or faster
than all other allocators but slower by 33% than llalloc 1MB, in increments that are powers of two (limiting block
while still using 22% less memory. Furthermore, scalloc internal fragmentation to 50%). These size classes may have
outperforms and outscales all allocators for varying objects different real-span size, explaining the difference between
sizes that fit virtual spans while consuming as much or 29 size classes and 9 distinct real-span sizes.
less memory. Scalloc handles spatial locality without being Objects of size larger than any size class are not managed
subject to active or passive false sharing, like some other by spans, but rather allocated directly from the operating
allocators. Access to memory allocated by scalloc is as fast system using mmap.
or faster than with most other allocators.
2.2 Virtual Spans
2. Virtual Spans, Span-Pool, and Frontend A virtual span is a span allocated in a very large portion
Like other allocators (e.g. Streamflow [23] and llalloc [19]), of virtual memory (32TB) which we call arena. All virtual
scalloc can be divided into two main parts: spans have the same fixed size of 2MB and are 2MB-aligned
(1) a frontend that manages memory in spans, and in the arena. Each virtual span contains a real span, of one of
(2) a backend for managing empty spans (ideally returning the available size classes. By the size class of the virtual span
their memory to the OS). we mean the size class of the contained real span. Typically,
the real span is (much) smaller than the virtual span that
Scalloc maintains scalability with respect to performance contains it. The maximal real-span size is limited by the size
and memory consumption by: of the virtual span. This is why virtual spans are suitable
• introducing virtual spans that enable unified treatment of for big objects as well as for small ones. The structure of the
variable-size objects; arena, virtual spans, and real spans is shown in Figure 1. The
• providing a scalable backend for managing spans; advantages of using virtual spans are:
• providing a frontend with allocation and deallocation
calls that are per-thread constant-time modulo lock- 1. Virtual memory outside of real spans does not cause
freedom and locks; fragmentation of physical memory, as it is not used and
therefore not mapped (on-demand paging of the OS);
In the following we describe these concepts in detail. 2. Uniform treatment of small and big objects;
2.1 Real Spans and Size Classes 3. No repeated system calls upon every span allocation
since the arena is mmapped only once.
A (real) span is a contiguous portion of memory partitioned
into same-sized blocks The size of blocks in a span deter- Note that since virtual spans are of the same size and
mines the size class of the span. All spans in a given size aligned in virtual memory, getting a new virtual span from
class have the same number of blocks. Hence, the size of a the arena is simply incrementing a bump pointer. When
span is fully determined by its size class: it is the product of a virtual span gets empty, it is inserted into the free-list
the block size and the number of blocks, plus a span header of virtual spans, i.e., the span-pool discussed in the next
containing administrative information. In scalloc, there are section. The disadvantages of using virtual spans are:
29 size classes but only 9 distinct real-span sizes which are
all multiples of 4KB (the size of a system page). 1. Current kernels and hardware only provide a 48-bit, in-
The first 16 size classes, with block sizes ranging from 16 stead of a 64-bit, address space. As a result, not all of
bytes to 256 bytes in increments of 16 bytes, are taken from virtual memory can be utilized (see below);
TCMalloc [8]. This design of small size-classes limits block 2. Returning a virtual span to the span-pool may be costly
internal fragmentation for sizes larger than 16 bytes to less in one scenario: a virtual span with a real span of a
than 50%. These 16 size classes all have the same real-span given size greater than a given threshold becomes empty
size. Size classes for larger blocks range from 512 bytes to and is inserted into the span pool. Then, in order to
real-span size x real-span size y real-span size n
a locally linearizable [9] “stack-like” pool. It is “stack-like”
pre allocated
since in a single-threaded scenario it is actually a stack.
Figure 2 illustrates the segregation by real-span size is
stack 1 stack 2 stack p
implemented as pre-allocated array where each index in
the array refers to a given real-span size. Consequently,
free span free span free span
dynamic
free span free span free span
all size classes that have the same real-span size refer to
free span free span free span the same index. Each array entry then holds another pre-
⊥ ⊥ ⊥
allocated array, the pool array, this time of lock-free Treiber
Figure 2: Span-pool layout stacks [25]. The pool array has size equal to the number of
cores (determined at runtime during the initialization phase
of the allocator). As a result a stack in any of the pools of the
limit physical-memory fragmentation, we use madvise1 span-pool is identified by a real-span index and a core index.
to inform the kernel that the remaining virtual (and thus The design is inspired by distributed queues [10]. We use
mapped physical) memory is no longer needed. stacks rather than queues for the following reasons: spatial
locality, especially on thread-local workloads; lower latency
Note that the design of the span-pool minimizes the of push and pop compared to enqueue and dequeue; and
chances that a virtual span changes its real-span size. stacks can be implemented without sentinel nodes, i.e., no
To our knowledge, mmapping virtual memory in a sin- additional memory is needed for the data structure. Thereby,
gle call at this order of magnitude (32TB) is a new idea for we utilize the memory of the elements inserted into the pool
memory allocation. Upon initialization, scalloc mmaps 245 to construct the stacks, avoiding any dynamic allocation of
virtual memory addresses, the upper limit for a single mmap administrative data structures. Distributed stacks are, to our
call on Linux. This call does not introduce any significant knowledge, among the fastest scalable pools. To make the
overhead as the memory is not mapped by the operating sys- occurrence of the ABA problem [13] unlikely we use 16-
tem. It is still possible to allocate additional virtual memory bit ABA counters that are embedded into link pointers2 .
using mmap, e.g. for other memory allocation or memory- Completely avoiding the ABA problem is a non-trivial task,
mapped I/O. The virtual address space still left is 248 − 245 which can be solved using e.g. hazard pointers [20].
bytes, i.e., 224TB. Listing 1 shows the pseudo code of the span-pool. Upon
In the worst case of the current configuration with 2MB returning a span to the span-pool, a thread performing a
virtual spans, if real spans are the smallest possible (16KB), put call first determines the real-span index for a given
the physical memory addressable with scalloc is (245 /221 ) · span (line 21) and the core index as thread identifier mod-
214 bytes = 237 bytes = 256GB. ulo the number of cores (line 22). Before actually insert-
We have also experimented with configurations of up to ing (line 25) the given span into the corresponding stack
128MB for virtual spans resulting in unchanged temporal the thread may return the span’s underlying memory to the
and spatial performance for the benchmarks that were not operating system using the madvise system call with ad-
running out of arena space. Enhancing the Linux kernel to vice MADV DONTNEED (line 24), effectively freeing the af-
support larger arenas is future work. On current hardware, fected memory. This is the expensive case, only performed
with up to 48 bits for virtual addresses, this would enable on spans with large real-span size determined by a threshold,
up to 256TB arena space and 2TB addressable physical as unused spans with large physically mapped real-spans re-
memory (in the worst case, with 2MB virtual spans and sult in noticeable physical fragmentation and the madvise
16KB real spans). system call may anyway be necessary upon later reuse, e.g.,
Note that in scalloc virtual spans do not restrict the possi- when a span is to be reused in a size class with a smaller
bility of observing segmentation faults because unmapped real-span size. The MADVISE THRESHOLD (line 23) is set
memory that is not used by a real-span is still protected to 32KB, which is the boundary between real-span sizes
against accessed using the mprotect system call. of size classes that are incremented by 16 bytes and those
that are incremented in powers of two. Note that lowering
2.3 Backend: Span-Pool the threshold does not substantially improve the observed
The span-pool is a global concurrent data structure that log- memory consumption in our experiments while it noticeably
ically corresponds to an array of real-span-size-segregated decreases performance. Furthermore, for scenarios where
“stack-like” pools. The span-pool implements put and get physical fragmentation is an issue, one can add a compaction
methods; no values are lost nor invented from thin air; it call that traverses and madvises particular spans.
neither provides a linearizable emptiness check, nor any or- Upon retrieving a span from the span pool, for a given
der guarantees. However, each pool within the span-pool is size class, a thread performing a get call first determines the
1 madvise is used to inform the kernel that a a range of virtual memory is 2 Currently a 64-bit address space is limited to 48 bits of address, enabling
not needed and the corresponding page frames can be unmapped. the other 16 bits to be used as ABA counter.
expected
Arena (RSS = 0)
Listing 1: Span-pool pseudo code
1 Int n u m _ c o r e s() ; // Returns the number of cores . free Backend (RSS compacted)
2 Int t h r e a d _ i d() ; // Returns this thread ’s id
3 // (0 - based ) . Frontend (RSS = real span size)
4
5 // Utility f u n c t i o n s to map spans and real - span hot reusable
6 // sizes to distinct indexes .
7 Int r e a l _ s p a n _ i d x( Span span ) ;
malloc()
8 Int r e a l _ s p a n _ i d x( Int r e a l _ s p a n _ s i z e) ; floating free()
9
10 // Returns the real span size of a given span .
11 Int r e a l _ s p a n _ s i z e( Span span ) ; Figure 3: Life cycle of a span
12
13 // Madvise all but a span ’s first page with
14 // M A D V _ D O N T N E E D.
15 v o i d m a d v i s e _ s p a n( Span span ) ;
serves as backend for retrieving and returning empty spans,
16 i.e., spans that have no allocated blocks.
17 SpanPool {
18 Stack spans [ M A X _ R E A L _ S P A N _ I D X][ n u m _ c o r e s() ];
We distinguish several states in which a span can be, il-
19 lustrated in Figure 3. A span can be in a state: expected, free,
20 v o i d put ( Span span ) :
21 Int rs_idx = r e a l _ s p a n _ i d x( span ) ;
hot, floating, or reusable. A span is expected if it is still in
22 Int core_idx = t h r e a d _ i d () mod n u m _ c o r e s() ; the arena, i.e., it is completely unused. Note that in this state
23 i f r e a l _ s p a n _ s i z e( span ) >= M A D V I S E _ T H R E S H O L D: its memory footprint is 0 bytes. Spans contained in the span-
24 m a d v i s e _ s p a n( span ) ;
25 spans [ rs_idx ][ core_idx ]. put ( span ) ; pool are free. A span can be in some of the other states only
26
27 Span get ( Int s i z e _ c l a s s) :
when it is in the frontend, i.e., it is assigned a specific size
28 Int rs_idx = r e a l _ s p a n _ i d x( s i z e _ c l a s s) ; class. Spans that are hot are used for allocating new blocks.
29 Int core_idx = t h r e a d _ i d () mod n u m _ c o r e s() ;
30 // Fast path .
For spans that are not hot we distinguish between floating
31 spans [ rs_idx ][ core_idx ]. get () ; and reusable based on a threshold of the number of free
32 i f span == NULL :
33 // Try to reuse some other span .
blocks. Spans with less than or equal free blocks than the
34 f o r rs_idx in range (0 , M A X _ R E A L _ S P A N _ I D X) : specified threshold are floating, spans with more free blocks
35 f o r core_idx in range (0 , n u m _ c o r e s() ) : than specified by the threshold are reusable. We refer to this
36 spans_ [ rs_idx ][ core_idx ]. Get () ;
37 i f span != NULL : threshold as reusability threshold. It is possible to only have
38 r e t u r n span ;
39 // If e v e r y t h i n g fails , just return a span from
spans that are floating, i.e., no reuse of nonempty spans, at
40 // the arena . the expense of increased fragmentation. Throughout its life
41 r e t u r n arena . a l l o c a t e _ v i r t u a l _ s p a n() ;
42 }
in the frontend, a span is always assigned to exactly one
LAB, the so-called owning LAB. In each LAB, for each size
real-span index of the size class (line 28) and the core index class there is a unique hot span. Furthermore each LAB con-
as thread identifier modulo the number of cores (line 29). In tains for each size class a set of reusable spans. The details
the fast path for span retrieval the thread then tries to retrieve about this set will be clarified when we discuss the imple-
a span from this identified stack (line 31). Note that this fast mentation in the next section.
path implements the match to the put call, effectively maxi- Recall that a LAB can be a TLAB (the LAB has a sin-
mizing locality for consecutively inserted (put) and retrieved gle thread assigned to it) or a CLAB (one LAB per core,
(get) spans of equal real-span sizes. If no span is found in the and threads with equal identifiers modulo number of cores
fast path, the thread searches all real-span size indexes and share the same LAB). Any thread assigned to a LAB can be
core indexes for a span to use (lines 34–38). Note that this considered an owner of the spans owned by the LAB. A con-
motivates the design of the real-span sizes: For reuse, a span sequence of the concept of ownership is that deallocation of
of a large real-span size has anyway been madvised whereas a block may happen in spans that are not owned by a thread.
all other spans have the same real-span size; Reusing a span We refer to such deallocation as remote free, whereas deal-
in the same real-span size (even if the size class changes) location in a span owned by a thread is a local free. All allo-
amounts only to changing the header. Only when the search cations in scalloc are local, performed in the corresponding
for an empty virtual span fails, the thread gets a new virtual hot span. A common problem in allocator design is handling
span from the arena (as for initial allocation; line 41). Note remote frees in a scalable way. No mechanism for handling
that the search through the span-pool may fail even if there remote frees results in so-called blowup fragmentation [3],
are spans in it due to the global use of the arrays (and the i.e., memory freed through remote frees can never be reused
nonlinearizable emptiness check). again. Similar to other span-based allocators [19, 23], scal-
loc provides two free lists of blocks in each span, a local free
2.4 Frontend: Allocation and Deallocation list and a remote free list. The local free list is only accessed
We now explain the mutator-facing frontend of scalloc, i.e., by an owning thread, while the remote free list can be ac-
the part of the allocator that handles allocation and deallo- cessed concurrently by multiple (not owning) threads at the
cation requests from the mutator. Recall that the span-pool same time.
header block payload The real-span header layout is shown in Figure 4. A link
local remote
field is used to link up spans when necessary, i.e., it is used
link epoch owner free used free ...
f-list f-list to link spans in the span-pool as well as in the set of reusable
spans. The epoch field is used to uniquely identify a span’s
state within its life cycle (see below). The local and remote
Figure 4: Real span layout
free list contained in a span are encoded in the fields local
Allocation. Upon allocation of a block in a given size class, f-list and remote f-list, respectively. A span’s owning
a thread checks its LAB’s size class for a hot span. If a hot LAB is encoded in the owner field.
span exists the thread tries to find a block in the local free The fields of a LAB are: an owner field that uniquely
list of the hot span. If a block is found, the thread allocates identifies a LAB; for each size class a field that refers to
in this block (this is the allocation fast path). The following the hot span, called hot span; and per size class the set of
situations can also occur (for implementation details see reusable spans kept in a field reusable span.
Section 2.5): Owner encoding. The owner field consists of two parts, an
(a) No hot span exists in the given size class. The thread then actual identifier (16 bits) of the owning LAB and a refer-
tries to assign a new hot span by trying to reuse one from ence (48 bits) to the owning LAB. The whole field fits in a
the set of reusable spans. If no span is found there, the single word and can be updated atomically using compare-
thread falls back to retrieving a span from the span-pool. and-swap instructions. Note that upon termination of the
(b) There is a hot span, but its local free list is empty. The last thread that is assigned to a LAB, the owner is set to
idea now is to use a remotely freed block. However, it TERMINATED. Subsequent reuses of the LAB (upon assign-
is not a wise option to allocate in the remote free list, as ing newly created threads to it) result in a different owner,
that would make allocation interfere with remote frees, i.e., the actual identifier is different while the reference to
destroying the performance of allocation. Therefore, this the LAB stays the same. Also note that due to thread termi-
is a point of choice: If there are enough blocks (in terms nation a span’s owning LAB might have a different owner
of the reusability threshold) in the remote free list, the than the span’s owner field indicates.
thread moves them all to its local free list and contin- Epoch encoding. The epoch field is a single word that en-
ues with fast-path allocation. Otherwise, if there are not codes a span’s state and an ABA counter. The states hot,
enough blocks in the remote free list, the thread gets a free, reusable, and floating are encoded in the upper parts
new hot span like in (a). (bitwise) of the word. The ABA counter (encoded in the rest
of the word) is needed for versioning spans as the state alone
Deallocation. Upon deallocation, a thread returns the block
is not enough to uniquely encode a state throughout a span’s
to be freed to the corresponding free list, which is the local
life cycle. E.g., one thread can observe a reusable span that
free list in case the thread owns the span, and the remote
after the last free is empty. Since freeing the object and tran-
free list otherwise. Depending on the state of the span where
sitioning the span into the state is not an atomic operation,
the block is allocated, the thread then performs the following
another thread can now observe this span as empty (because
actions (for implementation details see Section 2.5):
it has been delayed after an earlier free operation) and put it
(a) The span is floating. If the number of free blocks in into the span-pool. This span can now be reused by the same
this span is now larger than the reusability threshold, the thread in the same size class ultimately ending up in the state
span’s state changes to reusable and the span is inserted reusable, but not completely empty. At this point the thread
into the set of reusable spans of the owning LAB. that initially freed the last block in the previous round needs
(b) The span is reusable. If the free was the last free in the to be prevented from transitioning it into state free.
span, i.e., all blocks have been freed, the span is removed
Remote f-list encoding. We use a Treiber stack [25] to
from the set of reusable spans of the owning LAB and
implement the remote free list in a span. The top pointer of
returned to the span-pool.
the stack is stored in its own cache line in the span header.
If the span is hot, no additional action is taken. Furthermore, we keep the number of blocks in the stack’s
Note that a new contribution of scalloc is that a span is top pointer. This number is increased on each put operation.
freed upon the deallocation of the last object in the span. All A single call is used to retrieve all blocks from this free
other span-based allocators postpone freeing of a span until list by atomically setting the stack’s top pointer to NULL
the next allocation which triggers a cleanup. (implicitly setting the block count to 0). Note that generating
a new state (putting and retrieving all blocks) only requires
2.5 Implementation Details the top pointer. As a result special ABA handling is not
We now explain the implementation details of scalloc, i.e., needed (ABA can occur, but is not a problem) 3 .
the encoding of fields in headers and the concrete algorithms
used for allocation, deallocation, and thread termination. 3 For detailed explanations of the ABA problem see [20].
when no owner (i.e. TERMINATED) is present or the owner is
Listing 2: Auxiliary structures and methods different than the one provided as parameter. For details on
43 // Constant i n d i c a t i n g t e r m i n a t e d LABs . thread termination see below. Contention on sets of reusable
44 Int T E R M I N A T E D;
45 spans is low as the sets are segregated by size class and
46 LAB get_lab ( Int owner ) ; LABs. For the implementation of reusable sets of spans in
47 Bool i s _ o r p h a n( Span span ) ;
48 scalloc we use a lock-based deque. We are aware of lock-free
49 Span { /* Free list i m p l e m e n t a t i o n s omitted . */ implementations of deques [5] that can be enhanced to be us-
50 Int epoch ;
51 Int owner ; able in scalloc. However, the process of cleaning up the set
52 Int s i z e _ c l a s s; at thread termination (see below) requires wait-freedom as
53
54 Bool t r y _ m a r k _ h o t( Int o l d _ e p o c h) ; other threads may still be accessing the data structure. Help-
55 Bool t r y _ m a r k _ f l o a t i n g( Int o l d _ e p o c h) ; ing approaches can be used to (even efficiently [16]) solve
56 Bool t r y _ m a r k _ r e u s a b l e( Int o l d _ e p o c h) ;
57 Bool t r y _ m a r k _ f r e e( Int o l d _ e p o c h) ; this problem. Experiments suggest that contention on these
58 sets is low and we thus keep the implementation simple.
59 Bool t r y _ r e f i l l _ f r o m _ r e m o t e s () ;
60 Bool t r y _ a d o p t( Int n e w _ o w n e r) ;
61 } Listing 3 illustrates the main parts of scalloc’s frontend.
62
63 Set { /* Set i m p l e m e n t a t i o n omitted . */
For simplicity we omit error handling, e.g., returning out of
64 Int owner ; memory. Recall that each LAB is assigned an owner and
65
66 v o i d open ( Int owner ) ;
holds for each size class a reference to the hot span and the
67 v o i d close () ; // Sets owner to T E R M I N A T E D; reusable spans (lines 74–76).
68
69 Span get () ;
The method get span (line 80) is used to retrieve new
70 Bool put ( Int old_owner , Span span ) ; spans, either from the reusable spans (lines 82–87), or from
71 Bool t r y _ r e m o v e( Int old_owner , Span span ) ;
72 }
the span-pool (line 88). The calls to try mark hot on
line 85 and line 89 represent the transitions free → hot and
reusable → hot, respectively. Note that the transition free →
hot does not compete with any other threads.
Listing 2 provides an overview of auxiliary methods on The method allocate (line 92) is used to allocate a sin-
spans and sets for reusable spans. gle block in a hot span. If no hot span is present a new one is
Recall that an owner field embeds a reference obtained using get span (line 94). The hot span is then used
to the corresponding LAB, which can be retrieved to retrieve a block from the local free list of a span (line 95).
using get lab (line 46). Furthermore, the function If this attempt fails because the local free list is empty,
is oprhan (line 47) is used to check whether the given span the remote free list is inspected. If enough (wrt. reusabil-
is an orphan, i.e., all threads assigned to its owning LAB ity threshold represented as REUSABILITY THRESHOLD) re-
have terminated before all blocks have been returned. motely freed blocks are available, they are moved to the
A span then contains the previously mentioned epoch local free list (line 98), just before actually allocating a
and owner fields (lines 50–51). The methods that try to new block (line 99). If not enough remotely freed blocks
mark a span as being in a specific state (lines 54–57) are available the current hot span is marked as float-
all take an epoch value and try to atomically change it ing (line 101), i.e., the hot span takes the transition hot →
to a new value that has the corresponding state bits set floating, and a new hot span is retrieved (line 102). The block
and the ABA counter increased. These calls are then used is then allocated in the new hot span (line 103).
in the actual algorithm for allocation and deallocation to The method deallocate (line 106) is used to free
transition a span from one state into another. The method a single block. Since freeing a block and transitioning
try refill from remotes (line 59) is used to move re- spans through states are non-atomic operations, the owner
mote blocks (if there are more available then reusability and epoch values of a span are stored before freeing the
threshold) from the remote free list to the local one. The block (lines 109–110). The span’s free call (line 111) then
method try adopt (line 60) is used to adopt orphaned puts the block into the corresponding free list (local or re-
spans, i.e., atomically change their owning LAB. mote, depending on the owner of the span). If after this
Maintaining reusable spans should not have a notice- free, the number of free blocks is larger than the reusabil-
able performance impact (latency of allocation and deallo- ity threshold (line 114), the span is put into state reusable
cation) — which of course suggests using fast and scalable (floating → reusable). Similar to other state transitions, this
rather than non-scalable and slow data structures. Our de- action is serialized through try mark reusable (line 115).
sign provides constant time put, get, and remove of arbi- The succeeding thread then also tries to insert the span into
trary spans (lines 69–71). Furthermore, reusable spans are the set of reusable spans for this size class (line 117). Note
cleaned upon termination of the last thread assigned to a that this call takes the old owner as parameter to prohibit in-
LAB, requiring open and close methods (lines 66–67) that serting into a reusable set of a LAB that has either no owner
effectively prohibit put and remove methods accessing a set or an owner that is different from the old owner. Similarly,
needed) from the set of reusable spans (line 121). Finally,
Listing 3: Frontend: Allocation, deallocation, and thread ter- the span is put into the span-pool (line 122).
mination and initialization
73 LAB {
Thread Termination. Similar to others [14], we refer to
74 Span hot_span [ N U M _ S I Z E _ C L A S S E S]; spans that have not yet been transitioned into the state
75 Set r e u s e a b l e _ s p a n s[ N U M _ S I Z E _ C L A S S E S]; free (because they contain live blocks) while all threads as-
76 Int owner ;
77 signed to their owning LAB have terminated as orphaned
78 // Retrieve a span from the set of reusable spans spans. Since LABs do not necessarily have references to all
79 // or the span - pool .
80 Span get_span ( Int s i z e _ c l a s s) : owned spans (there exist no references to floating spans) a
81 Span span ; span cannot be declared as orphaned by setting a flag. In-
82 do :
83 span = r e u s e a b l e _ s p a n s[ s i z e _ c l a s s]. get () ; stead, orphaned spans can be detected by comparing a span’s
84 i f span != NULL && owner against the owner that is set in the owning LAB.
85 span . t r y _ m a r k _ h o t( span . epoch ) :
86 r e t u r n span ; Owner fields that differ or a LAB owner set to TERMINATED
87 u n t i l span == NULL ; indicates an orphaned span. Orphaned spans are always
88 span = s p a n _ p o o l. get ( s i z e _ c l a s s) ;
89 span . t r y _ m a r k _ h o t( span . epoch ) ; // always succeeds floating and adopted by threads upon freeing a block in
90 r e t u r n span ; these spans (lines 112–113). LAB cleanup happens in
91
92 Block allocate ( Int sc /* size class */ ) : terminate (line 124) where for each size class (lines 125–
93 i f hot_span [ sc ] == NULL : 134) the reusable spans set is closed (line 126), and all
94 hot_span [ sc ] = get_span ( sc ) ;
95 Block block = hot_span [ sc ]. a l l o c a t e _ b l o c k() ; spans (hot and reusable) are transitioned into state float-
96 i f block == NULL : ing (line 128 and line 133). For reusable spans this transition
97 // Case of empty local free list .
98 i f hot_span [ sc ]. t r y _ r e f i l l _ f r o m _ r e m o t e s() : competes with reusable → free (line 119), where a poten-
99 block = hot_span [ sc ]. a l l o c a t e _ b l o c k() ; tial last free of a block in a span triggers putting the span
100 r e t u r n block ;
101 hot_span [ sc ]. t r y _ m a r k _ f l o a t i n g( span . epoch ) ; into the span-pool. Finally, the LAB is marked as terminated
102 hot_span [ sc ] = get_span ( sc ) ; and consequently all spans that are not free can be observed
103 block = hot_span [ sc ]. a l l o c a t e _ b l o c k() ;
104 r e t u r n block ; as orphaned. Reusing a LAB later on requires setting a new
105
106 v o i d d e a l l o c a t e( Block block ) :
unique owner (line 140).
107 Span span = s p a n _ f r o m _ b l o c k( block ) ;
108 Int sc = span . s i z e _ c l a s s; Handling Large Objects. Scalloc provides span-based allo-
109 Int o l d _ o w n e r = span . owner ; cation of blocks of size less than or equal to 1MB and relies
110 Int o l d _ e p o c h = span . epoch ;
111 span . free ( block , owner ) ; on conventional mmap for all other objects. For allocation this
112 i f span . i s _ o r p h a n() : means that the frontend just forwards the allocation request
113 span . t r y _ a d o p t( owner ) ;
114 i f span . f r e e _ b l o c k s() > R E U S A B I L I T Y _ T H R E S H O L D: to an allocator that just mmaps the block and adds a header
115 i f span . t r y _ m a r k _ r e u s a b l e( o l d _ e p o c h) : containing the required size. Deallocation requires checking
116 o l d _ o w n e r. r e u s e a b l e _ s p a n s[ sc ]. put (
117 old_owner , span ) ; whether a block has been allocated in a span or not. How-
118 i f span . is_full () : ever, since spans are contained in a single large arena this
119 i f span . t r y _ m a r k _ f r e e( o l d _ e p o c h) :
120 o l d _ o w n e r. r e u s e a b l e _ s p a n s[ sc ]. t r y _ r e m o v e( check is cheap (xor-ing against aligned arena boundary). De-
121 old_owner , span ) ; pending on whether the block has been allocated in a span
122 s p a n _ p o o l. put ( span ) ;
123
or not, the request is just forwarded appropriately.
124 v o i d t e r m i n a t e() :
125 f o r sc in s i z e _ c l a s s e s: Unwritten Rules. The illustrated concepts yield a design
126 r e u s e a b l e _ s p a n s[ sc ]. close () ; that provides scalability on a multi-core system while keep-
127 hot_span [ sc ]. t r y _ m a r k _ f l o a t i n g(
128 hot_span [ sc ]. epoch ) ; ing memory compact wrt. to a reusability threshold. To this
129 hot_span [ sc ] = NULL ; end we would like to note that being competitive in absolute
130 Span span ;
131 do : terms requires an implementation that forces strict inlining
132 span = r e u s e a b l e _ s p a n s[ sc ]. get () ; of code, careful layouting of thread-local storage, and inter-
133 span . t r y _ m a r k _ f l o a t i n g( span . epoch ) ;
134 u n t i l span == NULL ; cepting thread creation and termination. Without those tech-
135 owner = T E R M I N A T E D; niques absolute performance suffers from overheads of func-
136
137 v o i d init ( Int n e w _ o w n e r) : tion calls as well as cache misses (for unnecessarily checking
138 owner = n e w _ o w n e r; conditions related to thread-local storage).
139 f o r sc in s i z e _ c l a s s e s:
140 r e u s e a b l e _ s p a n s[ sc ]. open ( n e w _ o w n e r) ;
141 }
2.6 Properties
Span-internal Fragmentation. Span-internal fragmenta-
making the transition reusable → free requires marking it as tion is a global property and refers to memory assigned to
free through try mark free (line 119). Note that marking a real-span of a given size class that is currently free (unused
a span as free competes with reusing it in get span. After by the mutator) but cannot be reused for serving allocation
successfully marking it as free the span can be removed (if requests in other size classes by any LAB.
Let f be the current global span-internal fragmentation. scalloc are pools, allowing in principle every data structure
Let s refer to the span on which the next operation happens. with pool semantics to be used. However, unlike segment
Let size be the size class of s and u be the size of the payload queues [1] and k-FIFO stacks [12], Distributed Queues [10]
(memory usable for blocks) of s. At initialization f = 0. with Treiber stacks [25], as used in scalloc, do not require
Then, for an allocation of a block in s dynamically allocated administrative memory (such as sen-
tinel nodes), which is important for building efficient mem-
f + u − size if no usable span (1)

f = ory allocators. The data structures for reusable spans are
f − size otherwise (2) implemented using locks but could in principle be replaced
where no usable spans means no hot span and no reusable with wait-free sets, which nowadays can be implemented al-
spans are present (1). Note that a span might already be most as efficiently as their lock-free counterparts [16].
reusable (wrt. to the threshold) but not yet present in the set Many concepts underlying scalloc such as size classes,
of reusable spans. This case is still covered by (1) and is a hierarchical allocation (local allocation buffers, spans), and
result of the fact that freeing an object and further processing local memory caching in so-called private heaps (span own-
it (reusable sets, or span-pool) are non-atomic operations. ership) have already been introduced and studied in thread-
Furthermore, for a deallocation of a block in s local variants [3, 26]. Scalloc borrows from some of these
concepts and integrates them with lock-free concurrent data

f −u if last block (3) structures, and introduces new ideas like e.g. virtual spans.
f =
f + size otherwise (4) In our experiments we compare scalloc to some of
where last block (3) refers to the last free of a block in a the best and most popular concurrent memory alloca-
given span. Note that to achieve this fragmentation property tors: ptmalloc24 (libc 2.19), jemalloc (3.6.0), TCMalloc
on a free call an allocator, such as scalloc, has to return an (googleperftools 2.1), Intel TBB allocator (4.3), Hoard
empty span to a global backend immediately. A regular free (git-13c7e75), Streamflow (git-41aa80d), and llalloc (1.4).
not emptying the span increases fragmentation by the size of McRT-Malloc [14] is left out of our comparison because
the block as this span cannot be reused globally (4). of a missing implementation. For Michael’s allocator [21]
there exists no reference implementation — an implemen-
Operation Complexity. An allocation operation only con-
tation for x86-64 by the Streamflow authors crashes for all
siders hot spans and reusable spans and does not need to
our benchmarks; we have received another implementation5
clean up empty spans. The operation is constant-time as in
which unfortunately does not perform and scale as we ex-
the uncontended case either a hot span is present and can be
pect from the original paper. We thus decided to leave the
used for allocating a block or a reusable span is made hot
Michael allocator out of our comparisons. ptmalloc2 [7] ex-
again before allocating a block in it. In the contended case
tends Doug Lea’s malloc [18] (dlmalloc; 2.7.x) and is part
more than one reusable span may need to be considered be-
of the GNU libc library. jemalloc [6] is the default alloca-
cause of concurrent deallocation operations. At least one of
tor in FreeBSD and NetBSD and has been integrated into
the operations will make progress in constant time.
products of Mozilla (like Firefox) and Facebook. TCMal-
A deallocation operation only considers the affected span,
loc [8] is Google’s counter-part to jemalloc, also aiming at
i.e., the span containing the block that is freed. Local deal-
performance and scalability. TBB [15] is maintained by Intel
locations are constant-time and remote deallocations are
as part of their thread building block library which aims at
constant-time modulo synchronization (insertion into the re-
easy creation of fast and scalable multi-threaded programs.
mote free-list which is lock-free). Spans that get reusable are
Hoard [3] and Streamflow [23] are both academic allocators
made reusable in constant-time modulo synchronization (in-
known to perform well. llalloc [19] is an allocator developed
sertion into the set of reusable spans which is lock-based).
by Lockless Inc.
Spans that get empty are handled in the span-pool.
All mentioned allocators create private heaps in one way
Span-pool put and get operations are constant-time mod-
or another. This design has proven to reduce contention and
ulo synchronization (the span-pool is lock-free).
(partially) avoid false sharing. Scalloc is no different in this
aspect as it also creates private heaps (span ownership) and
3. Related Work
exchanges space between threads (through the span-pool).
We first discuss related work on concurrent data structures Another aspect all allocators have in common are heaps
and then on existing concurrent memory allocators. segregated by size classes for spatial locality. It makes ob-
Concurrent data structures in the fast path of the frontend,
as well as the backend are lock-free [13]. A recent trend to-
wards semantically relaxed concurrent data structures [1, 12] 4 Surprisingly, ptmalloc3 performs worse than ptmalloc2, which is why we
opens doors for new design ideas and even greater perfor-
exclude it from our experimental evaluation.
mance and scalability so that hierarchies of spans (buffers 5 In a private correspondence we received pointers to the Amino Concurrent
in general) can be avoided and may be utilized globally Building Blocks (http://amino-cbbs.sourceforge.net/) malloc im-
across the whole system. The concurrent data structures in plementation.
ject headers obsolete (except for coalescing which is why B ENCHMARK O BJECT S IZE L OCAL R EMOTE T HREAD
ptmalloc2 uses them). [B YTES ] F REES F REES T ERM .
Allocators implementing private heaps without return- 483.xalancbmk 1-2M 100% 0% no
ing remotely freed objects to the allocating threads suf- Threadtest 64 100% 0% no
fer from unbounded blowup fragmentation in producer- Shbench 1-8 100% 0% no
consumer workloads [3]. Hence, it is necessary to transfer Larson 7-8 ≥ 99% < 1% yes
remotely freed memory back to the heap it was allocated on. Object Sizes 16-4M 100% 0% no
Mutator Locality 16-32 100% 0% no
ptmalloc2 solves the blowup fragmentation problem by Prod.-Con. 16-512 see Section 4.3 no
globally locking and then deallocating the block where it
has been allocated. TCMalloc and jemalloc both maintain Table 1: Summary of benchmarks
caches of remotely freed objects which are only returned to
a global heap when reaching certain thresholds. Hoard allo-
cates objects in superblocks which are similar to the spans in
scalloc. Unlike scalloc, Hoard returns remotely freed objects
in a hierarchical fashion, first by deallocating the objects in
the superblocks in which they were allocated, then by trans- cache and 32KB 4-way associative L1 instruction cache per
ferring superblocks between private heaps, and eventually by core, 256KB unified 8-way associative L2 cache per core,
transferring superblocks from private heaps to a global heap. 24MB unified 24-way associative L3 cache per processor,
For this purpose, Hoard locks the involved superblocks, pri- 128GB of main memory, and Linux kernel version 3.8.0.
vate heaps, and the global heap. TBB, Streamflow, and llal- Note that we have found Linux 3.8.13. (Ubuntu 14.04
loc maintain a private and a public free-list per thread and LTS kernel) to sometimes use anticipated paging, i.e., phys-
size class. The public free lists are implemented using lock- ically mapping pages that have not yet been accessed. The
free algorithms. Scalloc does exactly the same. use of mprotect circumvents this issue. However, since we
To this end we would like to note that span-based al- cannot reliably enforce this for other allocators, thus making
location in the frontend can already be found in Stream- a fair comparison impossible, we still use 3.8.0.
flow and llalloc. However, both allocators require cleaning There are two configurable parameters in scalloc,
up empty spans in allocation calls, and use different strate- MADVISE THRESHOLD and REUSABILITY THRESHOLD. We
gies for large objects (binary buddy system) and backends choose MADVISE THRESHOLD to be 32KB which is the
(BIPOP tables [23]). smallest real-span size. In principle one can set the thresh-
Another common practice in many allocators is to han- old as low as the system page size, effectively trading per-
dle small and big objects, whatever the threshold between formance for lower memory consumption, but as spans are
small and big is, in separate sub-allocators which are typ- subject to reuse at all times this is not necessary. Further-
ically based on entirely different data structures and algo- more we choose REUSABILITY THRESHOLD to be 80%, i.e.,
rithms. ptmalloc2, TCMalloc, jemalloc, and llalloc are such spans may be reused as soon as 80% of their blocks are free.
“hybrid” allocators, whereas TBB, Hoard, Streamflow, and Span reuse is useful in workloads that exhibit irregular allo-
scalloc are not. ptmalloc2 manages big objects in best-fit cation and deallocation patterns. Since span reuse optimizes
free-lists, TCMalloc and jemalloc round the size of big ob- memory consumption with negligible overhead we have also
jects up to page multiples and allocate them from the global considered a configuration that does not reuse spans and dis-
heap, and llalloc maintains a binary tree of big objects (c.f. cuss the results but do not show the data for clarity.
binary buddy system) in a large portion of memory obtained In our experiments we compare scalloc against other al-
from the operating system. Huge objects, again whatever the locators in benchmarks known to perform interesting usage
threshold between big and huge is, are handled by the op- patterns, e.g., threadtest provides a completely thread-local
erating system in all considered allocators including scal- workload for batched allocation and deallocation of objects.
loc. The principle challenge is to determine the thresholds Unless explicitly stated we report the arithmetic mean of
between small and big, for hybrid allocators, as well as be- 10 sample runs including the 95% confidence interval based
tween big and huge, for all allocators. Scalloc addresses that on the corrected sample standard deviation. For memory
challenge, through virtual spans, by removing the threshold consumption we always report the resident set size (RSS).
between small and big objects and by making the threshold The sampling frequency varies among experiments and has
between big and huge so large that it is likely to be irrelevant been chosen high enough to not miss peaks in memory
in the foreseeable future for most existing applications. consumption between samples. Since most benchmarks do
not report memory consumption we employ an additional
4. Experimental Evaluation process for measuring the RSS. As a result, benchmarks like
All experiments ran on a UMA machine with four 10-core threadtest, larson and shbench only scale until 39 threads.
2GHz Intel Xeon E7-4850 processors supporting two hard- We still report the 40 threads ticks to illustrate this behavior.
ware threads per core, 32KB 8-way associative L1 data A summary of all benchmarks can be found in Table 1.
4.1 Single-threaded Workload: 483.xalancbmk mark is configured to perform 106 rounds per thread of ob-
We compare different allocators on the 483.xalancbmk jects between 1 and 8 bytes in size. We exclude Stream-
workload of the SPEC CPU2006 suite which is known to flow from this experiment because it crashes for more than 1
be an allocation intense single-threaded workload [24]. thread in the shbench workload. For temporal performance
Figure 10b reports a benchmark score called ratio where we show the speedup wrt. single-threaded ptmalloc2.
a higher ratio means a lower benchmark runtime. The results Figure 6a shows the performance results. As objects sur-
omit data for Streamflow because it crashes. Scalloc (among vive rounds of allocations the contention on the span-pool
others) provides a significant improvement compared to pt- is not as high as in threadtest. The memory consumption
malloc2 but the differences among the best performing al- in Figure 6b indicates that the absolute performance is de-
locators including scalloc are small. Nevertheless, the re- termined by sizes of spans (or other local buffers). Alloca-
sults demonstrate competitive single-threaded temporal per- tors that keep memory compact in this benchmark also suf-
formance of scalloc. Note that the SPEC suite does not pro- fer from degrading absolute performance. Scalloc performs
vide metrics for memory consumption. better than all other allocators except for llalloc which con-
sumes more memory. Note that reusing spans in this bench-
4.2 Thread-local Workloads mark has an impact on memory consumption. Scalloc is con-
We evaluate the performance of allocators in workloads that figured to reuse at 80% free blocks in a span. Disabling
only consists of thread-local allocations and deallocations. reusing of spans results for 20 threads in a memory con-
Recall that scalloc only allocates blocks in a hot span of sumption increase of 14.7% (while having no noticeable im-
a given size class. Hence, workloads that perform more pact on performance). This suggests that reusing spans in
consecutive allocations in a single size class than a span can scalloc is important in non-cyclic workloads.
hold blocks (i.e. working is larger than the real-span size)
result in benchmarking the span-pool.
Threadtest Threadtest [3] allocates and deallocates objects
of the same size in a configurable number of threads and 4.2.1 Thread Termination Workload: Larson
may perform a configurable amount of work in between al- The larson benchmark [17] simulates a multi-threaded server
locations and deallocations. For a variable number of threads application responding to client requests. A thread in Larson
t, the benchmark is configured to perform 104 rounds per receives a set of objects, randomly performs deallocations
5
thread of allocating and deallocating 10t objects of 64 bytes. and allocations on this set for a given number of rounds,
For temporal performance we show the speedup with respect then passes the set of objects on to the next thread, and fi-
to single-threaded ptmalloc2 performance. nally terminates. The benchmark captures robustness of allo-
In scalloc, objects of 64 bytes are allocated in 64-byte cators for unusual allocation patterns including terminating
blocks in spans with real-span size 32KB. Since allocations threads. Unlike results reported elsewhere [23] we do not
and deallocations are performed in rounds reusing of spans observe a ratio of 100% remote deallocations as Larson also
has no effect on memory consumption. The overhead for exhibits thread-local allocation and deallocation in rounds.
adding and removing spans to the reusable set is negligible. For a variable number of threads t, the benchmark is config-
Figure 5a illustrates temporal performance where all allo- ured to last 10 seconds, for objects between 7 and 8 bytes
cators but jemalloc scale until 39 threads with only the abso- (smallest size class for all allocators), with 103 objects per
lute performance being different for most allocators. Since thread, and 104 rounds per slot of memory per thread. Unlike
the working set for a single round of a thread is roughly the other experiments, the larson benchmark runs for a fixed
6.4MB, allocators are forced to interact with their backend. amount of time and provides a throughput metric of memory
The results suggest that the span-pool with its strategy of management operations per second. We exclude Streamflow
distributing contention provides the fastest backend of all from this experiment as it crashes under the larson workload.
considered allocators. The memory consumption shown in Figure 7a illustrates temporal performance where all con-
Figure 5b suggests that threads do not exhibit a lock-step sidered allocators scale but provide different speedups. Sim-
behavior, i.e., they run out of sync wrt. to their local rounds, ilar to shbench, the rate at which spans get empty varies. The
which ultimately manifests in lower memory consumption memory consumption in Figure 7b illustrates that better ab-
for a larger number of threads. The span-pool supports this solute performance comes at the expense of increased mem-
behavior by providing a local fast path with a fall back to ory consumption. Furthermore, terminating threads impose
scalable global sharing of spans. the challenge of reassigning spans in scalloc (and likely also
Shbench Similar to threadtest, shbench [22] exhibits a impose a challenge in other allocators that rely on thread-
thread-local behavior where objects are allocated and deal- local data structures). Similar to shbench, reusing spans
located in rounds. Unlike threadtest though, the lifetime of before they get empty results in better memory consump-
objects is not exactly one round but varies throughout the tion. Disabling span reuse increases memory consumption
benchmark. For a variable number of threads t, the bench- by 7.6% at 20 threads while reducing performance by 2.7%.
4.3 Producer-Consumer Workload objects in 40 threads (number of native cores) to study the
This experiment evaluates the temporal and spatial perfor- scalability of virtual spans and the span pool.
mance of a producer-consumer workload to study the cost of Figure 8a shows the total time spent in the allocator,
remote frees and possible blowup fragmentation for an in- i.e., the time spent in malloc and free. The x-axis refers to
creasing number of producers and consumers. For that pur- intervals [2x , 2x+2 ) of object sizes in bytes with 4 ≤ x ≤ 20
pose we configure ACDC such that each thread shares all al- at increments of two. For each object size interval ACDC
located objects with all other threads, accesses all local and allocates 2x KB of new objects, accesses the objects, and
shared objects, and eventually the last (arbitrary) thread ac- then deallocates previously allocated objects. This cycle is
cessing an object frees it. The probability of a remote free in repeated 30 times. For object sizes smaller than 1MB scalloc
the presence of n threads is therefore 1 − 1/n, e.g., running outperforms all other allocators because virtual spans enable
two threads causes on average 50% remote frees and running scalloc to rely on efficient size-class allocation. The only
40 threads causes on average 97.5% remote frees. possible bottleneck in this case is accessing the span-pool.
Figure 9a presents the total time each thread spends in the However, even in the presence of 40 threads we do not
allocator for an increasing number of producers/consumers. observe contention on the span-pool. For objects larger than
Up to 30 threads scalloc and Streamflow provide the best 1MB scalloc relies on mmap which adds system call latency
temporal performance and for more than 30 threads scalloc to allocation and deallocation operations and is also known
outperforms all other allocators. to be a scalability bottleneck [4].
The average per-thread memory consumption illustrated The average memory consumption (illustrated in Fig-
in Figure 9b suggests that all allocators deal with blowup ure 8b) of scalloc allocating small objects is higher (yet still
fragmentation, i.e., we do not observe unbounded growth competitive) because the real-spans for size-classes smaller
in memory consumption. However, the absolute differences than 32KB have the same size and madvise is not enabled
among different allocators are significant. Scalloc provides for them. For larger object sizes scalloc causes the smallest
competitive spatial performance where only jemalloc and memory overhead comparable to jemalloc and ptmalloc2.
ptmalloc2 require less memory at the expense of higher total This experiment demonstrates the advantages of trading
per-thread allocator time. virtual address space fragmentation for high throughput and
This experiment demonstrates that the approach of scal- low physical memory fragmentation.
loc to distributing contention across spans with one remote 4.6 Mutator Performance: Locality
free list per span works well in a producer-consumer work-
In order to expose differences in spatial locality, we con-
load and that using a lock-based implementation for reusing
figure ACDC to access allocated objects (between 16 and
spans is not a performance bottle-neck.
32 bytes) increasingly in allocation order (rather than out-
of-allocation order). For this purpose, ACDC organizes allo-
4.4 Robustness against False Sharing
cated objects either in trees (in depth-first, left-to-right order,
False sharing occurs when objects that are allocated in the representing out-of-allocation order) or in lists (represent-
same cache line are read from and written to by different ing allocation order). ACDC then accesses the objects from
different threads. In cache coherent systems this scenario can the tree in depth-first, right-to-left order and from the list in
lead to performance degradation as all caches need to be kept FIFO order. We measure the total memory access time for
consistent at all times. An allocator is prone to active false an increasing ratio of lists, starting at 0% (only trees), go-
sharing [3] if objects that are allocated by different threads ing up to 100% (only lists), as an indicator of spatial local-
(without communication) end up in the same cache line. It is ity. ACDC provides a simple mutator-aware allocator called
prone to passive false sharing [3] if objects that are remotely compact to serve as optimal (yet without further knowledge
deallocated by a thread are immediately usable for allocation of mutator behavior unreachable) baseline. Compact stores
by this thread again. Both scenarios lead to objects sharing a the lists and trees of allocated objects without space over-
cache line, potentially leading to false sharing. head in contiguous memory for optimal locality.
Only TCMalloc is prone to passive and active false shar- Figure 10a shows the total memory access time for an
ing. Hoard is susceptible to passive false sharing. Scalloc increasing ratio of object accesses in allocation order. Only
does not suffer from false sharing as spans need to be freed jemalloc and llalloc provide a memory layout that can be ac-
to be reused by other threads for allocation. In the rare case cessed slightly faster than the memory layout provided by
of thread termination a thread adopting a span can allocate in scalloc. Scalloc does not require object headers and reini-
the same span as other threads may still have blocks, causing tializes span free-lists upon retrieval from the span-pool.
false sharing. We omit the plots for space reasons. For a larger ratio of object accesses in allocation order, the
other allocators improve too but not as much as scalloc,
4.5 Robustness for Varying Object Sizes TBB, llalloc, and Streamflow which approach the memory
We configure the ACDC allocator benchmark [2] to allo- access performance of the compact baseline allocator. Note
cate, access, and deallocate increasingly larger thread-local that we can improve memory access time even more by set-
jemalloc ptmalloc2 TCMalloc Hoard scalloc
llalloc TBB Streamflow compact
50 16
memory consumption in MB (less is better)

45
14
speeup wrt. ptmalloc2 (more is better)
40
12
35
10
30
25 8
20
6
15
4
10
2
5
0 0
1 2 4 6 8 10 20 30 40 1 2 4 6 8 10 20 30 40
number of threads number of threads
(a) Speedup (b) Memory consumption
Figure 5: Threadtest benchmark
140 3
memory consumption in GB (less is better)
120
speeup wrt. ptmalloc2 (more is better)
2.5
100
2
80
1.5
60
1
40
20 0.5
0 0
1 2 4 6 8 10 20 30 40 1 2 4 6 8 10 20 30 40
(a) Speedup (b) Memory consumption
Figure 6: Shbench benchmark
220 8
200
memory consumption in MB (less is better)
million operations per sec (more is better)
7
180
160 6
140 5
120
4
100
80 3
60 2
40
1
20
0 0
12 4 8 10 20 30 40 12 4 8 10 20 30 40
(a) Throughput (b) Memory consumption
Figure 7: Larson benchmark

jemalloc ptmalloc2 TCMalloc Hoard scalloc
llalloc TBB Streamflow compact
1000 100000
average memory consumption in MB

100
total allocator time in seconds
10000
(logscale, less is better)

(logscale, less is better)
10
1000
1
100
0.1
10
0.01
1
0.001
B
4B
B
B
B
56
1K
4K
M
4K
4M
4B
B
B
-6
56
16
4K
4M
56
1K
4K
1
-2
6-
-6
1-
-6
16
B-
1-
56
16
-2
4-
-2
64
-6
6-
1-
25
16
16
B-
1-
6K
-2
4-
64
64
16
25
6K
64
25
25
object size range in bytes (logscale)

object size range in bytes (logscale)
(a) Total allocator time (b) Average memory consumption
Figure 8: Temporal and spatial performance for the object-size robustness experiment
0.4 16
average per-thread memory consumption in MB
total per-thread allocator time in seconds
0.35 14
0.3 12
0.25 10
(less is better)
(less is better)
0.2 8
0.15 6
0.1 4
0.05 2
0 0
2 4 6 8 10 20 30 40 2 4 6 8 10 20 30 40
(a) Total per-thread allocator time (b) Average per-thread memory consumption
Figure 9: Temporal and spatial performance for the producer-consumer experiment
12 25
total memory access time in seconds
10
20
score (SPEC ratio; more is better)
8
(less is better)
15
6
10
4
2 5
0
0 20 40 60 80 100 0
percentage of object accesses in allocation order jemalloc llalloc ptmalloc2 tbb tcmalloc hoard scalloc
(a) Memory access time for the locality experiment (b) SPEC CPU2006 483.xalancbmk
Figure 10: Mutator locality and single-threaded performance

ting the reusability threshold of scalloc to 100, i.e., spans are [2] M. Aigner and C.M. Kirsch. ACDC: Towards a universal mu-
only reused once they get completely empty and reinitial- tator for benchmarking heap management systems. In Proc.
iced through the span-pool at the expense of higher memory International Symposium on Memory Management (ISMM).
consumption. We omit this data for consistency reasons. ACM, 2013. doi:10.1145/2464157.2464161.
To explain the differences in memory access time we pick [3] E.D. Berger, K.S. McKinley, R.D. Blumofe, and P.R. Wilson.
the data points for ptmalloc2 and scalloc at x=60% where Hoard: a scalable memory allocator for multithreaded appli-
the difference in memory access time is most significant and cations. In Proc. International Conference on Architectural
compare the number of all hardware events obtainable us- Support for Programming Languages and Operating Systems
(ASPLOS). ACM, 2000. doi:10.1145/384264.379232.
ing perf6. While most numbers are similar we identify two
events where the numbers differ significantly. First, the L1 [4] A.T. Clements, M.F. Kaashoek, and N. Zeldovich. RadixVM:
cache miss rate with ptmalloc2 is 20.8% while scalloc causes Scalable address spaces for multithreaded applications. In
Proc. ACM European Conference on Computer Systems (Eu-
a L1 miss rate of 15.2% at almost the same number of to-
roSys), 2013. doi:10.1145/2465351.2465373.
tal L1 cache loads. We account this behavior to more ef-
fective cache line prefetching because we observe 272.2M [5] M. Dodds, A. Haas, and C. M. Kirsch. Fast concurrent data-
structures through explicit timestamping. Technical Report
L1 cache prefetches with scalloc and only 119.8M L1 cache
TR 2014–03, Department of Computer Sciences, University
prefetches with ptmalloc2. Second, and related to the L1 of Salzburg, 2014.
miss rate, we observe 408.5M last-level cache loads with pt-
[6] J. Evans. A scalable concurrent malloc(3) implementation for
malloc2 but only 190.2M last-level cache loads with scalloc
freebsd. In Proc. BSDCan, 2006.
because every L1 cache miss causes a cache load on the next
level. The last-level cache miss rate is negligible with both [7] W. Gloger. ptmalloc2 – a multi-thread malloc implementation.
http://malloc.de/en/.
allocators suggesting that the working set (by design) fits
into the last- level cache. Note that last-level cache events in [8] Google Inc. gperftools: Fast, multi-threaded
perf include L2 and L3 cache events on our hardware. malloc() and nifty performance analysis tools.
http://code.google.com/p/gperftools/.
5. Conclusion [9] A. Haas, T.A. Henzinger, A. Holzer, C.M. Kirsch, M. Lip-
pautz, H. Payer, A. Sezgin, A. Sokolova, and H. Veith.
We presented three contributions: (a) virtual spans that en- Local linearizability. CoRR, abs/1502.07118, 2015.
able uniform treatment of small and big objects; (b) a fast arXiv:1502.07118.
and scalable backend leveraging newly developed global [10] A. Haas, T.A. Henzinger, C.M. Kirsch, M. Lippautz,
data structures; and (c) a constant-time (modulo synchro- H. Payer, A. Sezgin, and A. Sokolova. Distributed queues
nization) frontend that eagerly returns empty spans to the in shared memory—multicore performance and scalabil-
backend. Our experiments show that scalloc is either better ity through quantitative relaxation. In Proc. International
(threadtest, object sizes, producer-consumer) or competitive Conference on Computing Frontiers (CF). ACM, 2013.
(shbench, larson, mutator locality, SPEC) in performance doi:10.1145/2482767.2482789.
and memory consumption compared to other allocators. [11] D. Hendler, I. Incze, N. Shavit, and M. Tzafrir. Flat
Interesting future work may be (1) integrating virtual combining and the synchronization-parallelism trade-
spans and virtual memory management, (2) NUMA-aware off. In Proc. SPAA, pages 355–364. ACM, 2010.
mapping of real spans, and in particular (3) dynamically re- doi:10.1145/1810479.1810540.
sizing real spans to trade off LAB provisioning and perfor- [12] T.A. Henzinger, C.M. Kirsch, H. Payer, A. Sezgin,
mance based on runtime feedback from mutators, similar in and A. Sokolova. Quantitative relaxation of concur-
spirit to just-in-time optimizations in virtual machines. rent data structures. In Proc. Symposium on Princi-
ples of Programming Languages (POPL). ACM, 2013.
Acknowledgements doi:10.1145/2429069.2429109.
[13] M. Herlihy and N. Shavit. The Art of Multiprocessor Pro-
This work has been supported by the National Research Net-
gramming. Morgan Kaufmann Publishers Inc., 2008.
work RiSE on Rigorous Systems Engineering (Austrian Sci-
ence Fund (FWF): S11404-N23, S11411-N23) and a Google [14] R.L. Hudson, B. Saha, A-R. Adl-Tabatabai, and B.C.
Hertzberg. Mcrt-malloc: a scalable transactional mem-
PhD Fellowship.
ory allocator. In Proc. International Symposium on
Memory Management (ISMM), pages 74–83. ACM, 2006.
References doi:10.1145/1133956.1133967.
[1] Y. Afek, G. Korland, and E. Yanovsky. Quasi- [15] Intel Corporation. Thread building blocks (tbb).
linearizability: Relaxed consistency for improved con- http://threadingbuildingblocks.org.
currency. In Proc. Conference on Principles of Distributed
Systems (OPODIS), pages 395–410. Springer, 2010. [16] A. Kogan and E. Petrank. A methodology for creating fast
doi:10.1007/978-3-642-17653-1_29. wait-free data structures. In Proc. Symposium on Principles
and Practice of Parallel Programming (PPoPP), pages 141–
6 https://perf.wiki.kernel.org 150. ACM, 2012. doi:10.1145/2145816.2145835.
[17] P-Å. Larson and M. Krishnan. Memory allocation for long-
running server applications. In Proc. International Sym-
posium on Memory Management (ISMM), pages 176–185.
ACM, 1998. doi:10.1145/286860.286880.
[18] D. Lea. A memory allocator.
http://g.oswego.edu/dl/html/malloc.html.
[19] Lockless Inc. llalloc: Lockless memory allocator.
http://locklessinc.com/.
[20] M.M. Michael. Hazard pointers: Safe memory reclamation
for lock-free objects. IEEE Trans. Parallel Distrib. Syst.,
15(6):491–504, 2004. doi:10.1109/TPDS.2004.8.
[21] M.M. Michael. Scalable lock-free dynamic memory alloca-
tion. In Proc. Conference on Programming Language De-
sign and Implementation (PLDI), pages 35–46. ACM, 2004.
doi:10.1145/996893.996848.
[22] MicroQuill Inc. shbench. http://www.microquill.com/.
[23] S. Schneider, C.D. Antonopoulos, and D.S. Nikolopoulos.
Scalable locality-conscious multithreaded memory allocation.
In Proc. International Symposium on Memory Management
(ISMM). ACM, 2006. doi:10.1145/1133956.1133968.
[24] K. Serebryany, D. Bruening, A. Potapenko, and D. Vyukov.
Addresssanitizer: A fast address sanity checker. In
Proc. USENIX Conference on Annual Technical Conference
(USENIX ATC), pages 28–28. USENIX Association, 2012.
[25] R.K. Treiber. Systems programming: Coping with paral-
lelism. Technical Report RJ-5118, IBM Research Center,
1986.
[26] P.R. Wilson, M.S. Johnstone, M. Neely, and D. Boles.
Dynamic storage allocation: A survey and critical re-
view. In Proc. International Workshop on Mem-
ory Management (IWMM), pages 1–116. Springer, 1995.
doi:10.1007/3-540-60368-9_19.
View publication stats

Fast, Multicore-Scalable, Low-Fragmentation Memory Allocation Through Large Virtual Memory and Global Data Structures

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Fast, Multicore-Scalable, Low-Fragmentation Memory Allocation Through Large Virtual Memory and Global Data Structures

Hochgeladen von

Copyright:

Verfügbare Formate

See

Fast, Multicore-Scalable, Low-Fragmentation

Article in ACM SIGPLAN Notices · March 2015

Christoph M. Kirsch Michael Lippautz

SEE PROFILE SEE PROFILE

Martin Aigner Christoph M. Kirsch Michael Lippautz Ana Sokolova

Abstract is to decide when to move empty spans from the frontend

memory consumption in MB (less is better)

(a) Speedup (b) Memory consumption

Figure 5: Threadtest benchmark

(a) Speedup (b) Memory consumption

Figure 6: Shbench benchmark

(a) Throughput (b) Memory consumption

Figure 7: Larson benchmark

average memory consumption in MB

(logscale, less is better)

object size range in bytes (logscale)

(a) Total allocator time (b) Average memory consumption

Figure 9: Temporal and spatial performance for the producer-consumer experiment

Figure 10: Mutator locality and single-threaded performance

View publication stats

Das könnte Ihnen auch gefallen