Memory Models

The C++ and CLR memory models / Sasha Goldshtein
Slides: http://s.sashag.net/dw2016mem
The C++ and CLR memory

models
Sasha Goldshtein CTO, Sela Group
@goldshtn blog.sashag.net
Assumptions
You are a C++/C# developer
You write multithreaded code (who
doesnt?)
You care about the correctness of your
code
You might have gotten used to the
nurturing embrace of x86, but now you
have to make sure your code is correct in
the fiery, dangerous pits of ARM as well
Agenda
Atomicity, exclusivity, and ordering
What does memory model even mean?
Examples of memory reorderings
Volatile and atomic variables
Examples of broken code and how to fix
it
This is genuinely a level-400 talk. Viewer
discretion is advised. Rated R because of
frequent mentions of memory barriers and
Atomicity,
Exclusivity, and
Ordering
Atomicity
An atomic operation is non-interruptible
Partial reads/writes or context switches arent
allowed
On Intel x86-64 processors, aligned reads

and writes of 64-bit values are atomic
Many trivial operations, especially with
non-optimizing compilers,
are not atomic
Original source code
Resulting x86-64 instructions
++globalVar;
mov rax, qword ptr [globalVar]

add rax, 1
mov qword ptr [globalVar], rax
Atomicity Does Not

Guarantee Exclusivity
A non-interruptible operation still does
not guarantee exclusive access to
memory
inc qw ord ptr [gv]
inc qw ord ptr [gv]
CPU 1
Store buffer
w rite(gv,1)
L2 cache
gv= = 0
CPU 2
Store buffer
w rite(gv,1)
L1 cache
L1 cache
gv= = 0
gv= = 0
Exclusivity
LO CK: Require exclusive access to
memory
In the past, achieved by locking the memory
bus
Currently,
lock
inc qw ord ptr achieved
[gv]
lock incthe
qw orcache
d ptr [gv]
by marking
lineCPU
in 1exclusive mode
CPU 2
L2 cache
Store buffer
w rite(gv,1)
gv= = 0
Store buffer
L1 cache
L1 cache
gv= = 0
g v evicted
Ordering
Atomicity does not guarantee ordering
As we will see later, some memory
loads/stores may be reordered
E.g., stores may become visible to other
processors after subsequent loads retire
Processors may disagree on a variables

value
A processor may see its own writes before
other processors see them
Complex Relationships
atomicity
exclusivity
ordering
Memory Ordering
Memory Ordering
Question: Does the computer execute the
program you wrote?
No
Compilers, processors, and memory
controllers issue memory operations in a
different order
Its not malicious, its an optimization!
Compiler Reordering
Hmm, it sounds like youre reading that
array element many times inside the
loop
Let me hoist that read out of the loop for
ya!
int sum = 0;
Sounds
like a nice optimization, right?
int
p = get_pivot();
for (int i = 0; i < p; ++i)
{
sum += data[i] * data[p];
}
Compiler Reordering
Automatic vectorization is essentially
compiler reordering
// original code:
for (int i = 0; i < n; ++i) {
dst[i] = src[i];
}
// vectorized (and therefore reordered):
for (int i = 0; i < n; i += 16) {
auto val = _mm_stream_load_si128((__m128i*)&src[i]);
_mm_stream_si128((__m128i*)&dst[i], val);
}
Compiler Reordering
Read elimination is a fairly common
//optimization
original code:
if (arg1 == 0) throw new ArgumentException();
int val = arg2 + 1;
val += arg1;
// optimized (reordered) code:
int tmp = arg1;
if (tmp == 0) throw new ArgumentException();
int val = arg2 + 1;
val += tmp;
Compiler Reordering
How about this reordering?
// original code:
void enqueue(X new_element) {
bounded_queue_[++last] = new_element;
locked_ = 0;
}
// reordered:
void enqueue(X new_element) {
locked_ = 0;
bounded_queue_[++last] = new_element;
}
Out-Of-Order Execution
Processors have deep execution pipelines
designed to execute instructions in
parallel
This may cause reordering; specifically,
reordering of Time
memory
0
1operations
2
3
4
5
AD D D W O RD PTR [EAX], ML
ECX
AD D ECX, D W O RD PTR
[EBX]
ALU
MS
ML
ALU
AD D D W O RD PTR [ED X], ESI

ML ALU MS
Simple three-stage pipeline. The first instructions memory
AD D concurrently
ED I, 1
store may execute
(or even after) theALU
third
instructions memory load (assuming they are
independent).
Caches
Caches and store buffers can
dramatically delay memory writes and
Socket
speed up memory CPU
reads
4
cycles
3210
KB
cycles
256 KB
40
cycles
4-16
MB
CPU Core
Store
Buffer
L1 Cache
CPU Core
Store
Buffer
L1 Cache
L2 Cache
L2 Cache
L3 Cache
Memory Controller
>100 cycles
1 GB 1 TB
Main Memory
Cache Line
Cache Line
What Kinds of Reorderings

Are Permissible?
Data dependencies must be honored
E.g., same thread writes X = 1, then reads X:
gets 1
Compilers may reorder any memory

access under the as-if rule
Processors x86/x86-64
ARMv7,
SPARC PSO
have different
guarantees
IA64
Loads after
loads
No
Yes
No
Loads after
stores
No
Yes
No
Stores after
stores
No
Yes
Yes
Reordering, Example 1
Assuming g_p is widget*
Thread 1 may see g_p non-null, but only
partially initialized
Thread 1
Thread 2
if (g_p != nullptr)
{
g_p->do_work();
}
g_p = new widget();

Writes performed by
widgets constructor arent
necessarily visible here;
broken on ARM and SPARC
PSO
Thread 2 may see com plete as true, but
value would not be initialized, or only
partially initialized
Thread 1
Thread 2
value = SomeComputation();
if (complete)
complete = true;
{
Use(value);
}
Petersons algorithm for synchronization
Assuming flag1, flag2 are initialized to
Store can pass load, so
0
both threads see the
Thread 1
Thread 2
others flag as 0 and enter

the critical section;
broken even on x86!!!
START1: flag1 = 1;
START2: flag2 = 1;
if (flag2 == 0) {
if (flag1 == 0) {
critical section
critical section
} else {
} else {
flag1 = 0;
flag2 = 0;
goto START1;
goto START2;
}
Sequential Consistency
Sequential consistency (SC)
The result of any execution (reads and
writes) on multiple processors requires that
the operations of each individual processor
execute in the order specified by the
program
SC is often incredibly expensive and

precludes important optimizations
Modern compilers and processors do not
offer sequential consistency by default
SC-DRF
Race conditions
A memory location can be accessed
simultaneously by two threads, one of which
is a writer
Sequential consistency for data race free

programs (SC-DRF)
Executing reads and writes in program order,
as long as you dont have a race condition
Hardware promises sequential consistency if
you obey the constraints and dont write race
conditions
Memory Models
The C++11 memory model offers SC-DRF
Compiler doesnt take care of preventing
processor reorderings; semantics are
hardware-dependent
Provides facilities for ensuring undesired
reorderings do not occur
The CLR memory model which one?

ECMA CLI allows all reorderings, Microsoft
CLR implementation precludes store-store
reorderings
Provides facilities for ensuring undesired
reorderings do not occur
Good Fences Make

Good Neighbors
volatile
In C++:
Volatile variables prevent compiler
reorderings of reads and writes (to these
variables), and some other optimizations
Volatile variables do not prevent processor
reorderings (although in some versions they
used to \_( )_/)
In C#:
Volatile variables additionally prevent
processor reorderings, producing
unidirectional barriers
Compiler-Only Barriers
Prevent the compiler from moving reads
There are also compileror writes across the barrier only half-barriers; hardly
useful
VC++: _ReadWriteBarrier()
GCC: asm volatile("" ::: "memory")
Thread 1
Thread 2
value = long_calculation(); if (done) {

_ReadWriteBarrier();
_ReadWriteBarrier();
done = true;
do something with value
}
Processor Memory Barriers

A full barrier prevents all reads and
writes from passing the barrier (in any
direction)
Thats more than what we need in this
Thread 1
Thread 2
case:
if (g_p != nullptr)
auto temp = new widget();
{
MemoryBarrier();
g_p->do_work();
g_p = temp;
}

In C#, Thread.M em oryBarrier is a full
barrier
Volatile.Read, Volatile.W rite, and
accessing volatile variables produce
unidirectional
barriers
Thread
1
Thread
2
flag1 = 1;
flag2 = 1;
Thread.MemoryBarrier();
Thread.MemoryBarrier();
if (flag2 == 0) {
if (flag1 == 0) {
critical section
critical section
}
}
else handle_contention(); else handle_contention();

What if we made fl
ag1 (but only fl
ag1)
volatile?
Thread 1
Thread 2
flag1 = 1;
flag2 = 1;
if (flag2 == 0) {
if (flag1 == 0) {
critical section
critical section
}
}

What if we made both fl
ag1 and fl
ag2
volatile?
Thread 1
Thread 2
flag1 = 1;
flag2 = 1;
if (flag2 == 0) {
if (flag1 == 0) {
critical section
critical section
}
}
Other Forms of Barriers

Operations on synchronization
mechanisms are memory barriers
(usually unidirectional)
Operations on std::atomic variables are
memory barriers (usually unidirectional)
sync is some CLR object
z is std::atomic<int>
Monitor.Enter(sync);
++x;
Monitor.Exit(sync);
z = 42;
Acquire and Release

Barriers
Unidirectional barriers are usually enough
to avoid data races
On Windows, unidirectional barriers are
available with e.g.
InterlockedExchangeRelease,
ThreadInterlockedCompareExchangeAcquire
1
Thread 2
???
???
done = true;
}
std::atom ic
Portable API for low-level memory
operations
Atomic: no torn reads or torn writes
Ordered: acquire/release and additional
memory ordering guarantees
Suppose
Thread
1
val
ue and
done are
Thread
2
std::atom ics:

done = true;
}
Pitfall: Release Barrier

Followed by Acquire Barrier
Imagine fl
ag1 and fl
ag2 are
std::atom ic< int> or C# volatile variables
These instructions can still reorder!
Thread 1
Thread 2
flag1 = 1;
flag2 = 1;
if (flag2 == 0) {
if (flag1 == 0) {
critical section
critical section
}
}
Examples
Thread-Safe Singleton: C+
+, BAD
static T* instance_ = nullptr;
static T* get_instance() {
if (instance_ == nullptr) {
instance_ = new T();
}
return instance_;
}
What if two
threads enter
this method at
the same time?
Thread-Safe Singleton: C++, STILL

BAD
static std::mutex protector_;
protector_.lock();
protector_.unlock();
}
return instance_;
}
What if two
threads get
inside the if
statement?

BAD
protector_.lock();
}
protector_.unlock();
}
return instance_;
}
What if an
exception
occurs during
initialization of
T? (Like
bad_alloc.)

BAD
What if writes
from Ts
constructor are
reordered w.r.t.
to the write to
instance_?
std::lock_guard<std::mutex> lock{ protector_ };
}
}
return instance_;
}

BAD
static volatile T* instance_ = nullptr;
Same issue. C+
+ volatile
doesnt prevent
processor
reorderings.
}
}
return instance_;
}
Thread-Safe Singleton: C++, WHEW

static std::atomic<T*> instance_{ nullptr };
}
}
return instance_;
}
Thread-Safe Singleton: C+
+
// Requires fully-conformant C++11 compiler
// Supported by VC++ since Visual Studio 2015

static T& get_instance() {
static T instance;
return instance;
}
Thread-Safe Singleton: C#,

BAD
static T s_instance;
public static T Instance

{
get
{
if (s_instance == null)
s_instance = new T();
return s_instance;
}
}
Thread-Safe Singleton: C#,

BAD UNDER ECMA
static T s_instance;
static readonly object s_protector = new object();

{
And also bad because of

load speculation. Load-toget
load reorderings are allowed
{
if s_instance is not volatile,
if (s_instance != null) return s_instance;
so the read of s_instance can
lock (s_protector)
be reordered with the reads
{
of its fields!!! IA64 actually
be able to do it with
if (s_instance == null) s_instance =would
new T();
speculative execution if it
}
guesses s_instance.
return s_instance;
(see Duffy, p. 524)
}
Thread-Safe Singleton: C#
static volatile T s_instance;
static readonly object s_protector = new object();
{
get
{
if (s_instance != null) return s_instance;
lock (s_protector)
{
if (s_instance == null) s_instance = new T();
}
return s_instance;
}
}
Thread-Safe Singleton: C#
// Thread-safe but not lazy
static readonly T s_instance = new T();
{
get { return s_instance; }
}
// Thread-safe and lazy
static readonly Lazy<T> s_instance = new Lazy<T>(
() => new T());
{
get { return s_instance.Value; }
One More Example

struct SpinLock_DO_NOT_USE
{
private bool _locked;
public void Lock()
{
while (Interlocked.Exchange(ref _locked, true)) ;
}
public void Unlock()
{
_locked = false;
}
}
Does _locked need to be volatile?

Or, does this line require
Volatile.W rite?
Myths
Inconsistency Between
Processors
Myth: Different processors may
indefinitely keep seeing a different value
for a shared variable
Reality: This can happen for a very short
time (say, 1s) because of CPU store
buffering
Flushing The Cache

Myth: Writes to C# volatile variables
flush the cache and make the write
immediately visible to other processors
Reality: They dont flush the cachea
volatile store is a fence, and a fence
waits for the store buffer to be drained;
the cache coherence protocol guarantees
other processors will see the stored value
volatile Is Like lock

Myth: For small types like ints you dont
need a full-blown lock, just use volatile
Reality: protected is like delegate
Use volatile To Avoid Race

Conditions
Myth: If you apply volatile to the right
variables, you dont need synchronization
mechanisms and wont have race
conditions
Reality: Use kerosene to put off a fire
Summary
Here be dragons!
If possible, try to hide behind someone
elses synchronization primitives
Thank you!
Sasha Goldshtein CTO, Sela Group
@goldshtn blog.sashag.net

Memory Models

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Memory Models

Hochgeladen von

Copyright:

Verfügbare Formate

The C++ and CLR memory models / Sasha Goldshtein

The C++ and CLR memory

The C++ and CLR memory models / Sasha Goldshtein

The C++ and CLR memory models / Sasha Goldshtein

The C++ and CLR memory models / Sasha Goldshtein

The C++ and CLR memory models / Sasha Goldshtein

On Intel x86-64 processors, aligned reads

mov rax, qword ptr [globalVar]

The C++ and CLR memory models / Sasha Goldshtein

Atomicity Does Not

The C++ and CLR memory models / Sasha Goldshtein

The C++ and CLR memory models / Sasha Goldshtein

Processors may disagree on a variables

The C++ and CLR memory models / Sasha Goldshtein

The C++ and CLR memory models / Sasha Goldshtein

The C++ and CLR memory models / Sasha Goldshtein

The C++ and CLR memory models / Sasha Goldshtein

The C++ and CLR memory models / Sasha Goldshtein

The C++ and CLR memory models / Sasha Goldshtein

The C++ and CLR memory models / Sasha Goldshtein

The C++ and CLR memory models / Sasha Goldshtein

AD D D W O RD PTR [ED X], ESI

The C++ and CLR memory models / Sasha Goldshtein

The C++ and CLR memory models / Sasha Goldshtein

What Kinds of Reorderings

Compilers may reorder any memory

The C++ and CLR memory models / Sasha Goldshtein

g_p = new widget();

The C++ and CLR memory models / Sasha Goldshtein

The C++ and CLR memory models / Sasha Goldshtein

others flag as 0 and enter

The C++ and CLR memory models / Sasha Goldshtein

SC is often incredibly expensive and

The C++ and CLR memory models / Sasha Goldshtein

Sequential consistency for data race free

The C++ and CLR memory models / Sasha Goldshtein

The CLR memory model which one?

The C++ and CLR memory models / Sasha Goldshtein

Good Fences Make

The C++ and CLR memory models / Sasha Goldshtein

The C++ and CLR memory models / Sasha Goldshtein

value = long_calculation(); if (done) {

The C++ and CLR memory models / Sasha Goldshtein

Processor Memory Barriers

The C++ and CLR memory models / Sasha Goldshtein

Processor Memory Barriers

The C++ and CLR memory models / Sasha Goldshtein

Processor Memory Barriers

The C++ and CLR memory models / Sasha Goldshtein

Processor Memory Barriers

The C++ and CLR memory models / Sasha Goldshtein

Other Forms of Barriers

The C++ and CLR memory models / Sasha Goldshtein

Acquire and Release

The C++ and CLR memory models / Sasha Goldshtein

value = long_calculation(); if (done) {

The C++ and CLR memory models / Sasha Goldshtein

Pitfall: Release Barrier

The C++ and CLR memory models / Sasha Goldshtein

The C++ and CLR memory models / Sasha Goldshtein

The C++ and CLR memory models / Sasha Goldshtein

Thread-Safe Singleton: C++, STILL

The C++ and CLR memory models / Sasha Goldshtein