Sie sind auf Seite 1von 55

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

The C++ and CLR memory


models
Sasha Goldshtein CTO, Sela Group
@goldshtn blog.sashag.net

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Assumptions
You are a C++/C# developer
You write multithreaded code (who
doesnt?)
You care about the correctness of your
code
You might have gotten used to the
nurturing embrace of x86, but now you
have to make sure your code is correct in
the fiery, dangerous pits of ARM as well

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Agenda
Atomicity, exclusivity, and ordering
What does memory model even mean?
Examples of memory reorderings
Volatile and atomic variables
Examples of broken code and how to fix
it
This is genuinely a level-400 talk. Viewer
discretion is advised. Rated R because of
frequent mentions of memory barriers and

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Atomicity,
Exclusivity, and
Ordering

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Atomicity
An atomic operation is non-interruptible
Partial reads/writes or context switches arent
allowed

On Intel x86-64 processors, aligned reads


and writes of 64-bit values are atomic
Many trivial operations, especially with
non-optimizing compilers,
are not atomic
Original source code
Resulting x86-64 instructions
++globalVar;

mov rax, qword ptr [globalVar]


add rax, 1
mov qword ptr [globalVar], rax

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Atomicity Does Not


Guarantee Exclusivity
A non-interruptible operation still does
not guarantee exclusive access to
memory
inc qw ord ptr [gv]
inc qw ord ptr [gv]
CPU 1
Store buffer
w rite(gv,1)

L2 cache
gv= = 0

CPU 2
Store buffer
w rite(gv,1)

L1 cache

L1 cache

gv= = 0

gv= = 0

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Exclusivity
LO CK: Require exclusive access to
memory
In the past, achieved by locking the memory
bus
Currently,
lock
inc qw ord ptr achieved
[gv]
lock incthe
qw orcache
d ptr [gv]
by marking
lineCPU
in 1exclusive mode
CPU 2
L2 cache

Store buffer
w rite(gv,1)

gv= = 0

Store buffer

L1 cache

L1 cache

gv= = 0

g v evicted

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Ordering
Atomicity does not guarantee ordering
As we will see later, some memory
loads/stores may be reordered
E.g., stores may become visible to other
processors after subsequent loads retire

Processors may disagree on a variables


value
A processor may see its own writes before
other processors see them

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Complex Relationships
atomicity

exclusivity
ordering

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Memory Ordering

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Memory Ordering
Question: Does the computer execute the
program you wrote?
No
Compilers, processors, and memory
controllers issue memory operations in a
different order
Its not malicious, its an optimization!

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Compiler Reordering
Hmm, it sounds like youre reading that
array element many times inside the
loop
Let me hoist that read out of the loop for
ya!
int sum = 0;
Sounds
like a nice optimization, right?
int
p = get_pivot();
for (int i = 0; i < p; ++i)
{
sum += data[i] * data[p];
}

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Compiler Reordering
Automatic vectorization is essentially
compiler reordering
// original code:
for (int i = 0; i < n; ++i) {
dst[i] = src[i];
}
// vectorized (and therefore reordered):
for (int i = 0; i < n; i += 16) {
auto val = _mm_stream_load_si128((__m128i*)&src[i]);
_mm_stream_si128((__m128i*)&dst[i], val);
}

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Compiler Reordering
Read elimination is a fairly common
//optimization
original code:
if (arg1 == 0) throw new ArgumentException();
int val = arg2 + 1;
val += arg1;
// optimized (reordered) code:
int tmp = arg1;
if (tmp == 0) throw new ArgumentException();
int val = arg2 + 1;
val += tmp;

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Compiler Reordering
How about this reordering?
// original code:
void enqueue(X new_element) {
bounded_queue_[++last] = new_element;
locked_ = 0;
}
// reordered:
void enqueue(X new_element) {
locked_ = 0;
bounded_queue_[++last] = new_element;
}

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Out-Of-Order Execution
Processors have deep execution pipelines
designed to execute instructions in
parallel
This may cause reordering; specifically,
reordering of Time
memory
0
1operations
2
3
4
5
AD D D W O RD PTR [EAX], ML
ECX
AD D ECX, D W O RD PTR
[EBX]

ALU

MS

ML

ALU

AD D D W O RD PTR [ED X], ESI


ML ALU MS
Simple three-stage pipeline. The first instructions memory
AD D concurrently
ED I, 1
store may execute
(or even after) theALU
third
instructions memory load (assuming they are
independent).

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Caches
Caches and store buffers can
dramatically delay memory writes and
Socket
speed up memory CPU
reads
4
cycles
3210
KB
cycles
256 KB
40
cycles
4-16
MB

CPU Core
Store
Buffer
L1 Cache

CPU Core
Store
Buffer
L1 Cache

L2 Cache

L2 Cache

L3 Cache

Memory Controller
>100 cycles
1 GB 1 TB

Main Memory

Cache Line
Cache Line

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

What Kinds of Reorderings


Are Permissible?
Data dependencies must be honored
E.g., same thread writes X = 1, then reads X:
gets 1

Compilers may reorder any memory


access under the as-if rule
Processors x86/x86-64
ARMv7,
SPARC PSO
have different
guarantees
IA64

Loads after
loads

No

Yes

No

Loads after
stores

No

Yes

No

Stores after
stores

No

Yes

Yes

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Reordering, Example 1
Assuming g_p is widget*
Thread 1 may see g_p non-null, but only
partially initialized
Thread 1

Thread 2

if (g_p != nullptr)
{
g_p->do_work();
}

g_p = new widget();


Writes performed by
widgets constructor arent
necessarily visible here;
broken on ARM and SPARC
PSO

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Reordering, Example 2
Thread 2 may see com plete as true, but
value would not be initialized, or only
partially initialized
Thread 1

Thread 2

value = SomeComputation();
if (complete)
complete = true;
{
Use(value);
}

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Reordering, Example 3
Petersons algorithm for synchronization
Assuming flag1, flag2 are initialized to
Store can pass load, so
0
both threads see the
Thread 1

Thread 2

others flag as 0 and enter


the critical section;
broken even on x86!!!

START1: flag1 = 1;
START2: flag2 = 1;
if (flag2 == 0) {
if (flag1 == 0) {
critical section
critical section
} else {
} else {
flag1 = 0;
flag2 = 0;
goto START1;
goto START2;
}

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Sequential Consistency
Sequential consistency (SC)
The result of any execution (reads and
writes) on multiple processors requires that
the operations of each individual processor
execute in the order specified by the
program

SC is often incredibly expensive and


precludes important optimizations
Modern compilers and processors do not
offer sequential consistency by default

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

SC-DRF
Race conditions
A memory location can be accessed
simultaneously by two threads, one of which
is a writer

Sequential consistency for data race free


programs (SC-DRF)
Executing reads and writes in program order,
as long as you dont have a race condition
Hardware promises sequential consistency if
you obey the constraints and dont write race
conditions

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Memory Models
The C++11 memory model offers SC-DRF
Compiler doesnt take care of preventing
processor reorderings; semantics are
hardware-dependent
Provides facilities for ensuring undesired
reorderings do not occur

The CLR memory model which one?


ECMA CLI allows all reorderings, Microsoft
CLR implementation precludes store-store
reorderings
Provides facilities for ensuring undesired
reorderings do not occur

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Good Fences Make


Good Neighbors

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

volatile
In C++:
Volatile variables prevent compiler
reorderings of reads and writes (to these
variables), and some other optimizations
Volatile variables do not prevent processor
reorderings (although in some versions they
used to \_( )_/)

In C#:
Volatile variables additionally prevent
processor reorderings, producing
unidirectional barriers

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Compiler-Only Barriers
Prevent the compiler from moving reads
There are also compileror writes across the barrier only half-barriers; hardly
useful
VC++: _ReadWriteBarrier()
GCC: asm volatile("" ::: "memory")

Thread 1

Thread 2

value = long_calculation(); if (done) {


_ReadWriteBarrier();
_ReadWriteBarrier();
done = true;
do something with value
}

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Processor Memory Barriers


A full barrier prevents all reads and
writes from passing the barrier (in any
direction)
Thats more than what we need in this
Thread 1
Thread 2
case:
if (g_p != nullptr)
auto temp = new widget();
{
MemoryBarrier();
g_p->do_work();
g_p = temp;
}

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Processor Memory Barriers


In C#, Thread.M em oryBarrier is a full
barrier
Volatile.Read, Volatile.W rite, and
accessing volatile variables produce
unidirectional
barriers
Thread
1
Thread
2
flag1 = 1;
flag2 = 1;
Thread.MemoryBarrier();
Thread.MemoryBarrier();
if (flag2 == 0) {
if (flag1 == 0) {
critical section
critical section
}
}
else handle_contention(); else handle_contention();

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Processor Memory Barriers


What if we made fl
ag1 (but only fl
ag1)
volatile?
Thread 1

Thread 2

flag1 = 1;
flag2 = 1;
if (flag2 == 0) {
if (flag1 == 0) {
critical section
critical section
}
}
else handle_contention(); else handle_contention();

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Processor Memory Barriers


What if we made both fl
ag1 and fl
ag2
volatile?
Thread 1

Thread 2

flag1 = 1;
flag2 = 1;
if (flag2 == 0) {
if (flag1 == 0) {
critical section
critical section
}
}
else handle_contention(); else handle_contention();

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Other Forms of Barriers


Operations on synchronization
mechanisms are memory barriers
(usually unidirectional)
Operations on std::atomic variables are
memory barriers (usually unidirectional)
sync is some CLR object

z is std::atomic<int>

Monitor.Enter(sync);
++x;
Monitor.Exit(sync);

z = 42;

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Acquire and Release


Barriers
Unidirectional barriers are usually enough
to avoid data races
On Windows, unidirectional barriers are
available with e.g.
InterlockedExchangeRelease,
ThreadInterlockedCompareExchangeAcquire
1
Thread 2
value = long_calculation(); if (done) {
???
???
done = true;
do something with value
}

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

std::atom ic
Portable API for low-level memory
operations
Atomic: no torn reads or torn writes
Ordered: acquire/release and additional
memory ordering guarantees

Suppose
Thread
1

val
ue and
done are
Thread
2
std::atom ics:

value = long_calculation(); if (done) {


done = true;
do something with value
}

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Pitfall: Release Barrier


Followed by Acquire Barrier
Imagine fl
ag1 and fl
ag2 are
std::atom ic< int> or C# volatile variables
These instructions can still reorder!
Thread 1

Thread 2

flag1 = 1;
flag2 = 1;
if (flag2 == 0) {
if (flag1 == 0) {
critical section
critical section
}
}
else handle_contention(); else handle_contention();

The C++ and CLR memory models / Sasha Goldshtein

Examples

Slides: http://s.sashag.net/dw2016mem

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Thread-Safe Singleton: C+
+, BAD
static T* instance_ = nullptr;
static T* get_instance() {
if (instance_ == nullptr) {
instance_ = new T();
}
return instance_;
}

What if two
threads enter
this method at
the same time?

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Thread-Safe Singleton: C++, STILL


BAD
static T* instance_ = nullptr;
static std::mutex protector_;
static T* get_instance() {
if (instance_ == nullptr) {
protector_.lock();
instance_ = new T();
protector_.unlock();
}
return instance_;
}

What if two
threads get
inside the if
statement?

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Thread-Safe Singleton: C++, STILL


BAD
static T* instance_ = nullptr;
static std::mutex protector_;
static T* get_instance() {
if (instance_ == nullptr) {
protector_.lock();
if (instance_ == nullptr) {
instance_ = new T();
}
protector_.unlock();
}
return instance_;
}

What if an
exception
occurs during
initialization of
T? (Like
bad_alloc.)

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Thread-Safe Singleton: C++, STILL


BAD
static T* instance_ = nullptr;
static std::mutex protector_;

What if writes
from Ts
constructor are
static T* get_instance() {
reordered w.r.t.
if (instance_ == nullptr) {
to the write to
instance_?
std::lock_guard<std::mutex> lock{ protector_ };
if (instance_ == nullptr) {
instance_ = new T();
}
}
return instance_;
}

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Thread-Safe Singleton: C++, STILL


BAD
static volatile T* instance_ = nullptr;
Same issue. C+
static std::mutex protector_;
+ volatile
doesnt prevent
processor
static T* get_instance() {
reorderings.
if (instance_ == nullptr) {
std::lock_guard<std::mutex> lock{ protector_ };
if (instance_ == nullptr) {
instance_ = new T();
}
}
return instance_;
}

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Thread-Safe Singleton: C++, WHEW


static std::atomic<T*> instance_{ nullptr };
static std::mutex protector_;
static T* get_instance() {
if (instance_ == nullptr) {
std::lock_guard<std::mutex> lock{ protector_ };
if (instance_ == nullptr) {
instance_ = new T();
}
}
return instance_;
}

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Thread-Safe Singleton: C+
+
// Requires fully-conformant C++11 compiler

// Supported by VC++ since Visual Studio 2015


static T& get_instance() {
static T instance;
return instance;
}

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Thread-Safe Singleton: C#,


BAD
static T s_instance;

public static T Instance


{
get
{
if (s_instance == null)
s_instance = new T();
return s_instance;
}
}

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Thread-Safe Singleton: C#,


BAD UNDER ECMA
static T s_instance;

static readonly object s_protector = new object();


public static T Instance
{

And also bad because of


load speculation. Load-toget
load reorderings are allowed
{
if s_instance is not volatile,
if (s_instance != null) return s_instance;
so the read of s_instance can
lock (s_protector)
be reordered with the reads
{
of its fields!!! IA64 actually
be able to do it with
if (s_instance == null) s_instance =would
new T();
speculative execution if it
}
guesses s_instance.
return s_instance;
(see Duffy, p. 524)
}

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Thread-Safe Singleton: C#
static volatile T s_instance;
static readonly object s_protector = new object();
public static T Instance
{
get
{
if (s_instance != null) return s_instance;
lock (s_protector)
{
if (s_instance == null) s_instance = new T();
}
return s_instance;
}
}

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Thread-Safe Singleton: C#
// Thread-safe but not lazy
static readonly T s_instance = new T();
public static T Instance
{
get { return s_instance; }
}
// Thread-safe and lazy
static readonly Lazy<T> s_instance = new Lazy<T>(
() => new T());
public static T Instance
{
get { return s_instance.Value; }

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

One More Example


struct SpinLock_DO_NOT_USE
{
private bool _locked;
public void Lock()
{
while (Interlocked.Exchange(ref _locked, true)) ;
}
public void Unlock()
{
_locked = false;
}
}

Does _locked need to be volatile?


Or, does this line require
Volatile.W rite?

The C++ and CLR memory models / Sasha Goldshtein

Myths

Slides: http://s.sashag.net/dw2016mem

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Inconsistency Between
Processors
Myth: Different processors may
indefinitely keep seeing a different value
for a shared variable
Reality: This can happen for a very short
time (say, 1s) because of CPU store
buffering

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Flushing The Cache


Myth: Writes to C# volatile variables
flush the cache and make the write
immediately visible to other processors
Reality: They dont flush the cachea
volatile store is a fence, and a fence
waits for the store buffer to be drained;
the cache coherence protocol guarantees
other processors will see the stored value

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

volatile Is Like lock


Myth: For small types like ints you dont
need a full-blown lock, just use volatile
Reality: protected is like delegate

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Use volatile To Avoid Race


Conditions
Myth: If you apply volatile to the right
variables, you dont need synchronization
mechanisms and wont have race
conditions
Reality: Use kerosene to put off a fire

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Summary
Here be dragons!
If possible, try to hide behind someone
elses synchronization primitives

The C++ and CLR memory models / Sasha Goldshtein

Slides: http://s.sashag.net/dw2016mem

Thank you!
Sasha Goldshtein CTO, Sela Group
@goldshtn blog.sashag.net

Das könnte Ihnen auch gefallen