Beruflich Dokumente
Kultur Dokumente
Slides: http://s.sashag.net/dw2016mem
Slides: http://s.sashag.net/dw2016mem
Assumptions
You are a C++/C# developer
You write multithreaded code (who
doesnt?)
You care about the correctness of your
code
You might have gotten used to the
nurturing embrace of x86, but now you
have to make sure your code is correct in
the fiery, dangerous pits of ARM as well
Slides: http://s.sashag.net/dw2016mem
Agenda
Atomicity, exclusivity, and ordering
What does memory model even mean?
Examples of memory reorderings
Volatile and atomic variables
Examples of broken code and how to fix
it
This is genuinely a level-400 talk. Viewer
discretion is advised. Rated R because of
frequent mentions of memory barriers and
Slides: http://s.sashag.net/dw2016mem
Atomicity,
Exclusivity, and
Ordering
Slides: http://s.sashag.net/dw2016mem
Atomicity
An atomic operation is non-interruptible
Partial reads/writes or context switches arent
allowed
Slides: http://s.sashag.net/dw2016mem
L2 cache
gv= = 0
CPU 2
Store buffer
w rite(gv,1)
L1 cache
L1 cache
gv= = 0
gv= = 0
Slides: http://s.sashag.net/dw2016mem
Exclusivity
LO CK: Require exclusive access to
memory
In the past, achieved by locking the memory
bus
Currently,
lock
inc qw ord ptr achieved
[gv]
lock incthe
qw orcache
d ptr [gv]
by marking
lineCPU
in 1exclusive mode
CPU 2
L2 cache
Store buffer
w rite(gv,1)
gv= = 0
Store buffer
L1 cache
L1 cache
gv= = 0
g v evicted
Slides: http://s.sashag.net/dw2016mem
Ordering
Atomicity does not guarantee ordering
As we will see later, some memory
loads/stores may be reordered
E.g., stores may become visible to other
processors after subsequent loads retire
Slides: http://s.sashag.net/dw2016mem
Complex Relationships
atomicity
exclusivity
ordering
Slides: http://s.sashag.net/dw2016mem
Memory Ordering
Slides: http://s.sashag.net/dw2016mem
Memory Ordering
Question: Does the computer execute the
program you wrote?
No
Compilers, processors, and memory
controllers issue memory operations in a
different order
Its not malicious, its an optimization!
Slides: http://s.sashag.net/dw2016mem
Compiler Reordering
Hmm, it sounds like youre reading that
array element many times inside the
loop
Let me hoist that read out of the loop for
ya!
int sum = 0;
Sounds
like a nice optimization, right?
int
p = get_pivot();
for (int i = 0; i < p; ++i)
{
sum += data[i] * data[p];
}
Slides: http://s.sashag.net/dw2016mem
Compiler Reordering
Automatic vectorization is essentially
compiler reordering
// original code:
for (int i = 0; i < n; ++i) {
dst[i] = src[i];
}
// vectorized (and therefore reordered):
for (int i = 0; i < n; i += 16) {
auto val = _mm_stream_load_si128((__m128i*)&src[i]);
_mm_stream_si128((__m128i*)&dst[i], val);
}
Slides: http://s.sashag.net/dw2016mem
Compiler Reordering
Read elimination is a fairly common
//optimization
original code:
if (arg1 == 0) throw new ArgumentException();
int val = arg2 + 1;
val += arg1;
// optimized (reordered) code:
int tmp = arg1;
if (tmp == 0) throw new ArgumentException();
int val = arg2 + 1;
val += tmp;
Slides: http://s.sashag.net/dw2016mem
Compiler Reordering
How about this reordering?
// original code:
void enqueue(X new_element) {
bounded_queue_[++last] = new_element;
locked_ = 0;
}
// reordered:
void enqueue(X new_element) {
locked_ = 0;
bounded_queue_[++last] = new_element;
}
Slides: http://s.sashag.net/dw2016mem
Out-Of-Order Execution
Processors have deep execution pipelines
designed to execute instructions in
parallel
This may cause reordering; specifically,
reordering of Time
memory
0
1operations
2
3
4
5
AD D D W O RD PTR [EAX], ML
ECX
AD D ECX, D W O RD PTR
[EBX]
ALU
MS
ML
ALU
Slides: http://s.sashag.net/dw2016mem
Caches
Caches and store buffers can
dramatically delay memory writes and
Socket
speed up memory CPU
reads
4
cycles
3210
KB
cycles
256 KB
40
cycles
4-16
MB
CPU Core
Store
Buffer
L1 Cache
CPU Core
Store
Buffer
L1 Cache
L2 Cache
L2 Cache
L3 Cache
Memory Controller
>100 cycles
1 GB 1 TB
Main Memory
Cache Line
Cache Line
Slides: http://s.sashag.net/dw2016mem
Loads after
loads
No
Yes
No
Loads after
stores
No
Yes
No
Stores after
stores
No
Yes
Yes
Slides: http://s.sashag.net/dw2016mem
Reordering, Example 1
Assuming g_p is widget*
Thread 1 may see g_p non-null, but only
partially initialized
Thread 1
Thread 2
if (g_p != nullptr)
{
g_p->do_work();
}
Slides: http://s.sashag.net/dw2016mem
Reordering, Example 2
Thread 2 may see com plete as true, but
value would not be initialized, or only
partially initialized
Thread 1
Thread 2
value = SomeComputation();
if (complete)
complete = true;
{
Use(value);
}
Slides: http://s.sashag.net/dw2016mem
Reordering, Example 3
Petersons algorithm for synchronization
Assuming flag1, flag2 are initialized to
Store can pass load, so
0
both threads see the
Thread 1
Thread 2
START1: flag1 = 1;
START2: flag2 = 1;
if (flag2 == 0) {
if (flag1 == 0) {
critical section
critical section
} else {
} else {
flag1 = 0;
flag2 = 0;
goto START1;
goto START2;
}
Slides: http://s.sashag.net/dw2016mem
Sequential Consistency
Sequential consistency (SC)
The result of any execution (reads and
writes) on multiple processors requires that
the operations of each individual processor
execute in the order specified by the
program
Slides: http://s.sashag.net/dw2016mem
SC-DRF
Race conditions
A memory location can be accessed
simultaneously by two threads, one of which
is a writer
Slides: http://s.sashag.net/dw2016mem
Memory Models
The C++11 memory model offers SC-DRF
Compiler doesnt take care of preventing
processor reorderings; semantics are
hardware-dependent
Provides facilities for ensuring undesired
reorderings do not occur
Slides: http://s.sashag.net/dw2016mem
Slides: http://s.sashag.net/dw2016mem
volatile
In C++:
Volatile variables prevent compiler
reorderings of reads and writes (to these
variables), and some other optimizations
Volatile variables do not prevent processor
reorderings (although in some versions they
used to \_( )_/)
In C#:
Volatile variables additionally prevent
processor reorderings, producing
unidirectional barriers
Slides: http://s.sashag.net/dw2016mem
Compiler-Only Barriers
Prevent the compiler from moving reads
There are also compileror writes across the barrier only half-barriers; hardly
useful
VC++: _ReadWriteBarrier()
GCC: asm volatile("" ::: "memory")
Thread 1
Thread 2
Slides: http://s.sashag.net/dw2016mem
Slides: http://s.sashag.net/dw2016mem
Slides: http://s.sashag.net/dw2016mem
Thread 2
flag1 = 1;
flag2 = 1;
if (flag2 == 0) {
if (flag1 == 0) {
critical section
critical section
}
}
else handle_contention(); else handle_contention();
Slides: http://s.sashag.net/dw2016mem
Thread 2
flag1 = 1;
flag2 = 1;
if (flag2 == 0) {
if (flag1 == 0) {
critical section
critical section
}
}
else handle_contention(); else handle_contention();
Slides: http://s.sashag.net/dw2016mem
z is std::atomic<int>
Monitor.Enter(sync);
++x;
Monitor.Exit(sync);
z = 42;
Slides: http://s.sashag.net/dw2016mem
Slides: http://s.sashag.net/dw2016mem
std::atom ic
Portable API for low-level memory
operations
Atomic: no torn reads or torn writes
Ordered: acquire/release and additional
memory ordering guarantees
Suppose
Thread
1
val
ue and
done are
Thread
2
std::atom ics:
Slides: http://s.sashag.net/dw2016mem
Thread 2
flag1 = 1;
flag2 = 1;
if (flag2 == 0) {
if (flag1 == 0) {
critical section
critical section
}
}
else handle_contention(); else handle_contention();
Examples
Slides: http://s.sashag.net/dw2016mem
Slides: http://s.sashag.net/dw2016mem
Thread-Safe Singleton: C+
+, BAD
static T* instance_ = nullptr;
static T* get_instance() {
if (instance_ == nullptr) {
instance_ = new T();
}
return instance_;
}
What if two
threads enter
this method at
the same time?
Slides: http://s.sashag.net/dw2016mem
What if two
threads get
inside the if
statement?
Slides: http://s.sashag.net/dw2016mem
What if an
exception
occurs during
initialization of
T? (Like
bad_alloc.)
Slides: http://s.sashag.net/dw2016mem
What if writes
from Ts
constructor are
static T* get_instance() {
reordered w.r.t.
if (instance_ == nullptr) {
to the write to
instance_?
std::lock_guard<std::mutex> lock{ protector_ };
if (instance_ == nullptr) {
instance_ = new T();
}
}
return instance_;
}
Slides: http://s.sashag.net/dw2016mem
Slides: http://s.sashag.net/dw2016mem
Slides: http://s.sashag.net/dw2016mem
Thread-Safe Singleton: C+
+
// Requires fully-conformant C++11 compiler
Slides: http://s.sashag.net/dw2016mem
Slides: http://s.sashag.net/dw2016mem
Slides: http://s.sashag.net/dw2016mem
Thread-Safe Singleton: C#
static volatile T s_instance;
static readonly object s_protector = new object();
public static T Instance
{
get
{
if (s_instance != null) return s_instance;
lock (s_protector)
{
if (s_instance == null) s_instance = new T();
}
return s_instance;
}
}
Slides: http://s.sashag.net/dw2016mem
Thread-Safe Singleton: C#
// Thread-safe but not lazy
static readonly T s_instance = new T();
public static T Instance
{
get { return s_instance; }
}
// Thread-safe and lazy
static readonly Lazy<T> s_instance = new Lazy<T>(
() => new T());
public static T Instance
{
get { return s_instance.Value; }
Slides: http://s.sashag.net/dw2016mem
Myths
Slides: http://s.sashag.net/dw2016mem
Slides: http://s.sashag.net/dw2016mem
Inconsistency Between
Processors
Myth: Different processors may
indefinitely keep seeing a different value
for a shared variable
Reality: This can happen for a very short
time (say, 1s) because of CPU store
buffering
Slides: http://s.sashag.net/dw2016mem
Slides: http://s.sashag.net/dw2016mem
Slides: http://s.sashag.net/dw2016mem
Slides: http://s.sashag.net/dw2016mem
Summary
Here be dragons!
If possible, try to hide behind someone
elses synchronization primitives
Slides: http://s.sashag.net/dw2016mem
Thank you!
Sasha Goldshtein CTO, Sela Group
@goldshtn blog.sashag.net