Beruflich Dokumente
Kultur Dokumente
Memory in C/C++
Mission:
Agenda
Agenda
Cache overview
6 pipelines
64kB L1 cache
6 instructions
per cycle
256kB L2 cache
Courtesy of Intel
Courtesy of Intel
Performance Win:
Agenda
Data access pattern
void mtx_mult() {
Poor spatial locality
for (int i=0; i<N; i++)
for (int j=0; j<N; j++)
for (int k=0; k<N; k++)
C[i][j] += A[i][k] * B[k][j];
}
Row 0
Row 1
Row 2
Row 3
etc
Array B:
B[0][5]
B[1][5]
B[2][5]
B[3][5]
void mtx_mult() {
for (int i=0; i<N; i++)
for (int k=0; k<N; k++)
for (int j=0; j<N; j++)
C[i][j] += A[i][k] * B[k][j];
}
Row 0
Row 1
Row 2
Row 3
etc
Array B:
miss
hits
B[0][5],
B[0][6],
B[0][7],
B[0][8]
Performance Win:
, and watch your access patterns.
Agenda
Multi-core effects
Speedup
1.6x
Speedup
5
10
1.6x
2.4x
Question:
5
10
20
L3 Cache
Available
Speedup
1.6x
2.4x
18x
Performance Win:
Performance Win:
Especially when multithreading I/O
bound code
Agenda
Vector code
struct Particle {
float_3 pos;
float_3 vel;
float_3 acc;
};
Particle particles[4096];
pos
Particles:
vel
acc
...
Hoisted
Pop quiz:
whats hot?
and why?
particles[j+6].pos.y
particles[j+6].pos.z
particles[j+4].pos.x
particles[j+4].pos.y
particles[j+4].pos.z particles[j+6].pos.x
Particles:
ymm register
(256 bits)
ymm register
(256 bits)
ymm register
(256 bits)
2x vmovss
6x vinsertps
2x vmovss
1x vinsertf128
6x vinsertps
2x
1x vmovss
vinsertf128
6x vinsertps
1x vinsertf128
Question:
...
Particles:
X:
...
Y:
...
Z:
...
New
code
New
code
42 fps
Performance Win:
Agenda
Alignment
Performance Win:
Agenda
struct
bool
bool
bool
bool
};
S {
a;
b;
c;
d;
Size of S is 4 bytes
a
struct S {
int a : 1;
int b : 1;
int c : 1;
int d : 1;
};
void set_c(S &s, bool v) {
s.c = v;
}
struct MyDataType {
float a; float b;
float w; float x;
};
float c;
float y;
float d;
float z;
Hot
Cold
MyDataType myDataArray[100000];
myDataArray:
...
Access 4 bytes
float b;
float x;
struct MyDataType {
MyHotDataType myHotData[100000];
MyColdDataType myColdData[100000];
};
MyDataType myData;
float c;
float y;
float d; };
float z; };
Agenda
Win:
Caveat:
StructName
http://blogs.msdn.com/b/nativeconcurrency
http://blogs.msdn.com/b/vcblog/
http://channel9.msdn.com/Shows/C9-GoingNative
Why?
http://blogs.msdn.com/b/nativeconcurrency
Misprediction cost
is 14 cycles on
Sandy Bridge
Indirect call,
requires prediction
class Vehicle {
virtual void Drive(int miles) { ... }
};
class Car : public Vehicle {
void Drive(int miles) { ... }
};
Devirtualized call,
void test(Vehicle *v) {
no prediction
v->Drive(10);
}
push 10
call ?Drive@Car@@UAEXH@Z
Front-End
Execution Pipelines
Back-End
Courtesy of Intel
Latency (cycles)
1 cycle
throughput
Add/Sub (1)
Logical (1)
14 cycle
throughput
FP mul (4)
FP sqrt (14)
FP div (14)
L1 access (4)
L2 access (12)
L3 access (26)
L3 miss ()
0
10
12
14
16
18
20
22
24
26
28
30
Auto Vectorization
#include <intrin.h>
for (int i=0; i<100; i+=4) {
__m128 t1 = _mm_loadu_ps(&A[i]);
__m128 t2 = _mm_loadu_ps(&B[i]);
__m128 t3 = _mm_add_ps(t1, t2);
_mm_storeu_ps(&C[i], t3);
}
Low latency
SIMD
Courtesy of Intel
SCALAR
(1 operation)
r2
r1
+
r3
VECTOR
(N operations)
v1
v2
+
v3
vector
length
Auto Vectorization
for (int i=0; i<100; i++)
A[i] = B[i] + C[i];
128-bit
operations
Speedup!
#include <intrin.h>
for (int i=0; i<100; i+=4) {
__m128 t1 = _mm_loadu_ps(&A[i]);
__m128 t2 = _mm_loadu_ps(&B[i]);
__m128 t3 = _mm_add_ps(t1, t2);
_mm_storeu_ps(&C[i], t3);
}
movss xmm0, s$
shufps xmm0, xmm0, 0
movups xmm1, [ecx]
mulps xmm1, xmm0
movups [edx], xmm1 Performance win.
Code size win.
New in Visual C++ 2013