GPU Architecture

GPU
Architectures
A CPU Perspec+ve
Derek Hower AMD Research 5/21/2013

With updates by David Wood
Goals
Data Parallelism: What is it, and how to exploit it?
Workload characterisHcs
Execu1on Models / GPU Architectures

MIMD (SPMD), SIMD, SIMT
GPU Programming Models

Terminology translaHons: CPU AMD GPU Nvidia GPU
Intro to OpenCL
Modern GPU Microarchitectures

i.e., programmable GPU pipelines, not their xed-funcHon predecessors
Advanced Topics: (Time permiYng)

The Limits of GPUs: What they can and cannot do
The Future of GPUs: Where do we go from here?
GPU ARCHITECTURES: A CPU PERSPECTIVE
Data Parallel
ExecuHon
on GPUs
Data Parallelism, Programming Models, SIMT

3
Graphics Workloads
Streaming computaHon
GPU
Graphics Workloads
Streaming computaHon on pixels
GPU
Graphics Workloads
Iden2cal, Streaming computaHon on pixels
GPU
Graphics Workloads
Iden2cal, Independent, Streaming computaHon on pixels
GPU
Architecture Spelling Bee

P-A-R-A-L-L-E-L
Spell
Independent
Generalize: Data Parallel Workloads

Iden2cal, Independent computaHon on mul2ple data inputs
0,7
=()
7,0
1,7
=()
6,0
2,7
=()
5,0
3,7
=()
4,0
Nave Approach
Split independent work over mul1ple processors
0,7
1,7
2,7
3,7
CPU0
7,0
()
=

CPU1
6,0
()
=

CPU2
5,0
()
=

CPU3
4,0
()
=

10
Data Parallelism: A MIMD Approach

MulHple InstrucHon MulHple Data
Program
=()
Program
=()
Program
=()
Program
=()
0,7
CPU0
7,0
Memory Writeback
Fetch Decode Execute

1,7
CPU1
6,0
Memory Writeback

2,7
CPU2
5,0
Memory Writeback

3,7
CPU3
4,0
Memory Writeback

11
Data Parallelism: A MIMD Approach

MulHple InstrucHon MulHple Data
Program
=()
0,7
CPU0
7,0
Memory Writeback
Fetch
Decode
Execute
When
work
is iden1cal
(same program):

1,7
CPU1
Program
Single Program MulHple
Data (SPMD)

Fetch Decode Execute Memory Writeback

of MIMD)
(Subcategory
=()
Program
=()
Program
=()
2,7
6,0
CPU2
5,0
Memory Writeback

3,7
CPU3
4,0
Memory Writeback

12
Data Parallelism: An SPMD Approach

Single Program MulHple Data
Split iden1cal, independent work over mul1ple processors
Program
=()
Program
=()
Program
=()
Program
=()
0,7
CPU0
7,0
Memory Writeback

1,7
CPU1
6,0
Memory Writeback

2,7
CPU2
5,0
Memory Writeback

3,7
CPU3
4,0
Memory Writeback

13
Data Parallelism: A SIMD Approach

Single InstrucHon MulHple Data
Split iden1cal, independent work over mul1ple execuHon units (lanes)
More ecient: Eliminate redundant fetch/decode
0,7
1,7
2,7
Program
=()
CPU0

3,7
Execute Memory

Execute
Memory
Fetch Decode
Memory
Execute
Memory
Execute

7,0
Writeback
6,0
Writeback
Writeback
5,0
Writeback
4,0
14
SIMD: A Closer Look

One Thread + Data Parallel Ops Single PC, single register le
0,7
1,7
2,7
Program
=()
CPU0

3,7
Execute
Memory
Memory
Execute

Fetch Decode
Memory
Execute
Memory
Execute
7,0
Writeback
6,0
Writeback
Writeback
5,0
Writeback
4,0

Register File
15
Data Parallelism: A SIMT Approach

Single InstrucHon MulHple Thread
Split iden1cal, independent work over mul1ple lockstep threads
MulHple Threads + Scalar Ops One PC, MulHple register les
0,7
Program
=()
WF0
1,7
Memory Writeback
Execute

2,7

3,7
Execute
Memory Writeback

Fetch Decode

Execute Memory Writeback

7,0
6,0
5,0
4,0

Memory Writeback
Execute

16
Terminology Headache #1
Its common to interchange

SIMD and SIMT
17
Data Parallel ExecuHon Models

MIMD/SPMD
MulHple independent
threads
SIMD/Vector
One thread with wide

execuHon datapath
SIMT
MulHple lockstep threads
18
ExecuHon Model Comparison
Example
Architecture
MIMD/SPMD
SIMD/Vector
MulHcore CPUs
x86 SSE/AVX
More general:
supports TLP
Can mix sequenHal & Easier to program

parallel code
Gather/Scamer
operaHons
Inecient for data

parallelism
Gather/Scamer can
be awkward
Pros
Cons
SIMT
GPUs
Divergence kills
performance
19
GPUs and Memory

Recall: GPUs perform Streaming computaHon
Streaming memory access
GPU
DRAM latency: 100s of GPU cycles

How do we keep the GPU busy (hide memory latency)?
20
Hiding Memory Latency

OpHons from the CPU world:
Caches
Need spaHal/temporal locality
OoO/Dynamic Scheduling
Need ILP
MulHcore/MulHthreading/SMT
Need independent threads
21
MulHcore MulHthreaded SIMT

Many SIMT threads grouped together into GPU Core
SIMT threads in a group SMT threads in a CPU core
Unlike CPU, groups are exposed to programmers
MulHple GPU Cores
GPU
GPU Core
GPU Core
22
MulHcore MulHthreaded SIMT

Many SIMT threads grouped together into GPU Core
SIMT threads in a group SMT threads in a CPU core
Unlike CPU, groups are exposed to programmers
MulHple GPU Cores
This is a GPU Architecture (Whew!)

GPU
GPU Core
GPU Core
23
GPU Component Names

AMD/OpenCL
Dereks CPU Analogy
Processing Element
Lane
SIMD Unit
Pipeline
Compute Unit
Core
GPU Device
Device
GPU Core
24
GPU Programming
Models
OpenCL

25

CUDA Compute Unied Device Architecture
Developed by Nvidia -- proprietary
First serious GPGPU language/environment
OpenCL Open CompuHng Language

From makers of OpenGL
Wide industry support: AMD, Apple, Qualcomm, Nvidia (begrudgingly), etc.
C++ AMP C++ Accelerated Massive Parallelism

Microsos
Much higher abstracHon that CUDA/OpenCL
OpenACC Open Accelerator

Like OpenMP for GPUs (semi-auto-parallelize serial code)
Much higher abstracHon than CUDA/OpenCL
26

CUDA Compute Unied Device Architecture
Developed by Nvidia -- proprietary
First serious GPGPU language/environment
OpenCL Open CompuHng Language

From makers of OpenGL
Wide industry support: AMD, Apple, Qualcomm, Nvidia (begrudgingly), etc.
C++ AMP C++ Accelerated Massive Parallelism

Microsos
Much higher abstracHon that CUDA/OpenCL
OpenACC Open Accelerator

Like OpenMP for GPUs (semi-auto-parallelize serial code)
Much higher abstracHon than CUDA/OpenCL
27
OpenCL
Early CPU languages were light abstracHons of physical hardware
E.g., C
Early GPU languages are light abstracHons of physical hardware

OpenCL + CUDA
28
OpenCL
E.g., C

OpenCL + CUDA
GPU Architecture
GPU
GPU Core
GPU Core
29
OpenCL
E.g., C

OpenCL + CUDA
GPU Architecture
OpenCL Model
GPU
GPU Core
NDRange
GPU Core
Workgroup
Workgroup
Work-item
Wavefront
30
NDRange
N-Dimensional (N = 1, 2, or 3) index space
ParHHoned into workgroups, wavefronts, and work-items
NDRange
Workgroup
Workgroup
31
Kernel
Run an NDRange on a kernel (i.e., a funcHon)
Same kernel executes for each work-item
Smells like MIMD/SPMD
Kernel
0,7
=()
7,0
1,7
=()
6,0
2,7
=()
5,0
3,7
=()
4,0
32
Kernel
Run an NDRange on a kernel (i.e., a funcHon)
Same kernel executes for each work-item
Smells like MIMD/SPMDbut beware, its not!
Kernel
Workgroup
0,7
=()
7,0
1,7
=()
6,0
2,7
=()
5,0
3,7
=()
4,0
33
OpenCL Code
__kernel
void flip_and_recolor(__global float3 **in_image,
__global float3 **out_image,
int img_dim_x, int img_dim_y)
{
int x = get_global_id(1); // get work-item id in dim 1
int y = get_global_id(2); // get work-item id in dim 2
out_image[img_dim_x - x][img_dim_y - y] =
recolor(in_image[x][y]);
}
34
GPU
Microarchitecture
AMD Graphics Core Next

35
GPU Hardware Overview

GPU
GDDR5
GPU
Local Memory
SIMT
SIMT
L1 Cache
SIMT
SIMT
SIMT
SIMT
L1 Cache
SIMT
GPU Core
SIMT
GPU Core
L2 Cache
Local Memory
36
Compute Unit A GPU Core

Compute Unit (CU) Runs Workgroups
Workgroup
Contains 4 SIMT Units

Picks one SIMT Unit per cycle for scheduling
SIMT Unit Runs Wavefronts

Each SIMT Unit has 10 wavefront instrucHon buer
Takes 4 cycles to execute one wavefront
SIMT
SIMT
SIMT
SIMT
L1 Cache
Local Memory
10 Wavefront x 4 SIMT Units =

40 Ac1ve Wavefronts / CU
64 work-items / wavefront x 40 acHve wavefronts =
2560 Ac1ve Work-items / CU
37
Compute Unit Timing Diagram
Time
On average: fetch & commit one wavefront / cycle
1
2
3
4
5
6
7
8
9
10
11
12
SIMT0
SIMT1
WF1_0
WF1_1
WF1_2
WF1_3
WF5_0
WF5_1
WF5_2
WF5_3
WF9_0
WF9_1
WF9_2
WF9_3
WF2_0
WF2_1
WF2_2
WF2_3
WF6_0
WF6_1
WF6_2
WF6_3
WF10_0
WF10_1
WF10_2
SIMT2
SIMT3
WF3_0
WF3_1
WF3_2
WF3_3
WF7_0
WF7_1
WF7_2
WF7_3
WF11_0
WF11_1
WF4_0
WF4_1
WF4_2
WF4_3
WF8_0
WF8_1
WF8_2
WF8_3
WF12_0
38
SIMT Unit A GPU Pipeline

Like a wide CPU pipeline except one fetch for enHre width
16-wide physical ALU
Executes 64-wavefront over 4 cycles. Why??
64KB register state / SIMT Unit

Compare to x86 (Bulldozer): ~1KB of physical register le state (~1/64 size)
Address Coalescing Unit
Registers
Registers
Registers
Registers
Registers
Registers
Registers
Registers
Registers
Registers
Registers
Registers
Registers
Registers
Registers
Registers
A key to good memory performance
Address Coalescing Unit

39
Address Coalescing
Wavefront: Issue 64 memory requests
NDRange
Workgroup
Workgroup
40
Address Coalescing
Wavefront: Issue 64 memory requests
Common case:
work-items in same wavefront touch same cache block
Coalescing:
Merge many work-items requests into single cache block request
Important for performance:

Reduces bandwidth to DRAM
41
GPU Memory
GPUs have caches.
42
Not Your CPUs Cache

By the numbers: Bulldozer FX-8170 vs. GCN Radeon HD 7970
CPU (Bulldozer)
GPU (GCN)
L1 data cache capacity
16KB
16 KB
AcHve threads (work-items)

sharing L1 D Cache
2560
L1 dcache capacity / thread
16KB
6.4 bytes
Last level cache (LLC) capacity
8MB
768KB
AcHve threads (work-items)

sharing LLC
81,920
LLC capacity / thread
1MB
9.6 bytes
43
GPU Caches
Maximize throughput, not hide latency
Not there for either spaHal or temporal locality
L1 Cache: Coalesce requests to same cache block by dierent work-items

i.e., streaming thread locality?
Keep block around just long enough for each work-item to hit once
UlHmate goal: Reduce bandwidth to DRAM
L2 Cache: DRAM staging buer + some instrucHon reuse

UlHmate goal: Tolerate spikes in DRAM bandwidth
If there is any spaHal/temporal locality:

Use local memory (scratchpad)
44
Scratchpad Memory
GPUs have scratchpads (Local Memory)
Allocated to a workgroup
i.e., shared by wavefronts in workgroup
SIMT
SIMT
SIMT
Rename address
Manage capacity manual ll/evicHon
SIMT
Separate address space

Managed by sosware:
L1 Cache
Local Memory
45
Example System: Radeon HD 7970

High-end part
32 Compute Units:
81,920 AcHve work-items
32 CUs * 4 SIMT Units * 16 ALUs = 2048 Max FP ops/cycle
264 GB/s Max memory bandwidth
925 MHz engine clock

3.79 TFLOPS single precision (accounHng trickery: FMA)
210W Max Power (Chip)

>350W Max Power (card)
100W idle power (card)
46
Radeon HD 7990 - Cooking

Two 7970s on one card:
375W (AMD Ocial) 450W (OEM)
47
A Rose by Any
Other Name
48
Terminology Headaches #2-5

Nvidia/CUDA
AMD/OpenCL
Dereks CPU Analogy
CUDA Processor
Processing Element
Lane
CUDA Core
GPU Core
Streaming
MulHprocessor
GPU Device
SIMD Unit
Pipeline
Compute Unit
Core
GPU Device
Device
49
Terminology Headaches #6-9

CUDA/Nvidia
OpenCL/AMD
Henn&Pai
Work-item
Sequence of
SIMD Lane
OperaHons
Wavefront
Thread of
SIMD
InstrucHons
Block
Workgroup
Body of
vectorized
loop
Grid
NDRange
Vectorized
loop
Thread
Warp
Group
50
Terminology Headache #10

GPUs have scratchpads (Local Memory)
Allocated to a workgroup
i.e., shared by wavefronts in workgroup
SIMT
SIMT
SIMT
Rename address
Manage capacity manual ll/evicHon
SIMT
Separate address space

Managed by sosware:
L1 Cache
Local Memory
Nvidia calls Local Memory

Shared Memory.
AMD some1mes calls it Group Memory.
51
Recap
Data Parallelism: IdenHcal, Independent work over mulHple data inputs
GPU version: Add streaming access pamern
Data Parallel Execu1on Models: MIMD, SIMD, SIMT

GPU Execu1on Model: MulHcore MulHthreaded SIMT
OpenCL Programming Model
NDRange over workgroup/wavefront
Modern GPU Microarchitecture: AMD Graphics Core Next (GCN)

Compute Unit (GPU Core): 4 SIMT Units
SIMT Unit (GPU Pipeline): 16-wide ALU pipe (16x4 execuHon)
Memory: designed to stream
GPUs: Great for data parallelism. Bad for everything else.
52
Advanced Topics
GPU LimitaAons, Future of GPGPU
53
Choose Your Own Adventure!

SIMT Control Flow & Branch Divergence
Memory Divergence
When GPUs talk
Wavefront communicaHon
GPU coherence
GPU consistency
Future of GPUs: Whats next?
54
SIMT Control Flow

Consider SIMT condiHonal branch:
One PC
MulHple data (i.e., mulHple condiHons)
if (x <= 0)
y = 0;
else
y = x;
55
SIMT Control Flow

Work-items in wavefront run in lockstep
Dont all have to commit
Branching through predica1on

AcHve lane: commit result
InacHve lane: throw away result

All lanes acHve at start: 1111
if (x <= 0)
y = 0;
else
y = x;
Branch set execuHon mask: 1000

Else invert execuHon mask: 0111
Converge Reset execuHon mask: 1111
56
SIMT Control Flow

Work-items in wavefront run in lockstep
Dont all have to commit
Branching through predica1on

AcHve lane: commit result
InacHve lane: throw away result

All lanes
acHve at start: 1111
Branch
divergence
if (x <= 0)
y = 0;
else
y = x;
Branch set execuHon mask: 1000

Else invert execuHon mask: 0111
Converge Reset execuHon mask: 1111
57
Branch Divergence
When control ow diverges, all lanes take all paths

Divergence Kills Performance
58
Beware!
Divergence isnt just a performance problem:
__global int lock = 0;

void mutex_lock()
{
// acquire lock
while (test&set(lock, 1) == false) {
// spin
}
return;
}
59
Beware!
Divergence isnt just a performance problem:
__global int lock = 0;

void mutex_lock()
{
Deadlock:
work-items cant enter mutex together!
// acquire lock
while (test&set(lock, 1) == false) {
// spin
}
return;
}
60
Memory Bandwidth
SIMT
DRAM
Lane 0
Bank 0
Lane 1
Bank 1
Lane 2
Bank 2
Lane 3
Bank 3
-- Parallel Access
61
Memory Bandwidth
SIMT
DRAM
Lane 0
Bank 0
Lane 1
Bank 1
Lane 2
Bank 2
Lane 3
Bank 3
-- Sequen1al Access
62
Memory Bandwidth
Memory divergence
SIMT
DRAM
Lane 0
Bank 0
Lane 1
Bank 1
Lane 2
Bank 2
Lane 3
Bank 3
-- Sequen1al Access
63
Memory Divergence
One work-item stalls enHre wavefront must stall
Cause: Bank conicts, cache misses
Data layout & parHHoning is important
64
Memory Divergence
One work-item stalls enHre wavefront must stall
Cause: Bank conicts, cache misses
Data layout & parHHoning is important

Divergence Kills Performance
65
CommunicaHon and SynchronizaHon

Work-items can communicate with:
Work-items in same wavefront
No special sync neededthey are lockstep!
Work-items in dierent wavefront, same workgroup (local)

Local barrier
Work-items in dierent wavefront, dierent workgroup (global)

OpenCL 1.x: Nope
OpenCL 2.x: Yes, but
CUDA 4.x: Yes, but complicated
66
GPU Consistency Models

Very weak guarantee:
Program order respected within single work-item
All other bets are o
Safety net:
Fence make sure all previous accesses are visible before proceeding
Built-in barriers are also fences
A wrench:
GPU fences are scoped only apply to subset of work-items in system
E.g., local barrier
Take-away: Area of acHve research

See Hower, et al. Heterogeneous-race-free Memory Models, ASPLOS 2014
67
GPU Coherence?
NoHce: GPU consistency model does not require coherence
i.e., Single Writer, MulHple Reader
MarkeHng claims they are coherent

GPU Coherence:
Nvidia: disable private caches
AMD: ush/invalidate enHre cache at fences
68
GPU Architecture Research

Blending with CPU architecture:
Dynamic scheduling / dynamic wavefront re-org
Work-items have more locality than we think
Tighter integraHon with CPU on SOC:

Fast kernel launch
Exploit ne-grained parallel region: Remember Amdahls law
Common shared memory
Reliability:
Historically: Who noHces a bad pixel?
Future: GPU compute demands correctness
Power:
Mobile, mobile mobile!!!
69
Computer Economics 101

GPU Compute is cool + gaining steam, but
Is a 0 billion dollar industry (to quote Mark Hill)
GPU design prioriHes:

1. Graphics
2. Graphics

N-1. Graphics
N. GPU Compute
Moral of the story:

GPU wont become a CPU (nor should it)
70

GPU Architecture

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

GPU Architecture

Hochgeladen von

Copyright:

Verfügbare Formate

GPU

Derek Hower AMD Research 5/21/2013

Execu1on Models / GPU Architectures

GPU Programming Models

Modern GPU Microarchitectures

Advanced Topics: (Time permiYng)

GPU ARCHITECTURES: A CPU PERSPECTIVE

GPU ARCHITECTURES: A CPU PERSPECTIVE

GPU ARCHITECTURES: A CPU PERSPECTIVE

GPU ARCHITECTURES: A CPU PERSPECTIVE

GPU ARCHITECTURES: A CPU PERSPECTIVE

GPU ARCHITECTURES: A CPU PERSPECTIVE

Architecture Spelling Bee

GPU ARCHITECTURES: A CPU PERSPECTIVE

Generalize: Data Parallel Workloads

GPU ARCHITECTURES: A CPU PERSPECTIVE

Data Parallelism: A MIMD Approach

Data Parallelism: A MIMD Approach

Fetch Decode Execute Memory Writeback

Data Parallelism: An SPMD Approach

Data Parallelism: A SIMD Approach

GPU ARCHITECTURES: A CPU PERSPECTIVE

SIMD: A Closer Look

GPU ARCHITECTURES: A CPU PERSPECTIVE

Data Parallelism: A SIMT Approach

GPU ARCHITECTURES: A CPU PERSPECTIVE

Its common to interchange

GPU ARCHITECTURES: A CPU PERSPECTIVE

Data Parallel ExecuHon Models

One thread with wide

GPU ARCHITECTURES: A CPU PERSPECTIVE

MulHple lockstep threads

ExecuHon Model Comparison

Can mix sequenHal & Easier to program

Inecient for data

GPU ARCHITECTURES: A CPU PERSPECTIVE

GPUs and Memory

DRAM latency: 100s of GPU cycles

GPU ARCHITECTURES: A CPU PERSPECTIVE

Hiding Memory Latency

Need spaHal/temporal locality

GPU ARCHITECTURES: A CPU PERSPECTIVE

MulHcore MulHthreaded SIMT

MulHple GPU Cores

GPU ARCHITECTURES: A CPU PERSPECTIVE

MulHcore MulHthreaded SIMT

MulHple GPU Cores

This is a GPU Architecture (Whew!)

GPU ARCHITECTURES: A CPU PERSPECTIVE

GPU Component Names

Dereks CPU Analogy

GPU ARCHITECTURES: A CPU PERSPECTIVE

GPU ARCHITECTURES: A CPU PERSPECTIVE

GPU Programming Models

OpenCL Open CompuHng Language

C++ AMP C++ Accelerated Massive Parallelism

OpenACC Open Accelerator

GPU Programming Models

OpenCL Open CompuHng Language

C++ AMP C++ Accelerated Massive Parallelism

OpenACC Open Accelerator

Early GPU languages are light abstracHons of physical hardware

GPU ARCHITECTURES: A CPU PERSPECTIVE