Beruflich Dokumente
Kultur Dokumente
Abstract
Manycore accelerators such as graphics processor units
(GPUs) organize processing units into single-instruction,
multiple data cores to improve throughput per unit
hardware cost. Programming models for these accelerators encourage applications to run kernels with large
groups of parallel scalar threads. The hardware groups
these threads into warps/wavefronts and executes them
in lockstepdubbed single-instruction, multiple-thread
(SIMT) by NVIDIA. While current GPUs employ a per-warp
(or per-wavefront) stack to manage divergent control flow,
it incurs decreased efficiency for applications with nested,
data-dependent control flow. In this paper, we propose and
evaluate the benefits of extending the sharing of resources
in a block of warps, already used for scratchpad memory, to exploit control flow locality among threads (where
such sharing may at first seem detrimental). In our proposal, warps within a thread block share a common blockwide stack for divergence handling. At a divergent branch,
threads are compacted into new warps in hardware. Our
simulation results show that this compaction mechanism
provides an average speedup of 22% over a baseline perwarp, stack-based reconvergence mechanism, and 17% versus dynamic warp formation on a set of CUDA applications
that suffer significantly from control flow divergence.
1. Introduction
Many throughput computing workloads exhibit data
level parallelism [6] yet achieve poor performance on current single-instruction, multiple data (SIMD) based accelerator architectures due to divergent control flow, or branch
divergence [12, 9, 19, 13]. While conventional SIMD architectures can support correct execution of divergent control flow through vector lane masking (predication) [4]
and/or stack-based reconvergence mechanisms [15, 7, 9],
such mechanisms reduce throughput. Given the significant potential area and energy consumption advantages of
SIMD hardware versus general purpose parallel hardware,
improved control-flow mechanisms are desirable.
Proposals for attacking this challenge include novel
W0 1101
W0 0010
C
W1 0110
W1 1001
W0 1111
W1 1111
Per-Warp
Reconvg. Stack
(PDOM)
Thread Block
Compaction
(DWF Ideal)
Execution Flow
A B C
A B C
DWF
A B
Starvation Eddy
CD
Time
Memory
#Access = 2
Memory
0x100 - 0x17F
0x180 - 0x1FF
W0 1 2 3 4
0x100 - 0x17F
0x180 - 0x1FF
W1 33 34 35 36
#Access = 4
WX 1 34 35 4
WY 33 2 3 36
PDOM
DWF
DWF
TBC
0%
TBC
25%
0%
100%
200%
300%
1 Note that the mechanisms studied in this paper support CUDA and
OpenCL programs with arbitrary control flow within a kernel.
Thread Block
Kernel
Launch
Interconnection
Network
Thread
Thread
Block
Block
Thread
Thread
Block
Block
Constant
Cache
Constant
Texture
Cache
Cache
Constant
Texture
Data
Cache
Cache
Cache
Texture
Data
Cache
Cache
Memory
Data
Port
Cache
Memory
Port
Memory
Port
Register
File
Register
File
Register
File
SIMT Core
SIMT Core
Thread Block
SIMT Core
Shared
Memory
Shared
Memory
Shared
Memory
CPU
MemoryPartition
Partition
Memory
Memory
Partition
Last-Level
Last-Level
Last-Level
CacheBank
Bank
Cache
Cache
Bank
Off-Chip
Off-Chip
Off-Chip
DRAM
Channel
DRAM
Channel
DRAM
Channel
2.1. Architecture
We study GPU-like manycore accelerator architectures
running CUDA applications [21, 24]. A CUDA program
starts on a CPU then launches parallel compute kernels onto
the accelerator. Each kernel launch in CUDA dispatches a
hierarchy of threads (a grid of blocks of warps of scalar
threads running the same compute kernel) onto the accelerator. Blocks are allocated as a single unit of work to a
single-instruction, multiple thread (SIMT) [16] core which
is heavily multi-threaded. Threads within a block can communicate via an on-chip shared memory. The SIMT cores
access banks of a globally coherent last-level (L2) cache
and DRAM memory channels via an on-chip network.
With the SIMT execution model, scalar threads are managed as a SIMD execution group called a warp (or wavefront in AMD terminology). In the CUDA programming
model, each warp contains a fixed group of scalar threads
throughout a kernel launch. This arrangement of scalar
threads into warps is exposed to the CUDA programmer/compiler for various control flow and memory access optimizations [24]. In this paper, we refer to warps with this
arrangement as static warps to distinguish them from the
dynamic warps that are dynamically created via dynamic
warp formation or thread block compaction.
2.2. Microarchitecture
Figure 4 illustrates our understanding of the multithreaded microarchitecture within a SIMT core of a contemporary GPU that we assumed while developing thread block
compaction. We defined this microarchitecture, described
below, by considering details found in recent patents [8,
17, 7]. In our evaluation, we approximate some details to
simplify our simulation model: We model the fetch unit as
always hitting in the instruction cache, and allow an instruction to be fetched, decoded and issued in the same cycle. We
Selection 4
Valid[1:N]
Fetch
I-Cache
v Inst. W1 r 2
v Inst. W2 r
v Inst. W3 r
To
Fetch
Issue
Issue ARB
TOS
To
Fetch
Decode
6
A
G
U
Access
Coalesc.
10
Issue
13
Branch Target PC
Valid[1:N]
Bank
Conflict
I-Buffer
Score
Board 7
Branch Unit
Active
Mask
Issue
9
Done (WID)
Pred.
11
RegFile
Shared MSHR
Mem
Data
Cache
Const.
Cache
Texture
Cache
Memory Port
To I-Cache
ScoreDecode Board
PC1 1
A PC
2
R
B PC3
ALU
ALU
12 ALU
ALU
MEM
Normalized IPC
80%
60%
40%
0%
DIVG
AES
BACKP
CP
DG
HRTWL
LIB
LKYT
MGST
NNC
RAY
STMCL
STO
WP
20%
BFS2
FCDT
HOTSP
LPS
MUM
MUMpp
NAMD
NVRT
SIMD Efficiency
100%
COHE
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
DWF
BFS2
FCDT
HOTSP
DWF-WB
LPS
PDOM
MUM
MUMpp
NAMD
NVRT
RAY
WP
Figure 6. DWF with and without warp barrier compared against baseline per-warp PDOM
stack reconvergence.
while this manager thread is acquiring tasks. DWF executes NVRT incorrectly because it does not enforce this behaviour. With DWF the worker threads incorrectly execute
ahead with obsolete task IDs when the manager thread is
acquiring new tasks. In general, we observed three problems when running CUDA applications on DWF enabled
execution model: (1) Applications relying on implicit synchronization in a static warp (e.g. NVRT) execute incorrectly; (2) Starvation eddies may reduce SIMD efficiency;
(3) Thread regrouping in DWF increases non-coalesced
memory accesses and shared memory bank conflicts.
Block-Wide
Reconvergence Stack
A 5678
TOS
3
1xxx
x234
B 5xx8
C x67x
TOS
TOS
1234
D5678
TOS
PC RPC
A
--
Active Threads
12345678
A
A
1.2.3.4
5.6.7.8
PC RPC
D
-B
D
C
D
Active Threads
12345678
23458
167
1.6.7.x
PC RPC
D
-B
D
Active Threads
12345678
23458
B
B
5.2.3.4
x.x.x.8
PC RPC
D
--
Active Threads
12345678
D
D
1.2.3.4
5.6.7.8
10
Execution Flow
with Per-Warp Reconvergence Stack
A
1
2
3
4
A
5
6
7
8
C
1
x
x
x
C
x
6
7
x
B
x
2
3
4
B
5
x
x
8
D
1
2
3
4
Warps After
Compaction
D
5
6
7
8
Execution Flow
with Thread Block Compaction
A
1
2
3
4
Time
A
5
6
7
8
C
1
6
7
x
11
B
5
2
3
4
B
x
x
x
8
D
1
2
3
4
D
5
6
7
8
Time
3.2. Implementation
Figure 8 illustrates the modifications to the SIMT core
microarchitecture to implement thread block compaction.
The modifications consist of three major parts: a modified branch unit ( 1 ), a new hardware unit called the thread
compactor ( 2 ), and a modified instruction buffer called the
warp buffer ( 3 ). The branch unit ( 1 ) has a block-wide reconvergence stack for each block. Each entry in the stack
consists of the starting PC (PC) of the basic block that corresponds to the entry, the reconvergence PC (RPC) that indicates when this entry will be popped from the stack, a
warp counter (WCnt) that stores the number of compacted
warps this entry contains, and a block-wide activemask that
records which thread is executing the current basic block.
The thread compactor ( 2 ) consists of a set of priority encoders that compact the block-wide activemask into compacted warps with thread IDs. The warp buffer ( 3 ) is an
instruction buffer that augments each entry with the thread
IDs associated with compacted warps.
To retain compatibility with applications that rely on
static warp synchronous behaviour (e.g., reduction in
the CUDA SDK), the thread compactor can optionally be
v Inst. W1 TIDs r
v Inst. W2 TIDs r
v Inst. W3 TIDs r
Selection
To
Fetch
Valid[1:N]
TIDs
to
Warp
Buffer
Issue
ARB
Issue
Branch Target PC
Fetch
Warp
Buffer
Valid[1:N]
I-Cache
Decode
ActiveMask[1:S]
A
R
B
Selective Reset
Priority Encoder
To I-Cache
PC1
PC2
PC3
Branch
Unit
TOS
To
Fetch
PCRPC
RPCWCnt
WCntActiveMask[1:B]
ActiveMask[1:B]
PC
PC
RPC WCnt
ActiveMask[1:B]
PCRPC
RPCWCnt
WCntActiveMask[1:B]
ActiveMask[1:B]
PC
PC RPC WCnt ActiveMask[1:B]
PCRPC
RPCWCnt
WCntActiveMask[1:B]
ActiveMask[1:B]
PC
PC
RPC WCnt
ActiveMask[1:B]
Branch Unit
Thread
Compactor Active
Pred.
Issue
ScoreBoard
Done (WID)
Mask
RegFile
Thread Compactor
ALU
ALU
ALU
ALU
MEM
Figure 8. Modifications to the SIMT core microarchitecture to implement Thread Block Compaction. N =
#warps, B = maximum #threads in a block. S = B W where W = #threads in a warp.
Control Flow Graph
for Divergent Branch
W1 1 2 3 4
A W2 5 6 7 8
W3 9 101112
TOS
Corresponding
1 5 9
Thread IDs
W1 1 - - B W2 - 6 7 W3 9 10 - 12
010
W1 - 2 3 4
C W2 5 - - 8
W3 - - 11 -
TOS
Thread Compactor
100
101
110
P-Enc
P-Enc
P-Enc
000
000
001
010
P-Enc
P-Enc
P-Enc
P-Enc
11
9
TOS
Next Cycle
P-Enc
@ W2 Arrival:
PC RPC WCnt
D
-2
B
D
0
C
D
0
TOS
@ W3 Arrival:
PC RPC WCnt
D
-1
B
D
0
C
D
0
ActiveMask
111 111 111 111
2 6 10 3 7 11 4 8 12
ActiveMask
111 111 111 111
000 010 010 000
010 000 000 010
ActiveMask
111 111 111 111
001 011 010 001
010 000 001 010
Time
13
10 @ W1 Arrival:
PC RPC WCnt
ActiveMask
D
-0 11 111 111 111 111
B
D
0
101 011 010 001
C
D
2
010 100 101 110
12
4. Likely-Convergence Points
of set bits in all lanes activemask). A single block-wide
activemask can generate at most as many compacted warps
as the static warps in the executing thread block.
For branches that NVIDIAs CUDA compiler marks
as potentially divergent, we always create two entries for
taken and not-taken outcomes when the first warp reaches a
branch even if that warp is not divergent. If all threads in a
block branch to the same target one of the entries will have
all bits set to zero and will be immediately popped when it
becomes the top of stack.
Figure 9 also shows the operation of the thread compactor ( 13 ). The block-wide activemask sent to the thread
compactor is stored in multiple buffers, each responsible for
threads in a single home vector lane [9]. Each cycle, each
priority encoder selects at most one thread from its corresponding inputs and sends the ID of this thread to the warp
buffer ( 4 in Figure 8). The bits corresponding to the selected threads will be reset, allowing the encoders to select
75%
8 25%
96%
T
O
S
C 234
2%
F 1234
--
--
--
1234
F
F
--
-1
T
O
S
--
234
34
2%
block F
73%3 2%
12
B
C
D
E
A 1234
block
block
block
block
4%
25%
block A
T
O
S
E
B
--
--
F
F
-E
--1
1
1
1234 T
O
2
S
1
34
2
T
O
S
--
--
--
--
-1
1234
2
1
34
E
E
--
F
F
--
-E
--1
1234
12
1
--
---
--
10
--
1234
12
0.35
0.3
PDOM
0.25
TBC
0.2
0.15
0.1
0.05
0
NVRT
MUM
PDOM
20
18
16
14
12
10
8
6
4
2
0
PDOM-LCP
TBC
TBC-LCP
Figure 10. Likely-convergence points improve individual warp SIMD efficiency by reconverging
before the immediate post-dominator.
NVRT
MUM
5. Methodology
We model our proposed hardware changes using a modified version of GPGPU-Sim (version 2.1.1b) [3]. We evaluate the performance of various hardware configurations on
the CUDA benchmarks listed in Table 1. Most of these
benchmarks are selected from Rodinia [5] and the benchmarks used by Bakhoda et al. [3]. We did not exclude any
benchmarks due to poor performance on thread block compaction but excluded some Rodina benchmarks that do not
run on our infrastructure due to their reliance on undocumented behaviour of barrier synchronization in CUDA. We
also use some benchmarks from other sources:
Face Detection is a part of Visbench [19]. Recommended
optimizations [18] for SIMD efficiency were applied.
MUMMER-GPU++ improved MUMMER-GPU [26] reducing data transfers with a novel data structure [11].
NAMD is a popular molecular dynamics simulator [25].
Ray Tracing (Persistent Threads) dynamically
distributes workload in software to mitigate hardware
inefficiencies [1]. We render the Conference scene.
7 Since likely-convergence and immediate post-dominator are the same.
Table 1. Benchmarks
Name
Divergent Set
BFS Graph Traversal [5]
Abbr.
BlockDim
BFS2
(512x1x1),
(256x1x1)
2x(32x6x1)
(16x16x1)
(32x4x1)
(256x1x1)
(192x1x1)
2x(64x1x1)
(32x6x1)
HRTWL
LIB
LKYT
MGST
NN cuda [5]
Ray Tracing [3]
Stream Cluster [5]
StoreGPU [3]
Weather Prediction [3]
NNC
RAY
STMCL
STO
WP
(256x1x1)
2x(16x16x1)
(16x8x1)
(84x1x1),
(112x1x1),
(256x1x1)
(512x1x1)
2x(64x1x1)
(175x1x1)
(96x1x1)
2x(32x1x1),
2x(128x1x1),
2x(256x1x1),
(208x1x1)
(16x1x1)
(16x8x1)
(512x1x1)
(128x1x1)
(8x8x1)
#Instr. Blocks/core
28M
1.7B
157M
81M
69M
140M
3.8B
700M
5,4
2
6
4
3
7
3
30M
193M
126M
569M
2
4,4
8
4,5,6
8.9B
1
162M
8,8
6.8B
5,5,1
2.3B 1,3,8,3,4,2,4
6M
65M
941M
131M
216M
5,8,8,8
3
2
1
4
6. Experimental Results
Figure 12 shows the performance of TBC and DWF relative to that of PDOM. TBC uses likely-convergence points
whereas PDOM does not and DWF and DWF-WB use majority scheduling [9]. For the divergent (DIVG) benchmark set, TBC has an overall 22% speedup over PDOM.
Much of its performance benefits are attributed to speedups
on the applications that have very low baseline SIMD efficiency (BFS2, FCDT, MUM, MUMpp, NVRT). While
DWF can achieve speedups on these benchmarks as well
(except of NVRT, which fails to execute), it also exhibits
slowdowns for the other benchmarks that have higher baseline SIMD efficiency (HOTSP, LPS, NAMD), lowering the
overall speedup to 4%. DWF with warp barrier (DWF-WB)
recovers from most of this slowdown and executes NVRT
properly, but loses much of the speedup on MUMpp. Overall, TBC is 17% faster than the original DWF and 6% faster
than DWF-WB.
# Streaming Multiprocessors
30
Warp Size
32
SIMD Pipeline Width
8
Number of Threads / Core
1024
Number of Registers / Core
16384
Shared Memory / Core
16KB
Constant Cache Size / Core
8KB
Texture Cache Size / Core
32KB, 64B line, 16-way assoc.
Number of Memory Channels
8
L1 Data Cache
32KB, 64B line, 8-way assoc.
L2 Unified Cache
1MB/Memory Channel, 64B line, 64-way assoc.
Compute Core Clock
1300 MHz
Interconnect Clock
650 MHz
Memory Clock
800 MHz
DRAM request queue capacity
32
Memory Controller
out of order (FR-FCFS)
Branch Divergence Method
PDOM [9]
Warp Scheduling Policy
Loose Round Robin
GDDR3 Memory Timing
tCL =10 tRP =10 tRC =35
tRAS =25 tRCD =12 tRRD =8
Memory Channel BW
8 (Bytes/Cycle)
DWF
DWF-WB
TBC
1
0.8
0.6
0.4
HM
COHE
WP
STO
STMCL
RAY
NNC
MGST
LKYT
LIB
HRTWL
DG
CP
0.2
0
HM
DIVG
NVRT
NAMD
MUMpp
MUM
LPS
HOTSP
FCDT
PDOM
1.2
TBC
BACKP
DWF-WB
AES
DWF
Normalized IPC
PDOM
BFS2
Normalized IPC
1.6
1.5
1.4
1.3
1.2
1.1
1
0.9
0.8
0.7
0.6
Figure 12. Performance of Thread Block Compaction (TBC) and Dynamic Warp Formation (DWF) relative to
baseline per-warp post-dominator reconvergence stack (PDOM) for the DIVG and COHE benchmark sets.
Normalized Core Cycles
0%
PDOM
DWF-WB
DWF
TBC
0%
Figure 15 compares the performance of TBC among different thread block prioritization policies. A thread block
prioritization sets the scheduling priority among warps from
different thread blocks, while warps within a thread block
are always scheduled with loose round-robin policy [9]:
FCDT
PDOM
DWF-WB
DWF
TBC
HOTSP
PDOM
DWF-WB
DWF
TBC
LPS
PDOM
DWF-WB
DWF
TBC
MUM
PDOM
DWF-WB
DWF
TBC
MUMpp
PDOM
DWF-WB
DWF
TBC
NAMD
PDOM
DWF-WB
DWF
TBC
34X
NVRT
PDOM
DWF-WB
DWF
TBC
W29-32
W13-16
W0_Mem
W25-28
W9-12
W0_Idle
W21-24
W5-8
Stall
W17-20
W1-4
subsystem undisturbed.
BFS2
Const
SharedMem
CoalesceStall
Texture
GlobalLocalMemRC
PDOM
DWF-WB
DWF
TBC
0%
200%
300%
BACKP AES
PDOM
DWF-WB
DWF
TBC
CP
PDOM
DWF-WB
DWF
TBC
DG
PDOM
DWF-WB
DWF
TBC
2.1
13X
HRTWL
PDOM
DWF-WB
DWF
TBC
LIB
PDOM
DWF-WB
DWF
TBC
12X
LKYT
PDOM
DWF-WB
DWF
TBC
MGST
PDOM
DWF-WB
DWF
TBC
NNC
PDOM
DWF-WB
DWF
TBC
STMCL RAY
PDOM
DWF-WB
DWF
TBC
PDOM
DWF-WB
DWF
TBC
STO
PDOM
DWF-WB
DWF
TBC
4.0
6X
WP
PDOM
DWF-WB
DWF
TBC
W29-32
W13-16
W0_Mem
W25-28
W9-12
W0_Idle
W21-24
W5-8
Stall
W17-20
W1-4
Const
SharedMem
CoalesceStall
Texture
GlobalLocalMemRC
7. Implementation Complexity
Most of the implementation complexity for thread block
compaction is the extra storage for thread IDs in the warp
buffer, and the area overhead in relaying these IDs down
the pipeline for register accesses. The register file in each
COHE
#Entries
BFS2
3
FCDT
5
HOTSP
2
LPS
5
MUM
13
MUMpp
14
NAMD
5
NVRT
5
AES
BACKP
CP
DG
HRTWL
LIB
LKYT
MGST
#Entries
1
2
1
2
5
1
3
38
NNC
RAY
STMCL
STO
WP
#Entries
3
9
3
1
8
8. Conclusion
In this paper, we proposed a novel mechanism, thread
block compaction, which uses a block-wide reconvergence
stack shared by all threads in a thread block to exploit their
control flow locality. Warps run freely until they encounter
a divergent branch, where the warps synchronize, and their
threads are compacted into new warps. At the reconvergence point the compacted warps synchronize again to resume in their original arrangements before the divergence.
We found that our proposal addresses some key challenges
of dynamic warp formation [9]. Our simulation evaluation
quantifies that it achieves an overall 22% speedup over a
per-warp reconvergence stack baseline for a set of divergent
applications, while introducing no performance penalty for
a set of control-flow coherent applications.
9. Acknowledgements
We thank the anonymous reviewers for their valuable
comments. This work was supported by the National Sciences and Engineering Research Council of Canada.
HM
COHE
WP
STO
STMCL
SRR
RAY
LKYT
RRB
NNC
AGE
MGST
PDOM
LIB
HRTWL
DG
CP
BACKP
1.2
1.15
1.1
1.05
1
0.95
0.9
0.85
0.8
AES
HM
DIVG
NVRT
NAMD
Normalized IPC
SRR
MUM
HOTSP
FCDT
RRB
MUMpp
AGE
LPS
PDOM
BFS2
Normalized IPC
1.6
1.5
1.4
1.3
1.2
1.1
1
0.9
0.8
0.7
0.6
Figure 15. Performance of TBC with various thread block prioritization policies relative to PDOM for the
DIVG and COHE benchmark sets.
Memory Traffic
Normalized to PDOM
1.2
TBC-AGE
2.67x
TBC-RRB
0.8
DIVG
WP
STO
STMCL
RAY
NNC
MGST
LIB
LKYT
HRTWL
CP
DG
AES
BACKP
NVRT
NAMD
MUMpp
LPS
MUM
FCDT
HOTSP
BFS2
0.6
COHE
Normalized IPC
PDOM
DWF
1.1
DWF-WB
1
0.9
TBC-AGE
0.8
TBC-SRR
TBC-RRB
0.7
0.6
1MB L2 Bk
128kB L2 Bk
HM DIVG
1MB L2 Bk
128kB L2 Bk
HM COHE
References
[1] T. Aila and S. Laine. Understanding the Efficiency of Ray
Traversal on GPUs. In HPG 09, 2009.
[2] AMD. R700-Family Instruction Set Architecture, 1.0 edition,
March 2009.
[3] A. Bakhoda et al. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In Intl Symp. on Perf. Analysis of
Systems and Software (ISPASS), pages 163174, April 2009.
[4] W. Bouknight et al. The Illiac IV System. Proceedings of
the IEEE, 60(4):369 388, apr. 1972.
[5] S. Che et al. Rodinia: A Benchmark Suite for Heterogeneous
Computing. In Intl Symp. on Workload Characterization
(IISWC), pages 4454, 2009.
[6] Y.-K. Chen et al. Convergence of Recognition, Mining, and
Synthesis Workloads and its Implications. Proceedings of
the IEEE, 96(5), May 2008.
[7] B. W. Coon et al. United States Patent #7,353,369: System
and Method for Managing Divergent Threads in a SIMD Architecture (Assignee NVIDIA Corp.), April 2008.
[8] B. W. Coon et al. United States Patent #7,434,032: Tracking Register Usage During Multithreaded Processing Using a
Scorebard having Separate Memory Regions and Storing Sequential Register Size Indicators (Assignee NVIDIA Corp.),
October 2008.
[9] W. Fung et al. Dynamic Warp Formation and Scheduling for
Efficient GPU Control Flow. In Proc. 40th IEEE/ACM Intl
Symp. on Microarchitecture, 2007.