Accelerating CUDA Graph Algorithms at Maximum Warp

Accelerating CUDA Graph Algorithms at Maximum Warp
By: Sungpack Hong, Sang Kyun Kim, Tayo Oguntebi, Kunle Olukotun
Presenter: Thang M. Le
Authors
Sungpack Hong

Ph.D graduate from Stanford Currently, a principal member of technical staff at Oracle
Sang Kyun Kim
Ph.D candidate at Stanford
Tayo Oguntebi
Ph.D candidate at Stanford
Kunle Olukotun Professor of Electrical Engineering & Computer Science at Stanford Director of Pervasive Parallel Laboratory
Agenda
What Is the Problem? Why Does the Problem Exist? Warp-Centric Programming Method Other Techniques Experimental Results Study of Architectural Effects Q&A
What Is the Problem?
The Parallel Random Access Machine (PRAM) abstract is often used to investigate theoretical parallel performance of graph algorithm PRAM approximation is quite accurate in supercomputer domains such as Cray XMT PRAM based algorithms fail to perform well on GPU architectures due to the workload imbalance among threads
Why Does the Problem Exist?
CUDA thread model exhibits certain discrepancies with the GPU architecture

Notably, no explicit notion of warps Different behaviors when accessing memory

Requests targeting the same address are merged Spatial locality accesses are maximally coalesced All other memory requests are serialized
SIMT eliminates SIMD constraints by allowing threads in a warp to pursue on different paths (aka path divergence) Path divergence provides more flexibilities at the cost of performance Path divergence leads to hardware underutilization
Scattering memory access patterns
A thread that processes a high-degree node will iterate the loop at line 23 many more times than other threads, stalling other threads in the warp
Path divergence
Non-coalesced memory
Warp-Centric Programming Method
A program can be run in either SISD phase or SIMT phase:
SISD:

all threads in a warp are executed on the same data degree of parallelism (per SM) = O(# concurrent warps) Each thread is executed on different data degree of parallelism (per SM) = O(# threads in a warp x # concurrent warps)
SIMT:

By default, all threads in a warp are executed following SISD When appropriate, switch to SIMT to exploit data parallelism
Is it safe to run a logic in SISD on GPU architecture?

methodA
input
output
Since methodA will be executed many times by different threads on the same input, the logic of methodA must be deterministic to guarantee the correctness in SISD phase
Advantages:
No path divergence Increase memory coalescing Allows to take advantage of shared memory
Traditional approach: thread-to-data

mapping Threads 0 1 2 3 4 5 . . . . n-2 n-1 n
Level[ ] 0 1 2 3 4 5 . . . . n-2 n-1 n
Warp-centric approach: warp-to-chunk of data

Chunks 0 Warps Threads 0 0 1 2 3 1 1 4 5 . . . . n-2 k k n-1 n
Level[ ] 0 1 2 3 4 5 . . . . n-2 n-1 n
mapping
Warp-centric approach: warp-to-chunk of data
Coarse-grained mapping Each chunk must be mapped to a warp Number of chunks Number of warps Chunk size & warp size are independent Chunk size is limited by the size of shared memory
Disadvantages:
If native SIMT width of the user application is small, the underlying hardware will be under-utilized (# threads assigned to data < #physical cores)
Disadvantages:
The ratio of the SIMT phase duration to the SISD phase duration imposes an Amdahls limit on performance
Amdahls Law:
Improvement:
Addressing these issues:
Partition a warp into multiple virtual warps with smaller size

Virtual warp size =
Increase parallelism within the SISD phase Improve ALU utilization by O(K) times Drawback: might re-introduce path divergence among these multiple virtual warps. This is the trade-off between path divergence and ALU utilization.
Other Techniques
Deferring Outliners:

Define some threshold Defer processing any node having degree greater than the threshold Process outliners in a separate kernel method
Dynamic Workload Distribution:
Virtual warp-centric does not prevent work imbalance among warps in a block Each warp fetches a chunk of work from shared work queue Trade-off between static/dynamic chunk of work:

static work distribution suffers from work imbalance dynamic work distribution imposes overhead
Experimental Results
Input Graphs:
RMAT: scale-free graph which follows a power law degree distribution like many real world graph. The average vertex degree is 12. RANDOM: uniformly distributed graph instance created by randomly connecting m pairs of nodes out of total n nodes. The average vertex degree is 12. LiveJournal: real world graph, is a very irregular structure Patent: is relatively regular graph, has a smaller average degree
Study of Architectural Effects
Advantages of GPU Architecture

Enables massively parallel execution Uses large number of warps to hide memory latency Uses GDDR3 memory which has higher bandwidth and lower latency than FB-DIMM based CPU main memory
Study of Architectural Effects
Effect of bandwidth utilization and latency hiding in GPUs
Conclusion:
Graph algorithms are bound by memory bandwidth
Last But Not Least
A supercomputer is a device for turning compute-bound problems into I/O-bound problems Ken Batcher

Accelerating CUDA Graph Algorithms at Maximum Warp

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Accelerating CUDA Graph Algorithms at Maximum Warp

Hochgeladen von

Copyright:

Verfügbare Formate

Accelerating CUDA Graph Algorithms at Maximum Warp

Sang Kyun Kim

Ph.D candidate at Stanford

Ph.D candidate at Stanford

What Is the Problem?

Why Does the Problem Exist?

Notably, no explicit notion of warps Different behaviors when accessing memory

Why Does the Problem Exist?

Scattering memory access patterns

Why Does the Problem Exist?

Why Does the Problem Exist?

Warp-Centric Programming Method

A program can be run in either SISD phase or SIMT phase:

Warp-Centric Programming Method

Is it safe to run a logic in SISD on GPU architecture?

Warp-Centric Programming Method

Warp-Centric Programming Method

Warp-Centric Programming Method

Traditional approach: thread-to-data

Level[ ] 0 1 2 3 4 5 . . . . n-2 n-1 n

Warp-Centric Programming Method

Warp-centric approach: warp-to-chunk of data

Level[ ] 0 1 2 3 4 5 . . . . n-2 n-1 n

Warp-Centric Programming Method

Warp-centric approach: warp-to-chunk of data

Warp-Centric Programming Method

Warp-Centric Programming Method

Warp-Centric Programming Method

Addressing these issues:

Partition a warp into multiple virtual warps with smaller size

Dynamic Workload Distribution:

Study of Architectural Effects

Advantages of GPU Architecture

Study of Architectural Effects

Effect of bandwidth utilization and latency hiding in GPUs

Graph algorithms are bound by memory bandwidth

Last But Not Least

Das könnte Ihnen auch gefallen