Beruflich Dokumente
Kultur Dokumente
type(dt) :: x
...
!$acc enter data copyin(x)
!$acc enter data copyin(x%xm)
....
!$acc exit data copyout(x%xm)
!$acc exit data delete(x)
type mdt
integer :: n
real, dimension(:), allocatable :: xm
end type
...
!$acc enter data copyin(x)
do i = 1, n
!$acc enter data copyin(x(i)%xm)
enddo
....
Managed Memory
Compile and link with –ta=tesla:managed
Allocate statements will allocate in CUDA Unified Memory
Advantages
- Most data clauses can be skipped, and in fact are ignored
- If locality works, most data stays on the GPU
- Data transfers use fast pinned data transfers
- Good for initial porting
- Derived type allocatable members automatically work
Managed Memory
Disadvantages
- All managed memory is moved to the GPU for each kernel launch
- No prefetch, no asynchronous data movement
- Only works for dynamically allocated memory
- local variables, module variables, static symbols are not managed
- Limited to memory size of the GPU
- Allocate and Deallocate are expensive
- Kepler only
- Only one device
- Your program can segfault(!) if the host code accesses managed data GPU is busy
Conditional GPU code
if clause on acc parallel / acc kernels
acc_on_device(acc_device_...)
$ $ $ $ $ $ $ $ $ $ $ $
$ $ $ $ $ $ $ $ $ $ $ $
HT/QPI
Shared Cache Shared Cache
High High
Capacity Capacity
Memory Memory
Modern HPC Node
X86 CPU GPU Accelerator
$ $ $ $ $ $
$ $ $ $ $ $
PCIe 3
Shared Cache
$ $ $ $ $ $ $ $
Shared Cache
High
Capacity
Memory
High Bandwidth
Memory
Latency- vs Throughput-Optimized Cores
CPU, LOC Accelerator, GPU, TOC
Fast clock (2.5-3.5 GHz) Slow clock (.8-1.2 GHz)
More work per clock More work per clock
- deep pipelining - shallow pipelining
- 3-5 wide multiscalar instruction issue - 1-2 wide multiscalar instruction issue
- 4-16 wide SIMD instructions - 16-64 wide SIMD instructions
- 4-24 cores - 24-72 cores
Fewer stalls Fewer stalls
- Large 10-24MB cache - Small .25-2MB cache
- Complex branch prediction - Little branch prediction
- Out-of-order execution - In-order execution
- 2-4 wide multithreading - 15-32 wide multithreading
Modern HPC Node
X86 CPU Xeon Phi
$ $ $ $ $ $
$ $ $ $ $ $
PCIe 3
Shared Cache
$ $ $ $ $ $ $ $
Shared Cache
High
Capacity
Memory
High Bandwidth
Memory
Modern HPC Node
APU
$ $ $
$ $ $
Shared Cache
$ $ $ $ $ $ $ $
Shared Cache
High
Capacity
Memory
High Bandwidth
Memory
Modern HPC Node
APU
High Bandwidth
$ $ $ Memory
$ $ $
Shared Cache
$ $ $ $ $ $
NVLink
$ $ $ $ $ $
Shared Cache
$ $ $ $ $ $ $ $
Shared Cache
High
Capacity
Memory
High Bandwidth
Memory
Modern HPC Node
ARM CPU Tesla Accelerator
$ $ $ $
$ $ $ $
PCIe 3
Shared Cache
$ $ $ $ $ $ $ $
Shared Cache
High
Capacity
Memory
High Bandwidth
Memory
Modern HPC Node
ARM CPU Tesla Accelerator
$ $ $ $
NVLink
$ $ $ $
Shared Cache
$ $ $ $ $ $ $ $
Shared Cache
High
Capacity
Memory
High Bandwidth
Memory
Performance Portability
The same program
runs and runs well
across multiple targets
Performance Portability
program seq(s) gpu(s) speedup m-core(s) speedup
clvrleaf 2698 161.73 16.7 511.1 5.2
md 13463 115.43 116.0 400.03 33.6
minighost 1062 146.13 7.3 319.93 3.3
olbm 449 305.97 1.4 95.75 4.7
ostencil 13835 60.13 230.0 2276.57 6.0
swim 537 83.98 6.4 121.20 4.4