Sie sind auf Seite 1von 53

OpenACC for Fortran

PGI Compilers for Heterogeneous Supercomputing


Advanced GPU Programming
Data management API routines
Multiple devices
Atomic operations
Derived types
Managed memory
Conditional GPU code
Multicore as a target
Interoperability with OpenMP
Interoperability with CUDA C and CUDA Libraries
Interoperability with CUDA Fortran
Data Management API
acc_copyin(a(:)) acc enter data copyin
acc_create(b(:)) acc enter data create
acc_copyout(a(:)) acc exit data copyout
acc_delete(b(:)) acc exit data delete
acc_is_present(a(2:n-1))
acc_update_host(a(2:n-1)) acc update host
acc_update_device(b(2:n)) acc update device
Multiple Devices
Environment Variable ACC_DEVICE_NUM
API routine acc_set_device_num
- call acc_set_device_num( 1, acc_device_nvidia )
OpenMP, based on thread number
- nd = acc_get_num_devices( acc_device_nvidia )
- ign = mod(omp_get_thread_num(),nd)
- call acc_set_device_num( ign, acc_device_nvidia )
MPI, based on rank
- nd = acc_get_num_devices( acc_device_nvidia )
- call mpi_comm_rank( mpi_comm_world, irank, ierror )
- ign = mod(irank,nd)
- call acc_set_device_num( ign, acc_device_nvidia )
Multiple devices with OpenMP
nd = acc_get_num_devices( acc_device_nvidia )
!$omp parallel private(ign)
ign = mod(omp_get_thread_num(),nd)
call acc_set_device_num( ign, acc_device_nvidia )
!$omp end parallel
...
!$acc data copy(a(:,:))
!$omp parallel do
do j = 1, n
!$acc parallel loop
do i = 1, n
a(i,j) = ...
enddo
enddo
!$acc end data
What could go wrong?
Multiple devices with OpenMP
...
!$omp parallel
!$acc data copy( a(:,:) )
!$omp do
do j = 1, n
!$acc parallel loop present(a)
do i = 1, n
a(i,j) = ...
enddo
enddo
!$acc end data
!$omp end parallel
What could go wrong?
Multiple devices with one thread
nd = acc_get_num_devices( acc_device_nvidia )
nchunk = (n+nd-1)/nd
do ign = 0, nd-1
call acc_set_device_num( ign, acc_device_nvidia )
jlow = ign*nchunk + 1
jhigh = max(n, (ign+1)*nchunk)
!$acc enter data copyin(a(:,jlow:jhigh)) async
!$acc parallel loop async
do j = jlow, jhigh
do i = 1, n
a(i,j) = ...
enddo
enddo
!$acc exit data copyout(a(:,jlow:jhigh)) async
enddo
!$acc wait
What could go wrong?
Multiple devices with MPI
No sharing between ranks, even on same GPU
Can run out of memory (no virtual memory on GPU)
Atomic Operations
OpenACC atomic construct, like OpenMP atomic construct
- some constructs will generate hardware atomic operations
!$acc atomic update
x = x + a(i)
!$acc atomic update
y = min(y,b(i))
!$acc atomic capture
ix = ix + 1
ime = ix
!$acc atomic
Fortran Derived Types
Arrays of derived type work just like arrays
Derived type with fixed size array members, should just work
Derived type with allocatable array members
- Deep copy not implemented (or defined)
- Workaround for PGI
type mdt
integer :: n
real, dimension(:), allocatable :: xm
end type

type(dt) :: x
...
!$acc enter data copyin(x)
!$acc enter data copyin(x%xm)
....
!$acc exit data copyout(x%xm)
!$acc exit data delete(x)
type mdt
integer :: n
real, dimension(:), allocatable :: xm
end type

type(dt), allocatable :: x(:)

...
!$acc enter data copyin(x)
do i = 1, n
!$acc enter data copyin(x(i)%xm)
enddo
....
Managed Memory
Compile and link with –ta=tesla:managed
Allocate statements will allocate in CUDA Unified Memory
Advantages
- Most data clauses can be skipped, and in fact are ignored
- If locality works, most data stays on the GPU
- Data transfers use fast pinned data transfers
- Good for initial porting
- Derived type allocatable members automatically work
Managed Memory
Disadvantages
- All managed memory is moved to the GPU for each kernel launch
- No prefetch, no asynchronous data movement
- Only works for dynamically allocated memory
- local variables, module variables, static symbols are not managed
- Limited to memory size of the GPU
- Allocate and Deallocate are expensive
- Kepler only
- Only one device
- Your program can segfault(!) if the host code accesses managed data GPU is busy
Conditional GPU code
if clause on acc parallel / acc kernels
acc_on_device(acc_device_...)

subroutine host_or_device( a, ongpu )


real, dimension(:) :: a
logical :: ongpu
!$acc parallel loop if(ongpu) default(present)
do i = 1, ubound(a,1)
if( acc_on_device( acc_device_nvidia) )then
a(i) = hostfoo( a(i) )
else
a(i) = devfoo( a(i) )
endif...
Compile for GPU and Host
–ta=tesla,host
- compiles each compute region for Tesla and sequential host code
ACC_DEVICE_TYPE
- nvidia or host
acc_set_device_type( acc_device_nvidia | acc_device_host )
Compile for Multicore
–ta=multicore
- compiles each compute region for parallel multicore host execution
- –ta=tesla,multicore will work in 2016
Currently being beta tested
- only one outer parallel loop is run in parallel
- no tuning for multicore execution (yet)
- no data movement (data clauses ignored)
Useful for initial code development
Useful for multi-target code deployment
Interoperability with OpenMP
–acc –mp to enable OpenACC and OpenMP
Threads can share a GPU
Shared data on host will be shared on the GPU as well
Data regions can overlap for shared data
- data created / copied in at entry to first data region
- data copied out / deleted at exit from last data region, even if different thread
Interoperability with OpenMP 4
No existing implementation of OpenMP 4 and OpenACC (Cray?)
Data management of both are coherent
- copy == map(inout), copyin == map(in), copyout == map(out), create == map(alloc)
- OpenACC defines two copies kept coherent by program
- OpenMP defines mapping a single copy from host to device and back
Parallelism management is very different
- OpenMP has teams, threads, SIMD lanes
- OpenACC has gangs, workers, vector lanes
- OpenMP is strictly prescriptive, parallel loop is a loop that runs in parallel
- OpenACC is more descriptive, parallel loop is a real parallel loop
- Same runtime should be able to handle both in a single program
Interoperability with OpenMP 4
No existing implementation of OpenMP 4 and OpenACC (Cray?)
Data management of both are coherent
- copy == map(inout), copyin == map(in), copyout == map(out), create == map(alloc)
- OpenACC defines two copies kept coherent by program
- OpenMP defines mapping a single copy from host to device and back
Parallelism management is very different
- OpenMP has teams, threads, SIMD lanes
- OpenACC has gangs, workers, vector lanes
- OpenMP is strictly prescriptive, parallel loop is a loop that runs in parallel
- OpenACC is more descriptive, parallel loop is a real parallel loop
- Same runtime should be able to handle both in a single program
Interoperability with CUDA
The device kernels are CUDA kernels, the data is CUDA data
Data interoperability
- Calling OpenACC C with data from CUDA C
- Calling OpenACC Fortran with data from CUDA C
- Calling OpenACC Fortran with data from CUDA Fortran
- Calling CUDA C with OpenACC data
- Calling CUDA Fortran with OpenACC data
Compute interoperability
- Calling CUDA C device routines from OpenACC
- Calling CUDA Fortran device routines from OpenACC
OpenACC data in CUDA C
#pragma acc data copyin(a[0:n]) copy(x[0:n])
{
...
#pragma acc host_data use_device(a)
{
cuda_routine( a );
}
...
#pragma acc parallel loop
for( j = 0; j < n; ++j ) a[j] = ...
}
CUDA C data in OpenACC
float *a;
cudaMalloc( &a, sizeof(float)*n );
...
openacc_routine( a );
...
void openacc_routine( float* a ){
...
#pragma acc parallel loop deviceptr(a)
for( j = 0; j < n; ++j ) a[j] = ...
}
CUDA C data in OpenACC
float *a;
cudaMalloc( &a, sizeof(float)*n );
...
openacc_routine_( a );
...
subroutine openacc_routine( a )
real a(*)
!$acc parallel loop deviceptr(a)
do j = 1, n
a(j) = ...
enddo
end subroutine
CUDA Fortran data in OpenACC
real, allocatable, device :: a(:)
allocate(a(n))
...
call openacc_routine(a)
...
subroutine openacc_routine( a )
real, device :: a(*)
!$acc parallel loop
do j = 1, n
a(j) = ...
enddo
end subroutine
OpenACC data in CUDA Fortran
real, allocatable :: a(:)
allocate(a(n))
!$acc data copyin(a)
...
call cuf_routine(a)
...
!$acc end data
...
subroutine cuf_routine( a )
real, device :: a(*)
!$cuf kernels do<<<*,64>>>
do i = 1, n
....
OpenACC data in CUDA Fortran
real, allocatable :: a(:)
allocate(a(n))
!$acc data copyin(a)
...
call cuf_kernel<<<n/64,64>>>(a)
...
!$acc end data
...
attributes(global) subroutine cuf_routine( a )
real, device :: a(*)
....
CUDA device routines in OpenACC
interface
subroutine cudadev( a, i, x ) bind(c)
real a(*)
real, value :: x
integer, value :: i
!$acc routine seq
end subroutine
end interface
...
!$acc parallel loop gang vector present(a)
do i = 1, n
call cudadev( a, i, x )
enddo
CUDA device routines in OpenACC
__device__ void cudadev( float* a, int i, float x ){
a[i] *= x;
}
CUDA device routines in OpenACC
module mm
contains
attributes(device) subroutine cudadev( a, i, x )
real a(*)
real, value :: x
integer, value :: i
a(i) = x*a(i)
end subroutine
end module
use mm
!$acc parallel loop gang vector present(a)
do i = 1, n
call cudadev( a, i, x )
enddo
CUDA Fortran and OpenACC
Data with device attribute can be used in OpenACC
regions
Data transfers with pinned attribute will be faster
OpenACC compute regions may call CUDA library
OpenACC compute regions may call user device
procedures
OpenACC data may be passed to arguments with device
attribute
CUDA Libraries
-Mcudalib=cublas|cufft|curand|cusparse
CUBLAS
- use cublas or use cublasxt or use openacc_cublas
CUFFT
- use cufft
- cufftSetStream(plan,acc_get_cuda_stream(acc_async_sync))
CURAND
- use curand or use openacc_curand
THRUST
- interface blocks, acc_get_cuda_stream(acc_async_sync)
OpenACC and OpenMP 4
OpenACC OpenMP
Focused on accelerated computing General purpose parallelism
More agile More measured
Performance Portability Performance Portability a challenge
Descriptive Prescriptive
Parallel loops Loops that run across threads
Extensive interoperability Limited interoperability
More mature for accelerators More mature for multi-core
Modern HPC Node
X86 CPU X86 CPU

$ $ $ $ $ $ $ $ $ $ $ $

$ $ $ $ $ $ $ $ $ $ $ $

HT/QPI
Shared Cache Shared Cache

High High
Capacity Capacity
Memory Memory
Modern HPC Node
X86 CPU GPU Accelerator

$ $ $ $ $ $

$ $ $ $ $ $

PCIe 3
Shared Cache

$ $ $ $ $ $ $ $

Shared Cache
High
Capacity
Memory
High Bandwidth
Memory
Latency- vs Throughput-Optimized Cores
CPU, LOC Accelerator, GPU, TOC
Fast clock (2.5-3.5 GHz) Slow clock (.8-1.2 GHz)
More work per clock More work per clock
- deep pipelining - shallow pipelining
- 3-5 wide multiscalar instruction issue - 1-2 wide multiscalar instruction issue
- 4-16 wide SIMD instructions - 16-64 wide SIMD instructions
- 4-24 cores - 24-72 cores
Fewer stalls Fewer stalls
- Large 10-24MB cache - Small .25-2MB cache
- Complex branch prediction - Little branch prediction
- Out-of-order execution - In-order execution
- 2-4 wide multithreading - 15-32 wide multithreading
Modern HPC Node
X86 CPU Xeon Phi

$ $ $ $ $ $

$ $ $ $ $ $
PCIe 3
Shared Cache

$ $ $ $ $ $ $ $

Shared Cache
High
Capacity
Memory
High Bandwidth
Memory
Modern HPC Node
APU

$ $ $

$ $ $

Shared Cache

High Capacity Memory


Modern HPC Node
Knights Landing

$ $ $ $ $ $ $ $

Shared Cache
High
Capacity
Memory
High Bandwidth
Memory
Modern HPC Node
APU

High Bandwidth
$ $ $ Memory

$ $ $

Shared Cache

High Capacity Memory


Modern HPC Node
POWER CPU Tesla Accelerator

$ $ $ $ $ $

NVLink
$ $ $ $ $ $

Shared Cache

$ $ $ $ $ $ $ $

Shared Cache
High
Capacity
Memory
High Bandwidth
Memory
Modern HPC Node
ARM CPU Tesla Accelerator

$ $ $ $

$ $ $ $
PCIe 3
Shared Cache

$ $ $ $ $ $ $ $

Shared Cache
High
Capacity
Memory
High Bandwidth
Memory
Modern HPC Node
ARM CPU Tesla Accelerator

$ $ $ $
NVLink
$ $ $ $

Shared Cache

$ $ $ $ $ $ $ $

Shared Cache
High
Capacity
Memory
High Bandwidth
Memory
Performance Portability
The same program
runs and runs well
across multiple targets
Performance Portability
program seq(s) gpu(s) speedup m-core(s) speedup
clvrleaf 2698 161.73 16.7 511.1 5.2
md 13463 115.43 116.0 400.03 33.6
minighost 1062 146.13 7.3 319.93 3.3
olbm 449 305.97 1.4 95.75 4.7
ostencil 13835 60.13 230.0 2276.57 6.0
swim 537 83.98 6.4 121.20 4.4

These are SPECAccel benchmark estimates


dual-processor Intel Haswell (32 cores) with NVIDIA K80 GPU
OpenACC Course – Starts Oct 1st
A Free Online Course
Experienced Instructors 4 Classes

OpenACC Toolkit 4 Office Hours


Hands-on Labs
GPU Access
Register at https://developer.nvidia.com/openacc_course
Performance Portable Programming
Challenges and Opportunities
- high core count devices
- large system memories, smaller high-bandwidth memories
OpenACC is demonstrating performance portability
Data management: as important as parallelism
- data location as well as data layout
Parallelism: Expose, Express, Exploit
- performance, not parallelism
https://www.pgroup.com/userforum
https://developer.nvidia.com/openacc
https://developer.nvidia.com/openacc_course
openacc@nvidia.com
OpenACC 1, 2, 2.5, 3, ...
OpenACC 1.0: data region, compute region, update, async
OpenACC 2.0: +routine, +atomics, +enter data/exit data
OpenACC 2.5: +default(present), -present_or_, +profile interface
OpenACC 3.0 (planned): deep copy, shared memory options
Backup, backup, backup
backup
`
!$acc data copyin(a(:,:), v(:)) copy(x(:))
!$acc parallel
!$acc loop gang
do j = 1, n
sum = 0.0
!$acc loop vector reduction(+:sum)
do i = 1, n
sum = sum + a(i,j) + v(i)
enddo
x(j) = sum
enddo
!$acc end parallel
!$acc end data
`
!$acc data copyin(a(:,:), v(:)) copy(x(:))
call matvec( a, v, x, n )
!$acc end data
...
subroutine matvec( m, v, r, n )
real :: m(:,:), v(:), r(:)
!$acc parallel present(a,v,r)
!$acc loop gang
do j = 1, n
sum = 0.0
!$acc loop vector reduction(+:sum)
do i = 1, n
sum = sum + m(i,j) + v(i)
enddo
r(j) = sum
enddo
!$acc end parallel
end subroutine
`
!$acc data copyin(a(:,:), v(:)) copy(x(:))
call matvec( a, v, x, n )
!$acc end data
...
subroutine matvec( m, v, r, n )
real :: m(:,:), v(:), r(:)
!$acc parallel default(present)
!$acc loop gang
do j = 1, n
sum = 0.0
!$acc loop vector reduction(+:sum)
do i = 1, n
sum = sum + m(i,j) + v(i)
enddo
r(j) = sum
enddo
!$acc end parallel
end subroutine
call init( v, n ) subroutine init( v, n )
call fill( a, n ) real, allocatable :: v(:)
!$acc data copy( x ) allocate(v(n))
do iter = 1, niter v(1) = 0
call matvec( a, v, x, n ) do i = 2, n
call interp( b, x, n ) v(i) = ....
!$acc update host( x ) enddo
write(...) x !$acc enter data copyin(v)
call exch( x ) end subroutine
!$acc update device( x )
enddo
!$acc end data
...

Das könnte Ihnen auch gefallen