WRF-GPU DR Young-Tae+Kim

Gangneung-Wonju National University
Youngtae Kim
Agenda
1 Background
1.1 The future of high performance computing
1.2 GP-GPU
1.3 CUDA Fortran
2 Implementation
2.1 Implementation Fortran program of WRF
2.2 Execution profile of WRF physics
2.3 Implementation of parallel programs
3 Performance
3.1 Performance comparison
3.2 Performance of WRF
4 Conclusions
1
1 Background
1.1 The future of High Performance Computing
H. Meuer, Scientific Computing World: June/July 2009
A thousand-fold performance increase over an 11-year time

period.
1986 Gigaflops
1997 Teraflops
2008 Petaflops
2019 Exaflops
For the near future,
we expect that
the hardware architecture
will be a combination of
specialized CPU and
GPU type cores.
1 Background
GP-GPU performance
FLOPS/Memory bandwidth for the CPU and GP-GPU

*FLOPS: Floating-Point Operations per Seconds
3
1 Background
GP-GPU Acceleration of WRF WSM5
1 Background
1.2 GP-GPU(General Purposed Graphic Processing Unit)
Originally graphic processing
Grid of Multi processors
Use PCI
Thread block
(Compute in parallel)
Grid (Data Domain)
1 Background
Caller: function<<<dimGrid, dimBlock>>> ()
Callee: i = blockDim%x*(blockIdx%x-1) + threadIdx%x
j = blockDim%y*(blockIdx%y-1) + threadIdx%y
(3*0+1,
3*0+1)
(3*0+1,
3*0+2)
((33**00+1,
+1,
33**00+3)
+3)
((33**00+1,
+1,
33**11+1)
+1)
((33**00+1,
+1,
33**11+2)
+2)
((33**00+1,
+1,
33**11+3)
+3)
((33**0
0+1,
+1,
33**22+1)
+1)
(3*0+1,
3*2+2)
(3*0+1,
3*2+3)
(3*0+2,
3*0+1)
(3*0+2,
3*0+2)
((33**00+2,
+2,
33**00+3)
+3)
((33**00+2,
+2,
33**11+1)
+1)
((33**00+2,
+2,
33**11+2)
+2)
((33**00+2,
+2,
33**11+3)
+3)
((33**0
0+2,
+2,
33**22+1)
+1)
(3*0+2,
3*2+2)
(3*0+2,
3*2+3)
(3*0+3,
3*0+1)
(3*0+3,
3*0+2)
((33**00+3,
+3,
33**00+3)
+3)
((33**00+3,
+3,
33**11+1)
+1)
((33**00+3,
+3,
33**11+2)
+2)
((33**00+3,
+3,
33**11+3)
+3)
((33**0
0+3,
+3,
33**22+1)
+1)
(3*0+3,
3*2+2)
(3*0+3,
3*2+3)
(3*1+1,
3*0+1)
(3*1+1,
3*0+2)
((33**11+1,
+1,
33**00+3)
+3)
((33**11+1,
+1,
33**11+1)
+1)
((33**11+1,
+1,
33**11+2)
+2)
((33**11+1,
+1,
33**11+3)
+3)
((33**11+1,
+1,
33**22+1)
+1)
(3*1+1,
3*2+2)
(3*1+1,
3*2+3)
(3*1+2,
3*0+1)
(3*1+2,
3*0+2)
((33**11+2,
+2,
33**00+3)
+3)
((33**1+2,
1+2,
33**11+1)
+1)
((33**11+2,
+2,
33**11+2)
+2)
((33**11+2,
+2,
33**11+3)
+3)
((33**11+2,
+2,
33**22+1)
+1)
(3*1+2,
3*2+2)
(3*1+2,
3*2+3)
(3*1+3,
3*0+1)
(3*1+3,
3*0+2)
((33**11+3,
+3,
33**00+3)
+3)
((33**1+3,
1+3,
33**11+1)
+1)
((33**11+3,
+3,
33**11+2)
+2)
((33**11+3,
+3,
33**11+3)
+3)
((33**11+3,
+3,
33**22+1)
+1)
(3*1+3,
3*2+2)
(3*1+3,
3*2+3)
(1,1)
(2,1)
(1,2)
(2,2)
(1,3)
(2,3)
Grid
6
1 Background
1.3 CUDA Fortran (PG Fortran version 10)
Developed by Portland Group Inc. and Nvidia(2009/12)
Support CUDA
Use Fortan90(95/03) syntax
Some limitations of CUDA Fortran
Not support automatic arrays and module variables

Not supoort common, equivalence
CUDA(Computer United Device Architecture): GP-GPU

programming interface by Nvidia
2 Implementation
2.1 Implementation Fortran program of WRF(v.3.4)
Physics run on GP-GPUs
Micro physics: WSM6 and WDM6
Boundary-layer physics: YSUPBL
Radiation physics - RRTMG_LW, RRTMG_SW
Surface-layer physics - SFCLAY
2 Implementation
2.2 Execution profile of WRF physics routines
WRF execution profile
22.5%
RRTMG_LW
RRTMG_SW
14.4%
WDM6
YSUPBL
etc.
2.1%
21.1%
2 Implementation
2.3 Implementation of parallel programs
2.3.1 Running environment
Modification of configure.wrf(environment set-up file)
Compatible to original WRF program

ARCH_LOCAL = -DRUN_ON_GPU
(GP-GPU code compile if -DRUN_ON_GPU defined)
Create a directory for CUDA codes only - cuda
GP-GPU source codes

Exclusive Makefile
10
2 Implementation
2.3.2 Structure of the GP-GPU program
(Original code)
(GPU code)
Initialize: dynamic allocation of
GP-GPU variables
Initialize
Physics routine:
do j=..
call 2d routine(..,j,..)
enddo
Physics routine:
Time steps
Copy CPU variables to GPU

call 3d routine (GPU)
Copy GPU variables to CPU
Time steps
Finalize: deallocation of GPU

variables
11
2 Implementation
2.3.3 Initialize & Finalize
phys/module_physics_init.F
#ifndef RUN_ON_GPU
CALL rrtmg_lwinit( )
#else
CALL rrtmg_lwinit_gpu()
#endif
main/module_wrf_top.F
#ifdef RUN_ON_GPU
call rrtmg_lwfinalize_gpu()
#endif
cuda/module_ra_rrtmg_lw_gpu.F
cuda/module_ra_rrtmg_lw_gpu.F
Initialization of constants
Allocation of GPU device variables
subroutine rrtmg_lwinit_gpu(...)
call rrtmg_lw_ini(cp)
allocate(p8w_d(dime,djme),stat=istate)
Deallocation of GPU device

variables
subroutine rrtmg_lwfinalize_gpu( ..)
deallocate(p8w_d)
12
2 Implementation
2.3.4 Calling GPU Functions
phys/module_radiation_driver.F
#ifndef RUN_ON_GPU
USE module_ra_rrtmg_lw, only: rrtmg_lwrad
#else
USE module_ra_rrtmg_lw_gpu, only rrtmg_lwrad_gpu
#endif
...
#ifndef RUN_ON_GPU
CALL RRTMG_LWRAD(...)
#else
CALL RRTMG_LWRAD_GPU(...)
13
2 Implementation
2.3.5 Translation into GPU code
Use 3-dimensional domain
Remove the horizontal loop(i, j-loop) of GPU function(Global)
SUBROUTINE wdm62D()
do k = kts, kte
do i = its, ite
cpm(i,k) = cpmcal(q(i,k))
xl(i,k) = xlcal(t(i,k))
enddo
enddo
do k = kts, kte
do i = its, ite
delz_tmp(i,k) = delz(i,k)
den_tmp(i,k) = den(i,k)
enddo
enddo
END SUBROUTINE wdm62D
attributes(global) subroutine wdm6_gpu_kernel(...)

i = (blockIdx%x-1)*blockDim%x + threadIdx%x
j = (blockIdx%y-1)*blockDim%y + threadIdx%y
if (((i.ge.its).and.(i.le.ite)).and. &
((j.ge.jts).and.(j.le.jte))) then
do k = kts, kte
cpm(i,k,j) = cpmcal(q(i,k,j))
xl(i,k,j) = xlcal(t(i,k,j))
! enddo
do k = kts, kte
delz_tmp(i,k,j) = delz(i,k,j)
den_tmp(i,k,j) = den(i,k,j)
enddo
endif
end subroutine wdm6_gpu_kernel
14
2 Implementation
2.3.6 Memory allocation of arrays and copy CPU data
subroutine rrtmg_lwrad(rthratenlw, emiss, )
rthratenlw(i,k,j) =
emiss(i,k,j) =
real, allocatable, device :: rthratenlw_d(:,:,:), emiss_d(:,:),

rthratenlw_d = rthratenlw
emiss_d = emiss
call rrtmg_lwrad_gpu_kernel<<<dimGrid, dimBlock>>> (rthratenlw_d, emiss_d, )
attributes(global) subroutine rrtmg_lwrad_gpu_kernel(rthratenlw, emiss, )
rthratenlw(i,k,j) =
emiss(i,k,j) =
15
3 Performance
3.1 Performance comparison
System specification used for performance checking
CPU
Intel Xeon E5405 (2.0GHz)
GPU
Tesla C1060 (1.3GHz)
Global memory
4G bytes
#multiprocessors
30
#cores
240
Registers/block
Max#thread/block
16384
512
16
3 Performance
Performance of WRF physics routines
4000
3500
CPU
3000
GPU
2500
2000
1500
1000
500
0
WSM6
WDM6
YSUPBL
RRTMG_LW RRTMG_SW
SFCLAY
17
3 Performance
Performance comparison of CUDA C and CUDA Fortran
300,000
250,000
CPU
GPU
microsec
200,000
150,000
100,000
50,000
0
WSM5
WSM6
18
3 Performance
3.2 Performance of WRF
18000
16000
CPU
14000
GPU
12000
10000
8000
6000
4000
2000
0
WRF
19
4 Conclusions
Pros
GP-GPUs can be used as efficient hardware accelerators.
GP-GPUs are cheap and energy efficient.
Cons
Communication between CPUs and GPUs is slow.
Data transfer between CPU and GP-GPU is a bottleneck.

Overlap of communication and computation is necessary.
Translation into GP-GPU code is not trivial.
Parameter passing methods, local resources are limited.

CUDA Fortran need to be improved.
20

WRF-GPU DR Young-Tae+Kim

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

WRF-GPU DR Young-Tae+Kim

Hochgeladen von

Copyright:

Verfügbare Formate

Gangneung-Wonju National University

A thousand-fold performance increase over an 11-year time

FLOPS/Memory bandwidth for the CPU and GP-GPU

Grid (Data Domain)

Not support automatic arrays and module variables

CUDA(Computer United Device Architecture): GP-GPU

Compatible to original WRF program

Create a directory for CUDA codes only - cuda

GP-GPU source codes

Copy CPU variables to GPU

Finalize: deallocation of GPU

Deallocation of GPU device

END SUBROUTINE wdm62D

attributes(global) subroutine wdm6_gpu_kernel(...)

real, allocatable, device :: rthratenlw_d(:,:,:), emiss_d(:,:),

Intel Xeon E5405 (2.0GHz)

Tesla C1060 (1.3GHz)

Data transfer between CPU and GP-GPU is a bottleneck.

Translation into GP-GPU code is not trivial.

Parameter passing methods, local resources are limited.

Das könnte Ihnen auch gefallen