Beruflich Dokumente
Kultur Dokumente
Youngtae Kim
Agenda
1 Background
1.1 The future of high performance computing
1.2 GP-GPU
1.3 CUDA Fortran
2 Implementation
2.1 Implementation Fortran program of WRF
2.2 Execution profile of WRF physics
2.3 Implementation of parallel programs
3 Performance
3.1 Performance comparison
3.2 Performance of WRF
4 Conclusions
1
1 Background
1.1 The future of High Performance Computing
H. Meuer, Scientific Computing World: June/July 2009
1 Background
GP-GPU performance
1 Background
GP-GPU Acceleration of WRF WSM5
1 Background
1.2 GP-GPU(General Purposed Graphic Processing Unit)
Originally graphic processing
Grid of Multi processors
Use PCI
Thread block
(Compute in parallel)
1 Background
Caller: function<<<dimGrid, dimBlock>>> ()
Callee: i = blockDim%x*(blockIdx%x-1) + threadIdx%x
j = blockDim%y*(blockIdx%y-1) + threadIdx%y
(3*0+1,
3*0+1)
(3*0+1,
3*0+2)
((33**00+1,
+1,
33**00+3)
+3)
((33**00+1,
+1,
33**11+1)
+1)
((33**00+1,
+1,
33**11+2)
+2)
((33**00+1,
+1,
33**11+3)
+3)
((33**0
0+1,
+1,
33**22+1)
+1)
(3*0+1,
3*2+2)
(3*0+1,
3*2+3)
(3*0+2,
3*0+1)
(3*0+2,
3*0+2)
((33**00+2,
+2,
33**00+3)
+3)
((33**00+2,
+2,
33**11+1)
+1)
((33**00+2,
+2,
33**11+2)
+2)
((33**00+2,
+2,
33**11+3)
+3)
((33**0
0+2,
+2,
33**22+1)
+1)
(3*0+2,
3*2+2)
(3*0+2,
3*2+3)
(3*0+3,
3*0+1)
(3*0+3,
3*0+2)
((33**00+3,
+3,
33**00+3)
+3)
((33**00+3,
+3,
33**11+1)
+1)
((33**00+3,
+3,
33**11+2)
+2)
((33**00+3,
+3,
33**11+3)
+3)
((33**0
0+3,
+3,
33**22+1)
+1)
(3*0+3,
3*2+2)
(3*0+3,
3*2+3)
(3*1+1,
3*0+1)
(3*1+1,
3*0+2)
((33**11+1,
+1,
33**00+3)
+3)
((33**11+1,
+1,
33**11+1)
+1)
((33**11+1,
+1,
33**11+2)
+2)
((33**11+1,
+1,
33**11+3)
+3)
((33**11+1,
+1,
33**22+1)
+1)
(3*1+1,
3*2+2)
(3*1+1,
3*2+3)
(3*1+2,
3*0+1)
(3*1+2,
3*0+2)
((33**11+2,
+2,
33**00+3)
+3)
((33**1+2,
1+2,
33**11+1)
+1)
((33**11+2,
+2,
33**11+2)
+2)
((33**11+2,
+2,
33**11+3)
+3)
((33**11+2,
+2,
33**22+1)
+1)
(3*1+2,
3*2+2)
(3*1+2,
3*2+3)
(3*1+3,
3*0+1)
(3*1+3,
3*0+2)
((33**11+3,
+3,
33**00+3)
+3)
((33**1+3,
1+3,
33**11+1)
+1)
((33**11+3,
+3,
33**11+2)
+2)
((33**11+3,
+3,
33**11+3)
+3)
((33**11+3,
+3,
33**22+1)
+1)
(3*1+3,
3*2+2)
(3*1+3,
3*2+3)
(1,1)
(2,1)
(1,2)
(2,2)
(1,3)
(2,3)
Grid
6
1 Background
1.3 CUDA Fortran (PG Fortran version 10)
Developed by Portland Group Inc. and Nvidia(2009/12)
Support CUDA
Use Fortan90(95/03) syntax
Some limitations of CUDA Fortran
2 Implementation
2.1 Implementation Fortran program of WRF(v.3.4)
Physics run on GP-GPUs
Micro physics: WSM6 and WDM6
Boundary-layer physics: YSUPBL
Radiation physics - RRTMG_LW, RRTMG_SW
Surface-layer physics - SFCLAY
2 Implementation
2.2 Execution profile of WRF physics routines
WRF execution profile
22.5%
RRTMG_LW
RRTMG_SW
14.4%
WDM6
YSUPBL
etc.
2.1%
21.1%
2 Implementation
2.3 Implementation of parallel programs
2.3.1 Running environment
Modification of configure.wrf(environment set-up file)
10
2 Implementation
2.3.2 Structure of the GP-GPU program
(Original code)
(GPU code)
Initialize: dynamic allocation of
GP-GPU variables
Initialize
Physics routine:
do j=..
call 2d routine(..,j,..)
enddo
Physics routine:
Time steps
Time steps
11
2 Implementation
2.3.3 Initialize & Finalize
phys/module_physics_init.F
#ifndef RUN_ON_GPU
CALL rrtmg_lwinit( )
#else
CALL rrtmg_lwinit_gpu()
#endif
main/module_wrf_top.F
#ifdef RUN_ON_GPU
call rrtmg_lwfinalize_gpu()
#endif
cuda/module_ra_rrtmg_lw_gpu.F
cuda/module_ra_rrtmg_lw_gpu.F
Initialization of constants
Allocation of GPU device variables
subroutine rrtmg_lwinit_gpu(...)
call rrtmg_lw_ini(cp)
allocate(p8w_d(dime,djme),stat=istate)
deallocate(p8w_d)
12
2 Implementation
2.3.4 Calling GPU Functions
phys/module_radiation_driver.F
#ifndef RUN_ON_GPU
USE module_ra_rrtmg_lw, only: rrtmg_lwrad
#else
USE module_ra_rrtmg_lw_gpu, only rrtmg_lwrad_gpu
#endif
...
#ifndef RUN_ON_GPU
CALL RRTMG_LWRAD(...)
#else
CALL RRTMG_LWRAD_GPU(...)
13
2 Implementation
2.3.5 Translation into GPU code
Use 3-dimensional domain
Remove the horizontal loop(i, j-loop) of GPU function(Global)
SUBROUTINE wdm62D()
do k = kts, kte
do i = its, ite
cpm(i,k) = cpmcal(q(i,k))
xl(i,k) = xlcal(t(i,k))
enddo
enddo
do k = kts, kte
do i = its, ite
delz_tmp(i,k) = delz(i,k)
den_tmp(i,k) = den(i,k)
enddo
enddo
do k = kts, kte
cpm(i,k,j) = cpmcal(q(i,k,j))
xl(i,k,j) = xlcal(t(i,k,j))
! enddo
do k = kts, kte
delz_tmp(i,k,j) = delz(i,k,j)
den_tmp(i,k,j) = den(i,k,j)
enddo
endif
end subroutine wdm6_gpu_kernel
14
2 Implementation
2.3.6 Memory allocation of arrays and copy CPU data
subroutine rrtmg_lwrad(rthratenlw, emiss, )
rthratenlw(i,k,j) =
emiss(i,k,j) =
rthratenlw(i,k,j) =
emiss(i,k,j) =
15
3 Performance
3.1 Performance comparison
System specification used for performance checking
CPU
GPU
Global memory
4G bytes
#multiprocessors
30
#cores
240
Registers/block
Max#thread/block
16384
512
16
3 Performance
Performance of WRF physics routines
4000
3500
CPU
3000
GPU
2500
2000
1500
1000
500
0
WSM6
WDM6
YSUPBL
RRTMG_LW RRTMG_SW
SFCLAY
17
3 Performance
Performance comparison of CUDA C and CUDA Fortran
300,000
250,000
CPU
GPU
microsec
200,000
150,000
100,000
50,000
0
WSM5
WSM6
18
3 Performance
3.2 Performance of WRF
18000
16000
CPU
14000
GPU
12000
10000
8000
6000
4000
2000
0
WRF
19
4 Conclusions
Pros
GP-GPUs can be used as efficient hardware accelerators.
GP-GPUs are cheap and energy efficient.
Cons
Communication between CPUs and GPUs is slow.
20