Sie sind auf Seite 1von 22

Gangneung-Wonju National University

Youngtae Kim

Agenda
1 Background
1.1 The future of high performance computing
1.2 GP-GPU
1.3 CUDA Fortran

2 Implementation
2.1 Implementation Fortran program of WRF
2.2 Execution profile of WRF physics
2.3 Implementation of parallel programs

3 Performance
3.1 Performance comparison
3.2 Performance of WRF

4 Conclusions
1

1 Background
1.1 The future of High Performance Computing
H. Meuer, Scientific Computing World: June/July 2009

A thousand-fold performance increase over an 11-year time


period.
1986 Gigaflops
1997 Teraflops
2008 Petaflops
2019 Exaflops
For the near future,
we expect that
the hardware architecture
will be a combination of
specialized CPU and
GPU type cores.

1 Background
GP-GPU performance

FLOPS/Memory bandwidth for the CPU and GP-GPU


*FLOPS: Floating-Point Operations per Seconds
3

1 Background
GP-GPU Acceleration of WRF WSM5

1 Background
1.2 GP-GPU(General Purposed Graphic Processing Unit)
Originally graphic processing
Grid of Multi processors
Use PCI
Thread block
(Compute in parallel)

Grid (Data Domain)

1 Background
Caller: function<<<dimGrid, dimBlock>>> ()
Callee: i = blockDim%x*(blockIdx%x-1) + threadIdx%x
j = blockDim%y*(blockIdx%y-1) + threadIdx%y
(3*0+1,
3*0+1)

(3*0+1,
3*0+2)

((33**00+1,
+1,
33**00+3)
+3)

((33**00+1,
+1,
33**11+1)
+1)

((33**00+1,
+1,
33**11+2)
+2)

((33**00+1,
+1,
33**11+3)
+3)

((33**0
0+1,
+1,
33**22+1)
+1)

(3*0+1,
3*2+2)

(3*0+1,
3*2+3)

(3*0+2,
3*0+1)

(3*0+2,
3*0+2)

((33**00+2,
+2,
33**00+3)
+3)

((33**00+2,
+2,
33**11+1)
+1)

((33**00+2,
+2,
33**11+2)
+2)

((33**00+2,
+2,
33**11+3)
+3)

((33**0
0+2,
+2,
33**22+1)
+1)

(3*0+2,
3*2+2)

(3*0+2,
3*2+3)

(3*0+3,
3*0+1)

(3*0+3,
3*0+2)

((33**00+3,
+3,
33**00+3)
+3)

((33**00+3,
+3,
33**11+1)
+1)

((33**00+3,
+3,
33**11+2)
+2)

((33**00+3,
+3,
33**11+3)
+3)

((33**0
0+3,
+3,
33**22+1)
+1)

(3*0+3,
3*2+2)

(3*0+3,
3*2+3)

(3*1+1,
3*0+1)

(3*1+1,
3*0+2)

((33**11+1,
+1,
33**00+3)
+3)

((33**11+1,
+1,
33**11+1)
+1)

((33**11+1,
+1,
33**11+2)
+2)

((33**11+1,
+1,
33**11+3)
+3)

((33**11+1,
+1,
33**22+1)
+1)

(3*1+1,
3*2+2)

(3*1+1,
3*2+3)

(3*1+2,
3*0+1)

(3*1+2,
3*0+2)

((33**11+2,
+2,
33**00+3)
+3)

((33**1+2,
1+2,
33**11+1)
+1)

((33**11+2,
+2,
33**11+2)
+2)

((33**11+2,
+2,
33**11+3)
+3)

((33**11+2,
+2,
33**22+1)
+1)

(3*1+2,
3*2+2)

(3*1+2,
3*2+3)

(3*1+3,
3*0+1)

(3*1+3,
3*0+2)

((33**11+3,
+3,
33**00+3)
+3)

((33**1+3,
1+3,
33**11+1)
+1)

((33**11+3,
+3,
33**11+2)
+2)

((33**11+3,
+3,
33**11+3)
+3)

((33**11+3,
+3,
33**22+1)
+1)

(3*1+3,
3*2+2)

(3*1+3,
3*2+3)

(1,1)

(2,1)

(1,2)

(2,2)

(1,3)

(2,3)

Grid
6

1 Background
1.3 CUDA Fortran (PG Fortran version 10)
Developed by Portland Group Inc. and Nvidia(2009/12)
Support CUDA
Use Fortan90(95/03) syntax
Some limitations of CUDA Fortran

Not support automatic arrays and module variables


Not supoort common, equivalence

CUDA(Computer United Device Architecture): GP-GPU


programming interface by Nvidia

2 Implementation
2.1 Implementation Fortran program of WRF(v.3.4)
Physics run on GP-GPUs
Micro physics: WSM6 and WDM6
Boundary-layer physics: YSUPBL
Radiation physics - RRTMG_LW, RRTMG_SW
Surface-layer physics - SFCLAY

2 Implementation
2.2 Execution profile of WRF physics routines
WRF execution profile
22.5%

RRTMG_LW
RRTMG_SW
14.4%

WDM6
YSUPBL
etc.

2.1%
21.1%

2 Implementation
2.3 Implementation of parallel programs
2.3.1 Running environment
Modification of configure.wrf(environment set-up file)

Compatible to original WRF program


ARCH_LOCAL = -DRUN_ON_GPU
(GP-GPU code compile if -DRUN_ON_GPU defined)

Create a directory for CUDA codes only - cuda

GP-GPU source codes


Exclusive Makefile

10

2 Implementation
2.3.2 Structure of the GP-GPU program
(Original code)

(GPU code)
Initialize: dynamic allocation of
GP-GPU variables

Initialize

Physics routine:
do j=..

call 2d routine(..,j,..)

enddo

Physics routine:
Time steps

Copy CPU variables to GPU


call 3d routine (GPU)
Copy GPU variables to CPU

Time steps

Finalize: deallocation of GPU


variables

11

2 Implementation
2.3.3 Initialize & Finalize
phys/module_physics_init.F
#ifndef RUN_ON_GPU
CALL rrtmg_lwinit( )
#else
CALL rrtmg_lwinit_gpu()
#endif

main/module_wrf_top.F
#ifdef RUN_ON_GPU
call rrtmg_lwfinalize_gpu()
#endif

cuda/module_ra_rrtmg_lw_gpu.F
cuda/module_ra_rrtmg_lw_gpu.F

Initialization of constants
Allocation of GPU device variables
subroutine rrtmg_lwinit_gpu(...)

call rrtmg_lw_ini(cp)

allocate(p8w_d(dime,djme),stat=istate)

Deallocation of GPU device


variables
subroutine rrtmg_lwfinalize_gpu( ..)

deallocate(p8w_d)

12

2 Implementation
2.3.4 Calling GPU Functions
phys/module_radiation_driver.F
#ifndef RUN_ON_GPU
USE module_ra_rrtmg_lw, only: rrtmg_lwrad
#else
USE module_ra_rrtmg_lw_gpu, only rrtmg_lwrad_gpu
#endif
...
#ifndef RUN_ON_GPU
CALL RRTMG_LWRAD(...)
#else
CALL RRTMG_LWRAD_GPU(...)

13

2 Implementation
2.3.5 Translation into GPU code
Use 3-dimensional domain
Remove the horizontal loop(i, j-loop) of GPU function(Global)
SUBROUTINE wdm62D()

do k = kts, kte
do i = its, ite
cpm(i,k) = cpmcal(q(i,k))
xl(i,k) = xlcal(t(i,k))
enddo
enddo
do k = kts, kte
do i = its, ite
delz_tmp(i,k) = delz(i,k)
den_tmp(i,k) = den(i,k)
enddo
enddo

END SUBROUTINE wdm62D

attributes(global) subroutine wdm6_gpu_kernel(...)


i = (blockIdx%x-1)*blockDim%x + threadIdx%x
j = (blockIdx%y-1)*blockDim%y + threadIdx%y
if (((i.ge.its).and.(i.le.ite)).and. &
((j.ge.jts).and.(j.le.jte))) then

do k = kts, kte
cpm(i,k,j) = cpmcal(q(i,k,j))
xl(i,k,j) = xlcal(t(i,k,j))
! enddo
do k = kts, kte
delz_tmp(i,k,j) = delz(i,k,j)
den_tmp(i,k,j) = den(i,k,j)
enddo

endif
end subroutine wdm6_gpu_kernel
14

2 Implementation
2.3.6 Memory allocation of arrays and copy CPU data
subroutine rrtmg_lwrad(rthratenlw, emiss, )

rthratenlw(i,k,j) =
emiss(i,k,j) =

real, allocatable, device :: rthratenlw_d(:,:,:), emiss_d(:,:),


rthratenlw_d = rthratenlw
emiss_d = emiss
call rrtmg_lwrad_gpu_kernel<<<dimGrid, dimBlock>>> (rthratenlw_d, emiss_d, )
attributes(global) subroutine rrtmg_lwrad_gpu_kernel(rthratenlw, emiss, )

rthratenlw(i,k,j) =
emiss(i,k,j) =

15

3 Performance
3.1 Performance comparison
System specification used for performance checking
CPU

Intel Xeon E5405 (2.0GHz)

GPU

Tesla C1060 (1.3GHz)

Global memory

4G bytes

#multiprocessors

30

#cores

240

Registers/block
Max#thread/block

16384
512

16

3 Performance
Performance of WRF physics routines
4000
3500

CPU

3000

GPU

2500
2000
1500
1000
500
0

WSM6

WDM6

YSUPBL

RRTMG_LW RRTMG_SW

SFCLAY

17

3 Performance
Performance comparison of CUDA C and CUDA Fortran
300,000

250,000

CPU

GPU

microsec

200,000

150,000

100,000

50,000

0
WSM5

WSM6

18

3 Performance
3.2 Performance of WRF
18000
16000

CPU

14000

GPU

12000
10000
8000
6000
4000
2000
0

WRF

19

4 Conclusions
Pros
GP-GPUs can be used as efficient hardware accelerators.
GP-GPUs are cheap and energy efficient.
Cons
Communication between CPUs and GPUs is slow.

Data transfer between CPU and GP-GPU is a bottleneck.


Overlap of communication and computation is necessary.

Translation into GP-GPU code is not trivial.

Parameter passing methods, local resources are limited.


CUDA Fortran need to be improved.

20

Das könnte Ihnen auch gefallen