Sie sind auf Seite 1von 18

Porting Marine Ecosystem Model Spin-Up Using Transport Matrices to GPUs

E. Siewertsen, J. Piwonski, T. Slawig


CAU - Christian-Albrechts-Universit at zu Kiel

19. M arz 2013

E. Siewertsen, J. Piwonski, T. Slawig (CAU)

GPU accelerated PETSc

19. M arz 2013

1 / 18

Outline

Motivation Algorithm Operations Software Porting to GPU


Biogeochemical model (Fortran) Model driver (C)

Results Remarks

E. Siewertsen, J. Piwonski, T. Slawig (CAU)

GPU accelerated PETSc

19. M arz 2013

2 / 18

Motivation
Global carbon cycle, CO2 uptake of the worlds oceans Parameter estimation in biogeochemical models
80 60 Latitude [degrees] 40 20 0 -20 -40 -60 -80 -180 -135 -90 -45 0 45 Longitude [degrees] 90 135 180 2.2 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6

3 Simulated concentration of nutrients (phosphate, P O4 ) at surface layer in 3 m mol P/m . The longitudinal and latitudinal resolution is at 1.0 .

E. Siewertsen, J. Piwonski, T. Slawig (CAU)

GPU accelerated PETSc

19. M arz 2013

3 / 18

Motivation contd

Ocean Biogeochemical Dynamics [Sarmiento and Gruber, 2006] System of transport equations for biogeochemical tracers: yi = (yi ) (v yi ) + qi (y, u), t
diusion advection reaction

i = 1, . . .

Climatological (annual periodic) forcing Solution is steady annual periodic state (equilibrium) Involves integration over thousands of model years

E. Siewertsen, J. Piwonski, T. Slawig (CAU)

GPU accelerated PETSc

19. M arz 2013

4 / 18

Motivation contd

Transport Matrix Method [Khatiwala et al., 2005] Discretized and approximated problem: yj +1 = A imp,j (A exp,j yj + t qj (yj , u)),
transport matrices

j = 0, . . .

Monthly averaged matrices provided Interpolation needed: A imp,j = j Ai[i,j ] + j Ai[i,j ] A exp,j = j Ae[i,j ] + j Ae[i,j ]

E. Siewertsen, J. Piwonski, T. Slawig (CAU)

GPU accelerated PETSc

19. M arz 2013

5 / 18

Algorithm

Assuming: 1 year 360 days and t 3 h y = y0 2: repeat 3: for j = 1, . . . , 2880 do 4: evaluate biogeochemical model: yq = qj (y, u) 5: interpolate matrices to time step j 6: perform explicit step: y = Aexp,j y 7: perform implicit step: y = Aimp,j (y + t yq ) 8: end for 9: until steady annual cycle is reached
1:

E. Siewertsen, J. Piwonski, T. Slawig (CAU)

GPU accelerated PETSc

19. M arz 2013

6 / 18

Operations

Evaluate biogeochemical model:


BGCStep(); Including copying between different data alignments

Interpolate matrices:
MatCopy(); MatScale(); MatAXPY();

Apply explicit and implicit step:


MatMult();

99.4 % of computational effort is spent by these operations.

E. Siewertsen, J. Piwonski, T. Slawig (CAU)

GPU accelerated PETSc

19. M arz 2013

7 / 18

Porting to GPU

Software:
Metos3D [Piwonski and Slawig, 2013] github.com/metos3d PETSc based implementation in C [Balay et al., 1997] Using PETSc matrix and vector operations Coupling biogeochemical models implemented in Fortran

Objectives:
Do not change Fortran implementation at all Do not change C implementation, if possible

E. Siewertsen, J. Piwonski, T. Slawig (CAU)

GPU accelerated PETSc

19. M arz 2013

8 / 18

Biogeochemical model

Fortran implementation:
Obtained PGI CUDA Fortran compiler license Using wrapper le model.CUF with Fortran kernels Including original through: #include "model.F" Using macro to change subroutine to attributes(device) subroutine

Copying between data alignments:


Thrust iterators Operator overloading

E. Siewertsen, J. Piwonski, T. Slawig (CAU)

GPU accelerated PETSc

19. M arz 2013

9 / 18

Model Driver

Objective: Do not change software implementation Translates to: Change library beneath PETSc-dev
GPU enabled PETSc version MatMult() is already implemented [Minden et al., 2010] MatCopy(), MatScale(), MatAXPY() have to added ..

E. Siewertsen, J. Piwonski, T. Slawig (CAU)

GPU accelerated PETSc

19. M arz 2013

10 / 18

PETSc contd

PETSc classes Basic PETSc principles: Object oriented programming using C language:
Data encapsulation Polymorphism Inheritance

E. Siewertsen, J. Piwonski, T. Slawig (CAU)

GPU accelerated PETSc

19. M arz 2013

11 / 18

PETSc contd
Class Mat Operations: struct MatOps ... PetscErrorCode PetscErrorCode PetscErrorCode PetscErrorCode ... }; { (*mult)(Mat,Vec,Vec); (*copy)(Mat,Mat,...); (*scale)(Mat,PetscScalar); (*axpy)(Mat,PetscScalar,Mat,...);

Adding own implementation: M->ops->scale = MatScale SeqAIJCUSP; ...

E. Siewertsen, J. Piwonski, T. Slawig (CAU)

GPU accelerated PETSc

19. M arz 2013

12 / 18

Results
GPU: GeForce GTX 480 CPU: Intel Xeon E5520, running at 2.27 GHz Biogeochemical model: N-DOP [Dutkiewicz et al., 2005] Min 621.43 s 28.17 s 22.06 Max 626.79 s 28.20 s Avg 622.14 s 28.18 s StdDev 0.540 0.003

CPU GPU

Overall performance gain simulating one model year using the N-DOP model at a longitudinal and latitudinal resolution of 2.8125 .

E. Siewertsen, J. Piwonski, T. Slawig (CAU)

GPU accelerated PETSc

19. M arz 2013

13 / 18

Results contd
Performance gain per operation: Routine BGCStep MatCopy MatScale MatAXPY MatMult CPU 469.76 s 34.04 s 23.33 s 37.49 s 58.19 s GPU 13.05 s 3.91 s 1.99 s 2.89 s 5.87 s CPU : GPU 36.00 8.70 11.70 12.96 9.92

Performance of MatMult in detail:


CPU: 1.2 GFlop/s (13 % of 9.08 GFlop/s), 12 GB/s (56.8 % of 21.2 GB/s) GPU: 11.9 GFlop/s (7 % of 168 GFlop/s), 119.4 GB/s (67.4 % of 177 GB/s)

E. Siewertsen, J. Piwonski, T. Slawig (CAU)

GPU accelerated PETSc

19. M arz 2013

14 / 18

Results contd
N-DOP model, CPU (626.07s/a)
BGCStep 46.0% 75.0% Other 9.3% 6.0% 3.7% 5.4% MatMult MatCopy 2.2% 20.7% MatMult MatAXPY Other

N-DOP model, GPU (28.34s/a) [block size 160]


BGCStep

13.8% 7.0% 10.2%

MatAXPY MatScale MatCopy

MatScale

Distribution of computational time per operation during the simulation of one model year.
E. Siewertsen, J. Piwonski, T. Slawig (CAU) GPU accelerated PETSc 19. M arz 2013 15 / 18

Results contd
Simulation of one cycle (N-DOP, 2.8125, 2880 time steps) 2.10 GHz AMD Barcelona (rzcluster) 2.67 GHz Intel Westmere (rzcluster) 2.93 GHz Intel Gainestown (HLRN) GeForce GTX 480

150

time per cycle [s]

100

50 28 10 17 20 28 30 40 processor count 50 56 60

Performance of GPU simulation compared to different CPU clusters.


E. Siewertsen, J. Piwonski, T. Slawig (CAU) GPU accelerated PETSc 19. M arz 2013 16 / 18

Remarks

PGI Fortran preprocessor didnt like:


#define subroutine attributes(device) subroutine C preprocessor was used

Only sequential PETSc operations were implemented Ongoing work on multi GPU implementation

E. Siewertsen, J. Piwonski, T. Slawig (CAU)

GPU accelerated PETSc

19. M arz 2013

17 / 18

Thanks

Thanks for your attention!


Special thanks to Dr. Ulrich Knechtel

E. Siewertsen, J. Piwonski, T. Slawig (CAU)

GPU accelerated PETSc

19. M arz 2013

18 / 18

Das könnte Ihnen auch gefallen