Hybrid OpenMP-MPI Turbulent Boundary +layer Code Over 32k Cores

Juan A.
Sillero, Guillem Borrell, Javier Jimnez

(Universidad Politcnica de Madrid)
and Robert D. Moser
(U. Texas Austin)
T/NT INTERFACE
x/
99
y/
99
z/
99
Hybrid OpenMP-MPI Turbulent
boundary Layer code over 32k cores
FUNDED BY: CICYT, ERC, INCITE, & UPM
Motivations
Numerical approach
Computational setup & domain decomposition
Node topology
Code Scaling
IO performances
Conclusions
Outline
Motivations
Differences between internal and external flows:
Internal: Pipes and channels
External: Boundary layers
Effect of large-scale intermittency in the turbulent

structures
Energy consumption optimization:
Skin friction is generated in the interface vehicle-

boundary layer
Separation of scales:
3 Layers structure: inner, logarithmic and outer
Achieved only with high Reynolds number
Important advantages of simulations over experiments

INTERNAL FLOWS EXTERNAL FLOWS
Motivations: Some underlying physic
Turbulent
Non Turbulent
Duct
Turbulent Turbulent
Pipe
Sections:
|!
0+
|
INTERNAL FLOWS EXTERNAL FLOWS
Motivations: Underlying physic
Turbulent
Non Turbulent
Duct
Turbulent Turbulent
Pipe
Sections:
Skin Friction (drag)
|!
0+
|
(5% world energy consumption)
Numerical Approach
Incompressible Navier-Stokes equations

u
v
w
p
+ Boundary Conditions
Staggered grid
Numerical Approach

Non-Linear Terms
Linear Viscous Terms
Linear pressure-gradient terms
Semi-implicit
RK-3
u
v
w
p
Staggered grid
Numerical Approach

u
v
w
p
u
v
w
p
Compact Finite Diferences
(X & Y)
Pseudo-Spectral
SPATIAL DISCRETIZATION:
Numerical Approach
Simens et al. JCP 228, 4218 (2009)
Jimenez et al. JFM 657, 335 (2010)
*
Fractional Step Method
Inlet conditions using [Lund et al.] recycling scheme approach
Linear systems solved using LU decomposition
Poisson equation for pressure solved using direct method
2nd order time accuracy and 4th order CFD

XY
63 Mb
ZY
11 Mb
ZY
XY
Blue Gene/P
4x450 PowerPC
2 Gb RAM (DDR2)
INCITE project (ANL)
PRACE project (Jugene)
Plane to Plane
decomposition
Tier-0
(16 R*8 buffers)
New parallelization strategy
+
Hybrid OpenMP-MPI
Global transposes:
Change the memory layout
Collective communications: MPI_ALLTOALLV
Messages are single precision (R*4)
About 40% of the total time (when using Torus network)
4 OpenMP threads
Static Scheduling:
Through private indexes
Maximise data locality
Good load balance
Loop blocking in Y
Tridiagonal LS LU solver
Tuned for Blue Gene/P

BL2:
SECOND MPl GROUP
BL1:
FlRST MPl GROUP
TOTAL AVAlLABLE NODES
(MPl_COMM_WORLD)
INITIALIZATION
READ FIELD
COMPUTE Naver-Stokes
RHS
TIME INTEGRATIOIN
Posson Equaton
(Mass Conservaton)
BL 1
BL 2
INITIALIZATION
READ FIELD
COMPUTE Naver-Stokes
RHS
TIME INTEGRATIOIN
Posson Equaton
(Mass Conservaton)
Min
Time Step
SEND}RECElVE
lNLET PLANE
EXPLlClT
SYNCHRONlZATlON
MPl_FlNALlZE
END PROGRAM
Create 2 MPI groups (MPI_GROUP_INCL)
Groups created based in 2 list of ranks
Split global communicator in 2 local ones
Each group performs independently
Some global operations:
Time step: MPI_ALLREDUCE
Inlet conditions: SEND/RECEIVE

Node topology
How to map virtual processes onto physical processors?
Predened Custom
Twice Faster
8192 Nodes
BL
1
=512
BL
2
=7680
3D Torus network is lost:

Comm
BL
1
[ Comm
BL
2
= MPI COMM WORLD
Node topology
BALANCE
COMM.
COMPUT.
CODE SCALING:
Millions of points per node
Size of message [Bytes]
T
i
m
e

p
e
r

m
e
s
s
a
g
e

[
s
]
T
i
m
e

[
s
]
Across nodes (MPI) Within node (OpenMP)
7 MB
N
o
d
e

o
c
c
u
p
a
t
i
o
n
2 kB
40% Comm.
8% Transp.
52% Comp.
Linear weak
scaling
IO Performances
Checkpoint of the simulation: 0.5 TBytes (R*4)
Every 3 hours (12 hours run)
Velocity {u,v,w} and pressure {p} elds (4x84 GB+4x7.2 Gb)
Correlation les {u}
Different strategies for IO:
Serial IO: Discarded
Parallel Collective IO:
Posix calls
SIONLIB library (Juelich)
HDF5 (GPFS & PVFS2)
HDF5 Tuning for Blue Gene/P:
GPFS & PVFS2 (cache OFF & ON respectively)
Cache OFF, write: 2 Gb/sec (5-15 minutes)
Cache ON, write: 16 Gb/sec (25-60 seconds)
Forcing le system block size in GPFS: 16 Gb/sec
Conclusions
Turbulent boundary layer code ported to hybrid OpenMP-MPI
Memory optimized for Blue Gene/P: 0.5 GB/core
Excellent weak linear scalability up to 8k nodes
Big impact in performances using custom node topologies
Parallel Collective IO (HDF5): Read 22 Gb/sec, Write 16 Gb/sec

Low pressure isosurface
at high Reynolds numbers

Hybrid OpenMP-MPI Turbulent Boundary +layer Code Over 32k Cores

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Hybrid OpenMP-MPI Turbulent Boundary +layer Code Over 32k Cores

Hochgeladen von

Copyright:

Verfügbare Formate

Juan A.

Sillero, Guillem Borrell, Javier Jimnez

Computational setup & domain decomposition

Differences between internal and external flows:

Internal: Pipes and channels

External: Boundary layers

Effect of large-scale intermittency in the turbulent

Energy consumption optimization:

Skin friction is generated in the interface vehicle-

3 Layers structure: inner, logarithmic and outer

Achieved only with high Reynolds number

Important advantages of simulations over experiments

Incompressible Navier-Stokes equations

Incompressible Navier-Stokes equations

Incompressible Navier-Stokes equations

Fractional Step Method

Inlet conditions using [Lund et al.] recycling scheme approach

Linear systems solved using LU decomposition

Poisson equation for pressure solved using direct method

2nd order time accuracy and 4th order CFD

Change the memory layout

Collective communications: MPI_ALLTOALLV

Messages are single precision (R*4)

About 40% of the total time (when using Torus network)

Through private indexes

Maximise data locality

Good load balance

Tuned for Blue Gene/P

Create 2 MPI groups (MPI_GROUP_INCL)

Groups created based in 2 list of ranks

Split global communicator in 2 local ones

Each group performs independently

Some global operations:

Time step: MPI_ALLREDUCE

Inlet conditions: SEND/RECEIVE

Checkpoint of the simulation: 0.5 TBytes (R*4)

Every 3 hours (12 hours run)

Velocity {u,v,w} and pressure {p} elds (4x84 GB+4x7.2 Gb)

Correlation les {u}

Different strategies for IO:

Serial IO: Discarded

Parallel Collective IO:

SIONLIB library (Juelich)

HDF5 (GPFS & PVFS2)

HDF5 Tuning for Blue Gene/P:

GPFS & PVFS2 (cache OFF & ON respectively)

Cache OFF, write: 2 Gb/sec (5-15 minutes)

Cache ON, write: 16 Gb/sec (25-60 seconds)

Forcing le system block size in GPFS: 16 Gb/sec

Turbulent boundary layer code ported to hybrid OpenMP-MPI

Memory optimized for Blue Gene/P: 0.5 GB/core

Excellent weak linear scalability up to 8k nodes

Big impact in performances using custom node topologies

Parallel Collective IO (HDF5): Read 22 Gb/sec, Write 16 Gb/sec

Das könnte Ihnen auch gefallen