Sie sind auf Seite 1von 19

Juan A.

Sillero, Guillem Borrell, Javier Jimnez


(Universidad Politcnica de Madrid)
and Robert D. Moser
(U. Texas Austin)
T/NT INTERFACE
x/
99
y/
99
z/
99
Hybrid OpenMP-MPI Turbulent
boundary Layer code over 32k cores
FUNDED BY: CICYT, ERC, INCITE, & UPM

Motivations

Numerical approach

Computational setup & domain decomposition

Node topology

Code Scaling

IO performances

Conclusions
Outline
Motivations

Differences between internal and external flows:

Internal: Pipes and channels

External: Boundary layers

Effect of large-scale intermittency in the turbulent


structures

Energy consumption optimization:

Skin friction is generated in the interface vehicle-


boundary layer

Separation of scales:

3 Layers structure: inner, logarithmic and outer

Achieved only with high Reynolds number

Important advantages of simulations over experiments


INTERNAL FLOWS EXTERNAL FLOWS
Motivations: Some underlying physic
Turbulent
Non Turbulent
Duct
Turbulent Turbulent
Pipe
Sections:
|!
0+
|
INTERNAL FLOWS EXTERNAL FLOWS
Motivations: Underlying physic
Turbulent
Non Turbulent
Duct
Turbulent Turbulent
Pipe
Sections:
Skin Friction (drag)
|!
0+
|
(5% world energy consumption)
Numerical Approach

Incompressible Navier-Stokes equations


u
v
w
p
+ Boundary Conditions
Staggered grid
Numerical Approach

Incompressible Navier-Stokes equations


Non-Linear Terms
Linear Viscous Terms
Linear pressure-gradient terms
Semi-implicit
RK-3
u
v
w
p
Staggered grid
Numerical Approach

Incompressible Navier-Stokes equations


u
v
w
p
u
v
w
p
Compact Finite Diferences
(X & Y)
Pseudo-Spectral
SPATIAL DISCRETIZATION:
Numerical Approach
Simens et al. JCP 228, 4218 (2009)
Jimenez et al. JFM 657, 335 (2010)
*

Fractional Step Method

Inlet conditions using [Lund et al.] recycling scheme approach

Linear systems solved using LU decomposition

Poisson equation for pressure solved using direct method

2nd order time accuracy and 4th order CFD



Computational setup & domain decomposition

XY
63 Mb

ZY
11 Mb

ZY

XY
Blue Gene/P

4x450 PowerPC

2 Gb RAM (DDR2)
INCITE project (ANL)
PRACE project (Jugene)
Plane to Plane
decomposition
Tier-0
(16 R*8 buffers)
Computational setup & domain decomposition
New parallelization strategy
+
Hybrid OpenMP-MPI
Computational setup & domain decomposition

Global transposes:

Change the memory layout

Collective communications: MPI_ALLTOALLV

Messages are single precision (R*4)

About 40% of the total time (when using Torus network)

4 OpenMP threads

Static Scheduling:

Through private indexes

Maximise data locality

Good load balance

Loop blocking in Y

Tridiagonal LS LU solver

Tuned for Blue Gene/P


Computational setup & domain decomposition
Computational setup & domain decomposition
BL2:
SECOND MPl GROUP
BL1:
FlRST MPl GROUP
TOTAL AVAlLABLE NODES
(MPl_COMM_WORLD)
INITIALIZATION
READ FIELD
COMPUTE Naver-Stokes
RHS
TIME INTEGRATIOIN
Posson Equaton
(Mass Conservaton)
BL 1
BL 2
INITIALIZATION
READ FIELD
COMPUTE Naver-Stokes
RHS
TIME INTEGRATIOIN
Posson Equaton
(Mass Conservaton)
Min
Time Step
SEND}RECElVE
lNLET PLANE
EXPLlClT
SYNCHRONlZATlON
MPl_FlNALlZE
END PROGRAM

Create 2 MPI groups (MPI_GROUP_INCL)

Groups created based in 2 list of ranks

Split global communicator in 2 local ones

Each group performs independently

Some global operations:

Time step: MPI_ALLREDUCE

Inlet conditions: SEND/RECEIVE


Node topology
How to map virtual processes onto physical processors?
Predened Custom
Twice Faster
8192 Nodes
BL
1
=512
BL
2
=7680
3D Torus network is lost:

Comm
BL
1
[ Comm
BL
2
= MPI COMM WORLD
Node topology
BALANCE
COMM.
COMPUT.
CODE SCALING:
Millions of points per node
Size of message [Bytes]
T
i
m
e

p
e
r

m
e
s
s
a
g
e

[
s
]
T
i
m
e

[
s
]
Across nodes (MPI) Within node (OpenMP)
7 MB
N
o
d
e

o
c
c
u
p
a
t
i
o
n
2 kB
40% Comm.
8% Transp.
52% Comp.
Linear weak
scaling
IO Performances

Checkpoint of the simulation: 0.5 TBytes (R*4)

Every 3 hours (12 hours run)

Velocity {u,v,w} and pressure {p} elds (4x84 GB+4x7.2 Gb)

Correlation les {u}

Different strategies for IO:

Serial IO: Discarded

Parallel Collective IO:

Posix calls

SIONLIB library (Juelich)

HDF5 (GPFS & PVFS2)

HDF5 Tuning for Blue Gene/P:

GPFS & PVFS2 (cache OFF & ON respectively)

Cache OFF, write: 2 Gb/sec (5-15 minutes)

Cache ON, write: 16 Gb/sec (25-60 seconds)

Forcing le system block size in GPFS: 16 Gb/sec

Conclusions

Turbulent boundary layer code ported to hybrid OpenMP-MPI

Memory optimized for Blue Gene/P: 0.5 GB/core

Excellent weak linear scalability up to 8k nodes

Big impact in performances using custom node topologies

Parallel Collective IO (HDF5): Read 22 Gb/sec, Write 16 Gb/sec


Low pressure isosurface
at high Reynolds numbers

Das könnte Ihnen auch gefallen