Vectorized Unstructured CFD Methods For GPU Computing

American Institute of Aeronautics and Astronautics
1
Vectorized Unstructured CFD Methods for GPU Computing
Gregory D. Howe
1

Georgia Institute of Technology, Atlanta, GA, 30332
Vectorized formulations of common unstructured-grid CFD methods are presented.
These are used to produce a Matlab-based unstructured CFD code that will run on
traditional central processing units (CPUs) and graphics processing units (GPUs). GPU-
based computing systems are potentially the new future of affordable high-performance
scientific computing while unstructured CFD has increasingly become the standard for
aerodynamic design problems over the last twenty years. Multiple method components are
examined here to establish their suitability for this type of vectorized formulation and
computation. Example solutions are presented for classic geometry cases.
Nomenclature
A = flux Jacobian matrix
a = speed of sound
c = centroid (of a cell or face)
C
p
= pressure coefficient = (p - p
) / (0.5
2
)
e
0
= total energy per unit volume
F = flux vector
h
0
= stagnation enthalpy
k = thermal conductivity
n = unit normal vector with components [n
x
n
y
n
z
]
p = pressure
R = residual vector
Re
L
= Reynolds number =
L /
, where L is a reference length

r = displacement vector between two points
S = area (typically of a cell face)
U = conserved state vector =
| |
T
e w v u
0

u,v,w = velocity x-, y-, and z-components
V = velocity vector =
| |
T
w v u

x = coordinate vector of a point in space (typically a grid node)
= ratio of specific heats (1.4 for air at moderate temperatures)
i
= number of faces of cell i
(i) = function returning face indices corresponding to cell i
(j) = direction of normal vector n
j
attached to face j
j
= number of nodes of face j
(j) = function returning node indices corresponding to face j
k
= number of cells which reference node k
(k) = function returning cell indices corresponding to cells referencing node k
(i) = function returning node indices corresponding to nodes defining cell i
(i,k) = function returning face indices corresponding to faces referencing node k and cell i
= density
= dynamic viscosity
= flux Jacobian spectral radius
= Courant-Friedrichs-Lewy (CFL) number
= flux limiter function
= cell volume

1
Master's Student, AIAA Student Member.


2
Subscripts
1,2, etc. = refers to a subfunction which produces that number of outputs
c = convective (as opposed to viscous) part
i = refers to cell-centered or centroidal values
j = refers to face-centered values
k = refers to nodal values
v = viscous (as opposed to convective) part
L = "left-hand" value across a boundary
R = "right-hand" value across a boundary
= perpendicular component. Typically computed via a dot-product with a normal vector.
= denotes freestream quantities
I. Introduction
It is becoming increasingly apparent with time that the future of scientific computing lies with Graphics
Processing Unit (GPU) computing technologies. In October 2010, the National University of Defense Technology in
China announced the completion of Tianhe-1A, a new world's fastest supercomputer, clocking an extraordinary
2.507 petaflops (2.507x10
15
floating-point operations per second). According to information from press releases,
about 70% of this computer's processing power is from over 7,000 NVIDIA Tesla GPUs. NVIDIA claims this
architecture to be three times as power efficient and twice as space efficient as a CPU-only architecture with the
same performance. Numerous computer manufacturers have begun selling individual workstations containing four
of the latest generation of Tesla GPU cards (commonly branded as "personal supercomputers"). These tiny clusters
can potentially run at around 5 teraflops - equivalent to the speed of the fastest supercomputer in the world as
recently as 2000. Reproducing this performance with CPUs would require on the order of 100 of Intel's latest
processors - far too many to fit in a single workstation.
The drawback to the nature of GPU computing is that utilizing all of this potential requires extremely well-
vectorized code. This is because GPUs consist of numerous small processing units that perform arithmetic
operations extremely quickly but cannot be effectively controlled independently. What this effectively means to the
programmer is that fast, efficient GPU code needs to contain as few loops and complex logical operations as
possible and perform arithmetic operations on whole arrays of data at once, instead of element-by-element.
Unstructured CFD codes have increasingly been the preference of aerodynamic designers over the last twenty
years due to the much-improved ease of grid creation over older structured methods. Complex 3-D geometries can
be orders of magnitude faster to discretize with unstructured grids. However, unstructured codes are less obviously
suited to the kind of vector processing done by GPUs. This is largely due to the amount of indexing and looping
over unpredictable numbers of elements that is typically involved. This current work is an effort to control this
problem and write an unstructured code in a vector-efficient manner.
II. Vectorization and GPU Programming
GPUs are fundamentally vector processors. Modern GPUs consist of potentially hundreds of small, cheap
"stream processors." These were originally implemented in order to allow GPUs to process many independent
graphical elements in parallel as a way to speed up generation of complex graphical effects. Programming APIs like
NVIDIA's CUDA allow direct access to this processing power from within more general C programs in order to
very quickly perform identical operations on elements of huge arrays.
The fundamental principle of writing vectorized code is to minimize control flow operations wherever possible
so that every element of a large array is performing the same action. Control flow operations include anything
which requires the computer to make a decision about whether to jump to a different line of code or not. This
obviously includes looping commands ("for," "while," or "do" loops) but can also extend to many types of "if-then"
statements.
A. Use of Matlab as a Development Environment
Many CFD programmers (and indeed many C and Fortran programmers in general) would deride Matlab as
simply too slow a programming language to be useful for writing a CFD solver. This is not a wholly false statement.
Matlab is an interpreted (as opposed to compiled) language, essentially meaning that source code is converted to
machine code at execution time rather than beforehand. This does inherently slow the computation somewhat,
especially with complex control statements. However, Matlab does an extremely good job of using precompiled,
optimized libraries to do individual vectorized operations. This means that if programs are written in a completely


3
vectorized fashion, they can come close to the maximum level of performance obtainable with a lower-level
programming language. The control flow around a mathematical operation may be slower than in another language,
but the mathematical operation itself should be very nearly as fast as possible in any other setting. Matlab also has
advantages in terms of ease of programming that generally result from its status as a higher-level programming
language than a language like C or Fortran.
All of this combines to make Matlab the ideal development environment for vectorized algorithms. The
language treats all variables as arrays and does virtually all calculations as vector operations when instructed
properly. Even the fact that iterative scalar algorithms are slow in Matlab can be seen as an advantage. This means
that Matlab code optimized for runtime will be close to optimized for runtime in vectorized hardware setting such as
a GPU.
B. GPU Programming through Matlab
Other researchers in scientific computing have noticed the potential parallels between the vectorization of
Matlab and that of a GPU programming language. To this end, several different groups have been working on
Matlab interfaces to CUDA, NVIDIA's GPU programming language. These include the Parallel Computing Toolbox
from Mathworks itself
1
, Jacket by Accelereyes
2
, a third-party company, and GPUmat
3
, an open-source project. In all
of these implementations, Matlab's object-oriented framework is used to create new data types that define "GPU
arrays." These arrays are stored on the GPU and feature overloaded mathematical functions so that standard
operations performed on these arrays are performed on the GPU. This makes the differences between programming
in standard Matlab and CUDA-accelerated Matlab nearly invisible in many cases. Well-vectorized Matlab code can
be translated to code running on the GPU in a small fraction of the amount of time it takes to alter C code to make
use of the CUDA libraries. (For reference, the procedure of translating existing Fortran code without creation of
some very complex libraries would really have to start with translating the code to C as CUDA is really only
designed to interface with a C environment).
The true beauty of this environment within Matlab is that it makes nearly all of the management of the GPU
itself and the transfer of data between GPU and system memory nearly transparent to the programmer. There are a
few programming oddities that work slightly differently on the GPU, but making a "rough draft" program that
utilizes the GPU takes almost no additional effort and even less knowledge about how the GPU works, provided that
the original Matlab code was well vectorized. Subtleties such as the few operations that would actually run faster on
a CPU take a bit longer to learn and optimize, but these are generally more minor performance improvements.
All of the GPU code that was developed for the project was developed in Jacket. Accelereyes' product is
considerably ahead of the other related projects in the amount of Matlab functionality that it supports. (Additionally,
Accelereyes offers a "student license" option that made obtaining a valid license for the product much more
affordable for a graduate student researcher.)
C. Vectorized Arithmetic and Functions
The key concept in a vectorized program is that there should be as few loops (for/do/while/etc) as possible. In a
program like a CFD solver, this ideally means that the only loop in the code should be the steps of the time
integration scheme. Some integration schemes, such as a Runge-Kutta scheme, will require addition of an extra loop
or two at this sort of large level to deal with subiterations. A key feature then is that all mathematical operations
need to operate on entire arrays at once without the need for looping.
The general description of vector operators are operators that take in arrays of arbitrary size, perform some
operation on each element and then return an array of the same size. The simplest examples of this are single-input
functions (e.g. sine, cosine, or the exponential function). These functions will take in a single array and return an
array of the same size. The other common example is a operator that takes in two or more arrays of identical size
and uses the elements of these arrays in a one-to-one correspondence to produce an output array of the same size.
This is the case with arithmetic operators (+, - , *, /).
D. Vectorized Indexing and Logical Vectors
An important feature employed for the vectorization of a whole code such as a CFD solver is the idea of
vectorized indexing. This is the ability to take an array, index it by an integer array of potentially wildly different
size, and produce an output array of the size of the index array. For example, imagine a 10x1 vector Y. A second
vector Z with only the even-index terms can be obtained easily with the vector indexing expression Y([2 4 6 8 10]
T
).
This type of indexing is the essence of how the complex connectivity of an unstructured grid can be dealt with in a
vectorized fashion.


4
This idea of vectorized indexing is familiar in the Matlab environment but may seem foreign to most C or
Fortran programmers. However, it is especially important in a GPU programming environment. With the way
CUDA is implemented on a GPU, many memory accesses can actually be "coalesced" into a much smaller number
of instructions to the device's memory controller. In order for this to occur, however, all of these memory access
commands must be issued at the same time. The benefits of coalescence of memory accesses varies heavily by the
pattern involved, but it is an active area where NVIDIA is focusing improvements in the CUDA architecture.
A common way to prevent the use of "if-then" control statements is by vectorized indexing with logical vectors.
An array of "true" and "false" values is created with a vectorized Boolean expression. This array is then used to
index another vector (of the same size) to return only the values which correspond to "true" indices. For example,
rather than looping through all of the faces of an unstructured meshes looking for which ones have a far-field
boundary condition, a programmer can use a vectorized command to create a logical vector which can then be used
to index the overall vector of faces to produce the vector of just far-field faces.
E. Less-Vectorized Functions
There are a few additional operations which can be performed in a more vector-friendly format even if they are
not precisely fully vectorized.
One example of this is arithmetic operations on two arrays of different sizes. This can be allowed if one of the
arrays has a size of one in all of the dimensions in which the two arrays differ in size. The smaller matrix in this size
is essentially "duplicated" along that dimension to perform the operation. The benefit over actual duplication is that
additional memory is not needed to store redundant information. For example, imagine the computation of face area
vectors from unit normal vectors and scalar face areas:

j j j
S n S = (1)
The inputs S
j
and n
j
are arrays with the same number of rows (the total number of faces), but S
j
has only one
column, whereas n
j
has three columns. Applying the distributive property of multiplication is perfectly natural here
to produce the three-column output array S
j
. (This is accomplished within Matlab via the built-in "bsxfun" function).
Another example of vector-friendly functions are operations which shrink one dimension of an array to unity.
The most common examples of this are "sum" and "product" functions, but other instances are possible. In
programming syntax, these functions will all take in a dimension on which to operate in addition to their array input.
For example, imagine the computation of the normal component of a velocity vector:

z y x j
wn vn un V + + =
,

(2)
If the velocity and normal vectors were stored as the n-by-3 arrays V
j
and n
j
, this computation could be done in
Matlab with the command "sum(V
j
.*

n
j
, 2)". An element-by-element multiplication takes place between the two
input arrays and then the product is summed along its second dimension, producing an n-by-1 array of normal
velocity components. The sum is not a completely vectorized operation, but for large arrays it will run far faster on
a vector processor than looping through the array.
III. Unstructured Grid Geometry
In general, the computation of most grid-based information is not particularly relevant to the topic of fast
vectorized computation because it only has to be evaluated once at the beginning of a solver run. These types of data
include parameters like cell centroids and volumes and face areas and normal vectors. The items that are discussed
here are more unique to this vectorized formulation.
A. Terminology
Throughout this work, a standard set of terms will be used to refer to the geometrical elements of an unstructured
grid, whether it be two- or three-dimensional. This terminology is fairly standard for 3-D grids, but it is important to
clarify here for consistency in the 2-D case.
1. Cells
The term cell is always used to refer to the highest-dimensional geometrical construct in the grid. In 3-D this
corresponds to polyhedron whereas in 2-D this corresponds to polygons. Cells are always said to have volume, even


5
if they are two-dimensional (in which case this "volume" is actually surface area). Throughout this work, cells are
referenced by the subscript index i. The variable
i
is used to refer to the volume of the cell i.
2. Faces
The term face is always used to refer to the components which join together to create a cell. In 3-D these are
polygons while in 2-D these are line segments. Faces are always shared by at most two cells (boundary faces are
referenced by only one cell). Faces are always said to have surface area, even if they are one-dimensional (in which
case this "surface area" is actually length). Faces also always have definable normal vectors to describe their
orientation in space. Throughout this work, faces are referenced by the subscript index j. This index varies from 1 to
the total number of faces in the entire grid (i.e. it is not redefined for each cell). The variable S
j
is used to refer to the
surface area of the face j.
3. Nodes
The term node is always used to refer to an (x, y, z) point in space. A group of nodes (in a particular order)
defines a face or cell. Nodes may be a part of any number of cells or faces. Throughout this work, nodes are
referenced by the index k. This index varies from 1 to the total number of nodes in the entire grid (i.e. it is not
redefined for each cell or face). The variable x
k
is used to refer to the coordinates of the node k.
4. Edges
In the event that the line segments of a 3-D mesh must be referenced, they are referred to as edges. Care should
be taken, however, not to confuse them with the faces of a 2-D grid. While the two are geometrically the same, they
function quite differently within a flow solver. Edges would be relevant in a node-centered scheme, but as the
scheme described here is cell-centered, they are not explicitly needed.
B. Connectivity Functions
A series of functions is identified here to define the way that the different elements are constructed from each
other. All of these are shown as functions which take in an integer index and return a vector (potentially of variable
length) of integer indices. In coding practice, however, all of these "functions" are simple table lookups into
matrices of integers. In typical usage, the row number of the matrix is the input ("independent") variable and the
values on that row are the output ("dependent") variables. Appendix A shows a small sample grid and the example
connectivity matrices for that grid.

1. Cell-to-Node Connectivity: (i)
The function (i) defines the connectivity of cells to nodes. The outputs of (i) have variable, but well-defined
length. For example, (i) will always return three indices for a triangular cell, four indices for a quadrilateral or
tetrahedral cell, etc. The cell indices i are typically arranged such that the length of (i) is monotonically increasing
as i increases. That is, all triangular cells are referenced before quadrilateral cells in 2-D or all tetrahedral cells are
referenced before pyramidal cells in 3-D. In the event of multiple cell geometries with the same number of nodes
(such as a triangular prism and a pentagonal pyramid, each with six nodes), the cells with the smallest number of
faces are referenced first (e.g. the prism with five faces before the pyramid with six). This allows for the
connectivity information to be broken up into multiple subfunctions without the need to store additional indexing
information. For example, in 2-D:
( )
( )
( )
+ <
=
Q T T
T
N N i N i
N i i
i
,
,
4
3

(3)
Where N
T
and N
Q
are the total number of triangular and quadrilateral elements, respectively. Thus for this grid
topology (consisting of only triangular and quadrilateral cells), (i) is stored as two matrices: one N
T
-by-3 and one
N
Q
-by-4.
This connectivity function is typically assumed to be the only one that is known beforehand, as it can be used to
completely define a computational mesh from a cloud of nodes.

2. Face-to-Node Connectivity: (j)
The function (j) defines the connectivity of faces to nodes. In a 2-D grid, (j) has length 2 for all j. On a 3-D
grid, there may be more variation, but even on a grid defined with tetrahedra, quadrilateral pyramids, triangular
prisms, and hexahedra, (j) will only consist of two subfunctions:
3
(j) and
4
(j).


6
(j) is essentially created by the reorganization and sorting of (i). A large matrix is created (for each subfunction
of (j)) by concatenating each face implicitly defined by (i) such that the rows of this matrix are the indices of the
nodes in faces. Each of these rows is sorted in an order-preserving fashion (to avoid altering the topology of
polygonal faces and turning a convex quadrilateral into a self-intersecting one) such that the faces that are referenced
multiple times are guaranteed to appear identically each time. The rows of the matrix are then sorted (with a typical
priority of first column, then second column, etc.). This places identical rows adjacent to one another so the
duplicates can be removed.

3. Cell-to-Face Connectivity: (i) and
-1
(j)
The function (i) defines the connectivity of cells to faces. The lengths of the outputs of (i) depend on the
topology of the cells defined by (i). If the cell indices are properly ordered, this will also be a monotonically-
increasing function that can also be deterministically broken up into subfunctions, thus eliminating the need for an
additional mapping function. Note that the inverse function
-1
(j) always has either one (for boundary faces) or two
(for internal faces) outputs because faces always split at most two cells.
Computing (i) is a somewhat complex process, but it is not overly time-consuming and need only be done once
for each grid. First, the inverse function
-1
(j) is created during the creation of (j). When the rows of (j) are sorted
(before duplicate faces are removed), the change in the original indices is tracked. Each pair of identical rows of this
intermediate (j) matrix came from the two cells that share the face j. Because not much memory is wasted by doing
so, boundary faces are typically handled by setting the second one equal to some non-indicial value such as 0 or
"NaN" to preserve the rectangular shape of the array. The inversion
-1
(j) to produce (i) is performed by a similar
sorting algorithm to the formation of (j). Each value of
-1
(j) is the row number of a row of (i) containing the
corresponding value of j. Optionally, (i) can be stored in "signed" form, where the index has a negative value if it
came from the second column of
-1
(j). This means that in addition to knowing which faces are connected to a cell,
it is known whether the cell was the "right-hand" or "left-hand" cell adjacent to that face.
C. Normal Vectors and Facing
Numerous equations in an unstructured formulation require the use of a face normal vector. Standard practice
defines these normal vectors as pointing out of the cell. However, it is desired to separate computations on faces
from either adjacent cell as being the "current" cell. This allows all edge-based computations to be done once and
then this resultant value be applied to both adjacent cells as appropriate. For this reason, a function (j) is defined
having the same shape as
-1
(j). (j) has a value of 1 when n
j
points out of the corresponding cell and a value of -1
when n
j
points into the corresponding cell. Thus, the product (j)n
j
always points out of the cell. (j) should return
a value of 0 or NaN in the same places as
-1
(j) does.
One method of computing the function (j) is by taking the sign of the dot product of the normal vector and the
vector between the cell and face centroids. (Note that this formulation is only strictly valid for convex cells):
( )
( )
( )
( )
( )
j
j
j
j
j
j
j
n c c
n c c

=
1
1

(4)
This function can also be reshaped as the inversion of
-1
(j) to (i) creating
-1
(i). This function returns values of
1 and -1 recording the facing of the normal vectors of all of the faces of cell i.
IV. Governing Equations
The following section describes the cell-centered scheme evaluated in this paper.
A. Finite-Volume Euler Equations
As with most unstructured codes, a finite-volume representation of the Euler equations is used. The general
formulation and form of the convective flux vector here is similar to that used by Frink
4
. These are given below in
vector integral form over a volume with a surface :
( ) ( ) | | 0
1
=

dS d
t
v c
U F U F U (5)


7
Where the state vector is given as:
| |
T
e w v u
0
= U (6)
The convective flux term is defined at a face as
4
:
( ) ( )
j
z
y
x
j
j
j j j c
n
n
n
p
p e
w
v
u
(
(
(
(
(
+
(
(
(
(
+
=
0
0
0
n V U F (7)
The viscous flux term is defined at a face as
5
:
( )
(
(
(
(
(
+ +
+ +
+ +
+ +
=
z z y y x x
zz z yz y xz x
yz z yy y xy x
xz z xy y xx x
j v
n n n
n n n
n n n
n n n

0
U F (8)
Where:

|
|
\
|
= =
|
\
|
= =
|
|
\
|
= =
|
\
|
=
|
|
\
|
=
|
\
|
=
y
w
z
v
e R
x
w
z
u
e R
x
v
y
u
e R
z
w
e R
y
v
e R
x
u
e R
L
zy yz
L
zx xz
L
yx xy
L
zz
L
yy
L
xx
3
2
3
2
3
2
V
V
V
(9)
The terms are used to represent the work done by the combination of viscous stresses and heat conduction:

z
T
e R
k
w v u
y
T
e R
k
w v u
x
T
e R
k
w v u
L
zz yz xz z
L
yz yy xy y
L
xz xy xx x
+ + + =

+ + + =

+ + + =

(10)


8
Note that the unexpected factors of 1/Re
L
are due to the nondimensionalization strategy, discussed later in this
paper.
If an ideal gas is assumed pressure, stagnation enthalpy, and local speed of sound can be calculated as:
( ) ( ) | |
2 2 2
2
1
0
1 w v u e p + + = (11)
( )
2 2 2
2
1
0
1
w v u
p
h + + +
=

(12)

p
a = (13)
Equation (5) is discretized by assuming a uniform value of U within a cell and uniform values of F over each of
the surfaces of the cell:
( ) | |
( )
i
i j
j j v j c
i
S
t
R F F
U
=
1 1
, ,
(14)
Where the indexing function (i) returns the face indices corresponding to the cell i. The fluxes across each face
should be calculated in a face-by-face fashion rather than a cell-by-cell fashion in order to only compute each flux
vector once. The summation of fluxes is generalized as a residual vector R
i
for simplicity in later expressions. Note
that this expression can be considered fully vectorized if the face flux vectors are reshaped with a vectorized index
by (i) and then the sum over these faces is performed in a single operation. The presence of cells with differing
numbers of faces will generally require a loop over the different topologies.
B. Convective Flux Discretization - Roe Flux Difference Splitting
The current convective flux discretization method implemented is Roe's approximate Riemann solver
6
. Again,
the particulars of the equations used here are generally borrowed from Frink
4
. The convective flux across a cell face
j is expressed as:
( ) ( ) ( ) ( ) | |
j
L R R L j c
U U A U F U F F + =
2
1
(15)
Where U
L
and U
R
represent the state vector on the "left" and "right" sides of the face j. Typically this
corresponds to flux coming from the cells referred to by the first and second columns, respectively, of the
connectivity function
-1
(j). The matrix A
is the flux Jacobian matrix computed with the following "Roe-averaged"

quantities:

R L
=
(16)

L R
L R R L
u u
u

/ 1
/
+
+
= (17)

L R
L R R L
v v
v

/ 1
/
+
+
= (18)


9

L R
L R R L
w w
w

/ 1
/
+
+
= (19)

L R
L R R L
h h
h

/ 1
/
, 0 , 0
0
+
+
= (20)
( ) ( ) | |
2 2 2
2
1
0
2

1 w v u h a + + = (21)
Note that the computation of these various parameters is greatly accelerated by the first computing the value of
L R
/
. Through use of diagonalizing matrices and eigenvalues, the "artificial dissipation" term introduced in
Equation 15 can be computed as:
( )
5 4 1

F F F U U A + + =
L R
(22)
where:

( )
(
(
(
(
(
(
+ +

+
(
(
(
(
(
(
+ +
|
\
|
=

V V w w v v u u
V n w
V n v
V n u
w v u
w
v
u
a
p
V
z
y
x

2 2 2
2
1
2 1
F (23)

(
(
(
(
(
(
\
|
=
a V h
a n w
a n v
a n u
a
V a p
a V
z
y
x

1
2

0
2 5 , 4

F (24)
Where
z y x
n w n v n u V
+ + =
and all -values are differences across the cell face, computed as p = (j)[p
R
- p
L
]
except
w n v n u n V
z y x
+ + =

. The inclusion of the normal vector facing function (j) is necessary for
consistency. Note that the entirety of this procedure is completely vectorized if values of U
R
and U
L
are given.
The above is sufficient to describe a first-order-accurate flux treatment where U
R
and U
L
are made equal to the
cell centered values at either side of the face.
C. Higher-Order State Reconstruction
A higher-order formulation for U
R
and U
L
can be obtained by assuming that the state variables vary linearly over
each cell (that is, their gradient is constant within the cell). This method is essentially a generalization to mixed-
element meshes of the method used by Frink and Pirzadeh
7
to discretize triangular and tetrahedral elements only. A
piecewise-linear reconstruction method can create what amounts to a first-order Taylor series approximation to find
the values of the state vector at the face:
( ) | |
( ) j i
ij i i i j
1
=
+ =

r U U U

(25)


10
Where
i
describes some flux-limiter function and r
ij
is the vector from the centroid of cell i to the centroid of
face j. Note that this is set up here so that two different values will be calculated for each face: that based on the
"left-hand" neighbor cell and that based on the "right-hand" neighbor. This is consistent with the formulation of
flux-splitting methods that will use these two value to compute a flux vector for the face. The only difficulty lies
with computing the value of U . This is done here using the divergence theorem of vector calculus:

= dS d Un U

(26)
If it is assumed that U varies linearly over (i.e. U is constant within the cell) and that the surface of the
volume is described by flat faces:

( )

i j
j j j
S
n U U
1

(27)
For line-segment or triangular faces, the face-centered state vectors U
j
are calculated as the mean of the nodal
state vectors U
k
(for higher-order polygonal faces, an inverse-distance-weighting scheme based on the face centroid
and the node location is necessary):

( )
=
=
j k
k
j
j
U U
1

(28)
While the nodal values are in turn interpolated from the cell-centered values based on an inverse-distance-
weighting scheme:

( ) ( )
|
|
\
|
|
|
\
|
=

= = k i
ik
k i
ik i k
r r

/ 1 / U U

(29)
Where r
ik
is the distance between the centroid of cell i and the node k. Notice that these weights can be
precomputed using knowledge of the grid alone, making this interpolation process a simple multiply-and-sum. The
gradient can then be computed:

( ) ( )

= = (
(

i j
j j
j k
k
j
i
S

n U U
1 1

(30)

Notice that the above equation can be factored, essentially switching the order of the two summations. This will
allow the inner summation to be dependent only upon grid information and mean that each U
k
is only referenced
once per cell:

( )
( ) ( )

= = (
(

i k k i j j
j j
k i
S j

,
1
n
U U

(31)
Where the connectivity function (i,k) returns the indices of the faces which are connected to node k and a part
of cell i. Notice that the summation inside the brackets is completely independent of solution information and can
be computed before the solver starts.
A three step process develops: interpolation to nodal values with Eq. (29), computation of gradients with Eq.
(31), and extrapolation to face values with Eq. (25).


11
It is also worthwhile to note that this procedure for generating node-centered data and cell-centered gradients is
not specific to state vectors and can be used for any arbitrary piece of information stored at the cell centroids of an
unstructured mesh.
D. Viscous Flux Discretization
Because the viscous flux terms are always elliptic, their evaluation is actually much simpler than that of the
convective fluxes. The state vectors of two adjacent cells can simply be averaged to find their value at the
connecting face. However, the viscous fluxes also require velocity and temperature gradients to be evaluated at the
cell faces. This can be accomplished by the simple averaging of the cell-centered gradients computed as in Eq. (31).
(Note that here the variable U is used to represent any generic flow variable):

( )
=
=
j i
i j
U U
1
2
1
(32)
However, this approach can lead to aberrant behavior in some cases, so the following procedure is used, adapted
from Blazek
5
. First, the derivative of the variable is calculated in the direction along the vector r
RL
from one
centroid to the other by using a simple first-order finite difference. Then a modified average can be calculated:

RL
RL
j
RL
RL
j j j
U
U U U
r
r
r
r
(
\
|
=
l
(33)
This process is repeated separately for each of the three velocity components and temperature. Once this
information is known, the viscous fluxes at the face centroids can easily be calculated.
E. Boundary Conditions
In the current implementation, all boundary conditions are enforced by setting the flux into the boundary cell
across the boundary face rather than by forcibly setting the state vector values after an iteration of time integration.
These fluxes are used directly in Eq. (14) without any additional artificial dissipation terms. These fluxes are set by
assuming that the value of the state vector is known accurately at the face (from the boundary conditions) and thus
the Roe FDS process is not involved with these fluxes.

1. Inviscid Solid Surface Boundary Condition
Inviscid surfaces are typified by the flow tangency boundary condition, namely that 0 = n V . Correspondingly,
the convective flux across an inviscid surface can be directly obtained from Eq. (7) as:
( )
(
(
(
(
(
=
0
0
z w
y w
x w
j c
n p
n p
n p
F (34)
Where the wall pressure p
w
is obtained from the cell-centered value (note that this can be considered at least
first-order accurate because of another solid-wall boundary condition of 0 = p n ).

2. Viscous No-Slip Wall Condition
The basic viscous wall condition is the "no-slip" condition:
0 = = = w v u (37)


12
This actually leaves the boundary condition for the convective flux term unchanged, as it already contained no
velocity term and the pressure boundary condition is unchanged. The viscous stress terms remain unchanged, but
the velocity gradient components must be calculated assuming that the no-slip condition has been applied. This is
most easily performed by simply setting the nodal values of velocity to zero on all no-slip surfaces before the
gradient calculation is performed for the entire grid. Because there is no flow through the wall, the energy flux
vector simplifies to = kT. If an adiabatic wall is desired, this temperature gradient can simply be set to zero
(note that the actual usage of the vector is its normal component, thus any gradient in the wall-tangent direction is
irrelevant). If an isothermal wall temperature is desired, the temperature at wall nodes should be set before gradient
calculation, just as with the velocity values.

3. Characteristic Far-Field Boundary Condition
This boundary condition for the far-field exterior of a computational grid is based on the concept of
characteristic variables that remain constant along particular characteristic lines. This particular formulation is
based upon that presented by Blazek
5
. "Information" (i.e. flow variables) is propagated into and out of the
computational domain based on the normal Mach number and direction of flow at the face. This condition variable
is computed using an outward-facing normal vector as:
( ) ( ) | |
j
z y x j
a wn vn un j M /
,
+ + =
(38)
This sets up four different cases where the state vector (or at least variables from which a state vector can be
computed) at the face U
j
is computed from the freestream state vector U
and the boundary cell state vector U

i
(for
supersonic inflow, supersonic outflow, subsonic inflow, and subsonic outflow, respectively):

= U U
j
when 1
,

j
M (39)

i j
U U = when 1
,

j
M (40)

( ) ( ) ( ) ( ) | | ( )
( )
( ) ( ) ( )
( ) ( ) ( )
( ) ( ) ( )
(
(
(
(
(

+
+ + +
=
(
(
(
(

i i j z
i i j y
i i j x
j
i z i y i x i i i
j
a p p n j w
a p p n j v
a p p n j u
a p p
w w n v v n u u n a j p p
w
v
u
p
i

/
/
/
/
2
2
1
, 0 1
,
< <
j
M (41)

( )
( ) ( ) ( )
( ) ( ) ( )
( ) ( ) ( )
(
(
(
(
(
(
+
+
+
+
=
(
(
(
(
(
(
i i j i z i
i i j i y i
i i j i x i
i i i
j
a p p n j w
a p p n j v
a p p n j u
a p p
p
w
v
u
p

/
/
/
/
2
when 1 0
,
<
j
M (42)
All i-subscripted quantities here refer to the cell-centered value of that quantity in the boundary cell. Once the
state vector at the face U
j
is computed, a flux vectors F
c,j
and F
v,j
can be computed using Equations 7 and 8,
respectively.
F. Explicit Time Integration Schemes
It was intended that both implicit and explicit time integration schemes would be evaluated during the course of
this project. Though implicit schemes are much more common in modern CFD codes due to their superior
convergence properties, they are nearly not as vectorizable as explicit schemes are. The explicit schemes presented
here are fully functional in the code.


13

1. Explicit Runge-Kutta Method
Explicit time integration is accomplished using a Runge-Kutta scheme on the residual vectors R
i
from Eq. (14).
This formulation is again borrowed from Frink
4
. Superscripts denote the time level n, the maximum number of
Runge-Kutta subiterations or "stages" m, and the current Runge-Kutta subiteration in parentheses:

=
=
+
) ( 1
) 1 ( ) 0 ( ) (
) 0 (
1
) 0 ( ) 1 (
) 0 (
m
i
n
i
m
i
i
m i
m
i
i
i
i i
n
i i
t
t
U U
R U U
R U U
U U
M
(43)
Where

1
1
+
=
k m
k
(44)
2. Implicit Residual Smoothing
An implicit residual smoothing method is employed as by Frink
4
in order to allow for an increase in the
maximum allowable time step for explicit integration. The concept of the method is that residuals are filtered
through a Laplacian operator to smooth out discontinuities and enhance stability:

i i i
R R R
2
+ = (45)
The Laplacian operator is approximated by a sum of the differences between the residuals of adjacent cells I and
the cell itself i:
( )
( ) ( )
=
=
i I
i I i

R R R
2
(46)
These equations are combined and solved through a Jacobi iteration, where m represents the steps of this
iteration:

( ) ( )
( ) ( ) ( ) ( )
|
|
\
|
+
|
|
\
|
+ =

= =
i I i I
m
I i
m
i

1 1
1
R R R (47)
The constant is set to control the diagonal dominance of the system. Blazek
5
suggests that values between 0.5
and 0.8 are most useful. Notice that when run in a vectorized manner, this method will require two nested loops: an
outer loop over the different types of cell topologies (e.g. triangles and quadrilaterals) and an inner loop over the
number of neighbors in that particular cell type. This makes it a well-vectorized method that should not interfere
substantially with the fantastic vectorization of the explicit method.
Most sources suggest that two Jacobi iterations are sufficient, but in practice here it has been found that three or
four can be useful in some cases. Compared with the computational expense of calculating gradients, limiter
functions, or even the Roe differencing, additional iterations of the smoothing operation are extremely fast, due to
their good vectorization and low operation count.


14
G. Implicit Time Integration Schemes
Implicit methods were created for this solver, but they appear to have some issues that prevent them from being
fully functional. This is discussed later, but the implementation of the methods is given here regardless. Implicit
methods are formulated using a backward-Euler time discretization to Eq. (14):
( )
n
i
n
i
n
i
i
t
R R U =
+
1
1
(48)
Where U
n
is the delta to the state vector of the next iteration:

n n n
U U U + =
+1
(49)
The residual vector at time step n+1 is estimated using a Taylor series expansion from time step n:

n
i
i
n
i
n
i
U
U
R
R R
|
\
|
+
+1
(50)
Combining Equations (48) and (50) yields the following implicit solution formulation:

n
i
n
i
n
i
i
i
t
U J R U
U
R
= =
(
\
|
(51)
Where the quantity in brackets on the left is called the implicit operator J. The constant is set to 1 for typical
implicit methods. In general, the "explicit operator" R
n
is still computed using a flux-splitting or flux-differencing
scheme (such as Roe's) but a simpler scheme is sometimes used to split fluxes within the implicit operator. In
general, the derivative in the implicit operator is divided into components for each face of the cell:

( )
=
|
\
|
=
|
\
|
F
N
i j
j
j v j c
i
S
U
F
U
F
U
R
, ,
(52)
Roe's Flux Difference Splitting can be applied here as well to expand the flux Jacobians. By assuming locally
constant Roe matrices, a reasonably accurate approximation to the product of the flux Jacobian and the state vector
step is obtained:
( ) ( ) ( ) { }
=
+
F
N
m
n
m L
n
m R m Roe
n
m R m R c
n
m L m L c
m n c
S
1
, , , , , ,
2
U U A U A U A U
U
R
(53)
Where A
L
is computed from the left-hand cell, A
R
is computed from the right-hand cell, and A
Roe
is computed
from the Roe-averaged values between the two cells, all at iteration n-1. If it is assumed that the "current" cell is
always the left-handed one, the following implicit scheme is obtained:
( ) ( ) { }
( )
( ) ( ) { }
( )
n
i
N
i j
j j Roe R j c
j
i
N
i j
j Roe L i c
j
i
F F
S S
t
R U A A U A A =
(
+
(
(

= =
, ,
2 2
(54)
Notice that the coefficient of the first term (in brackets) will form the diagonal of the implicit matrix equation.
The individual components of the second term (without the state vector) are the off-diagonal terms. A sparse linear
system is created:


15

n
i
R U M =
(55)
Note that the term involving t is added to the diagonal of the implicit matrix. This serves to stabilize the
system and decrease the condition number of the matrix. A small time step will mean significantly increasing the
diagonal-dominance of the matrix. This time step is still calculated via a CFL number as it is in the explicit case,
but the limit of the maximum useful CFL number is more in how singular of a matrix (how large of a condition
number) is acceptable than about the inherent stability of the numerical method, as is the case with the explicit
solution.

1. Flux Jacobian Matrices
Direct computation of the flux Jacobian is sometimes (but not always) necessary in evaluating the derivative of
the residual with respect to the state vector. For these instances, the formulation of the convective flux Jacobian is
given here:

( ) (
(
(
(
(
(

= =
V V w a a n V v a a n V u a a n a V
n a w n a V v n a w n u n a w n V u n
n a w n a v n v n a V u n a v n V v n
n a w n a u n v n a u n u n a V V u n
n n n
z y x
z z z y z x z
y y z y y x y
x x z x y x x
z y x
c
c
~ ~ ~ ~ ~
~ ~
~ ~
~ ~
0 0
2 1 2 1 2 1 1
2 3 2 2
2 2 3 2
2 2 2 3

A
U
F
(56)
Where:

( )( )
2 1
1
~
3 2
0
1
2 2 2
2
1
= = =
+ + = + + =

a a
e
a
w v u w n v n u n V
z y x
(57)
Note that if all of the terms seen above are expanded, the flux Jacobian is a function only of the state vector and
the normal vector and it is an odd function with respect to the normal vector. That is, A
c
(U,-n) = -A
c
(U,n). This
means that, just like the flux vectors, the flux Jacobian need only be computed once for each face, provided that it is
based on symmetric state vector information. So, A
Roe
only has to be computed once for each face. A
L
and A
R
must
be calculated separately, but for the adjacent cell, they can be swapped and their signs changed.

2. Implicit Boundary Conditions
Boundary conditions are applied in the implicit equation solely in the diagonal terms of the implicit matrix. This
is possible because the boundary conditions here only depend on the cell at the boundary (and potentially some
constant information such as freestream conditions). This is done with an additional flux Jacobian matrix that is
inserted into the equation as another A
L
that has no corresponding A
R
or A
Roe
.
For an inviscid wall, a special flux Jacobian matrix can be used that is derived from the inviscid wall flux (Eq.
34):
( )
(
(
(
(
(

=
0 0 0 0 0
0 0 0 0 0
2 2 2 2
2 2 2 2
2 2 2 2
z z z z z
y y y y y
x x x x x
wall c
n a w n a v n a u n a n
A (58)
The constants here have the same values as described in Eq. 57.
For far-field boundaries, the standard form of the flux Jacobian from Eq. 56 is used, but plugging in the
boundary values instead. For supersonic inflow, the flux Jacobian matrix can be substituted with the null matrix as


16
there is no dependence on the internal cell state vector. For supersonic outflow, the Jacobian matrix from inside the
cell can be used (A
L
as usual). For subsonic inflow and outflow, Equations 41 and 42 are used to calculate the
values of the primitive state vector on the face and then Eq. 56 is used to compute the flux Jacobian.

3. Gauss-Seidel Scheme
The Gauss-Seidel scheme is an iterative method to solve matrix equations. The scheme factors the implicit
operator into three parts: a strictly upper-triangular portion U, a strictly lower-triangular portion L, and a diagonal
portion D:
( )
n
i
n
i
n
i
R U U L D U M = + + = (59)
This equation can be solved with an iterative step:
( )
( ) ( ) 1
= +
m
i
n
i
m
i
U U R U L D (60)
This method is designed to run for several subiterations during each iteration of the flow solver. In the method
used presently, the first value of U
(0)
is assumed to be all zeros. In subsequent subiterations, the value found in the
previous iteration is used. After a specified number of subiterations, the value of U
m
is used as U
n
to move onto
the next solver iteration.
These equations are solved directly with sparse matrix methods. In many CFD applications, only one step is
taken in iterative methods before reevaluating the implicit matrix. It has been found, however, that taking a small
number of steps (two or three) can increase the stability of the method. This is generally not too crippling to the
runtime of a complete iteration, as the most time-consuming part of the operation is the assembly of the implicit
matrix. Solving these sparse triangular systems is actually a very fast operation.

4. Successive Over-Relaxation (SOR) Scheme
Successive Over-Relaxation (SOR) is an alternative iterative method to solve linear systems. Like the Gauss-
Seidel method, the matrix is factored into its U, L, and D components. In this method, however, a relaxation factor
in the formulation of the method:
( )
( )
( ) ( )
( ) 1
1

+ = +
m
i
n
i
m
i
U D U R U L D (61)
Other than the slight alterations to the operative equation and the addition of the parameter , the method
operates the same as the Gauss-Seidel.
H. Time Step Calculation
Local time-stepping is employed in the implicit and explicit schemes to increase convergence speed. Time steps
for each cell are calculated from equations given by Blazek
5
:

z
i
y
i
x
i
i
t
+ +

= (62)
Where is the CFL number and the variables represent the spectral radii of the flux Jacobian in the x, y, and z
directions:
( )
x
i i
x
i
S a u
+ = , ( )
y
i i
y
i
S a v
+ = , ( )
z
i i
z
i
S a w
+ = (63)
And the S
variables represent the projected area of the cell in the y-z, x-z, and x-y planes:

( )
=
=
i j
j j x
x
S n S
, 2
1
,
( )
=
=
i j
j j y
y
S n S
, 2
1
,
( )
=
=
i j
j j z
z
S n S
, 2
1
(64)


17
The maximum allowable CFL number for convergence depends heavily on which of the various methods
presented here are in use and upon several other options, such as the number of Runge-Kutta subiterations or
implicit residual smoothing iterations.
I. Nondimensionalization of Variables
All of the flow variables are nondimensionalized in an effort to aid computational stability. This
nondimensionalization strategy is based upon that described by Tannehill et al
8
. All variables seen elsewhere in this
document are replaced by nondimensional variables, denoted here (but nowhere else) with an asterisk:

= = =
= = = =
= = = =
T
T
T
V
p
p
V
w
w
V
v
v
V
u
u
V L
t
t
L
z
z
L
y
y
L
x
x
*
2
* *
* * * *
* * * *

(65)
In the event that there is a zero freestream velocity, the freestream speed of sound may be used in its place. Note
that this nondimensionalization causes an interesting redefinition of a few common constants:

= = = = T C C T T
M
R
o
* * *
0
2
*
1
1
(66)
Where R is the gas constant and C, T
0
and
0
are the coefficients in Sutherland's formula. With these constants
redefined, the ideal gas law and Sutherland's formula will continue to behave as expected to compute temperature
and dynamic viscosity:

* * * *
T R p = (67)

2
3
*
0
*
* *
* *
0 *
0
*
|
|
\
|
+
+
=
T
T
C T
C T
(68)
The only other consequence of the nondimensionalization within the code is the factor of 1/Re that shows up in
the viscous flux terms.
V. Success of Solver and Quality of Results
A. Explicit Euler Results
When run in explicit Euler mode, the solver created here produces reasonably accurate results for subsonic and
transonic cases. Figures 1 and 2 present pressure coefficient distributions for a NACA 0012 airfoil with several
different flux limiting options alongside wind tunnel data
9
. All of these cases were run with a CFL number of 3.5
for 5000 iterations, with each iteration containing three Runge-Kutta stages. Three iterations were used in the
implicit residual smoothing method with a coefficient of 0.5. Figure 1 presents a case at Mach 0.3 and 4.04 angle-
of-attack. Figure 2 presents a case at Mach 0.703 and 4.03 angle-of-attack. The different flux limiter options
shown include: fully 1st-order (
i
= 0), fully 2nd-order (
i
= 1), Venkat's limiter, and Barth and Jespersen's limiter.


18
The lower Mach number case seems fairly reasonable
everywhere. The primary difference between the 1st-
and 2nd-order cases are in the value of C
p
at the suction
peak. The 1st-order method over-predicts slightly (a
"peakier" distribution) while the 2nd-order method
under-predicts slightly (a "rounded" distribution).
Venkat's limiter actually does a very nice job of splitting
the difference and coming up with the right suction peak
pressure (it over-predicts the value by approximately
3%). The Barth and Jespersen limiter does not
substantially improve upon the 2nd-order case.
The transonic case is a slightly different story. The
1st- and 2nd-order cases differ drastically in their
placement of the shock and the sharpness of the shock.
The 1st-order case produces a very crisp, clean shock,
but it is misplaced by 5-7% chord. The 2nd-order case
"smears out" the shock to a considerable extent, but it
does seem to begin in approximately the right
location. These behaviors are fairly typical of
transonic Euler CFD solutions. Unfortunately, the
flux limiters do not seem to appreciably help the
problem. They are apparently both very "aggressive"
(i.e. they tend more towards 2nd-order than 1st-order).
Venkat's limiter does result in a strengthening of the
shock in comparison to the 2nd-order case, but it still
under-predicts the strength. Looking at the 1st- and
2nd-order cases, it would appear that there should be
some compromise between them that would be a very
good solution, but neither limiter finds it. It is
worthwhile to note that the difficulty in placing the
shock properly is likely largely grid-related, as no
adaptation is performed to capture the shock location.
Figures 3 and 4 show colored Mach number
contours around two of these cases. Both are fully 1st-order solutions of the NACA0012 at the two flight conditions
described above.

Figure 1 - C
p
Distribution for NACA0012 at Mach 0.3.
Figure 2 - C
p
Distribution for NACA0012 at Mach 0.7.
Figure 4 - Mach Contours - M
=0.3.

Figure 3 - Mach Contours - M=0.7.


19
B. More Complex Modes of Operation
The viscous and implicit modes of the solver unfortunately still contain enough errors or bugs that they do not
produce good results. The viscous mode seems to run well and produce reasonably correct-looking results, but at
some point along its operation, it has a tendency towards abrupt and massive divergence. Strange oscillations
appear in the temperature boundary layer just before these divergences, but it is currently unknown whether these
are the cause or the result of the divergence.
The implicit mode of the solver has its own odd set of issues. The implicit matrix created is much too close to
singular for accurate solution of the system, especially at high CFL numbers. At a CFL number of 1, Matlab
estimates the condition number of the matrix in the vicinity of 10
6
-10
7
. The is at the bare upper limit of how
singular a matrix Matlab can solve directly and the results of the solution should be highly questionable. The
addition of a simple preconditioner (such as a Jacobi preconditioner) or the use of an iterative method (such as a
Gauss-Seidel or Successive Over-Relaxation method) shrinks the condition number of the implicit matrix somewhat
and allows use of CFL numbers of around 2, but not anywhere the enormous CFL numbers typical of successful
implicit methods.
It is unknown what is causing these condition numbers to be so large, but it does not seem to be affecting the
solution quality overly much. The solutions produced by the implicit and explicit modes of the Euler solver seem to
be very comparable. However, since the explicit solver is actually more stable and requires slightly less
computation time per iteration, there is currently no real point in utilizing the implicit solver.
While the viscous and implicit modes are not as functional as they ought to be, they still do a very good job of
replicating the computations necessary to run these types of methods. Consequently, these modes of the solver can
still be used for evaluation of runtimes and comparison between CPU and GPU computations. They will necessitate
using small CFL numbers and relatively few iterations, but while 500 iterations at a CFL number of 0.5 may not
produce a converged solution, the runtime of this solution should be perfectly well representative of the time-per-
iteration required for a more converged solution.
VI. Timing Results
A. Hardware Used
The following hardware was used in all tests shown here. The release dates and prices of hardware are given in
order to inform comparisons.
Intel Core i7 920 Processor: Released Nov. 2008 at $285
6 GB of DDR3 RAM: Purchased Mar. 2009 at $100
NVIDIA GeForce GTX480: Released Mar. 2010 at $500 (1536 MB of on-board memory)
While the graphics card used was considerably more expensive than the CPU, at least some of the cost of RAM
should be added to the cost of the CPU. Building a large computing cluster with many CPUs will require the
purchase of a large amount of RAM, whereas the graphic card includes its own RAM. The GPU is also a newer
architecture than the CPU. This temporal jump was deliberately allowed because the of large increase in scientific
computing efficiency from the 2009-vintage GeForce 200 series to the 2010-vintage GeForce 400 series. (Prior to
the 400 series, the cards were not specifically design to perform double-precision calculations and thus their
performance when doing so was substandard). The
upgrade in CPU performance over the same year was
more modest.
Overall, it is probably fair based simply upon the
release date and price differences to expect the GPU to
be perhaps twice as fast as the CPU. This means that
for a difference in runtime to be significant in terms of
performance per price, the speedup should be
somewhat greater than 2x.
B. Explicit CPU vs. GPU Timing Results
Figure 5 shows the runtimes (in seconds per
iteration) of solutions of a NACA0012 airfoil using
both the CPU and GPU methods. In all cases, the
solver was running fully 2nd-order fluxes with no
limiter, only a single Runge-Kutta subiteration, and

Figure 5 - CPU and GPU Runtimes vs. Cell Count.


20
two implicit residual smoothing steps. These timing results are plotted versus the number of cells in the grid,
ranging from 20,000 to 600,000 cells. Figure 6 shows the same data, but interpreted as a "speedup" multiplier (the
CPU time per iteration divided by the GPU time per iteration). The GPU execution ranges from slightly slower than
CPU execution at the minimum grid size to approximately six times faster at the maximum grid size.
Clearly the trends indicate that the GPU computation becomes more and more advantageous as the grid size
increases. There are a few explanations for this phenomenon. At very low grid sizes, some of the operations
performed by the solver will not even involve all of the
480 processor cores on the card. This effect is probably
minimal in the timing runs shown here, all of which
have many more cells than that. More predominant is
likely the ratio of time spent interpreting instructions to
time spent executing instructions. The GPU is built to
run repetitive, predictable operations on massive data
sets. So, the more massive the data set, the better it
outperforms the CPU. On smaller data sets, the
commands are changing too quickly for this speed to be
extremely useful. This causes the speedup numbers to
follow a "diminishing returns" curve right up until the
point where the video card runs out of memory. The
case with 600,000 cells is presented here, but a case
with 700,000 cells would not run because of memory
limitations.
It is worthwhile to mention that in virtually all
scientific computing applications, there are tradeoffs to
be made between memory usage efficiency and computational efficiency. In this project, the decision was made to
always favor computational efficiency. This means that virtually any piece of data that could be computed only
once and stored was treated that way, from grid connectivity matrices to face normal vectors and cell areas. This
practice is beneficial to the runtime of the code, but imposes additional limitations on the maximum grid size that
can be used on a relatively memory-limited device such as the GTX 480. (Note that purpose-built scientific
computing cards such as NVIDIA's "Tesla" series can be purchased with much more available memory).
C. Runtimes of Implicit Modes
Unfortunately, the implicit modes of operation could not
be tested on the GPU system because sparse matrix support
in Jacket is still in its infancy and does not include all of the
functionality used by the implicit solver. However, some
results are presented here in an attempt to show the potential
performance of the GPU on implicit operations.
Figure 7 compares the runtime per iteration of three
different time integration schemes: the three-stage Runge-
Kutta, a three-step Gauss-Seidel, and a three-step
Successive Over-Relaxation. Both of these implicit methods
use Matlab's sparse matrix functionality to build the implicit
matrix and do the requisite matrix operations to solve the
system. The Gauss-Seidel is consistently about 1.8 times
slower than the explicit method and the SOR is about 2.7
times slower. These numbers are relatively constant with
increasing number of cells in the mesh, up to the largest
tested value of just under 400,000 cells. If these implicit methods were converging properly and allowed the much
larger CFL number typical of implicit methods, the solution time of both of these methods would be much lower
than that of the explicit method.
D. Lessons Learned About GPU Operations
Aside from the numerical details available in runtimes, a number of trends have been observed over time. Some
of these are presented here.
Figure 6 - CPU-to-GPU Times Speedup vs. Cell Count.

Figure 7 - Comparison of Runtimes of Different
Time Integration Schemes.


21
The GPU can be extraordinarily slow at many searching- and sorting-type operations. This tendency has
particularly presented itself in grid preprocessing routines, as the code that generates grid connectivity matrices
involves a lot of these sorting and searching operations. This code was converted to run on the GPU only to find
that its runtime went up by as much as two orders of magnitude in some cases. After this was realized, these
preconditioning operations were conducted solely on the CPU. For this scale of operations (two-dimensional grids
run on a single CPU or GPU) this is not too crippling. On the other hand, in a massively-parallel setting running
three-dimensional grids with millions of cells, this fact could be a problem. A set of GPUs running in parallel in a
cluster may have trouble generating these connectivity matrices without the aid of a set of CPUs that would
otherwise be nearly idle during the solution process. On the other hand, these connectivity matrices only have to be
created once for a given grid and multiple cases (different Mach numbers, Reynolds numbers, angles-of-attack)
could be run in parallel on multiple sets of GPUs from connectivity matrices generated on a single set of CPUs.
Indexing operations are not nearly as crippling as might initially be suspected. When this project was started, it
was assumed that all of the mathematical operations run on the GPU would have massive speedups while the
indexing operations (for example, the operation of collecting all of the face flux values that must be summed to give
the net residual for a cell) would likely be slower than on the CPU. Instead, it was found that the indexing
operations actually saw nearly the same speedups as the mathematical ones. This is often problematic to present as
numerical data, but it has been the author's experience from sifting through code profiler results comparing CPU and
GPU computations. Perhaps this particular observation is overly influenced by the choice of Matlab as a computing
environment, but it was a welcome result nonetheless.
A few even stranger tendencies have been noticed in GPU operation. One of these is the fact that division
operations on the GPU are considerably slower than multiplication operations. This is generally true about CPU
computing as well (and is frequently mentioned in computer science references concerned with algorithm
efficiency), but has largely been overcome by recent processor and compiler design. In particular, attempting to
divide by a constant will virtually always be changed during compilation into a multiplication operation. The
author's anecdotal experience from within Matlab is that division operations on the CPU might require on the order
of 1.25 times as much time as multiplication operations. On the GPU, this factor might go up to as much as four or
five times as long. The GPU operation may still be faster than the CPU operation, but the programmer is well-served
by attempting to reduce the number of division operations in their code.
Probably the most unexpected thing discovered while running timing cases was the impact of the computer's
other activities on the processing ability of the GPU. In particular, whenever the computer's screen saver would start
up, the time required per iteration on the GPU would increase dramatically, while the CPU runtimes would be
virtually unaffected. Then, when the computer's power saving settings would turn off the monitors altogether, the
runtimes would drop considerably again, to even lower than they were originally. None of this is particularly
shocking, but it did create some extremely odd timing plots before it was understood. The timing results here were
computed with the computer set to deactivate its displays completely whenever the user is inactive for even a short
time and the computer's mouse was unplugged for the duration of the runs to prevent an accidental "waking."
VII. Conclusions and Future Work
GPU-based computing has been shown to have great promise, even within the realm of unstructured-grid codes,
where complex indexing operations become so much more prevalent than in structured codes. This is an extremely
promising result because it would severely diminish the usefulness of GPUs in CFD if one had to revert to the much
more tedious grid generation process for structured codes in order to obtain the runtime efficiency of the GPU.
The implicit mode of operation needs considerable additional work. It is currently barely functional to the point
of not being useful. However, the author has seen tremendous promise in the implicit mode. Sparse matrix
operations in Matlab are extremely efficient and there are many built-in methods for iteratively solving sparse
matrix systems. If CFL numbers of reasonable size could be obtained with the implicit mode, this host of solution
methods could be brought to bear to generate extremely rapid solutions.
One of the largest steps that can be taken to extend this work is the extension of the code to 3-D solutions. All of
the methodology set up to be quickly and easily extensible to 3-D operations, but this has not been attempted as of
yet. The greatest speedups from CPU to GPU were seen on large grids with hundreds of thousands of cells. These
grids are an overwhelming amount of overkill for a simple 2-D geometry such as the NACA0012, but these cell
counts would be right in line with what is necessary to get Euler results on three-dimensional geometries of mild
complexity.


22
Figure 8 - Simple Example Grid
Appendix A. Example Grid Connectivity Functions
Figure 8 shows a simple mixed-element grid consisting of four triangular cells and two quadrilateral cells. The
large black numbers indicate cell indices (i), blue numbers with arrows indicate face indices (j), and red numbers
indicate node indices (k). The following matrices are the correct connectivity functions for this grid, as described in
Section III.B above. (Note that the NaNs indicated placeholder values, not any kind of errors).

( )
=
9 8
8 7
9 6
8 6
8 5
6 5
7 4
5 4
6 3
5 3
5 2
3 2
4 1
2 1
j

( )
=
10 7 13 8
4 7 2 1
11 14 12
11 10 9
9 5 6
5 4 3
i

( )
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
j
4
6
4
4 3
6 3
2 3
6
6 5
2
2 1
5 1
1
5
5
1

( )
=
4 7 8 5
4 5 2 1
8 9 6
8 6 5
5 6 3
5 3 2
i
(69)

References
1
Parallel Computing Toolbox - MATLAB. MathWorks Web Site [Online], URL: http://www.mathworks.com/products/
parallel-computing/ [cited 18 June 2011].
2
AccelerEyes - MATLAB GPU Computing. [Online] URL: http://www.accelereyes.com/ [cited 18 June 2011].
3
GPUmat: GPU toolbox for MATLAB. [Online] URL: http://gp-you.org/ [cited 18 June 2011].
4
Frink, N. T. "Upwind Scheme for Solving the Euler Equations on Unstructured Tetrahedral Meshes," AIAA Journal, Vol.
30, No. 1, 1991.
5
Blazek, J. Computational Fluid Dynamics: Principles and Applications, 2nd ed., Elsevier, New York, 2005.
6
Roe, P. L. "Approximate Riemann Solvers, Parameter Vectors, and Difference Schemes," Journal of Computational
Physics, Vol. 43, 357-372, 1981.
7
Frink, N. T. and Pirzadeh, S. Z. "Tetrahedral Finite-Volume Solutions to the Navier-Stokes Equations on Complex
Configurations," NASA TM-1998-208961, 1998.
8
Tannehill, J. C., Anderson, D. A., and Pletcher, R. H. Computational Fluid Mechanics and Heat Transfer, 2nd ed., Taylor &
Francis: Philadelphia, PA, 1997.
9
NATO Advisory Group for Aerospace Research and Development, "Experimental Data Base for Computer Program
Assessment," AGARD AR-138, 1979.

Vectorized Unstructured CFD Methods For GPU Computing

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Vectorized Unstructured CFD Methods For GPU Computing

Hochgeladen von

Copyright:

Verfügbare Formate

American Institute of Aeronautics and Astronautics

, where L is a reference length

is the flux Jacobian matrix computed with the following "Roe-averaged"

and the boundary cell state vector U

Das könnte Ihnen auch gefallen