Aerocuda: The 2-d CFD Code

aeroCuda: The GPU-Optimized Immersed Solid Code
Samir Patel Advisor: Dr. Cris Cecka June 23, 2012
Abstract Commercial uid dynamics software is expensive and can be dicult to handle for transient problems involving moving objects. While open-source codes exist to handle such problems, the documentation and structure of such codes might be dicult to navigate for researchers not well-versed in computer science or students lacking a formal background in uid dynamics. aeroCuda was developed to provide an ecient, accurate, and open-source method for testing uid dynamics problems involving moving objects. The solution method for the Navier-Stokes equations was the Projection Method, and the eects of objects moving in uid were implemented via Peskins Immersed Boundary Method. The code was rst developed in serial and then parallelized via CUDA and MPI to optimize its speed. It generates and rotates a full 2-d point cloud to simulate the objects shape, and also allows the user to implement full 2-d translational and rotational motion of the object. The results obtained for Reynolds numbers at 25 and 100 matched those obtained by Saiki and Biringen as well as Peskin and Lai; the expected physical phenomena are also conrmed.
Preface This paper was submitted for the satisfaction of the thesis requirement for the Bachelor of Science in Engineering Sciences at Harvard College on April 2, 2012.
My interest in the eld of CFD was piqued in high school, when I rst studied the Speedo LZR Racer. Since then, I have come a long way in my understanding of CFD, both in its applications and theoretical underpinnings. However, none of this would have been possible without the support of many individuals who have supported me throughout my career as a student.
I would like to thank my parents and my sister for their continued support and trust in me. They have been monumental in getting me to where I am today. I love you, Satish, Sneh, and Swati Patel!
I would like to thank my advisor, Cris Cecka, for his support in helping me bring this project to life.
There are some individuals who have supported my work as a student at Harvard without whom I could not envision being where I am today. Special thanks to Professor Robert Wood and Dr. Hiroto Tanaka for allowing me the opportunity to work on their robotics projects and learn from their dedication to the subject, which helped develop my interests and skill as a researcher. Special thanks to Professor Anette Hosoi and Ms. Lisa Burton for allowing me to begin exploring CFD under their tutelage.
I would also like to thank those that inuenced me in high school: Dr. Thom Morris, Mrs. Martha DeWeese, Mrs. Kemp Hoversten, Mr. Stephen Mikell, and Mr. Patrick Fisher. Their guidance allowed me to become the individual that I am today, and without their support I would not have be where I am. In addition, I would like to thank the man who helped kindle my interest in mathematics, Mr. Farhad Azar.
I would also like to thank Assistant Professor Charbel Bou-Mosleh of the Notre Dame University of Lebanon, who over the course of one summer taught me to appreciate CFD and helped me craft my beginnings as a researcher in this area.
I would like to thank Professor Charles Peskin of NYU for his support of my project (and of course, for developing its theoretical basis).
I would like to thank Karl Helfrich of Woods Hole Institute and Mattheus Ueckermann of the Massachusetts Institute of Technology for helping me navigate the world of CFD.
This project is dedicated to the memory of my grandfathers, a mechanical engineer and a physicist.
Contents
1 Motivation 1.1 1.2 1.3 1.4 Computational Fluid Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Moving Mesh and a Translating Cylinder . . . . . . . . . . . . . . . . . . . . . . . . Governing Equations and Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . Why Immersed Boundary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 4 5 6 8 8 8 9 10
2 Immersed Boundary Method and Solution to the Navier-Stokes equations 2.1 2.2 2.3 Modication of the Navier-Stokes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Developing the Forcing Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Relationship between the Solid and Prescribed Points . . . . . . . . . . . . . . . . .
3 Goal and Design Phase 3.1 3.2 3.3
Goal of aeroCuda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Reasons for Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Platforms Evaluated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3.1 3.3.2 3.3.3 Comsol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Ansys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 openFoam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4
Language for the Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 13
4 Working with openFoam 4.1 4.2 4.3 4.4
Mesh Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Solver Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Building the Code for openFoam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Issues with openFoam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 16
5 Development of aeroCuda 5.1 5.2
Inuences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Structural Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.2.1 5.2.2 5.2.3 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Solver-Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2.4 5.3
Post-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Pre-Computation: Interior Point Generation and Rotation Capabilities . . . . . . . . 20 5.3.1 5.3.2 5.3.3 5.3.4 5.3.5 5.3.6 Motivation behind Interior Point Generation . . . . . . . . . . . . . . . . . . 20 Interpolating the Surface of the Geometry . . . . . . . . . . . . . . . . . . . . 20 Developing the Cloaking Mechanism . . . . . . . . . . . . . . . . . . . . . . . 21 Developing the Delaunay Mechanism . . . . . . . . . . . . . . . . . . . . . . . 22 Comparing the Delaunay and Cloaking Mechanisms . . . . . . . . . . . . . . 22 Implementing the Rotation Algorithm . . . . . . . . . . . . . . . . . . . . . . 23
5.4
Developing the Solver In Serial Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.4.1 5.4.2 5.4.3 5.4.4 Implementing the Projection Method: Steps 2 and 4 . . . . . . . . . . . . . . 24 Implementing the Projection Method: Step 3 . . . . . . . . . . . . . . . . . . 25 Implementing the Interpolation Step . . . . . . . . . . . . . . . . . . . . . . . 26 Implementing the Forcing Field . . . . . . . . . . . . . . . . . . . . . . . . . . 27 28
6 Code Renements and Optimization 6.1
The Variable-Spring Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 6.1.1 6.1.2 6.1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Underlying Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.2
Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 6.2.1 6.2.2 6.2.3 Evaluation of MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Evaluation of CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Going with CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.3
Implementing the CUDA-optimized Structure . . . . . . . . . . . . . . . . . . . . . . 32 6.3.1 6.3.2 6.3.3 Implementing the Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Implementing the Forcing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Implementing the Intermediate Velocity and Final Velocity Calculations . . . 34 34
7 Results Obtained with aeroCuda 7.1 7.2 7.3 7.4
The Eect of Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Numerical Conrmations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Expected Physical Phenomena and Further Validation . . . . . . . . . . . . . . . . . 36 A Closer Look at the Physical Response of the Immersed Solid . . . . . . . . . . . . 36 2
7.5
Physical Location of the Immersed Solid Points . . . . . . . . . . . . . . . . . . . . . 37 37
8 Test Case: Swimmer in Glide Position 8.1 8.2 8.3 8.4
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Simulation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Reynolds Number Transition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 39
9 Conclusion 9.1 9.2 9.3 9.4
Numerical Improvements to aeroCuda . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Technical Improvements to aeroCuda . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Capability Enhancements to aeroCuda . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 41
10 Finances
10.1 Resources Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 10.2 Upgrades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 11 Appendix 41
11.1 Solving the Immersed Solid-inuenced Navier-Stokes Equations . . . . . . . . . . . . 41 11.1.1 Step 1: Force Projection [7] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 11.1.2 Step 2: Calculating the intermediate velocity eld [12] . . . . . . . . . . . . . 43 11.1.3 Step 3: Calculating the Pressure Field [12] . . . . . . . . . . . . . . . . . . . 45
11.1.4 Step 4: Calculating the Final Velocity elds[12] . . . . . . . . . . . . . . . . . 46 11.1.5 Step 5: Interpolation and Velocity [7] . . . . . . . . . . . . . . . . . . . . . . 47
1
1.1
Motivation
Computational Fluid Dynamics
The eld of computational uid dynamics (CFD) gradually arose as there was a demonstrated need to evaluate aerodynamic, mechanical, biological, and/or environmental systems, either for design or the study of naturally-occurring phenomena like vortex shedding. However, owing to the complexity of solving the Navier-Stokes equations, the eld of CFD grew to integrate three
disciplines (computer science, applied mathematics, and uid dynamics) in order to develop ecient and accurate solutions to the Navier-Stokes equations. The most common CFD simulations involve 3 main steps: pre-processing, simulation, and postprocessing. In the step of pre-processing, the problem at hand (e.g. 2-d cylinder in a wind-tunnel) is decomposed into either 2-d or 3-d geometry depending on the dimensionality of problem. This decomposition involves breaking the domain into a contiguous sequence of triangles or other simple geometrical shapes. For example, a 2-d cylinder would need to be partitioned into triangles to have its solution developed. Moreover, this decomposition can take lots of time if the geometries are complicatedthe elemental partitions must not have any overlaps, jagged edges, or displaced elements. This process is only for a steady-state problem; for transient solutions, moving meshes might be implemented. In such simulations, a mesh with a time-dependent orientation would be developed, allowing for the simulation to take place at a very computationally expensive cost, given that the mesh would have to be updated to reect the new orientation at each timestep. To reiterate, a moving mesh would be desired in the case of an object that is either changing shape or orientation as time progresses.
In the simulation step, depending on the Reynolds number ( ud ) magnitude of the problem, dierent parameters and solution methods might need to be implemented to ensure stability of the solution. For example, in high Reynolds number problems where lots of turbulence is expected, more sophisticated models might have to be applied to properly resolve the solutions. In other cases, the time-step and grid-size might have to be reduced to ensure accurate solutions. In the event that such reductions are implemented, the code must be as ecient as possible to ensure that lots of time isnt needed to achieve good solutions. In the post-processing step, the ow-elds at dierent times are observed and the convergence of the force or another eld variable to its steady-state levels are observed. In this case, for an object with prescribed motion that is either periodic or constant, steady-state refers to the situation in which the forces experienced are either periodic or constant. Being able to track the convergence of the forces allows us to know when the simulation can be terminated with sucient results.
1.2
Moving Mesh and a Translating Cylinder
To illustrate the complexity of implementing a moving mesh simulation, the case of a translating cylinder is considered. For the algorithm, the r-method outlined by Tao Tang of the University of 4
Maryland is observed. In the method, gridpoints are moved in such a way that at each timestep, a high concentration of points is located where strong changes in the variable elds (such as pressure or velocity) is expected. To support the r-method, there are functions, such as interpolation of eld variables to reect eld values at translated notes, that need to be implemented as well.[10] In the case of the translating cylinder, suppose that the cylinder moves with timestep of t = 0.001s at a velocity of u = 1m/s, at a Reynolds number of 100. This means that 1000 iterations are needed to see the cylinder translate 1 meter. For the ow to develop properly, usually 6-10 meters are needed before the Von Karman shedding phenomenon can be observed. Therefore, the nodes and variables are translated and interpolated, respectively, 6000 times to see the quantities develop. In addition, the necessity of mappings between the the actual domain and a test domain needed for a nite element formulation needs to be taken into account as well. Depending on the clustering of nodes around the cylinder, the number of points that need to be interpolated and updated may range from tens to thousands, depending on the accuracy desired.[10] The complexities of the equations at hand as well as the coding would set a barrier to someone who is not well-versed in computer science and uid mechanics. For a student just beginning to learn uid mechanics, implementing a moving mesh simulation to observe the ow around a translating cylinder is an unrealistic task. Moreover, the initial mesh itself has to be generated, which may or may not be dicult depending on the complexity of the object. In conclusion, many steps have to be executed at each timestep. For problems like apping wings which depend on rapid optimization of a variety of parameters, the overall cost of running the simulations would be very high. The immersed boundary method oers a much less expensive method, though at the cost of reduced accuracy (to be explained).
1.3
Governing Equations and Solutions
In CFD simulations, two primary equations are usually solved in the simulation, the momentum and mass convervation equations; collectively they are known as the Navier-Stokes equations. In two dimensions, the primary quantities dealt with are pressure p and the velocity elds, u and v. The quantity is the dynamic viscosity and is density. Let the 2-d velocity eld be denoted as
u = (u, v). The equations together are: Momentum: Mass: u + (u t )u = p + u=0

2
(1) (2)
Together, these equations establish the condition of incompressible ow, where the uid does not change density during the solution phase. This approximation is critical to the formulation of the Projection Method, the algorithm used to solve the equations in this project. The idea behind the projection method is that the velocity is propogated forward in time and corrected to account for the incompressible condition. The steps for solving the Navier-Stokes equations in this algorithm are [12]: 1. Solve for the intermediate velocity, u = RHS. 2. Solve for the pressure using the divergence of the ow eld, a Fast Fourier Transform (FFT). 3. Project the intermediate velocity to get the divergence-free nal velocity, un+1 = u
t 2 pn+1
u . Done using
pn+1 .
To note, aeroCuda solves the problems with periodic boundary conditions in both the X- and Ydirections. There is no inlet or outlet ow, but moving objects in a stationary uid.
1.4
Why Immersed Boundary
Being able to modify the geometry and run simulations, without having to recreate a mesh, would be a great step forward in eciency. Similarly, being able to change parameters and rerun simulations at fast runtimes would be very advantageous, especially for optimization. As an added benet, in the event that a decomposition of the domain is not needed to develop solutions, then simpler solutions can be implemented with very high eciency. The immersed boundary method developed by Peskin allows us to do exactly this. In Peskins formulation, an extra forcing term is added in an attempt to enforce the desired boundary conditions in the uid simulation (i.e. ow around the cylinder surface should match its prescribed velocity)[7]. Since the forcing terms coincide with gridpoints, a cartesian mesh can be used with simplied solver
routines. This is one of the foundations of this design project. Such a routine can be implemented and optimized while retaining accuracy, making immersed boundary an attractive choice.
Figure 1: Point Decomposition
Figure 2: Mesh of a Similar Disk
For example, in Figure 1 the points within the boundary are marked as those that provide forcing throughout the simulation (they follow prescribed motion as discussed later). However, in Figure 2 the nodes that compose the mesh of the disk would provide the Dirichlet or Neumann boundary conditions, depending on the type of simulation being run. Immersed boundary does not require the regeneration of a mesh at every new timestep in the simulation, which would require a mesh similar to that in Figure 2 to be translated (including all of the nodes) at every timestep. Given the avoidance of this task by immersed boundary, a signicant speedup in runtime is observed and forms part of the motivation behind building a code to run immersed boundary simulations.
Immersed Boundary Method and Solution to the Navier-Stokes equations
Professor Charles Peskin of NYU was the founder of the immersed boundary method, a method of solving the Navier-Stokes equations in a complicated or smooth domain with a structured grid. Peskin originally developed the immersed boundary formulation to model the uid ow in the heart; however, it has been widely adapted to many ow problems. His formulation is outlined in the following subsections. For this specic project, it has been coupled with the the projection method outlined by Tryggvason. The full scope of the problem is now addressed.
2.1
Modication of the Navier-Stokes
In his formulation, Peskin modies the momentum equation so that an extra forcing term, f , is included.[6] For the equations used, Tryggvasons -normalized equation is adopted, where p is the -normalized pressure term.[12] u + (u t )u = p + u=0
2
u+
(3) (4)
The addition of the forcing term in the Navier-Stokes equation allows the uid around a certain point (with prescribed velocity) to be forced such that the prescribed velocity is observed by the uid eld.6 Ordinarily in the Navier-Stokes equations, either a no-slip or a slip boundary condition would be prescribed via a Dirichlet or Neumann boundary condition on the object in the oweld. However, in a problem involving a moving boundary, the location of these conditions would be dependent on the orientation and location of the mesh. By using the forcing term, the boundary conditions are implicit in the formulation but do not need to have their locations respecied, as the locations of the aforementioned points provide that capability.
2.2
Developing the Forcing Term
In Peskins original formulation of the immersed boundary method, the forcing term had a magnid tude of | dxb | , where is the membraneous force constant, the derivative represents the curvature 2
2
of the membrane, and represents the tangential vector.
In doing so, Peskin allows for uid-
structure interaction to take place (uid forcing the boundary as well as vice-versa).6 However, in 8
the case of immersed solids, both boundary and interior points matter. Therefore, a modication of Peskins implementation as a network of springs is applied. In this alternative formulation, Peskin simulates the object via springs; this method was used by Peskin and Lai to simulate the ow around a stationary cylinder with great accuracy. The force is then given by (xp xb ), where xp , yp are the points prescribed by the user and xb , yb are those that move with and force the ow eld.[7] While this setup is useful for simple geometries and motions, for higher Reynolds ow problems a way of ensuring that immersed boundary points do not oscillate spuriously is required. The harmonic oscillator-forcing mechanism implemented by Saiki and Biringen in their study of the ow around a cylinder, f = (xp xs ) + (vs vp ) is used.[9] It is very similar to the actions of a damped harmonic oscillator and helps obtain convergence of the velocity while dissipating the energy exhibited by strongly-oscillating particles, as can be seen in the force plots discussed later. Peskins implementation of the forward Euler method is used to compute the integral in the code, where the position xn+1 = xn + un t. To b b b introduce the forcing terms into the Navier-Stokes, Peskin uses the Dirac delta function to transfer the boundary points force to an area of gridpoints via a stencil of coecients. In addition, to get the velocity of the boundary points Peskin interpolates from the surrounding uid velocity points via the same delta function stencil. [7] Henceforth, the immersed boundary shall be referred to as an immersed solid.
2.3
Relationship between the Solid and Prescribed Points
To reiterate, there are two sets of points: the solid points, (xb , yb and the prescribed points, (xp , yp . There is a one-to-one correspondence between the solid and prescribed points; each solid point tracks the prescribed point as the latter moves based on the motion specied by the user. The solid point derives its velocity from that of the uid points surrounding it. In Peskins formulation, each solid point receives velocity and projects force to all gridpoints within a radius of 2 gridspaces.[7] The act of calculating the velocity for a specic gridpoint is done by means of the Dirac delta functionss. This process is done twice: initially to calculate the damper force in the forcing equation and nally to advance the solid points. The act of projecting the force is done by obtaining the velocity of each solid point and the distance between the solid point and its prescribed counterpart.
These are provided as inputs into the forcing equation and a single value is obtained for each pair of solid and prescribed points. These forces are then transferred to the grid via the same Dirac delta function, except in this instance the value is spread to the surrounding uid points to inuence their motion. In sum:
Force Projection Obtain solid point velocities from surrounding uid via Dirac delta function. Calculate forces via forcing equation with solid point velocities, prescribed point velocities, and distances between each solid point and prescribed counterpart. Spread force to uid points surrounding the solid point via Dirac delta function. Point Update Obtain solid point velocities from surrounding uid via Dirac delta function. Use forward Euler to progress solid points by respective interpolated velocities and prescribed points by specied functional velocity.
3
3.1
Goal and Design Phase

Goal of aeroCuda
The goal of developing aeroCuda is to design either an add-on component to an existing CFD software or a standalone CFD code that is capable of handling immersed solid implementations for transient Navier-Stokes problems. Given that the scale and types of problems could range very extensively, certain targets for both user inputs and specications were set. While the nal design did not match all of these, it did satisfy the design expectations that were initially set. These are outlined in the following tables: The specications were set to allow for users to eciently calculate solutions to problems involving rigid bodies. The eciency comes from introducing parallelization into the code, whereby tasks are broken down amongst multiple processing units versus one processor. CUDA was chosen over MPI for the bulk of parallelization as it allowed for massive parallelization of very basic arithmetic operations. Concerning the rigid bodies, such implementations were the initial goal; however, the concentration of points in important regions could be decreased if the object expanded, leading 10
Specication Dimensionality Parallelization Numerical Accuracy Object Discretization Movement of Solid Points Object Type
Table 1: Specications Initial 3-d MPI/CUDA > 4th-order User-Specied Specify Positions for all Time Deformable
Final 2-d 9:1 CUDA:MPI 1st- and 2nd-Order Internally-Generated Prescribed Motion Rigid
to forcing problems (discussed in later sections). In addition, for this project it was simpler to prescribe consistent motion for the entire body; prescribing motion for all internal points would result in a drastic loss of eciency and introduce a very complicated structure in point-dependent functions. Lastly, 2 dimensions were chosen instead of 3-dimensions as grid-sizing and execution would lead to memory and slow-downs in runtime. The latter case can be developed if necessary. Table 2: Solver Input and Output Input Output Nodes/Connectivity Full Variable Fields Situational Parameters Solver Timings Point Locations Problem Parameters Functional Motion Total Force Part of the motivation behind this project was to place as much control as possible in the hands of the user. To this end, the user can input any 2-d surface and leave it to the software to generate the internal points. In addition, the motion can be prescribed through lambda functions, which are functions of variables that do not require formal declarations. Of importance to the user is the CFL condition, specically making sure that enough time- and space-renement is used to ensure convergence and accuracy of the solution. In terms of output, almost all calculated variables and analytics are outputted either at a certain frequency or every cycle. More in-depth analysis of the software will be provided in upcoming sections.
3.2
Reasons for Evaluation
The nal structure of aeroCuda, as well as the decision to construct a CFD code from scratch, were both decided upon after evaluating and working with a number of existing CFD platforms. The initial stage of the project focused exclusively on identifying a platform to implement the immersed solid method and a coding language to develop the module. The target criterion for a platform
11
was a software whose solver routines could be directly interfaced with via external code.
3.3
3.3.1
Platforms Evaluated
Comsol
Comsol is a widely-used industrial solver utility. It has modules available for all disciplines of engineering, including a uid-dynamics module. It interfaces directly with Matlab, whiich has an extensive library of tools that would provide good support to the user. However, COMSOL would require a new solve at each timestep, in order to update the new locations of points and forces. In addition, COMSOL did not allow for specic quantities to be placed on the eld, which complicated the ability of force and point placement. 3.3.2 Ansys
Ansys is the industry standard for uid-dynamics problems. It comes with everything from a strong CAD capability to mesh generation and a great CFD solver routine in Fluent. However, the user interface can be very complicated for individuals to operate, even for very basic test cases. The CAD interface allows for the construction of great geometries, but to operate the ICEM-CFD (mesher) and Fluent requires a high-level understanding. The Ansys suite allows for the implementation of an immersed solid functionality, which allows for motion to be prescribed to an object that wont deform, but has no capability for a deformable object. While it was ultimately chosen to pursue a rigid object, Ansys did not appeal due to the diculty of engaging Fluent and working with it directly (as is needed for the immersed solid implementation). 3.3.3 openFoam
openFoam is an open-source CFD library available as a set of C++ modules that can run any type of problem. The motivation behind using openFoam is that all of the solvers are coded and so one can go straight to implementing the immersed solid method. Additionally, the solver allows for the output of elds every certain number of cycles so a visual analysis of the eld can take place. It gave control at a really low-level, which meant implementing the immersed solid method would be considerably easier with this software than with Ansys and COMSOL. Moreover, its opensource attributes assured that no copyright or license violations would be incurred in modiying the software.
12
3.4
Language for the Module
For developing the code infrastructure to support the project, Python was the choice language. More than just being object-oriented, Python is quite easy to code in and can easily interface with a host of packages, from visualization to parallelization. Among those useful for this project were: Numpy: A Python math library that allows for the development and usage of arrays, with a host of functions that can work with these arrays. It also has a FFT package embedded within. Matplotlib: A Python plotting library that can very quickly generate contours, vector elds, and other plots. Pickle/ H5: These two libraries allow for quick and ecient outputting of data. In the case of pickle, Python variables output directly to a le. H5 allows for great data compression and its les (known as cubes) are very quick to write to and read from. Moreover, many of the underlying functions used in Python libraries have already been optimized using C and C++. These two languages were also considered for the project, but interfacing them with other packages and developing visualization would have been dicult.
4
4.1
Working with openFoam

Mesh Generation
To generate the mesh, the blockMesh utilities called from withinopenFoam from the central directory. The blockMeshDict contains all of our information and the blockMesh utility will nd the le and output it. Of the les that are outputted by blockMesh, all but the boundary le are important to have a discretization of the structured grid that will be worked with. The boundary le itself will be where the boundary conditions that need to be applied will be called.
4.2
Solver Utility
Once these les are produced, the solver utility testFoam is called to run the code. testFoam simply needs to be called without any command-line parameters, as it will read and output les so long as they follow the openFoam le structure. At each iteration, testFoam will output a time le with the specic results of that iterations results until the total runtime of the problem is completed. 13
4.3
Building the Code for openFoam

Figure 3: Structure of openFoam Immersed Solid Code
The structure of the openFoam immersed solid code that was developed took the above structure. All of the modules were produced via Python. The program works as follows and is detailed in Figure 3: 1. Reading in the Users Object: The user provides a node le and a connectivity le. This allows for point placement on the grid. 2. Parsing the Mesh File: After the blockMesh utility is called to generate the desired cartesian mesh, a parser is run to read in the mesh data. This consists of four les: the nodes, faces, cells, and neighbors. A parser runs on each le and stores the data, which is outputted via the pickle module to a data folder. Each specic value is stored as a key in a dictionary, and the relevant info (node coordinates, connectivity) as the items relative to that dictionary key. 3. Placing the Points on the Grid: Once the object data is obtained, a series of modules is run to triangulate which faces are closest to the objects points. This is done by iterating through all of the faces and seeing which ones centroid has the lowest distance relative to the object point. Once the centroids and consequently, cells, are specied, a boundary condition le is generated. 14
4. Developing the Boundary Condition File: Depending on what the user specied for motion, a le with all of the boundary conditions (patch, cell number, value) is outputted. The ile has all of the boundary conditions listed in sequential order with the relevant data and can be parsed by the csv.reader function in Python. 5. Parsing Input File: The user should have a le with a list of all of the boundary conditions that need to be specied in the program; this would consist of a list of patches (or face boundary conditions) which would have the information. In addition, the le should also contain specications of the mesh (grid size, spacing) and solver (time step, uid specications). These les will be parsed by an input module and the data will then be generated into a boundary condition le. 6. Generating the Initial Field: Once the boundary conditions are read, the mesh data is used to generate the initial eld which represents the problem at the initial time of 0, with the boundary conditions reected in the same manner. 7. Run the Solver Loop: The solver loop is executed for 1 iteration. At the next iteration, the same process takes place.
4.4
Issues with openFoam
The full immersed solid formulation was not implemented in openFoam, as it became apparent upon running a rst iteration of the code that the software was not a suitable choice. Multiple issues arose: 1. openFoam had its own le structure formats, and consistently developing the input les was not only costly (at every iteration) but also prone to errors (if even one letter was o or there was an errant space, the program would crash). 2. openFoam required a structured mesh, with the above implementation. Because the forcing term would be implemented via patches (facial boundary conditions), it soon became apparent that every face would have to be specied with a certain value; linking all of the faces together was not possible, unless the mesh was structured in this way. 3. The patch method was very quick to generate continuity errors; while these might have been resolved, due to the amount of time available for this project, it was not feasible to pursue this issue further. 15
4. The openFoam data elds were structured in the softwares specic le format, and while openFoam had a graphical user interface known as ParaView to do post-processing, it simply wasnt feasible to use this for all analyses, as a faster development loop was desired.
5
5.1
Development of aeroCuda
Inuences
Having evaluated the issue with openFoam, it seemed the best decision was to develop a code from scratch that would be malleable, eective, and ecient for users of all backgrounds to use. The nal structure of the code drew its inuence from codes developed by Peskin and openFoam. From Peskins, the structure of the solver routine as well as the force projection, interpolation, and advancement of the solid points were incorporated into the nal version. With respect to openFoam, the data input/output structure were adapted into the nal version of the code. Given that these codes were robust, and in the case of openFoam, well-established, they would serve as good templates. A notable part of the nal structure is the solving of the problem via an Nvidia GPU, which allows for very signicant parallelization and provides immense speedups in the solution phase. On his website, Peskin provided Matlab code that simulated the problem of an immersed elastic
d b membrane forcing uid via a tensile force (forcing is proportional to | d2 x |).[8] The code served as
2
a template for how an the immersed solid software might be structured. Since it was written in Matlab, the code was translated to Python to get a feel for what Python functions and/or modules would play a critical role in the CFD package. Among those that were useful were Pylab, Numpy, and Scipy, which provided vectors for handling the data but also arraywise operations. Of particular note was the pointer-referencing issues that arose in Python and not in Matlab. In Matlab, when a variable is set to take on the value of another variable, it receives a value by copying, not by direct memory reference. In Python, however, the data is transferred by direct memory referencing, unless a copy function is called, creating a duplicate of the value itself. Therefore, in certain cases where function calls and variable storage were dealt with, the code needed to be modied to ensure the original variable wasnt altered during the update process. From the Python version of Peskins code, there were a few important points that would gure in the development of an immersed solid code: Peskins implementation was for an immersed membrane but this projects goal was to sim16
ulate immersed solid bodies. A key dierence is that the uid inside the body surface does not move if the delta stencils of the boundary points do not cover the full interior. Peskins code used for-loops and other runtime expensive mechanisms, which led to high runtimes for large grids and/or a large number of immersed solid points. Peskins code provided a template where reconguration and adaptation (i.e. for array operations ) could provide serious optimization. Those areas which presented serious potential for optimization were the choice of solution algorithm and the parallelization of the code. Spurious oscillations occurred within the code when dierent situations were implemented, i.e. a wider membrane radius. With respect to the last item, in a paper by Saiki and Biringen it was noted that spectrallydiscretized ow-solvers tend to result in spurious oscillations. In Peskins code, Fourier transforms were widely used to solve the equations. This claim made by Saiki and Biringen motivated the usage of Tryggvasons formulation of the projection method.
5.2
5.2.1
Structural Overview
Input
Figure 4: Structure of Input
For the input, there are 3 main components (outlined in Figure 4). First, nodes that dene the surface of the object are needed. These will serve as a portion of the prescribed points (more might be needed, as explained in the following section). Second, the connectivities of these points are required as well, to help in guiding appropriate distance-checking and interpolation between 17
consecutive nodes. The nodes are inputted as an n x 2 array (x-coordinates in one column, ycoordinates in the other) and the connectivities are also n x 2, where the ith row has the ids of the 2 points that the ith nodet attaches to. Lastly, the parameters of the solve need to be provided. These range from the constants to grid-spacing, as well as the specications for the GPU (thread conguration per block, block conguration per grid).
5.2.2
Pre-Processing
Figure 5: Structure of Pre-Processing
The pre-processing phase is broken into multiple steps, as shown in Figure 5. First, the nodes and connectivities are checked for the spacing (tolerance prescribed by the user). This alerts him/her to problematic spacing. Second, the nodes and connectivity are then taken into the Complete module and wherever the spacing between two connected nodes is greater than the actual grid spacing, enough points are generated between the two nodes via interpolation until the gap is suciently small. Once the surface is closed, points inside the bounding surface are generated. Regardless of whether the user wants to rotate the orientation of the object, the rotation module is run to retrieve the angles and radii (relative to the specied origin of motion) of the points. These are important if any angular velocity is prescribed. Lastly, each specic point is given a specic spring constant to keep it as close as possible to the prescribed points positionthe reason is that external points have fewer points to rely on for additional forcing and therefore need a higher spring constant.
18
5.2.3
Solver-Loop
Figure 6: Structure of Solver Loop
At the conclusion of pre-processing, the solver loop is engaged. It is a repeat of 6 steps that feed data in and out, all shown in Figure 6. In the rst step, the velocities of the solid points are obtained via delta stencil interpolation from the variable elds. Once obtained, the forces on each solid point are calculated via the forcing equation and projected to the grid. Next, the equations are solved via the projection method: the intermediate velocity, pressure eld, and velocity correction. Lastly, the nal velocities are obtained via interpolation, and both prescribed/boundary solid are translated by their respective velocities. Just to note, all calculations take place via the GPU, to optimize their runtimes.
5.2.4
Post-Processing
The post-processing takes place as the solver loop executes, detailed in Figure 7. There are two types of outputs that take place. Those of type Transient are ones that take place with each execution cycle. Those of type Frequency take place after a certain number (user-specied) of cycles executes. Those values outputted of type Frequency tend to have lots of data and therefore should only be outputted after a large number of cycles, otherwise a slowdown in runtime and massive memory consumption will take place. The idea of Frequency outputs was taken from
19
Figure 7: Structure of Post-Processing
openFoam, as it seemed the most logical way to view variables without incurring the aforementioned costs.
5.3
5.3.1
Pre-Computation: Interior Point Generation and Rotation Capabilities

Motivation behind Interior Point Generation
In the immersed solid formulation, interior points need to be specied inside the 2-d or 3-d geometry to force the uid internally, as suggested by Peskin in correspondence. To put in perspective, if a circle is moving at a velocity u, it should have to force the uid on its outside only; the interior points should be moving at the same velocity u. If interior points are not specied, then the velocity in the interior of the circle will not be at u, as no force will be present to move the uid at the velocity u; this is the case with a moving membrane, which is not the focus of this project. To make the task easier for the user, the code requires a 2-d surface to be passed in and develops the interior points afterwards. 5.3.2 Interpolating the Surface of the Geometry
Since the immersed solid relies on points forcing the uid, it is important that points completely enclose the object at hand to prevent uid from penetrating the intended boundary. To handle this issue, an interpolation module is implemented to close gaps in the surface. It takes in a list of nodes and connectivities to generate the surface of the object. Once completed, each point has its connectivity checked to ensure that the distance between two nodes is less than a certain amount (for best results, this should be smaller than the grid-spacing). If the between two nodes is too large, a linear interpolation scheme is implemented by traversing a vector between the nodes and 20
Figure 8: Cloaking from Dierent Directions
placing a point every h units, where h is a tolerance dened by the user. In doing so, it is ensured that the object has no compromising gaps. 5.3.3 Developing the Cloaking Mechanism
Cloaking is a mechanism developed to help construct a point cloud that most closely resembles the objects geometry. The mechanism is illustrated in the following gure: The principle behind cloaking is to isolate all points that lie within a boundary by using normal vectors from all 4 sides. The nodes are mapped to locations on the grid via the prescribed spacings, with a magnitude of 1 (all of the gridpoints are initialized to 1). Cumulative sums are then executed from all 4 sides of the grid, using the Numpy.cumsum function. Therefore, any point which lies in the normal vector direction from a boundary point will have a value greater than 0, due to the cumulative sum. At the end, the grids are taken and examined for those points with a nonzero value in all four runs; these points form the point cloud that composes the object. The drawback to cloaking is that if the tolerance for cloaking is less than the interpolation spacing, then there will be gaps in the solid, which may reduce the eectiveness of the mechanism. This process is detailed in Figure 8. The Dark Blue portions of the gures represent those points where the sum is 0; where there is color (ranging from blue to red), the value of the sum is greater than 0 (the closer to red, the higher the sum). All those points with nonzero values in all 4 cases are taken to form the body of the object.
21
5.3.4
Developing the Delaunay Mechanism
An alternative method to the cloaking mechanism is the Delaunay Triangulation Method. While this was originally developed to help form meshes, it has been adapted here to develop interior points for an arbitrary 2-d geometry. As adapted and modied as necessary from the notes of Tautges, the algorithm is as follows [11]: 1. Identify an interior point (nd average (x, y) coordinate). 2. Initialize arrays to keep track of point ids, (x, y) locations, and those checked for neighbors. 3. Starting with the central, check to see if any interior/boundary points exist within the up, down, left, and right directions based on a radius r. If so, create a new point entry and log its x,y coordinates along with a checked status of 0 (empty). 4. Repeat the previous step until no new points have been added after a certain number of times executed. The benet of using the Delaunay method is that the generate points can very quickly conform to the boundary of the object without distorting its actual surface just to ll the interior. In addition, the tolerance can be adjusted to help ensure that the boundary is matched quite nicely. While the implemented algorithm does not involve any adaptive point generation, such a capability can be implemented in the code and would allow for more robust results. 5.3.5 Comparing the Delaunay and Cloaking Mechanisms
The following gure best depicts the eects of both mechanisms using dierent spacings on a NACA 6716 airfoil: From a rst glance, the Cloaking mechanism appears to provide more than enough points for the interior but which do not stay inside the shape, that is, cross the boundary (though the violations is not too apparent). In the case of the Delaunay method, fewer points are provided but they remain inside the boundary. While grid- and point-spacing certainly aect the outcome of the immersed boundary, conforming to the body of the object is important in CFD, regardless of the problem being solved. However, in the immersed boundary method, it is important the points dening the boundary be supported by interior points. In essence, since points moving in the same general direction at the same general speed, their force contributions will be split across the surrounding boundary points. Therefore, the point generation mechanism must be able to place
22
Figure 9: Interior Point Generation using Both Mechanisms
points very closely to the boundary. Since the Cloaking mechanism does this more eciently, it is used to generate the point clouds for the following simulations. 5.3.6 Implementing the Rotation Algorithm
To allow the user to test dierent angles of attack or orientations, a rotation module was implemented to provide the geometry with a certain angular orientation. The general structure behind the rotation algorithm is as follows: 1. Calculate where the central point of the geometry lies. 2. Shift the entire object to be centered over origin. 3. Get distance from origin to all points. 4. Get angles of all points relative to origin by converting them to complex vectors and using the angle function in Python. 5. Add the theta desired to all of the angles. 6. Use the r(cos, sin) formulation to regenerate the points. 7. Shift them back to the original central point.
23
5.4
Developing the Solver In Serial Code
The projection method has 5 steps that need to be solved. The algorithm presented here summarizes the full solution of the algorithm detailed in the Solver-Loop subsection of the Structural Overview section: 1. An interpolation of velocities from the eld and a projection of the calculated forces to the eld [8] 2. An explicit solve for the intermediate velocity [12] 3. An implicit solve for the pressure eld to correct the intermediate velocity [12] 4. An explicit solve for the nal velocity via pressure correction [12] 5. An interpolation of velocities from the eld and an update of the prescribed and solid locations [12] 5.4.1 Implementing the Projection Method: Steps 2 and 4
Steps 2 and 4 are the easiest to implement since they are explicit and involve shifting operations. For the simulations of this project, it is important to note that periodic boundary conditions are enforced, so over a domain of size [0, L] [0, L], the conditions x(0) = x(L) and y(0) = y(L) hold for all variables and their derivatives. It would be important to make sure that the cells on the
boundaries read their data from those on the opposite if the applied operator requires a cell past the boundary. In Python, such an operation can be implemented via the Numpy.roll function. This function allows for the shifting of an array of n dimensions via a specic axis and by a certain magnitude. Therefore, the second step of the algorithm was laid out as follows. un = eld velocity (ux , uy ) at step n, f = force eld, us = intermediate velocity (us , vs ), x=x-spacing, y = y-spacing, = density, t = timestep, = viscosity Dene the function partial-rst(variable, spacing, magnitude, axis): (roll(variable,-1, axis)roll(variable, 1, axis))/(2*spacing) Dene the function partial-second(variable, spacing, magnitude, axis): (roll(variable,-1,axis)2*variable + roll(variable,1,axis))/pow(spacing,2)
24
us = un + t*(-1*(partial-rst(un ,x,1,2)*ux + partial-rst(un ,y,0,2)*uy ) + *(partial-second(un ,x,1,2) + partial-second(un ,y,0,2)) + f /) Likewise, for the fourth step of the algorithm: un+1 = eld velocity at step n + 1, p = pressure un+1 = us - t*partial-rst(p,***,1,2) *** denotes relevant axis (x-axis: x, y-axis: y) In utilizing the Numpy.roll function, two benets are gained. First, because the Numpy functions are coded in C++ and operate array-wise, the cost of iterating through the array via looping is avoided. Secondly, the roll function implicitly accommodates periodic boundary conditions, helping to avoid conditional statements to ensure the nodes on the boundary and interior are treated properly. 5.4.2 Implementing the Projection Method: Step 3
In the description of the algorithm used, it was outlined that the FFT was used to solve the Poisson equation. However, this was only arrived at after considering the implementation of the matrix solution method. The Poisson equation, given as, 2p 2p + =f 2x 2y takes the following form when decomposed via nite-dierences: pi,j+1 2pi,j + pi,j1 pi+1,j 2pi,j + pi1,j + = fi,j . (x)2 (y)2 A matrix method like BICGSTAB can be utilized to solve this equation. The coecient matrix would have ve bands, since there are ve variables involved in each equation, shown in Figure 10. From a computational perspective, this means that for every point on the computational grid, there are 5 values to be stored in the matrix. Since the smallest grid used is of size (512,512),
about 10mb is allocated for the coecient matrix. While a method like BICGSTAB can indeed work with a coecient matrix of this size, it would require many iterations in addition to ensuring that memory allocation is not a problem (creating such the matrix outlined resulted in a MemoryError being called by Numpy). Since a speedy and accurate CFD solution is desired, and one that does 25
Figure 10: Coecient Matrix Structure for Poisson Equation on 8 Node x 8 Node grid not require massive amounts of cores to run, too, implementing a spectral solution to the Poisson equation is an ecient way of obtaining a good solution to the equation.
5.4.3
Implementing the Interpolation Step
The delta function stencil is a 4x4 stencil but has uniform x and y values which are multiplied together by the eld values. In his code, Peskin conducted this interpolation in the following method [8]: 1. Calculate location of point, radius, and other necessary parameters 2. Iterate through all of the points 3. Multiply stencil by eld values and get the total sum For a quantity of 1000 points, the time to execute such a loop would be very large. Therefore, it is important to vectorize these calculations and avoid looping to produce quick iterations. To do this, the stencil should be examined: it is a combination of 16 coecients multiplied by 16 corresponding values from the the eld. Therefore, for each immersed solid point there are 4 unique delta values in the x-dimension and 4 unique delta values in the y-dimension. Therefore, for each immersed solid point two 4 4 arrays are generated, each with the x-values uniform across the
26
rows and the y-values uniform across the columns. Once obtained, the x-value arrays are stacked on top of each other and y-value arrays aligned next to each other using the Numpy.column-stack and Numpy.row-stack functions, respectively. These gives two 4n 4 size matrices. To get the full delta values, another set of matrices, one that has the x-values and y-values of the corresponding points, is generated. Now the Numpy.flatten function is used on the arrays to convert them all to 1-dimensional vectors (i.e. flatten(2 16 vector) = 1 32 vector). Numpy lends motivation to this idea, as an array of values can be yielded by some variable if a 1-dimensional array or list (multiple dimensions are not supported) is passed as the index. The delta values become relatively easy to work with, as the list of relevant eld values is multiplied by the x- and y- delta vectors. The resulting values are then taken, and using the Numpy.reshape function to convert them back to nx16 matrices, the Numpy.sum function is executed across across the 1 axis (horizontal, or row-wise) to add up all of the values and return the relevant u and v velocities for each solid point. 5.4.4 Implementing the Forcing Field
The projection of the forcing eld onto the grid is similar to the interpolation step, except in this case values are passed instead of taken from the grid. Assuming that the forces has been calculated for all solid points, the force value at each solid point needs to be projected to the surrounding points using the delta-function. In addition, this property is additive, meaning that other points in the vicinity might be aecting the same gridpoint and so the forces will need to be added together. First, a force variable of the grids size is initialized to 0 and converted to a 1-d vector via the Numpy.flatten command. Tbe same delta-stencil and global location arrays are implemented as in the interpolation step. However, instead of retrieving values from the grid, similarly-sized eld value arrays are created by repeating our force values in 16 sets; therefore, if the array is [1,2,3...] the new array will have the 0-15th indices corresponding to 1, the 16-31st indices corresponding to 2, and so forth. These are multiplied by the delta-matrices and the force vector, yielding the stencil values. The global location values are then used to initialize a defaultdict dictionary pointing to a list. A defaultdict is an object in Python that allows for one to place values with certain keys based on an object type, like a list or a oat. This satises the need well, group by location is desired. Therefore, a Python generator (which takes much less time than a for-loop since it does not create the object in memory) to iterate through the global location vector and place the stencil values with their appropriate locations. The the stencil values are summed at each point using another defaultdict, except one thats initialized to a oat (thing of this as the reduce-portion of 27
map-reduce). Since the global locations are stored as the dictionary keys and the force magnitudes as their values, passing these to the force grid is an easy process. Both keys and values can be isolated as lists, and the relevant gridpoints can be augmented by passing the keys directly to the force grid (Force-grid[keys]) and add the values (Force-grid[keys] += values). The reshape function is then used to reshape force-grid to the size of the domain.
6
6.1
Code Renements and Optimization

The Variable-Spring Model
Motivation
6.1.1
In the immersed solid method, the outermost layer of solid points are responsible for breaking the ow as the object moves. Consequently, these points are the ones that also happen to shift positions the most (due to uid forces) and thereby are most likely to begin a chain of displacement within the layers of surrounding points. The easy solution would be to raise all the spring constants to massive levels; however, this is not feasible since the object (at the beginning of its motion) would be destroyed by a massive spring force from the initial motion. However, by raising the stiness of those points in areas with fewer solid points, more force can be eected by those points to compensate for the compounding eect of having multiple points forcing the same gridpoint. Raising stiness also ensures that the solid points will closely follow the prescribed points, with higher forces being the penalty for widening distances. Therefore, the variable spring model is proposed. 6.1.2 Underlying Principle
In the variable-spring model, spring constants are inversely proportionaly to the number of surrounding the points. The reason traces back to Peskins delta function. Since it is a 4x4 stencil, neighboring points are more than likely to overlap on the same gridpoints; as a result, their forces will compound, applying a much stronger spring force than an individual point alone. However, if a point is rather secluded in the geometry (on the surface or the point end of an airfoil), that point must have its spring constant raised to compensate for the fewer surrounding points but also having to deal with the boundary layer.
28
6.1.3
Algorithm
In the variable-spring model, the algorithm implemented is as follows: Produce the distance vector of one point to all of the points in the object Run a logic statement to nd those within a specied radius Sum up the logic 1 values to nd the total Repeat the above for all points in the solid. User prescribes a slope to apply based on the number of surrounding points and initial o . Let M ax denote the largest number of solid points within the specied radius of a solid point and Surr denote the number of solid points within the specied radius of the specic solid point being dealt with. Let m denote a slope constant prescribed by the user to specify how muc hthe spring constant should be raised for every point lying in the vicinity. Once the maximum number of surrounding points is identied, assign the spring constant: i = o (1 + m(M ax Surr + 1))
6.2
Parallelization
While the algorithm itself is not optimized for speed, it is easily parallelizable mainly because the operations employed in the solution involve basic arithmetic steps that involve data from multiple points. Since the algorithm has to be repeated for all points, the process can be executed by n processors if the points are split up into n groups. Each processor will then work on its group and return the value. Things are made easier by the MPI scatter and gather functions, which allow for the groups to be sent to respective processors and the same groups to be returned in the right order, respectively. Therefore, there is no issue with synchronization and the order of retrieval. The values are simply passed out, the function executed, and the outputs gather and concatenated into a 1-d vector with length equal to the number of immersed solid points. The algorithm above was initially implemented in a serial code. However, it took considerably long to run, even for the most basic cases. Therefore, the focus now shifted to optimizing the code via parallell processing. To this end, 2 options existed: using MPI or Nvidias CUDA GPU computing platform.
29
6.2.1
Evaluation of MPI
If MPI was used, the structure of the program would be as follows. For the interpolation scheme, the immersed solid points are scattered (they are broken up into n arrays for n processors to work on) amongst the dierent processors and the velocities are gathered (collection of the processors computed values) back. For the force projection, the force grid (as an array of zeros) would be broadcasted (same copy sent) to all processors and each processor would add its projections to the grid. The outputs would be gathered and added up to obtain the grid values. The calculation of the intermediate velocity could be implemented via a domain decomposition method with ghost cell transfers. The most dicult step would be the Poisson equation, as this would require an MPI-version of BICGSTAB to be implemented. The spectral Poisson solution would be a waste to implement via MPI, as the FFT is essentially a global operation. An implementation of an FFT algorithm might involve some sort of master-slave algorithm where one processor serves as the distributor of data that needs to be processed. As the other processors execute jobs, the central processor retrieves the data from the completed processors and provides new data to be processed. This process continues until the full operation is complete. Once done, the central processor would have to transpose the matrix and then pass out new arrays to have the FFT run. This would result in a lot of code to implement which might not even oer a speedup. Given the goals of this project, this would detract from the malleability of the code but also prevent it from running even faster. 6.2.2 Evaluation of CUDA
If CUDA was used, the structure would be as follows. CUDA grants control of an individual thread, of which there are millions on a gpu, enabling the grassroots control of each grid point value. Therefore, the code can be parallelized at a level which would not be possible on MPI (or would be possible, but would require a vast amount of resources and code). For the interpolation code, one point could be assigned to each node, whose job would entail computing the full stencil for that specic point and return the velocity, eliminating the need for vectorization. For the forcing implementation, the same processed would be used but the values would be stored in an array with a correspoinding global id array, so that a group of threads could run the reduction very eciently. The intermediate velocity calculation could also be run very quickly, as each thread needs to read the values from the cells surrounding it and execute two lines of operations. The Poisson equation can be solved using FFT libraries that exist with Python bindings to CUDA. The velocity correction
30
could be executed in a manner similar to the intermediate velocity calculation step. The source code required for CUDA (though more complicated) would be concise, but it would also help in another way. With CUDA, gpuarrays (pointer references to arrays) are allocated and left in device memory, avoiding the necessity of having to pass memory back and forth between the host and device (this can be avoided in MPI but would take much longer to implement and be much more complicated than the CUDA code). 6.2.3 Going with CUDA
Having thought about both approaches, the CUDA implementation appeared to be more feasiblel. It would be cleaner and more eective by allowing a much more low-level approach than MPI. While it wouldnt allow for functions like Numpy.roll to be used, it would provide a greater speedup by allowing for thread-based approaches. Figure 11: Technical Structure of CUDA [3]
The technical structure of an Nvidia GPU worked quite well with the solution method employed by aeroCuda to solve the Navier-Stokes equations. The structure is detailed in Figure 11. Each Nvidia gpu contains 3 levels of operation: the grid, the block, and the thread. The hierarchy, as shown in the relevant gure, functions as follows: 1. Thread: This is the lowest level of the hierarchy. It functions as a worker for executing the 31
functions and can access local, shared, or global memory. Local memory can only be accessed by each thread. 2. Block: This is a group of threads that function together. The blocks are important as shared memory can be accessed by all threads in a block it is also quicker to read and write from than global memory. It can accommodate 32 32 threads. 3. Grid: This is a group of blocks that forms the basis of the computational grid. Only global memory exists on this level. It is also important to recognize that the GPU is separated from the CPU or computing platform, so memory will need to be allocated on the GPU to hold the computed data. The PyCUDA package developed by Klockner does exactly just that and more.[5] Klockners gpuarray module allows for the creation of arrays on the gpu that have properties similar to Numpy arrays but also allow operations between arrays to be conducted on the gpu, providing a further speedup. The PyCUDA package will allow for the engagement of CUDA from a very high-level but use functions optimized for necessary operations. The projection method with the immersed solid formulation has 4 explicit steps and 1 implicit step. For the explicit steps, CUDA kernels (functions) can be written to execute them. For the implicit step, FFTs are needed to for these issues. Nvidia developed the cuFFT package to run FFTs using the CUDA programming structureto adapt this in Python, the pyFFT package developed by Bogdan Opanchuk creates a binding with PyCUDA to pass gpuarray objects to cuFFT. The following sections will describe the programming scheme. Of worth noting is that CUDA will take an n-dimensional variable and decompose it to a 1d vector, whereby the indexing is carried through by the block and thread level. Therefore, in the following outlines of the CUDA algorithm, all global variables/quantites (while they might be 2-dimensional) are actually 1-dimensional when transferred to the GPU.
6.3
6.3.1
Implementing the CUDA-optimized Structure

Implementing the Interpolation
In the case of interpolation, the vectorizing process is completely averted. Since n immersed solid points exist, n threads can carry out the interpolation scheme for each point. The parameters for each point (xr , yr , rx , ry ,etc) are calculated in the similar fashion. However, for the stencils, each
32
point has a double for-loop that iterates through all of the possible indices. In each iteration, a new (x)(y) is calculated and multiplied by the relevant point, which the thread reads from the eld variable (this is stored in global memory, since it is available to all threads). Once the threads have completed, they write the interpolated values to an n-length vector. 6.3.2 Implementing the Forcing
Implementing the forcing is slightly more complicated than before. Multiple arrays are needed for this implementation. In one array, global IDs of the force projections will need to be stored (the mapping of global IDs is shown in Figure 12). For n points, this array will have to containt 16n elements, to ensure that each projection is written to a dierent space. In addition, two arrays will have to be created, of the same 16n length, to store the magnitudes of the corresponding forces. In the fourth and fth arrays, the full force grid will need to be assembled, to store the total forcing at each point (if it is actually forced, else the value is just 0). Figure 12: Thread-to-Point Mapping Diagram
. In the rst step, for each immersed solid point a double for-loop is engaged. If it is the ith solid point, it will write to the [16i, 16i + 15] indices of the global ID and the corresponding force vectors. Therefore, all threads require 16 total iterations to get all the projected forces. The issue now becomes writing to the grid. In CUDA, a common issue is that of thread racing, where by multiple threads try to write to the same global memory location or shared memory location. If not executed properly or done in sequence, multiple threads can write at the same time or read at the same time and result in wrong values being written or read. Therefore, all of the threads simply cannot write to the same location. However, if recalled the stencil had 16 unique points; therefore, in the global ID vector, every 16 values should be completely dierent, starting from 33
the beginning. Therefore, if 16 threads execute in a for-loop of size n, the threads [0, 15] will read from the [16n, 16(n + 1)) indices of the corresponding force vector. They willl then take the [16n, 16(n + 1)] values of the global ID vector and augment the respective locations on the full forcing grid. The threads are synchronized using the synchthreads() command to ensure that no threads begin executing the next for-loop, since they might interfere with the reads and/or writes of the threads still completing the previous for-loop. 6.3.3 Implementing the Intermediate Velocity and Final Velocity Calculations
Since both steps involve explicit nite dierencing, the task is fairly straightforward. Referring to the following gure depicting the layout of the threads and the computational grid, so long as the number of gridpoints does not exceed the number of threads, every point will have a unique thread assigned to compute its value. Since the values are being stored in new (intermediate or nal velocity) arrays, there is no issue with race-conditions between threads. Therefore, the crux of the task at hand is to compute the proper ids of those points needed for computing the relevant center points value. Since the dimensions of the grids and blocks are set by the user (in addition to the solver parameters), the index can be calculated either blockwise (as done in this code) or row/column-wise; it depends on the users preference.
7
7.1
Results Obtained with aeroCuda

The Eect of Optimization
Loading the code onto the GPU removes a considerable portion of the runtime. The speedups are especially noticeable in the 1st, 2nd, and 4th steps, as shown in Table 3. In the 1st step, substituting the thread-based force projection for the vectorized projection appears to have provided the bulk of the speedup, since in the 5th step where only the interpolation takes place, there is a much smaller speedup. The nite-dierencing steps (2,4) show a very high speedup as well, especially in the case of the 4th step. The discrepancy between these two steps might be the total number of global memory reads that must be madesince the 2nd step requires many more variables than the 4th step, it is possible that the variable reads are forming somewhat of a bottleneck on that time. The actual time and speedup quantities for the simulations are listed in Table 3. For the serial code, the 1st and 2nd steps took the longest, while for the GPU the 1st and 3rd steps took the 34
longest. The issue behind this could be that for the serial code, the necessity of having multiple roll functions execute the partial derivatives resulting in a slowdown for the 2nd step. For the 1st step of the serial code, the forcing function was dicult to optimize outside of the vectorization that was done. For aeroCuda, the runtime for the 1st step was large as it involved the for-loop iteration necessary to place all the forces at their respective points on the grid. The Poisson equaton, the 3rd step, took the second-longest to execute, yet, still provided a good speedup over the 3rd step runtime of the serial code. For an improvement to aeroCuda, a more robust algorithm for transferring forces to the grid would help shave some time o the 1st step. Table 3: Simulation Serial100 GPU100 Speedup Simulation Speedup for Re 100 Case 1st 2nd 3rd 4th 5th 0.87s 0.67s 0.28s 0.33s 0.035s 0.018s 0.008s 0.014s 0.004s 0.011s 48.2 81.1 19.9 89.1 3.13
7.2
Numerical Conrmations
The results obtained are expected in terms of magnitude, though they vary slightly from those obtained in other papers. The Drag Coeocients are shown in Table 4. Multiple sources are used to conrm the tests conducted in this paper. In particular, Hendersons work on studying the drag around a cylinder shows a graph of the drag coecient as a function of Re [4]. All of the values fall within the expected regions according to that graph. For numerical conrmation, in the case of the Re 100 cylinder the experimental coecient of drag obtained by Peskin and Lai is very closely matched by the Re 100 case [6]. In the Re 25 case, the coecient of drag is on the higher end of the numerical studies presented by Saiki and Biringen, but is backed by other studies [9]. Table 4: Coecient Simulation GPU 1000 GPU 100 GPU 25 of Drag Mean 1.53 1.4 2.24 Mean Std 0.35 0.19 1.07 and Standard Deviation Previous Work 1.5 1.44-1.54 1.54-2.26
The cylinders were run at the same conditions except velocity and timestep (details are given in Table 5). For the computational parameters, the spacings were x =
1 128 , y
1 128
while the density
= 1. The following table outlines the time-stepping, dynamic viscosity, and velocity parameters for the dierent simulations: 35
Table 5: Simulation Parameters Simulation ux t Re 1000 1 0.0001 0.0003 Re 100 1 0.001 0.003 Re 25 0.25 0.001 0.003 As outlined in the notes of Tryggvason, for the projection method implemented the CFL condition was dt <
2 10 .[12] |u|2
Given these constraints, the Re 25 and 100 cases can be executed at the
same parameters. For the Re 1000 case, since the dynamic viscosity is 10x lower than in the Re 25 and 100 cases, the time step must be lowered signicantly in order to get the best result. A timestep of 104 was satisfactory to get a good coecient of drag. In a conversation with Karl Helfrich, a CFD scientist at Woods Hole Oceanographic Institute, two issues with the present code were noticed. First, the method used to solve these problems is known as a direct numerical simulationin eect, no approximations are used and the renement of the spacing (spatial and temporal) is used to obtain solutions. Not only is this costly but it cannot be used for all problems. Second, the projection method employed is purely rst-order in time and a more accurate advancement of the solution would most likely help in both stability and accuracy.
7.3
Expected Physical Phenomena and Further Validation
In gure 13 , there is no vortex shedding occuring in the case of the cylinder at Re 25. This is because the phenomenon, known as Von Karman Shedding, does not occur until about Re 50. The contours obtained here are also present in the paper by Saiki and Biringen, where the authors also simulate a Re 25 ow around a cylinder. In the cases of the cylinders at Re 100 and Re 1000 (gures 9 and 10, respectively), the Von Karman Shedding takes place. In particular, the shape of the vortices obtained in the Re 100 case take on the similar pointy shape that those in Peskin and Lais paper for Re 150 took on as well. The immersed solid method has the points traveling at the prescribed 1 m/s in both gures, as the velocity magnitude matches that specied by the color bar on the right-hand sides of both plots. The velocity magnitude was taken to be: |umag | = (u2 + v 2 ) 2 .
1
7.4
A Closer Look at the Physical Response of the Immersed Solid
Looking at the drag forces in Figure 15, the oscillatory motion of the immersed solid points is evidenced by the nature of the drag forces. It should be noted that the forces listed here are in
N m,
as the solution is 2-dimensional and not 3-dimensional. While this is a graph focused on the
36
converged portion of the drag force, initially one can expect to see a damped harmonic oscillator response from the system, where the both the drag and lift forces gradually converge to their steady state value(s). To measure the oscillations magnitudes, the means and the standard deviations are computed to provide a better understanding of the steady-state behavior. In the case of the cylinder at Re 25, it can be assumed that the force is in the mid 0.02-0.025 range, while for the Re 100 and 1000 cases, the force will be in the mid 0.2-0.24. The forces should not be very dierent for these last two cases, as their coecients of drag are relatively close to each other.
7.5
Physical Location of the Immersed Solid Points
Looking at Figure 16, the colorbars indicate the magnitude of the displacement of solid points from their prescribed counterparts. In the diagrams, the point dispersion goes from best to worst as Re 25, 1000, and 100. This should be expected as the Re 25 case faces the lowest velocity, while in the Re 1000 case a very high spring constant is used. In the case of the Re 100 cylinder, on the upper and lower surfaces of the cylinder that break the ow (right side of the cylinder), the points have shifted more than in the left half of the cylinder. While they might be moving along at the proper velocity, in moving out of position they might have minorly aected the expected drag value by applying forces to uid from their shift positions. Since the coecient of drag for the Re 100 case matched that achieved in other papers, the eect was negligible. Looking at all cases, no point was more than 1/2 of a gridspace width away from its intended position. This provides more conrmation that the ow was properly matched and that the objects structure held strong throughout the simulations, preventing a distortion of the ow around its surface.
8
8.1
Test Case: Swimmer in Glide Position

Overview
One of the underlying motivations behind developing this code was to apply it towards problems involving biological motion. In the study conducted by Von Loebbecke et. al., the authors attempt to analyze the ow around a swimmer performing the dolphin kick in 3-dimensions.[1] The study is adapted to t the present capabilities of this solver: the case of a swimmer in the glide position at constant velocity in 2-dimensions. The geometrical gure of the swimmer was obtained from the paper, via an image-tracing mechanism in MATLAB. The outline was then provided to the aeroCuda code and the siimulation parameters were provided. 37
8.2
Simulation Details
The swimmer obtained from the paper was scaled to about 1.7m in length. The max width of the swimmer was 0.23m. The outline consisted of 7998 points when it was taken from the paper. After being submitted to the cloaking module with a spacing of 1/128, the nal point cloud is shown in Figure 18. The grid size was set to 512 x 4096, with the spacing in both dimensions set to 1/128.0. The kinematic viscosity was set to 3x104 with the density set to = 1000. The timestep was 104 s, well with the CFL condition range. The general spring constant was set to 5x107 , with a slope of 0.5 for the variable spring model. Figure 18 details the spring constants at all of the points around the body. Concerning the motion of the swimmer, the velocity was set to 1 m/s in the positive x-direction.
8.3
Simulation Results
The ow around the body of the swimmer is very similar to that around an airfoil, perhaps due to the streamlined nature of the body; the Reynolds number of the simulation was placed at 5666.7. Two major conrmations of the solution are presented, the plots for which can be found in Figures 19-21. First, the magnitude of shift visible in the immersed solid points is very minimal; the largest separation between a solid point and its prescribed point is less than one gridspace away. Second, the forces felt by the swimmer oscillate at steady levels; the drag force is concentrated around 60-70 while the lift force shows sinusoidal oscillations at steady periods. If the points have not shifted much, then the integrity of the body is whole and the ow around can be deemed rather accurate. The force diagrams conrm, as if points have shifted dramatically then the force would not stabilize around a mean value. Looking at the ow around the swimmer, the shedding of vortices is continuous at this stage, showing that steady state has been achieved.
8.4
Reynolds Number Transition
To further expand the analysis of the swimmer, the ow patterns at dierent Reynolds Number regimes are examined (found in Figure 22). These simulations are done by varying the kinematic viscosity of the problem at hand, the values of which are listed below. However, with higher Reynolds number problems, the of the springs as well as the timesteps must be changed to accommodate the changing nature of the problem. In increasing , it is ensured that points do not shift at the higher Reynolds ows while decreasing the timestep allows for much more accuracy to
38
be obtained than at higher timesteps, in addition to satisfying the CFL condition. Table 6: Flow Low Medium High Swimmer at Dierent Reynolds Numbers Reynolds Number t 566.7 3x103 103 107 5666.7 3x104 104 5x107 56666.7 3x105 105 108
At a low Reynolds number, the boundary layer should remain intact, which it does in the simulation run. However, at the medium Reynolds number, the shedding of vortices and a thinner boundary layer are to be expected, as the uid is not as viscous. At the high Reynolds number case, the boundary layer separates and vortices are shed not just at the feet of the swimmer (as in the medium Reynolds case) but also along the body. These diagrams were obtained at 5s for the low and medium cases, and 4s for the high case. Due to cluster compute time issues, it was dicult to run long simulations as holding a position longer than 4 hours resulting in dismissal from the cluster.
9
9.1
Conclusion
Numerical Improvements to aeroCuda
While the aeroCuda code provides good accuracy and eciency, it can be optimized in a few critical areas that can unlock its potential as a strong CFD code. The rst of these is the numerical methods, including the algorithms behind the solutions and the governing principles of uid dynamics. The rst numerical improvement is the updating of the projection method implemented here to a more numerically-accurate projection method. In particular, the algorithm here is rstorder in time and second-order in space; however, in his notes Tryggvason presents an Runge-Kutta fourth order method (in time) that was developed by Weinan E. [12] The dierence between the methods should be noticeable at high Reynolds number, where the temporal discretization makes a dierence. To improve the actual spatial accuracy, implementing higher-order nite-dierence approximations would be a good start. A potential issue with higher-order numerical methods might be their stability; when the projection method was rst implemented, higher-numerical derivatives were implemented for both rst and second derivatives. However, implementing anything greater than second-order accurate methods resulted in a deterioration of the solution. Of help might be expanding the delta function stencil, which would also provide more accurate forcing of the 39
uid around the immersed solid points. The delta function itself can also be improved in order of accuracy, which should help develop a more accurate forcing function.
9.2
Technical Improvements to aeroCuda
The numerical improvements above may prove to be dicult to code if the eciency and optimization of aeroCudas runtime is taken into consideration. For example, in order to increase numerical accuracy more gridpoints would have to be worked with; this means that global memory reads would increase signicantly to process the accuracy. In the force projection portion of step one, a large for-loop is executed by 16 points to add forces to their respective points. These are just some areas which result in slowdowns to the codeif their execution times can be reduced by 30-40
9.3
Capability Enhancements to aeroCuda
There are two areas which should be the next focus for aeroCuda development: expansion to 3-d and video processing. In the case of the expansion to 3-d, two issues manifest: memory transfer and thread execution. In the case of 3-d, there will be a massive increase in the amount of memory consumed, simply because the domain is being extended to another dimension. Therefore, the transfer times of large chunks of memory (in the case of a cubic grid with 1000 gridpoints per dimensions and oats, 8 gb would be needed) would be very high. Moreover, for 1 billion threads to execute, there would be a noticeable increase in time; high quantitires of threads executing do take more time to execute. Perhaps a combination of MPI and CUDA could be used to execute the problemhowever, data would have to be transferred to and from the GPU at every timestep. For video processing, there are probably multiple ways to implement this, the most direct being output of variables at every timestep. However, this would result in a massive and unrealistic memory requirement, especially with a large grid. If the GPU could be worked with at a deeper level, perhaps video processing could be made as part of the solver processif it results in a drop in runtime, however, then perhaps it would not be wise to pursue this attribute.
9.4
Final Remarks
The immersed solid implementation developed in this paper proves to be reliable for the cases demonstrated here. It conmed the observations made in other papers and for situations at high Reynolds numbers, large force uctuations were observed but the model itself appears to have worked well. This theory developed by Peskin can truly unlock the potential for ecient CFD 40
simulations of transient ow problems; it has been demonstrated in others works just as it has in this one. With an improvement to the numerical method and technical specications of aeroCuda, it is hoped that this code could be of great use to researchers and students alike. Computational uid dynamics is a dicult eld, but one which still holds many secrets to be unlocked. Hopefully, aeroCuda will help shed light on some of them in the future.
10
10.1
Finances
Resources Used
To develop aeroCuda, the Enthought Python distribution was used. This license is free for students and academic organizations but requires a fee for those in industry to obtain. The Nvidia GPU used for these computations was a Tesla C2070, which retails for 2111.85 dollars at SabrePC. This GPU was accessed through the Resonance cluster at the Harvard School of Engineering and Applied Sciences. Therefore, the total budget of the project was 0 dollars.
10.2
Upgrades
With an expansion to a 3-d case, the memory transfer times will drastically increase. As a result, GPUs that are capable of transferring larger amounts of data at very low runtime cost should be sought out. The authors knowledge of GPU market is limited, but more technical users might be able to target an optimal GPU for aeroCuda to operate.
11
11.1
11.1.1
Appendix
Solving the Immersed Solid-inuenced Navier-Stokes Equations
Step 1: Force Projection [7]
Let us begin by dening our prescribed points as (xp , yp ) . These points are given some analytical function (u(p (t), vp (t)) by the user that denes their motion through space.We then dene the Eulerian co-points as (xb , yb ). These points retrieve their motion (ub , vb ) from the velocities that are calculated by the grid. To obtain these velocities, Peskins delta function method is used.6 In the delta function stencil, a reference point must chosen for each boundary point. On the
41
stencil, the reference point is located at the [0, 0] location, with stencil indices in both the x- and y-directions dened over the range [-1,2]. Assuming the spacing is the same along the x-axis and y-axis, the reference gridpoints actual grid position is attained by: xr = xb x yr = yb y
We then obtain the displacements between (xr , yr ) and (xb , yb ): rx = xb xr x ry = yb yr y
The delta function stencil is 4 4 in size. Each position on it is a function of (rx , yx ): (x, y) = (rx )(ry ) The purpose of the delta function is to blend the value of the force to the surrounding gridpoints or derive the velocity at a certain point from the surrounding grid point velocities. This is integral to the formulation as it allows for information (such as force and velocity) to be transferred between the immersed solid points and the gridpoints. The phi function is taken to be: (r + 1) = (r) =
64r 8
32r+ 1+4r4r2 8
32r+ 1+4r4r2 8 4r2 8 1 2
(r 1) = (r 2) =
32r+ 1+4r4r2 8
32r+ 1+4r4r2 8
The stencil is given in Table 1. Dene xR = (rx ) and yR = (ry ). The top row of indices is j and the leftmost column of indices are i. Table 7: Delta Stencil 0 1 64ry 64r ( 8 yR)xR ( 8 y yR)( 4rx 2 + xR) 8 xRyR yR( 4rx 2 + xR) 8 4r 2 4r 2 xR( y + yR) ( 4rx 2 + xR)( y + yR) 8 8 8 4rx 2 xR(0.5 yR) ( 8 + xR)(0.5 yR)
Index -1 0 1 2
-1 64r ( 8 y ( 64rx 8 ( 64rx 8 ( 64rx 8
xR) xR)yR 4r 2 xR)( y + yR) 8 xR)(0.5 yR)
yR)( 64rx 8
2 64r ( 8 y yR)(0.5 xR) yR(0.5 xR) 4r 2 ( y + yR)(0.5 xR) 8 (0.5 yR)(0.5 xR)
42
We then obtain the interpolated value for ub , vb using the stencil coecients and the u, v of the surrounding gridpoints. Denote the above stencil as the function s(i, j, rx , ry ):
2 2
ub =
i,j=1
u(yr +i,xr +j) s(i, j, rx , ry )
vb =
i,j=1
v(yr +i,xr +j) s(i, j, rx , ry )
Once we have the velocities, we can now calculate the total force. In the harmonic oscillator function, that developed by Saiki and Biringen, the was the spring constant and was the damping coecient. The motion up , vp is known to us as we prescribed it. Therefore, we have for the k th boundary point: Fx,k = (xp,k xb,k ) (ub,k up,k ) Fy,k = (yp,k yb,k ) (vb,k vp,k )
With the force per point calculated, we now need to project it to the surrounding gridpoints. This operation is done in a manner similar to the interpolation step. Instead of aggregating the values through the summation step, though, the values are added to their respective locations on a grid. Therefore, assume that f x, f y represent the force eld terms in both dimensions in the modied Navier-Stokes. We initialize them to 0 at each iteration, and then do the following for the k th boundary point:
2
f xyr,k +i,xr,k +j = Fx,k s(i, j, rx , ry )

i,j=1
f yyr,k +i,xr,k +j = Fy,k s(i, j, rx , ry )
Now we move to solving the equations via the projection method. 11.1.2 Step 2: Calculating the intermediate velocity eld [12]
Previously, we established the force elds interacting with the main equations as fx , fy . Therefore, we now need to solve those equations. Let us dene the primary eld quantities: un =< un , v n >= Primary Velocity Fields at time =n u =< u , v >= Intermediate Velocity Fields un+1 =< un+1 , v n+1 >= Final Velocity Fields p = Pressure Field f =< f x, f y >= Forcing Fields 43
The fully-modied Navier-Stokes equations are given by:

un t
+(
un )un = p +
2 un
un = 0
We begin by decomposing the equation via nite dierences. The time derivative is represented through forward Euler, and all other derivatives are obtained through a second-order centraldierence scheme.[12] The rst and second derivatives, when evaluated via centered dierencing, are given as: qi,j qi,j+1 qi,j1 = x 2x 2 qi,j qi,j+1 2qi,j + qi,j1 = 2x (x)2
Note that the same procedures follows for the y-axis derivatives, with a change in the axis of dierencing and the magnitude of spacing Applying these operators on the modied Momentum equation, we obtain the following breakdown of the terms: Time Derivative:
un+1 un i,j i,j t
n un i,j+1 2u+ui,j+1 2 x
Viscous Derivative: (
n un i+1,j 2u+ui1,j ) 2 y
One term remains: the convective derivative. With a basic centered-dierencing scheme it would be given as: un i,j
n n un un i+1,j ui1,j i,j+1 ui,j1 n + vi,j x y
The above equation also applies for the v-eld. Mattheus Ueckermann of MIT was consulted on the occurrence of oscillations observed with the code when using the above centered dierence scheme for the code. His explanation of the issue was that in the advection equation, using a centered dierence scheme doesnt allow for information in the direction of the ow to be transmitted properly. For example, if the ow is negative, we need to see what the value at the ux between cells j and j+1 is, as opposed to cells j-1 and j+1 which do not necessarily average to the proper value. Therefore, the centered dierencing operation was reected to adjust the following scheme: un i,j
n un i,j+1 ui,j , un < 0 i,j x
un i,j
un un i,j i,j1 , un > 0 i,j x
44
The above equations also apply for the v-eld. The idea is to look upwind if positive advection and downwind if negative advection. However, because this is a rst-order approximation, the accuracy is not very good. To improve upon this, the CFD-Wiiki online website was consulted for the QUICK (Quadratic Interpolation for Convective Kinematics) formulation. The idea behind this implementation is that in the centered dierencing operations, instead of relying on two points to nd the derivative 4 are used. If positive advection, 2 upwind points and 1 downwind point are used; if negative advection, 2 downwind points and 1 upwind point are used. By applying the QUICK algorithm, the following formula results for the convective derivative [2]: Convective Derivative:
n n n (0.375un i,j+1 + 0.375ui,j 0.875ui,j1 + 0.125ui,j2 ) , un > 0 i,j x n n n 0.375un (0.125ui,j+2 + 0.875ui,j+1 0.375ui,j i,j1 ) un , un < 0 i,j i,j x
un i,j
The above equations also apply for the v-eld. By incorporating more points into the analysis, a more accurate and stable solution is obtained. Therefore, the QUICK formulation was used for evaluating the convective term. In the projection method, an intermediate velocity is inserted into the time derivative to isolate the pressure term on the left-hand side of equation. Therefore, two equations are developed: u = un + t(Convective Derivative + Viscous Derivative + fx ) i,j i,j un+1 u i,j i,j t
= p
The same equations exist for the v-velocity eld. The rst equation is purely explicit and can be solved by decomposition through nite-dierences. The second equation is implicit and will yield two more equations to be solved in the subsequent steps. 11.1.3 Step 3: Calculating the Pressure Field [12]
To solve the linking equation, un+1 u = p t
45
the continuity equation is introduced and used to generate a pressure eld that imposes the divergence-free condition. The gradient operator is applied to the linking equation: ( By the divergence condition, un+1 u )= 2t
2
un+1 = 0. Therefore, the following Poisson equation is obtained:

2
p=
u t
We now need to solve for the pressure, p. The right hand side may be computed explicitly, call it U . Express p and U by in terms of their Fourier transforms: p(, ) =
n,m 2x L 2y L .
pn,m e(n+m)
U (, ) =
n,m
Un,m e(n+m)
where =
and =
Then, taking the second derivative

2
p=
n,m
(n2 m2 )pn,m e(n+m)
and equating Fourier modes with Un,m yields 4 2 (n2 m2 )pn,m = Un,m Un,m + m2 )
pn,m =
4 2 (n2
and the pressure p is given by the inverse Fourier transform of the right hand side. Thus, we simply need to compute the 2-d FFT of U , divide by the matrix of corresponding coecients, and compute the 2-d iFFT (inverse FFT) to get the matrix for p. 11.1.4 Step 4: Calculating the Final Velocity elds[12]
Now that the pressure eld p and velocity vield u have been calculated, the nal velocity, as given by Tryggvason, can be obtained [12]: un+1 = u t( p)
46
To do this, the prior equations are decomposed via nite-dierencing as in Step 2: un+1 = u i,j i,j 11.1.5 pi,j+1 pi,j1 t 2x
n+1 vi,j = vi,j
pi+1,j pi1,j t 2y
Step 5: Interpolation and Velocity [7]
We follow the same interpolation procedure used in Step One to obtain the velocities.
2 2
ub =
i,j=1
u(yr +i,xr +j) s(i, j, rx , ry )
vb =
i,j=1
v(yr +i,xr +j) s(i, j, rx , ry )
Using the Peskin method of forward Euler, we update the positions of the solid and the prescribed points [8]: xn+1 = xn + un t b b b
n+1 n n yb = yb + vb t
xn+1 = xn + un t p p p
n n n+1 yp = yp + vp t
We now have the new locations of the points and can proceed to the next iteration of our solution.
References
[1] Russell Mark James Hahn Afred Von Loebbecke, Rajat Mittal. A computational method for analysis of underwater dolphin kick hydrodynamics in human swimming. Sports Biomechanics Journal, 8(1):6077, March 2009. [2] CFD-Wiki. Linear schemes - structured grids. [3] Nvidia Corporation. Nvidia cuda c programming guide, 11 2011. [4] Ronald Henderson. Details of the drag curve near the onset of vortex shedding. [5] Andreas Klckner, Nicolas Pinto, Yunsup Lee, Bryan C. Catanzaro, Paul Ivanov, and Ahmed o Fasih. Pycuda: Gpu run-time code generation for high-performance computing. CoRR, abs/0911.3456, 2009.
47
[6] Ming-Chih Lai and Charles S. Peskin. An immersed boundary method with formal secondorder accuracy and reduced numerical viscosity. Journal of Computational Physics, 160(2):705 719, 2000. [7] Prof. Charles Peskin. The immersed boundary method in a simple special case. [8] Prof. Charles Peskin. tar le of matlab programs. [9] E.M. Saiki and S. Biringen. An immersed boundary method with formal second-order accuracy and reduced numerical viscosity. Journal of Computational Physics, 123(2):450465, 1996. [10] Tao Tang. Moving mesh methods for computational uid dynamics. [11] Prof. Timothy Tautges. Mesh generation. [12] Prof. Gretar Tryggvason. Solving the navier-stokes in primitive variables i, Spring 2010.
48
Figure 13: Vorticity Contours at Dierent Reynolds Numbers
49
Figure 14: Velocity Magnitude at Dierent Reynolds Numbers
50
Figure 15: Forces at Dierent Reynolds Numbers
51
Figure 16: Immersed Solid Point Dispersion at Dierent Reynolds Numbers
52
Figure 17: Discretization of the Swimmer
Figure 18: Variable Spring Model of the Swimmer
53
Figure 19: Point Shift of the Swimmer
Figure 20: Forces on the Swimmer
Figure 21: Flow Around the Swimmer at T= 25s
54
Figure 22: Flow Transition Dependent on Reynolds Number
55

Aerocuda: The 2-d CFD Code

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Aerocuda: The 2-d CFD Code

Hochgeladen von

Copyright:

Verfügbare Formate

aeroCuda: The GPU-Optimized Immersed Solid Code

Samir Patel Advisor: Dr. Cris Cecka June 23, 2012

3 Goal and Design Phase 3.1 3.2 3.3

Language for the Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 13

4 Working with openFoam 4.1 4.2 4.3 4.4

5 Development of aeroCuda 5.1 5.2

6 Code Renements and Optimization 6.1

7 Results Obtained with aeroCuda 7.1 7.2 7.3 7.4

The Eect of Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Numerical Conrmations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Physical Location of the Immersed Solid Points . . . . . . . . . . . . . . . . . . . . . 37 37

8 Test Case: Swimmer in Glide Position 8.1 8.2 8.3 8.4

9 Conclusion 9.1 9.2 9.3 9.4

10.1 Resources Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 10.2 Upgrades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 11 Appendix 41

Moving Mesh and a Translating Cylinder

Governing Equations and Solutions

u = (u, v). The equations together are: Momentum: Mass: u + (u t )u = p + u=0

Why Immersed Boundary

Figure 1: Point Decomposition

Figure 2: Mesh of a Similar Disk

Immersed Boundary Method and Solution to the Navier-Stokes equations

Modication of the Navier-Stokes

Developing the Forcing Term

of the membrane, and represents the tangential vector.

In doing so, Peskin allows for uid-

Relationship between the Solid and Prescribed Points

Goal and Design Phase

Reasons for Evaluation

Language for the Module

Working with openFoam

Building the Code for openFoam

Issues with openFoam

Figure 4: Structure of Input

Figure 5: Structure of Pre-Processing

Figure 6: Structure of Solver Loop

Figure 7: Structure of Post-Processing

Pre-Computation: Interior Point Generation and Rotation Capabilities

Figure 8: Cloaking from Dierent Directions

Developing the Delaunay Mechanism

Figure 9: Interior Point Generation using Both Mechanisms

Developing the Solver In Serial Code

Implementing the Interpolation Step

Code Renements and Optimization

Implementing the CUDA-optimized Structure

Results Obtained with aeroCuda

while the density

Expected Physical Phenomena and Further Validation

A Closer Look at the Physical Response of the Immersed Solid

Physical Location of the Immersed Solid Points

Test Case: Swimmer in Glide Position

Reynolds Number Transition

Technical Improvements to aeroCuda

Capability Enhancements to aeroCuda

We then obtain the displacements between (xr , yr ) and (xb , yb ): rx = xb xr x ry = yb yr y

32r+ 1+4r4r2 8 4r2 8 1 2

-1 64r ( 8 y ( 64rx 8 ( 64rx 8 ( 64rx 8

xR) xR)yR 4r 2 xR)( y + yR) 8 xR)(0.5 yR)

u(yr +i,xr +j) s(i, j, rx , ry )

v(yr +i,xr +j) s(i, j, rx , ry )

f xyr,k +i,xr,k +j = Fx,k s(i, j, rx , ry )

f yyr,k +i,xr,k +j = Fy,k s(i, j, rx , ry )

The fully-modied Navier-Stokes equations are given by:

un un i,j i,j1 , un > 0 i,j x

To solve the linking equation, un+1 u = p t