FULL GPU Implementation of Lattice-Boltzmann Methods with Immersed Boundary Conditions for Fast Fluid Simulations

Lattice Boltzmann Method (LBM) has shown great potential in fluid simulations, but performance issues and difficulties to manage complex boundary conditions have hindered a wider application. The upcoming of Graphic Processing Units (GPU) Computing offered a possible solution for the performance issue, and methods like the Immersed Boundary (IB) algorithm proved to be a flexible solution to boundaries. Unfortunately, the implicit IB algorithm makes the LBM implementation in GPU a non-trivial task. This work presents a fully parallel GPU implementation of LBM in combination with IB. The fluid-boundary interaction is implemented via GPU kernels, using execution configurations and data structures specifically designed to accelerate each code execution. Simulations were validated against experimental and analytical data showing good agreement and improving the computational time. Substantial reductions of calculation rates were achieved, lowering down the required time to execute the same model in a CPU to about two magnitude orders.


INTRODUCTION
Fluid simulation is a computation-intensive task that can generally consume huge amounts of time.However, recent hardware architectures can lead to substantial increases in performance through many-core processors if appropriate schemes are applied.Many fluid simulation models, like Lattice Boltzmann Methods, can be greatly accelerated by the use of Graphic Processors Units (GPU).
These processors are optimized to execute single-instruction multiple-data operations, namely a simple kernel over each element of a large set simultaneously, operating as a coprocessor of the host CPU [1].Initially, developers could take advantage of the GPU power through the graphics pipeline by means of shading languages [2].Currently, there are parallel computing architectures for GPU programming using high-level languages [3].For example, NVIDIA and the Portland Group (PGI) have worked in cooperation to develop CUDA FORTRAN language [4].
In some cases the migration to GPU is a simple code translation, but in most cases a new algorithm must be implemented.In particular, a simple Lattice Boltzmann Method (LBM) fluid solver can be adapted to a GPU by translating some operations and replacing the main loops by multiple threads.Several researchers have shown that the combination of GPU and parallel LBM is an excellent tool for fast fluid simulations [5][6] [7].However, practical LBM applications require flexible management of boundary conditions and this cannot be achieved with basic LBM solver.
A more evolved method is needed to overcome boundary issues when the geometry of the problem turns complex, like the Immersed Boundary Method (IB).
An efficient iteration procedure for combining LBM with the Immersed Boundary method was presented showing good results for fluid-solid interactions while keeping a flexible implementation [8].Recent publications show promising results for the IB-LBM coupling on GPU but, in most implementations, only LBM is run on GPU hardware leaving the IB part for the CPU.Other solutions [9] use a single-step explicit IB algorithm [10] to couple with LBM on GPU code.
In the present work, the algorithm combination was refined and implemented in GPU, which considerably reduce simulation time.

Lattice Boltzmann Method
LBM is basically a mesoscopic kinetic model with a discrete internal velocity variable, whose average magnitudes obey some macroscopic field equations [11].The method represents the fluid by a set of particle populations that move between cells over a regular grid.Fluid behavior is achieved by operations between these populations on each cell locally.The most common model for 2D simulations is D2Q9 [12] that uses a square lattice with 9 velocity directions (Fig. 1).In the present work, this model is used for solving 2D Navier-Stokes equations [13].
The population particles in node x at time t having velocity eα, denoted by fα(x,t) follow the evolution functions where the equilibrium function fα (0) (x,t) determines the macroscopic equations that the automata simulates.Equations ( 1) and ( 2) are called the collision and streaming step respectively.In the called D2Q9 grid model the index α spans over nine discrete directions (see Fig. 1).
The equilibrium function to simulate the Navier-Stokes through the Bhatnagar-Gross-Krook (BGK) collision operator can be expressed as [11] (3) are the macroscopic density and mass flux respectively.The fluid viscosity can be controlled via the relaxation parameter τ, as where e is the lattice velocity, δx/δt.For problems with external forces, the following term is added to the right side of (2) [14], as (7) (8) where fbij is the force applied to each ij-cell.The external forcing term gα given by ( 7) and (8) has first order convergence [15], which limits the solution to problems with slow moving boundary or flexible boundary with small pressure gradient.For fast moving boundaries or flexible boundaries with large pressure gradients, a higher order method is needed.Chen and Doolen [11] proposed an alternative implicit second-order convergence scheme, replacing (7) by (9) Owing to the use of this equation, the final numeric scheme becomes implicit.

The Immersed Boundary Method
The method of Immersed Boundary (IB) was initially developed to deal with flexible boundaries in finite elements method.The boundary is represented by a set of massless particles coupled by elastic forces to space points and between themselves, which moves with the surrounding fluid.Conversely, the force generated by distortions of the boundary is transferred to the fluid [16].F(s,t) at the boundary point X(s,t).Thus, F(s,t) is determined by the configuration of X(s,t) and it is transferred into the force term g in ( 8), which determines the flow velocity and pressure throughout domain Ωf.
For a boundary immersed in a viscous fluid the IB is given by the following set of equations [17] (10) where u is the flow velocity, ρ the fluid density, p the pressure, υ the fluid viscosity, fb the external force, X the boundary coordinate, s the boundary fiber length, U the boundary speed, x the fluid flow coordinate, Sf the boundary force generation operator, and δ(r) the Dirac delta function.Equations ( 10) and ( 11) are the Navier-Stokes equations with external force fb, while ( 12) and ( 13) are the IB equations.Equation ( 14) and the right part of ( 12) represent the fluid-boundary interaction.The discretized version of ( 12) and ( 14) using a regularized discrete delta function δh are (15) and (16) where h=Δx=Δy is the fluid node spacing and Δsk is the boundary segment length.The delta function δh is an influence distribution given by [17] ( The force density F induced by the boundary over the fluid is determined by the position of the nodes, and can be written in general as (19) where kc is the tension stiffness, kγ the bending rigidity, kf the fastening stiffness and Z the target position of the boundary.The discretized equations of ( 16) can be expressed as (20) Two values of Fk corresponding to time in t-1 and t are needed in every step to calculate Fk in an implicit way.

GPU implementation of LBM-IB
Since the CPU implementation of LBM with IB was done using FORTRAN programing language, the PGI CUDA FORTRAN compiler [4] was used for the full GPU implementation.

GPU implementation of LBM-IB
The main program is a CPU code that calls the GPU subroutines, called kernels.The basic LBM calculations are applied to each cell of the grid whereas the IB calculations are applied to each point in the boundary [8].Many authors have suggested that the way to achieve best performances in GPU is by means of a single collide-stream loop [7], [18], [19], but this was studied in the case of LBM with standard boundary conditions.However, when LBM is combined with the IB method, there is a different situation.In effect, the IB coupling introduces an internal iteration that can be combined with the streaming step, but the collision step can be computed only once per time step [8].Following these execution structure, the proposed implementation of LBM-IB in CUDA pseudo-code is as follows:

Collide kernel
The collision kernel calculates the operator of the LBM scheme given by ( 1).The execution configuration used for the simulations is computed to maximize parallelism with a single thread per LBM cell.

Compute_IB_Points kernel
This kernel calculates the effect of the fluid cells on the IB.Each boundary point has a 5×5 matrix to store the weight of the force on the neighbor cells (18).The kernel uses a 3D matrix to store the boundary weight values for each boundary point.The matrix is called BW(dx,dy,k), where k is the boundary IB-node index and (dx, dy) is the cell position respect to the node.Each 2D sub-matrix has five cells in each direction, that is: [xn-2,yn-2 ; xn+2,yn+2] for boundary point contained in the [Xn, Yn] cell.
There is one GPU thread for each IB-point, which calculates the force exerted to the fluid by the node (15) and the new position of the node (16).The following pseudocode shows the kernel scheme.(9).The execution is performed with a single thread per cell.

Compute_Boundaries_And_Macroscopics kernel
This solution kernel runs one thread per cell.Each thread applies standard bounce-back boundary conditions [11] where necessary and calculates the macroscopic variables of the fluid, ρ and u.

Remarks
The main difference with other IB implementations designed for CPU like the presented in [8] is the convergence condition in the inner loop.Testing the mean error would imply an extra reduction [20], so the maximum error metric is used as a convergence condition.Also the BW(dx, dy, k) structure saves the time of computing the δ(x,y) of each IB point with its surrounding neighborhood each iteration multiple times.In recent GPU implementations, the IB part of the algorithm is processed in CPU [21] or a simplified explicit version is executed in GPU [22].

Memory usage
Because of the implicit definition of IB algorithms, it is necessary to have memory allocation for both actual time step and next time step.Variables must be minimized in GPU to have double memory space because of limited memory.In this case, the minimum variables are u, LBM forces and also IB Forces to check convergence.Other variables can be avoided to duplicate their memory space.

RESULTS
The described IB-LBM algorithm was tested in a GPU NVIDIA GeForce GTX 580 with 3GB DDR5 SDRAM hosted by an AMD PHENOM II X6 2.81GHZ CPU.Two flow cases were simulated to test the performance and accuracy of the scheme, namely, a Poiseuille flow and the vortex shedding behind a solid obstacle.

Poiseuille flow
The first case is a standard Poiseuille flow between two parallel plates subjected to constant pressure conditions at the inlet and the outlet.In fully developed flow, the velocity profile is parabolic and the shear stress acting on the fluid is exactly balanced by the pressure gradient (Fig. 3), which leads to an analytical solution (11).
Two straight immersed boundaries of 100 points each are located in a 110×61 lattice to represent the channel walls as shown in Fig. 4. The IB lines are separated 5.5 cells away from the grid limits, so they delimit a 50-cells wide channel between them.The initial particle density ρ0 is 1.0 and the density drop between the inlet and the exit of the channel Δρ is 10 -5 corresponding to Δρ/Δx = 1/3 10 -7 .The relaxation parameter τ is set in 1.5, corresponding to a cinematic viscosity ν = 1/3 in units of Δx 2 /Δt, giving a Reynolds number of 46.87.Fig. 5 shows the excellent agreement between the calculated and analytical velocity profile, with a total quadratic error of 3.3 10 -8 .The second test case corresponds to a cylinder located in the center of a channel.At the back of the obstacle, the fluid stream fails to stick to cylinder's wall, and the boundary layers separate from each side of the cylinder resulting in a Von Karman vortex train.A detailed explanation of this case can be found in [8][23].
The channel was represented by a 900×128 lattice, as shown in Fig. 6.The obstacle is a circle, 22-cells diameter, centered at position (64, 64) inside the channel.It is represented by an IB of 200 points, which ensures the no-leakage between points criterion [16].The channel pressure drop is ∆p = 5 10 -4 , the average particle density is ρ0 = 0.05 and τ = 0.65.Fig. 7 shows the field of the velocity module at t=5000Δt steps.The wake of vortexes can be easily recognized.The Strouhal number is St=fd/V where f is the vortex shedding frequency and V the fluid velocity.The obtained St is 0.165, having a good agreement with available literature [23].

Performance
In order to assess the performance of the present implementation, a comparison was made against an equivalent FORTRAN implementation running in a single thread on a CPU AMD Phenom II X6 at 2.81 GHz.The main algorithm and subroutines are similar to those in GPU code.The main difference is that in the CPU all the work is performed in a single thread fashion.So the subroutines are built as two nested loops for each spatial direction and one loop for the boundary points.Both implementations use single precision floating point representations.
The most popular metric to assess the performance of LBM codes is the number of lattice updates per second (LUPS) [19].Tables 1 and 2 compare the performances of each implementation for different platforms, scaling the problem to various domain grid sizes.The simulation corresponds to Poiseuille flow case presented in Section 3.1 (Table 1) and Vortex shedding case presented in Section 3.2 (Table 2).The number of boundary points used in each case to simulate the channel wall is proportional to the grid length to ensure the no-leakeage between points criterion [16] and it is shown in column 2 of Table 1.It can be seen that the performance achieved with GPU is between 2 and 140 times faster than with CPU.It can be seen that when domain size is increased, the global speed up does so too.In part because the numbers of IB points per cell turns lower, lowering down the computational cost proportion of IB in the whole process.The time consumption of each kernel was also assessed to identify the critical calculation steps.It should be noted that these are just estimations, since the kernels are linked to each other.Table 3 compares the resulting performances of each implementation discriminated by kernels.The Compute_LBM_Forces kernel consumes 90% of the computing time, being this the critical section of the process.Further optimization strategies can be applied, but probably they will entail resigning some flexibility.For example, sequential kernels with the same execution configuration could be combined like Stream_and_force and Compute_Boundaries_And_Macroscopics but loosing decoupling of code.

CONCLUSIONS
A GPU implementation of the Lattice Boltzmann Method for fluid flow combined with immersed boundaries capable of modelling complex and flexible boundary conditions was presented.The IB method represents objects as force fields acting on local neighborhoods around boundaries.On each time step of LBM, the algorithm executes an inner iteration to solve the fluid-boundary interaction implicitly.Data locality and execution scheme are different when standard boundary conditions are used.These are crucial aspects to GPU implementations, so the combination of IB and LBM becomes less trivial to parallelize.Two classical test case simulations were performed, Poiseuille flow and vortex shedding behind a cylindrical obstacle.The algorithm on GPU can give satisfactory results in terms of accuracy.Substantial reductions of the calculation rates were achieved, reducing up to 84 times the time required by a CPU to execute the same case.The GPU code reaches near 80 Mlups on Geforce GTX 580 desktop graphic board.

Figure 1 :
Figure 1: Space of discrete velocities in LBM-d2Q9, corresponding to the population functions fα.

Figure 2
shows a 2D example with a closed immersed boundary.The boundary and the fluid domain are denoted by Γb and Ωf, respectively.The state of the boundary is represented by X(s,t), a Lagrangian vector function of arc length s and time t, which returns the location of the boundary nodes on Γb.The influence action on the fluid is represented by a force density

Figure 3 :
Figure 3: Poiseuille flow in a rectangular channel.

3. 2 .
Vortex shedding FULL GPU Implementation of Lattice-Boltzmann Methods with Immersed Boundary Conditions for Fast Fluid

Figure 6 :
Figure 6: Von Karman vortex street scheme represented with IB.

Figure 7 :
Figure 7: Von Karman vortex street, contour map of velocity module at time step 5000.

Table 1 :
Performance comparison between GPU and CPU codes simulatingPoiseuille flow, obtained over 1000 iterations.

Table 2 :
Performance comparison between GPU and CPU codes simulatingVortex shedding, obtained over 1000 iterations.

Table 3 :
GPU individual kernel times obtained over 1.000 iterations simulating Poiseuille flow in a grid of 400 x 240 cells.