articles/2018-12-22_fun_with_compute_shaders_and_fluid_dynamics.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368

# Fun with compute shaders and fluid dynamics

As I previously alluded to, computational fluid dynamics is a current subject of interest of mine both academically[^0] and recreationally[^00]. Where on the academic side the focus obviously lies on theoretical strictness and simulations are only useful as far as their error can be judged and bounded, I very much like to take a more hand wavy approach during my freetime and just _fool around_. This works together nicely with my interest in GPU based computation which is to be the topic of this article.

[^0]: i.e. I've now been a student employee of the [Lattice Boltzmann Research Group](http://www.lbrg.kit.edu/) for two years where I contribute to the open source LBM framework [OpenLB](https://www.openlb.net/). Back in 2017 I was granted the opportunity to attend the [LBM Spring School](https://www.openlb.net/spring-school-2017) in Tunisia. In addition to that I am currently writing my bachelor's thesis on grid refinement in LBM using OpenLB.
[^00]: e.g. [boltzbub](https://tree.kummerlaender.eu/projects/boltzbub/), [compustream](https://code.kummerlaender.eu/compustream/), this article.

<video controls="" preload="metadata" loop="true" poster="https://static.kummerlaender.eu/media/boltzstern_1.jpg"><source src="https://static.kummerlaender.eu/media/boltzstern.mp4" type="video/mp4"/></video>

While visualizations such as the one above are nice to behold in a purely asthetic sense independent of any real word groundedness their implementation is at least inspired by models of our physical reality. The next section aims to give a overview of such models for fluid flows and at least sketch out the theoretical foundation of the specific model implemented on the GPU to generate all visualization we will see on this page.

## Levels of abstraction

The behaviour of weakly compressible fluid flows -- i.e. non-supersonic flows where the compressibility of the flowing fluid plays a small but _non-central_ role -- is commonly modelled by the weakly compressible Navier-Stokes equations which relate density $\rho$, pressure $p$, viscosity $\nu$ and speed $u$ to each other:
$$\begin{aligned} \partial_t  \rho + \nabla \cdot (\rho u) &= 0 \\ \partial_t u + (u \cdot \nabla) u &= -\frac{1}{\rho} \nabla p + 2\nu\nabla \cdot \left(\frac{1}{2} (\nabla u + (\nabla u)^\top)\right)\end{aligned}$$

As such the Navier-Stokes equations model a continuous fluid from a macroscopic perspective. That means that this model doesn't concern itself with the inner workings of the fluid -- e.g. what it is actually made of, how the specific molecules making up the fluid interact individually and so on -- but rather considers it as an abstract vector field. One other way to model fluid flows is to explicitly model the individual fluid molecules using classical physics. This microscopic approach closely reflects what actually happens in reality. From this perspective the _flow_ of the fluid is just an emergent property of the underlying individual physical interactions. Which approach one chooses for computational fluid dynamics depends on the question one wants to answer as well as the available computational ressources. A sufficienctly precise model of individual molecular interactions precisely models physical reality in arbitrary situations but is easily much more computationally intensive that a macroscopic approach using Navier-Stokes. In turn, solving such macroscoping equations can quickly become problematic in complex geometries with diverse boundary conditions. No model is perfect and no model is strictly better that any other model in all categories.

## Lattice Boltzmann in theory…

The approach I want to introduce for this article is neither macroscopic nor microscopic but situated between those two levels of abstraction -- it is a _mesoscopic_ approach to fluid dynamics. Such a model is given by the Boltzmann equations that can be used to describe fluids from a statistical perspective. As such the _Boltzmann-approach_ is to model neither the macroscopic behavior of a fluid nor the microscopic particle interactions but the probability of a certain mass of fluid particles $f$ moving inside of an external force field $F$ with a certain directed speed $\xi$ at a certain spatial location $x$ at a specific time $t$:
$$\left( \partial_t + \xi \cdot \partial_x + \frac{F}{\rho} \cdot \partial_\xi \right) f = \Omega(f) \left( = \partial_x f \cdot \frac{dx}{dt} + \partial_\xi f \cdot \frac{d\xi}{dt} + \partial_t f \right)$$

The total differential $\Omega(f)$ of this Boltzmann advection equation can be viewed as a collision operator that describes the local redistribution of particle densities caused by said particles colliding. As this equation by itself is still continuous in all variables we need to discretize it in order to use it on a finite computer. This basically means that we restrict all variable values to a discrete and finite set in addition to replacing difficult to solve parts with more approachable approximations. Implementations of such a discretized Boltzmann equation are commonly referred to as the Lattice Boltzmann Method.

As our goal is to display simple fluid flows on a distinctly two dimensional screen, a first sensible restiction is to limit space to two dimensions[^1]. As a side note: At first glance this might seem strange as no truly 2D fluids exist in our 3D environment. While this doesn't need to concern us for generating entertaining visuals there are in fact some real world situations where 2D fluid models can be reasonable solutions for 3D problems.

[^1]: Of course the Lattice Boltzmann Method works equally well in three dimensions.

The lattice in LBM hints at the further restriction of our 2D spatial coordinate $x$ to a discrete lattice of points. The canonical way to structure such a lattice is to use a cartesian grid.

Besides the spatial restriction to a two dimensional lattice a common step of discretizing the Boltzmann equation is to approximate the collision operator using an operator pioneered by Bhatnagar, Gross and Krook: $$\Omega(f) := -\frac{f-f^\text{eq}}{\tau}$$

This honorifically named BGK operator relaxes the current particle distribution $f$ towards its theoretical equilibrium distribution $f^\text{eq}$ at a rate $\tau$. The value of $\tau$ is one of the main control points for influencing the behaviour of the simulated fluid. e.g. its Reynolds number[^2] and viscosity are controlled using this parameter.
Combining this definition of $\Omega(f)$ and the Boltzmann equation without external forces yields the BGK approximation of said equation:
$$(\partial_t + \xi \cdot \nabla_x) f = -\frac{1}{\tau} (f(x,\xi,t) - f^\text{eq}(x,\xi,t))$$

[^2]: Dimensionless ratio of inertial compared to viscous forces. The Reynolds number is essential for linking the lattice-based simulation to physical models. LBM simulations tend to be harder to control the higher the Reynolds number - i.e. the more _liquid_ and thus turbulent the fluid becomes. For further details see e.g. Chapter 7 _Non-dimensionalisation and Choice of Simulation Parameters_ of the book linked right below.

To further discretize this we restrict the velocity $\xi$ not just to two dimensions but to a finite set of nine discrete unit velocities (<tt>D2Q9</tt> - 2 dimensions, 9 directions):
$$\newcommand{\V}[2]{\begin{pmatrix}#1\\#2\end{pmatrix}} \{\xi_i\}_{i=0}^8 = \left\{ \V{0}{0}, \V{-1}{\phantom{-}1}, \V{-1}{\phantom{-}0}, \V{-1}{-1}, \V{\phantom{-}0}{-1}, \V{\phantom{-}1}{-1}, \V{1}{0}, \V{1}{1}, \V{0}{1} \right\}$$

We also define the equilibrium $f^\text{eq}$ towards which all distributions in this model strive as the discrete equilibrium distribution by Maxwell and Boltzmann. This distribution $f_i^\text{eq}$ of the $i$-th discrete velocity $\xi_i$ is given for density $\rho \in \mathbb{R}_{\geq 0}$ and total velocity $u \in \mathbb{R}^2$ as well as fixed lattice weights $w_i$ and lattice speed of sound $c_s$:
$$f_i^\text{eq} = w_i \rho \left( 1 + \frac{u \cdot \xi_i}{c_s^2} + \frac{(u \cdot \xi_i)^2}{2c_s^4} - \frac{u \cdot u}{2c_s^2} \right)$$

The moments $\rho$ and $u$ at location $x$ are in turn dependent on the cumulated distributions:
$$\begin{aligned}\rho(x,t) &= \sum_{i=0}^{q-1} f_i(x,t) \\ \rho u(x,t) &= \sum_{i=0}^{q-1} \xi_i f_i(x,t)\end{aligned}$$

Verbosely determining the constant lattice weights and the lattice speed of sound would exceed the scope[^3] of this article.  Generally these constants are chosen depending of the used set of discrete velocities in such a way that the resulting collision operator preserves both momentum and mass. Furthermore the operator should be independent of rotations.
$$w_0 = \frac{4}{9}, \ w_{2,4,6,8} = \frac{1}{9}, \ w_{1,3,5,7} = \frac{1}{36}, \ c_s = \sqrt{1/3} $$

[^3]: If you want to know more about all the gritty details I can recommend [The Lattice Boltzmann Method: Principles and Practice](https://link.springer.com/book/10.1007/978-3-319-44649-3) by Krüger et al.

We have now fully discretized the BGK approximation of the Boltzmann equation. As the actual solution to this equation is still implicit in its definition we need to solve the following definite integral of time and space:
$$f_i(x+\xi_i, t+1) - f_i(x,t) = -\frac{1}{\tau} \int_0^1 (f_i(x+\xi_i s,t+s) - f_i^\text{eq}(x+\xi_i s, t+s)) ds$$

Since the exact integration of this expression is actually non-trivial it is once again only approximated. While there are various ways of going about that we can get away with using the common trapezoidial rule and the following shift of $f_i$ and $\tau$:
$$\begin{aligned}\overline{f_i} &= f_i + \frac{1}{2\tau}(f_i - f_i^\text{eq}) \\ \overline\tau &= \tau + \frac{1}{2}\end{aligned}$$

Thus we finally end up with a discrete LBM BGK equation that can be trivially performed -- i.e. there is is a explicit function for transforming the current state into its successor -- on any available finite computer:
$$\overline{f_i}(x+\xi_i,t+1) = \overline{f_i}(x,t) - \frac{1}{\overline\tau} (\overline{f_i}(x,t) - f_i^\text{eq}(x,t))$$

Note that on an infinite or periodic (e.g. toroidial) lattice this equation defines all distributions in every lattice cell. If we are confronted with more complex situations such as borders where the fluid is reflected or open boundaries where mass enters or leaves the simulation domain we need special boundary conditions to model the missing distributions. Boundary conditions are also one of the big subtopics in LBM theory as there isn't one condition to rule them all but a plethora of different boundary conditions with their own up and downsides.

## …and in practice

The ubiquitous way of applying the discrete LBM equation to a lattice is to separate it into a two step _Collide-and-Stream_ process:
$$\begin{aligned}f_i^\text{out}(x,t) &:= f_i(x,t) - \frac{1}{\tau}(f_i(x,t) - f_i^\text{eq}(x,t)) &&\text{(Collide)} \\ f_i(x+\xi_i,t+1) &:= f_i^\text{out}(x,t) &&\text{(Stream)}\end{aligned}$$

Closer inspection of this process reveals one of the advantages of LBM driven fluid dynamics: They positively beg for parallelization. While the collision step is embarrassingly parallel due to its fully cell-local nature even the stream step only communicates with the cell's direct neighbors.

One might note that the values of our actual distributions $f_i$ are -- contrary to the stated goal of the previous section -- still unrestricted, non-discrete and unbounded real numbers. Their discretization happens implicitly by choosing the floating point type used by our program. In the case of the following compute shaders all these values will be encoded as 4-byte single-precision floating point numbers as is standard for GPU code.

To implement a LBM using compute shaders we need to represent the lattice in the GPU's memory. Each lattice cell requires nine 4-byte floating point numbers to describe its distribution. This means that in 2D the lattice memory requirement by itself is fairly negligible as e.g. a lattice resolution of <tt>1024x1024</tt> fits within 36 MiB and thus takes up only a small fraction of the onboard memory provided by current GPUs. In fact GPU memory and processors are fast enough that we do not really have to concern ourselves with detailed optimizations[^4] if we only want to visualize a reasonably sized lattice with a reasonable count of lattice updates per second -- e.g. 50 updates per second on a <tt>256x256</tt> lattice do not require[^5] any thoughts on optimization whatsoever on the Nvidia K2200 employed by my workstation.

[^4]: e.g. laying out the memory to suit the GPU's cache structure, optimizing instruction sequence and so on
[^5]: i.e. the code runs without causing any mentionable GPU load as reported by the handy [nvtop](https://github.com/Syllo/nvtop) performance monitor

Despite all actual computation happening on the GPU we still need some CPU-based wrapper code to interact with the operating system, initialize memory, control the OpenGL state machine and so on. While I could not find any suitable non-gaming targeted C++ library to ease development of this code the scaffolding originally written[^6] for my vector field visualization [computicle](https://tree.kummerlaender.eu/projects/computicle/) was easily adapted to this new application.

[^6]: See [On NixOS, GPU programming and other assorted topics](/article/nixos_gpu_assorted_topics/).

To further simplify the implementation of our GLSL stream kernel we can use the abundant GPU memory to store two full states of the lattice. This allows for updating the cell populations of the upcoming collide operation without overwriting the current collision result which in turn means that the execution sequence of the stream kernel doesn't matter.

So all in all we require three memory regions: A collision buffer for performing the collide step, a streaming buffer as the streaming target and a fluid buffer to store velocity and pressure for visualization purposes. As an example we can take a look at how the underlying lattice buffer for collide and stream is [allocated](https://code.kummerlaender.eu/compustream/tree/src/buffer/vertex/lattice_cell_buffer.cc) on the GPU:

```cpp
LatticeCellBuffer::LatticeCellBuffer(GLuint nX, GLuint nY) {
	glGenVertexArrays(1, &_array);
	glGenBuffers(1, &_buffer);

	const std::vector<GLfloat> data(9*nX*nY, GLfloat{1./9.});

	glBindVertexArray(_array);
	glBindBuffer(GL_ARRAY_BUFFER, _buffer);
	glBufferData(
		GL_ARRAY_BUFFER,
		data.size() * sizeof(GLfloat),
		data.data(),
		GL_DYNAMIC_DRAW
	);

	glEnableVertexAttribArray(0);
	glVertexAttribPointer(0, 1, GL_FLOAT, GL_FALSE, 0, nullptr);
}
```

We can use the resulting `_buffer` address of type `GLuint` to bind the data array to corresponding binding points inside the compute shader. In our case these binding points are defined as follows:

```cpp
layout (local_size_x = 1, local_size_y = 1) in;

layout (std430, binding=1) buffer bufferCollide{ float collideCells[]; };
layout (std430, binding=2) buffer bufferStream{ float streamCells[]; };
layout (std430, binding=3) buffer bufferFluid{ float fluidCells[]; };

uniform uint nX;
uniform uint nY;
```

Calling compute shaders of this signature from the CPU is nicely abstracted by some computicle-derived[^7] wrapper classes such as [`ComputeShader`](https://code.kummerlaender.eu/compustream/tree/src/shader/wrap/compute_shader.cc):

[^7]: See [<tt>main.cc</tt>](https://code.kummerlaender.eu/compustream/tree/src/main.cc?id=8df8940bced82cad259bc8144aa3668871222d22#n100) as a starting point for diving into the code.

```cpp
// vector of buffer addresses to be bound
auto buffers = {
	lattice_a->getBuffer(),
	lattice_b->getBuffer(),
	fluid->getBuffer()
};

// bind buffers for the shaders to work on
collide_shader->workOn(buffers);
stream_shader->workOn(buffers);

// activate and trigger compute shaders
{
	auto guard = collide_shader->use();
	collide_shader->dispatch(nX, nY);
}
{
	auto guard = stream_shader->use();
	stream_shader->dispatch(nX, nY);
}
```

Lattice constants can be stored directly in the shader:

```cpp
const uint  q         = 9;
const float weight[q] = float[](
	1./36., 1./9., 1./36.,
	1./9. , 4./9., 1./9. ,
	1./36 , 1./9., 1./36.
);

const float tau   = 0.8;
const float omega = 1/tau;
```

Manual indexing to mime multidimensional arrays allows for flexible memory layouting while preserving reasonably easy access:

```cpp
uint indexOfDirection(int i, int j) {
	return 3*(j+1) + (i+1);
}

uint indexOfLatticeCell(uint x, uint y) {
	return q*nX*y + q*x;
}

/* [...] */

float w(int i, int j) {
	return weight[indexOfDirection(i,j)];
}

float get(uint x, uint y, int i, int j) {
	return collideCells[indexOfLatticeCell(x,y) + indexOfDirection(i,j)];
}
```

The discrete equilibrium distribution $f_i^\text{eq}$ is expressed as a single line of code when aided by some convenience functions such as `comp` for the dot product of discrete velocity $\xi_i$ and velocity moment $u$:

```cpp
float equilibrium(float d, vec2 u, int i, int j) {
	return w(i,j)
	     * d
	     * (1 + 3*comp(i,j,u) + 4.5*sq(comp(i,j,u)) - 1.5*sq(norm(u)));
}
```

Our actual collide kernel [<tt>collide.glsl</tt>](https://code.kummerlaender.eu/compustream/tree/src/shader/code/collide.glsl) is compactly expressed as a iteration over all discrete velocities and a direct codificaton of the collision formula:

```cpp
const uint x = gl_GlobalInvocationID.x;
const uint y = gl_GlobalInvocationID.y;

const float d = density(x,y);
const vec2  v = velocity(x,y,d);

setFluid(x,y,v,d);

for ( int i = -1; i <= 1; ++i ) {
	for ( int j = -1; j <= 1; ++j ) {
		set(
			x,y,i,j,
			get(x,y,i,j) + omega * (equilibrium(d,v,i,j) - get(x,y,i,j))
		);
	}
}
```

The streaming kernel [<tt>stream.glsl</tt>](https://code.kummerlaender.eu/compustream/tree/src/shader/code/stream.glsl) turns out to be equally compact even when a basic bounce back boundary condition is included. Such a condition simply reflects the populations that would be streamed outside the fluid domain to define the -- otherwise undefined -- populations pointing towards the fluid.

```cpp
const uint x = gl_GlobalInvocationID.x;
const uint y = gl_GlobalInvocationID.y;

if ( x != 0 && x != nX-1 && y != 0 && y != nY-1 ) {
	for ( int i = -1; i <= 1; ++i ) {
		for ( int j = -1; j <= 1; ++j ) {
			set(x+i,y+j,i,j, get(x,y,i,j));
		}
	}
} else {
	// rudimentary bounce back boundary handling
	for ( int i = -1; i <= 1; ++i ) {
		for ( int j = -1; j <= 1; ++j ) {
			if ( (x > 0 || i >= 0) && x+i <= nX-1
			  && (y > 0 || j >= 0) && y<