4 files changed, 715 insertions, 1 deletions
diff --git a/articles/2020-05-26_lambda_tuple_swallowing.md b/articles/2020-05-26_lambda_tuple_swallowing.md
index f64ea52..b33c046 100644
--- a/articles/2020-05-26_lambda_tuple_swallowing.md
+++ b/articles/2020-05-26_lambda_tuple_swallowing.md
@@ -1,6 +1,6 @@
 # Working with tuples using swallowing and generic lambdas
 
-Suppose you have some kind of list of types. Such a list by can by itself be [used](/article/using_scheme_as_a_metaphor_for_template_metaprogramming/) to perform any compile time computation one might come up with. So let us suppose that you additionally want to construct a tuple from something that is based on this list. i.e. you want to connect the compile time only type list to a run time object. In such a case you might run into new question such as: How do I call constructors for each of my tuple values? How do I offer access to the tuple values using only the type as a reference? How do I call a function for each value in the tuple while preserving the connection to the compile time list? If such questions are of interest to you, this article might possibly also be.
+Suppose you have some kind of list of types. Such a list can by itself be [used](/article/using_scheme_as_a_metaphor_for_template_metaprogramming/) to perform any compile time computation one might come up with. So let us suppose that you additionally want to construct a tuple from something that is based on this list. i.e. you want to connect the compile time only type list to a run time object. In such a case you might run into new question such as: How do I call constructors for each of my tuple values? How do I offer access to the tuple values using only the type as a reference? How do I call a function for each value in the tuple while preserving the connection to the compile time list? If such questions are of interest to you, this article might possibly also be.
 
 While the standard's tuple template is part of the C++ subset I use in basically all of my developments[^0] I recently had to revisit some of these questions while reworking OpenLB's core data structure using its [_meta descriptor_](/article/meta_descriptor/) concept. The starting point for this was a class template called `FieldArrayD` to store an array of instances of a single field in a SIMD vectorization friendly _structure of arrays_ layout. As a LBM lattice in practice stores not just one such field type but multiple of them (all declared in the central _descriptor_ structure) I then wanted a `MultiFieldArrayD` class template that does just that. i.e. a simple wrapper that accepts a list of fields as a variadic template parameter pack and instantiates a `FieldArrayD` for each of them. A sensible place for storing these instances is of course our trusty `std::tuple`:
 
diff --git a/articles/2021-09-26_noise_and_ray_marching.org b/articles/2021-09-26_noise_and_ray_marching.org
new file mode 100644
index 0000000..d3b231d
--- /dev/null
+++ b/articles/2021-09-26_noise_and_ray_marching.org
@@ -0,0 +1,276 @@
+* Noise and Ray Marching
+[[https://literatelb.org][LiterateLB's]] volumetric visualization functionality relies on a simple ray marching implementation
+to sample both the 3D textures produced by the simulation side of things and the signed distance
+functions that describe the obstacle geometry. While this produces surprisingly [[https://www.youtube.com/watch?v=n86GfhhL7sA][nice looking]]
+results in many cases, some artifacts of the visualization algorithm are visible depending on the
+viewport and sample values. Extending the ray marching code to utilize a noise function is
+one possibility of mitigating such issues that I want to explore in this article.
+
+While my [[https://www.youtube.com/watch?v=J2al5tV14M8][original foray]] into just in time visualization of Lattice Boltzmann based simulations
+was only an aftertought to [[https://tree.kummerlaender.eu/projects/symlbm_playground/][playing around]] with [[https://sympy.org][SymPy]] based code generation approaches I have
+since put some work into a more fully fledged code. The resulting [[https://literatelb.org][LiterateLB]] code combines
+symbolic generation of optimized CUDA kernels and functionality for just in time fluid flow
+visualization into a single /literate/ [[http://code.kummerlaender.eu/LiterateLB/tree/lbm.org][document]].
+
+For all fortunate users of the [[https://nixos.org][Nix]] package manager, tangling and building this from the [[https://orgmode.org][Org]]
+document is as easy as executing the following commands on a CUDA-enabled NixOS host.
+
+#+BEGIN_SRC sh
+git clone https://code.kummerlaender.eu/LiterateLB
+nix-build
+./result/bin/nozzle
+#+END_SRC
+
+** Image Synthesis
+The basic ingredient for producing volumetric images from CFD simulation data is to compute
+some scalar field of samples \(s : \mathbb{R}^3 \to \mathbb{R}_0^+\). Each sample \(s(x)\) can be assigned a color
+\(c(x)\) by some convenient color palette mapping scalar values to a tuple of red, green and blue
+components.
+
+[[https://literatelb.org/tangle/asset/palette/4wave_ROTB.png]]
+
+The task of producing an image then consists to sampling the color field along a ray assigned
+to a pixel by e.g. a simple pinhole camera projection. For this purpose a simple discrete
+approximation of the volume rendering equation with constant step size \(\Delta x \in \mathbb{R}^+\) already
+produces suprisingly good pictures. Specifically
+$$C(r) = \sum_{i=0}^N c(i \Delta x) \mu (i \Delta x) \prod_{j=0}^{i-1} \left(1 - \mu(j\Delta x)\right)$$
+is the color along ray \(r\) of length \(N\Delta x\) with local absorption values \(\mu(x)\). This
+local absorption value may be chosen seperately of the sampling function adding an
+additional tweaking point.
+
+#+BEGIN_EXPORT html
+<video style="width:100%" src="https://literatelb.org/media/nozzle.webm" controls="controls">
+</video>
+#+END_EXPORT
+
+The basic approach may also be extended arbitrarily, e.g. it is only the inclusion of a couple
+of phase functions away from being able [[https://tree.kummerlaender.eu/projects/firmament/][recover the color produced by light travelling through the participating media that is our atmosphere]].
+
+** The Problem
+There are many different possibilities for the choice of sampling function \(s(x)\) given the results of a
+fluid flow simulation. E.g. velocity and curl norms, the scalar product of ray direction and shear layer
+normals or vortex identifiers such as the Q criterion
+\[ Q = \|\Omega\|^2 - \|S\|^2 > 0 \text{ commonly thresholded to recover isosurfaces} \]
+that contrasts the local vorticity and strain rate norms. The strain rate tensor \(S\) is easily
+recovered from the non-equilibrium populations \(f^\text{neq}\) of the simulation lattice — and is in
+fact already used for the turbulence model. Similarly, the vorticity \(\Omega = \nabla \times u\) can be
+computed from the velocity field using a finite difference stencil.
+
+The problem w.r.t. rendering when thresholding sampling values to highlight structures in the flow
+becomes apparent in the following picture:
+
+#+BEGIN_EXPORT html
+<div class="flexcolumns">
+<div>
+<span>Q Criterion</span>
+<img src="https://static.kummerlaender.eu/media/q_criterion_default.png"/>
+</div>
+<div>
+<span>Curl Norm</span>
+<img src="https://static.kummerlaender.eu/media/curl_default.png"/>
+</div>
+</div>
+#+END_EXPORT
+
+While the exact same volume discretization was used for both visualizations, the slices are much
+less apparent for the curl norm samples due to the more gradual changes. In general the issue is
+most prominent for scalar fields with large gradients (specifically the sudden jumps that occur
+when restricting sampling to certain value ranges as is the case for the Q criterion).
+
+** Colors of Noise
+The reason for these artifacts is primarily choice of start offsets w.r.t. the traversed volume
+in addition the the step width. While this tends to become less noticable when decreasing said
+steps, this is not desirable from a performance perspective.
+
+What I settled on for LiterateLB's renderer are view-aligned slicing and random jittering to remove
+most visible artifacts. The choice of /randomness/ for jittering the ray origin is critical here as plain
+random numbers tend to produce a distracting static-like pattern. A common choice in practice is
+to use so called /blue noise/ instead. While both kinds of noise eliminate most slicing artifacts, the
+remaining patterns tend to be less noticeable for blue noise. Noise is called /blue/ if it contains only
+higher frequency components which makes it harder for the pattern recognizer that we call brain to
+find patterns where there should be none.
+
+The [[https://www.spiedigitallibrary.org/conference-proceedings-of-spie/1913/0000/Void-and-cluster-method-for-dither-array-generation/10.1117/12.152707.short?SSO=1][void-and-cluster algorithm]][fn:vac] provides a straight forward method for
+pre-computing tileable blue noise textures that can be reused during the actual visualization.
+Tileability is a desirable property for this as we otherwise would either need a noise texture
+large enough to cover the entire image or instead observe jumps at the boundary between
+the tiled texture.
+
+The first ingredient for /void-and-cluster/ is a =filteredPattern= function that applies a
+plain Gaussian filter with given $\sigma$ to a cyclic 2d array. Using cyclic wrapping during the
+application of this filter is what renders the generated texture tileable.
+
+#+BEGIN_SRC python
+def filteredPattern(pattern, sigma):
+    return gaussian_filter(pattern.astype(float), sigma=sigma, mode='wrap', truncate=np.max(pattern.shape))
+#+END_SRC
+
+This function will be used to compute the locations of the largest void and tightest
+cluster in a binary pattern (i.e. a 2D array of 0s and 1s). In this context a /void/ describes
+an area with only zeros and a /cluster/ describes an area with only ones.
+
+#+BEGIN_SRC python
+def largestVoidIndex(pattern, sigma):
+    return np.argmin(masked_array(filteredPattern(pattern, sigma), mask=pattern))
+#+END_SRC
+
+These two functions work by considering the given binary pattern as a float array that is blurred by
+the Gaussian filter. The blurred pattern gives an implicit ordering of the /voidness/ of each pixel, the
+minimum of which we can determine by a simple search. It is important to exclude the initial binary
+pattern here as void-and-cluster depends on finding the largest areas where no pixel is set.
+
+#+BEGIN_SRC python
+def tightestClusterIndex(pattern, sigma):
+    return np.argmax(masked_array(filteredPattern(pattern, sigma), mask=np.logical_not(pattern)))
+#+END_SRC
+
+Computing the tightest cluster works in the same way with the exception of searching the largest array
+element and masking by the inverted pattern.
+
+#+BEGIN_SRC python
+def initialPattern(shape, n_start, sigma):
+    initial_pattern = np.zeros(shape, dtype=np.bool)
+    initial_pattern.flat[0:n_start] = True
+    initial_pattern.flat = np.random.permutation(initial_pattern.flat)
+    cluster_idx, void_idx = -2, -1
+    while cluster_idx != void_idx:
+        cluster_idx = tightestClusterIndex(initial_pattern, sigma)
+        initial_pattern.flat[cluster_idx] = False
+        void_idx = largestVoidIndex(initial_pattern, sigma)
+        initial_pattern.flat[void_idx] = True
+    return initial_pattern
+#+END_SRC
+
+For the initial binary pattern we set =n_start= random locations to one and then repeatedly
+break up the largest void by setting its center to one. This is also done for the tightest cluster
+by setting its center to zero. We do this until the locations of the tightest cluster and largest
+void overlap.
+
+#+BEGIN_SRC python
+def blueNoise(shape, sigma):
+#+END_SRC
+
+The actual algorithm utilizes these three helper functions in  four steps:
+1. Initial pattern generation
+   #+BEGIN_SRC python
+    n = np.prod(shape)
+    n_start = int(n / 10)
+
+    initial_pattern = initialPattern(shape, n_start, sigma)
+    noise = np.zeros(shape)
+   #+END_SRC
+3. Eliminiation of =n_start= tightest clusters
+   #+BEGIN_SRC python
+    pattern = np.copy(initial_pattern)
+    for rank in range(n_start,-1,-1):
+        cluster_idx = tightestClusterIndex(pattern, sigma)
+        pattern.flat[cluster_idx] = False
+        noise.flat[cluster_idx] = rank
+   #+END_SRC
+4. Elimination of =n/2-n_start= largest voids
+   #+BEGIN_SRC python
+    pattern = np.copy(initial_pattern)
+    for rank in range(n_start,int((n+1)/2)):
+        void_idx = largestVoidIndex(pattern, sigma)
+        pattern.flat[void_idx] = True
+        noise.flat[void_idx] = rank
+   #+END_SRC
+5. Elimination of =n-n/2= tightest clusters of the inverted pattern
+   #+BEGIN_SRC python
+    for rank in range(int((n+1)/2),n):
+        cluster_idx = tightestClusterIndex(np.logical_not(pattern), sigma)
+        pattern.flat[cluster_idx] = True
+        noise.flat[cluster_idx] = rank
+   #+END_SRC
+
+For each elimination the current =rank= is stored in the noise texture
+producing a 2D arrangement of the integers from 0 to =n=. As the last
+step the array is divided by =n-1= to yield a grayscale texture with values
+in $[0,1]$.
+
+#+BEGIN_SRC python
+return noise / (n-1)
+#+END_SRC
+
+In order to check whether this actually generated blue noise, we can take a
+look at the Fourier transformation for an exemplary \(100 \times 100\) texture:
+
+#+BEGIN_EXPORT html
+<div class="flexcolumns">
+<div>
+<span>Blue noise texture</span>
+<img src="https://static.kummerlaender.eu/media/blue_noise.png"/>
+</div>
+<div>
+<span>Fourier transformation</span>
+<img src="https://static.kummerlaender.eu/media/blue_noise_fourier.png"/>
+</div>
+</div>
+#+END_EXPORT
+
+One can see qualitatively that higher frequency components are significantly more
+prominent than lower ones. Contrasting this to white noise generated using uniformly
+distributed random numbers, no preference for any range of frequencies can be
+observed:
+
+#+BEGIN_EXPORT html
+<div class="flexcolumns">
+<div>
+<span>White noise texture</span>
+<img src="https://static.kummerlaender.eu/media/white_noise.png"/>
+</div>
+<div>
+<span>Fourier transformation</span>
+<img src="https://static.kummerlaender.eu/media/white_noise_fourier.png"/>
+</div>
+</div>
+#+END_EXPORT
+
+** Comparison
+Contasting the original Q criterion visualization with one produced using blue noise jittering
+followed by a soft blurring shader, we can see that the slicing artifacts largely vanish.
+While the jittering is still visible to closer inspection, the result is significantly more pleasing
+to the eye and arguably more faithful to the underlying scalar field.
+
+#+BEGIN_EXPORT html
+<div class="flexcolumns">
+<div>
+<span>Simple ray marching</span>
+<img src="https://static.kummerlaender.eu/media/q_criterion_default.png"/>
+</div>
+<div>
+<span>Ray marching with blue noise jittering</span>
+<img src="https://static.kummerlaender.eu/media/q_criterion_blue_noise.png"/>
+</div>
+</div>
+#+END_EXPORT
+
+While white noise also obcures the slices, its lower frequency components 
+produce more obvious static in the resulting image compared to blue noise.
+As both kinds of noise are precomputed we can freely choose the kind of
+noise that will produce the best results for our sampling data.
+
+#+BEGIN_EXPORT html
+<div class="flexcolumns">
+<div>
+<span>Blue noise</span>
+<img src="https://static.kummerlaender.eu/media/q_criterion_blue_noise_close.png"/>
+</div>
+<div>
+<span>White noise</span>
+<img src="https://static.kummerlaender.eu/media/q_criterion_white_noise_close.png"/>
+</div>
+</div>
+#+END_EXPORT
+
+In practice where the noise is applied just-in-time during the visualization of
+a CFD simulation, all remaining artifacts tend to become invisible. This can
+be seen in the following video of the Q criterion evaluated for a simulated
+nozzle flow in LiterateLB:
+
+#+BEGIN_EXPORT html
+<video style="width:100%" src="https://static.kummerlaender.eu/media/nozzle_q_criterion.webm" controls="controls">
+</video>
+#+END_EXPORT
+
+[fn:vac] Ulichney, R. Void-and-cluster method for dither array generation. In Electronic Imaging (1993). DOI: [[https://www.spiedigitallibrary.org/conference-proceedings-of-spie/1913/0000/Void-and-cluster-method-for-dither-array-generation/10.1117/12.152707.short?SSO=1][10.1117/12.152707]].
diff --git a/articles/2021-10-11_reproducible_development_environment_teensy.org b/articles/2021-10-11_reproducible_development_environment_teensy.org
new file mode 100644
index 0000000..e0fd626
--- /dev/null
+++ b/articles/2021-10-11_reproducible_development_environment_teensy.org
@@ -0,0 +1,195 @@
+* Reproducible development environment for Teensy
+So for a change of scenery I recently started to mess around with microcontrollers again.
+Since the last time that I had any real contact with this area was probably around a decade ago --- programming an [[https://www.dlr.de/rm/en/desktopdefault.aspx/tabid-14006/#gallery/34068][ASURO]] robot --- I started basically from scratch.
+Driven by the goal of building and programming a fancy mechanical keyboard (as it seems to be the trendy thing to do) I chose the Arduino-compatible [[https://www.pjrc.com/store/teensy40.html][Teensy 4.0]]
+board. While I appreciate the rich and accessible software ecosystem for this platform, I don't really want to use some special IDE, applying amongst other things[fn:0]
+weird non-standard preprocessing to my code. In this vein it would also be nice to use my accustomed [[https://nixos.org][Nix-based]] toolchain which leads me to this article.
+
+Roughly following what [[https://rzetterberg.github.io/teensy-development-on-nixos.html][others did]] for Teensy 3.1 while adapting it to Teensy 4.0 and Nix flakes it is simple to build and flash
+some basic C++ programs onto a USB-attached board. The adapted version of the Arduino library is available on [[https://github.com/PaulStoffregen/cores][Github]] and can
+be compiled into a shared library using flags
+
+#+BEGIN_SRC make
+MCU     = IMXRT1062
+MCU_DEF = ARDUINO_TEENSY40
+
+OPTIONS  = -DF_CPU=600000000 -DUSB_SERIAL -DLAYOUT_US_ENGLISH
+OPTIONS += -D__$(MCU)__ -DARDUINO=10813 -DTEENSYDUINO=154 -D$(MCU_DEF)
+
+CPU_OPTIONS = -mcpu=cortex-m7 -mfloat-abi=hard -mfpu=fpv5-d16 -mthumb
+
+CPPFLAGS = -Wall -g -O2 $(CPU_OPTIONS) -MMD $(OPTIONS) -ffunction-sections -fdata-sections
+CXXFLAGS = -felide-constructors -fno-exceptions -fpermissive -fno-rtti -Wno-error=narrowing -I@TEENSY_INCLUDE@
+#+END_SRC
+
+included into a run-of-the-mill Makefile and relying on the =arm-none-eabi-gcc= compiler. Correspondingly, the
+derivation for the core library [[http://code.kummerlaender.eu/teensy-env/tree/core.nix?id=44c1837717f748b891df1a6c88a72ec3a51470ce][=core.nix=]] is straight forward. It clones a given version of the library repository,
+jumps to the =teensy4= directory, deletes the example =main.cpp= file to exclude it from the library and applies a Makefile
+adapted from the default one. For the result only headers, common flags and the linker script =IMXRT1062.ld=
+are exported.
+
+As existing Arduino /sketches/ commonly consist of a single C++ file (ignoring some non-standard stuff for later) most
+builds can be handled generically by a mapping of =*.cpp= files into flashable =*.hex= files. This is realized by the following
+function based on the =teensy-core= derivation and a [[http://code.kummerlaender.eu/teensy-env/tree/Makefile.default?id=44c1837717f748b891df1a6c88a72ec3a51470ce][default makefile]]:
+
+#+BEGIN_SRC nix
+build = name: source: pkgs.stdenv.mkDerivation rec {
+  inherit name;
+
+  src = source;
+
+  buildInputs = with pkgs; [
+    gcc-arm-embedded
+    teensy-core
+  ];
+
+  buildPhase = ''
+    export CC=arm-none-eabi-gcc
+    export CXX=arm-none-eabi-g++
+    export OBJCOPY=arm-none-eabi-objcopy
+    export SIZE=arm-none-eabi-size
+
+    cp ${./Makefile.default} Makefile
+    export TEENSY_PATH=${teensy-core}
+    make
+  '';
+
+  installPhase = ''
+    mkdir $out
+    cp *.hex $out/
+  '';
+};
+#+END_SRC
+
+The derivation yielded by =build "test" ./test= results in a =result= directory containing a =*.hex= file for each
+C++ file contained in the =test= directory. Adding a =loader= function to be used in convenient =nix flake run=
+commands
+
+#+BEGIN_SRC nix
+loader = name: path: pkgs.writeScript name ''
+  #!/bin/sh
+  ${pkgs.teensy-loader-cli}/bin/teensy-loader-cli --mcu=TEENSY40 -w ${path}
+'';
+#+END_SRC
+
+a reproducible build of the canonical /blink/ example[fn:1] is realized using:
+
+#+BEGIN_SRC sh
+nix flake clone git+https://code.kummerlaender.eu/teensy-env --dest .
+nix run .#flash-blink
+#+END_SRC
+
+Expanding on this, the =teensy-env= flake also provides convenient =image(With)= functions for building
+programs that depend on additional Arduino libraries such as for controlling servos. E.g. the build
+of a program =test.cpp= placed in a =src= folder
+
+#+BEGIN_SRC cpp
+#include <Arduino.h>
+#include <Servo.h>
+
+extern "C" int main(void) {
+  Servo servo;
+  // Servo connected to PWM-capable pin 1
+  servo.attach(1);
+  while (true) {
+    // Match potentiometer connected to analog pin 7
+    servo.write(map(analogRead(7), 0, 1023, 0, 180));
+    delay(20);
+  }
+}
+#+END_SRC
+
+is fully described by the flake:
+
+#+BEGIN_SRC nix
+{
+  description = "Servo Test";
+
+  inputs = {
+    teensy-env.url = git+https://code.kummerlaender.eu/teensy-env;
+  };
+
+  outputs = { self, teensy-env }: let
+    image = teensy-env.custom.imageWith
+      (with teensy-env.custom.teensy-extras; [ servo ]);
+
+  in {
+    defaultPackage.x86_64-linux = image.build "servotest" ./src;
+  };
+}
+#+END_SRC
+
+At first I expected the build of [[http://www.ulisp.com/][uLisp]][fn:2] to proceed equally smoothly as this implementation of Lisp
+for microcontrollers is provided as a single [[https://raw.githubusercontent.com/technoblogy/ulisp-arm/master/ulisp-arm.ino][=ulisp-arm.ino=]] file. However, the =*.ino= extension
+is not just for show here as beyond even the replacement of =main= by =loop= and =setup= --- which
+would be easy to fix --- it relies on further non-standard preprocessing offered by the
+Arduino toolchain. I quickly aborted my efforts towards patching in e.g. the forward-declarations
+which are automagically added during the build (is it really such a hurdle to at least declare stuff before
+referring to it… oh well) and instead followed a less pure approach using =arduino-cli= to access
+the actual Arduino preprocessor.
+
+#+BEGIN_SRC sh
+arduino-cli core install arduino:samd
+arduino-cli compile --fqbn arduino:samd:arduino_zero_native --preprocess ulisp-arm.ino > ulisp-arm.cpp
+#+END_SRC
+
+The problematic line w.r.t. to reproducible builds in Nix is the installation of the =arduino:samd= toolchain
+which requires network access and wants to install stuff to home. Pulling in arbitrary stuff over the
+network is of course not something one wants to do in an isolated and hopefully reproducible build
+environment which is why this kind of stuff is heavily restricted in common Nix derivations. Luckily
+it is possible to misuse (?) a fixed-output derivation to describe the preprocessing of =ulisp-arm.ino=
+into a standard C++ =ulisp-arm.cpp= compilable using the GCC toolchain.
+
+The relevant file [[https://code.kummerlaender.eu/teensy-env/tree/ulisp.nix?id=44c1837717f748b891df1a6c88a72ec3a51470ce][=ulisp.nix=]] pulls in the uLisp source from Github and calls =arduino-cli= to install
+its toolchain to a temporary home folder followed by preprocessing the source into the derivation's
+output. The relevant lines for turning this into a fixed-output derivation are
+
+#+BEGIN_SRC nix
+outputHashMode = "flat";
+outputHashAlgo = "sha256";
+outputHash = "mutVLBFSpTXgUzu594zZ3akR/Z7e9n5SytU6WoQ6rKA=";
+#+END_SRC
+
+to declare the hash of the resulting file. After this point building and flashing uLisp using the =teensy-env=
+flake works the same as for any C++ program. The two additional /SPI/ and /Wire/ library dependencies are
+added easily using =imageWith=:
+
+#+BEGIN_SRC nix
+teensy-ulisp = let
+  ulisp-source = import ./ulisp.nix { inherit pkgs; };
+  ulisp-deps   = with teensy-extras; [ spi wire ];
+in (imageWith ulisp-deps).build
+  "teensy-ulisp"
+  (pkgs.linkFarmFromDrvs "ulisp" [ ulisp-source ]);
+#+END_SRC
+
+So we are now able to build and flash uLisp onto a conveniently attached Teensy 4.0 board using only:
+
+#+BEGIN_SRC sh
+nix flake clone git+https://code.kummerlaender.eu/teensy-env --dest .
+nix run .#flash-ulisp
+#+END_SRC
+
+Connecting finally via serial terminal =screen /dev/ttyACM0 9600= we end up in a LISP environment where we
+can play around with the microcontroller at our leisure without reflashing.
+
+#+BEGIN_SRC lisp
+59999> (* 21 2)
+42
+
+59999> (defun blink (&optional x)
+         (pinmode 13 t)
+         (digitalwrite 13 x)
+         (delay 1000)
+         (blink (not x)))
+
+59966> (blink)
+#+END_SRC
+
+As always, the code of everything discussed here is available via Git on [[https://code.kummerlaender.eu/teensy-env][code.kummerlaender.eu]].
+While I only focused on Teensy 4.0 it should be easy to adapt to other versions by changing the
+compiler flags using [[https://github.com/PaulStoffregen/cores][PaulStoffregen/cores]] as a reference.
+
+[fn:0] e.g. forcing me to patch my XMonad [[http://code.kummerlaender.eu/nixos_home/tree/gui/conf/xmonad.hs][config]] to even get a usable UI…
+[fn:1] Simply flashing the on-board LED periodically
+[fn:2] Interactive development using a Lisp REPL on a microcontroller, how much more can you really ask for?
diff --git a/articles/2023-12-26_benefiting_from_deliberately_failing_linkage.org b/articles/2023-12-26_benefiting_from_deliberately_failing_linkage.org
new file mode 100644
index 0000000..a67999d
--- /dev/null
+++ b/articles/2023-12-26_benefiting_from_deliberately_failing_linkage.org
@@ -0,0 +1,243 @@
+* Benefiting from deliberately failing linkage
+Realizing that I have not written anything here for two /years/ lets just start writing again[fn:-1]:
+Compilation times for template-heavy C++ codebases such as the one at [[https://openlb.net][the center of my daily life]] can be a real pain.
+This mostly got worse since I started to really get my hands dirty in its depths during the [[https://www.helmholtz-hirse.de/series/2022_12_01-seminar_9.html][extensive refactoring]] towards SIMD and GPU support[fn:0].
+The current sad high point in compilation times was reached when compiling the first GPU-enabled simulation cases: More than 100 seconds for a single compile on my not too shabby system.
+This article will detail how I significantly reduced this on the build system level while gaining useful features.
+
+#+BEGIN_SRC bash
+λ ~/p/c/o/e/t/nozzle3d (openlb-env-cuda-env) • time make
+make -C ../../.. core
+make[1]: Entering directory '/home/common/projects/contrib/openlb-master'
+make[1]: Nothing to be done for 'core'.
+make[1]: Leaving directory '/home/common/projects/contrib/openlb-master'
+nvcc -pthread --forward-unknown-to-host-compiler -x cu -O3 -std=c++17 --generate-code=arch=compute_75,code=[compute_75,sm_75] --extended-lambda --expt-relaxed-constexpr -rdc=true -DPLATFORM_CPU_SISD -DPLATFORM_GPU_CUDA  -DDEFAULT_FLOATING_POINT_TYPE=float -I../../../src -c -o nozzle3d.o nozzle3d.cpp
+nvcc nozzle3d.o -o nozzle3d -lolbcore -lpthread -lz -ltinyxml -L/run/opengl-driver/lib -lcuda -lcudadevrt -lcudart -L../../../build/lib
+________________________________________________________
+Executed in  112.27 secs    fish           external
+   usr time  109.46 secs  149.00 micros  109.46 secs
+   sys time    2.42 secs   76.00 micros    2.42 secs
+#+END_SRC
+
+Even when considering that this compiles many dozens of individual CUDA kernels for multiple run-time selectable physical models and boundary conditions in addition to the simulation scaffold[fn:1] it still takes too long for comfortably iterating during development.
+Needless to say, things did not improve when I started working on heterogeneous execution and the single executable needed to also contain vectorized versions of all models for execution on CPUs in addition to MPI and OpenMP routines. 
+Even worse, you really want to use Intel's C++ compilers when running CPU-based simulations on Intel-based clusters[fn:2] which plainly is not possible in such a /homogeneous/ compiler setup where everyhing has to pass through =nvcc=.
+
+#+BEGIN_SRC bash
+λ ~/p/c/o/e/t/nozzle3d (openlb-env-gcc-env) • time make
+g++ -O3 -Wall -march=native -mtune=native -std=c++17 -pthread -DPLATFORM_CPU_SISD  -I../../../src -c -o nozzle3d.o nozzle3d.cpp
+g++ nozzle3d.o -o nozzle3d -lolbcore -lpthread   -lz -ltinyxml     -L../../../build/lib
+________________________________________________________
+Executed in   31.77 secs    fish           external
+   usr time   31.21 secs    0.00 micros   31.21 secs
+   sys time    0.55 secs  693.00 micros    0.55 secs
+#+END_SRC
+
+Comparing the GPU build to the previous CPU-only compilation time of around 32 seconds -- while nothing to write home about -- it was still clear that time would be best spent on separating out the CUDA side of things, both to mitigate its performance impact and to enabled a /mixed/ compiler environment.
+
+[fn:-1] …and do my part in feeding the LLM training machine :-)
+[fn:0] Definitely a double edged sword: On the one side it enables concise DSL-like compositions of physical models while supporting automatic code optimization and efficient execution accross heterogeneous hardware. On the other side my much younger, Pascal-fluent, self would not be happy with how cryptic and unmaintainable many of my listings can look to the outsider.
+In any case, OpenLB as a heavily templatized and meta-programmed C++ software library is a foundational design decision.
+[fn:1] Data structures, pre- and post-processing logic, IO routines, ...
+[fn:2] Commonly improving performance by quite a few percent
+
+** Requirements
+Firstly, any solution would need to exist within the existing plain Makefile based build system[fn:3] and should not complicate the existing build workflow for our users[fn:4].
+Secondly, it should allow for defining completely different compilers and configuration flags for the CPU- and the GPU-side of the application.
+The intial driving force of speeding up GPU-targeted compilation would then be satisfied as a side effect due to the ability of only recompiling the CPU-side of things as long as no new physical models are introduced. This restriction is useful in the present context as GPU kernels execute the computationally expensive part, i.e. the actual simulation, but generally do not change often during development of new simulation cases after the initial choice of physical model.
+
+[fn:3] Which was a deliberate design decision in order to minimize dependencies considering the minimal build complexity required by OpenLB as a plain CPU-only MPI code. While this could of course be reconsidered in the face of increased target complexity it was not the time to open that bottle. 
+[fn:4] Mostly domain experts from process engineering, physics or mathematics without much experience in software engineering.
+
+** Approach
+Following the requirements, a basic approach is to split the application into two compilation units: One containing only the CPU-implementation consisting of the high level algorithmic structure, pre- and post-processing, communication logic, CPU-targeted simulation kernels and calls to the GPU code.
+The other containing only the GPU code consisting of CUDA kernels and their immediate wrappers called from the CPU-side of things -- i.e. only those parts that truly need to be compiled using NVIDIA's =nvcc=. 
+Given two separated files =cpustuff.cpp= and =gpustuff.cu= it would be easy to compile them using separate configurations and then link them together into a single executable.
+The main implementation problem is how to generate two such separated compilation units that can be cleanly linked together, i.e. without duplicating symbols and similar hurdles.
+
+** Implementation
+In days past the build system actually contained an option for such separated compilation: termed as the /pre-compiled mode/ in OpenLB speak.
+This mode consisted of a somewhat rough and leaky separation between interface and implementation headers that was augmented by many hand-written C++ files containing explicit template instantiations of the aforementioned implementations for certain common arguments.
+These C++ files could then be compiled once into a shared library that was linked to the application unit compiled without access to the implementation headers.
+While this worked it was always a struggle to keep these files maintained.
+Additionally any benefit for the, at that time CPU-only, codebase was negligible and in the end not worth the effort any more causing it to be dropped somewhere on the road to release 1.4.
+
+Nevertheless, the basic approach of compiling a shared libary of explicit template instantiations is sound if we can find a way to automatically generate the instantiations per-case instead of manually maintaining them.
+A starting point for this is to take a closer look at the linker errors produced when compiling a simulation case including only the interface headers for the GPU code.
+These errors contain partial signatures of all relevant methods from plain function calls
+
+#+BEGIN_SRC bash
+λ ~/p/c/o/e/l/cavity3dBenchmark (openlb-env-gcc-openmpi-cuda-env) • mpic++ cavity3d.o  -lpthread -lz -ltinyxml -L../../../build/lib -lolbcore
+cavity3d.cpp:(...): undefined reference to `olb::gpu::cuda::device::synchronize()'
+#+END_SRC
+
+to bulk and boundary collision operator constructions
+
+#+BEGIN_SRC bash
+cavity3d.cpp:(...): undefined reference to `olb::ConcreteBlockCollisionO<float, olb::descriptors::D3Q19<>, (olb::Platform)2, olb::dynamics::Tuple<float, olb::descriptors::D3Q19<>, olb::momenta::Tuple<olb::momenta::BulkDensity, olb::momenta::BulkMomentum, olb::momenta::BulkStress, olb::momenta::DefineToNEq>, olb::equilibria::SecondOrder, olb::collision::BGK, olb::dynamics::DefaultCombination> >::ConcreteBlockCollisionO()'
+#+END_SRC
+
+as well as core data structure accessors:
+
+#+BEGIN_SRC bash
+cavity3d.cpp:(.text._ZN3olb20ConcreteBlockLatticeIfNS_11descriptors5D3Q19IJEEELNS_8PlatformE2EE21getPopulationPointersEj[_ZN3olb20ConcreteBlockLatticeIfNS_11descriptors5D3Q19IJEEELNS_8PlatformE2EE21getPopulationPointersEj]+0x37): undefined reference to `olb::gpu::cuda::CyclicColumn<float>::operator[](unsigned long)'
+#+END_SRC
+
+These errors are easily turned into a sorted list of unique missing symbols using basic piping
+
+#+BEGIN_SRC makefile
+build/missing.txt: $(OBJ_FILES)
+    $(CXX) $^ $(LDFLAGS) -lolbcore 2>&1 \
+  | grep -oP ".*undefined reference to \`\K[^']+\)" \
+  | sort \
+  | uniq > $@
+#+END_SRC
+
+which only assumes that the locale is set to english and -- surprisingly -- works consistently accross any relevant C++ compilers[fn:5], likely due to all of them using either the GNU Linker or a drop-in compatible alternative thereto.
+The resulting plain list of C++ method signatures hints at the reasonably structured and consistent template /language/ employed by OpenLB:
+
+#+BEGIN_SRC cpp
+olb::ConcreteBlockCollisionO<float, olb::descriptors::D3Q19<>, (olb::Platform)2, olb::CombinedRLBdynamics<float, olb::descriptors::D3Q19<>, olb::dynamics::Tuple<float, olb::descriptors::D3Q19<>, olb::momenta::Tuple<olb::momenta::BulkDensity, olb::momenta::BulkMomentum, olb::momenta::BulkStress, olb::momenta::DefineToNEq>, olb::equilibria::SecondOrder, olb::collision::BGK, olb::dynamics::DefaultCombination>, olb::momenta::Tuple<olb::momenta::VelocityBoundaryDensity<0, -1>, olb::momenta::FixedVelocityMomentumGeneric, olb::momenta::RegularizedBoundaryStress<0, -1>, olb::momenta::DefineSeparately> > >::ConcreteBlockCollisionO()
+olb::gpu::cuda::CyclicColumn<float>::operator[](unsigned long)
+olb::gpu::cuda::device::synchronize()
+// [...]
+#+END_SRC
+
+For example, local cell models -- /Dynamics/ in OpenLB speak -- are mostly implemented as tuples of momenta, equilibrium functions and collision operators[fn:6].
+All such relevant classes tend to follow a consistent structure in what methods with which arguments and return types they implement.
+We can use this domain knowledge of our codebase to transform the incomplete signatures in our new =missing.txt= into a full list of explicit template instantiations written in valid C++.
+
+#+BEGIN_SRC makefile
+build/olbcuda.cu: build/missing.txt
+# Generate includes of the case source
+# (replaceable by '#include <olb.h>' if no custom operators are implemented in the application)
+	echo -e '$(CPP_FILES:%=\n#include "../%")' > $@
+# Transform missing symbols into explicit template instantiations by:
+# - filtering for a set of known and automatically instantiable methods
+# - excluding destructors
+# - dropping resulting empty lines
+# - adding the explicit instantiation prefix (all supported methods are void, luckily)
+	cat build/missing.txt \
+	| grep '$(subst $() $(),\|,$(EXPLICIT_METHOD_INSTANTIATION))' \
+	| grep -wv '.*\~.*\|FieldTypeRegistry()' \
+	| xargs -0 -n1 | grep . \
+	| sed -e 's/.*/template void &;/' -e 's/void void/void/' >> $@
+# - filtering for a set of known and automatically instantiable classes
+# - dropping method cruft and wrapping into explicit class instantiation
+# - removing duplicates
+	cat build/missing.txt \
+	| grep '.*\($(subst $() $(),\|,$(EXPLICIT_CLASS_INSTANTIATION))\)<' \
+	| sed -e 's/\.*>::.*/>/' -e 's/.*/template class &;/' -e 's/class void/class/' \
+	| sort | uniq >> $@
+#+END_SRC
+
+Note that this is only possible due to full knowledge of and control over the target codebase.
+In case this is not clear already: In no way do I recommend that this approach be followed in a more general context[fn:7].
+It was only the quickest and most maintainable approach to achieving the stated requirements given the particulars of OpenLB.
+
+As soon as the build system dumped the first =olbcuda.cu= file into the =build= directory I thought that all that remained was to compile this into a shared library and link it all together.
+However, the resulting shared library contained not only the explicitly instantiated symbols but also additional stuff that they required.
+This caused quite a few duplicate symbol errors when I tried to link the library and the main executable.
+While linking could still be forced by ignoring these errors, the resulting executable was not running properly.
+This is where I encountered something unfamiliar to me: linker version scripts.
+
+The same as for basically every question one encounters in the context of such fundamental software as GNU =ld=, first released alongside the other GNU Binutils in the 80s, a solution has long since been developed.
+For our particular problem the solution are /linker version scripts/.
+
+#+BEGIN_SRC
+LIBOLBCUDA { global: 
+/* list of mangeled symbols to globally expose [...] */
+_ZGVZN3olb9utilities14TypeIndexedMapIPNS_12AnyFieldTypeIfNS_11descriptors5D3Q19IJEEELNS_8PlatformE0EEENS_17FieldTypeRegistryIfS5_LS6_0EEEE9get_indexINS_18OperatorParametersINS_19CombinedRLBdynamicsIfS5_NS_8dynamics5TupleIfS5_NS_7momenta5TupleINSH_11BulkDensityENSH_12BulkMomentumENSH_10BulkStressENSH_11DefineToNEqEEENS_10equilibria11SecondOrderENS_9collision3BGKENSF_18DefaultCombinationEEENSI_INSH_18InnerEdgeDensity3DILi0ELi1ELi1EEENSH_28FixedVelocityMomentumGenericENSH_17InnerEdgeStress3DILi0ELi1ELi1EEENSH_16DefineSeparatelyEEEEEEEEEmvE5index;
+local: *;
+};
+#+END_SRC
+
+Such a file can be passed to the linker via the =--version-script= argument and can be used to control which symbols the shared library should expose.
+For our /mixed/ build mode the generation of this script is realized as an additional Makefile target:
+
+#+BEGIN_SRC makefile
+build/olbcuda.version: $(CUDA_OBJ_FILES)
+	echo 'LIBOLBCUDA { global: ' > $@
+# Declare exposed explicitly instantiated symbols to prevent duplicate definitions by:
+# - filtering for the set of automatically instantiated classes
+# - excluding CPU_SISD symbols (we only instantiate GPU_CUDA-related symbols)
+# - dropping the shared library location information
+# - postfixing by semicolons
+	nm $(CUDA_OBJ_FILES) \
+	| grep '$(subst $() $(),\|,$(EXPLICIT_CLASS_INSTANTIATION))\|cuda.*device\|checkPlatform' \
+	| grep -wv '.*sisd.*' \
+	| cut -c 20- \
+	| sed 's/$$/;/' >> $@
+	echo 'local: *; };' >> $@
+#+END_SRC
+
+Note that we do not need to manually mangle the symbols in our =olbcuda.cu= but can simply read them from the library's object file using the =nm= utility.
+The two instances of =grep= are again the point where knowledge of the code base is inserted[fn:8].
+
+At this point all that is left is to link it all together using some final build targets:
+
+#+BEGIN_SRC makefile
+libolbcuda.so: $(CUDA_OBJ_FILES) build/olbcuda.version
+	$(CUDA_CXX) $(CUDA_CXXFLAGS) -Xlinker --version-script=build/olbcuda.version -shared $(CUDA_OBJ_FILES) -o $@
+
+$(EXAMPLE): $(OBJ_FILES) libolbcuda.so
+	$(CXX) $(OBJ_FILES) -o $@ $(LDFLAGS) -L . -lolbcuda -lolbcore $(CUDA_LDFLAGS)
+ #+END_SRC
+
+Here the shared library is compiled using the separately defined =CUDA_CXX= compiler and associated flags while the example case is compiled using =CXX=, realizing the required mixed compiler setup.
+For the final target we can now define a mode that only recompiles the main application while reusing the shared library:
+
+#+BEGIN_SRC makefile
+$(EXAMPLE)-no-cuda-recompile: $(OBJ_FILES)
+	$(CXX) $^ -o $(EXAMPLE) $(LDFLAGS) -L . -lolbcuda -lolbcore $(CUDA_LDFLAGS)
+
+.PHONY: no-cuda-recompile
+no-cuda-recompile: $(EXAMPLE)-no-cuda-recompile
+#+END_SRC
+
+While the initial compile of both the main CPU application and the GPU shared library any additional recompile using =make no-cuda-recompile= is sped up significantly.
+For example the following full compilation of a heterogeneous application with MPI, OpenMP, AVX-512 Vectorization on CPU and CUDA on GPU takes around 115 seconds:
+
+#+BEGIN_SRC bash
+λ ~/p/c/o/e/t/nozzle3d (openlb-env-gcc-openmpi-cuda-env) • time make
+mpic++ -O3 -Wall -march=native -mtune=native -std=c++17 -pthread -DPARALLEL_MODE_MPI  -DPLATFORM_CPU_SISD -DPLATFORM_GPU_CUDA  -DDEFAULT_FLOATING_POINT_TYPE=float -I../../../src -c -o nozzle3d.o nozzle3d.cpp
+mpic++ nozzle3d.o  -lpthread -lz -ltinyxml -L../../../build/lib -lolbcore 2>&1 | grep -oP ".*undefined reference to \`\K[^']+\)" | sort | uniq > build/missing.txt
+nvcc -O3 -std=c++17 --generate-code=arch=compute_75,code=[compute_75,sm_75] --extended-lambda --expt-relaxed-constexpr -rdc=true -I../../../src -DPARALLEL_MODE_MPI -DPLATFORM_CPU_SISD -DPLATFORM_GPU_CUDA -Xcompiler -fPIC -c -o build/olbcuda.o build/olbcuda.cu
+nvcc -O3 -std=c++17 --generate-code=arch=compute_75,code=[comput