aboutsummaryrefslogtreecommitdiff
path: root/articles/2023-12-26_benefiting_from_deliberately_failing_linkage.org
blob: a67999dd9d488dca5ca00b21603825121324f9de (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
* Benefiting from deliberately failing linkage
Realizing that I have not written anything here for two /years/ lets just start writing again[fn:-1]:
Compilation times for template-heavy C++ codebases such as the one at [[https://openlb.net][the center of my daily life]] can be a real pain.
This mostly got worse since I started to really get my hands dirty in its depths during the [[https://www.helmholtz-hirse.de/series/2022_12_01-seminar_9.html][extensive refactoring]] towards SIMD and GPU support[fn:0].
The current sad high point in compilation times was reached when compiling the first GPU-enabled simulation cases: More than 100 seconds for a single compile on my not too shabby system.
This article will detail how I significantly reduced this on the build system level while gaining useful features.

#+BEGIN_SRC bash
λ ~/p/c/o/e/t/nozzle3d (openlb-env-cuda-env) • time make
make -C ../../.. core
make[1]: Entering directory '/home/common/projects/contrib/openlb-master'
make[1]: Nothing to be done for 'core'.
make[1]: Leaving directory '/home/common/projects/contrib/openlb-master'
nvcc -pthread --forward-unknown-to-host-compiler -x cu -O3 -std=c++17 --generate-code=arch=compute_75,code=[compute_75,sm_75] --extended-lambda --expt-relaxed-constexpr -rdc=true -DPLATFORM_CPU_SISD -DPLATFORM_GPU_CUDA  -DDEFAULT_FLOATING_POINT_TYPE=float -I../../../src -c -o nozzle3d.o nozzle3d.cpp
nvcc nozzle3d.o -o nozzle3d -lolbcore -lpthread -lz -ltinyxml -L/run/opengl-driver/lib -lcuda -lcudadevrt -lcudart -L../../../build/lib
________________________________________________________
Executed in  112.27 secs    fish           external
   usr time  109.46 secs  149.00 micros  109.46 secs
   sys time    2.42 secs   76.00 micros    2.42 secs
#+END_SRC

Even when considering that this compiles many dozens of individual CUDA kernels for multiple run-time selectable physical models and boundary conditions in addition to the simulation scaffold[fn:1] it still takes too long for comfortably iterating during development.
Needless to say, things did not improve when I started working on heterogeneous execution and the single executable needed to also contain vectorized versions of all models for execution on CPUs in addition to MPI and OpenMP routines. 
Even worse, you really want to use Intel's C++ compilers when running CPU-based simulations on Intel-based clusters[fn:2] which plainly is not possible in such a /homogeneous/ compiler setup where everyhing has to pass through =nvcc=.

#+BEGIN_SRC bash
λ ~/p/c/o/e/t/nozzle3d (openlb-env-gcc-env) • time make
g++ -O3 -Wall -march=native -mtune=native -std=c++17 -pthread -DPLATFORM_CPU_SISD  -I../../../src -c -o nozzle3d.o nozzle3d.cpp
g++ nozzle3d.o -o nozzle3d -lolbcore -lpthread   -lz -ltinyxml     -L../../../build/lib
________________________________________________________
Executed in   31.77 secs    fish           external
   usr time   31.21 secs    0.00 micros   31.21 secs
   sys time    0.55 secs  693.00 micros    0.55 secs
#+END_SRC

Comparing the GPU build to the previous CPU-only compilation time of around 32 seconds -- while nothing to write home about -- it was still clear that time would be best spent on separating out the CUDA side of things, both to mitigate its performance impact and to enabled a /mixed/ compiler environment.

[fn:-1] …and do my part in feeding the LLM training machine :-)
[fn:0] Definitely a double edged sword: On the one side it enables concise DSL-like compositions of physical models while supporting automatic code optimization and efficient execution accross heterogeneous hardware. On the other side my much younger, Pascal-fluent, self would not be happy with how cryptic and unmaintainable many of my listings can look to the outsider.
In any case, OpenLB as a heavily templatized and meta-programmed C++ software library is a foundational design decision.
[fn:1] Data structures, pre- and post-processing logic, IO routines, ...
[fn:2] Commonly improving performance by quite a few percent

** Requirements
Firstly, any solution would need to exist within the existing plain Makefile based build system[fn:3] and should not complicate the existing build workflow for our users[fn:4].
Secondly, it should allow for defining completely different compilers and configuration flags for the CPU- and the GPU-side of the application.
The intial driving force of speeding up GPU-targeted compilation would then be satisfied as a side effect due to the ability of only recompiling the CPU-side of things as long as no new physical models are introduced. This restriction is useful in the present context as GPU kernels execute the computationally expensive part, i.e. the actual simulation, but generally do not change often during development of new simulation cases after the initial choice of physical model.

[fn:3] Which was a deliberate design decision in order to minimize dependencies considering the minimal build complexity required by OpenLB as a plain CPU-only MPI code. While this could of course be reconsidered in the face of increased target complexity it was not the time to open that bottle. 
[fn:4] Mostly domain experts from process engineering, physics or mathematics without much experience in software engineering.

** Approach
Following the requirements, a basic approach is to split the application into two compilation units: One containing only the CPU-implementation consisting of the high level algorithmic structure, pre- and post-processing, communication logic, CPU-targeted simulation kernels and calls to the GPU code.
The other containing only the GPU code consisting of CUDA kernels and their immediate wrappers called from the CPU-side of things -- i.e. only those parts that truly need to be compiled using NVIDIA's =nvcc=. 
Given two separated files =cpustuff.cpp= and =gpustuff.cu= it would be easy to compile them using separate configurations and then link them together into a single executable.
The main implementation problem is how to generate two such separated compilation units that can be cleanly linked together, i.e. without duplicating symbols and similar hurdles.

** Implementation
In days past the build system actually contained an option for such separated compilation: termed as the /pre-compiled mode/ in OpenLB speak.
This mode consisted of a somewhat rough and leaky separation between interface and implementation headers that was augmented by many hand-written C++ files containing explicit template instantiations of the aforementioned implementations for certain common arguments.
These C++ files could then be compiled once into a shared library that was linked to the application unit compiled without access to the implementation headers.
While this worked it was always a struggle to keep these files maintained.
Additionally any benefit for the, at that time CPU-only, codebase was negligible and in the end not worth the effort any more causing it to be dropped somewhere on the road to release 1.4.

Nevertheless, the basic approach of compiling a shared libary of explicit template instantiations is sound if we can find a way to automatically generate the instantiations per-case instead of manually maintaining them.
A starting point for this is to take a closer look at the linker errors produced when compiling a simulation case including only the interface headers for the GPU code.
These errors contain partial signatures of all relevant methods from plain function calls

#+BEGIN_SRC bash
λ ~/p/c/o/e/l/cavity3dBenchmark (openlb-env-gcc-openmpi-cuda-env) • mpic++ cavity3d.o  -lpthread -lz -ltinyxml -L../../../build/lib -lolbcore
cavity3d.cpp:(...): undefined reference to `olb::gpu::cuda::device::synchronize()'
#+END_SRC

to bulk and boundary collision operator constructions

#+BEGIN_SRC bash
cavity3d.cpp:(...): undefined reference to `olb::ConcreteBlockCollisionO<float, olb::descriptors::D3Q19<>, (olb::Platform)2, olb::dynamics::Tuple<float, olb::descriptors::D3Q19<>, olb::momenta::Tuple<olb::momenta::BulkDensity, olb::momenta::BulkMomentum, olb::momenta::BulkStress, olb::momenta::DefineToNEq>, olb::equilibria::SecondOrder, olb::collision::BGK, olb::dynamics::DefaultCombination> >::ConcreteBlockCollisionO()'
#+END_SRC

as well as core data structure accessors:

#+BEGIN_SRC bash
cavity3d.cpp:(.text._ZN3olb20ConcreteBlockLatticeIfNS_11descriptors5D3Q19IJEEELNS_8PlatformE2EE21getPopulationPointersEj[_ZN3olb20ConcreteBlockLatticeIfNS_11descriptors5D3Q19IJEEELNS_8PlatformE2EE21getPopulationPointersEj]+0x37): undefined reference to `olb::gpu::cuda::CyclicColumn<float>::operator[](unsigned long)'
#+END_SRC

These errors are easily turned into a sorted list of unique missing symbols using basic piping

#+BEGIN_SRC makefile
build/missing.txt: $(OBJ_FILES)
    $(CXX) $^ $(LDFLAGS) -lolbcore 2>&1 \
  | grep -oP ".*undefined reference to \`\K[^']+\)" \
  | sort \
  | uniq > $@
#+END_SRC

which only assumes that the locale is set to english and -- surprisingly -- works consistently accross any relevant C++ compilers[fn:5], likely due to all of them using either the GNU Linker or a drop-in compatible alternative thereto.
The resulting plain list of C++ method signatures hints at the reasonably structured and consistent template /language/ employed by OpenLB:

#+BEGIN_SRC cpp
olb::ConcreteBlockCollisionO<float, olb::descriptors::D3Q19<>, (olb::Platform)2, olb::CombinedRLBdynamics<float, olb::descriptors::D3Q19<>, olb::dynamics::Tuple<float, olb::descriptors::D3Q19<>, olb::momenta::Tuple<olb::momenta::BulkDensity, olb::momenta::BulkMomentum, olb::momenta::BulkStress, olb::momenta::DefineToNEq>, olb::equilibria::SecondOrder, olb::collision::BGK, olb::dynamics::DefaultCombination>, olb::momenta::Tuple<olb::momenta::VelocityBoundaryDensity<0, -1>, olb::momenta::FixedVelocityMomentumGeneric, olb::momenta::RegularizedBoundaryStress<0, -1>, olb::momenta::DefineSeparately> > >::ConcreteBlockCollisionO()
olb::gpu::cuda::CyclicColumn<float>::operator[](unsigned long)
olb::gpu::cuda::device::synchronize()
// [...]
#+END_SRC

For example, local cell models -- /Dynamics/ in OpenLB speak -- are mostly implemented as tuples of momenta, equilibrium functions and collision operators[fn:6].
All such relevant classes tend to follow a consistent structure in what methods with which arguments and return types they implement.
We can use this domain knowledge of our codebase to transform the incomplete signatures in our new =missing.txt= into a full list of explicit template instantiations written in valid C++.

#+BEGIN_SRC makefile
build/olbcuda.cu: build/missing.txt
# Generate includes of the case source
# (replaceable by '#include <olb.h>' if no custom operators are implemented in the application)
	echo -e '$(CPP_FILES:%=\n#include "../%")' > $@
# Transform missing symbols into explicit template instantiations by:
# - filtering for a set of known and automatically instantiable methods
# - excluding destructors
# - dropping resulting empty lines
# - adding the explicit instantiation prefix (all supported methods are void, luckily)
	cat build/missing.txt \
	| grep '$(subst $() $(),\|,$(EXPLICIT_METHOD_INSTANTIATION))' \
	| grep -wv '.*\~.*\|FieldTypeRegistry()' \
	| xargs -0 -n1 | grep . \
	| sed -e 's/.*/template void &;/' -e 's/void void/void/' >> $@
# - filtering for a set of known and automatically instantiable classes
# - dropping method cruft and wrapping into explicit class instantiation
# - removing duplicates
	cat build/missing.txt \
	| grep '.*\($(subst $() $(),\|,$(EXPLICIT_CLASS_INSTANTIATION))\)<' \
	| sed -e 's/\.*>::.*/>/' -e 's/.*/template class &;/' -e 's/class void/class/' \
	| sort | uniq >> $@
#+END_SRC

Note that this is only possible due to full knowledge of and control over the target codebase.
In case this is not clear already: In no way do I recommend that this approach be followed in a more general context[fn:7].
It was only the quickest and most maintainable approach to achieving the stated requirements given the particulars of OpenLB.

As soon as the build system dumped the first =olbcuda.cu= file into the =build= directory I thought that all that remained was to compile this into a shared library and link it all together.
However, the resulting shared library contained not only the explicitly instantiated symbols but also additional stuff that they required.
This caused quite a few duplicate symbol errors when I tried to link the library and the main executable.
While linking could still be forced by ignoring these errors, the resulting executable was not running properly.
This is where I encountered something unfamiliar to me: linker version scripts.

The same as for basically every question one encounters in the context of such fundamental software as GNU =ld=, first released alongside the other GNU Binutils in the 80s, a solution has long since been developed.
For our particular problem the solution are /linker version scripts/.

#+BEGIN_SRC
LIBOLBCUDA { global: 
/* list of mangeled symbols to globally expose [...] */
_ZGVZN3olb9utilities14TypeIndexedMapIPNS_12AnyFieldTypeIfNS_11descriptors5D3Q19IJEEELNS_8PlatformE0EEENS_17FieldTypeRegistryIfS5_LS6_0EEEE9get_indexINS_18OperatorParametersINS_19CombinedRLBdynamicsIfS5_NS_8dynamics5TupleIfS5_NS_7momenta5TupleINSH_11BulkDensityENSH_12BulkMomentumENSH_10BulkStressENSH_11DefineToNEqEEENS_10equilibria11SecondOrderENS_9collision3BGKENSF_18DefaultCombinationEEENSI_INSH_18InnerEdgeDensity3DILi0ELi1ELi1EEENSH_28FixedVelocityMomentumGenericENSH_17InnerEdgeStress3DILi0ELi1ELi1EEENSH_16DefineSeparatelyEEEEEEEEEmvE5index;
local: *;
};
#+END_SRC

Such a file can be passed to the linker via the =--version-script= argument and can be used to control which symbols the shared library should expose.
For our /mixed/ build mode the generation of this script is realized as an additional Makefile target:

#+BEGIN_SRC makefile
build/olbcuda.version: $(CUDA_OBJ_FILES)
	echo 'LIBOLBCUDA { global: ' > $@
# Declare exposed explicitly instantiated symbols to prevent duplicate definitions by:
# - filtering for the set of automatically instantiated classes
# - excluding CPU_SISD symbols (we only instantiate GPU_CUDA-related symbols)
# - dropping the shared library location information
# - postfixing by semicolons
	nm $(CUDA_OBJ_FILES) \
	| grep '$(subst $() $(),\|,$(EXPLICIT_CLASS_INSTANTIATION))\|cuda.*device\|checkPlatform' \
	| grep -wv '.*sisd.*' \
	| cut -c 20- \
	| sed 's/$$/;/' >> $@
	echo 'local: *; };' >> $@
#+END_SRC

Note that we do not need to manually mangle the symbols in our =olbcuda.cu= but can simply read them from the library's object file using the =nm= utility.
The two instances of =grep= are again the point where knowledge of the code base is inserted[fn:8].

At this point all that is left is to link it all together using some final build targets:

#+BEGIN_SRC makefile
libolbcuda.so: $(CUDA_OBJ_FILES) build/olbcuda.version
	$(CUDA_CXX) $(CUDA_CXXFLAGS) -Xlinker --version-script=build/olbcuda.version -shared $(CUDA_OBJ_FILES) -o $@

$(EXAMPLE): $(OBJ_FILES) libolbcuda.so
	$(CXX) $(OBJ_FILES) -o $@ $(LDFLAGS) -L . -lolbcuda -lolbcore $(CUDA_LDFLAGS)
 #+END_SRC

Here the shared library is compiled using the separately defined =CUDA_CXX= compiler and associated flags while the example case is compiled using =CXX=, realizing the required mixed compiler setup.
For the final target we can now define a mode that only recompiles the main application while reusing the shared library:

#+BEGIN_SRC makefile
$(EXAMPLE)-no-cuda-recompile: $(OBJ_FILES)
	$(CXX) $^ -o $(EXAMPLE) $(LDFLAGS) -L . -lolbcuda -lolbcore $(CUDA_LDFLAGS)

.PHONY: no-cuda-recompile
no-cuda-recompile: $(EXAMPLE)-no-cuda-recompile
#+END_SRC

While the initial compile of both the main CPU application and the GPU shared library any additional recompile using =make no-cuda-recompile= is sped up significantly.
For example the following full compilation of a heterogeneous application with MPI, OpenMP, AVX-512 Vectorization on CPU and CUDA on GPU takes around 115 seconds:

#+BEGIN_SRC bash
λ ~/p/c/o/e/t/nozzle3d (openlb-env-gcc-openmpi-cuda-env) • time make
mpic++ -O3 -Wall -march=native -mtune=native -std=c++17 -pthread -DPARALLEL_MODE_MPI  -DPLATFORM_CPU_SISD -DPLATFORM_GPU_CUDA  -DDEFAULT_FLOATING_POINT_TYPE=float -I../../../src -c -o nozzle3d.o nozzle3d.cpp
mpic++ nozzle3d.o  -lpthread -lz -ltinyxml -L../../../build/lib -lolbcore 2>&1 | grep -oP ".*undefined reference to \`\K[^']+\)" | sort | uniq > build/missing.txt
nvcc -O3 -std=c++17 --generate-code=arch=compute_75,code=[compute_75,sm_75] --extended-lambda --expt-relaxed-constexpr -rdc=true -I../../../src -DPARALLEL_MODE_MPI -DPLATFORM_CPU_SISD -DPLATFORM_GPU_CUDA -Xcompiler -fPIC -c -o build/olbcuda.o build/olbcuda.cu
nvcc -O3 -std=c++17 --generate-code=arch=compute_75,code=[compute_75,sm_75] --extended-lambda --expt-relaxed-constexpr -rdc=true -Xlinker --version-script=build/olbcuda.version -shared build/olbcuda.o -o libolbcuda.so
mpic++ nozzle3d.o -o nozzle3d  -lpthread -lz -ltinyxml -L../../../build/lib -L . -lolbcuda -lolbcore -L/run/opengl-driver/lib -lcuda -lcudadevrt -lcudart
________________________________________________________
Executed in  115.34 secs    fish           external
   usr time  112.68 secs  370.00 micros  112.68 secs
   sys time    2.68 secs  120.00 micros    2.68 secs
#+END_SRC

Meanwhile any additional compilation without introduction of new physical models (leading to the instantiation of additional GPU kernels) using =make no-cuda-recompile= takes /just/ 37 seconds:

#+BEGIN_SRC bash
λ ~/p/c/o/e/t/nozzle3d (openlb-env-gcc-openmpi-cuda-env) • time make no-cuda-recompile
mpic++ -O3 -Wall -march=native -mtune=native -std=c++17 -pthread -DPARALLEL_MODE_MPI  -DPLATFORM_CPU_SISD -DPLATFORM_GPU_CUDA  -DDEFAULT_FLOATING_POINT_TYPE=float -I../../../src -c -o nozzle3d.o nozzle3d.cpp
mpic++ nozzle3d.o -o nozzle3d  -lpthread -lz -ltinyxml -L../../../build/lib -L . -lolbcuda -lolbcore -L/run/opengl-driver/lib -lcuda -lcudadevrt -lcudart
________________________________________________________
Executed in   36.47 secs    fish           external
   usr time   35.71 secs    0.00 micros   35.71 secs
   sys time    0.75 secs  564.00 micros    0.75 secs
#+END_SRC

This speedup of ~3 for most compiles during iterative development alone is worth the effort of introducing this new mode.
Additionally, the logs also already showcase /mixed compilation/ as the CPU side of things is compiled using =mpic++= resp. GNU C++ while the shared libary is compiled using =nvcc=. This extends seamlessly to more complex setups combining MPI, OpenMP, AVX-512 vectorization on CPU and CUDA on GPU in a single application.

[fn:5] Which spans various versions of GCC, Clang, Intel C++ and NVIDIA =nvcc=
[fn:6] Momenta representing how to compute macroscopic quantities such as density and velocity, equilibrium representing the /undistrubed/ representation of said quantities in terms of population values and the collision operator representing the specific function used to /relax/ the current population towards this equilibrium. For more details on LBM see e.g. my articles on [[/article/fun_with_compute_shaders_and_fluid_dynamics/][Fun with Compute Shaders and Fluid Dynamics]], a [[/article/year_of_lbm/][Year of LBM]]
or even my just-in-time visualized [[https://literatelb.org][literate implementation]].
[fn:7] However, implementing such a explicit instantiation generator that works for any C++ project could be an interesting project for… somebody.
[fn:8] Now that I write about it this could probably be modified to automatically and eliminate conflicts by only exposing the symbols that are missing from the main application

** Conclusion
All in all this approach turned out to be unexpectedly stable and portable accross systems and compilers from laptops to supercomputers.
While it certainly is not the most beautiful thing I ever implemented, to say the least, it is very workable in practice and noticeably eases day to day development.
In any case, the mixed compilation mode was included in [[https://www.openlb.net/news/openlb-release-1-6-available-for-download/][OpenLB release 1.6]] and has worked without a hitch since then.
The mixed compilation mode is also isolated to just a few optional Makefile targets and did not require any changes to the actual codebase -- meaning that it can just quietly be dropped should a better solution for the requirements come along.

For the potentially empty set of people that have read this far, are interested in CFD simulations using LBM and did not run screaming from the rather /pragmatic/ build solution presented here:
If you want to spend a week learning about LBM theory and OpenLB practice from invited lecturers at the top of the field as well as my colleagues and me, our upcoming [[https://www.openlb.net/spring-school-2024/][Spring School]] may be of interest.
Having taken place for quite a few years now at diverse locations such as Berlin, Tunisia, Krakow and Greenwich the 2024 rendition will take place at the historical /Heidelberger Akademie der Wissenschaften/ in March. I'd be happy to meet you there!