From 85d3f49b5b400a1cdf9f85650d436d92c91e64d9 Mon Sep 17 00:00:00 2001 From: Adrian Kummerlaender Date: Wed, 27 Dec 2023 10:51:52 +0100 Subject: Finalize new article on mixed compilation --- ...enefiting_from_deliberately_failing_linkage.org | 110 +++++++++++++++++++-- 1 file changed, 102 insertions(+), 8 deletions(-) (limited to 'articles') diff --git a/articles/2023-12-26_benefiting_from_deliberately_failing_linkage.org b/articles/2023-12-26_benefiting_from_deliberately_failing_linkage.org index e7f3930..a67999d 100644 --- a/articles/2023-12-26_benefiting_from_deliberately_failing_linkage.org +++ b/articles/2023-12-26_benefiting_from_deliberately_failing_linkage.org @@ -1,5 +1,5 @@ * Benefiting from deliberately failing linkage -Realizing that I have not written anything here for two /years/ lets just start writing again: +Realizing that I have not written anything here for two /years/ lets just start writing again[fn:-1]: Compilation times for template-heavy C++ codebases such as the one at [[https://openlb.net][the center of my daily life]] can be a real pain. This mostly got worse since I started to really get my hands dirty in its depths during the [[https://www.helmholtz-hirse.de/series/2022_12_01-seminar_9.html][extensive refactoring]] towards SIMD and GPU support[fn:0]. The current sad high point in compilation times was reached when compiling the first GPU-enabled simulation cases: More than 100 seconds for a single compile on my not too shabby system. @@ -35,6 +35,7 @@ Executed in 31.77 secs fish external Comparing the GPU build to the previous CPU-only compilation time of around 32 seconds -- while nothing to write home about -- it was still clear that time would be best spent on separating out the CUDA side of things, both to mitigate its performance impact and to enabled a /mixed/ compiler environment. +[fn:-1] …and do my part in feeding the LLM training machine :-) [fn:0] Definitely a double edged sword: On the one side it enables concise DSL-like compositions of physical models while supporting automatic code optimization and efficient execution accross heterogeneous hardware. On the other side my much younger, Pascal-fluent, self would not be happy with how cryptic and unmaintainable many of my listings can look to the outsider. In any case, OpenLB as a heavily templatized and meta-programmed C++ software library is a foundational design decision. [fn:1] Data structures, pre- and post-processing logic, IO routines, ... @@ -74,7 +75,6 @@ to bulk and boundary collision operator constructions #+BEGIN_SRC bash cavity3d.cpp:(...): undefined reference to `olb::ConcreteBlockCollisionO, (olb::Platform)2, olb::dynamics::Tuple, olb::momenta::Tuple, olb::equilibria::SecondOrder, olb::collision::BGK, olb::dynamics::DefaultCombination> >::ConcreteBlockCollisionO()' -cavity3d.cpp:(...): undefined reference to `olb::ConcreteBlockCollisionO, (olb::Platform)2, olb::CombinedRLBdynamics, olb::dynamics::Tuple, olb::momenta::Tuple, olb::equilibria::SecondOrder, olb::collision::BGK, olb::dynamics::DefaultCombination>, olb::momenta::Tuple, olb::momenta::FixedVelocityMomentumGeneric, olb::momenta::InnerCornerStress3D<1, -1, 1>, olb::momenta::DefineSeparately> > >::ConcreteBlockCollisionO()' #+END_SRC as well as core data structure accessors: @@ -93,12 +93,13 @@ build/missing.txt: $(OBJ_FILES) | uniq > $@ #+END_SRC -which only assumes that the locale is set to english and -- surprisingly -- works consistently accross any relevant C++ compilers[fn:5], likely due to shared or very similar linkers. -The resulting plain list of C++ method signatures showcases the reasonably structured and consistent template /language/ employed by OpenLB: +which only assumes that the locale is set to english and -- surprisingly -- works consistently accross any relevant C++ compilers[fn:5], likely due to all of them using either the GNU Linker or a drop-in compatible alternative thereto. +The resulting plain list of C++ method signatures hints at the reasonably structured and consistent template /language/ employed by OpenLB: #+BEGIN_SRC cpp olb::ConcreteBlockCollisionO, (olb::Platform)2, olb::CombinedRLBdynamics, olb::dynamics::Tuple, olb::momenta::Tuple, olb::equilibria::SecondOrder, olb::collision::BGK, olb::dynamics::DefaultCombination>, olb::momenta::Tuple, olb::momenta::FixedVelocityMomentumGeneric, olb::momenta::RegularizedBoundaryStress<0, -1>, olb::momenta::DefineSeparately> > >::ConcreteBlockCollisionO() -olb::ConcreteBlockCollisionO, (olb::Platform)2, olb::CombinedRLBdynamics, olb::dynamics::Tuple, olb::momenta::Tuple, olb::equilibria::SecondOrder, olb::collision::BGK, olb::dynamics::DefaultCombination>, olb::momenta::Tuple, olb::momenta::FixedVelocityMomentumGeneric, olb::momenta::RegularizedBoundaryStress<0, 1>, olb::momenta::DefineSeparately> > >::ConcreteBlockCollisionO() +olb::gpu::cuda::CyclicColumn::operator[](unsigned long) +olb::gpu::cuda::device::synchronize() // [...] #+END_SRC @@ -138,12 +139,105 @@ As soon as the build system dumped the first =olbcuda.cu= file into the =build= However, the resulting shared library contained not only the explicitly instantiated symbols but also additional stuff that they required. This caused quite a few duplicate symbol errors when I tried to link the library and the main executable. While linking could still be forced by ignoring these errors, the resulting executable was not running properly. -This is where I encountered something unfamiliar to me: Linker version scripts. +This is where I encountered something unfamiliar to me: linker version scripts. -[fn:5] Which spans various versions of GCC, Clang and Intel C++ +The same as for basically every question one encounters in the context of such fundamental software as GNU =ld=, first released alongside the other GNU Binutils in the 80s, a solution has long since been developed. +For our particular problem the solution are /linker version scripts/. + +#+BEGIN_SRC +LIBOLBCUDA { global: +/* list of mangeled symbols to globally expose [...] */ +_ZGVZN3olb9utilities14TypeIndexedMapIPNS_12AnyFieldTypeIfNS_11descriptors5D3Q19IJEEELNS_8PlatformE0EEENS_17FieldTypeRegistryIfS5_LS6_0EEEE9get_indexINS_18OperatorParametersINS_19CombinedRLBdynamicsIfS5_NS_8dynamics5TupleIfS5_NS_7momenta5TupleINSH_11BulkDensityENSH_12BulkMomentumENSH_10BulkStressENSH_11DefineToNEqEEENS_10equilibria11SecondOrderENS_9collision3BGKENSF_18DefaultCombinationEEENSI_INSH_18InnerEdgeDensity3DILi0ELi1ELi1EEENSH_28FixedVelocityMomentumGenericENSH_17InnerEdgeStress3DILi0ELi1ELi1EEENSH_16DefineSeparatelyEEEEEEEEEmvE5index; +local: *; +}; +#+END_SRC + +Such a file can be passed to the linker via the =--version-script= argument and can be used to control which symbols the shared library should expose. +For our /mixed/ build mode the generation of this script is realized as an additional Makefile target: + +#+BEGIN_SRC makefile +build/olbcuda.version: $(CUDA_OBJ_FILES) + echo 'LIBOLBCUDA { global: ' > $@ +# Declare exposed explicitly instantiated symbols to prevent duplicate definitions by: +# - filtering for the set of automatically instantiated classes +# - excluding CPU_SISD symbols (we only instantiate GPU_CUDA-related symbols) +# - dropping the shared library location information +# - postfixing by semicolons + nm $(CUDA_OBJ_FILES) \ + | grep '$(subst $() $(),\|,$(EXPLICIT_CLASS_INSTANTIATION))\|cuda.*device\|checkPlatform' \ + | grep -wv '.*sisd.*' \ + | cut -c 20- \ + | sed 's/$$/;/' >> $@ + echo 'local: *; };' >> $@ +#+END_SRC + +Note that we do not need to manually mangle the symbols in our =olbcuda.cu= but can simply read them from the library's object file using the =nm= utility. +The two instances of =grep= are again the point where knowledge of the code base is inserted[fn:8]. + +At this point all that is left is to link it all together using some final build targets: + +#+BEGIN_SRC makefile +libolbcuda.so: $(CUDA_OBJ_FILES) build/olbcuda.version + $(CUDA_CXX) $(CUDA_CXXFLAGS) -Xlinker --version-script=build/olbcuda.version -shared $(CUDA_OBJ_FILES) -o $@ + +$(EXAMPLE): $(OBJ_FILES) libolbcuda.so + $(CXX) $(OBJ_FILES) -o $@ $(LDFLAGS) -L . -lolbcuda -lolbcore $(CUDA_LDFLAGS) + #+END_SRC + +Here the shared library is compiled using the separately defined =CUDA_CXX= compiler and associated flags while the example case is compiled using =CXX=, realizing the required mixed compiler setup. +For the final target we can now define a mode that only recompiles the main application while reusing the shared library: + +#+BEGIN_SRC makefile +$(EXAMPLE)-no-cuda-recompile: $(OBJ_FILES) + $(CXX) $^ -o $(EXAMPLE) $(LDFLAGS) -L . -lolbcuda -lolbcore $(CUDA_LDFLAGS) + +.PHONY: no-cuda-recompile +no-cuda-recompile: $(EXAMPLE)-no-cuda-recompile +#+END_SRC + +While the initial compile of both the main CPU application and the GPU shared library any additional recompile using =make no-cuda-recompile= is sped up significantly. +For example the following full compilation of a heterogeneous application with MPI, OpenMP, AVX-512 Vectorization on CPU and CUDA on GPU takes around 115 seconds: + +#+BEGIN_SRC bash +λ ~/p/c/o/e/t/nozzle3d (openlb-env-gcc-openmpi-cuda-env) • time make +mpic++ -O3 -Wall -march=native -mtune=native -std=c++17 -pthread -DPARALLEL_MODE_MPI -DPLATFORM_CPU_SISD -DPLATFORM_GPU_CUDA -DDEFAULT_FLOATING_POINT_TYPE=float -I../../../src -c -o nozzle3d.o nozzle3d.cpp +mpic++ nozzle3d.o -lpthread -lz -ltinyxml -L../../../build/lib -lolbcore 2>&1 | grep -oP ".*undefined reference to \`\K[^']+\)" | sort | uniq > build/missing.txt +nvcc -O3 -std=c++17 --generate-code=arch=compute_75,code=[compute_75,sm_75] --extended-lambda --expt-relaxed-constexpr -rdc=true -I../../../src -DPARALLEL_MODE_MPI -DPLATFORM_CPU_SISD -DPLATFORM_GPU_CUDA -Xcompiler -fPIC -c -o build/olbcuda.o build/olbcuda.cu +nvcc -O3 -std=c++17 --generate-code=arch=compute_75,code=[compute_75,sm_75] --extended-lambda --expt-relaxed-constexpr -rdc=true -Xlinker --version-script=build/olbcuda.version -shared build/olbcuda.o -o libolbcuda.so +mpic++ nozzle3d.o -o nozzle3d -lpthread -lz -ltinyxml -L../../../build/lib -L . -lolbcuda -lolbcore -L/run/opengl-driver/lib -lcuda -lcudadevrt -lcudart +________________________________________________________ +Executed in 115.34 secs fish external + usr time 112.68 secs 370.00 micros 112.68 secs + sys time 2.68 secs 120.00 micros 2.68 secs +#+END_SRC + +Meanwhile any additional compilation without introduction of new physical models (leading to the instantiation of additional GPU kernels) using =make no-cuda-recompile= takes /just/ 37 seconds: + +#+BEGIN_SRC bash +λ ~/p/c/o/e/t/nozzle3d (openlb-env-gcc-openmpi-cuda-env) • time make no-cuda-recompile +mpic++ -O3 -Wall -march=native -mtune=native -std=c++17 -pthread -DPARALLEL_MODE_MPI -DPLATFORM_CPU_SISD -DPLATFORM_GPU_CUDA -DDEFAULT_FLOATING_POINT_TYPE=float -I../../../src -c -o nozzle3d.o nozzle3d.cpp +mpic++ nozzle3d.o -o nozzle3d -lpthread -lz -ltinyxml -L../../../build/lib -L . -lolbcuda -lolbcore -L/run/opengl-driver/lib -lcuda -lcudadevrt -lcudart +________________________________________________________ +Executed in 36.47 secs fish external + usr time 35.71 secs 0.00 micros 35.71 secs + sys time 0.75 secs 564.00 micros 0.75 secs +#+END_SRC + +This speedup of ~3 for most compiles during iterative development alone is worth the effort of introducing this new mode. +Additionally, the logs also already showcase /mixed compilation/ as the CPU side of things is compiled using =mpic++= resp. GNU C++ while the shared libary is compiled using =nvcc=. This extends seamlessly to more complex setups combining MPI, OpenMP, AVX-512 vectorization on CPU and CUDA on GPU in a single application. + +[fn:5] Which spans various versions of GCC, Clang, Intel C++ and NVIDIA =nvcc= [fn:6] Momenta representing how to compute macroscopic quantities such as density and velocity, equilibrium representing the /undistrubed/ representation of said quantities in terms of population values and the collision operator representing the specific function used to /relax/ the current population towards this equilibrium. For more details on LBM see e.g. my articles on [[/article/fun_with_compute_shaders_and_fluid_dynamics/][Fun with Compute Shaders and Fluid Dynamics]], a [[/article/year_of_lbm/][Year of LBM]] or even my just-in-time visualized [[https://literatelb.org][literate implementation]]. [fn:7] However, implementing such a explicit instantiation generator that works for any C++ project could be an interesting project for… somebody. +[fn:8] Now that I write about it this could probably be modified to automatically and eliminate conflicts by only exposing the symbols that are missing from the main application ** Conclusion -Surprisingly, this quick and dirty approach turned out to be unexpectedly stable and portable accross systems and compilers. +All in all this approach turned out to be unexpectedly stable and portable accross systems and compilers from laptops to supercomputers. +While it certainly is not the most beautiful thing I ever implemented, to say the least, it is very workable in practice and noticeably eases day to day development. +In any case, the mixed compilation mode was included in [[https://www.openlb.net/news/openlb-release-1-6-available-for-download/][OpenLB release 1.6]] and has worked without a hitch since then. +The mixed compilation mode is also isolated to just a few optional Makefile targets and did not require any changes to the actual codebase -- meaning that it can just quietly be dropped should a better solution for the requirements come along. + +For the potentially empty set of people that have read this far, are interested in CFD simulations using LBM and did not run screaming from the rather /pragmatic/ build solution presented here: +If you want to spend a week learning about LBM theory and OpenLB practice from invited lecturers at the top of the field as well as my colleagues and me, our upcoming [[https://www.openlb.net/spring-school-2024/][Spring School]] may be of interest. +Having taken place for quite a few years now at diverse locations such as Berlin, Tunisia, Krakow and Greenwich the 2024 rendition will take place at the historical /Heidelberger Akademie der Wissenschaften/ in March. I'd be happy to meet you there! -- cgit v1.2.3