symlbm_playground - Tinkering with LBM, OpenCL and SymPy-based code generation

Age	Commit message (Collapse)	Author
2019-07-25	Fix handling of outer boundary cellsstandalone	Adrian Kummerlaender
	As we can only use a multiplicative mask to distinguish between cell types and streaming memory offsets are statically resolved the buffers have to provide a well-defined padding in both directions. Otherwise undefined data is accessed which may distort results.
2019-07-25	Add GCC to environment	Adrian Kummerlaender

2019-07-25	Use D3Q19, fix MLUPS calculation	Adrian Kummerlaender

2019-07-25	Print MLUPS in standalone code	Adrian Kummerlaender

2019-07-23	Generate basic example in plain C++	Adrian Kummerlaender
	An attempt to produce a minimal LBM implementation to benchmark various memory and vectorization schemes on the CPU.
2019-07-18	Update README.md	Adrian Kummerlaender

2019-07-18	Add another GL interop example	Adrian Kummerlaender
	…just for fun
2019-07-10	Update slides for talk	Adrian Kummerlaender

2019-07-10	Add basic talk slides	Adrian Kummerlaender

2019-07-10	Add README.md	Adrian Kummerlaender

2019-07-08	Update benchmark plots	Adrian Kummerlaender

2019-07-06	Update benchmark plots	Adrian Kummerlaender

2019-07-06	Update benchmark scripts	Adrian Kummerlaender

2019-07-06	Add further non-CSE benchmark results @ P100	Adrian Kummerlaender

2019-07-04	Add further non-CSE benchmark results @ K2200	Adrian Kummerlaender

2019-07-04	Update benchmark plots	Adrian Kummerlaender

2019-07-04	Update benchmark results of LDC @ Tesla P100	Adrian Kummerlaender

2019-07-02	Determine discrete velocities of D2Q9 and D3Q27	Adrian Kummerlaender

2019-07-02	Determine lattice speed of sound	Adrian Kummerlaender

2019-07-02	Determine weights using Gauss-Hermite quadrature	Adrian Kummerlaender

2019-07-01	Expand LDC benchmark scripts	Adrian Kummerlaender

2019-06-30	Move OpenCL buffers into Memory class	Adrian Kummerlaender

2019-06-29	Implement layout and memory padding	Adrian Kummerlaender
	There are at least two distinct areas where padding can be beneficial on a GPU: 1. Padding the global thread sizes to support specific thread layouts e.g. (32,1) layouts require the global lattice width to be a multiple of 32 2. Padding the memory layout at the lowest level to align memory accesses i.e. some GPUs read memory in 128 Byte chunks and as such it is beneficial if the operations are aligned accordingly For lattice and thread layout sizes that are exponents of two these two padding areas are equivalent. However when one operates on e.g. a (300,300) lattice using a (30,1) layout, padding to 128 bytes yields a performance improvement of about 10 MLUPS on a K2200. Note that I am getting quite unsatisfied with how the Lattice class and its suroundings continue to accumulate parameters. The naming distinction between Geometry, Grid, Memory and Lattice is also not very intuitive.
2019-06-28	Move some common benchmark plots into helper functions	Adrian Kummerlaender

2019-06-27	Add some benchmark plots	Adrian Kummerlaender

2019-06-25	Adapt benchmark results format to be importable	Adrian Kummerlaender

2019-06-25	Fix LDC 3D x-z-plane plot	Adrian Kummerlaender

2019-06-25	Add raw data of Tesla P100 benchmarks	Adrian Kummerlaender

2019-06-24	Add basic benchmark scripts, K2200 results	Adrian Kummerlaender

2019-06-22	Add interactive 2D LDC notebook, fix material initialization	Adrian Kummerlaender

2019-06-22	Add platform, precision and thread layout parameters	Adrian Kummerlaender

2019-06-22	Extract parameters in GL interop example	Adrian Kummerlaender

2019-06-21	Gather interop moments in a more generic manner	Adrian Kummerlaender
	i.e. return unshifted moments in a implicitly ordered float4 array. Cell positions are reconstructed by a vertex shaded analogously to how it is done in compustream.
2019-06-20	Prototype OpenGL interoperation	Adrian Kummerlaender

2019-06-20	Move back assignment	Adrian Kummerlaender

2019-06-18	Expand square expressions	Adrian Kummerlaender
	Yields another ~5-10 MLUPS in the simple D2Q9 example. Now averaging at ~840 MLUPS for D2Q9 and ~ 400 MLUPS for D3Q19 on a K2200.
2019-06-17	Extract population offset	Adrian Kummerlaender

2019-06-17	Add function for exporting moments as VTK files	Adrian Kummerlaender

2019-06-16	Add PyEVTK to environment	Adrian Kummerlaender

2019-06-16	Replace some explicit dimension branching	Adrian Kummerlaender

2019-06-16	Select thread layout depending on the descriptor's characteristics	Adrian Kummerlaender

2019-06-16	Declutter gid and offset calculation	Adrian Kummerlaender

2019-06-16	Add D3Q27 descriptor	Adrian Kummerlaender

2019-06-15	Split descriptors and symbolic formulation	Adrian Kummerlaender

2019-06-15	Add support for generating a D3Q19 kernel	Adrian Kummerlaender
	Note how this basically required no changes besides generalizing cell indexing and adding the symbolic formulation of a D3Q19 BGK collision step. Increasing the neighborhood communication from 9 to 19 cells leads to a significant performance "regression": The 3D kernel yields ~ 360 MLUPS compared to the 2D version's ~ 820 MLUPS.
2019-06-15	Start to record some benchmarks	Adrian Kummerlaender

2019-06-15	Consistently name population buffers	Adrian Kummerlaender

2019-06-14	Extract geometry information	Adrian Kummerlaender

2019-06-13	Further the separation between descriptor and lattice	Adrian Kummerlaender

2019-06-13	Tidy up symbolic kernel generation	Adrian Kummerlaender