aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2019-06-16Add D3Q27 descriptorAdrian Kummerlaender
2019-06-15Split descriptors and symbolic formulationAdrian Kummerlaender
2019-06-15Add support for generating a D3Q19 kernelAdrian Kummerlaender
Note how this basically required no changes besides generalizing cell indexing and adding the symbolic formulation of a D3Q19 BGK collision step. Increasing the neighborhood communication from 9 to 19 cells leads to a significant performance "regression": The 3D kernel yields ~ 360 MLUPS compared to the 2D version's ~ 820 MLUPS.
2019-06-15Start to record some benchmarksAdrian Kummerlaender
2019-06-15Consistently name population buffersAdrian Kummerlaender
2019-06-14Extract geometry informationAdrian Kummerlaender
2019-06-13Further the separation between descriptor and latticeAdrian Kummerlaender
2019-06-13Tidy up symbolic kernel generationAdrian Kummerlaender
2019-06-13Add JupyterLab to environmentAdrian Kummerlaender
2019-06-13Add kernel customization point for velocity boundariesAdrian Kummerlaender
2019-06-12Port LDC example to new structureAdrian Kummerlaender
2019-06-12Make it easier to exchange initial equilibration logicAdrian Kummerlaender
2019-06-12RestructuringAdrian Kummerlaender
2019-06-12Initialize material numbers using given geometry functionAdrian Kummerlaender
2019-06-12Collect moments outside of the lattice classAdrian Kummerlaender
2019-06-12Move kernel template into separate fileAdrian Kummerlaender
2019-06-12Allocate moments buffer only on deviceAdrian Kummerlaender
2019-06-11Restore wrongly deleted file from 75d0088Adrian Kummerlaender
2019-06-11Move equilibrization to kernelAdrian Kummerlaender
2019-06-11Move D2Q9 codegen into separate fileAdrian Kummerlaender
2019-06-11Preshift population field pointerAdrian Kummerlaender
Now averaging ~ 820 MLUPS again
2019-06-11Statically resolve indices as far as possibleAdrian Kummerlaender
Interestingly this seems to lose up to 10 MLUPS at first glance. On the other hand such a small difference could also be a temporary load issue.
2019-06-11Move index calculation to compile timeAdrian Kummerlaender
2019-06-11Templatize assignment loopsAdrian Kummerlaender
2019-06-11Start to use codegen for actual kernel generationAdrian Kummerlaender
2019-06-11Remove initial vector field exampleAdrian Kummerlaender
2019-06-11Test generation of D3Q19 kernel code in notebookAdrian Kummerlaender
2019-06-11Count operationsAdrian Kummerlaender
2019-06-11Restructure codegen notebookAdrian Kummerlaender
2019-06-10Improve plot generationAdrian Kummerlaender
* Only update moment field when it is actually needed * => ~825 MLUPS * Defer plot generation until the actual simulation is done
2019-06-10Reduce thread block sizeAdrian Kummerlaender
=> ~780 MLUPS
2019-06-10Improve plot outputAdrian Kummerlaender
2019-06-10Add fixed velocity boundaries to generated LBM kernelAdrian Kummerlaender
Interestingly this increased performance to ~750 MLUPS compared to ~665 MLUPS.
2019-06-09First test of partially generated LBM kernelAdrian Kummerlaender
A kernel extracted from `lbn_codegen.ipynb` yields ~665 MLUPS compared to the ~600 MLUPS produced by a manually optimized kernel. Note that this new kernel currently doesn't handle boundary conditions (but dropping in a density condition doesn't impact performance).
2019-06-09Start tracking codegen notebookAdrian Kummerlaender
2019-06-09Test lid driven cavityAdrian Kummerlaender
Notice that the indexing order of numpy arrays follows matrix conventions.
2019-06-09Fix relaxation timeAdrian Kummerlaender
2019-06-09Fix boundariesAdrian Kummerlaender
2019-06-09Add periodic performance reportingAdrian Kummerlaender
2019-06-08Performance optimizationsAdrian Kummerlaender
Starting point: ~200 MLUPS on a NVidia K2200 Changes that did not noticeably impact performance: * Memory layout AOS vs. SOA (weird, probably highly platform dependent) * Propagate on read * Tagging pointers as read / write only * Manual code inlining Changes that made things worse: * Bad thread block sizes The actual issue: * Hidden double precision computations => Code now yields ~600 MLUPS
2019-06-04Update notebookAdrian Kummerlaender
2019-06-04Check whether hand-unrolling makes a differenceAdrian Kummerlaender
…it doesn't in this case.
2019-06-04Enable verbose OpenCL outputAdrian Kummerlaender
2019-05-31Try out various OpenCL work group sizes using a Jupyter notebookAdrian Kummerlaender
This is actually quite nice for this kind of experimentation!
2019-05-30Collapse SOA into single arrayAdrian Kummerlaender
Weirdly the expected performance gains due to better coalescence of memory access is not achieved.
2019-05-29Move to structure of arraysAdrian Kummerlaender
2019-05-29Add Jupyter to nix-shellAdrian Kummerlaender
2019-05-28Add const qualifiers for pointersAdrian Kummerlaender
2019-05-28Pull streaming for local writesAdrian Kummerlaender
2019-05-28Remove branch to enable vectorization on IntelAdrian Kummerlaender
Twice the MLUPS!