aboutsummaryrefslogtreecommitdiff
path: root/implosion.py
AgeCommit message (Collapse)Author
2019-06-29Implement layout and memory paddingAdrian Kummerlaender
There are at least two distinct areas where padding can be beneficial on a GPU: 1. Padding the global thread sizes to support specific thread layouts e.g. (32,1) layouts require the global lattice width to be a multiple of 32 2. Padding the memory layout at the lowest level to align memory accesses i.e. some GPUs read memory in 128 Byte chunks and as such it is beneficial if the operations are aligned accordingly For lattice and thread layout sizes that are exponents of two these two padding areas are equivalent. However when one operates on e.g. a (300,300) lattice using a (30,1) layout, padding to 128 bytes yields a performance improvement of about 10 MLUPS on a K2200. Note that I am getting quite unsatisfied with how the Lattice class and its suroundings continue to accumulate parameters. The naming distinction between Geometry, Grid, Memory and Lattice is also not very intuitive.
2019-06-22Add interactive 2D LDC notebook, fix material initializationAdrian Kummerlaender
2019-06-16Replace some explicit dimension branchingAdrian Kummerlaender
2019-06-15Split descriptors and symbolic formulationAdrian Kummerlaender
2019-06-15Add support for generating a D3Q19 kernelAdrian Kummerlaender
Note how this basically required no changes besides generalizing cell indexing and adding the symbolic formulation of a D3Q19 BGK collision step. Increasing the neighborhood communication from 9 to 19 cells leads to a significant performance "regression": The 3D kernel yields ~ 360 MLUPS compared to the 2D version's ~ 820 MLUPS.
2019-06-15Consistently name population buffersAdrian Kummerlaender
2019-06-14Extract geometry informationAdrian Kummerlaender
2019-06-13Further the separation between descriptor and latticeAdrian Kummerlaender
2019-06-13Tidy up symbolic kernel generationAdrian Kummerlaender
2019-06-13Add kernel customization point for velocity boundariesAdrian Kummerlaender
2019-06-12Make it easier to exchange initial equilibration logicAdrian Kummerlaender
2019-06-12RestructuringAdrian Kummerlaender
2019-06-11Restore wrongly deleted file from 75d0088Adrian Kummerlaender
2019-06-11Remove initial vector field exampleAdrian Kummerlaender
2019-06-09Fix relaxation timeAdrian Kummerlaender
2019-06-09Fix boundariesAdrian Kummerlaender
2019-06-09Add periodic performance reportingAdrian Kummerlaender
2019-06-08Performance optimizationsAdrian Kummerlaender
Starting point: ~200 MLUPS on a NVidia K2200 Changes that did not noticeably impact performance: * Memory layout AOS vs. SOA (weird, probably highly platform dependent) * Propagate on read * Tagging pointers as read / write only * Manual code inlining Changes that made things worse: * Bad thread block sizes The actual issue: * Hidden double precision computations => Code now yields ~600 MLUPS
2019-06-04Check whether hand-unrolling makes a differenceAdrian Kummerlaender
…it doesn't in this case.
2019-05-31Try out various OpenCL work group sizes using a Jupyter notebookAdrian Kummerlaender
This is actually quite nice for this kind of experimentation!
2019-05-30Collapse SOA into single arrayAdrian Kummerlaender
Weirdly the expected performance gains due to better coalescence of memory access is not achieved.
2019-05-29Move to structure of arraysAdrian Kummerlaender
2019-05-28Add const qualifiers for pointersAdrian Kummerlaender
2019-05-28Pull streaming for local writesAdrian Kummerlaender
2019-05-28Remove branch to enable vectorization on IntelAdrian Kummerlaender
Twice the MLUPS!
2019-05-27Add material numbersAdrian Kummerlaender
2019-05-27Print some performance statisticsAdrian Kummerlaender
2019-05-26Add basic D2Q9 LBMAdrian Kummerlaender
Ported the basic compustream structure