Age | Commit message (Collapse) | Author | |
---|---|---|---|
2019-06-29 | Implement layout and memory padding | Adrian Kummerlaender | |
There are at least two distinct areas where padding can be beneficial on a GPU: 1. Padding the global thread sizes to support specific thread layouts e.g. (32,1) layouts require the global lattice width to be a multiple of 32 2. Padding the memory layout at the lowest level to align memory accesses i.e. some GPUs read memory in 128 Byte chunks and as such it is beneficial if the operations are aligned accordingly For lattice and thread layout sizes that are exponents of two these two padding areas are equivalent. However when one operates on e.g. a (300,300) lattice using a (30,1) layout, padding to 128 bytes yields a performance improvement of about 10 MLUPS on a K2200. Note that I am getting quite unsatisfied with how the Lattice class and its suroundings continue to accumulate parameters. The naming distinction between Geometry, Grid, Memory and Lattice is also not very intuitive. | |||
2019-06-22 | Add interactive 2D LDC notebook, fix material initialization | Adrian Kummerlaender | |
2019-06-16 | Replace some explicit dimension branching | Adrian Kummerlaender | |
2019-06-15 | Split descriptors and symbolic formulation | Adrian Kummerlaender | |
2019-06-15 | Add support for generating a D3Q19 kernel | Adrian Kummerlaender | |
Note how this basically required no changes besides generalizing cell indexing and adding the symbolic formulation of a D3Q19 BGK collision step. Increasing the neighborhood communication from 9 to 19 cells leads to a significant performance "regression": The 3D kernel yields ~ 360 MLUPS compared to the 2D version's ~ 820 MLUPS. | |||
2019-06-15 | Consistently name population buffers | Adrian Kummerlaender | |
2019-06-14 | Extract geometry information | Adrian Kummerlaender | |
2019-06-13 | Further the separation between descriptor and lattice | Adrian Kummerlaender | |
2019-06-13 | Tidy up symbolic kernel generation | Adrian Kummerlaender | |
2019-06-13 | Add kernel customization point for velocity boundaries | Adrian Kummerlaender | |
2019-06-12 | Make it easier to exchange initial equilibration logic | Adrian Kummerlaender | |
2019-06-12 | Restructuring | Adrian Kummerlaender | |
2019-06-11 | Restore wrongly deleted file from 75d0088 | Adrian Kummerlaender | |
2019-06-11 | Remove initial vector field example | Adrian Kummerlaender | |
2019-06-09 | Fix relaxation time | Adrian Kummerlaender | |
2019-06-09 | Fix boundaries | Adrian Kummerlaender | |
2019-06-09 | Add periodic performance reporting | Adrian Kummerlaender | |
2019-06-08 | Performance optimizations | Adrian Kummerlaender | |
Starting point: ~200 MLUPS on a NVidia K2200 Changes that did not noticeably impact performance: * Memory layout AOS vs. SOA (weird, probably highly platform dependent) * Propagate on read * Tagging pointers as read / write only * Manual code inlining Changes that made things worse: * Bad thread block sizes The actual issue: * Hidden double precision computations => Code now yields ~600 MLUPS | |||
2019-06-04 | Check whether hand-unrolling makes a difference | Adrian Kummerlaender | |
…it doesn't in this case. | |||
2019-05-31 | Try out various OpenCL work group sizes using a Jupyter notebook | Adrian Kummerlaender | |
This is actually quite nice for this kind of experimentation! | |||
2019-05-30 | Collapse SOA into single array | Adrian Kummerlaender | |
Weirdly the expected performance gains due to better coalescence of memory access is not achieved. | |||
2019-05-29 | Move to structure of arrays | Adrian Kummerlaender | |
2019-05-28 | Add const qualifiers for pointers | Adrian Kummerlaender | |
2019-05-28 | Pull streaming for local writes | Adrian Kummerlaender | |
2019-05-28 | Remove branch to enable vectorization on Intel | Adrian Kummerlaender | |
Twice the MLUPS! | |||
2019-05-27 | Add material numbers | Adrian Kummerlaender | |
2019-05-27 | Print some performance statistics | Adrian Kummerlaender | |
2019-05-26 | Add basic D2Q9 LBM | Adrian Kummerlaender | |
Ported the basic compustream structure |