|
Using cell lists as parameters for multiple non-branching kernels seems
to reduce performance by ~50 MLUPS (for single precision D2Q9).
This might be alleviated by padding the cell lists to enable thread
layout control or by improved kernel dispatching.
On the upside this OpenCL program runs not only on GPUs but is also vectorized on Intel
CPUs yielding about 180 MLUPS (single precision) and - anticlimactically - 85 MLUPS for
double precision on a i7-4790K.
However both these values compare well to the performance of established CPU LBM codes.
|