Poor performance at the particle level calls for a coarser level parallelisation model at the grid level. As the grid owns the particles through particle association pointers and since the traversal triggers particle comparisons, we choose to parallelise the grid traversal itself with (grid-based parallelism). The actual particle position updates (Algorithm 13) are performed in parallel during the touchVertexFirstTime event. Vertices are visited concurrently so are the updates of velocities and configurations of the corresponding particles. Concurrent contact detection routines are invoked in ttouchVertexLastTime and enterCell lines (Algorithm 13).
The particle to grid association maintenance exhibits a lower concurrency level due to the on-the-fly particle-to-vertex re-assignments which are driven by position updates. A particle is allowed to move at most within the list of a cell-connected vertices. We realise these moves while the algorithm traverses through the grid in parallel. Particle reassignments modify records at associated vertices. We may not run the particle-grid maintenance on two vertex-connected cells concurrently due to thread safety. Instead, we run through the cells per tree level in a red-black Gauß- Seidel fashion. Multiscale traversal thread safety is ensured with the colouring of every second cell to demarcate multilevel cell inter-dependencies. The colouring along every coordinate axis ensures that particle re-assignments do not induce read-
CHAPTER 8. MANYCORE CONCURRENCY
write race conditions at our vertices. Along with particle position updates, collision checks per vertices and per vertices pairs (Algorithm 13) are executed in parallel, the collision points are safeguarded in a shared memory container.
Similarly, the evaluation of the adaptivity criteria requires additional synchroni- sation and colouring. Grid coarsening phases rely on data movement restrictions to ensure the consistency of the grid morphology over time. No two children vertices properties (associated particles, refinement control parameters) are lifted into their parent concurrently on cell coarsening as this triggers undefined behaviour. These events are safeguarded with atomic lock operators. Nevertheless, these meta data operations are negligible in terms of computational cost. During grid morphology changes, we continue to traverse the grid in serial mode up until the grid geome- try becomes stationary. Although we run in parallel the lift and drops of particles through the grid levels, the updates of virtual particles links, the initialisation of data structures, the allocation of memory, many of these operations contain synchro- nisation constraints through operating system calls. Thus it is convenient to wait until part of the grid geometry becomes stationary and skip the parallel treatment of the affected grid regions by one grid sweep. This results in a pipelined parallel DEM-grid traversal that is thread-safe.
The parallel DEM-grid promises a coarser level of parallelism based upon the grid discretisation but this is often unnecessary when the majority of grid vertices are unoccupied due to refinement or particle clustering. A better solution is to utilise multiple layers of parallelism, this promises better computational granularity as computational work/geometry is often not equality distributed in the domain. For this we rely on a task-based realisation which follows a producer and con- sumer model. Through Intel’s Threading Building Blocks (TBBs) [68, 73] unlike traditional OpenMP-based [14] algorithms we do not assign stationary compute re- sources to particular algorithmic steps. Instead the parallelisation model produces and consumes jobs as tasks. We base our parallel formulation on the outermost grid traversal routines to launch tasks using the peano-framework [91]. In our tasked- based models, all our proposed shared memory parallelisation layers are combined. Although the three levels are conceptionally different to each other, the computa- tional efficiency of one level might depend on the others. With the layers combined, the shared memory output (i.e. the storage of contact points) has to be protected by global semaphores. The shared memory lock frequency depends on the physical geometry configuration and dynamics at hand. The identification of unique con- tact points occurs infrequently when compared to the total triangle count memory accesses. As such, our total synchronisation penalty is negligible during a traversal.
CHAPTER 8. MANYCORE CONCURRENCY
The parallel traversal runs through vertex/cell to evaluate all local refinement criteria and identifies collision candidates. At this point the algorithm does not trigger an actual particle-to-particle comparison but instead a task-producer model wraps around each pair-wise collision candidate into tasks. The tasks are then launched, the grid traversal continues immediately. Such an approach relies on task stealing [73] to keep all cores that are not used by the grid traversal busy. A schedul- ing subsystem consumes the actual particle-to-particle comparison tasks, executes them, and eventually stores the output contact points. Cores on the machine act as task consumers. At the end of a grid traversal we employ one global synchronisation point, the traversing core waits until all of the launched tasks have completed.
The contact detection tasks that are produced during the parallel grid traversal are marked based on execution priority. The traversal itself is set to the highest priority but grid traversal is often computationally empty due to the underlying geometry. Our algorithm has the capacity to intermix the traversal with contact de- tection tasks to keep the machine busy. The contact detection tasks are launched as lower priority background tasks and are invoked to be executed during the traversal at no particular order. A high number of background tasks per core would indi- cate task over-subscription and the stacking of tasks at the end of the traversal, whereas an under-subscription would indicate lower task consumption. The number of background tasks launched by the producer at a time can be specified by the user, however we stick to backgrounds tasks that are equal to the number of hardware cores which is the ideal setting for many applications [10].
A producer-consumer task-based parallelisation model allows us to utilise all levels of parallelism. We treat computational work as independent task units that are allowed to be intermixed and consumed by any thread unit. This scheme allows the algorithm to execute both the efficient DEM collision phases but also the traversal itself in parallel following a task stealing paradigm.