Dynamic task scheduling and memory consumption

3.2 StarPU-based multifrontal method

3.2.2 Dynamic task scheduling and memory consumption

The level of concurrency as well as the memory consumption of the multifrontal method based on StarPU depend on the path followed in the DAG and the scheduling policy used. As explained, the numerical tasks (i.e., panel reductions, updates and assemblies) are submitted by activation tasks which are, therefore, extremely important because they potentially increase the concurrency level and enhance the scalability. It must be noted, however, that executing an activation task too early is worthless because the numerical tasks it submits cannot be executed as they depend on many other, previously submitted tasks. On the other hand, the activation tasks in charge of allocating fronts increase the memory consumption. For this reason, during the dynamic construction of the DAG we need an efficient scheduling policy to have an amount of concurrency suited to the resources and to save memory from unnecessary allocations.

Maximizing concurrency while limiting the memory consumption is achieved by as- signing to each type of tasks priority values as follows:

• activate: This tasks increases the memory consumption by allocating frontal matrix data structures and therefore we assign it a negative priority to prevent the scheduler from allocating fronts when there is enough parallelism to feed all resources and thus limit memory consumption;

• assemble: These tasks are critical for concurrency because all numerical tasks from a front depend on it. They are given the highest priority after deactivate operation which is 3;

• geqrt: panel operations lie on the critical path of the DAG corresponding to the dense QR factorization when using a 1D block-column partitioning. They are as- signed priority 2 which is higher than non-critical update tasks;

3.2. StarPU-based multifrontal method

• gemqrt: update operations are given priority 1 which is the lowest for numerical tasks;

• deactivate: The deactivation is responsible for deallocating the data structure of a frontal matrix and thus decreases the memory consumption. We therefore give this task the highest priority which is 4.

Similarly to the strategy employed in qr mumps, activating a new front is only consid- ered when no other tasks can be executed by a worker. As such, we exploit as much as possible the node-level parallelism instead of tree-level parallelism and thus take advantage of a better data locality and a lower memory consumption. In addition the native scheduler in qr mumps can only handle two levels of task priority while in our implementation we consider an arbitrary number of priority.

The order of execution of activation tasks impacts the memory consumption for the factorization by allocating the front memory and allowing the deallocation of chil-

dren nodes memory for contribution blocks as explained in [60]. In order to minimize as

much as possible the memory footprint of the factorization, the front activation should follow as much as possible a postorder traversal of the elimination tree which allows us to minimize this memory consumption in the sequential case. To achieve a good efficiency of the memory usage it is important to prioritize the activation tasks according to this postorder traversal and this can be done by modifying the priority associated with activation tasks as follows: for a node numbered i using a postorder traversal we give the corresponding activation task a priority equal to −i. As shown in the experimental results below, this allows a conservative memory behaviour for the scheduling strategy compared to qr mumps. This ordering of the elimination tree minimizes the number of active nodes during the factorization and thus maximizes the exploitation of node-level parallelism. However, for parallel execution, there is no guarantee on the actual execution of activation tasks and thus control over the memory usage of the factorization.

Although StarPU comes with some predefined scheduling policies, none of them sup- ports arbitrary priorities. Therefore we choose to implement our own scheduler using the dedicated API (see Section 1.4.2.1). The implementation of our dynamic scheduler, illus- trated in Figure 3.4, is based on a central sorted queue where tasks are ordered according to their priority and can be described by the following two routines:

• the push routine inserts a ready task to the central queue keeping the list sorted according to the task priorities. Upon termination of a task, the worker that has completed it checks the status of all the other tasks which depend on it and, if any of these has become ready for execution, it invokes the push method on it. This dependency check plus the push method may be seen as the equivalent of the fill queues routine although it is much cheaper because the search space for ready tasks is much more restricted;

• the pop routine retrieves the highest priority task from the central queue. It is called by workers when they become idle and is equivalent to the pick task from qr mumps. This scheduler is dynamic, generic and capable of taking into account task priorities. Moreover, it is compatible with every kind of worker supported by the runtime system including CPU and GPU workers although we do not recommend its use on such architectures where smarter scheduling policies are necessary to achieve acceptable performance (see Chapter 6). Unlike the qr mumps scheduler, it does not exploit data locality in a

runtime system Workers Runtime core CPU0 CPU2 CPU1 task priority t t

...

t t Scheduler t t _t t

Figure 3.4: Sorted central queue scheduler in qrm starpu.

aware scheduling proposed therein (see Section 1.7.4.1) only provided mild improvements and that an interleaved memory allocation policy does a much better job of reducing the penalty for distant memory accesses on a limited size NUMA machine. A much better method for dealing with the complexity of large NUMA machines is based on the concept of contexts which was developed and evaluated by Hugo et al. [71,70] and which we briefly describe in Section 7.3. Another drawback of this scheduler lies in the fact that all the ready tasks are stored in a single queue: in the case were the number of ready tasks and of working threads is high, this may lead to a costly contention on the locks that prevent concurrent accesses to this data structure. In Section 4.4 we will describe the design and implementation of a novel scheduler that aims at overcoming these shortcomings but we believe that this scheduler provides the necessary features to conduct a fair comparison with the original qr mumps code.

In document Task-based multifrontal QR solver for heterogeneous architectures (Page 68-70)