Conclusions - Running stream-like programs on heterogeneous multi-core systems

This chapter introduced a new partitioning heuristic for stream programs. Unlike previous work, it takes account of the partition’s effect on software pipelining and buffer allocation. The algorithm controls the length of the pipeline using the convexity constraint. Unlike previous work, it has a flexible mechanism to take account of the compiler’s ability to fuse kernels.

This chapter also introduced a queue sizing algorithm, which allocates the memory in local stores to streams. Unlike several previous algorithms, it supports streaming programs with non-constant multiplicities and variable computation times. It also achieves higher performance and lower latency than previous algorithms.

Chapter 4

Run-time Decisions

The previous chapter introduced two new algorithms for the stream compiler. The stream compiler transforms the source code into an executable, performing optimisations using the information known before the program starts running. This chapter focuses instead on the decisions taken at run time. In particular, this chapter is concerned with dynamic scheduling of stream programs.

A program can be scheduled either statically, by the compiler, or dynamically, by the run- time system. Chapter 3 was concerned with static scheduling. If the program is scheduled statically, each task becomes a thread containing a loop, each iteration of which executes its kernels’ work functions in sequence. Tasks communicate using communications primitives, which they call as and when they need them. This implies a relatively simple run-time system, similar to acolib. The main advantages of static scheduling are low overhead and low memory use.

If the program is scheduled dynamically, each task is broken up into self-contained, non- blocking tasks. The dynamic scheduler decides in which order to run these tasks, constrained by dependencies between them: a task cannot start if it needs an output from a task that hasn’t yet finished. Tasks consume input when they start and produce output when they complete, rather than communicating as required using communications primitives.

Dynamic scheduling is the only choice when the program’s behaviour is unpredictable. It also supports stream graphs that are not known at compile time, because kernels and streams are introduced dynamically. As illustrated in Figure 4.1, a dynamic scheduler adds overhead, but it can be better overall when the partitioned stream program is unbalanced. The discussion in Section 4.2.2 gives an example.

A dynamically scheduled program still benefits from unrolling and partitioning transformations, since they coarsen the granularity, and bigger tasks have smaller total overhead. In addition, unrolling enables vectorisation, and task fusion enables data reuse and poly- hedral transformations. A partitioning transformation for a dynamic scheduler, unlike the algorithm in Chapter 3, does not need to map tasks to processors.

Section 4.1 defines the dynamic scheduling problem. Section 4.2 describes the previously known scheduling algorithms, and gives a worst case example for each. Section 4.3 develops an adaptive scheduler for stream-like applications. There are two variants of this scheduler, the apriority stream scheduler requires a one-dimensional stream program, and it needs

4. RUN-TIME DECISIONS

Balanced program Unbalanced program

Static schedule

Proc 1

Proc 2

Proc 3

0us 100.0us 200.0us 300.0us 400.0us 500.0us 600.0us Proc 1

Proc 2

Proc 3

0us 100.0us 200.0us 300.0us 400.0us 500.0us 600.0us

Dynamic schedule

Proc 1

Proc 2

Proc 3

0us 100.0us 200.0us 300.0us 400.0us 500.0us 600.0us Proc 1

Proc 2

Proc 3

0us 100.0us 200.0us 300.0us 400.0us 500.0us 600.0us

⇒ static schedule is shorter ⇒ dynamic schedule is shorter

Figure 4.1: Example traces with static and dynamic scheduling (black is scheduling overhead and shades distinguish kernels)

annotations from the compiler. The gpriority general-purpose scheduler is more general, and it does not need any additional information. Section 4.4 is the experimental evaluation.

4.1 The dynamic scheduling problem

4.1.1 Interface to the dynamic scheduler

The stream program starts executing on one processor, in a sequential master thread. When- ever the master thread reaches a function marked as a task, the function does not execute right away. Instead, an instance of that task is created, and added to the partially-generated dependency graph (PDG). The PDG holds tasks until they have finished executing. Each vertex is a task, and each directed edge represents a dependency between tasks. When the predecessors of a task, if any, have all finished, the task becomes ready. Ready tasks are held in a data structure known as the ready queue, until they are issued, in an order determined by the dynamic scheduler. Figure 4.2 illustrates this process, showing the dynamic scheduler in context.

The dynamic scheduler supports the three operations labelled in Figure 4.2: void create-task(Task task-id, Task-set preds)

Task issue-task(Worker worker) void complete-task(Task task-id)

The create-task operation adds a task to the partial dependency graph (PDG). Its ar- guments identify the task, and the set of tasks whose outputs it consumes. The issue-task operation is called by a worker when it is idle; it waits for and returns the next ready task to issue. The complete-task operation is called by a worker when it finishes executing its task; it

4.1 The dynamic scheduling problem Sequential thread(s) Partially-generated dependency graph (PDG) Ready queue Worker 1 Worker 2 Worker 3 ... Workers Scheduler create-task complete-task issue-task

Figure 4.2: Dynamic scheduler in context

notifies the scheduler that the outputs from that task are now ready. A complete scheduler would provide some additional operations, to support initialisation, barriers, and so on, but they are not relevant here. This interface is similar to the interface between Nanos++ [Nana] and its scheduler plugin.

This construction has some direct implications. First, tasks are non-preemptive, so once issued, they run to completion. Second, the scheduler is non-clairvoyant, meaning that task execution times are not known until they complete. Third, the dependency graph is built by a sequential thread, and is not known all at once.

4.1.2 Throttle policy

Another aspect of scheduling is the throttle policy. The program may create a large number of tasks in the course of its execution, so the main thread should be stalled if the PDG contains too many tasks. On Nanos++, this is done by inlining the next task, effectively making the rest of the master thread a new task dependent on the newly created one. This is a reasonable way to prune a recursive computation such as quicksort, which would otherwise create a large number of very small tasks. On CellSs, the main program on the PPE simply stalls, and the processor does not become an extra worker, since all the workers are SPEs.

The decision to stall task creation is taken by the throttle policy. In Nanos++, the default throttle policy is based on the number of ready tasks per processor, inlining tasks if the number of ready tasks is greater than some threshold.

For one-dimensional stream programs, the number of ready tasks is usually confined to a relatively narrow region, as shown in the histograms in Figure 4.3. These histograms show the distribution of the number of ready tasks, sampled at a constant rate. If the threshold is greater than ten, it is never exceeded, and the size of the PDG will grow without limit. If it is less than ten, the main thread will stall too frequently. A better throttle policy is one based on the number of tasks in the PDG. The main thread should be stalled when the PDG contains a certain number of iterations’ worth of tasks.

4. RUN-TIME DECISIONS

0 2 4 6 8 10

Number of ready tasks

Frequency 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0 2 4 6 8 10

Number of ready tasks

Frequency 0.00 0.02 0.04 0.06 0.08 0 2 4 6 8 10

Number of ready tasks

Frequency 0.00 0.05 0.10 0.15 0.20

(a) 1 processor (b) 6 processors (c) 16 processors

Figure 4.3: Number of ready tasks, excluding those running: channelvocoder, using oldest- first

4.1.3 Objective function: comparing schedulers

The precise goal of the scheduler depends on the context. An offline stream program, such as video transcoding or a database query, should be finished as soon as possible. In standard

terminology, the scheduler’s objective would be to minimise the makespan, written Cmax,

the time that the last task finishes.

Other stream programs, such as video playback, operate in real time. This suggests that an additional goal should be to minimise latency, so that an output frame isn’t delayed by the processing of too many tasks from future frames.

An additional constraint on the scheduler is memory footprint. If the memory footprint is too large, either performance will suffer through paging to disk, or the program will fail due to insufficient memory.

In document Running stream-like programs on heterogeneous multi-core systems (Page 85-89)