Chapter 7 Heterogeneous Parallel Execution Model
7.3 GMAC Design and Implementation
7.4.4 Context Migration
The cost of context migration between accelerators is a key factor in developing load-balancing policies in multi-threaded applications and multi-programmed systems. This section measures the cost of context migration and sets the base
0.1 1 10 100 1000 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32678 65536 Time (miliseconds)
Migration size (KiB)
C870 GTX285
Figure 7.7: Context Migration
for future accelerator scheduling research. A synthetic benchmark is used to evaluate the cost of context migration. This benchmark forces an accelerator context migration changing the thread’s accelerator affinity through a GMAC call. Before setting the accelerator affinity, the benchmark allocates shared data structures of a configurable size.
Figure 7.7 shows the context migration time for different application dataset sizes. Context migration is an expensive operation, taking about 4 milliseconds in GTX285 for very small dataset values and up to 1 second for datasets of 1GB. The context migration time is mainly formed by two components: context creation and data transfers. The context creation time, which does not depend on the dataset size, is about 4 milliseconds in GTX285 (see Figure 7.6) if no previous context exists in the target accelerator. However, the context creation time becomes negligible when compared with the data transfer time if the target accelerator already has active contexts. These experimental results show that the target accelerator state (i.e., initialized vs. idle) must be considered for context migration policies.
Context migration time becomes dominated by data transfer cost as the context dataset grows. Figure 7.7 shows that the context migration cost follows a linear dependency with the data transfer size for dataset sizes larger then 2MB. Such a simple dependency might be easily integrated in the context migration policy to estimate the cost of migrating different contexts. This migration time might be used together history-based algorithms [JVG+09] to trigger context
migrations.
that context migration might be only triggered on situations of high load un- balance. For instance, the cost of migrating PNS is two orders of magnitude higher than the accelerator execution time. In Parboil benchmarks, context mi- gration would represent little overhead for the case of TPACF. This benchmark has a dataset of 4.77MB (i.e., a migration time of about 20ms in GTX285) and an average accelerator execution time of 940ms per accelerator invocation. Unfortunately history-based algorithms are of little help in this case because TPACF performs a single accelerator call. These experimental results highlight the importance of algorithms to select the accelerator assignment on context initialization time.
7.5
Summary
This chapter has introduced the HPE model for heterogeneous parallel systems. HPE integrates accelerators into the execution thread abstraction, provided by most operating systems. In HPE, execution threads are extended with execu- tion modes, which defines the hardware resources accessible by the execution thread. All execution threads belonging to the same user process share a com- mon CPU execution mode, which is active when the application is running at the CPU. Moreover, execution threads also own one execution mode per kind of accelerator present in the system. In HPE, applications call accelerators by switching its active execution mode from the CPU to the accelerator being in- voked. The HPE model is fully compatible with existent applications, keeping the sequential programming model that programmers are used to. Furthermore, the HPE model provides backwards compatibility by allowing the emulation of accelerators in software whenever they are not present, such in legacy systems. Execution modes provide application programmers with a programming model with synchronous accelerator calls. In this programming model, the application execution flow can be only running in one processor (i.e., CPU or accelerator) at the time. HPE allows parallel CPU – accelerator execu- tion by spawning new user threads, in the same way that parallel execution on multi-core CPUs is accomplished in current operating systems. Using sepa- rate execution threads to allow concurrent CPU – accelerator execution enables fine-grained synchronization between the code executed in the CPU and the code executed in the accelerator, which is not possible in existent programming models where accelerators are called asynchronously.
Two implementation approaches for the HPE model in GMAC has been pre- sented. There is trade-off between system performance and memory isolation in these implementations due to the current lack of memory protection in acceler- ators. This chapter has outlined the modifications on the accelerator hardware and the accelerator virtual memory structures required to support memory pro- tection while being compatible with the existent hardware. Experimental results have shown that the HPE model produces little overhead.
7.6
Significance
The HPE model integrates accelerators in the execution model implemented by most contemporary operating systems, providing full backwards compatibility with existent applications and systems. The HPE mode, by providing appli- cation programmers with an execution model they are familiar with, improves the programmability of heterogeneous parallel systems and eases the adoption of accelerator in general purpose applications.
The HPE model introduces the concept of execution mode to define the different processors (i.e., CPUs and accelerators) where execution threads might be executed. Execution modes allow the implementation of an execution model where accelerators are called by requesting an execution mode switch to the OS. Execution mode switches are analogous to privilege-level switches, currently used to perform system calls. Execution modes also allow a simple mechanism for the operating system to allow executing applications on systems without accelerator: on an execution mode switch, the OS emulates in software the accelerator being invoked.
The HPE model implementation in GMAC has illustrated the need for mem- ory protection mechanisms to be implemented by accelerators. The necessary hardware support outlined in this chapter would allow an efficient implementa- tion of the HPE model while being fully compatible with current applications.