The OpenCL Runtime - The Multi2Sim Simulation Framework. A CPU-GPU Model for Heterogeneous Comp

Multi2Sim’s OpenCL execution framework has been restructured and rewritten in version 4.1, to represent a more realistic organization based on runtime libraries and device drivers. This approach is in contrast to earlier versions, where the runtime only transformed OpenCL API calls into ABI system calls with the same arguments and behavior.

Table 5.2: Supported runtime libraries and redirections from vendor-specific to Multi2Sim runtimes.

Runtime name

and version Vendor runtime

Redirected M2S

runtime Description

OpenCL 1.1 libOpenCL.so libm2s-opencl.so Support for the OpenCL API, used for the emulation of the AMD Southern Islands GPU.

CUDA 4.0 libcuda.so libcudart.so

libm2s-cuda.so Support for the NVIDIA CUDA Runtime/Driver API, used for the emulation of an NVIDIA Fermi GPU.

OpenGL 4.2 libGL.so libm2s-opengl.so Support for the OpenGL API, with base and extension functions, used for the emulation of graphic applications.

GLUT 2.6.0 libGLUT.so libm2s-glut.so This library allows the user to create and manage windows containing OpenGL contexts, and also read the mouse, keyboard and joystick functions.

GLU 8.0.4 libGLU.so libm2s-glu.so This library consists of a number of wrapper functions that invoke base OpenGL API functions. They provide higher-level drawing routines from the more primitive routines that OpenGL provides.

GLEW 1.6.0 libGLEW.so libm2s-glew.so This library handles initialization of OpenGL extensions.

Bringing the OpenCL execution burden into the runtime (user code) allows the new library to execute kernels on the CPU. This would have been difficult within Multi2Sim code, since it has no way to easily initiate interaction with —or schedule work onto— guest threads. It also allows for non-blocking OpenCL calls, which would have been difficult to implement using system calls, as they suspend the guest thread that initiated them.

Currently, the runtime exposes two OpenCL devices to the application linked with it, returned with a call to clGetDeviceIDs: the x86 and the Southern Islands device. If the OpenCL host program runs on Multi2Sim, all devices are accessible; if it runs natively on the real machine, only the x86 device is visible.

Execution Outside of Multi2Sim

A novel feature introduced in Multi2Sim 4.1 is the support for native execution of the OpenCL runtime. An application can be linked statically or dynamically with this runtime, and then executed directly from a shell, without running it on top of Multi2Sim. When run natively, only the x86 device is accessible from the application. The reason is that the execution of x86 device kernels is completely implemented in the runtime, including the OpenCL object management, kernel binary loading,

work-group scheduling, and data management.

To determine which devices to make visible, the runtime detects whether it is running on Multi2Sim or on real hardware right at the initialization of the OpenCL framework, when the application invokes

clGetPlatformIDs. This is done by attempting to communicate with the Multi2Sim driver through the

dedicated system call interface. When running natively, the real OS receives this event and reports an invalid system call code; when running on Multi2Sim, the driver successfully acknowledges its

presence. Based on this outcome, the runtime decides whether to instantiate structures of complementary devices, other than the x86 one.

User Options

The runtime has some configurable options, passed by the user through environment variables. The reason to use environment variables, as opposed to Multi2Sim command-line options, is that the runtime runs as guest code, which cannot access Multi2Sim’s memory (at least without the overhead of an explicit communication mechanism). Also, and more importantly, the runtime can run natively, in which case the command-line option alternative is just discarded. Environment variables can be accessed both by the runtime running natively and on Multi2Sim.

The following options are supported:

• M2S_OPENCL_DEBUG. When set to 1, this variable forces the OpenCL runtime to dump debug information for every API call invoked by the host program. Each call includes the function name and the value for its arguments. Information is also provided on state updates for events, tasks, and command queues.

• M2S_OPENCL_BINARY. This variable contains the path of the OpenCL kernel binary that the runtime should load if the application attempts to run function clCreateProgramWithKernel. An

application is expected to use clCreateProgramWithBinarywhen linked with Multi2Sim’s OpenCL runtime. However, if the application’s source code is not available, and the original code creates OpenCL programs from their source, the alternative is compiling the kernel source on a machine with the AMD tools usingm2c ––amd, importing the generated binary, and setting environment variableM2S_OPENCL_BINARY to point to it.

As an example, let us run benchmark FFT from the APP SDK 2.5 benchmarks suite for x86, available on the website. This benchmark is linked statically with Multi2Sim’s OpenCL runtime. By passing option –device cputo the benchmark, we can force it to only query available CPU devices when runningclGetDeviceIDs. Option––load FFT_Kernels.bin tells the benchmark to use this kernel binary.

Since we are using the x86 device, we can run the benchmark natively without invoking the simulator.

The following command line activates the debug information in the runtime:

$ M2S_OPENCL_DEBUG=1 ./FFT --load FFT_Kernels.bin --device cpu [libm2s-opencl] call ’clGetPlatformIDs’

The OpenCL host program can schedule non-blocking tasks using command queues, such as

host-device memory transfers or kernel executions. These operations may take a significant amount of

time, but they do not necessarily block the host thread that scheduled them. To allow the application to run non-blocking tasks, the runtime associates one software thread (pthread) to each OpenCL command queue object. While one specific command queue’s thread blocks when effectively executing a task, the host program’s main thread continues to run, possibly queuing other tasks in parallel.

The runtime also supports OpenCL events. When an application wants to synchronize tasks across command queues, it can bind them to OpenCL event objects. A task can have an event object associated with it, while at the same time, it can depend on the completion of a set of other events.

When the runtime completes the execution of one task, it automatically wakes up any dependent tasks in all command queues, and proceeds execution.

Supporting Multiple Devices

The implementation of OpenCL objects in the runtime can be classified in two categories:

• The runtime front-end contains an implementation of the most significant OpenCL API calls, as well as all generic OpenCL objects (platform, context, command queue, event, etc.).

• The runtime back-ends contain implementations for those OpenCL objects with device-specific information, namely the OpenCL program, kernel, and device objects. There is one runtime back-end per device exposed to the application. In devices other than x86, the runtime back-end is in charge of communicating with the device driver using ABI system calls. The x86 back-end runs completely at the runtime back-end level, and without any interaction with a device driver, since the host and device architectures are the same.

The runtime is designed internally with a clear interface between the front-end and the back-ends, based on sets of call-back functions in program, kernel, and device objects. While a portable OpenCL application does not alter its source code when targeting different devices (other than the call that selects the device itself), the functions triggered in the runtime’s front-end are also the same. Only when a device-specific action is required, the front-end invokes a standard device-specific function in a back-end to do the job. For example, a buffer being transferred from host to device involves a

device-specific call that copies the buffer content into the appropriate memory space.

With this division of responsibilities, the OpenCL runtime never deals with device-specific details—it only maintains the association of opaque, device-dependent objects with the back-ends that created them. Similarly, the runtime back-ends never have to deal with object lifetime management, command queues, event-based synchronization, or other devices. This modular organization allows for a

convenient extension of the runtime to support additional device back-ends.

In document The Multi2Sim Simulation Framework. A CPU-GPU Model for Heterogeneous Computing (For Multi2Sim v. 4.2) (Page 60-63)