Stream Processing on GPUs Using Distributed Multimedia Middleware

(1)

Stream Processing on GPUs Using Distributed Multimedia Middleware

Michael Repplinger ^1,2 , and Philipp Slusallek ^1,2

1

Computer Graphics Lab, Saarland University, Saarbr¨ ucken, Germany

2

German Research Center for Artificial Intelligence (DFKI), Agents & Simulated Reality, Saarbr¨ ucken, Germany

email: [email protected], [email protected]

Abstract. Available GPUs provide increasingly more processing power especially for multimedia and digital signal processing. Despite the tremen- dous progress in hardware and thus processing power, there are and al- ways will be applications that require using multiple GPUs either running inside the same machine or distributed in the network due to computa- tional intensive processing algorithms.

Existing solutions for developing applications for GPUs still require a lot of hand-optimization when using multiple GPUs inside the same machine and provide in general no support for using remote GPUs distributed in the network. In this paper we address this problem and show that an open distributed multimedia middleware, like the Network-Integrated Multimedia Middleware (NMM), is able (1) to seamlessly integrate pro- cessing components using GPUs while completely hiding GPU specific issues from the application developer, (2) to transparently combine pro- cessing components using GPUs or CPUs, and (3) to transparently use local and remote GPUs for distributed processing.

1 Introduction

Since GPUs are especially designed for stream processing and free programmable, they are well suited to be used for multimedia or digital signal processing. Avail- able many-core technologies like Nvidia’s Compute Unified Device Architecture (CUDA) [1] on top of GPUs simplify development of highly parallel algorithms running on a single GPU.

However, there are still many obstacles and problems when programming applications for GPUs. In general, a GPU can only execute algorithms for pro- cessing data, we call kernels, while the corresponding control logic still has to be executed on the CPU. The main problem for a software developer is that a kernel runs in a different address space than the application itself. To ex- change data between the application and the kernel or between different GPUs within the same machine, specialized communication mechanisms (e.g., DMA data transfer), memory areas, and special scheduling strategies have to be used.

This seriously complicates integrating GPUs into applications as well as com-

bining algorithms for multimedia or digital signal processing running on CPUs

and GPUs or have to be distributed in the network.

(2)

Available open distributed multimedia middleware solutions such as the Network-Integrated Multimedia Middleware (NMM) [2] provide a general ab- straction for stream processing using a flow graph based approach. This allows an application to specify stream processing by combining different processing el- ements to a flow graph, each representing a single algorithm or processing step.

A unified messaging system allows to send large data blocks, called buffer, and events. This enables the usage of these middleware solutions for data driven as well as event driven stream processing. Furthermore, they consider the network as an integral part of their architecture and allow transparent use and control of local and remote components. The important aspect in this context is that an open distributed multimedia middleware like NMM strongly supports the spe- cialization of all its components to different technologies while still providing a unified architecture for application development.

Thus, the general idea presented in this paper is to treat GPUs and CPUs within a single machine in the same way as a distributed system. We will show that using an open distributed multimedia middleware for stream processing allows (1) to seamlessly integrate processing components using GPUs while completely hiding GPU specific issues from the application developer, (2) to transparently combine processing components using GPUs or CPUs, and (3) to transparently use local and remote GPUs for distributed processing.

This paper is structured as follows: in Section 2 we describe the essential components of NMM for stream processing on GPUs. Section 3 discusses related work and Section 4 presents the integration of CUDA into NMM. Section 5 shows how to use multiple local or distributed GPUs for processing. Section 6 presents performance measurements showing that multimedia applications using CUDA can even improve their performance when using our approach. Section 7 concludes this paper and highlights future work.

2 Open Distributed Multimedia Middleware

Fig. 1. Technology specific aspects are either hidden within nodes or edges of a dis- tributed flow graph.

Distributed Flow Graph: To hide technology specific issues as well as used net-

working protocols from the application, NMM uses the concept of a distributed

(3)

flow graph, providing a strict separation of media processing and media transmis- sion as can be seen in Figure 1. The nodes of a distributed flow graph represent processing units and hide all aspects of the underlying technology used for data processing. Edges represent connections between two nodes and hide all specific aspects of data transmission within a transport strategy (e.g., pointer forwarding for local and TCP for network connections). Thus, data streams can flow from distributed source to sink nodes, being processed by each node in-between.

The concept of distributed flow graph is essential to (1) seamlessly integrate processing components using GPUs and hide GPU specific issues from the ap- plication developer. Corresponding CUDA kernels can then be integrated into nodes. In the following, nodes using the CPU for media processing are called CPU nodes while nodes using the GPU are called GPU nodes. A GPU node runs in the address space of the CPU but configures and controls kernel func- tions running on a GPU.

Since kernels of a GPU node run in a different address space, data received from main memory has to be copied to the GPU using DMA transfer before it can be processed. An open distributed multimedia middleware hides all aspects regarding data transmission in the edges of a flow graph. An open distributed multimedia middleware allows to integrate these technology specific aspects re- garding data transmission as a transport strategy. In case of NMM, a connection service automatically choose suitable transport strategies and thus allows (2) transparent combination of processing components using GPUs or CPUs in the same machine.

Fig. 2. A data stream can consist of buffers B and events E. A parallel binding allows to use different transport strategies to send these messages while the strict message order is preserved.

Unified Messaging System: In general, an open distributed multimedia middle-

ware is able to support both, data and event driven stream processing. NMM

supports both kind of messages by its unified messaging system. Buffers rep-

resent large data blocks, e.g., a video frame, that are efficiently managed by

buffer manger. A buffer manager is responsible for allocating specific memory

blocks and can be shared between multiple components of NMM. For example,

CUDA needs memory allocated as page-locked memory on the CPU side to use

(4)

DMA communication for transporting media data to GPU. Realizing a corre- sponding buffer manager allows to use this kind of memory within NMM. Since page-locked memory is also a limited resource, this buffer manager provides an interface to the application for specifying a maximum value for used page-locked memory. Moreover, sharing of buffer managers, e.g, for page-locked memory, be- tween CPU and GPU nodes enables an efficient communication between these nodes because a CPU node automatically uses page-locked memory that can be directly copied to the GPU.

Events include control information, include arbitrary typed data and are directly mapped to a method invocation of a node, as long as a node supports this event. NMM allows the combination of events and buffers within the same data stream while a strict message order is preserved. Here, the important aspect is that NMM itself is completely independent of the information sent between nodes but provides an extensible framework for sending messages.

Open Communication Framework: An open communication framework is re- quired to transport messages between nodes of a flow graph correctly. Since a GPU can only process buffers, events have to be sent to and executed within the corresponding GPU node that controls the algorithm running on a GPU. This means that different transport strategies have to be used to send messages of a data stream to a GPU node, i.e., DMA transfer for media data and pointer forwarding for control events. For this purpose, NMM uses the concept of par- allel binding [3], as shown in Figure 2. This allows to use different transport strategies for buffers and events, while the original message order is preserved.

Moreover, NMM support a pipeline of transport strategies which is required to use remote GPU nodes. Since a GPU only has access to the main memory of the collocated CPU and is not able to send buffers directly to a network interface, it has to be copied to the main memory first. In a second step, the buffer can be sent to a remote system using standard network protocols like TCP. NMM supports to set up a pipeline of different transport strategies within a parallel binding in order to reuse existing transport strategies. The connection service of NMM allows to automatically choose these transport strategies for transmitting buffers between GPU and CPU nodes and enables (3) transparent use of local and remote GPUs for distributed processing.

3 Related Work

Due to the wide acceptance of CUDA, there already exist some CUDA spe-

cific extensions. GpuCV [4] and OpenVidia [5] are computer vision libraries that

completely hide the underlying GPU architecture. Since they act as black box so-

lutions it is difficult to combine them with existing multimedia applications that

use the CPU. However, the CUDA kernels of such libraries can be reused in the

presented approach. In [6] CUDA was integrated in a grid computing framework

which is built on top of the DataCutter middleware. Since the DataCutter mid-

dleware focuses on processing a large number of completely independent tasks,

it is not suitable for multimedia processing.

(5)

Most of existing frameworks for distributed stream processing that also sup- port GPUs have a special focus on computer graphics. For example WireGL [7], Chromium [8], and Equalizer [9] support distribution of workload to GPUs dis- tributed in the network. However, all these frameworks are limited to work only with OpenGL-based applications and can not be used for general processing on GPUs.

Available distributed multimedia middleware solutions like NMM [2], NIST II [10] and Infopipe [11] are especially designed for distributed media processing, but only NMM and NIST II support the concept of a distributed flow graph.

However, the concept of parallel binding and pipelined parallel binding are only supported by NMM and not by NIST II.

4 Integration of CUDA

Fig. 3. CUDA is integrated into NMM using a three layer approach. All layers can be accessed by the application, but only the distributed flow graph is seen by default.

We use a three layer approach for integrating CUDA into NMM as can be seen in Figure 3. All three layers can be accessed from the application, but only the first layer, which includes the distributed flow graph, can be seen by default.

Here, all CUDA kernels are encapsulated into specific GPU nodes, so that they can be used within a distributed flow graph for distributed stream processing.

Since processing of a specific buffer requires that all following operations have to be executed on the same GPU, our integration of CUDA ensures that all follow- ing GPU nodes use the same GPU. Therefore, GPU nodes are interconnected using the LocalStrategy which simply forwards pointer of buffers.

However, the concept of parallel binding is required for connecting CPU

and GPU nodes. Here, incoming events are still forwarded to a LocalStrategy

because a GPU node processes events in the same address space as a CPU

node. Incoming buffers are sent to a CPUToGPUStrategy or GPUToCPUStrategy

to copy media data from main memory to GPU memory or vice versa using

CUDA’s asynchronous DMA transfer. NMM also provides a connection service

(6)

that is extended to automatically choose these transport strategies for transmit- ting buffers between GPU and CPU nodes. This shows that the approach of a distributed flow graph (1) hides all specific aspects of CUDA and GPUs to the application.

The second layer enables efficient memory management. Page-locked mem- ory that can be directly copied to a GPU is allocated and managed by a CPUBufferManager, while a GPUBufferManager allocates and manages memory on a GPU. Since a CPUToGPUStrategy requires page-locked memory to avoid un- necessary copy operations, it forwards a CPUBufferManager to all predecessor nodes. This is enabled by the concept of shareable buffer managers, described in Section 2. As can be seen in Figure 3, this is done as soon as a connection between a CPU and GPU node is established. Again, this shows the benefit of using a distributed multimedia middleware for multimedia processing on a GPU.

GPU nodes can be combined with existing CPU nodes in an efficient way, i.e., without unrequired memory copies, and without changing the implementation of existing nodes.

Fig. 4. This figure shows buffer processing using a combination of CPU and GPU nodes.

The lowest layer is responsible for managing and scheduling of available GPUs within a system. Since we directly use the driver API, different GPUs can be accessed by using the same application thread in general by pushing the cor- responding CUDA context. However, the current implementation of CUDA’s context management does not support asynchronous copy operation [12]. Thus, the CUDAManager maintains an individual thread for each GPU, called GPU thread, for accessing a GPU. If a component executes a CUDA operation within one of its methods (e.g., executes a kernel), it requests the CUDAManager to in- voke this method by using a specific GPU thread. Executing a method through the CUDAManager blocks until the CUDAManager has executed the correspond- ing method. This completely hides the use of multiple GPU threads as well as different application threads for accessing a GPU and the application logic.

However, since page-locked memory is already bound to a specific GPU, the

CUDAManager instantiates a CPUBufferManager and GPUBufferManager for each

(7)

GPU. To propagate scheduling information between different nodes, each buffer stores information about the GPU thread that has to be used for processing.

Moreover, CUDA operations are executed asynchronously in so called CUDA streams. Therefore, each buffer provides its own CUDA stream where all compo- nents along the flow graph queue their CUDA specific operations asynchronously.

Since all buffers used by asynchronous operations of a single CUDA stream can only be released if the CUDA stream is synchronized, each CUDA stream is en- capsulated into an NMM-CUDA stream that also stores involved resources and releases them if the CUDA stream is synchronized.

Figure 4 shows all steps for processing a buffer using CPU and GPU nodes:

1. When CPUToGPUStrategy receives a buffer B CP U , it requests a suitable buffer B GP U for the same GPU. Then it initiates an asynchronous copy operation. Before forwarding B GP U to GPU node A, it adds B CP U to the NMM-CUDA stream because this buffer can only be released if the CUDA stream is synchronized.

2. GPU node A initiates the asynchronous execution of its kernel and forwards B GP U to GPU node B. Since both GPU nodes execute their kernel on the same GPU, the connecting transport strategy uses simple pointer forwarding to transmit B GP U to GPU node B.

3. GPU node B also initiates the asynchronous execution of the kernel and forwards B GP U .

4. When GPUtoCPUStrategy receives B GP U , it requests a new B CP U ^′ and ini- tiates asynchronous memory copy from GPU to CPU.

5. Finally, the transport strategy synchronizes the CUDA stream to ensure that all operations on the GPU have been finished, before forwarding B CP U ^′ to CPU node B and releases all resources stored in the CUDA-NMM stream.

5 Parallel Processing on Multiple GPUs

5.1 Automatic Parallelization

Automatic parallelization is only provided for GPUs inside a single machine.

Here, the most important influence on scheduling is that page-locked memory is bound to a specific GPU. This means that all following processing steps are bound to a specific GPU, but GPU nodes are not explicitly bound to a specific GPU. So if multiple GPUs are available within a single system, the processing of next media buffers could be automatically initialized on different GPUs.

However, this is only possible if a kernel is stateless and does not depend on information about already processed buffers, e.g. a kernel that changes the resolution of each incoming video buffer. In contrast to this, a stateful kernel, e.g.

for encoding or decoding video, stores state information on a GPU and cannot automatically be distributed to multiple GPUs.

When using a stateful kernel inside a GPU node, the corresponding trans-

port strategy of type CPUToGPUStrategy uses a CPUBufferManager of a spe-

cific GPU to limit media processing to a single GPU. But if a stateless kernel

(8)

inside a GPU node is used, the corresponding CPUToGPUStrategy forwards a CompositeBufferManager to all preceding nodes. The CompositeBufferManager includes CPUBufferManager for all GPUs, and when a buffer is requested it asks the CUDAManager which GPU should be used and returns a page-locked buffer for the corresponding GPU. So far we implemented a simple round robin mech- anism that is used inside the CUDAManager to distribute buffers one GPU after the other.

5.2 Explicit Parallelization

To explicitly distribute workload to multiple local or distribute GPUs for state- less kernels, we provide a set of nodes that support explicit parallelization for application developer. The general idea can be seen in Figure 5. A manager node is used to distribute workload to a set of local or distributed GPU nodes. Since the kind of data that has to distributed to all succeeding GPU nodes strongly depends on the application, the manager node provides only a scheduling algo- rithm and distributes incoming data from its predecessor node.

Fig. 5. The manager node distributes workload to connected GPU nodes. After data have been processed by all successive GPU nodes, they are sent back to an assembly node that recreates correct order of processed data.

First, the manager node sends data to one connected GPU node after each other. As soon as a GPU node connected to the manager node has finished processing its data, it informs the manager node, by sending a control event, to send new data for processing. The manager node in turns sends next available data to this GPU node. This simple scheduling approach leads to an efficient dynamic load balancing between the GPU nodes, because GPU nodes that finish processing their data earlier, do receive new processing tasks earlier as well.

This approach automatically considers differences in processing time that can be caused by using different graphic boards.

6 Performance Measurements

For all performance measurements we use two PCs, PC1 and PC2, connected

through a 1 Gigabit/sec full-duplex Ethernet connection, each with an Intel

Core2 Duo 3.33 GHz E8600 processor, 4 GB RAM (DDR3 1300 MHz), running

64 Bit Linux (kernel 2.6.28) and CUDA Toolkit 2.0. PC1 includes 2 and PC2 1

(9)

Buffer size Reference [Buf/s] NMM: 1 GPU [Buf/s] NMM: 2 GPUs[Buf/s] NMM: 3GPUs [Buf/s]

PC1 PC1 PC1 PC1 + PC2

50 KB 1314 1354 (103 %) 2152 (163.8 %) 2790 (212.32%)

500 KB 200 235 (117 %) 463 (231.5 %) 680 (340.0%)

1000 KB 103 117 (113.6 %) 234 (227.2 %) 342 (332.7 %)

2000 KB 52 59 (113.4 %) 118 (226.9 %) 168 (323.1%)

3000 KB 34 39 (114.7 %) 79 (232.3 %) 105 (308.8%)

Table 1. Performance of the NMM-CUDA integration versus a single threaded ref- erence implementation. Throughput is measured in buffers per second [Buf/s] for a stateless kernel using 1, 2 and 3 GPUs.

Nvidia GeForce 9600 GT graphics boards, each with 512 MB RAM. In order to measure the overhead of the presented approach, we compare it to a reference program that copies data from CPU to GPU, executes a kernel and finally copies data back to main memory using a single application thread with a corresponding flow graph that consists of two CPU nodes and one GPU node in between using the same kernel. Based on the throughput of buffers per second that can be passed, we compare the reference implementation and the corresponding NMM flow graph. Since NMM inherently uses multiple application threads, which is not possible by the reference application without using a framework like NMM, these measurements also include the influence of using multiple application threads.

For all measurements, we used a stateless kernel that adjust brightness of incoming video frames. The resulting throughput with different buffer sizes can be seen in Table 1. The achieved throughput of our integration is up to 16.7%

higher compared to the reference application. These measurements show that there is no overhead when using NMM as distributed multimedia middleware together with our CUDA integration, even for purely locally running applica- tions. Moreover, the presented approach inherently uses multiple application threads for accessing the GPU, which leads to a better exploitation of the used GPU. Adding a second GPU can double the buffer throughput for larger buffer sizes, if a stateless kernel is used.

Moreover, adding a single remote GPU shows that the overall performance can be increased up to a factor of three. When using both PCs for processing we use the manager node described in Section 5.2 that distributes workload to two GPU nodes, each running on PC1 and PC2. Since PC1 provides two GPUs and we us a stateless kernel, both graphic boards of PC1 are used. The assembly node runs on PC1 and receives the results from PC2. However, for large amount of data the network turns to be out the bottleneck. In our benchmark we already send up 800 MBit/sec in both directions so that a 4th GPU in a remote PC can not be fully exhausted. In this case faster network technologies, e.g. Infiniband which provides up to 20GBit/sec, have to be used.

7 Conclusion and Future Work

In this paper we demonstrated that a distributed multimedia middleware like

NMM is able (1) to seamlessly integrate processing components using GPUs

(10)

while completely hiding GPU specific issues from the application developer, (2) to transparently combine processing components using GPUs or CPUs, and (3) to transparently use local and remote GPUs for distributed processing. From our point of view, a distributed multimedia middleware such as NMM is essen- tial to fully exploit the processing power of today’s GPUs, while still offering a suitable abstraction for developers. Thus, future work will mainly focus on integrating emerging many-core technologies to conclude on which functionality should additionally be provided by a distributed multimedia middleware.

Acknowledgements

We would like to thank Martin Beyer for his valuable work on supporting the integration of CUDA into NMM.

References

1. NVIDIA: CUDA Programming Guide 2.0. (2008)

2. M. Lohse, F. Winter, M. Repplinger, and P. Slusallek: Network-Integrated Multi- media Middleware (NMM). In: MM ’08: Proceedings of the 16th ACM international conference on Multimedia. (2008) 1081–1084

3. M. Repplinger, F. Winter, M. Lohse, and P. Slusallek: Parallel Bindings in Dis- tributed Multimedia Systems. In: Proceedings of the 25th IEEE International Conference on Distributed Computing Systems Workshops (ICDCS 2005), IEEE Computer Society (2005) 714–720

4. Y. Allusse, P. Horain, A. Agarwal, and C. Saipriyadarshan: GpuCV: An Open- Source GPU-accelerated Framework for Image Processing and Computer Vision.

In: MM ’08: Proceeding of the 16th ACM international conference on Multimedia, New York, NY, USA, ACM (2008) 1089–1092

5. J. Fung, and S. Mann: OpenVIDIA: parallel GPU computer vision. In: MUL- TIMEDIA ’05: Proceedings of the 13th annual ACM international conference on Multimedia, New York, NY, USA, ACM (2005) 849–852

6. D.R. Hartley et al.: Biomedical Image Analysis on a Cooperative Cluster of GPUs and Multicores. In: ICS ’08: Proceedings of the 22nd annual international confer- ence on Supercomputing, New York, NY, USA, ACM (2008) 15–25

7. G. Humphreys et al.: WireGL: a scalable graphics system for clusters. In: SIG- GRAPH ’01: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques. (2001) 129–140

8. G. Humphreys et al.: Chromium: a stream-processing framework for interactive rendering on clusters. In: SIGGRAPH ’02: Proceedings of the 29th Annual Con- ference on Computer Graphics and Interactive Techniques. (2002) 693–702 9. S. Eilemann, and R. Pajarola: The Equalizer parallel rendering framework. Tech-

nical Report IFI 2007.06, Department of Informatics, University of Z¨ urich (2007) 10. A. Fillinger et al.: The NIST Data Flow System II: A Standardized Interface for

Distributed Multimedia Applications. In: IEEE International Symposium on a World of Wireless; Mobile and MultiMedia Networks (WoWMoM), IEEE (2008) 11. A. P. Black et al.: Infopipes: An Abstraction for Multimedia Streaming. Multime-

dia Syst. 8 (2002) 406–419

12. NVIDIA: CUDA Programming and Development. NVidia Forum (2009)

http://forums.nvidia.com/index.php?showtopic=81300&hl=cuMemcpyHtoDAsync.

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware

Michael Repplinger 1,2 , and Philipp Slusallek 1,2

Computer Graphics Lab, Saarland University, Saarbr¨ ucken, Germany

German Research Center for Artificial Intelligence (DFKI), Agents & Simulated Reality, Saarbr¨ ucken, Germany

email: [email protected], [email protected]

1 Introduction

This seriously complicates integrating GPUs into applications as well as com-

bining algorithms for multimedia or digital signal processing running on CPUs

and GPUs or have to be distributed in the network.

2 Open Distributed Multimedia Middleware

Fig. 1. Technology specific aspects are either hidden within nodes or edges of a dis- tributed flow graph.

Distributed Flow Graph: To hide technology specific issues as well as used net-

working protocols from the application, NMM uses the concept of a distributed

Fig. 2. A data stream can consist of buffers B and events E. A parallel binding allows to use different transport strategies to send these messages while the strict message order is preserved.

Unified Messaging System: In general, an open distributed multimedia middle-

ware is able to support both, data and event driven stream processing. NMM

supports both kind of messages by its unified messaging system. Buffers rep-

resent large data blocks, e.g., a video frame, that are efficiently managed by

buffer manger. A buffer manager is responsible for allocating specific memory

blocks and can be shared between multiple components of NMM. For example,

CUDA needs memory allocated as page-locked memory on the CPU side to use

3 Related Work

Due to the wide acceptance of CUDA, there already exist some CUDA spe-

cific extensions. GpuCV [4] and OpenVidia [5] are computer vision libraries that

completely hide the underlying GPU architecture. Since they act as black box so-

lutions it is difficult to combine them with existing multimedia applications that

use the CPU. However, the CUDA kernels of such libraries can be reused in the

presented approach. In [6] CUDA was integrated in a grid computing framework

which is built on top of the DataCutter middleware. Since the DataCutter mid-

dleware focuses on processing a large number of completely independent tasks,

it is not suitable for multimedia processing.

Available distributed multimedia middleware solutions like NMM [2], NIST II [10] and Infopipe [11] are especially designed for distributed media processing, but only NMM and NIST II support the concept of a distributed flow graph.

However, the concept of parallel binding and pipelined parallel binding are only supported by NMM and not by NIST II.

4 Integration of CUDA

Fig. 3. CUDA is integrated into NMM using a three layer approach. All layers can be accessed by the application, but only the distributed flow graph is seen by default.

We use a three layer approach for integrating CUDA into NMM as can be seen in Figure 3. All three layers can be accessed from the application, but only the first layer, which includes the distributed flow graph, can be seen by default.

Here, all CUDA kernels are encapsulated into specific GPU nodes, so that they can be used within a distributed flow graph for distributed stream processing.

However, the concept of parallel binding is required for connecting CPU

and GPU nodes. Here, incoming events are still forwarded to a LocalStrategy

because a GPU node processes events in the same address space as a CPU

node. Incoming buffers are sent to a CPUToGPUStrategy or GPUToCPUStrategy

to copy media data from main memory to GPU memory or vice versa using

CUDA’s asynchronous DMA transfer. NMM also provides a connection service

that is extended to automatically choose these transport strategies for transmit- ting buffers between GPU and CPU nodes. This shows that the approach of a distributed flow graph (1) hides all specific aspects of CUDA and GPUs to the application.

GPU nodes can be combined with existing CPU nodes in an efficient way, i.e., without unrequired memory copies, and without changing the implementation of existing nodes.

Fig. 4. This figure shows buffer processing using a combination of CPU and GPU nodes.

However, since page-locked memory is already bound to a specific GPU, the

CUDAManager instantiates a CPUBufferManager and GPUBufferManager for each

GPU. To propagate scheduling information between different nodes, each buffer stores information about the GPU thread that has to be used for processing.

Moreover, CUDA operations are executed asynchronously in so called CUDA streams. Therefore, each buffer provides its own CUDA stream where all compo- nents along the flow graph queue their CUDA specific operations asynchronously.

Since all buffers used by asynchronous operations of a single CUDA stream can only be released if the CUDA stream is synchronized, each CUDA stream is en- capsulated into an NMM-CUDA stream that also stores involved resources and releases them if the CUDA stream is synchronized.

Figure 4 shows all steps for processing a buffer using CPU and GPU nodes:

2. GPU node A initiates the asynchronous execution of its kernel and forwards B GP U to GPU node B. Since both GPU nodes execute their kernel on the same GPU, the connecting transport strategy uses simple pointer forwarding to transmit B GP U to GPU node B.

3. GPU node B also initiates the asynchronous execution of the kernel and forwards B GP U .

4. When GPUtoCPUStrategy receives B GP U , it requests a new B CP U ′ and ini- tiates asynchronous memory copy from GPU to CPU.

5. Finally, the transport strategy synchronizes the CUDA stream to ensure that all operations on the GPU have been finished, before forwarding B CP U ′ to CPU node B and releases all resources stored in the CUDA-NMM stream.

5 Parallel Processing on Multiple GPUs

5.1 Automatic Parallelization

Automatic parallelization is only provided for GPUs inside a single machine.

However, this is only possible if a kernel is stateless and does not depend on information about already processed buffers, e.g. a kernel that changes the resolution of each incoming video buffer. In contrast to this, a stateful kernel, e.g.

for encoding or decoding video, stores state information on a GPU and cannot automatically be distributed to multiple GPUs.

When using a stateful kernel inside a GPU node, the corresponding trans-

port strategy of type CPUToGPUStrategy uses a CPUBufferManager of a spe-

cific GPU to limit media processing to a single GPU. But if a stateless kernel

5.2 Explicit Parallelization

Fig. 5. The manager node distributes workload to connected GPU nodes. After data have been processed by all successive GPU nodes, they are sent back to an assembly node that recreates correct order of processed data.

This approach automatically considers differences in processing time that can be caused by using different graphic boards.

6 Performance Measurements

For all performance measurements we use two PCs, PC1 and PC2, connected

through a 1 Gigabit/sec full-duplex Ethernet connection, each with an Intel

Core2 Duo 3.33 GHz E8600 processor, 4 GB RAM (DDR3 1300 MHz), running

64 Bit Linux (kernel 2.6.28) and CUDA Toolkit 2.0. PC1 includes 2 and PC2 1

Buffer size Reference [Buf/s] NMM: 1 GPU [Buf/s] NMM: 2 GPUs[Buf/s] NMM: 3GPUs [Buf/s]

PC1 PC1 PC1 PC1 + PC2

50 KB 1314 1354 (103 %) 2152 (163.8 %) 2790 (212.32%)

500 KB 200 235 (117 %) 463 (231.5 %) 680 (340.0%)

1000 KB 103 117 (113.6 %) 234 (227.2 %) 342 (332.7 %)

2000 KB 52 59 (113.4 %) 118 (226.9 %) 168 (323.1%)

3000 KB 34 39 (114.7 %) 79 (232.3 %) 105 (308.8%)

Michael Repplinger ^1,2 , and Philipp Slusallek ^1,2

4. When GPUtoCPUStrategy receives B GP U , it requests a new B CP U ^′ and ini- tiates asynchronous memory copy from GPU to CPU.

5. Finally, the transport strategy synchronizes the CUDA stream to ensure that all operations on the GPU have been finished, before forwarding B CP U ^′ to CPU node B and releases all resources stored in the CUDA-NMM stream.