In order to distribute the workload and take advantage of using a GPU cluster, we can take advantage of a distributed streaming framework. Such a framework allows us to, in a more easy fashion, execute computation in several nodes, while also providing us
with fault tolerance. Frameworks like these usually work by specifying a topology of nodes, with directed edges that specify the data flow. Each node executes some type of computation and passes the data forward according to the specified topology. This would allow us to specify the nodes as explained in section1.3.1to perform each task, as well as the data dependencies between nodes.
2.5.1 Apache Storm
Storm [10] is an open source distributed real-time computation system. Storm consumes streams of data and processes them in a specified way, partitioning the stream between different stages, specified in a topology. It is also scalable and fault-tolerant. This makes it very useful for our purposes, due to the fact that we intend to use a GPU cluster (thus taking advantage of Storm’s scalability) and that we intend to process streams of data.
A storm cluster contains two types of nodes: master and workers. The master node, called theNimbus, is responsible for distributing work around the cluster (between the
worker nodes) and monitoring for failures. The worker nodes listen for work assigned to them by theNimbus node and start or stop processes according to that.
Storms works by specifying a topology, which is a graph of computation. A topology is composed of several nodes spread across several machines. Each node in the topology contains logic for a worker to execute and the edges between nodes specify dependencies and data flow. The worker nodes process a subset of the topology.
Another abstraction in Storm is the stream. Storm provides primitives to handle and process a stream, throughsprouts and bolts. Sprouts are a source of streams, while bolts
consume input streams and process them. We can specify a graph of bolts and sprouts (which are nodes) with the edges between them signifying stream dependencies. This is illustrated in Figure2.12.
Figure 2.12:Bolts and Sprouts Topology
The Sprouts (on the left of the image) are the source of streams, and pass them to the three bolts in the middle of the image. Once these execute the necessary processing, they
2 . 5 . D I S T R I B U T E D S T R E A M P R O C E S S I N G
pass the streams on to the final bolt.
Storm is simple to use, easy to configure and its abstractions allow us to easily imple- ment a distributed stream processing system. Since Storm works with any programming language, there are no concerns about its compatibility with our system. Also, its stream processing capabilities enable us to easily process streams of images in any way we desire.
2.5.2 Intel Threading Building Blocks
While Storm allows us to take advantage of a cluster of machines horizontally, Intel Threading Building Blocks (TBB) [6], allows us to take advantage of each of the machines vertically. It enables a programmer to easily write C++ parallel programs that take ad- vantage of modern multicore processors. It is widely used and tested, meaning it is very reliable. It includes several algorithms, locks and atomic operations that allow a pro- grammer to abstract from low-level details, worry less about the intricacies of parallel programming and focus more on the design of the solution.
We can take advantage of TBB in order to accelerate our solution in each machine that we use, taking advantage of as much processing power as we can. In modern computers, no part of the processor should be left idle, and TBB allows us to make sure that never happens.
2.5.2.1 TBB Flow Graph
TBB Flow Graph is a TBB interface that enables a programmer to easily write powerful dependency graph and data flow algorithms. It expresses computation in terms of di- rected graphs, with dependencies between them. Data flows in the direction of the graph edges, and it is possible to have several graph nodes executing in parallel. There are also different types of nodes to help with the construction of the flow graph, such as buffer nodes, split and join nodes, queue nodes, and several others.
Figure 2.13:Serial Execution
Imagine a program composed by four components A, B, C and D. Component A does some initialization while components B and C perform independent calculations on the initial data generated by A. Component D, that depends on the data calculated by B and C, finally computes the final result. A serial execution of this algorithm can be seen in Figure2.13. As we can see, the components execute sequentially and take a total of 20100ms to execute. However, components B and C only depend on data coming from A, meaning they can execute in parallel. Using TBB flow graph, we could design
something like shown in Figure2.14. This way, after the computation is executed on A, components B and C execute in parallel and send their results to the final component D, which performs the last computations. Through the use of TBB we can reduce the total execution time to 15100ms. This basic examples illustrates the usefulness of TBB flow graph and shows how we can take advantage of it for our computations because, as shown in Chapter1, our program can be expressed in terms of a flow graph. Flow graph also has another very useful feature, it allows each node to be parallelized to a certain degree,e.g.,
we could (depending on concurrency constraints) make each independent node execute in several processor cores simultaneously, speeding up the process even further.
Figure 2.14:Parallel Execution