FastFlow - Libraries Used by our Implementation

2.4 Libraries Used by our Implementation

2.4.2 FastFlow

FastFlow [62] is an open source programming framework for structured parallel programming, targeting shared-memory multi-core and support-ing the exploitation of GPUaccelerators. Its efficiency stems from the op-timized implementation of the base communication mechanisms and from

3For the sake of simplicity, we consider the basic formulation of dangling-freeness, whereas more refined definitions (that still hold for C++ smart pointers) take into account, for instance, the type of the referenced value.

Core patterns pipeline, farm, feedback High-level patterns parallel_for, parallel_forReduce, …

Parallel applications eﬃcient and portable

Multicore and many-core platforms Clusters of multicore + many-core

FastFlow

CUDA TCP/IP

IB/OFED Building blocks

queues, ﬀ_node, ...

OpenCL

FIGURE2.3: Layered FastFlow design.

its layered design (cf. Fig.2.3), based on C++ templates. FastFlow provides a set of algorithmic skeletons addressing both stream parallelism (e.g., farm and pipeline) and data parallelism (e.g. map, stencil, reduce), along with their arbitrary nesting and composition [25]. Map, reduce, and stencil pat-terns can be run on multi-cores or can be offloaded ontoGPUs. In the latter case, the user code can includeGPU-specific code (i.e., CUDA or OpenCL kernels).

For instance, leveraging the farm skeleton, FastFlow exposes a Paral-lelFor pattern [61], where chunks of a loop iterations are streamed to be exe-cuted by the farm workers. Just like TBB, FastFlow’sparallel_forpattern uses C++11 lambda expression as a concise way to create function objects:

lambdas can “capture” the state of non-local variables, by value or by ref-erence, and allow functions to be syntactically defined where and when needed.

From the performance viewpoint, one distinguishing feature at the core of FastFlow is that it supports lock-free (fence-free)Multiple Producer Mul-tiple Consumer (MPMC) queues [20], thus providing low overhead high bandwidth multi-party communications on multi-core architectures for any streaming network, including cyclic graphs of threads. The key intuition un-derlying FastFlow is to provide the programmer with fast lock-free Multi-ple Producer Single Consumer (MPSC)queues andSingle Producer Multi-ple Consumer (SPMC)queues—that can be used in pipeline to buildMPMC queues—to support fast streaming networks.

Traditionally,MPMC queues are built as passive entities: threads con-currently synchronize (according to some protocol) to access data; these synchronizations are usually supported by one or more atomic operations (e.g., Compare-And-Swap) that behave as memory fences. FastFlow design follows a different approach: to avoid any memory fence, the synchroniza-tions among queue readers or writers are arbitrated by an active entity (e.g., a thread). We call these entities Emitter (E) or Collector (C) according to their role; they actually read an item from one or more lock-freeSingle Producer Single Consumer (SPSC), queues and write onto one or more lock-freeSPSC

2.4. Libraries Used by our Implementation 31

queues. This requires a memory (pointer) copy but no atomic operations.

The advantage of this solution, in terms of performance, comes from the higher speed of the copy operation compared with the memory fence; this advantage is further increased by avoiding cache invalidation triggered by fences. This behavior also depends on the size and the memory layout of copied data. The former point is addressed using data pointers instead of data, ensuring that the data is not concurrently written: in many cases this can be derived by the semantics of the skeleton that has been implemented usingMPMCqueues—for example, this is guaranteed in a stateless farm as well as many other cases.

Shared-memory FastFlow

The FastFlow implementation for shared-memory platforms provides two basic abstractions:

• Process-component, i.e., an active control flow entity, implemented by means POSIX threads;⁴

• 1-1 channel, i.e., a communication channel between two components, realized with wait-freeSPSCqueues [16].

The 1-1 channel is “state of the art” in its class, in terms of both latency and bandwidth. For instance, the SPSC queue exhibits a latency down to 10 nanoseconds per message on a standard Intel Xeon @2.0GHz [16].

Dolz et al. [72] tested the correctness of FastFlowSPSCqueue benign data races over a set of µ-benchmarks and real applications on a dual-socket Intel Xeon CPU E5-2695 platform.

FastFlow design is a layered one (see Fig.2.3). On top of the mentioned basic abstractions, the bottom layer (Building blocks in Fig.2.3) provides the following entities:

• FastFlow node, i.e., the basic unit of parallelism that is typically iden-tified with a node in a streaming network. Such a node is used to encapsulate sequential portions of code implementing functions (i.e., process-components), as well as higher-level parallel patterns, such as pipelines and farms; From theAPIviewpoint, a FastFlow node is an object of the^ff_nodeclass;

• Collective channel, i.e., a communication channel among two or more

ff_nodes, of arbitrary type (e.g.,SPSC,MPMC).

The second layer (Core patterns in Fig.2.3) provides basic streaming pat-tern (i.e., farm and pipeline) and some common variants (e.g., ordering farm).

On top of core patterns, High-level patterns are provided to target dif-ferent types of parallelism. For instance, parallel_for and ^map allow to express data parallelism in a similar manner as other popular frameworks, such as OpenMP and TBB.

4Porting to C++ threads is under investigation.

Pattern Description

unicast Send the input data to the (unique) connected peer (unidirectional point-to-point communication)

broadcast Sends the input data to all connected peers

scatter Sends different parts of the input data, typically par-titions, to all connected peers

onDemand Sends the input data to one of the connected peers, chosen at runtime on the basis of the actual workload

fromAll (aka. all-gather) Receives different parts of the data from all connected peers combining them in a single data item

fromAny Receives one data item from one of the connected peers

TABLE2.1: Communication patterns among^ff_dnodes.

Distributed FastFlow

An experimental extension, targeting distributed systems, has been imple-mented on top the ZeroMQ library [130]. Briefly, ZeroMQ is an LGPL open-source communication library providing the user with a socket layer that carries whole messages across various transports: inter-thread communi-cations, inter-process communicommuni-cations, TCP/IP, and multicast sockets. Ze-roMQ offers an asynchronous communication model, providing a quick construction of complex asynchronous message-passing networks with rea-sonable performance.

A^ff_dnode(distributed^ff_node) provides an external channel that can support various patterns of communication. The set of communication patterns allows one to provide exchange of messages among a set of dis-tributed nodes, using well-known predefined patterns. The semantics of each communication pattern currently implemented are summarized in Ta-ble2.1. Graphs of^ff_nodes can be connected by way of^ff_dnodes, thus pro-viding a homogeneous abstraction for programming both multi-core and distributed platforms.

In document Parallel Programming with Global Asynchronous Memory: Models, C++ APIs and Implementations (Page 37-40)