Dataow Computing Model

The dataow computing model represents a radical alternative to the von-Neumann computing model. This model oers many opportunities for parallel processing, because it has neither a program counter nor a global updatable memory, i.e., the two characteristics of the von-Neumann model that inhibit parallelism. Due to these prop- erties, it is extensively used as a concurrency model in software and as a high-level design model for hardware.

The principles of dataow were originated by Karp and Miller [32]. They proposed a graph-theoretic model for the description and analysis of parallel computations. Shortly after, in the early 1970s, the rst dataow models were developed by Dennis [33] and Kahn [34]. Dennis originally applied the dataow idea to the computer architecture design while Kahn used it in a theoretical context for modeling concurrent software. Based on these models, dataow has been used and developed in many areas of computing research such as in digital signal processing, recongurable computing, high-level logic design, graphics processing and data warehousing. It is also relevant in many software architectures including database engine designs and concurrent computing frameworks. The dataow model is self-scheduled since instruction sequencing is constrained only by data dependencies. Moreover, the model is asynchronous because program execution is driven only by the availability of the operands at the inputs to the functional units. Specically, the ring rule states that an instruction is enabled as soon as its corresponding operands are present, and executed when hardware resources are available. If several instructions become reable at the same time, they can be executed in parallel. This simple principle provides the potential for massive parallel execution at the instruction level. Thus, dataow architectures implicitly manage complex tasks such as processor load balancing, synchronization, and accesses to common resources.

Figure 2.1 illustrates the execution of a simple loop using both von-Neumann and dataow models. As the gure shows, a dataow program is represented as a directed graph, called dataow graph (DFG). This consists of named nodes and arcs that repre- sent instructions and data dependencies among instructions, respectively [35, 36]. Data values propagate along the arcs in the form of packets, called tokens. Two important characteristics of the dataow graphs are functionality and composability. Functionality means that the evaluation of a graph is equivalent to the evaluation of a mathematical

(a) (b)

node=actor= instruction arc= data path = operand : (tagged) token a T F merge select T F <N mul sub add matching b 10 a do{ a =(a+b)×(a-10); } while (a<N) ; PC load r0, (a) load r1, (b) loop1: add r2, r1, r0 sub r0, r0, 10 mul r0, r0, r2 cmp r0, N jl loop1 store (a), r0

Figure 2.1: Computing a loop using (a) the von-Neumann model,(b) a dataow model

function on the same input values. Composability implies that graphs can be combined to form new graphs [37]. A DFG can be created at dierent computing stages. For instance, it can be created for a specic algorithm used for designing special-purpose architectures (common for signal processing circuits). However, most dataow-based systems convert a high-level code into DFG at compile time, decode time, or even dur- ing execution time, depending on the architecture organization. Unlike control ow programs, binaries compiled for a dataow machine explicitly contain the data depen- dency information.

2.2.1 Dataow Architectures

In practice, implementation of the dataow model can be classied as static (single- token-per-arc) and dynamic (multiple-tagged-token-per-arc) architectures. The static approach allows at most one token to reside on any arc [38]. This is accomplished by extending the basic ring rule as follows: A node is enabled as soon as tokens are present on its input arcs and there is no token on any of its output arcs [39]. In order to implement the restriction of having at most one token per arc, and to guard against non- determinacy, extra reverse arcs carry acknowledge signals from consuming to producing nodes [39].

Figure 2.2-a shows an example of static dataow graph for computing a loop which is executed N times sequentially (note that in this gure, the graph for controlling iteration of the loop is not illustrated). The implementation of the static dataow model is simple, but since the graph is static, every operation can be instantiated only once, and thus loop iterations and subprogram invocations can not proceed in parallel. Despite this drawback, some machines were designed based on this model, including the MIT Dataow Architecture [38, 40], DDM1 [41], LAU [42], and HDFM [43].

For (i=1 to N) Si = (ai×bi)+(ci×di)

(a) (b)

N times

…

itr.1 itr.2 … itr.N × S1 × + a1 b1 c1 d1 × S2 × + a2 b2 c2 d2 × SN × + aN bN cN dN Ack. Signal Operand × S × + a b c d

Figure 2.2: DFG of a loop (a) the static and (b) the dynamic dataow

The dynamic dataow model tries to overcome some of the deciencies of static dataow by supporting the execution of multiple instances of the same instruction template, thereby supporting parallel invocations of loop iterations and subprogram. Figure 2.2-b shows the concurrent execution of dierent iterations of the loop. This is achieved by assigning a tag to each data token representing the dynamic instance of the

target instruction (e.g., a1, a2, ...). Thus, an instruction is red as soon as tokens with

identical tags are present at each of its input arcs. This enabling rule also eliminates the need for acknowledge signals, increases parallelism, and reduces token trac. Notable examples of this model are the Manchester Dataow [44], the MIT Tagged-Token [45], DDDP [46] and PIM-D [47].

The dynamic dataow can execute out-of-order, bypassing any token with complex execution and delays the remaining computation. Another noteworthy benet of this model is that little care is required to ensure that tokens remain in order.

The main disadvantage of the dynamic model is the extra overhead required to match tags on tokens. In order to reduce the execution time overhead of matching tokens, dynamic dataow machines require expensive associative memory implementations [44]. One notable attempt to eliminate the overheads associated with the token store is the Explicit Token Store (ETS) [48, 49]. The idea is to allocate a separate memory frame for every active loop iteration and subprogram invocation. Since frame slots are accessed using osets relative to a frame pointer, the associative search is eliminated. To make that concept practical, the number of concurrently active loop iterations must be controlled. Hence, the condition constraint of k-bounded loops was proposed [50], which bounds the number of concurrently active loop iterations. The Monsoon architecture [51] is the main example of this model.

2.2.2 Limitations of Dataow Models

The dataow model has the potential to be an elegant execution paradigm with the ability to exploit inherent parallelism available in applications. However, implementations of the model have failed to deliver the promised performance due to inherent ineciencies and limitations. One reason for this is that the static dataow is unable to eectively uncover large amount of parallelism in typical programs. Dynamic dataow architectures are limited by prohibitive costs linked to associative tag lookups, in terms of latency, silicon area, and power consumption.

Another signicant problem is that dataow architectures are notoriously dicult to program because they rely on specialized dataow and functional languages. However, these languages have no notion of explicit computation state, which limits the ability to manage data structures (e.g., arrays). To overcome these limitations, some dataow systems include specialized storage mechanisms, such as the I-structure [52], which preserve the single assignment property. Nevertheless, these storage structures are far from generic and their dynamic management complicates the design.

In contrast, imperative languages such as C, C++, or Java explicitly manage machine state through load/store operations. This modus operandi decouples the data storage from its producers and consumers, thereby concealing the ow of data and making it virtually impossible to generate eective (large) dataow graphs. Further- more, the memory semantics of C and C++ support arithmetic operations on memory pointers, which result in memory aliasing, where dierent semantic names may refer to the same memory location. Memory aliasing cannot be resolved statically, thus further obfuscating the ow of data from between producers and consumers. Consequently, dataow architectures do not eectively support imperative languages.

In summary, the dataow model is eective in uncovering parallelism, due to the explicit expression of parallelism among dataow paths and the decentralized execution model that obviates the need for a program counter to control instruction execution. Despite these advantages, programmability issues limit the usefulness of dataow machines. Moreover, the lack of a total order on instruction execution makes it dicult to enforce the memory ordering that imperative languages require. For further details, we refer the reader to more extensive literature on the subject [53, 54, 55].

In document Hardware design of task superscalar architecture (Page 42-46)

Dataow Computing Model