A graph-oriented task manager for small
multiprocessor systems.
Xavier Verians1, Jean-Didier Legat1, Jean-Jacques Quisquater1 and Benoit Macq2 1 Microelectronics Laboratory, Université Catholique de Louvain, 2 Telecommunications Laboratory, Université Catholique de Louvain,
place du Levant, 3; 1348 Louvain-la-Neuve; Belgium Phone: +32-10-47.80.62. Fax: +32-10-47.25.98.
e-mail: [email protected]
Abstract
This paper presents a new task manager aimed to general-purpose small-scale multiprocessors. It is designed to exploit complex parallelism structures from general applications. The manager uses an explicit parallelism encoding based on a topological description of the task dependence graph. A dedicated task manager decodes the parallelism information and builds a structured representation of the dependence graph in a queue bank. The queue bank design allows to efficiently extract the tasks ready to be executed and to implement synchronization between consecutive tasks. Simulations performed on SPLASH and image processing benchmarks validates the parallelism exploitation. Results shows that in case of complex parallel structures, performances can be significantly improved.
Submitted to Europar 99, Topic 9: "Parallel Computer Architecture".
Due to the number of concepts presented in this paper, a short publication as a research paper is not wished.
The theoretical aspects of this work will be published in the Fourth International Conference of the ACPC (ACPC'99). It mainly presents the underlying task manager algorithm from the task graph approach. The current paper is essentially oriented towards the task manager implementation. On the opposite to the ACPC paper, it details well the queue bank structure, the parallelism description used and the software/hardware implementation issue.
A graph-oriented task manager for small
multiprocessor systems.
I. Introduction
General computers become more efficient thanks to an increased parallelism exploitation. Current processors aimed to personal computers intensively use the local instruction-level parallelism. The parallelism extraction for general-purpose applications has mainly been studied for one-chip architectures as VLIW, superscalar or multithreaded architectures. However, new systems begin to exploit medium and coarse-grained parallelism. These systems are based on small-scale multiprocessors or one-chip "super-processors" able to execute several tasks simultaneously. It is especially true in small commercial computers where several processors begin to be used on the same board to boost performances. For example, four Pentium Pro can now be interconnected without any external glue logic [1]. The reason is that it is simpler to design a small processor and to repeat it on the same chip or board than to develop a huge and complex uniprocessor having similar performances.
However, these small-scale multiprocessors - i.e. with less than 32 processing
units (PUs) - require an efficient parallelism extraction at the task level to achieve
good performances. Currently, mainly software methods are used to exploit parallelism. These techniques, inspired from those used in larger systems, perform well in case of large threads or large data sets. However, general-purpose computers are small-scale systems used to execute general-purpose or multimedia applications on relatively small data sets. These applications present a reduced and more complex parallelism that cannot easily or efficiently be described with a classical syntax. This paper addresses this issue by proposing a task manager that can handle parallelism from complex applications at a small software or hardware cost. As the parallelism management is centralized, it is dedicated to small-scale systems. We also limit our study to shared-memory multiprocessors.
This paper is divided as follows. The next section presents some definitions from the graph-theory field. The main features of our task manager are presented in the third section. Section IV details the task manager structure and how it handles tasks. The underlying management algorithm is first described on simple directed acyclic graphs. Afterwards, it is extended to support data-dependent programs and to exploit efficiently low- or high-level parallel structures. The software/hardware implementation issue is presented in section V. The system has been simulated to validate our parallelism management. Results are provided in section VI. They show that our parallelism manager can provide similar results in case of simple parallel applications. However, on complex applications, the proposed parallelism exploitation allows to improve performances significantly.
II. Definitions
Medium and coarse-grained parallelism can be represented with a task graph [2]. The program is divided into several tasks connected by dependence links. A task is an instruction set that will be executed on a single processing unit. Tasks can communicate and interact together by using the shared memory. Their size and the number of dependence are not limited. They can comprise jumps, function calls and conditional branches, provided all jumps are referencing an address included in the same task. Some tasks can begin in a function and end in another one.
A task graph [2] is a directed graph consisting of a set of vertices (tasks) V and a set of arcs (dependences) on the vertices, A. If the arc ( ti, tj) ∈ A , then task Ti
must complete execution before task Tj can start execution. We say that task Ti is a
preceding task of Tj. Arcs ( ti, tj) are also called incoming arcs of Tj. The notation
used is the following: capitals to design tasks and lower-case letters for arcs. The
distance between two tasks d ( Ti, Tj) is defined as the minimum number of arcs to
follow to go from a task Tk ∈ V to both Ti and Tj. The task graph defines a task
ordering. This ordering is partial, as parallel tasks do not need to be executed one before another. A task Ti is eligible for execution only if all tasks Tj such that ( ti,
tj) ∈ A are been executed.
At run-time, the task graph is unfolded to produce a directed acyclic graph (DAG), also called dynamic task graph. The task graph analysis is performed at run-time by the processor. However, due to its limited resources, the processor will only see a part of this graph, called the decoded slice of the dynamic task graph. The decoded slice encloses all the tasks already decoded by the system, but waiting for their execution to be launched or completed.
III. Task manager characteristics
The task manager has three main jobs. It first decodes the task program and internally builds an acyclic representation of the static dependence graph. Its second job is to transform the static graph into a dynamic one by the execution of data-dependent control commands. Finally, it has to select eligible tasks to send it to the PUs and to remove completed tasks. Our proposed task manager presents the following characteristics. Is is:
• general: General-purpose applications or emerging multimedia applications have a
complex structure with a parallelism available at the control-flow level. The main capability of a good parallelism management is the system ability to handle various kinds of parallelism structures. As about all parallelism patterns can be represented by directed acyclic graphs, we use a parallelism description at the assembly level directly based on a topological description of the task flow graph. This avoids limiting the pattern diversity by the use of specific parallel directives. An example of a topological task graph description is provided in figure 1. The syntax is simple: each line describes a task T and its incoming arcs ( tj, t ) . As t is the same for all the incoming
• dynamic: The manager decodes the task graph at run-time. It supports
data-dependent control structures. Following the data processed, it can modify the graph to change the program behavior. It is achieved by directly inserting control instructions into the task graph.
• structured: The task manager stores the task graph in a structured fashion by using
a queue bank. It allows to simplify the graph management operations to increase their efficiency. Moreover, intelligent task scheduling policies based on a local analysis of the task graph can be supported.
• architecture-independent: The goal of the task manager is to generate parallel
tasks and send it to the processing units. No assumption is made on the system architecture except that a shared-memory is used and a centralized task management could be implemented efficiently. It limits our manager to small-scale systems. Particularly, the parallelism description will assume the number of available processing units to be unknown at compile-time.
• separation of the parallelism from the program code: Usually, parallelism
commands (fork, join, spawn, create,…) are directly inserted into the task code [3][4][5][6]. Tasks are created in an order related to the task execution order. If tasks are control-dependent, the decoding can hardly be performed in advance and the PUs loose time to handle the dependence links between tasks. The time lost is generally reduced by avoiding the joining of tasks created by different PUs, consequently discarding meshed-like parallelism patterns. The separation of the parallelism description, called the task program, from the task computation code allows our manager to overcome this problem. It releases the PUs from the task creation and manipulation problem. The parallelism available can then be decoded and processed in parallel with task execution. Another advantage is that the parallelism information is no more spread throughout the program code. The task manager can then easily have a large view on the program structure and the available parallelism.
IV. Task manager structure
The parallelism manager stores the decoded slice of the dynamic task graph in a structure able to handle it efficiently. The structure can be implemented in software or
Fig. 1: a) task dependence graph, b) corresponding topological description called the task program.
hardware. This point will be discussed in the next section. We first describe the task manager by considering only directed acyclic task graph. The manager is then extended to support data-dependent graphs. The task manager structure is close to graph representations used in static scheduling methods. However, the proposed structure allows to store the graph under a well-structured pattern, consequently simplifying the graph management operations.
4.1. DAG management
The task manager core is a queue bank. Each queue stores a piece of the decoded slice of the dynamic task graph. The parallelism available at a given time is the number of parallel paths in the graph at the current execution stage. To conserve the parallel structure of the task graph, parallel paths are stored in different queues. As a path is constituted by a chain of consecutive tasks, each queue holds tasks sequentially chained. If the task graph is drawn as a DAG oriented from top to bottom (fig. 2), the queue bank slices the decoded graph into vertical task chains. The advantage of this structure is clear, as the maximum parallelism available is equal to the number of non-empty queues.
Moreover, such a structure already orders tasks into the partial execution order. As each task in a queue depends on the previous task of the same queue and if executed tasks are removed from the queues, eligible tasks would only be placed in the queue head. This second property will boost the search for a new eligible task. However, to be able to easily order tasks into the queues, they have to be decoded in the same partial order. This imposes a constraint on the task program: tasks have to be ordered in the topological description such that a sequential decoding of the task program generates tasks in the partial execution order.
The three basic operations on a task graph are the addition of a task, the search of an eligible task and the removal of an executed task. All these operations are executed at run-time.
• task addition: To add a new decoded task, the manager searches a queue qk such
as t a i l ( qk) = Tj, with ( tj, t ) ∈ A . If no queue satisfies the condition, qk will
point to an empty queue. Then all the incoming arcs ( ti, t ) , except ( tj, t ) , are
added to the queue - actually, only ti is used to identify the dependence - followed by
the task identifier T. If no empty queue is available, the decoding waits for the execution to proceed further until a queue is released. This procedure is performed to ensure that a task is added to a queue only if it depends on the last task of this queue. The search operation is quite simple: it only needs to access the last element of the queues. The operation complexity is O(nqueues) fetching and comparisons (where nqueues is the number of active queues) in a sequential system and O(1) in a parallel implementation. An example of the queue bank filling steps is provided at figure 3. It uses the task program presented in figure 1.
• eligible task selection: The search for an eligible task to execute is straightforward.
If a task in a queue depends on other tasks of the task window, it is preceded by its incoming arcs. Only eligible tasks are not preceded by arcs or tasks and are thus stored in the queue head. Finding an eligible task only requires testing if the queue head contains an arc or a task identifier. When selected, the task is tagged "in execution". The operation only requires nqueues or 1 test(s) respectively in a sequential or parallel implementation. The test operation simply consists in testing the bit telling if the stored identifier is a task or a dependence.
• task removal: When a task T completes, the task T and all the arcs ( t , tj) are
removed from the queues. It is the most expensive operation. However, the critical operation is the removal of a dependence preventing a new task to become eligible. These dependences are found in the queue heads. Thus the delay between a task completion and a dependent task launching is determined by the time needed to
Fig. 3: Task management process related to the task program of fig. 1. For the figure clarity, we have dissociated the queue filling from the task execution. Actually, these two processes are performed concurrently. "last" denotes the arc associated to the tail element. a,b) queue filling. A step corresponds to the decoding of a line in the task program. c) queue status after the execution of A. These task and the associated arc a are removed. d) queue status after the execution of B and
remove the arcs associated with T from the queue heads. This operation can be performed with a 0(nqueues) or 0(1) maximum cost with a sequential or parallel implementation respectively. It simply consists for each queue in a comparison of the head element with the task identifier. If they match, the head element is popped from the queue. Removing the arcs from the queue bodies takes a longer time. As these operations are not critical, the overhead can be hidden by the task execution.
An example of the task selection and removal processes is provided at figure 3. Actually, the task addition, selection and removal operations are all executed during execution, tasks being added on one the queue tails, while others are removed from the queue heads.
4.2. Data-dependent graphs
Usually, the program structure depends on the data processed. This is particularly the case in emerging multimedia applications, where the kernel applied to an object is determined depending on some data properties. In this case, the task graph contains conditional branches that will be resolved during execution. Two approaches are possible: either the manager waits for the branch to be resolved before resuming the graph decoding and the queue bank filling, either the conditional branches are decoded and stored unresolved. While the first approach is simpler, it suffers from the delay between a branch resolution and the time for a new eligible task to achieve the queue head to be scheduled to a processing unit. For this reason, we have chosen the second approach, as the queue structure allows to resolve branches at run-time.
To insert branches into the DAG, they are simply considered as simple tasks depending on other tasks. When a branch is decoded, it is statically predicted to determine in advance which side will be taken. The branch management policy is close to those used in classical superscalar processors[7], except that we are working with tasks in place of instructions. The decoding will continue on the selected side, while the address pointed by the other branch side is stored in a small memory. Tasks following the branch are written in a new empty queue. A tag identifying the branch is added to each queue storing tasks depending on this branch. When a branch becomes eligible, it is resolved. If the prediction was right, the depending queues are untagged. On the contrary case, the tagged queues are discarded and the decoding resumes on the other branch side. In this case, a delay appears between the branch resolution and the scheduling of the next eligible task.
Backward branches allow to implement loops in the task graph. The task graph can then no longer be considered as acyclic. The sequential decoding of the task graph will unroll it to produce the DAG describing the actual execution. Several instances of the same task could exist at the same time in the system. To distinguish those task instances, tasks and dependences are renamed at the decoding step. As the tasks in the task program are ordered, the arcs ( ti, tj) where Ti has not been renamed are
ignored. This mechanism transforms the conservative and static dependence graph into a dynamic graph where only dependences valid in this context are taken into account.
4.3. Parallelism management optimizations
The described task manager can handle all the parallelism patterns that can be drawn as a task graph. The parallelism exploited can have a wide granularity range, depending on the task size. However, it is interesting to extend its functionalities to optimize the parallelism exploitation.
Applications are often based on small kernels easily parallelizable (filtering, initialization loops...). These kernels can be described by a single task executed many times. We called it a low-level loop. The queue-based system would require writing a new task instance and a testing command for each iteration. To avoid this expensive solution, the task attached to the low-level loop is written once in the queue bank. When selected for execution, it is sent to a dedicated resource, the task server. The server generates several task instances to schedule it to the PUs and holds some loop state bits to detect the loop end. The queue bank considers low-level loops as a simple task, reducing sharply the management overheads.
The parallelism exploitation based on the dependence graph is well suited to exploit complex parallelism patterns among close tasks, close by using the above distance definition. While this parallelism representation is valid for all granularities, the sequential decoding of the topological description is less efficient in case of coarse-grained parallelism level. Indeed, it often arises that multiple processes or large program parts can be executed in parallel. If each process is described by a large number of tasks, a sequential decoding cannot provide a view large enough on the global parallelism structure. A simultaneous decoding of different task graph parts is necessary. High-level commands, such as classical fork/join commands [5][6], are added to the task program. The key difference with other systems is that these commands are only used to allow a simultaneous decoding and not necessary an unconditional simultaneous execution. They generate several task pointers that will decode several task program parts simultaneously. It does not limit the allowed parallelism patterns, as it is only used to add new properties to the general medium-level description. As tasks are written into queues by analyzing the dependence information, the task pointers could use the same queue bank. At least one queue will be statically allocated to each active task pointer to avoid deadlocks. An example of a task graph with a conditional branch and a parallel command is provided at figure 4.
The second common high-level parallel structure are loops. For the same reason, a
doall command can be added to the task graph to decode several iteration subgraphs in
parallel. The doall loop only signifies that several loop iterations can be decoded in parallel (It should better be called decodeall). The dependence mechanism can however impose some dependences between task instances of consecutive iterations. The doall command then implements doall, doacross, forall, … loops. This command allows a speculative execution of loop iterations to compute loop indexes. Indeed, a large number of loops have iterations only depending on each other by the loop index computations. A speculative execution of these computations allows to launch the execution of several iterations in parallel. This loop management requires a small memory to hold some loop state bits. Such high-level loops can be overlapped to exploit parallelism among instances of several loop levels.
V. Implementation
The parallelism manager is aimed toward small-scale multiprocessor systems. The management can be distributed throughout the system. In this case, the manager is implemented by a program handling data structures in memory. This solution is the simplest as it does not require any modification of the multiprocessor system. When a PU is idle, it executes a part of the task manager program to generate a new eligible task to execute [8]. This approach suffers of several drawbacks. First, the task management is not always executed in parallel with task execution. In the case of a large amount of parallelism available, the task manager program competes for execution with the useful task computations. Moreover, as each PU can execute the task manager, data have to move from one PU cache to another. This approach performs well if the task manager limits its memory accesses compared with the useful task memory accesses. As tasks can be small in our system, and the search operations require several accesses to the queue content, the additional network load will reduce the system performance. We are currently programming the simulator to demonstrate this point.
The chosen approach is to place the task manager in a single resource with a local task memory. It is justified because the communications between the task manager and the PUs are reduced. The task manager can be a software program interlaced with task execution on a PU. However, the task manager mainly requires additions and memory accesses. To interrupt a PU to execute the task manager is wasting the complex PU resources, and might incur some load unbalance with other PUs. An additional mechanism would have to be implemented to ensure that the task manager program is executed quickly enough to generate eligible tasks when needed, and not too often to avoid to overload the PU. All these reasons have lead us to choose a third solution consisting in the addition of a small resource to handle the parallelism. This resource implementing the task manager can be considered as a simplified PU with a local memory, optimized for task management operations.
Fig. 4: left: data-dependent task graph with parallel command, right: corresponding task program. $tmp is a shared register written by a PU. Remark: it must be noted that the
high-level fork/join is not necessary as the left and right parts of the task program are small. It is used for the example purpose.
The design of the queue bank structure allows to easily implement most of the operations. The critical operations are the synchronization of two consecutive tasks and the control command execution that can modify the graph structure. If enough parallelism is available, these operations can be hidden by the execution of other simultaneous tasks. However, in most of cases, the overhead appears at least partially.
Task addition, selection and removal require a search in all the active queues. If they are executed sequentially, the operation latency will reduce the performance. However, such operations can be performed in parallel on all the queues. For this reason, we have adopted a hardware implementation of the queue bank and the task servers. The remaining parts of the task manager will be implemented with a small RISC core. It will reduce the critical operation latency, while limiting the manager implementation cost. The point-to-point synchronization operation or the selection of eligible tasks will require only a single simultaneous operation on all the active queues.
VI. Simulations
Simulations have been performed on several classes of applications first to validate the proposed parallelism exploitation and secondly to verify the possibility to implement complex parallelism patterns. The ease of programming is also taken into account. A third class of simulations is planned to discuss the software/hardware implementation issue. This step is still in progress and will not be presented here.
The simulation system comprises an experimental compiler and a multiprocessor simulator. The compiler is based on the Stanford's SUIF system[9] extended with the Harvard's MACHINE library[10]. Several passes have been added to extract the parallelism, generate the task program and managing parallel stack frames. Currently, the tasks are defined by the programmer, by inserting compilation directives in the program code. The multiprocessor simulator simulates a bus-based multiprocessor with simple RISC processors and a shared memory.
The first set of simulations intends to validate our parallelism representation and its capability to extract a lot of parallelism from complex programs. For this reason, the simulated system has been simplified to remove the overheads incurred by a particular memory or PU architecture. Otherwise, it would be difficult to determine if measured performances are due to good multiprocessor optimizations or an efficient parallelism exploitation. The simulated system has then ideal PUs processing one instruction per cycle, and an ideal network and memory system. To be able to compare our results with a current multiprocessor system, we use the benchmarks from the SPLASH[11][12] suite. Our results are compared with those provided in [11][12] also computed on an ideal system. For speedup measurement, we use the normalized speedup defined by the ratio of the measured execution time to the execution time of an equivalent sequential program executed on a single PU. The relative speedup is defined compared with the execution time of the parallel program on a single PU.
Simulation results on SPLASH and SPLASH-2 benchmarks are provided in fig. 5a. Water, MP3D and FFT are simple algorithms with a parallelism based on several simple parallel loops. In this case, results are similar to the ideal SPLASH results, as classical multiprocessors are well suited to execute large parallel loops. Performances for the LU algorithm are far better, despite a smaller data set used (and thus less parallelism available). The performance improvement is explained by a deeper exploitation of the available parallelism. Indeed, while SPLASH only exploits the parallelism inside each iteration, our more general parallelism representation allows to exploit the partial parallelism available among several iterations. Moreover, the task manager structure limits the parallelism management overheads allowing smaller tasks to be processed. Indeed, figure 5b shows the results of our parallelism exploitation (ML=multi-level) compared with the SPLASH parallelism exploitation (SL=single-level) on small data sets (64x64 matrix elements clustered by 4 (1024 blocks), 16 (256 blocks) or 64 (64 blocks)). In this case, the performance gain rises up to 12 % for 32 to 64 PUs as soon as there is enough parallelism available. It is achieved by keeping the executed instruction overhead below 1% up to 16 PUs and 3% for 64 PUS, except in case of very fine parallelism extraction (blocs of 4 elements) (fig. 6).
Simulations have also been performed on other applications from the multimedia and image processing fields. We have simulated an MPEG-2 decoder. The decoder program is sequential and directly derived from the MPEG-2 standard [13]. It has been parallelized by introducing parallelization directives and by rewriting minor code parts. Indeed, the use of an optimized parallel program will produce better results. The image segmentation algorithm [14] is an algorithm with a lot of critical parts. The zone growth operation presents only a parallelism of 2.6. Moreover, the available parallelism highly depends on the image homogeneity. Results are shown in figure 7. The MPEG-2 decoder shows good performances up to 24 PUs. For larger systems, the flat curve shows that all the parallelism has been extracted. This is due to the fact that the bitstream decoding is a sequential operation and cannot be parallelized. Concerning the segmentation algorithm, the partial overlap of multiple zone growths hides the critical parts. The handling of data-dependent parallelism allows to achieve a good parallelism extraction.
The execution of complex programs first shows that our parallelism management is efficient to describe and exploit complex structures. Moreover, this capability simplifies the programmer job, as he has not to thoroughly modify the sequential program to compel with some limiting parallel syntax. Indeed, we used for the SPLASH benchmarks the original version with slight modifications. Concerning the
0 16 32 48 64 0 16 32 48 64 ideal Wat er MP3D FFT LU PUs Normalized speedup 0 16 32 48 64 0 16 32 48 64 ideal SL, 1024 blocks ML, 1024 blocks SL, 256 blocks ML, 256 blocks SL, 64 blocks ML, 64 blocks Normalized speedup PUs
Fig. 5: a) SPLASH benchmark performances: WATER (64 mol.), LU(256x256 matrix), MP3D (3000 mol.), FFT(214 points); b) LU performances with 64x64 matrix. SL is for
image processing benchmarks, we have taken a sequential program and inserted parallel commands. No modifications have been made to ensure that parallel tasks will be close together in the program. This is not important as the task manager rearranges the tasks in the queue bank.
VII. Conclusion
We have presented a new parallelism manager for multiprocessor systems. The goal of this manager is to extract and exploit the complex parallelism patterns occurring in general applications. This goal has been achieved by using a graph-based parallelism representation and a dedicated task manager. The parallelism representation is separated from the program code to avoid to spread the parallelism information throughout the program code. The parallelism description is derived from a topological description of the dependence graph. A dynamic parallelism management is performed thanks to the use of control commands inserted in the task program. Several other parallel commands have been added to optimize the manager performances in case of simple loops and high-level parallel structures. The potentialities of the parallelism representation are exploited by a task manager based on a queue bank. A structured part of the task graph is stored in the bank. The proposed graph representation allows to execute the graph manipulation operations at a limited cost. We have validated the systems on several benchmarks. The manager ability to handle various kinds of parallelism has been stated. A more detailed parallelism exploitation allows to improve throughput significantly in case of complex applications.
This paper has demonstrated the potentialities of this approach. Coupled with other parallelism improvement techniques, it will achieve an efficient parallelism exploitation. Currently, no particular load balancing or data locality improvement techniques have been implemented. However, the queue bank stores a structured description of the dependence graph. Techniques based on a local graph analysis could be implemented in the manager to improve the results. The data locality is a more critical problem than the load balancing. If the manager can generate enough parallel tasks, the load imbalance can be hidden by the execution of other tasks. The
Fig. 7: MPEG-2 decoder and image segmentation performances/ Fig. 6: Instruction overheads for the LU
benchmarks and different parallelism extractions.
problem of data locality is that it decreases with the increase of the parallelism extraction. To overcome this problem, a data locality policy should be added to the task manager, coupled with a compiler that maximizes the local data reuse.
Future work will be oriented toward the actual implementation of the task manager. We have presented in this paper the global architecture and the hardware/software options, but several implementation alternatives still need to be studied and validated to have a cheap and efficient task manager. Concerning the compiler, the programmer has to introduce commands into the program code to help it to partition the program into tasks. We will later develop passes to perform this task automatically on a sequential program.
References
[1] David E. Culler, Jaswinder Pal Singh, Anoop Gupta, "Parallel Computer Architecture, a Hardware/Software Approach", Morgan Kaufmann Publishers Inc, San Francisco, California, 1998
[2] Johnson, T., Davis,T., Hadfield, S.: "A Concurrent Dynamic Task Graph", Parallel Computing, Vol 22, No 2, Ferbruary 1996, pp 327-333
[3] Chandra, R., Gupta, A., Hennessy, L.:"Integrating Concurency and Data Abstraction in the COOL Parallel Programming Language", IEEE Computer, February 1994
[4] Blumofe, R.D., Joerg, C.J., Kuszmaul, B.C.: Cilk: "An Efficient Multithreaded Runtime System", Proc. of the Fifth ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, July 1995
[5] Kai Hwang, Zhiwei Xu, "Scalable Parallel Computing", McGraw-Hill, 1998
[6] Nikhil, R.S., Papadopoulos, G.M., Arvind: "*T: A Multithreaded Massively Parallel Architecture", Int. Symp. on Computer Architecture, 1992, pp 156-167
[7] Mike Johnson, "Superscalar Microprocessor Design," PTR Prentice Hall, 1991
[8] Babak Hamidzadeh, David J. Lilja, "Dynamic Scheduling Strategies for Shared-Memory Multiprocessors," Proc. of the 16th Int. Conf. on Distributed Computing Systems, Hong Kong, 1996, pp 208-215
[9] Stanford Compiler Group, "The SUIF Library", The SUIF compiler Documentation Set, Stanford University, 1994
[10] M.D. Smith, "The SUIF MACHINE Library", Harvard University, 1997
[11] Singh, J.P., Weber W-D., Gupta, A.: "SPLASH: Stanford Parallel Applications for Shared Memory", Computer Architecture News, 20(1), July 1994, pp 5-44
[12] Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., Gupta, A.: "The SPLASH-2 Programs: Characterization and Methodological Considerations", Proc. of the 22nd Ann. Symp. on Computer Architecture, June 1995, pp 24-36
[14] Gilmont, T., Verians, X., Legat J-D., Veraart, Cl.: "Resolution Reduction by Growth of Zones for Visual Prosthesis", Proc. of the 1996 Int. Conf. on Image Processing, Switzerland, September 1996, pp 299-302
[13] ISO/IEC 13818-2, "1995, Information Technology – Generic coding of moving pictures and associated audio information – Part 2:video."