Chapter 6: Parallelism Implementation in FPGA-based ES··································
6.3 Methodologies of Processing in Parallel ············································
6.3.1 Computation Decomposition at High Level ······························
Grama et al (Grama et al 2003) also address a method for parallelism analysis of computation at the high level. It is called decomposition techniques. It can provide a good starting point for parallelism analysis in many real computation problems at the high level. It can be used to analyse and decompose the operations in complex problems, which can be executed in parallel, by combining one or more decomposition techniques. Though decomposition techniques may not always bring about the best parallel solution to a problem, they offer a feasible method for turning a common problem into a parallelism problem.
Chapter 6 Parallelism Implementation in FPGA-based ES 119
The basic idea of this method is that a given problem has to be split into computation sub-steps that can be executed concurrently and independently, and a task-dependency graph can be defined. This graph can tell which sub-steps in the problem can be executed concurrently.
6.3.1.1 Four Decomposition Techniques
The decomposition technique has four types: data decomposition, recursive decomposition, exploratory decomposition, and speculative decomposition.
The data and recursive decomposition techniques can be applied to general problems whereas the speculative and exploratory decomposition techniques may be used in specific problems.
6.3.1.2 Data Decomposition
Data decomposition can be used in parallelism decomposition on algorithms that operate on a large amount of data. It takes place in two steps: data partition and computation task division. The former is done on the data with their independency; the latter is carried out according to the results of the former to determine which operations should be done on each data portion. The operations on one data portion are relatively fixed and similar. Data decomposition can produce several solutions to a given problem. It has to evaluate among them on their performance and efficiency before choosing one.
Data decomposition can be done on input, output, and intermediate data of a given problem. The data decomposition can also conform to the owner-computes rule. It means that each owner has its own task and data, and all the computation operations of its task are involved in just its own data. For the input or output data decomposition, the relationship between owners and tasks in the owner-computes rule can have different meanings. In input data decomposition, the owner of a portion of input data should be the owner of the task that performs all the operations on this portion of data and produce results. In output data decomposition, the owner of a portion of output data should be the owner of the task that performs all the operations that can produce this portion of output data.
With intuition, it is most natural to start with output data decomposition because each individual part of the output of the problem can be processed independently from other parts of the output. If possible, the operations that are used to yield that part of output should be independent of those used in processing other parts of the output. Thus, the computation task division occurs naturally.
For example, it is well-known that matrix operations on large matrixes can be replaced by the formulation of matrix operations on small block matrixes. The latter can significantly decrease the cost of computing and the requirement for storage. Another benefit of the
Chapter 6 Parallelism Implementation in FPGA-based ES 120
latter is that the matrix operations on block matrixes can be applied in parallelism decomposition. This makes matrix operations on large matrixes transform into matrix operations on several small block matrixes, which can be processed in parallel.
A restriction of output data decomposition is that it can work well only if each output of the problem can be computed as a function of the input. Sometimes, it is not the case. It is natural to turn to the input data decomposition to find a solution. If it is found that the input data of a problem can be divided into independent groups and it induces operations on each group that can be performed in parallel, this problem is one that can be addressed in the input data decomposition. The obvious feature of this type of problems is that the operations of the task performing on each independent group of input data are isolated from those on other groups of input data. That is, one task is input data independent of other tasks. For instance, number or alphabet sorting is such a case. Figure 6.2 shows output and input data decompositions.
Figure 6.2 Output and Input Data Decompositions
Usually, the problems to solve are not ideally suited to output or input data decomposition. For these problems, the intermediate results may be used for parallelism decomposition. Intermediate data decomposition suits to problems that need to be processed in multiple stages and intermediate data to produce the final output. The intermediate data can be treated as the output data of the operation of an intermediate stage or the input data of the operation of its subsequent stage. Thus, it can be seen as the output or input data decomposition of a sub-problem of the original one. It can focus on looking for the data independence in this sub-problem and carrying out data decomposition.
In reality, problems do not obviously belong to any single type of output, input or intermediate data decomposition. We need to look for a solution by combining output, input and intermediate data decompositions. In some cases, it is possible to gain more concurrency by using input data decomposition after the output data decomposition. In other cases, a serial algorithm may not explicitly fit into any type of data decomposition.
Output Data Input Data
Tasks
Chapter 6 Parallelism Implementation in FPGA-based ES 121
But when the solution structure of the serial algorithm is reorganised, it may yield the chance for the intermediate data decomposition. Intermediate data decomposition requires more exploration and may produce higher parallelism than the output or input data decomposition.
6.3.1.3 Recursive Decomposition
Recursive decomposition can be expressed as a problem that is first divided into several independent sub-problems, meaning that there is no data-dependent relationship between the sub-problems. Each sub-problem is further divided into smaller sub-problems in the same way as the upper division level. At any division level, the results of all the sub-problems are combined for the upper level. In the end, all the results at different division levels are combined at their division levels and finally form the final result at the top level recursively. Thus, the problem can be treated as a divide-and-conquer problem. Along with problem division, the solution algorithm to this problem can be split into sub-sections at different levels as well. At any division level, all the sub-sections can be executed concurrently. Figure 6.3 shows a task-dependency graph for three division levels.
Figure 6.3 Task-Dependency Graph for Three Division Levels 6.3.1.4 Exploratory Decomposition
Exploratory decomposition can be used in problems that carry out a search in a space for solutions. The search process shrinks the space in which the solutions are included. Therefore, the original search space can be divided into small regions, in each of which the search process can be done concurrently until the solutions are found.
In some way, exploratory decomposition is similar to data decomposition of number or alphabet sorting problems. Both of them search in partitioned sets. There are some differences between them, however. Data decomposition has to be done in the entire set of acceptable values and the final solution of the problem relies on the results of all the tasks. Exploratory decomposition can reach the end without waiting for all the tasks to be finished if all the solutions of the problem are found. Consequently, the way in which the original search space is partitioned can seriously affect the parallelism performance of
Original Problem Sub-problems at Second Level Sub-problems at Third Level
Chapter 6 Parallelism Implementation in FPGA-based ES 122
exploratory decomposition. Poor partitioning means that the parallelism computation may not be better at speed than its corresponding serial algorithm.
6.3.1.5 Speculative Decomposition
Speculative decomposition can be applied to solving problems that one task has different branches of operations because its input has different patterns. The input may be the computation result that is produced by the task just before this one. This dependence relation means that this task cannot start until the result of the previous task is provided. Its behaviour is just like the conditional branch statement in high-level languages.
One way to solve this kind of problem is simply to compute all the possible branches of this task and gain all the possible results without waiting for the last task to yield the ‘condition’ for this task. Therefore, this task can be performed concurrently with other tasks without waiting. It can speed up the whole processing of the problem. Once the ‘condition’ is produced, the corresponding result of the correct branch of this task can be output to the next task and the results of other branches become useless and are ignored. It is obvious that the processing of the incorrect branches is pointless. Thus, speculative decomposition issues a compromise solution by simply doing the computation of the most likely conditional branches. It can meet most situations. If one of the non-computing branches is correct, its relative computation must be made up.
One the basis of the statistics, the overall performance of the speculative decomposition may not be lower than its counterpart, a serial algorithm. If there are several stages of speculative decomposition in one problem, their speed-up effect can be enhanced much more.
If we compare speculative decomposition with exploratory decomposition in detail, some differences can be discovered.
In speculative decomposition, there is always more work to be done than its corresponding serial algorithm in order to enlarge the parallelism by pre-performing some tasks that will not be used in the processing. Speculative decomposition is faster than or equal to its corresponding serial algorithm. In exploratory decomposition, since the division of search regions is not unique, it cannot be determined in advance how much the parallelism can speed up and how much more or less work the parallelism can bring about than its corresponding serial algorithm.
In speculative decomposition, the possible set of output is known. It is the results for all the possible conditions. But the input for the correct condition is not known in advance. In exploratory decomposition, the possible set of input is known, and is the search space. But the output is unknown in advance, i.e. the portion which will result in the correct output solution to the search problem.
Chapter 6 Parallelism Implementation in FPGA-based ES 123
6.3.1.6 Mixing Decomposition
The above decomposition techniques can be combined during the application. Often, the computation for a problem has a structure with multiple stages and these different stages may have various features that match different decomposition techniques. Thus, it is necessary to apply different types of decomposition in different stages. Hybrid decompositions are acceptable.
This project is a system-level research that includes a hierarchical architecture composed of a wide variety of problems. Mixed decomposition is adopted.
Since there are many matrix operations in the Mesa-OpenGL for FPGA-based ESs, input data decomposition is used in dividing a large matrix into small block matrixes. When an object surface is complicated and has to be represented with a big mesh or grid, its vertex matrix can be large. Data decomposition in large matrixes can decrease the cost of computing (especially multiplication) and storage.
As discussed in Section 5.5.1, for surface editing with the PAMA, the edited surface is a problem at the top level, which can be solved with recursive decomposition. This problem is first divided into small patches, as shown in Figure 5.5. Each patch is then divided into two triangles, as shown in Figure 5.6. Each triangle can be processed independently because its data are independent from those of the others. Thus, the triangles can be processed in parallel.