Parallelism Mapping at High Level ·········································

Chapter 6: Parallelism Implementation in FPGA-based ES··································

6.3 Methodologies of Processing in Parallel ············································

6.3.2 Parallelism Mapping at High Level ·········································

Decomposition techniques are used to identify the concurrency in a problem and decompose it into tasks that can be executed in parallel. After decomposition, the tasks that are parsed out can be mapped onto the processes that can be executed by the operators available in a specific parallelism system. The mapping includes programming the tasks of a parallel algorithm or re-programming a serial algorithm with the parallelism style designated for the specific parallelism system. Except for conforming to the designated parallelism programming style, the above decomposition can provide a lot of fundamental parallelism information for the mapping. The behaviour of the tasks and the interactions between them offers guiding for mapping. The task behaviour can explain how an operator can process the data portion whereas the interaction behaviour can indicate how the operator communicates with others.

A parallelism mapping plot is expected to execute faster than its serial counterpart. Four features of the tasks significantly affect the parallelism mapping plots.

6.3.2.1 Task Running Time

Chapter 6 Parallelism Implementation in FPGA-based ES 124

known, the inherent serial section that cannot be replaced by parallel tasks places the most constraint on a parallelism mapping plot. It is ideal for parallelism mapping where all the tasks in parallel take the same amount of time to complete. That is, all the tasks are uniform. The parallelism algorithm will be faster than its serial counterpart.

For example, there are several matrix left multiplication in the Mesa-OpenGL for FPGA-base ES, as shown in Section 5.4.2.7 and Figure 5.4. In these matrix multiplication problems, the tasks parsed out by the parallelism decomposition can be uniform because we can determine the sizes of these matrixes in advance.

Since the complexity of parallelism problem can vary, parallelism decomposition cannot divide every problem into uniform tasks that can be completed in the same amount of time. For instance, the search problem is a non-uniform one. Some task, like the inherent serial section, may take more time than other tasks. This kind of task will particularly influence the effect of the parallelism mapping plot.

For instance, the recursive decomposition of the edited surface, as given in Section 6.3.1.6, cannot be divided into uniform tasks since patches with different shapes are divided into pairs of different triangles. One triangle can have a different shape and area from those of others, which can result in different line segments scanned line by line for each triangle, as shown in Figure 5.7.

6.3.2.2 Knowledge of Task Running Time

If the task running time of a parallelism mapping plot can be estimated, the task running time is known before the task is executed. This information is useful because it can enable assessment of how much faster a parallel algorithm can be than a serial algorithm. For example, in a matrix multiplication problem, the amount of time can be known, derived by the execution of a small block sub-matrix multiplication and scalar algebraic operations. This evaluation can help to assess whether or not a parallelism mapping plot is good enough and whether or not there is room for optimisation.

In some cases, it cannot be known in advance how long a task takes to execute. For example, exploratory decomposition for a search problem may mean that we cannot be sure how many steps to be taken to find the final solution. It depends on every choice made at each stage. Thus, uncontrolled factors make the task running time unobtainable. A good judgement cannot be obtained for this problem .

In this research, it can be known in advance that the running time of the left multiplication of a 4X1 matrix by a 4X4 matrix can be evaluated. The 4X1 matrix represents the coordinates of a vertex. The 4X4 matrix can be one of the transformation matrixes, as shown in Figure 5.4. On the other hand, it cannot be known in advance how long time it will take to draw a surface. The triangles into which the surface is divided can

Chapter 6 Parallelism Implementation in FPGA-based ES 125

have various shapes and areas. 6.3.2.3 Known Task and Generated Task

There are two types of task. One is the task known before the algorithm starts execution, called a known task; the other is the task generated during the execution of the algorithm, referred as a generated task.

Known tasks usually can be parsed out by data decomposition. Even though recursive decomposition can generate many sub-tasks by using a task-dependency graph, these sub-tasks are known before the parallelism algorithm executes. They belong to known tasks as well.

Generated tasks may be yielded by decompositions when the parallelism algorithm is executing. At the high level, decomposition techniques may not lead to an explicit detailed task-dependency graph that includes the actual tasks. Recursive decomposition may yield an array of a designated size, but this array can include different number of operators with different capabilities for different parallelism hardware platforms because a task-dependency graph may not indicate detailed information about the hardware platform.

In any case, a generated task should be one that takes a state as its input or its generation condition. Then the task starts to extend by itself with a predefined number of stages, such as the scenario in a search problem, and dynamically generates more tasks to perform the same computation on each of the resulting states until the solution is found and the algorithm terminates. Whether or not the generated tasks are yielded depends on the input of an algorithm when being executed.

In this research, the surface drawing is a problem of recursive decomposition. It was seen in Section 6.3.1.6 that it includes three level tasks, these being the surface on the top, the patches at the middle level, and the triangles at the lowest level. These tasks are known tasks. But underling this recursive decomposition are unknown sub-tasks. We cannot know how many lines should be scanned in each triangle before the surface is evaluated. The number may change dynamically when the surface is edited with user interactions.

6.3.2.4 Related Data

In parallelism problems, a large amount of data usually needs to be processed. The access and movement of a large amount of data can be costly in terms of CPU and memory space, especially when communications between operators or I/O operations are required and the overheads become even worse. Therefore, data related to a task and the size of the data can also affect a parallelism mapping plot.

Chapter 6 Parallelism Implementation in FPGA-based ES 126

of the next task is the output of the last one. Whether or not the related data of a task are available can determine if the task can be performed in parallel with other tasks.

Related data sizes vary. The size of the input data can be different from that of output as well. For example, the size of data in a frame buffer is determined by the number of resolutions and the type of colour pattern. When it is processed with the streaming, the input of a task for a frame buffer is just one or two pixels. But the output can be a whole frame buffer. When it is processed by copying a whole image, the input of a task can be the whole image and its output can be the pixels in a square area of the frame buffer. For a search problem, the input of a task can be the set of search scopes, and the output of a task may be just one number in the set.

In this research, there are various data sizes in the graphics pipeline of OpenGL, as shown in Figure 5.1. In terms of its geometric pipeline, the vertices of a surface are basic elements that are processed and are 4X1 matrixes with four coordinates. The coordinates are fixed-point numbers with 11 bits for the fractional part. Via its fragment pipeline, a frame buffer with 800 X 480 pixels is stored. Each pixel in the frame buffer is an integral of 32 bits for the RGBA format with four components of red, green, blue and alpha and eight bits for each component. When the surface is finally displayed on the device screen, the LCD accepts only the stream of eight bits for each colour element including red, green and blue. These various data sizes restrict the system parallelism, which results in an orderly relationship between the geometric and fragment pipelines, and between steps in each pipeline. Thus, in this project, parallelism decomposition cannot map into the SIMD or MIMD architecture, but it can map onto the pipelined parallelism discussed in Section 6.2.2 effectively and efficiently. That is why the pipelined parallelism is adopted in the system of this project.

In document A novel parallel algorithm for surface editing and its FPGA implementation (Page 148-151)