We will now discuss the model’s novelty compared to the existing literature, and what benefits the proposed approach offers.
Our approach is closely based on the method of filling individual cells of the scoring grid in parallel, using a different thread to fill each one as it’s dependencies become satisfied. This in itself is not entirely novel; the works of Krusche and Tiskin [54], Kloetzli et al. [53], and to a lesser extent Boyer et al. [14] are existing literature’s with similarities to our implementation, and the literature review in Sec. 2.4.3 covers these in more detail. However, as discussed at length through this thesis, none of these parallel models allow for the parallelism of other problems, quickly and easily. The above papers for example are all related solely to the solving of the longest common sub-sequence problem. Considering again the early work of Galil and Park [30], where they discuss the feasibility of a more generic diagonal approach to solving dynamic programming algorithms, you can also see
clear influences for our model, as they consider dynamic programming from a problem agnostic point of view. Furthermore, from an implementation point of view, of the previous works we have identified few of these consider factors such as optimising CUDA parameters, or take advantage of modern CUDA features such as streams.
Therefore, a clear, and significant, contribution of this work, is that we demonstrate that it is possible to make a pseudo-generic framework that allows the implementation of many different dynamic programming algorithms based problems in parallel. Compared to the existing literature we reviewed, none of these allowed for the practical implementation of different problems; they either demonstrated an implementation for a single problem, or discussed the theoretical implications of multiple problems, with no practical backing. We then go a step further and implement the model upon the architecture of the GPU, demonstrating that our model is suitable for use in a massively parallel environment. Of the literature reviewed, there is considerably less available regarding implementation on GPU architectures, with there being some notable examples from post 2000 [13, 75, 99]. Therefore, not only are there are no examples of CPU based models to solve multiple problems, there are also no examples of GPU based models, allowing us to find our research niché.
Our model also employs novel memory management, which we believe has two fold con- tributions. Firstly, and most significantly, it allows problems larger than the size of memory to be computed through the use of memory rotation, and secondly it demonstrates a highly efficient implementation taking full advantage of multiple GPU streams, and asynchronous operations. The base memory structure initially stemmed from the work of Klotezli et al. [53], but has been vastly improved from this point with additional implementation optimisation such as memory padding, as well as being extended to enable the implementation of memory rotation.
In terms of the finer details of the implementation of our model, Wu et al. [95] also provided inspiration for our work demonstrating how the size of the wave front can be
adjusted based on the amount of work that needs to be done in an effort to improve the parallel efficiency. This proved to be a critical factor for our implementation, considering the large impact divergence can have on GPU code. We also drew on the work of Berger & Galea [10] where they discuss how thread grouping and the altering GPU parameters, based on the problem at hand can improve the efficiency and performance of the algorithm. Whilst our model does not adopt the concept of thread grouping, we have a strong emphasis on allowing the user to change parameters based on the problem being computed, as well as an element of pre-processing where we calculate optimal values for block sizes, and kernel sizes.
Finally, our contribution of allowing the user to interact with the model through the concept of a file format, in an API style mechanism seems to be entirely unique in the literature, although this is arguably more an implementation and engineering contribution rather than solely a scientific contribution. However, it is through this mechanism we are able to claim our model is generic, and without the ability to define different problems, as well as associated parameters, the rest of the model would not be valid.
The complexity for our proposed model, to compare to the existing literature, can not be ascertained at this point, as this is dependent on the problem the user inputs, as well as the amount of memory they opt to maintain. Therefore the complexity for individual problems is considered in the following implementation section.
Summary
This chapter presented a detailed description of the parallel model that this thesis is proposing, as well as giving an insight into the design process to provide an insight as to our design choices. Also the novelty and contributions of our model were presented in comparison to the literature currently available. Performance results were shown to justify some of our design choices, demonstrating the general performance of different components of our model, such
as memory transfer and block structure, and how these perform in isolation. The next chapter describes the implementation of the introduced test problems on the GPU, through the use of our parallel model.
Application of the Model
Overview
In this chapter we discuss how the introduced problems are solved through the use of the proposed model. We consider different dynamic programming definitions, which require different dependencies to be maintained for different problems, and how this is represented in the model. We also consider implementation optimisations which can be made for the separate problems. Later in the chapter we discuss adaptations required in order to solve more complex dynamic programming problems. Finally we seek to define the broad classes of problems which are unsuitable to be solved by the model.
Fig. 4.1 Dependencies of the wavefront when solving the longest common subsequence problem