This chapter has presented the performance results of two applications: an ocean model and a parallel sorting application. Each of these applications have different characteristics, which is partly why they were chosen to test Tupleware in terms of its aims of performance and scalability stated in Section 4.1.
The findings of this performance evaluation were pleasing in terms of the sorting application, which displayed a high level of speedup on up to the maximum number of sixteen nodes, and was effective in evenly distributing the processing workload amongst all participating nodes in the system. The performance of this application also clearly illustrated the effect of varying the granularity of each processing task, with the larger threshold size exhibiting a higher degree of speedup than the smaller threshold sizes. This was due to the time each process spent on network communi- cations remaining relatively constant, while the processing performed per process decreased as more nodes were added to the system. This is a result typical of an application such as this, and we can conclude that the Tupleware system has met its aim in this case of providing a scalable platform upon which to develop this style of medium-grained application.
In terms of the ocean model, the results show that the overall speedup gain was limited, and that as the number of nodes increased, the time each node spent performing network communication placed a limiting factor to the continued scala- bility of the application. However, we also found that an increase in problem size, in this case the size of the grid, did not place a disproportionate load on any processes, and so there remains scope for the grid size to be increased further on a cluster with nodes with more than 1GB of main memory. The level of memory usage was the limiting factor with regards to the grid size on the cluster used for the testing de- tailed in this chapter, as the largest grid used (2400x2400) began to access virtual memory during execution, which dramatically slows performance.
Overall, we have shown that Tupleware can provide performance gains for dis- tributed parallel applications, and that it can scale in terms of the number of nodes and also in terms of the problem size. While the performance of the tightly-coupled ocean model is not optimal, this is a common problem with tuple space-based sys- tems, and the performance of Tupleware in this instance is comparatively good.
Chapter 7
Conclusions & Further Work
This chapter summarises the research presented in this thesis, and discusses its suc- cess in achieving its stated aims. Finally, some directions for possible future work are outlined.
7.1
Conclusions
This thesis has presented Tupleware, a distributed tuple space aimed at array-based parallel applications. We have detailed the design and implementation of the sys- tem with the two broad aims of producing a viable platform for the execution of distributed applications, and of doing this in such a way that the level of complexity remains low from the perspective of the application programmer.
The approach to evaluating these aims was to implement two applications on the system: an ocean model and a parallel sorting application. The contrasting char- acteristics of these applications allowed Tupleware’s behaviour to be thoroughly analysed.
7.1.1
Scalability & Performance
The viability of the system relates to its ability to exhibit scalability in terms of the size of the system and in terms of the application problem size, and its ability to provide the application with increased performance. These aims were stated in Section 4.1, and evaluated in Chapter 6. The outcome of the performance testing conducted is summarised below.
For medium- and coarse-grained applications, embodied by the sorting appli- cation, Tupleware displayed a high level of scalability and performance gains. We found that the granularity of the application had a significant effect on the level
of speedup achieved, with the finer-grained instance of the application achieving less speedup than the two more coarse-grained instances. However, to achieve con- sistent speedup on up to the maximum number of sixteen nodes demonstrates the viability of the system for this type of application.
For the tightly-coupled ocean modelling application, a the system delivered a performance gain by distributing the application over nodes in the cluster, however the level of speedup achieved was lower than for the sorting application. This was due to two main reasons: the frequency of communication was much higher for the ocean model, and the time spent performing this communication was greatly inflated by the time-stepped nature of the application’s execution. The fact that each node spent significant time waiting for required values to become available from nodes processing neighbouring panels greatly diminished the performance gains experienced.
Overall, the performance results were pleasing. Tupleware provided obvious and significant performance gains and scalability to the sorting application. The class of applications which include the ocean model include much higher commu- nication requirements, making it difficult to achieve the same level of speedup in a distributed computing environment, due to the high cost of latency. Given these factors, we believe that achieving even a modest level of speedup is a pleasing result.
7.1.2
Ease of programmability
As discussed in Chapter 4, the application programming interface was designed to preserve the simplicity and semantics of the operations found in Linda. Tuple- ware contains equivalent operations to Linda with the exceptions of eval(), as
Tupleware does not support spawning of new processes at runtime, and also of the additional bulk retrieval operations rdAll() and inAll(). The semantics of Tu-
pleware’s operations are very similar to those found in Linda and JavaSpaces. The other factor related to the ease of programmability of the system relates to the way additional tuple spaces are integrated into the system. Tupleware is imple- mented in such a way that the distribution of tuple space is completely transparent to the programmer, who does not need to worry about the issues of data distribution or locality, or the existence and location of remote instance of tuple space. It is left to the underlying runtime system to handle these issues, which it does by imple- menting an efficient algorithm for the retrieval of tuples from remote nodes. This algorithm uses the success or failure of previous requests in order to calculate the probability of a given instance of the tuple space being able to successfully fulfil future requests, and allows us to minimise the amount of network communication
carried out by the system.
7.1.3
Contribution of the research
The contribution of this research is that the system produced successfully manages to transparently integrate multiple tuple spaces into a distributed tuple space without any added complexity for the programmer. Despite the distribution of the space, it is able to provide a scalable platform for distributed parallel array-based applications with a broad range of characteristics, and provide a significant level of performance gain.
The search algorithm is a large part of this contribution, as it allows a mechanism for the tuple space to dynamically adapt to the communications characteristics of the application at runtime, allowing naturally forming groups of nodes to optimise their communications. With the exception of SwarmLinda (discussed in Chapter 3), this is not an approach that has been widely researched in the field.
A further contribution of this research is that a complete, concrete implementa- tion of the proposed techniques has been implemented and shown to be effective. This has allowed us to test the system in a real-world environment, which in turn helps to identify further enhancements which may be worthwhile pursuing in the future.