Chapter 5 Testing adaptive SLA management on the White Rose Grid
7.4 Future Work
The following ideas represent future work to refine or extend the research presented in this thesis.
• The SLA Management Architecture could be extended to work with an ensemble of applications where many instances need to be run and covered by a single SLA. For each application instance within the ensemble an instance of the SLA Engine could be used to monitor and adapt to performance degradations – functionality which is currently supported. However, a new
component would be needed, a master SLA Engine, responsible for coordinating all application instances within the ensemble. This component would take on a metascheduling role with the power to suspend and restart instances to achieve the optimum performance for the SLA rather than a specific application instance. Metascheduling decisions could be implemented in a rule base controlled by the adaptive controller. For example, an instance which is executing with a predicted finish time well before the deadline may be suspended and replaced with an instance which is behind schedule or one which may not have started yet. This may occur if the number of application instances is large and the Resource Broker is unable to reserve enough resources for the entire job set to start simultaneously. This approach is counter-intuitive at the instance level but may optimise the performance of the SLA.
In addition to these changes, the SLA specification proposed in Section 3.8 would need to be altered to include elements which could describe the number of instances within the ensemble. The SLA would need finer grained warning, migration and violation elements in order to describe which instances the warnings, migrations and violations applied to.
The SLA Management Architecture would still rely on an initial prediction for the length of time each application instance would take to execute. The deadline offered within the SLA would be a function of these individual times and other factors such as availability of resources before the start and average queuing times for instances not scheduled at the start.
The monitoring techniques demonstrated in Chapter 4 and 5 could be applied to monitor each application instance, however the master SLA engine would need information regarding SGE (or other LBQS) queue status in order to make metascheduling decisions. Most LBQSs support this information but not across domains within a Grid Infrastructure. Therefore, the master SLA Engine would have to rely on a higher level monitoring service such as MDS or a Resource Broker which would have access to information from across the domains.
• Currently, it is assumed that the resource onto which the migration occurs will offer the application greater performance. Use of the DAME XTO application
during scenario 3 in Chapter 5 demonstrated the importance of assessing resource performance before migration. Assessment of the performance gain offered by the new resource prior to migration is one method which would achieve this. Comparison of the predicted remaining execution time on the current and new resource and the overhead due to migration would provide a measure of performance gain. The overheads due to migration would include potential variation in checkpoint transfer time and NBQS queuing time on the new resource. A simple method of estimating the rescheduling gain is shown in Equation 5 and used by Vadhiyar and Dongarra in [107] to decide whether application adaptation should proceed.
(
)
(
)
T
T
T
T
current remaining migration roverhead new remaining current remaining gain= − +Equation 5 Calculating the potential gain of migration
• An extension to the SLA Management Arcitecture may place greater emphasis on predicting overheads due to file transfer or queuing. A solution may employ the learning based initial prediction technique described in Section 5.2.3 to predict delays using historical observations. Sampling the average bandwidth between the current resource and the new resource and the average job queuing time from the NBQS on the new resource metrics which could indicate possible overheads.
• An extension to the adaptive controller would enable fuzzy rather than rule based control. The use of fuzzy control is useful when the control domain is continuous. In the experiments in Chapters 4 and 5 the control domain is not continuous because the ability of the SLA Manager to affect the performance of the Grid application is controlled by the resource on which the Grid application is executing.
If Grid resources were to support a priority based execution policy, the SLA Manager could manipulate the priority given to the Grid application, enabling it to receive a greater amount of CPU time compared to the competing applications. Nice values are a way of controlling process priority on POSIX operating systems. An experiment might manipulate this value in order to
affect the amount of CPU processing received by the Grid application. If monitoring suggests the application will meet the timing guarantee, the application could be given a lower priority, which would decrease the amount of CPU time received. Equally, if performance is low and the predicted remaining execution time is higher than the schedule, the application could be given a higher priority which would increase the amount of CPU time received by the application.
• Extending the technique to include a broader range of applications would enhance the capability. The current solution is tested using a CPU intensive application and rule based adaptation. The solution should be tested with other application types (interactive or MPI) to determine if it is suitable in other fields.
• The architecture may benefit from a resource acquisition protocol which implements an economic model to control resource usage. An implementation based on GESA [71] could be applied to allow resource usage to be linked with usage.