ARRIVE-Framework - Adaptive Resource Relocation in Virtualized Heterogeneous Clusters

VMM. The total duration for migration is less for non-live migration compared to the live migration, but it introduces penalty in terms of increased wall time for HPC applications. Live migration has lesser penalty in terms of wall-clock time but has a direct relationship with the memory dirtying and communication rates of the migrating VM. This is due to CPU-intensiveness of Xen’s inter-VM communication and migration routine which results in CPU starvation for the VM. We have demonstrated a simple optimization for the default live migration feature of Xen VMM. By reducing the number of iterations in the pre-copy phase to a bare minimum, we can reduce the migration time stretch of HPC applications by 50%. In cases where the application is highly store and communication intensive, the optimized live migration routine can give over 200% better performance over default live migration.

8.2 ARRIVE-Framework

We believe that this research contributes towards an understanding of how the system level measurements can be used to characterize applications and estimate their execution times in a heterogeneous compute farms.

We have carried out an extensive literature review of the existing application performance estimation models. We have found that these models are based on linear CPU frequency models, require off-line profiling, source code modification or combinations thereof.

We have shown that performance models based on CPU frequency alone are less accurate in HC environments, particularly if the HC has CPUs with compara- ble instruction sets but different cycle penalties for floating point operations and/or L1/L2 cache misses. Leveraging micro-architectural characteristics exhibited by different applications through the hardware performance counter data can greatly improve the prediction accuracy. Our experiments show that the prediction accuracy of the application runtime estimate can be improved by a factor of 1.65 in certain cases. For a subset of NAS parallel benchmarks (NPB), an average prediction accuracy of our method is 84.6% compared to the average prediction accuracy of 66.5% for the linear CPU frequency model.

We have also shown that source code changes and off-line profiling are not necessary if highly accurate performance prediction is not required. This is especially true for the dynamic process placement heuristics where a sufficient accuracy that leads to an affective migration decision is adequate. In essence, the fine-grained off-line profiling is not necessary for the dynamic placement heuristics.

The problem of optimal placement of jobs in a compute farm is NP-complete. In this thesis, we have presented a new approach to tackle this issue. Instead of a global optimization heuristic, we have employed a local optimization heuristic. The framework carries out a local optimization between the two sub-clusters only if the job migration (relocation) saves time. We do not consider other sub-clusters in the compute farm for the decision.

Unlike the previous methods of process-migration, the methodology adopted in this research is based on VM migration. Each VM in our framework hosts a maximum of one parallel process. VM migration has already found its place in the mainstream operating systems compared to process migration, although the later has been researched for years. We have shown that the VM migration methodology is a robust and a very practical solution.

To determine the average throughput improvement, we have tried to simulate real world workloads. The job queues were generated from Dror Feitelson’s workload archives. We used the NAS Parallel Benchmarks (NPB) to represent the applications conforming to these workloads. The duration of each experiment was between 3 to 4 hours. Our experimental setup consisted of 10 machines spread across four distinct hardware configurations and micro-architectures. The CPU frequency of the machines lied between 2 and 3 GHz. To introduce heterogeneity, two machines were installed with 100 Mbps ethernet cards and the rest had gigabit ethernet cards. This enabled us to build a 32 VM heterogeneous compute farm with each VM having exactly one CPU. Our experiments show that we are able to improve the throughput of the heterogeneous compute farm by 25% and the time saved by the HC with our framework is over 30% over the scheduling solution based on easy backfill algorithm. The overheads (wall clock time stretch in applications) of our framework are less than 3%. We have seen that a single effective migration decision overwhelms the overhead cost for the entire cluster for a given time frame. To make migration decisions that significantly improve the throughput, the framework requires there to be applications with a long execution time relative to the migration time. We make use of user time estimates, which are a hard requirement of the backfill algorithm, to predict if the application will run for sufficient time in future. However, it is difficult to provide an accurate time estimate for an application especially in an HC. The user time estimates can be two to three times off the actual wall clock time due to variations in CPU and communication infrastructure. Similarly, users are known (intentionally or otherwise) to provide incorrect time estimates.

We have tried to partially remove this dependence by using the user time estimate as a guide only. The profile estimates are calculated over a fixed period

In document Adaptive Resource Relocation in Virtualized Heterogeneous Clusters (Page 155-157)