• No results found

compared to EP.

5.5

Summary

We have presented a detailed study of the migration facility of the Xen VMM, specifically targeting HPC applications. The effects of live and non-live migration techniques on HPC application wallclock times were analyzed in detail and a detailed relationship of the migration routine with memory modification, commu- nication intensity and CPU contention between guest VMs and the host VMM was presented. We show that migration is a CPU intensive operation. The Dom-0 CPUs (of both the source and the destination) go to 100% while migrating a VM. In the case, where a CPU is spare for Dom-0, the wall clock time stretch introduced by migration is less compared to a fully loaded VMM. The high CPU utilization of the Dom-0 CPU suggests that migrating all VMs from a single VMM in parallel is not a good option. The cost of migration (the wall clock time stretch) depends on memory intensitivity of the application rather than its memory footprint. However, the total time of migration is proportional to the memory size of the VM. For example, a VM with 4 GB ram will take longer time to migrate, but if an application has less memory intensive operations, the wall clock stretch introduced will be very low compared to a 512 MB VM running a memory intensive application. Accurate prediction of the wall clock time stretch of an application runtime due to migration is not easy to identify unless memory operations are profiled. Similarly, as shown in the chapter, migration is a CPU intensive operation. The total time of migration and the wall clock time stretch in the application runtime depends on the underlying hardware of the VMM. However, we have noticed, for the given hardware and the set of NAS benchmarks, the wall clock time stretch (cost of migration) lies between 8 to 15 seconds per VM migration.

We also show that by reducing the iterations in pre-copy phase helps in keeping lower migration costs in case of the HPC applications. Our optimization is able to reduce the total number of memory pages transfered during the migration by up to 500% and results show an average of 50% improvement over the default Xen migration routine on the traditional gigabit Ethernet infrastructure. All the subsequent experiments in the thesis use this optimization unless specified.

Part

IV

Design, Implementation and

Evaluation

Chapter

6

Design and Implementation of a

Resource Relocation

Framework

In this chapter we present the design and implementation details of a our resource remapping framework called ARRIVE-F (Adaptive Resource Relocation In Virtualized Environments- Framework). The framework is able to exploit the heterogeneity in a compute farm to improve its throughput.

The framework then carries out a lightweight online profiling of the CPU, communication and memory subsystems of all the active jobs in the compute farm. From this, it constructs a performance model to predict the execution times of each job on all the distinct sub-clusters in the compute farm. Based upon the predicted execution times, our framework is able to relocate the compute jobs to the best suited hardware platforms such that the overall throughput of the compute farm is increased. We utilize the live migration feature of virtual machine monitors to migrate the job from one sub-cluster to another.

The implementation details of the model, the performance prediction accuracy and the migration decision model are also discussed.

6.1

Motivation

As discussed in Part II, parallel applications are required to distribute computa- tions unevenly to account for the varied speed and architecture of processors in a heterogeneous compute farm [49, 85, 71]. The load balancing of such computations require a considerable effort from the programmer’s perspective [85]. This also

(a) Heterogeneous compute cluster. (b) Heterogenous compute farm with homoge- neous sub-clusters.

Figure 6.1: Pictorial view of hetrogeneous compute farm.

makes the job scheduling decisions a difficult and tedious task.

Performance modeling techniques are often used to tackle the issue of hetero- geneity [71, 85] where the application characteristics are encapsulated into a set of formulas to form a performance model. The performance models are then mapped to different architectures to determine the performance [30]. Fine-grained perfor- mance modeling is capable of reasonably accurate prediction but the associated cost of profiling can be very high in terms of the wall-clock time of the job [71, 85].

We deal with this issue of heterogeneity by breaking the heterogeneous compute cluster into a number of homogeneous sub-clusters as shown in Figure 6.1. The runtime characteristics of the applications are determined with the combination of hardware performance counters/units (PMUs) and programmers interface of the MPI (PMPI). This enables us to predict the performance of running MPI applications on all the other sub-clusters present in our heterogeneous compute farm. All this is done without the need of changing the application source or the binary (provided MPI library is dynamically linked) or fine-grained off-line profiling and analysis phase. We then determine the best-suited sub-cluster for the compute job from our compute farm and migrate the job to this sub-cluster inorder to improve the overall throughput of the compute farm and the average waiting time.