Understanding the Implications of Virtual Machine Management on Processor Microarchitecture Design

(1)

Understanding the Implications of Virtual Machine

Management on Processor Microarchitecture Design

Xiufeng Sui

Advanced Computer Systems Laboratory Institute of Computing Technology, CAS Beijing, China [email protected] Tao Sun Computer Science and Technology Department University of Science and

Technology of China Hefei, China [email protected]

Tao Li Dept. of Electrical and Computer Engineering University of Florida

[email protected]

Lixin Zhang

Advanced Computer Systems Laboratory

Institute of Computing Technology, CAS

Beijing, China [email protected] Abstract—Cloud computing has demonstrated tremendous

capability in a wide spectrum of online services. Virtualization provides an efficient solution to the utilization of modern multi-core processor systems while affording significant flexibility. The growing popularity of virtualized datacenters motivates deeper understanding of the interactions between virtual machine management and the micro-architecture behaviors of the privileged domain. We argue that these behaviors must be factored into the design of processor microarchitecture in virtualized datacenters.

In this work, we use performance counters on modern servers to study the micro-architectural execution characteristics of the privileged domain while performing various VM management operations. Our study shows that today’s state-of-the-art processor still has room for further optimizations when executing virtualized cloud workloads, particularly in the organization of last level caches and on-chip cache coherence protocol. Specifically, our analysis shows that: shared caches could be partitioned to eliminate interference between the privileged domain and guest domains; the cache coherence protocol could support a high degree of data sharing of the privileged domain; and cache capacity or CPU utilization occupied by the privileged domain could be effectively managed when performing management workflows to achieve high system throughput.

Keywords—Virtualization; Datacenter management; cloud Computing; CloudSuite; Cache

1. INTRODUCTION

Cloud computing has emerged as a dominant computing paradigm to provide scalable online services. Typical server services, such as web search, social network, massive data analysis, and media streaming are all hosted in large datacenters. The emerging cloud workloads [1] represent computing and communication demands in these services, which differ from desktop, parallel, and traditional server workloads.

Virtualization is the cornerstone for cloud computing [2]. The base platform in today’s datacenter consists of physical hosts that run hypervisors, and workloads will run within virtual machines on these platforms. Since the hypervisor encapsulates different applications into each separate guest virtual machine (VM), a cloud provider can leverage VM consolidation and migration to achieve excellent resource utilization, high availability, and efficient power saving in large

data centers. The hypervisor is not alone in its task of administering the guest domains on the system. A special privileged domain (domain0 in Xen’s terminology [3]) serves as an administrative interface to the hypervisor. It has special privileges, such as being able to start new domains, access physical I/O resources, and interact with other virtual machines running on the system.

Meanwhile virtualization has the potential to dramatically increase the flexibility of deployments for general-purpose workloads. From the system management perspective, the virtualized environment enables a number of new workflows in datacenters [4]. These workflows involve operations on the virtual machines provided by the privileged domain of hypervisor, including powering on/off a VM, making a VM clone, and VM live migration (i.e., moving VMs from one physical host to another with minimal service down-time). These management activities could prompt the privileged domain to demand much more storage and networking resources than traditional user applications do.

Ideally, an application running inside a VM shall achieve the same performance as if it owns the physical host, that is, independent of those co-located VMs that share the same physical resource, including the privileged domain. The growing popularity of virtual machine environments has motivated deeper investigation of the performance implications of virtualization [5] at the architecture and microarchitecture levels. Nevertheless, the characteristics of the privileged domain when executing the emerging cloud applications, especially when performing the management workflows, has not been well studied so far. Note that the privileged domain serves all VMs for accessing I/O devices, which makes it usually exhibit different micro-architectural behavior (e.g. cache miss rate, data sharing degree) than the guest domains. Therefore, the processor architect should take these differences into account to further improve the performance. Moreover, when management operations are performed, the resource competition between privileged and guest domains will significantly impact both the performance of applications running in the guest domains and the efficiency of the management activities.

In this paper, we employ processor performance counters to provide a comprehensive micro-architectural characterization of virtualized systems running cloud workloads, with a special

(2)

focus on the runtime behavior of the privileged domain. We believe our study will be helpful to explore the design space for scalable and efficient cloud server processors. In summary, we have the following observations:

• The privileged domain suffers from high last level

cache miss rates. It is important to partition the shared

LLC judiciously among the guest domains and the privileged domain by taking utility information of each component into account.

• The degree of data sharing of the privileged domain

is high. Mechanisms to track the sharers in a single VM

which occupies only a small portion of the entire on-chip space can significantly reduce the space overhead of the coherence protocol. However, we cannot simply eliminate all coherence requests across VM domains, and should support such data sharing efficiently. • CPU resources needed by the privileged domain to

perform live migration are limited. It is necessary to

constrain the percentage of CPU occupied by the privileged domain, for both guaranteeing the performance of migration and avoiding slowing down applications running on the same host.

• Memory level parallelism of the privileged domain

when performing VMs boot storm is high. We can

limit the cache capacity occupied by the privileged domain and assign the additional space to guest VMs running memory-bound applications to achieve high system throughput.

The rest of the paper is organized as follows. Section 2 describes virtualized datacenters and dominant cloud applications in more detail. Section 3 provides a detailed description of our experimental methodology. Section 4 presents our evaluation results. Finally, we conclude this work in Section 5.

2. BACKGROUND 2.1 Xen hypervisor

Virtualization provides an illusion of multiple machines on a physical platform. A software layer called the hypervisor manages physical resources and isolates virtual machines from each other. Xen [3] is one of the most popular hypervisors that allows multiple virtual machines to run on top of a single server. The hypervisor is not alone in its task of administering the guest domains on the system. A special privileged domain called domain0 serves as the administrative interface to Xen. In a Xen-based system, domain0 is the first domain started by the Xen hypervisor on boot. It has special privileges, like being able to start new domains, access physical I/O resources, and interact with the other virtual machines running on the system. For example, it is responsible for running all of the device drivers for the hardware. For hardware that is made available to other domains (such as network interfaces and disks), it runs the backend driver, which multiplexes and forwards the requests from the frontend driver in each guest domain (called domainU) to hardware. In addition, most of the management operations in virtualized datacenters are also controlled by

domain0. Figure 1 depicts a virtual machine host with four virtual machines. On the left, the virtual machine host’s domain0 is shown running the SUSE Linux operating system.

Figure 1The Xen virtual machine environment 2.2 Management operations in virtualized data centers

In a virtualized data center, applications are run in multiple independent VMs on top of a single physical machine. It is easy for administrators to maintain physical and virtual hosts due to the flexibility of virtualization, and therefore many critical management operations are frequently performed. The management operations that we focus on are as follows: (1) VM power-on and power-off. (2) VM reset or soft-reset, which is equivalent to hitting the reset switch on a physical host. (3) Automated live migration, a load balancing or a power saving technique performed in an automated manner by moving VMs between hosts while keeping the VMs live. Live migration requires shared storage (e.g. NFS storage) between the source and destination hosts. (4) VM clone creates a replica of a powered-off VM. This is useful when duplicating a configuration: for example, when a new employee joins, the standard desktop VM image can be quickly deployed to the employee’s computer.

Recently, Soundararajan et al. [4] profiled the ongoing management workload activities from 17 data centers using VMware’s virtualization software, and analyzed their characteristics in detail. Since each data center is varied in size, the frequencies that these operations occur are also different in each data center. For each operation, they chose a single site and computed the average number of operations per day by taking the total number of operations performed and dividing by the number of days the management server was running. They also provided the maximum number of times the operation was performed on a given day to illustrate the burstiness of these operations. Table 1 lists all the operations mentioned earlier. The data illustrates that in some environments, VMs are powered on an average of 90 times in a day (and sometimes as often as 1500 times per day), and VMs are migrated over 50 times per day. Such activity is not found in non-virtualized data centers, as migrating applications that run directly on physical servers is much more involved than migrating VMs. In addition, physical hosts are rarely

(3)

powered-on or powered-off except in case of failure or periodic maintenance. In fact, all these operations are controlled by the domain0 and hypervisor, and will consume a large amount of system resources (such as CPU, memory, and network bandwidth). We will focus on the micro-architecture behavior of domain0 when performing such management operations.

Table 1Typical management operations in virtualized data centers [4]

Operations Average Number per _{Day at Various Sites} Peak per Day at _{Various Sites}

VM Power On 90 1576

VM Live Migration 51 3156

VM Power Off 35 1535

VM Reset 4.6 176

VM Clone 6.0 44

2.3 Dominant cloud applications

M. Ferdman et al. [1] presented an overview of the applications that are most commonly found in today’s clouds. Overall, all cloud applications exhibit similar characteristics. For instance, they operate on large data sets that are split across a large number of nodes, typically into memory-resident shards. They serve large numbers of completely independent requests that do not share any state. They have application software designed specifically for the cloud infrastructure where unreliable nodes may come and go and where inter-node connectivity is used only for high-level task management and coordination. Table 2 lists all the CloudSuite applications that we used in our study.

Table 2 Typical cloud applications [1]

CloudSuite

Applications Descriptions Data Serving

NoSQL data stores can provide fast and scalable storage with varying and rapidly evolving storage schema, and have been explicitly used to serve as the backing store for large-scale web applications.

Data Analysis

The map-reduce paradigm enables automated analytical processing accessible to large-scale human-generated information. It farms out requests to a cluster of nodes that first perform filtering and transformation of the data (map) and then aggregates the results (reduce).

Media Streaming

Streaming services use large server clusters to gradually packetize and transmit media files ranging from megabytes to gigabytes in size, pre-encoded in various formats and bit-rates to suit a wide client base, which is enabled by the availability of high-bandwidth connections to home and mobile devices.

Software Testing

Software testing is a large-scale simulation task which is adapted to a worker-queue model with centralized load balancing, as cloud computing can temporarily offer dynamic and heterogeneous resources that are loosely connected over an IP network.

Web Search

The largest data center applications are used to provide access to public internet indexes (e.g., Google, Bing, Baidu). Multi-terabyte indexes are split into shards, with each index serving node (ISN) responsible for processing requests to its own shard. A frontend node sends index search requests to all ISNs in parallel, collects and sorts the responses, and sends a formatted reply to the requesting client.

3. EXPERIMENTAL SETUP

The machines under test are two Intel Xeon 5620 servers, each with two 2.40GHz processors and 32GB of RAM. Each processor includes four out-of-order processor cores, 8 threads with Hyper-Threading technology [6], and a three-level cache hierarchy: L1 and L2 caches are private to each core, while the LLC (L3) is shared among all cores. Xen 4.1.2 is used as the virtual machine monitor. Each cloud workload runs on a virtual machine which has one vCPU, 2GB of RAM, and Debian 6.0 with 2.6.32 Linux kernel. In order to achieve optimum single core performance, Hyper-Threading is disabled in our experiment. Table 3 summarizes our experimental setup.

Table 3 The experimental workload setup

CloudSuite

Applications Setup

Data Serving

We benchmark one VM running Cassandra 0.7.3 storage system with a 30GB YCSB dataset that exceeds the VM’s RAM capacity. Server load is generated using a YCSB 0.1.3 client that sends requests following a Zipfian distribution with an equal number of reads and updates.

Data Analysis

We benchmark one VM of a Hadoop 0.20.2 cluster running the algorithm WordCount on a 4GB set of Wikipedia pages. Each core runs one map and one reduce job.

Media Streaming

We benchmark a VM running Darwin Streaming Server 6.0.3 to serve 50GB of videos encoded in several bit-rates ranging between 42Kbps and 60Kbps and use the Faban workload driver to simulate the clients. We limit our setup to low bit-rate streams to shift stress away from network I/O.

Software Testing

We benchmark one VM running parallel symbolic execution to search for programming bugs in an application binary. We use Cloud9 to analyze the command-line printf utility from the GNU CoreUtils 6.10.

Web Search

We benchmark one VM as one index serving node (ISN) of the distributed version of Nutch 1.1/Lucene 3.0.1 with an index size of 2GB and data segment size of 23GB of content crawled from the public internet. We simulate the clients using the Faban workload driver. The clients are configured to achieve maximum search request rate while ensuring that 90% of all search queries complete in under 0.5 seconds.

In order to exactly analyze the behavior of cloud applications on virtualized data centers, code level profiling tools are necessary. OProfile [7] is a system-wide profiler for Linux systems, capable of profiling all running code at low overhead. It is based on statistical profiling – continually sampling currently executing code at every fixed hardware event. Large set of samples that approximate the real hardware event distribution are employed. Xenoprof [5] extends Xen and Oprofile to fully profile across multiple domains and covers codes in user processes, kernel, and hypervisor.

Specifically, we concentrate on two aspects when analyzing the data generated by Xenoprof:

(4)

• The micro-architectural characterization if the cloud applications were transitioned to a virtual environment on a given hardware platform. Here we use only one VM configured as described in Table 3, since this can eliminate cache interference between multiple VMs and reflect the inherent characteristics of each virtualized cloud benchmark; this methodology is consistent with prior study on virtualization [8].

• The execution behavior of the privileged domain when performing various VM management operations (Sections 4.3, 4.4 and 4.5). Here we use one benchmarking VM and multiple background VM instances each of which executes a simple computing code.

We monitor more than eight hardware counters for our analysis: CPU_CLK_UNHALTED, INST_RETIRED, LLC_REFS, LLC_MISSES, DTLB_MISSES, L1I, and L2_LINES_IN, etc.

4. CHARACTERIZATION AND ANALYSIS 4.1 The effect of hypervisor

Coherence transactions by a hypervisor must be broadcasted to all the caches in a system, as the hypervisor can be invoked from any VM and its memory data may reside in any cache within the system. Domain0 in Xen is a privileged VM, which handles I/O for guest VMs. The hypervisor forwards I/O requests from guest VMs to domain0, and domain0 actually accesses I/O devices. Since domain0 serves all VMs, without explicitly pinning it to a specific core, it tends to migrate to different cores frequently. Although it is possible to reduce the set of cores that domain0 can use, we allow domain0 to be scheduled to any core for performance purposes. Figure 2 presents the decomposition of last level cache misses by sharing types. In this Figure, each bar represents a single VM running the specified workload, and LLC misses are decomposed to guest user, guest kernel, domain0 and hypervisor. For some of the cloud applications, e.g. data analysis (8.7%) and software testing (2.9%), less than 10% of LLC misses are from hypervisor and domain0. On the contrary, for media steaming, 85.9% of LLC misses are caused by domain0 and hypervisor. For web search, 36.6% of LLC misses are due to the execution of domain0 and hypervisor. Therefore, we can see that guest user, domain0, and hypervisor usually have different LLC memory intensity [9].

Figure 3 presents the L1 instruction cache miss rates on cloud workloads. We find that, in para-virtualization, the instruction working sets of many cloud workloads considerably exceed the IL1 cache capacity, resulting in high IL1 cache miss rate. We find that the instruction working sets of many cloud workloads, when running in virtualized environments, considerably exceed the IL1 cache capacity, resulting in high IL1 cache miss rate.

Figure 2 Last level cache miss decompositions: misses by hypervisor (Xen), domain0, guest kernel, and guest VMs

Each cloud workload actively uses domain0 and hypervisor due to the frequent communication via the network. IL1 cache miss rates of domain0 and hypervisor are rather high on most workloads, accounting for 30 to 80 percent of the overall IL1 cache misses, indicating a large instruction working set. However, there are no such penalties for non-virtualized applications.

Figure 3 IL1 cache miss decompositions: misses by hypervisor (Xen), domain0, guest kernel, and guest VMs

Implications The performance of a virtualized system is largely determined by the performance of the privileged domain [5] and therefore cache optimization techniques that only focus on guest domains are less effective. A guest application with excellent cache behavior does little to speed up system performance if domain0 manifests poor cache behavior. Therefore, when partitioning the shared cache among guest user, domain0 and hypervisor, the goal should be speeding up the critical component by taking utility information into account.

Since each VM uses a subset of the cores in a server, the data accessed by each VM is located in the private levels of the cache hierarchy of those cores where it executes. The main drawback of such multi-level cache hierarchy is that it duplicates the VM private data in every core that accesses it. However, the data of every VM is spread across the entire chip

(5)

in the shared last level cache. Hence, a private cache hierarchy miss results in a request to an arbitrary position of the chip, instead of only to the cores in which that VM is running, thus increasing the latency to solve the miss and creating interference between VMs. As the number of cores and the virtualization density increase, this effect becomes more prominent. An intermediate approach is to use caches that are private to the VM, but shared between the cores of a VM. This prevents the duplication of VM private data but incurs duplication of VM shared data. Therefore, new mechanisms to bring the data closer to the VMs and yet avoid duplication of VM shared data are necessary. This reduction in distance shortens the paths traversed to solve misses, therefore improving performance and reducing on-chip interconnection utilization.

4.2 The data sharing under virtualization

We investigate data sharing behavior of the on-chip L2 cache lines for cloud applications under virtualization, as shown in Figure 4. We break down each VM into application, guest kernel, domain0, and hypervisor components to provide insight into the source of data sharing.

In general, we observe limited data sharing across cloud applications. Some java-based applications (e.g. Data Serving and Web Search) exhibit a high degree of sharing from the use of a concurrent garbage collector, artificially inducing application-level communication. However, with regard to guest kernel, domain0 and hypervisor components, almost all of the shared L2 cache lines are dominated by the network subsystem.

Figure 4 The number of L2 cache lines in exclusive and shared state Implications In virtualized environments, the heavy data sharing exhibited by domain0 when executing most cloud applications indicates that wide and low-latency interconnects used in today’s processors are necessary to provide the data needed by domain0 efficiently. Conversely, cloud workloads see no benefit from fine-grained coherence and high core-to-core communication bandwidth in non-virtualized systems [1]. In fact, a major problem for current cache coherence proposals is that they do not scale well with the number of cores. For instance, the overhead introduced by directory-based protocols in terms of storage overhead in the cache grows linearly. As another example, in addition to their traffic requirements, token-based protocols [10] show scalability issues due to the hardware tables needed by the mechanism to perform persistent requests. In the case of server consolidation,

each VM is only restricted to a set of cores. It is a waste of space trying to keep coherence information about sharers for the entire chip. Therefore, mechanisms to track the sharers in a single VM can significantly reduce the space overhead of the coherence protocols. In [11], Marty proposed to impose a two-level virtual (or logical) coherence hierarchy on a physically flat CMP that harmonizes with VM assignment. Intel SCC (single cloud chip) [12] has two levels of cache, and there is no hardware cache coherence support among cores in order to simplify the design, reduce power consumption and encourage the exploration of on-chip distributed memory software models. 4.3 The effect of hypervisor scheduler

The default scheduler of Xen is a credit-based scheduler [13], which is a proportional fair share scheduler with global load balancing on multi-core systems. The credit scheduler allocates a time slice to each vCPU, called credit, for each scheduling period. vCPUs consume the assigned credits as they run. To ensure fairness, the scheduler always picks a vCPU that has remaining credits ahead of those that have run out of credits. Once a vCPU is picked, it can run for a time slice (e.g. 30ms). A vCPU can be blocked when it is no longer runnable, even if it has not used up the assigned credits.

On multi-core platforms, the credit scheduler dynamically relocates waiting vCPUs to idle cores for load balancing purposes. When a physical CPU becomes idle or has not been boosted or under priority tasks, it checks its peer CPUs’ queues to see if there is a strictly higher priority task. If so, it steals that higher priority task. This default scheduling policy does not consider the cost of relocation. With this policy, all vCPUs aggressively migrate across physical cores to make cores as busy as possible.

An alternative way of scheduling while avoiding relocation is to pin vCPUs to physical cores. However, such restriction on scheduling may result in under-utilization of cores. Figure 5 illustrates the system throughput with different scheduling policies to show the effect of restricting physical cores a VM can use. The no relocation policy pins virtual CPUs to physical cores with a one-to-one mapping. The full relocation policy does not restrict relocation to maximize the throughput of the system. We perform experiments on two systems: one is an under-committed system and another is an over-committed system. The hardware system has eight physical cores. Eight VMs (with one vCPU per VM) run on the under-committed system, and sixteen VMs run on the overcommitted system.

(6)

(a) Under-committed: 8 VMs (1 vCPU per VM)

(b) Over-committed: 16 VMs (1 vCPU per VM) Figure 5 The effect of Xen scheduler: under-committed

vs. over-committed systems

Figure 5(a) shows the throughput and L2 cache miss rate when vCPUs are under-committed. In the under-committed system, for most of the cloud workloads and the VM kernels, pinning vCPUs to physical cores (no relocation) results in better performance than the full relocation policy by improving caching efficiency. However, as shown in Figure 5(b), in the over-committed system, although it is not immediately obvious, allowing relocation provides better performance than pinning vCPUs to physical cores. In the over-committed system, improving the utilization of cores becomes critical, as multiple VMs compete for the cores. In fact, the scheduler compromises between core efficiency and cache utilization. In the under-committed case, VM relocation cannot improve core efficiency but reduces cache utilization, which causes performance penalty. However, in the over-committed case, performance can benefit in relocation due to the improvement of core efficiency.

Implications The arrangement of VMs running in the server cannot be known beforehand. The same server might be under-committed or over-committed. Additionally, the transitions between the two different scenarios can occur at any time. To avoid wasting resources, dynamic reconfiguration of

the server should be allowed. Resources should be dynamically reallocated in order to match the current distribution of VMs running in the server. For example, if VMs are given private caches and a change of arrangement of VMs takes place, the private caches should match the new arrangement of VMs. On the other hand, mechanisms to reduce the coherence protocol overhead may negatively affect the ability to rearrange resources. Therefore, a trade-off or new mechanisms are needed to design the best possible coherence protocol that achieves both objectives.

4.4 VM live migration

In data centers, the typical scenarios of live migration are summarized as follows:

Load balancing. VMs are migrated from heavily loaded

hosts to light ones to achieve optimal resource utilization, maximize workload throughput, and avoid overload.

Online maintenance. To free one physical machine for

upgrade or maintenance, all the VMs are migrated away without disconnecting clients. As a result, system reliability and availability is improved.

Power management. If many physical machines are

lightly loaded, their VMs can be consolidated into fewer ones. Then the idle physical machines can be powered off to save power.

Pre-Copy technique [14] is one of the most prevalent live migration algorithms, which is designed to minimize VM service downtime. It iteratively copies memory pages to the destination node while keeping the VM service available. When the applications’ writable working set becomes small enough, the VM is suspended and only its CPU state and dirty pages in the last iteration are sent out to the destination. In the pre-copy phase, although the VM service is still available, significant service degradation could occur since the migration daemon continually consumes network bandwidth to transfer the dirty pages in each iteration.

Live migration can consume substantial CPU cycles and network bandwidth. Hence, the resources available to perform VM migration will affect the performance of the migration and consequently the performance of the migrated applications. Meanwhile, the migrated VM will compete for resources with the execution of other VMs. However, existing studies on live migration performance are typically based on the assumption that there are sufficient resources on the source and destination hosts, which is often not the case for highly consolidated systems. Therefore, it is important to understand the performance of VM migration under different levels of resource availability. With such knowledge, the hypervisor in a virtualized data center can take it into consideration when allocating resources and migrating VMs across the system according to the applications and system optimization goals.

To this end, we conduct a series of experiments on Xen hypervisor to investigate the microarchitecture behavior of the migrated VM by varying the resource available to the migration. To be more specific, the throughput of a VM running each of the cloud benchmark is measured with different amount of CPU allocated to domain0, which

(7)

processes the migration. CPU usage of domain0 on the hosts is controlled by setting the CPU cap parameter of Xen’s credit CPU scheduler [3].

We first study the impact of domain0’s CPU allocation on the throughput of migrated VM during migration. The migrated VM runs cloud benchmarks. Figure 6 shows that as the CPU allocated to domain0 on the source host increases from 10% to 30%, the CPI of domain0 and hypervisor drops dramatically for all three cloud benchmarks. However, after domain0’s CPU allocation exceeds 30%, CPI stays at the same level. By monitoring domain0’s actual CPU usage during the entire migration, we find that when domain0 is assigned more than 30% of CPU, it only consumes at most 30%.

The results also show that the CPI of guest user and guest kernel is almost identical when the CPU utilization of domain0 changes from 10% to 100%. This observation can be explained by the following two factors. First, CPU usages by domain0 and domainU are well isolated without much interference. Second, the process of VM live migration handled by domain0 and hypervisor has little impact on the cloud applications running on the migrated VM.

Figure 6 Migration time for VMs with different CPU allocations to Domain0 on the source host

Implications In order to achieve a desired migration performance, the CPU resources of the source hosts need to be carefully managed, as they may be under contention and can affect the migration performance. Excessive CPU resource occupied by domain0 does not help further speed up live migration. It is necessary to constrain the percentage of CPU occupied by the privileged domain, for both guaranteeing the performance of migration and avoiding slowing down applications running on the same host.

4.5 VM boot storm

One of the increasingly common uses of virtualization is “Virtual Desktop Infrastructure” (VDI), in which virtual desktops are hosted on physical machines in a data center and are accessed remotely through thin clients. VDI deployments are susceptible to so-called “boot-storms” in which all VMs are turned on at the start of the day as employees come to work [4]. As a result, bursts of hundreds of power operations to VMs can occur daily.

In Figure 7, we show the Normalized CPI, LLC MPKI and number of DTLB misses for all the five cloud benchmarks. We randomly power on 16 VMs during the execution of each benchmark. All the data in Figure 7 are normalized to the normal execution of each benchmark without any management operations. As can be seen, VM boot storms have little effect on most of the cloud benchmarks when the number of powered on VMs is relatively small. However, with regard to media streaming, the transmission of media files ranging from megabytes to gigabytes in size is handled by the backend driver of NIC in domain0. When domain0 is busy handling a large amount of disk traffic to power on 16 VMs, there are not enough memory and CPU resources to process the transmission of media files. Therefore, the CPI of media streaming benchmark increases nearly 10%.

(8)

Figure 7 Figure 7: Normalized CPI, LLC MPKI and DTLB misses of cloud workloads during boot storms

Figure 8 shows the CPI of domain0 varies over time when web search was running in a dedicated VM and 8 VMs were powered on simultaneously at 415s, 23s, 245s, 372s, and 338s respectively. We notice that the CPI of domain0 decreases immediately as soon as the power on operations are triggered.

Figure 8 The variation of domain0 CPI over time

In Figure 7, what interested us is the behavior of domain0. Both the LLC MPKI and the number of DTLB misses of domain0 increase significantly on all five cloud benchmarks, but the throughput keeps increasing. When booting a VM, data is read from disk until the VM is able to run an OS and complete the boot sequence. In our setups, about 200 MB of data is read from disk before a VM can boot. Therefore, in case that 16 VMs were powered-on at once, the server must handle an additional 16 × 200 MB = 3.2 GB of disk traffic. In fact, the operations of concurrently reading data from disk have plenty of memory level parallelism (MLP), so the latency of LLC and DTLB misses can be effectively hidden.

We investigate the utility of increasing the number of powered-on VMs for data analysis and web search in Figure 9.

We plot the CPI as a function of the number of powered-on VMs. We find that the throughput of the benchmarks themselves and their guest kernels are less sensitive to the number of powered VMs. However, the CPI of domain0 and hypervisor is very sensitive to the number of powered on VMs. When the memory and CPU are not saturated, we can continuously achieve enough MLP to improve the throughput of domain0. However, we believe that booting quite a large number of VMs (e.g. 500 VMs) at once will stress the memory system on each host and saturate the CPU [4]. As a result, the performance of domain0 will eventually decline.

Figure 9 CPI sensitivity to the number of powered on VMs

Furthermore, we perform the same experiments for VM clone with boot storms. The results are quite similar to those presented in Figure 7. This is because VM clone is a disk read/write intensive operation with high levels of MLP. For example, to clone 20 16 GB VMs, the system must read up to (20 × 16 GB) = 320 GB data and then write up to 320 GB data.

(9)

Implications Since VM power on and VM clone are both common management operations with rather high MLP which can tolerate high LLC miss rate, we can limit the cache capacity occupied by domain0 through software cache partitioning based on page coloring [15] or hardware partition technologies [16, 17], then assign the additional cache space to guest VMs running memory-bound applications to achieve high system throughput.

5. CONCLUSIONS

Cloud computing has emerged as a dominant computing paradigm to provide scalable online services. Virtualization allows a more efficient use of hardware through server consolidation and also enables a variety of new workflows that can reduce total cost of maintenance and increase the flexibility in the datacenter. The growing popularity of virtualized datacenters motivates deeper investigation of the impact of virtual machine management on overall system performance.

In this work, we use performance counters to perform a comprehensive micro-architectural study on the execution characteristics of the privileged domain while performing various VM management operations. Our study demonstrated that there are still significant optimization spaces on cache hierarchy design to improve cloud server performance as well as efficiency. Specifically, our analysis showed that shared cache should be partitioned to eliminate interference between the privileged domain and guest domains, cache coherence protocol should support a high degree data sharing of the privileged domain, and cache capacity or CPU utilization occupied by the privileged domain should be limited when performing management workflows to achieve high system throughput. We believe that a better understanding of these microarchitecture implications will benefit processor design practice in the era of virtualization and cloud computing.

ACKNOWLEDGEMENTS

We would like to thank Kevin Lim (HP Labs) and Tor Aamodt (University of British Columbia / Stanford University) and all anonymous reviewers for their insightful comments and suggestions. This work was supported by the National Natural Science Foundation of China (No. 61202062).

REFERENCES

[1] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi, Clearing the Clouds: A Study of Emerging Scale-out Workloads on Modern Hardware, In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems, 2012.

[2] R. Iyer, R. Illikkal, L. Zhao, D. Newell, J. Moses, Virtual Platform Architectures for Resource Metering in Datacenters, In ACM SIGMETRICS Performance Evaluation Review, Volume 37, Issue 2 (Sept. 2009), pages: 89-90.

[3] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield, Xen and the Art of Virtualization, In Proceedings of the Symposium on Operating Systems Principles, 2003.

[4] V. Soundararajan and J. Anderson, The Impact of Management Operations on the Virtualized Datacenter, In Proceedings International Symposium on Computer Architecture, 2010.

[5] A. Menon, J. R. Santos, Y. Turner, G. J. Janakiraman, and W. Zwaenepoel. Diagnosing Performance Overheads in the Xen Virtual Machine Environment, In Proceedings of International Conference on Virtual Execution Environments, 2005.

[6] S. Eggers, J. Emer, H. Levy, J. Lo, R. Stamm, D. Tullsen, Simultaneous Multithreading: A Platform for Next-Generation Processors, IEEE Micro, vol.17 no.5, p.12-19, Sept. 1997.

[7] Oprofile. http://oprofile.sourceforge.net.

[8] A. Gordon, N. Amit, N. Har'El , M. Yehuda, A. Landau, A. Schuster, D. Tsafrir, ELI: Bare-Metal Performance for I/O Virtualization, In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems, 2012.

[9] A. Jaleel, H. Najaf-abadi, S. Subramaniam, C. Steely Jr., Joel Emer, CRUISE: Cache Replacement and Utility-aware Scheduling, In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems, 2012.

[10] M. Martin, M. Hill, and D. Wood, Token Coherence: Decoupling Performance and Correctness, In Proceedings of International Symposium on Computer Architecture, 2003.

[11] M. Marty and M. Hill, Virtual Hierarchies to Support Server Consolidation, In Proceedings of International Symposium on Computer Architecture, 2007.

[12] C. Clauss, S. Lankes, P. Reble, and T. Bemmerl, Evaluation and Improvements of Programming Models for the Intel SCC Many-core Processor, In Proceedings of International Conference on High Performance Computing & Simulation, 2011.

[13] D. Kim, H. Kim, and J. Huh. Virtual Snooping: Filtering Snoops in Virtualized Multi-cores, In Proceedings of International Symposium on Microarchitecture, 2010.

[14] C. Clark, K. Fraser, S. Hand, J.G. Hansen, E. Jul, C. Limpach, I. Pratt, and A. Warfield, Live Migration of Virtual Machines, In Proceedings of International Symposium on Networked Systems Design and Implementation, 2005.

[15] J. Lin, Q. Lu, X. Ding, Z. Zhang, and P. Sadayappan, Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems, In Proceedings of International Symposium on High Performance Computer Architecture, 2008.

[16] M. Qureshi and Y. Patt, Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches, In Proceedings of International Symposium on Microarchitecture, 2006.

[17] J. Chang and G. Sohi, Cooperative Cache Partitioning for Chip Multiprocessors, In Proceedings of International Conference on Supercomputing, 2007.