Controlling Dynamic Guests
in a Virtual Computing Utility
Jeff Chase Ionut Constandache Azbayer Demberel Laura Grit Varun Marupadi Matt Sayler Aydan Yumerefendi
Department of Computer Science Duke University
{chase,ionut,asic,grit,varun,sayler,aydan}@cs.duke.edu
1. INTRODUCTION
Virtual Machine (VM) technology is rapidly emerging as a new foundation for server resource sharing and application service delivery. Virtual machines are instantiated fromimages containing initial file system state and software for the machine. VM images are a convenient vehicle for deliveringvirtual appliances with prepackaged software components for a specific purpose or application.
An image may be instantiated in a guest VM on server hardware at a server hosting provider. VCL systems are an important example, but the provider could also be a data center managed by enterprise virtual utility software (e.g., VMware or XenSource), a virtual grid site such as Globus Workspaces [Keahey et al. 2006], or an open computing utility such as Amazon’s EC2.
This paper summarizes some recent work onOrca, an open-source architecture
for virtual computing utilities that host dynamic guest environments on a shared
server substrate. The hosted environments (guests) on an Orca substrate may
range from single-user personal computing systems (e.g., a virtual desktop service) to distributed applications, experimental network environments, or programming platforms for distributed computing, such as “personal compute clouds”. New guests (such as cloud computing environments or multi-tier Web services) may
be deployed in an Orca system as plug-in extensions, without modifications to
the Orca software itself. We designed Orca to be “guest-neutral”: it exports
protocols and programmatic interfaces with a clean separation of guest-specific logic from a common underlying virtual computing platform.
The software for a guest is packaged as one or more virtual appliances, which are instantiated on demand in VMs and configured or “contextualized” to perform their intended functions within the guest application or service. Each guest is governed by an external programmaticguest controllerthat usesOrcaprotocols to acquire, configure, and release server resources at resource providers (virtual computing utility sites). The controller instantiates and configures the guest and monitors its behavior, making adjustments according to its policy. Dynamic guestsmay add or remove VMs or other shared resources on the fly as their needs change.
We focus on two issues for dynamic guests with pluggable controllers:
This research was supported by an IBM Faculty Award and by the National Science Foundation through awards CNS-0720829, CNS-0509408, and ANI 03-30658
—Secure control of portable images. We outline a general protocol for authenti-cated control of the guest by its controller. Our approach seeks to maximize the portability of images, and in particular to avoid requiring any context-specific state on the image, such as security keys or passwords that are specific to the owner or hosting site. Portable images encourage the practice of “package once, run anywhere”, which promotes image sharing and a healthy open market for virtual appliances and utility services.
—Fair resource sharing. Guests compete with one another for resources, and must be isolated. For example, a guest that experiences an unexpected spike in load should not interfere with another’s ability to meet a minimal assured perfor-mance level, even if they are coresident on the same physical servers. TheOrca software supports pluggable arbitration and scheduling policies in brokering in-termediaries. One goal is to combine effective fair-sharing policies with support for guest-neutral on-demand allocation and advance reservations.
2. OVERVIEW
Owner Site Node
L3. Stage and validate selected image Image service Guest controller dom0 domU Utility site authority Appliance provider Host server
Site control channel Request channel
L2. Request to lease VMs
L6. Guest VM install/boot L7. Lease grant response
Boot channel L5. Setup guest VM (domU)
L1. Controller server begins to provision resources for a guest application environment
Guest commands transmitted over secure, authenticated guest control channel Prepare and certify portable
virtual appliance images
L8. Join new VM to guest environment
Initialize host VM hypervisor and control
domain (dom0)
L4. Initiate install/boot of selected image
L9. Configure and contextualize
Fig. 1. Secure contextualization of a node instantiated from a portable
image in Orca.
Orca is based on a foundational abstraction of resource leasing. A resource
lease is a contract between a resource provider (e.g., a hosting site) and a source consumer (guest). The contract grants the consumer access to some re-source for a specified period of time. Orca resource consumers use open leasing protocols and programmatic interfaces to negotiate contracts and acquire and co-ordinate the underlying resources, optionally assisted by brokering intermediaries. Orca project software includes the Shirako leasing core [Irwin et al. 2006], the Automat [Yumerefendi et al. 2007] control portal and related components, a new implementation of the SHARP framework [Fu et al. 2003] for accountable lease contracts and brokering, Cluster-on-Demand (COD [Chase et al. 2003]), a back-end resource manager for shared clusters, and driver modules to interface the sys-tem to various virtualization technologies (e.g., Xen) and guest environments (e.g., cluster/grid middleware). It also includes plug-in policy modules for automated
resource management and adaptation, which is a primary focus of our ongoing research and development.
Figure 1 illustrates the process of instantiating a node at a utility site.
—The guest controller uses a negotiation protocol to lease server resources and instantiate a node from a selected image. Requests are subject to arbitration when resources are constrained (Section 4).
—The utility’s management software, an authority server representing the site, retrieves an image, e.g., from a third-party appliance provider. The site selects a host machine from its inventory and commands a control domain (denoted dom0) on the target host to instantiate a guest VM (domU) from the image. The control domain passes startup parameters through theboot channelto establish basic network connectivity.
—Anagentpreinstalled on the image starts during the boot sequence and allows the owner/controller to establish a secure guest control channelto issue commands to the guest node. Section 3 discusses the agent and binding protocol.
—The controller interacts with the node through the channel to perform any re-quired contextualization operations, e.g., updating the guest software, assigning roles, exchanging identities and startup parameters among components of a com-plex guest, installing additional packages etc.
One goal of our work is to provide a common foundation for contextualization in virtual computing utilities, governing the interactions among the image, utility, and guest controller. A standard for these interactions would promote sharing of portable images and the development of third-party virtual appliance providers offering a selection of prepackaged, endorsed images that meet a wide range of needs. Standards for interoperability of images are also essential for development of open hosting utilities that can run images from third-party image providers.
Researchers have proposed various approaches for general contextualization ser-vices and proxies [Bradshaw et al. 2007; Sapuntzakis et al. 2003] that can be used once a secure guest control channel is established. Contextualization standards are an important area of development within the industry.
2.1 Examples: Virtual Computing with Orca
Virtual desktops. Consider the simplest example of a VM instantiated at a utility
site for direct use by the owner as a virtual desktop. The guest controller is a Web site that processes form requests from users, and instantiates built-to-order virtual desktops for specified appliance images on demand. In this case, what is desired is a simple form of contextualization to transfer keys and enable the owner to log in securely over a connection using transport-layer security (e.g., TLS/SSL/SSH) with mutual authentication for the owner and the guest node (VM). Ideally the login does not require a password.
Hadoop clouds. Now consider a dynamic guest scenario, such as dynamic
in-stantiation of a personal Hadoop cloud from the shared server pool. Hadoop is an open-source middleware system for data-intensive cluster computing using the MapReduce programming model. A recent IBM/Google initiative broadens the notion of “cloud computing” to Hadoop systems, in which middleware manages the
distribution of subtasks across the nodes in the cloud, much as Amazon’s Elastic Compute Cloud assigns guest VMs to a shared server pool. We are experimenting with a Hadoop guest package forOrcathat combines both cloud models: it instan-tiates multiple Hadoop clouds from a common server pool shared with other guests. The Hadoop clouds may be sized according to load and resource availability.
In this case, a Hadoop cloud is a guest with an external manager (guest controller) acting on behalf of the cloud owner. The system contextualize each new VM to join it into the cloud, assign it a specific role, and establish its relationships with other nodes that it serves, depends on, or interacts with. For example, Hadoop provides a distributed file system (HDFS) led by a designated metadata server (NameNode), and a job execution service with a designated JobTracker master. Other nodes are assigned the role of storage sites for HDFS (DataNodes) and/or execution sites for subtasks of a job (TaskTrackers). Each worker node (DataNode or TaskTracker) must register with its corresponding master (NameNode or JobTracker) before the master can assign it data chunks to store or tasks to execute. Worker nodes may be added on-the-fly to speed execution, or even withdrawn for use by another guest without disrupting the application (within certain limits).
Workbench automation. Advanced controllers may guide the execution of the
guest based on high-level objectives, intermediate results, and resource availability. In recent work we built an automated benchmarking controller that plans and executes experiments to meet a high-level benchmarking objective [Shivam et al. 2008], such as mapping the impact of a set of workload and configuration parameters on the peak throughput of a network file service. The controller plans experiments to maximize the yield of new information at low cost based on prior results. For each experiment it leases resources to instantiate a test system and a workload generator to obtain performance measures. This project provides a context for us to study control policies that balance accuracy, time-to-result, and overall cost. 3. SECURE CONTROL OF PORTABLE IMAGES
This section outlines a protocol to establish an authenticated guest control chan-nel, which is the essential toehold for contextualization to adapt the node to its local environment, join it to an application service, and/or monitor and control its execution. The controller can also instantiate a local contextualization proxy to act on its behalf. Proxies can improve scalability, reduce network load and latency, and permit use of private IP addresses within the site.
Security of the guest control channel is crucial. If the channel is not secure, then an attacker could gain control of a newly instantiated VM and impersonate or deny access to the rightful owner. A more sophisticated attacker might interpose a man-in-the-middle between the guest controller and its nodes in order to spy on or tamper with the guest application.
A channel is authenticated if each party is justifiably convinced of the identity of the other. Without loss of generality, we use asymmetric key pairs to represent identity. Authentication becomes possible when each party knows the public key of the other party. In our approach, the utility site authority brokers a secure exchange of keys between the guest controller (or proxy) and each guest node, suitable to set up an authenticated connection with transport-layer security (e.g.,
Owner Site Node Guest controller
dom0
domU
Utility site authority
A generic keymaster server is installed on every portable appliance image. It executes a standard key exchange
protocol with the owner, based on tokens passed via the integrity-protected boot
channel.
Site control channel Request channel
Boot channel
Negotiate/exchange SSL/TLS session key O+ SHA1(O+) Host server Check N+ key matches SHA1 hash SHA1(N+) SHA1(O+) SHA1(N+) SHA1(N+) N+ Generate node keys: (N-, N+) O+ Check O+ key matches SHA1 hash
guest control channel
Each owner or guest controller
possesses an asymmetric keypair: public key
O+, private key
O-Note: A mutually trusted third party (such as a broker or PKI certifying authority) endorses the public key O+ to the utility site authority.
Fig. 2. A variant of the secure binding protocol in which the site authority
brokers an exchange of compact public key hashes.
TLS/SSL) as the guest control channel. Each virtual appliance image must include a preinstalled piece of software to run the guest node’s side of the secure binding
protocol. We refer to this element as the KeyMaster agent. TheKeyMaster
and controller conduct a one-round binding protocol for mutual authentication and key exchange, seeded by secure tokens passed from the utility boot authority. Since it is included as a “front door” on every system, theKeyMastermust be generic
and convincingly secure. The reference KeyMaster is implemented as a small
Python script that is minimalist, portable, and verifiable, easing technical barriers to adoption by virtual computing utilities and image providers.
Figure 2 outlines one variant of the protocol we propose. It assumes only the integrity of the channels: all messages are sent in the clear and contain no secrets, only hashes of public keys. However, we assume that the control domain and the site authenticate their communications, e.g., using keys installed by the utility. Also, the owner and the site have established a trust relation prior to the request. The public key hashes also allow these tokens to be pushed through a capacity-restricted boot channel. Note that we assume that the boot channel is full duplex (e.g., XenStore). If the control domain and the node can communicate only through the boot command line, than the control domain might generate the node key pair and push it at boot time. Newer versions of the Linux kernel can accommodate up to 2048 characters. This is sufficient for an RSA key pair (2x128 bytes) and the hash value of the owner public key (20 bytes). If the boot channel is narrow and half duplex, than the only means to authenticate the node is to pass it a shared secret from the control domain. In this case the control domain must deliver the secret to the site and then to the owner over encrypted connections.
The standard binding protocol preserves the portability of images: in particular, it avoids the need to modify an image for a particular site or owner (e.g., to pre-install keys or tokens onto the image), which could interfere with image sharing
or endorsement of standard images by image providers. The protocol development was guided by discussions with rPath, which sellsrBuilder appliance preparation tools to software vendors, and operates a free website calledrBuilder onlineoffering preconfigured virtual appliances built from open-source software. The health of this emerging industry depends on common formats and protocols for virtual appliances and the metadata to describe and instantiate them in virtual computing utilities. 4. FAIR SHARING FOR DYNAMIC GUESTS
If the utility is overprovisioned or underused, then it may approve any reasonable request for resources by a guest controller. But if resources are constrained, it must arbitrate and schedule requests. This must be done in a way that provides assur-ances of predictable performance and isolation to guests, and shares the resources fairly according to policies configured by operators.
Fair-share (proportional-share) scheduling algorithms have been used for differ-entiated service quality for CPU scheduling and network packet scheduling, and more recently, to virtual machines. We consider how to apply fair sharing for re-source control in Orca virtual computing utilities. Since each lease request may obtain multiple virtual machines as a group, we refer to the leased contexts as vir-tual clusters. For example, a virtual cluster might host a personal Hadoop cloud or a job execution system for a particular user subgroup. The role of the fair-share scheduler is to arbitrateflowsof requests for virtual clusters originating from a set of users or groups. The policy objectives are to serve requests while assuring each active user or group a minimum share of the resources according to its assigned weight, and share any surplus resources proportionally to the weights. Fairshare algorithms arework-conserving: they never leave a resource unit idle when it can be used to serve a pending request. In essence, work-conserving proportional share has a “use it or lose it” property: a flow relinquishes its right to any portion of its share that it does not request to consume.
To illustrate, consider a university cluster shared by multiple research groups, each of which contributes servers to the cluster. An example of such a system is Duke’s Shared Cluster Resource, or DSCR. Suppose the policy is to allocate resources to each group in proportion to its contribution. If some group is not using its share of the servers, they become available for use by other groups. This policy uses shared resources efficiently while providing incentives for groups and departments to contribute resources to the shared cluster.
In a cluster setting it is also desirable to allow requests for multiple concurrent units, such as a virtual cluster or a parallel job. Some production batch scheduling systems (e.g., LSF, PBS, and Maui) incorporate fair share scheduling algorithms for clusters. We began our work after a close look at Maui’s policy and found that its objective is to be tunable to obtain a range of policies that approximate weighted fair share over various intervals [Jackson et al. 2001]. In comparison, our objective is to obtain high fidelity to the weights with clearly defined behavior and fairness bounds while incorporating general support for important features of virtual computing systems, such as advance reservations, priority scheduling for requests originating from the same group, canceled or vacated leases, dynamic resizing of virtual clusters, and dynamic changes to the server pool.
D client requests S(pf) j F(pf) j pf j rfj dfj pf j+1 pf j+2 !f scheduler (maintains virtual time)
f v(t) = S(pj-1) pf j-1 pg j+1 !g ph j+1 !h pf j pg j ph j ph j+2
lease request for a virtual cluster with multiple attributes
cluster with Delements (e.g., virtual servers)
Fig. 3. Overview of the system model for WINKS.
We are experimenting with a new advanced scheduling algorithm calledWinks1 . Our approach integrates a fair queuing algorithm, calendar scheduling, advance reservations and a proportional share allocation policy. We chose to build on ex-tensions of Start-time Fair Queuing (SFQ) [Goyal et al. 1997] to arrays of host resource units (e.g., fixed-size virtual machine slivers) [Jin et al. 2004]. Figure 3
depicts the system model for Winks . An administrative entity first groups the
resource consumers and assigns a configurable weight to each groupϕf. A flowf is
the sequence of timestamped requestsp0
f. . . pnf originating from some group. Each
request for a flow is a lease request for a virtual cluster with various parameters (e.g., number of resourcesrjf and duration d
j
f). When a request arrives at timet,
the scheduler tags it with a start tagS(pjf) and finish tagF(pjf). The tags represent when a request should start and complete relative to a system notion of virtual time v(t). In SFQ algorithmsv(t) is given by the start tag of the last request dispatched at timet. The scheduler assigns tags as follows:
S(pjf) = max{v(t), F(p j−1 f )}, j≥1 (1) F(pjf) =S(p j f) + rjf×djf ϕf , j≥1 (2)
This extends SFQ to incorporate the request width into the calculation of the finish tag. Theprogressof a flowf is given by the start tag of the request at the head of the flow’s queue. The fairness of a schedule at any point in time can be determined by comparing the progress of competing flows: equivalent values imply that the flows have received fair resource allotments, relative to their assigned weights.
Winksextends SFQ in several ways to enable fair sharing for virtual computing:
—Wide Requests. Consumers may issue requests for multiple concurrent resource
units to be allocated together: these requests arewide requests.
—Flow Reordering. Winks reorders requests in a flow while preserving the
fairness and efficiency properties of SFQ. A flow may becomestalledif there are insufficient resources to schedule request at the front of its queue (e.g., because it is too wide). Winks reorders the requests if there is another request from the same flow that can be dispatched in place of the stalled request. Fairness-preserving flow reordering also supports priority scheduling within each flow. 1The name is derived from Weighted Wide Window Calendar Scheduler.
—Calendar Scheduling. We incorporate a calendar intoWinkswith a schedul-ing window to constrain the schedulschedul-ing horizon. The calendar structure enables
Winks to satisfy future requests (advance reservations) with a specified start
time that is in the window. To maintain fairness, Winks limits resources
re-served by each flow to its share of the window. Larger window sizes give the scheduler more flexibility, while smaller window sizes increase fidelity.
—Backfill Scheduling. Flow reordering allows Winksto enable stalled flows to
make progress; however, it does not prevent the blocked request from starving.
Winksuses the calendar to fit wide requests ahead in the schedule and “backfills”
the schedule with smaller requests. Backfill is a common technique in parallel job schedulers, and we extend it to a virtual computing system.
—Dynamic Resizing. Winks preserves fairness in the presence of dynamic
re-sizing of the shared resource pool.
To our knowledge,Winks is the first weighted fair queuing algorithm with the ability to schedule parallel (wide) requests into the future, with an integrated cal-endar to accommodate backfill and advance reservations. We have implemented
Winksas a policy plugin for anOrcalease manager (Shirako [Irwin et al. 2006]).
Experimental results show thatWinksenforces weights effectively for wide requests and prevents flows from starving.
REFERENCES
Bradshaw, R.,Desai, N.,Freeman, T.,and Keahey, K.2007. A Scalable Approach to De-ploying and Managing Appliances. InProceedings of the TerraGrid Conference.
Chase, J. S.,Irwin, D. E.,Grit, L. E.,Moore, J. D.,and Sprenkle, S. E.2003. Dynamic Virtual Clusters in a Grid Site Manager. InProceedings of the Twelfth International Symposium on High Performance Distributed Computing (HPDC).
Fu, Y.,Chase, J.,Chun, B.,Schwab, S.,and Vahdat, A.2003. SHARP: An Architecture for Secure Resource Peering. InProceedings of the 19th ACM Symposium on Operating System Principles.
Goyal, P.,Vin, H. M.,and Chen, H.1997. Start-Time Fair Queueing: A Scheduling Algorithm for Integrated Services Packet Switching Networks. IEEE/ACM Transactions on Network-ing 5,5 (October), 690–704.
Irwin, D.,Chase, J. S.,Grit, L.,Yumerefendi, A.,Becker, D.,and Yocum, K. G.2006. Sharing Networked Resources with Brokered Leases. InProceedings of the USENIX Technical Conference.
Jackson, D. B.,Snell, Q.,and Clement, M. J.2001. Core Algorithms of the Maui Scheduler. InProceedings of JSSPP.
Jin, W.,Chase, J. S.,and Kaur, J.2004. Interposed Proportional Sharing for a Storage Service Utility. InProceedings of SIGMETRICS.
Keahey, K.,Foster, I.,Freeman, T.,and Zhang, X. 2006. Virtual Workspaces: Achieving Quality of Service and Quality of Life in the Grid.Scientific Programming Journal 0,0.
Sapuntzakis, C., Brumley, D.,Chandra, R.,Zeldovich, N., Chow, J., Lam, M. S., and Rosenblum, M.2003. Virtual Appliances for Deploying and Maintaining Software. In Pro-ceedings of the 17th Large Installation Systems Administration Conference (LISA). 181–194.
Shivam, P.,Marupadi, V.,Chase, J.,Subramaniam, T.,and Babu, S.2008. Cutting Corners: Workbench Automation for Server Benchmarking. InProceedings of the USENIX Technical Conference.
Yumerefendi, A.,Shivam, P.,Irwin, D.,Gunda, P.,Grit, L.,Demberel, A.,Chase, J.,and Babu, S.2007. Towards an Autonomic Computing Testbed. In Workshop on Hot Topics in Autonomic Computing (HotAC).