RCL: Software Prototype
D3.2.1
Scheduled delivery 30.06.2014 Actual delivery 30.06.2014
Version 1.0
Responsible Partner IBM
Dissemination Level
PU Public
Revision History
Date Editor Status Version Changes
18.05.2014 Ronen Kat Draft 0.1 Outline 25.05.2014 Yossi
Kuperman Draft 0.2 Added section 3
05.06.2014 Ronen Kat Draft 0.3 Merge IBM and RedHat updates 15.06.2014 Ronen Kat Draft 0.4 Added introduction and finishing 22.06.2014 Ronen Kat Draft 0.5 Input from internal reviewers 23.06.2014 Dave
Gilbert Draft 0.6 Address comments from reviewers on Red Hat sections 30.06.2014 Ronen Kat Final 1.0 Finishing
Contributors
Ronen Kat – IBM, Yossi Kuperman - IBM
Dave Gilbert – Red Hat, Andrea Arcangeli – Red Hat
Internal Reviewers
Vasileios Anagnostopoulos Luis Tomás
Copyright
This report is © by IBM and other members of the ORBIT Consortium 2013-2016. Its duplication is allowed only in the integral form for anyone's personal use and for the purposes of research or education.
Acknowledgements
The research leading to these results has received funding from the EC Seventh Framework Programme FP7/2007-2013 under grant agreement n° 609828.
Glossary of Acronyms
Acronym Definition
D Deliverable
DoW Description of Work EC European Commission PM Project Manager PO Project Officer WP Work Package
MTU Maximum Transition Unit
Table of Contents
1. Executive Summary...6
2. Introduction...7
3. I/O Consolidation...8
4. Memory Consolidation and Externalization...11
5. Cloud Management...13
List of Figures
Figure 1: Split I/O deployment...8
Figure 2: Memory externalization nested test-bed...11
List of Tables
Table 1: I/O consolidation: components internal names...9Table 2: I/O consolidation: status of components...9
Table 3: I/O consolidation: status of API...10
Table 4: Memory consolidation and externalization: status of components...12
1.
Executive Summary
This document summarizes the prototype development work done as part of WP3. For this project interval, the first nine months of the project, we report for Task 3.1 (T3.1) and Task 3.2 (T3.2) – as Task 3.3 (T3.3) has not yet started.
Work for developing the I/O consolidation layer (T3.1), and memory consolidation and externalization layer (T3.2) is well under way and the development completeness coverage is well beyond the required 25% of this deliverable per the quality plan in D1.2.1 [1].
The next deliverable of the prototype D3.2.2 is scheduled for June 2015 - month 21 of the project.
2.
Introduction
This first software prototype delivery realizes the design and specification outlined in deliverable D3.1.1 [2].
The work and status of developing the I/O consolidation layer is described in Chapter 3, the work and status of memory consolidation and externalization layer is described in Chapter 4, and the work and status of the cloud management is described in Chapter 5.
2.1. Progress toward feature and API completeness
Development is well under way, and progress of the feature completeness is close to (or even above) 50% of the planned features. The progress for the API development is about 25% for the I/O consolidation layer and above 50% percent for the memory consolidation and externalization layer.
Work on cloud management has not started yet, and is scheduled to start on October 2014 per the project plan.
3.
I/O Consolidation
Our objective is to externalize I/O resources and consolidate all of them in a single dedicated appliance. We did so by detaching the back-end logic responsible for handling I/O from the hypervisor software stack and moved this logic from all the compute servers used to host the VMs to a remote server dedicated for I/O virtualization.
The prototype is comprised from the components described in deliverable D3.1.1, the internal names of the components are listed in Table 1, and the status is listed in Table 2.
3.1. Status of prototype
The prototype implementation is capable of creating block and network virtual devices. The exposed virtual devices are partially functional and not yet ready to be deployed.
It is possible to communicate over a virtual network device with the load generator and read/write a block of data from a remote block device that resides at the I/O hypervisor memory.
We deployed Split I/O prototype on 3 machines in our lab (depicted at Figure 1 from right to left): the load generator, the I/O hypervisor (back-end) and the host that runs the VM (front-end). Each machine is an IBM System x3550 M4, equipped with two 8-cores sockets of Intel Xeon E2660 CPU running at 2.2 GHz. Each machine is further equipped with 56GB of
memory and an Intel x520 dual port 10Gbps SRIOV NIC.
The prototype is implemented for Linux 3.9 kernel as a set of new kernel modules. Each module corresponds to a component described at deliverable 3.1.1 (see Table 1).
Modules 3.1.1, 3.1.21, 3.1.22, and 3.1.23 are installed on the I/O hypervisor machine, and modules modules 3.1.1, 3.1.11, 3.1.12 and 3.1.13 are installed on the VM. Note that module 3.1.1 is installed both on the VM and I/O hypervisor.
Note that component 3.1.1 (Ethernet transport) is used by both the front-end and the back-end. Its main purpose is to facilities with data transportation from both ends. It does so efficiently by using layer 2 (MAC layer) and thus avoids higher layers such as TCP/IP which incur high overhead. As block I/O requests have arbitrary sizes, and the size of the network packet that can be sent over the wire size is bounded by a Maximum Transition Unit (MTU), component 3.1.1 should fragment the requests before sending them to the other end for processing. The fragment size is determined by the MTU.
To instantiate a virtual device, we created a special user-space utility that operates the (generic back-end) kernel module (module 3.1.21). Invoking the utility on the I/O hypervisor with the following details: device type (block or net), backing device (e.g. local block device)
which in turn instructs the VM to expose a virtual device.
Bellow is a mapping between our internal modules names and the components described at D3.1.1.
Component Module Name Installed On
3.1.1 - Split I/O Ethernet Transport vrio_eth.ko VM + I/O hypervisor 3.1.11 - Split I/O Generic Front-End vrio_generic.ko VM
3.1.12 - Split I/O Block Front-End vrio_gblk.ko VM 3.1.13 - Split I/O Net Front-End vrio_gnet.ko VM
3.1.21 - Split I/O Generic Back-End vrio_generic.ko I/O hypervisor 3.1.22 - Split I/O Block Back-End vrio_hblk.ko I/O hypervisor 3.1.23 - Split I/O Net Back-End vrio_hnet.ko I/O hypervisor 3.1.31 - Split I/O management module vrio.py I/O hypervisor
Table 1: I/O consolidation: components internal names Notes:
1. vRIO (virtual Remote I/O) is the internal development name for Split I/O.
2. Module vrio_generic.ko is used both for the generic front-end and the generic
back-end.
3.2. External interactions
Management for the Split I/O hypervisor is provided through the split I/O hypervisor python module. Status of development is described in the Table 2.
The management library will be used by the cloud management layer as part of T3.3.
3.3. Completion status of components
In the next table (Table 2) we show the development status of split I/O modules included in this prototype.
Component Status and progress
3.1.1 - Split I/O Ethernet Transport 70% completed 3.1.11 - Split I/O Generic Front-End 70% completed 3.1.12 - Split I/O Block Front-End 40% completed 3.1.13 - Split I/O Net Front-End 50% completed 3.1.21 - Split I/O Generic Back-End 70% completed 3.1.22 - Split I/O Block Back-End 40% completed 3.1.23 - Split I/O Net Back-End 50% completed 3.1.31 - Split I/O management module Not started
Table 2: I/O consolidation: status of components
In the next table (Table 3) we show development status of the split I/O API functions.
RESET_GUEST_DEVICES() Not started CREATE_BLOCK_DEVICE() 80% completed REMOVE_BLOCK_DEVICE() Not started CREATE_NETWORK_DEVICE() 80% completed REMOVE_NETWORK_DEVICE() Not started
4.
Memory Consolidation and
Externalization
Our objective is to provide a mechanism that allows the hypervisor to retrieve memory pages from a remote system for use in a VM; this is being used as part of a post-copy
migration implementation where a migrated VM starts running on the destination host prior to all of the memory being copied over.
4.1. Status of prototype
The prototype is capable of performing small post-copy migrations, with page requests making the full round trip in limited situations, but is incomplete and not yet stable. The prototype consists of modifications to both QEMU and the Linux kernel (v3.13 currently). The components are as described in deliverable 3.1.1.
The current deployment is within a testing environment consisting of a nested hypervisor allowing all components to be easily tested on one machine as shown in Figure 2. The two QEMU instances are the source (left) and destination which has a partial copy of it's
memory.
The Linux kernel running in the L1 guest includes the 'Linux kernel mm subsystem
enhancements' (component 3.2.1). These allow the destination QEMU instance to mark an area of memory as 'userfault' – i.e. external, which casues QEMU to be notified when the guest accesses the page. At a later point in time, the QEMU uses another kernel
modification to atomically (and efficiently) move a page into place to satisfy the previous request.
The QEMU running in the L1 guest contains the modifications from components 3.2.2 and 3.2.3. 3.2.2 – 'Remote memory front end' routes page requests/data between the kernel on the destination machine and the network towards the source machine. Requests from the kernel are recorded in a page ownership datastructure, and sent as requests along a 'return path' to the source VM.
3.2.3 – 'Remote memory handler' satisfies page requests from the destination machine and routes control messages to the Remote memory front end; these include the messages that initiate the transition to postcopy.
The existing QEMU migration protocol has been modified to include commands to initiate the postcopy mode, and to provide a bidirectional transport allowing the destination to request pages. The mode is enabled using an extra migration-capability flag.
4.2. External interactions
None yet.
4.3. Completion status of components
The status reflects the status of a prototype that is starting to work; the basics are there but need filling out, making more robust and tidying up before submission to the upstream projects.
Component Status and progress
3.2.1 Linux kernel mm subsystem enhancements 60% completed 3.2.2 Remote memory front end 50% completed 3.2.3 Remote memory handler 50% completed
Table 4: Memory consolidation and externalization: status of components Figure 2: Memory externalization nested test-bed.
Host system (running Linux with unmodified KVM)
L1 guest (running modified Linux kernel) Host (unmodified)
QEMU instance:
L2 guest (unmodified) L1 modified
QEMU instance (1): L1 modifiedQEMU instance (2):
4.4. Additional notes
The next step is to finish the functionality and stabilise the current version so that arbitrary guests can be transferred.
5.
Cloud Management
The cloud management is part of Task 3.3 which is scheduled to start at M13 (October 2014). Therefore, no components of the cloud management are included in this deliverable.
6.
References
[1]. ORBIT document - D1.2.1 - Quality Plan, February 12, 2014 [2]. ORBIT document - D3.1.1 - RCL Design And Open Specification