RCL: Design and Open
Specification
D3.1.1
Document Information
Scheduled delivery
31.03.2014
Actual delivery
31.03.2014
Version
1.0
Responsible Partner
IBM
Dissemination Level
PU
Public
Revision History
Date
Editor
Status
Version
Changes
27.02.2014
Ronen Kat
Draft
0.1
Skeleton
05.03.2014
Ronen Kat
Draft
0.2
added base text
05.03.2014
Yossi
Kuperman
Draft
0.3
added section 3 (I/O consolidation)
09.03.2014
Yossi
Kuperman
Draft,
0.4
added text in section 3 (I/O
consolidation)
07.03.2014
Dave
Gilbert
Draft,
0.4RH
added section 4 (Memory
Externalization)
10.03.2014
Ronen Kat
Draft
0.5
Added section 5. Edits to section 3.
16.03.2014
Ronen Kat
Ready for
review
0.6
Merge and finishing
16.03.2014
Ronen Kat
Ready for
review
0.7
Merge and finishing
26.03.2014
Ronen Kat
Final
1.0
Address reviewers comments
Contributors
Ronen Kat (IBM), Joel Nider (IBM), Yossi Kuperman (IBM), Andrea Arcangeli (Red Hat), David
Gilbert (Red Hat)
Internal Reviewers
Glossary of Acronyms
Acronym
Definition
Dx.x.x
Deliverable x.x.x
DoW
Description of Work
I/O
Input / Output in relation to data transfer
EC
European Commission
KVM
A virtualization infrastructure for the Linux kernel which turns it into a hypervisor
Mxx
Month (xx) number from the start of the project
PM
Project Manager
PO
Project Officer
QEMU
Emulator and virtualization machine that allows you to run a complete operating
system on top of another.
RCL
Resource Consolidation Layer
TBD
To Be Defined
VM
Virtual Machine
VMM
Virtual Machine Manager
Table of Contents
1. Executive Summary...6
2. Introduction...7
2.1. Document Focus... 7
2.2. Next Steps... 8
3. I/O Consolidation...9
3.1. Components... 10
3.2. Interfaces... 13
3.3. Modules... 14
4. Memory Consolidation and Externalization...15
4.1. Components... 16
4.2. Interfaces... 17
4.3. Modules... 19
5. Cloud Management...20
List of Figures
Figure 1: ORBIT architecture – focus on resource consolidation layer...7
Figure 2: I/O consolidation modules and interfaces...9
Figure 3: Memory externalization components and interfaces...15
1.
Executive Summary
This first delivery of the Resource Consolidation Layer (RCL) Design and Open Specification
document focuses on the high-level architecture of the “Resource Externalization and
Consolidation as a Facilitator for Fault Tolerance” work package (WP). It shows the
components and interactions between the components of the I/O consolidation, memory
consolidation and externalization and the cloud management.
The design and open specification will be extended in the future version as part of D3.1.2
and D3.1.3 throughout the progress of the project.
2.
Introduction
The RCL addresses the end-to-end implementation of virtual resource consolidation, via
enhancements of the Virtual Machine Manager (VMM) and Virtual Machine (VM)
paravirtualized device drivers in conjunction with modern hardware supporting x86
full-virtualization such as Intel-VT and AMD-V. The RCL enables a VM to use not only local
resources but also remote resources residing on remote physical hosts within the
data-center, and make them appear to the VMs as local resources.
The I/O consolidation is the I/O “middleware” that allows network I/O resources and block
I/O resources to be accessed and shared by multiple VMs and enables resource to be
“switched” between VMs and physical hosts.
The memory consolidation concept is based on “externalization” of the memory, allowing
VM memory to physically reside not on a local physical machine, but also on a remote
physical machine.
Figure 1: ORBIT architecture – focus on resource consolidation layer.
2.1. Document Focus
This first design and open specification document includes the high level design of the
resource consolidation layer, and includes the I/O consolidation and memory externalisation
components and interfaces.
2.2. Next Steps
On M9 (June 2014) a first software prototype implementation (D3.2.1) will realize the design
and open specification in this document, to be followed by a scientific report (D3.3.1) on
M12 (September 2014) describing the outcome of the design and implementation.
Next, the design and open specification is to be expanded in delivery D3.1.2 on M18 (March
2015) with additional details. This will be the next update of the this document.
3.
I/O Consolidation
Figure 2 shows the task's components and the dependencies with internal and external
components, each of the following components are implemented as part of the Linux kernel
version 3.13.
Figure 2: I/O consolidation modules and interfaces.
Note component 3.3.TBD which is part of Task 3.3 which is scheduled to start on M13
(October 2014). the notation TBD here reflect that the design of this component has not
been performed, and in fact may be composed of multiple components. The details of
component 3.3.TBD is to be included in the next update of this document.
3.1. Components
The guest operating system (front-end) runs on top of KVM [1] and QEMU as a VM and will have
components 3.1.12, 3.1.13, 3.1.11 and 3.1.1 installed, whereas the I/O consolidation hypervisor
(back-end) runs on top of bare-metal with the same base kernel and modules 3.1.22, 3.1.23, 3.1.21
and 3.1.1 installed. (the module 3.1.1 is installed on both front-end & back-end)
The following shows for each component (in Figure 2) the component number, name, description
and role in ORBIT, the success indication for the component, and the interfaces of the component.
COMPONENT 3.1.1
NAME Split I/O Ethernet Transport
High Level Description / ROLE
This component is responsible for implementing the Ethernet transport layer to enable efficient and scalable communication between the Split I/O front-end and Split I/O back-end components
Success indicator Less than 20% overhead compared to traditional para-virtual I/O
Interfaces EXPOSED (internal and
external) 3.1.1.1 INTERFACES CONSUMED (internal and
external) Only Linux Kernel interfaces will be used, none from other components within the project.
COMPONENT 3.1.11
NAME Split I/O Generic Front-End
High Level Description / ROLE
This component is responsible for providing common Split I/O data services to both the network and block front-ends. It is also responsible for instantiating the virtual block and virtual network devices of the VMs.
Success indicator Virtual block and network devices consolidated in a centralized and remote I/O
hypervisor
Interfaces EXPOSED (internal and external)
3.1.11.1 3.1.11.2
INTERFACES CONSUMED (internal and
COMPONENT 3.1.21
NAME SplitI/O Generic Back-End
High Level Description / ROLE
This component is responsible for providing common Split I/O data services to both the network and block back-ends. It is also responsible for exposing a control interface to the management cloud layer that can be used to link the VMs with the corresponding I/O hypervisor and manage the virtual network and virtual block devices
BUILDING BLOCK Linux/KVM
Success indicator Virtual block and network devices consolidated in a centralized and remote I/Ohypervisor
Interfaces EXPOSED (internal and external)
3.1.21.1 3.1.21.2
INTERFACES CONSUMED (internal and
external) 3.1.1.1
COMPONENT 3.1.12
NAME Split I/O Block Front-End
High Level Description / ROLE This component is responsible for exposing virtual block devices to the guestoperating system and sending all the read/write requests to the block back-end running in the I/O hypervisor
Success indicator Virtual block devices consolidated in a centralized and remote I/O hypervisor
Interfaces EXPOSED (internal and
external) None INTERFACES CONSUMED (internal and
external)
3.1.11.1
Internal Linux interfaces
COMPONENT 3.1.13
NAME Split I/O Net Front-End
High Level Description / ROLE
This component is responsible for exposing virtual network devices to the guest operating system and sending/receiving all network frames to/from the I/O the network back-end running in the I/O hypervisor
BUILDING BLOCK Linux/KVM
Success indicator Virtual network devices consolidated in a centralized and remote I/O hypervisor
Interfaces EXPOSED (internal and
external) none INTERFACES CONSUMED (internal and
external)
3.1.11.1
NAME SplitI/O Block Back-End
High Level Description / ROLE
This component is responsible for processing the block read and block write requests sent by the block front-end running within each VM. It maps the (remote) virtual block device exposed to each VM with a (local) block device
BUILDING BLOCK Linux/KVM
Success indicator Virtual block devices consolidated in a centralized and remote I/O hypervisor
Interfaces EXPOSED (internal and
external) None INTERFACES CONSUMED (internal and
external)
3.1.21.1
Internal Linux interfaces
COMPONENT 3.1.23
NAME SplitI/O Net Back-End
High Level Description / ROLE
This component is responsible for receiving/sending virtual network L2 frames from/to the virtual network devices of each VM. It bridges the (remote) virtual network devices exposed to each VM with a (local) tap/macvtap interface. The tap interface can be connected with any virtual network (e.g. OVS/Linux bridge) and the macvtap interface can be connected to any physical NIC.
BUILDING BLOCK Linux/KVM
Success indicator Virtual network devices consolidated in a centralized and remote I/O hypervisor
Interfaces EXPOSED (internal and
external) none INTERFACES CONSUMED (internal and
external)
3.1.21.1
3.2. Interfaces
The following interfaces are part of Task 3.1 (Consolidation of Virtualized I/O). The interfaces that are
external to Task 3.1 include concentrate and detailed API specification, while the internal (to Task
3.1) API will be detailed in the next update of the document.
The following shows for each interface (in Figure 2) the interface number, name, description and role
in ORBIT, and the relevant components which use the interface.
interface 3.1.1.1
NAME Split I/O Transport
High Level Description / ROLE
Abstract the concrete transport (Ethernet) used for the communication. Enables to easily plug new transports (e.g. Infiniband which will not be implemented as part of this project)
consumed by components (internal and external)
3.1.11 3.1.21
interface 3.1.11.1
NAME Split I/O Front-end Protocol
High Level Description / ROLE Defines the generic I/O communication front-end services (based on virtio protocol)for all types of virtual I/O devices (block and net)
consumed by components (internal and external)
3.1.12 3.1.13
interface 3.1.21.1
NAME Split I/O Back-end Protocol
High Level Description / ROLE Defines the generic I/O communication back-end services (based on virtio protocol)
for all types of virtual I/O back-ends (block and net)
consumed by components (internal and external)
3.1.22 3.1.23
interface 3.1.11.2
NAME Split I/O Internal Control
High Level Description / ROLE
Defines the control operations (e.g. specify I/O hypervisor, create virtual network device, create virtual block device) that can be used by the back-ends to configure and manage the front-ends
This is a logical interface that will be used remotely (RPC over the Split I/O transport)
and Task 3.3 (Cloud Management Integration).
The following shows the interface number, name, description and role in ORBIT, the relevant
components which use the interface, and the list of APIs in the interface.
interface 3.1.21.2
NAME Split I/O External Control
High Level Description / ROLE
Defines the control operations (e.g. specify I/O hypervisor, specify virtual network devices and block devices, connect a virtual network device with a virtual switch, set the backing store for a virtual block device) that will be used by the cloud management component to configure and manage the system
consumed by components (internal and
external) 3.3.TBD (described in section 5) API
RESET_GUEST_DEVICES()
Reset a guest to a “clean” state, removing all virtual devices Parameters:
•
IO Guest's management address (e.g. VF's MAC address)CREATE_BLOCK_DEVICE()
Creates a virtual block device on the specified guest Parameters:
• IO Guest's management address (e.g. VF's MAC address) • Guest's device name (e.g. /dev/vrda)
• IO Hypervisor device name (e.g. /dev/sdb) • QoS – rate-limit (optional)
REMOVE_BLOCK_DEVICE() Parameters:
• IO Guest's management address (e.g. VF's MAC address) • Guest's device name (e.g. /dev/vrda)
CREATE_NETWORK_DEVICE()
Creates a virtual network device on the specified guest Parameters:
• IO Guest's management address (e.g. VF's MAC address) • Guest's device name (e.g. eth7)
• IO Hypervisor device name (e.g. eth5) • QoS – rate-limit (optional)
REMOVE_NETWORK_DEVICE() Parameters:
• IO Guest's management address (e.g. VF's MAC address) • Guest's device name (e.g. eth7)
4.
Memory Consolidation and
Externalization
Figure 3 shows the Task 3.2 components and the dependencies with internal and external
components. Component 4.1.6 (Post copy Manager) is part of work package 4 (external to
this report), and is not described here.
Figure 3: Memory externalization components and interfaces.
Post copy Manager
4.1.6
Linux kernel mm
subsystem
enhancements
3.2.1
Remote memory
front end
3.2.2
Remote memory
handler
3.2.3
Userspace page
fault control
3.2.1.1
Remote
memory
Protocol
3.2.2.1
Remote
memory
src API
3.2.3.1
Remote
memory
dest API
3.2.2.2
4.1. Components
The following shows for each component (in Figure 3) the component number, name,
description and role in ORBIT, the success indication for the component, and the interfaces
of the component.
COMPONENT 3.2.1
NAME Linux kernel mm subsystem enhancements
High Level Description / ROLE New mechanism to handle page faults in userspace
Success indicator Running guest with remote memory (as part of post-copy)
Interfaces EXPOSED (internal and
external) 3.2.1.1 INTERFACES CONSUMED (internal and
external) Internal kernel interfaces
COMPONENT 3.2.2
NAME Remote memory front end
High Level Description / ROLE Routes page requests/data between the kernel on the destination machine and the network towards the source machine Success indicator Running guest with remote memory (as part of post-copy)
Interfaces EXPOSED (internal and
external) 3.2.2.1 INTERFACES CONSUMED (internal and
external) 3.2.1.1
COMPONENT 3.2.3
NAME Remote memory handler
High Level Description / ROLE Satisfies page requests from the source machine, routes control messages to the Remote memory front end Success indicator Running guest with remote memory (as part of post-copy)
4.2. Interfaces
The following shows for each interface (in Figure 3) the interface number, name, description
and role in ORBIT, the relevant components which use the interface, and the list of APIs in
the interface.
interface 3.2.1.1
NAME Userspace page fault control
High Level Description / ROLE Provides an efficient way for user-space to register for faults on a region of virtual memory, and a way for it to respond to those faults consumed by components (internal and
external) 3.2.2 but also must be designed to be generally useful to other users API
madvise(start,length, MADV_[NO]USERFAULT)
[Un]Mark a region of anonymous virtual memory such that pages that have not been allocated cause a 'user fault', to be notified on a userfaultfd, and causing the faulting thread to pause
int userfaultfd(int flags) (TBD) Open a file descriptor to receive user fault notifications and to unpause
faulted threads
remap_anon_pages(dest,src,len) Move the mapping of a page of anonymous memory to a new virtual memory location read(userfaultfd Receive an address from the kernel of a page that needs to be filled by userspace
interface 3.2.2.1
NAME Remote memory protocol
High Level Description / ROLE Provides a mechanism for passing pages on demand between virtual machines, and
to control the process
consumed by components (internal and
external) 3.2.3 and interacts with the existing migration protocol
Protocol entries These entries correspond to packets over a modified migration protocol
rp.REQPAGES(start,len) (dest->src) A request for a region of guest physical memory of the given length. The source must ensure that at least these pages are provided
PAGE(address,data) (src->dest) A page of guest physical memory (modification of semantics of existing migration message)
command(code,data) (src->dest) Provides a synchronisation and control mechanism for subcommands; uses TBD butsome uses given below command(SENSITISE_RAM) (src->dest) Causes the destination to switch into userfault mode
command(OPENRP) (src->dest) Asks the destination to open a 'return path' for sending page requests over
command(REQACK(id)) (src->dest) Request acknowledgement from destination (dest)
command(DISCARD, (start, range)[]) (src->dest)
Invalidate a previously transmitted page, requiring the destination to send a REQPAGES to recover it
rp.ACK(id) Response to a REQACK
interface 3.2.2.2
NAME Remote memory destination API
High Level Description / ROLE Provides the page destination with control over the link to/and control over the source system consumed by components (internal and
external) 4.1.6 (Post copy manager – instance on the destination side) API These are internal interfaces within QEMU and are subject to change
NAME Remote memory source API
High Level Description / ROLE Provides the page source with control over the establishment of the link to/and
control over the partner system
consumed by components (internal and
external) 4.1.6 (Post copy manager – instance on the source side)
API These are internal interfaces within QEMU and are subject to change
qemu_savevm_command_send(comman d,data,len)
Send a command to the page destination system; corresponds to command() message in 3.2.2.1; matching APIs are provided for each command
qemu_file_get_return_path Retrieve handle to receive messages from the page destination; a consumer (such as the postcopy manager)
sentmap[] API TBD An ownership table identifying whether a page has been sent to the destination
4.3. Modules
Following is a list of modules that will be created as part of the memory externalization.
3.2.1/remap Linux kernel facility for remapping anonymous memory3.2.1/userfault-madvise Linux kernel facility for marking an area of memory as userfault
3.2.1/userfaultfd Linux kernel mechanism for communicating faults to userspace and allowing userspace to continue execution of previously blocked thread
3.2.2/fault-wire QEMU userspace code to accept faults from 3.2.1/userfaultfd and organise them for transmission over the network (for 3.2.2.1)
3.2.2/wire-map QEMU userspace code to accept incoming pages, use 3.2.1/remap to remap them
and then allow the thread to continue with 3.2.1/userfaultfd
3.2.2/command-handler QEMU protocol management code to accept commands from the source and process them 3.2.2/destination-page-ownership QEMU page management code to keep track of incoming and requested pages
3.2.2/init QEMU code to initialise the kernel userfault code
3.2.3/return-path QEMU protocol management code to provide a return path from destination host to source and format messages carried by it
3.2.3/response-handler QEMU protocol management code to accept responses from the destination and
process them