• No results found

RCL: Design and Open Specification

N/A
N/A
Protected

Academic year: 2021

Share "RCL: Design and Open Specification"

Copied!
22
0
0

Loading.... (view fulltext now)

Full text

(1)

RCL: Design and Open

Specification

D3.1.1

(2)

Document Information

Scheduled delivery

31.03.2014

Actual delivery

31.03.2014

Version

1.0

Responsible Partner

IBM

Dissemination Level

PU

Public

Revision History

Date

Editor

Status

Version

Changes

27.02.2014

Ronen Kat

Draft

0.1

Skeleton

05.03.2014

Ronen Kat

Draft

0.2

added base text

05.03.2014

Yossi

Kuperman

Draft

0.3

added section 3 (I/O consolidation)

09.03.2014

Yossi

Kuperman

Draft,

0.4

added text in section 3 (I/O

consolidation)

07.03.2014

Dave

Gilbert

Draft,

0.4RH

added section 4 (Memory

Externalization)

10.03.2014

Ronen Kat

Draft

0.5

Added section 5. Edits to section 3.

16.03.2014

Ronen Kat

Ready for

review

0.6

Merge and finishing

16.03.2014

Ronen Kat

Ready for

review

0.7

Merge and finishing

26.03.2014

Ronen Kat

Final

1.0

Address reviewers comments

Contributors

Ronen Kat (IBM), Joel Nider (IBM), Yossi Kuperman (IBM), Andrea Arcangeli (Red Hat), David

Gilbert (Red Hat)

Internal Reviewers

(3)

Glossary of Acronyms

Acronym

Definition

Dx.x.x

Deliverable x.x.x

DoW

Description of Work

I/O

Input / Output in relation to data transfer

EC

European Commission

KVM

A virtualization infrastructure for the Linux kernel which turns it into a hypervisor

Mxx

Month (xx) number from the start of the project

PM

Project Manager

PO

Project Officer

QEMU

Emulator and virtualization machine that allows you to run a complete operating

system on top of another.

RCL

Resource Consolidation Layer

TBD

To Be Defined

VM

Virtual Machine

VMM

Virtual Machine Manager

(4)

Table of Contents

1. Executive Summary...6

2. Introduction...7

2.1. Document Focus... 7

2.2. Next Steps... 8

3. I/O Consolidation...9

3.1. Components... 10

3.2. Interfaces... 13

3.3. Modules... 14

4. Memory Consolidation and Externalization...15

4.1. Components... 16

4.2. Interfaces... 17

4.3. Modules... 19

5. Cloud Management...20

(5)

List of Figures

Figure 1: ORBIT architecture – focus on resource consolidation layer...7

Figure 2: I/O consolidation modules and interfaces...9

Figure 3: Memory externalization components and interfaces...15

(6)

1.

Executive Summary

This first delivery of the Resource Consolidation Layer (RCL) Design and Open Specification

document focuses on the high-level architecture of the “Resource Externalization and

Consolidation as a Facilitator for Fault Tolerance” work package (WP). It shows the

components and interactions between the components of the I/O consolidation, memory

consolidation and externalization and the cloud management.

The design and open specification will be extended in the future version as part of D3.1.2

and D3.1.3 throughout the progress of the project.

(7)

2.

Introduction

The RCL addresses the end-to-end implementation of virtual resource consolidation, via

enhancements of the Virtual Machine Manager (VMM) and Virtual Machine (VM)

paravirtualized device drivers in conjunction with modern hardware supporting x86

full-virtualization such as Intel-VT and AMD-V. The RCL enables a VM to use not only local

resources but also remote resources residing on remote physical hosts within the

data-center, and make them appear to the VMs as local resources.

The I/O consolidation is the I/O “middleware” that allows network I/O resources and block

I/O resources to be accessed and shared by multiple VMs and enables resource to be

“switched” between VMs and physical hosts.

The memory consolidation concept is based on “externalization” of the memory, allowing

VM memory to physically reside not on a local physical machine, but also on a remote

physical machine.

Figure 1: ORBIT architecture – focus on resource consolidation layer.

2.1. Document Focus

This first design and open specification document includes the high level design of the

resource consolidation layer, and includes the I/O consolidation and memory externalisation

components and interfaces.

(8)

2.2. Next Steps

On M9 (June 2014) a first software prototype implementation (D3.2.1) will realize the design

and open specification in this document, to be followed by a scientific report (D3.3.1) on

M12 (September 2014) describing the outcome of the design and implementation.

Next, the design and open specification is to be expanded in delivery D3.1.2 on M18 (March

2015) with additional details. This will be the next update of the this document.

(9)

3.

I/O Consolidation

Figure 2 shows the task's components and the dependencies with internal and external

components, each of the following components are implemented as part of the Linux kernel

version 3.13.

Figure 2: I/O consolidation modules and interfaces.

Note component 3.3.TBD which is part of Task 3.3 which is scheduled to start on M13

(October 2014). the notation TBD here reflect that the design of this component has not

been performed, and in fact may be composed of multiple components. The details of

component 3.3.TBD is to be included in the next update of this document.

(10)

3.1. Components

The guest operating system (front-end) runs on top of KVM [1] and QEMU as a VM and will have

components 3.1.12, 3.1.13, 3.1.11 and 3.1.1 installed, whereas the I/O consolidation hypervisor

(back-end) runs on top of bare-metal with the same base kernel and modules 3.1.22, 3.1.23, 3.1.21

and 3.1.1 installed. (the module 3.1.1 is installed on both front-end & back-end)

The following shows for each component (in Figure 2) the component number, name, description

and role in ORBIT, the success indication for the component, and the interfaces of the component.

COMPONENT 3.1.1

NAME Split I/O Ethernet Transport

High Level Description / ROLE

This component is responsible for implementing the Ethernet transport layer to enable efficient and scalable communication between the Split I/O front-end and Split I/O back-end components

Success indicator Less than 20% overhead compared to traditional para-virtual I/O

Interfaces EXPOSED (internal and

external) 3.1.1.1 INTERFACES CONSUMED (internal and

external) Only Linux Kernel interfaces will be used, none from other components within the project.

COMPONENT 3.1.11

NAME Split I/O Generic Front-End

High Level Description / ROLE

This component is responsible for providing common Split I/O data services to both the network and block front-ends. It is also responsible for instantiating the virtual block and virtual network devices of the VMs.

Success indicator Virtual block and network devices consolidated in a centralized and remote I/O

hypervisor

Interfaces EXPOSED (internal and external)

3.1.11.1 3.1.11.2

INTERFACES CONSUMED (internal and

(11)

COMPONENT 3.1.21

NAME SplitI/O Generic Back-End

High Level Description / ROLE

This component is responsible for providing common Split I/O data services to both the network and block back-ends. It is also responsible for exposing a control interface to the management cloud layer that can be used to link the VMs with the corresponding I/O hypervisor and manage the virtual network and virtual block devices

BUILDING BLOCK Linux/KVM

Success indicator Virtual block and network devices consolidated in a centralized and remote I/Ohypervisor

Interfaces EXPOSED (internal and external)

3.1.21.1 3.1.21.2

INTERFACES CONSUMED (internal and

external) 3.1.1.1

COMPONENT 3.1.12

NAME Split I/O Block Front-End

High Level Description / ROLE This component is responsible for exposing virtual block devices to the guestoperating system and sending all the read/write requests to the block back-end running in the I/O hypervisor

Success indicator Virtual block devices consolidated in a centralized and remote I/O hypervisor

Interfaces EXPOSED (internal and

external) None INTERFACES CONSUMED (internal and

external)

3.1.11.1

Internal Linux interfaces

COMPONENT 3.1.13

NAME Split I/O Net Front-End

High Level Description / ROLE

This component is responsible for exposing virtual network devices to the guest operating system and sending/receiving all network frames to/from the I/O the network back-end running in the I/O hypervisor

BUILDING BLOCK Linux/KVM

Success indicator Virtual network devices consolidated in a centralized and remote I/O hypervisor

Interfaces EXPOSED (internal and

external) none INTERFACES CONSUMED (internal and

external)

3.1.11.1

(12)

NAME SplitI/O Block Back-End

High Level Description / ROLE

This component is responsible for processing the block read and block write requests sent by the block front-end running within each VM. It maps the (remote) virtual block device exposed to each VM with a (local) block device

BUILDING BLOCK Linux/KVM

Success indicator Virtual block devices consolidated in a centralized and remote I/O hypervisor

Interfaces EXPOSED (internal and

external) None INTERFACES CONSUMED (internal and

external)

3.1.21.1

Internal Linux interfaces

COMPONENT 3.1.23

NAME SplitI/O Net Back-End

High Level Description / ROLE

This component is responsible for receiving/sending virtual network L2 frames from/to the virtual network devices of each VM. It bridges the (remote) virtual network devices exposed to each VM with a (local) tap/macvtap interface. The tap interface can be connected with any virtual network (e.g. OVS/Linux bridge) and the macvtap interface can be connected to any physical NIC.

BUILDING BLOCK Linux/KVM

Success indicator Virtual network devices consolidated in a centralized and remote I/O hypervisor

Interfaces EXPOSED (internal and

external) none INTERFACES CONSUMED (internal and

external)

3.1.21.1

(13)

3.2. Interfaces

The following interfaces are part of Task 3.1 (Consolidation of Virtualized I/O). The interfaces that are

external to Task 3.1 include concentrate and detailed API specification, while the internal (to Task

3.1) API will be detailed in the next update of the document.

The following shows for each interface (in Figure 2) the interface number, name, description and role

in ORBIT, and the relevant components which use the interface.

interface 3.1.1.1

NAME Split I/O Transport

High Level Description / ROLE

Abstract the concrete transport (Ethernet) used for the communication. Enables to easily plug new transports (e.g. Infiniband which will not be implemented as part of this project)

consumed by components (internal and external)

3.1.11 3.1.21

interface 3.1.11.1

NAME Split I/O Front-end Protocol

High Level Description / ROLE Defines the generic I/O communication front-end services (based on virtio protocol)for all types of virtual I/O devices (block and net)

consumed by components (internal and external)

3.1.12 3.1.13

interface 3.1.21.1

NAME Split I/O Back-end Protocol

High Level Description / ROLE Defines the generic I/O communication back-end services (based on virtio protocol)

for all types of virtual I/O back-ends (block and net)

consumed by components (internal and external)

3.1.22 3.1.23

interface 3.1.11.2

NAME Split I/O Internal Control

High Level Description / ROLE

Defines the control operations (e.g. specify I/O hypervisor, create virtual network device, create virtual block device) that can be used by the back-ends to configure and manage the front-ends

This is a logical interface that will be used remotely (RPC over the Split I/O transport)

(14)

and Task 3.3 (Cloud Management Integration).

The following shows the interface number, name, description and role in ORBIT, the relevant

components which use the interface, and the list of APIs in the interface.

interface 3.1.21.2

NAME Split I/O External Control

High Level Description / ROLE

Defines the control operations (e.g. specify I/O hypervisor, specify virtual network devices and block devices, connect a virtual network device with a virtual switch, set the backing store for a virtual block device) that will be used by the cloud management component to configure and manage the system

consumed by components (internal and

external) 3.3.TBD (described in section 5) API

RESET_GUEST_DEVICES()

Reset a guest to a “clean” state, removing all virtual devices Parameters:

IO Guest's management address (e.g. VF's MAC address)

CREATE_BLOCK_DEVICE()

Creates a virtual block device on the specified guest Parameters:

• IO Guest's management address (e.g. VF's MAC address) • Guest's device name (e.g. /dev/vrda)

• IO Hypervisor device name (e.g. /dev/sdb) • QoS – rate-limit (optional)

REMOVE_BLOCK_DEVICE() Parameters:

• IO Guest's management address (e.g. VF's MAC address) • Guest's device name (e.g. /dev/vrda)

CREATE_NETWORK_DEVICE()

Creates a virtual network device on the specified guest Parameters:

• IO Guest's management address (e.g. VF's MAC address) • Guest's device name (e.g. eth7)

• IO Hypervisor device name (e.g. eth5) • QoS – rate-limit (optional)

REMOVE_NETWORK_DEVICE() Parameters:

• IO Guest's management address (e.g. VF's MAC address) • Guest's device name (e.g. eth7)

(15)

4.

Memory Consolidation and

Externalization

Figure 3 shows the Task 3.2 components and the dependencies with internal and external

components. Component 4.1.6 (Post copy Manager) is part of work package 4 (external to

this report), and is not described here.

Figure 3: Memory externalization components and interfaces.

Post copy Manager

4.1.6

Linux kernel mm

subsystem

enhancements

3.2.1

Remote memory

front end

3.2.2

Remote memory

handler

3.2.3

Userspace page

fault control

3.2.1.1

Remote

memory

Protocol

3.2.2.1

Remote

memory

src API

3.2.3.1

Remote

memory

dest API

3.2.2.2

(16)

4.1. Components

The following shows for each component (in Figure 3) the component number, name,

description and role in ORBIT, the success indication for the component, and the interfaces

of the component.

COMPONENT 3.2.1

NAME Linux kernel mm subsystem enhancements

High Level Description / ROLE New mechanism to handle page faults in userspace

Success indicator Running guest with remote memory (as part of post-copy)

Interfaces EXPOSED (internal and

external) 3.2.1.1 INTERFACES CONSUMED (internal and

external) Internal kernel interfaces

COMPONENT 3.2.2

NAME Remote memory front end

High Level Description / ROLE Routes page requests/data between the kernel on the destination machine and the network towards the source machine Success indicator Running guest with remote memory (as part of post-copy)

Interfaces EXPOSED (internal and

external) 3.2.2.1 INTERFACES CONSUMED (internal and

external) 3.2.1.1

COMPONENT 3.2.3

NAME Remote memory handler

High Level Description / ROLE Satisfies page requests from the source machine, routes control messages to the Remote memory front end Success indicator Running guest with remote memory (as part of post-copy)

(17)

4.2. Interfaces

The following shows for each interface (in Figure 3) the interface number, name, description

and role in ORBIT, the relevant components which use the interface, and the list of APIs in

the interface.

interface 3.2.1.1

NAME Userspace page fault control

High Level Description / ROLE Provides an efficient way for user-space to register for faults on a region of virtual memory, and a way for it to respond to those faults consumed by components (internal and

external) 3.2.2 but also must be designed to be generally useful to other users API

madvise(start,length, MADV_[NO]USERFAULT)

[Un]Mark a region of anonymous virtual memory such that pages that have not been allocated cause a 'user fault', to be notified on a userfaultfd, and causing the faulting thread to pause

int userfaultfd(int flags) (TBD) Open a file descriptor to receive user fault notifications and to unpause

faulted threads

remap_anon_pages(dest,src,len) Move the mapping of a page of anonymous memory to a new virtual memory location read(userfaultfd Receive an address from the kernel of a page that needs to be filled by userspace

(18)

interface 3.2.2.1

NAME Remote memory protocol

High Level Description / ROLE Provides a mechanism for passing pages on demand between virtual machines, and

to control the process

consumed by components (internal and

external) 3.2.3 and interacts with the existing migration protocol

Protocol entries These entries correspond to packets over a modified migration protocol

rp.REQPAGES(start,len) (dest->src) A request for a region of guest physical memory of the given length. The source must ensure that at least these pages are provided

PAGE(address,data) (src->dest) A page of guest physical memory (modification of semantics of existing migration message)

command(code,data) (src->dest) Provides a synchronisation and control mechanism for subcommands; uses TBD butsome uses given below command(SENSITISE_RAM) (src->dest) Causes the destination to switch into userfault mode

command(OPENRP) (src->dest) Asks the destination to open a 'return path' for sending page requests over

command(REQACK(id)) (src->dest) Request acknowledgement from destination (dest)

command(DISCARD, (start, range)[]) (src->dest)

Invalidate a previously transmitted page, requiring the destination to send a REQPAGES to recover it

rp.ACK(id) Response to a REQACK

interface 3.2.2.2

NAME Remote memory destination API

High Level Description / ROLE Provides the page destination with control over the link to/and control over the source system consumed by components (internal and

external) 4.1.6 (Post copy manager – instance on the destination side) API These are internal interfaces within QEMU and are subject to change

(19)

NAME Remote memory source API

High Level Description / ROLE Provides the page source with control over the establishment of the link to/and

control over the partner system

consumed by components (internal and

external) 4.1.6 (Post copy manager – instance on the source side)

API These are internal interfaces within QEMU and are subject to change

qemu_savevm_command_send(comman d,data,len)

Send a command to the page destination system; corresponds to command() message in 3.2.2.1; matching APIs are provided for each command

qemu_file_get_return_path Retrieve handle to receive messages from the page destination; a consumer (such as the postcopy manager)

sentmap[] API TBD An ownership table identifying whether a page has been sent to the destination

(20)

4.3. Modules

Following is a list of modules that will be created as part of the memory externalization.

3.2.1/remap Linux kernel facility for remapping anonymous memory

3.2.1/userfault-madvise Linux kernel facility for marking an area of memory as userfault

3.2.1/userfaultfd Linux kernel mechanism for communicating faults to userspace and allowing userspace to continue execution of previously blocked thread

3.2.2/fault-wire QEMU userspace code to accept faults from 3.2.1/userfaultfd and organise them for transmission over the network (for 3.2.2.1)

3.2.2/wire-map QEMU userspace code to accept incoming pages, use 3.2.1/remap to remap them

and then allow the thread to continue with 3.2.1/userfaultfd

3.2.2/command-handler QEMU protocol management code to accept commands from the source and process them 3.2.2/destination-page-ownership QEMU page management code to keep track of incoming and requested pages

3.2.2/init QEMU code to initialise the kernel userfault code

3.2.3/return-path QEMU protocol management code to provide a return path from destination host to source and format messages carried by it

3.2.3/response-handler QEMU protocol management code to accept responses from the destination and

process them

(21)

5.

Cloud Management

The cloud management is part of Task 3.3 which is scheduled to start at M13 (October

2014), and will implement component 3.3.TBD which is a place holder for the cloud

management implementation. The OpenStack components that are to be enhanced are

emphasized (as circles) in the figure below (taken from [3]), and include:

Compute component - Nova

Block storage component - Cinder

Networking component – Neutron

Dashboard (management) component - Horizon

(22)

6.

References

[1]

KVM – Kernel Based Virtual Machine.

http://www.linux-k

vm.org

[2]

QEMU – Open source process emulator.

http://wiki.qemu.org

[3]

OPENSTACK CLOUD ADMINISTRATOR GUIDE - HAVANA.

References

Related documents