Long-Distance Migration Rev. 2.0
Table of Contents
Legal Notice ...3
Executive Summary ...4
Purpose ...5
Taxonomy ...6
Assumptions ...7
What Does “Long Distance” Mean? ...8
Long-Distance Migration Usage Scenarios ...9
General Migration Classes ...9
Representative Migration Usage Scenarios ...9
Units of Long-Distance Migration ...10
Network Concept for Migration Considerations ...12
General Migration Workflow ...13
Schema for Usage Scenarios ...16
Usage Scenario Group 1 - Relocate ...17
Usage Scenario Group 2 - Extend ...18
Usage Scenario Group 3 - Sustain ...19
Key Performance Indicators ...20
Service Tiers ...21
RFP Requirements ...23
Solution Providers...23
Service Providers ...23
Summary of Industry Actions Required ...24
Appendix 1: Example Use Cases ...25
Use Case 1 – Data Affinity ...25
Use Case 2 – Follow-the-Sun ...25
Use Case 3 – Disaster Recovery ...25
Use Case 4 – Follow-the-Moon ...25
Use Case 5 – Cloud Bursting ...26
Use Case 6 – Data Center Downtime/Closure ...26
Appendix 2: Suggested Future Enhancements ...27
Contributors
Axel Knut Bethkenhagen, BMW
Mustan Bharmal, T-Systems International GMBH Sudip Chahal, Intel IT
Alan Clarke, SUSE Ravi A. Giri, Intel IT Bernd Henning, Fujitsu
Eric Kristoff, ODCA Infra Workgroup Tobias Kunze, Red Hat
Ben MP Li, Deutsche Bank Geoff Poskitt, Fujitsu Peter Pruijssers, Atos Erik Rudin, Science Logic Avi Shvartz, Bank Leumi Ryan Skipp, Deutsche Telekom Catherine Spence, Intel IT Mick Symonds, Atos Arivou Tandabany, Telstra Hans van de Koppel, Capgemini Stephanie Woolson, Lockheed Martin
Legal Notice
© 2012-2013 Open Data Center Alliance, Inc. ALL RIGHTS RESERVED.
This “Long-Distance Migration” document is proprietary to the Open Data Center Alliance (the “Alliance”) and/or its successors and assigns.
NOTICE TO USERS WHO ARE NOT OPEN DATA CENTER ALLIANCE PARTICIPANTS: Non-Alliance Participants are only granted the right to review, and make reference to or cite this document. Any such references or citations to this document must give the Alliance full attribution and must acknowledge the Alliance’s copyright in this document. The proper copyright notice is as follows: “© 2012-2013 Open Data Center Alliance, Inc. ALL RIGHTS RESERVED.” Such users are not permitted to revise, alter, modify, make any derivatives of, or otherwise amend this document in any way without the prior express written permission of the Alliance.
NOTICE TO USERS WHO ARE OPEN DATA CENTER ALLIANCE PARTICIPANTS: Use of this document by Alliance Participants is subject to the Alliance’s bylaws and its other policies and procedures.
NOTICE TO USERS GENERALLY: Users of this document should not reference any initial or recommended methodology, metric, requirements, criteria, or other content that may be contained in this document or in any other document distributed by the Alliance (“Initial Models”) in any way that implies the user and/or its products or services are in compliance with, or have undergone any testing or certification to demonstrate compliance with, any of these Initial Models.
The contents of this document are intended for informational purposes only. Any proposals, recommendations or other content contained in this document, including, without limitation, the scope or content of any methodology, metric, requirements, or other criteria disclosed in this document (collectively, “Criteria”), does not constitute an endorsement or recommendation by Alliance of such Criteria and does not mean that the Alliance will in the future develop any certification or compliance or testing programs to verify any future implementation or compliance with any of the Criteria.
LEGAL DISCLAIMER: THIS DOCUMENT AND THE INFORMATION CONTAINED HEREIN IS PROVIDED ON AN “AS IS” BASIS. TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE LAW, THE ALLIANCE (ALONG WITH THE CONTRIBUTORS TO THIS DOCUMENT) HEREBY DISCLAIM ALL REPRESENTATIONS, WARRANTIES AND/OR COVENANTS, EITHER EXPRESS OR IMPLIED, STATUTORY OR AT COMMON LAW, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE, VALIDITY, AND/
OR NONINFRINGEMENT. THE INFORMATION CONTAINED IN THIS DOCUMENT IS FOR INFORMATIONAL PURPOSES ONLY AND THE ALLIANCE MAKES NO REPRESENTATIONS, WARRANTIES AND/OR COVENANTS AS TO THE RESULTS THAT MAY BE OBTAINED FROM THE USE OF, OR RELIANCE ON, ANY INFORMATION SET FORTH IN THIS DOCUMENT, OR AS TO THE ACCURACY OR RELIABILITY OF SUCH INFORMATION.
EXCEPT AS OTHERWISE EXPRESSLY SET FORTH HEREIN, NOTHING CONTAINED IN THIS DOCUMENT SHALL BE DEEMED AS GRANTING YOU ANY KIND OF LICENSE IN THE DOCUMENT, OR ANY OF ITS CONTENTS, EITHER EXPRESSLY OR IMPLIEDLY, OR TO ANY INTELLECTUAL PROPERTY OWNED OR CONTROLLED BY THE ALLIANCE, INCLUDING, WITHOUT LIMITATION, ANY TRADEMARKS OF THE ALLIANCE.
TRADEMARKS: OPEN CENTER DATA ALLIANCESM, ODCASM, and the OPEN DATA CENTER ALLIANCE logo® are trade names, trademarks, and/or service marks (collectively “Marks”) owned by Open Data Center Alliance, Inc. and all rights are reserved therein. Unauthorized use is strictly prohibited. This document does not grant any user of this document any rights to use any of the ODCA’s Marks. All other service marks, trademarks and trade names reference herein are those of their respective owners.
OPEN DATA CENTER ALLIANCE USAGE Model:
Long-Distance Migration Rev. 2.0
Executive Summary
Historically, workload migration over any distance has been an immensely complex, time consuming, and disruptive activity. With the adoption of virtualization at the core of modern infrastructure services, the complexity of the underlying infrastructure may have increased, but the potential to eliminate time and disruption from migration activities is finally possible, although that potential is currently somewhat constrained by relatively short distances and low latencies.
The drivers for long-distance migration capabilities are really no different from those that led to the current state of technology, which can be considered in terms of three foundational classes: migration, extension, and sustaining. This document presents practical usage models based on these classes, along with success and failure scenarios for each. Finally, a discussion of service provider requirements and an industry call to action is presented.
This version of the usage model extends the prior version of this document in the following ways:
• The name change to “Long-Distance Migration” includes the whole system and not just the “workload.”
• The reference base of platform as a service (PaaS) and compute infrastructure as a service (ClaaS) now includes storage and network.
• Usage scenarios have been updated to include data migration activities.
• The focus is still on a “single system” migration for this iteration as opposed to a complete computing landscape.
• The scope has expanded to include “any cloud provider,” public or private.
This document serves a variety of audiences. Business decision makers looking for specific solutions and enterprise IT groups involved in planning, operations, and procurement will find this document useful. Solution providers and technology vendors will benefit from its content to better understand customer needs and to tailor service and product offerings. Standards organizations will find the information helpful in defining end-user relevant and open standards.
Purpose
There are several motivations for long-distance migration. Business drivers typically fall into one of several categories.1
• Business continuity. Migrate or replicate systems to establish an orderly evacuation of a data center that has experienced or anticipates experiencing failures, security breaches, or other disruptions. This encompasses both disaster recovery and disaster avoidance. Seasonal migrations in high-risk areas would also be included here.
• Resource locality. In some cases it is impossible, impractical, or too costly to bring data or other resources close to the compute environment.
Therefore the application and services move closer to the needed resources. This move could be driven by data volume, device requirements, power costs, data residency requirements, or other legal/regulatory restrictions. “Data affinity” refers to when computing moves to the data location, because the volume of data is too large to move to where the compute is taking place. “Follow-the-moon” migrations take power and cooling costs into consideration to move the system to where costs are least expensive.
• Follow-the-sun. Move computing close to users based on geographic locations, time zone considerations, and other factors. This is performed on a proactive and scheduled basis. Follow-the-sun is a special variant of resource locality where the users are the immovable “resource”
around which the system revolves. “Lazy-follow-the-sun” refers to a follow-the-sun model that has no hard cut-off time for users accessing the source system so therefore an overlap exists during the migration process.
• Dynamic scaling. Also referred to as “data center expansion” or “cloud bursting.” Dynamic scaling is the need to dynamically and elastically acquire and dispose of capacity. We include both real-time, on-demand, and proactively scheduled scaling. The benefit is that the clients (cloud subscribers) should no longer have to build for peak demand.
• Data center migration. Migrate to a new facility because the cloud subscriber is temporarily or permanently changing facilities. For example, this migration could be driven by a data center consolidation program or by the termination of a relationship with a given cloud provider.
Note: The compute and cloud marketplaces are currently out of the scope of this usage model.
1 Travostino, F., et al., “Seamless Live Migration of Virtual Machines Over the MAN/WAN.” (2005). http://dl.acm.org/citation.cfm?id=1160266
Taxonomy
Table 1 lists the standard terms and definitions used in this document.
Table 1. Terms and definitions.
Actor Description
Cloud Provider An organization providing compute, storage, and network services (such as infrastructure as a service, platform as a service, or software as a service) and charging cloud subscribers based on their actual proportional usage (pay per use) of the allocated resources. A (public) cloud provider provides services over the Internet. A cloud subscriber could also be its own cloud provider.
Cloud providers can be considered to be in two groups in the context of system movement.
• Source cloud provider - The site or data center where the cloud application or system is originally located
• Target cloud provider - The site or data center where the cloud application or system is to be moved or copied
Cloud Service Broker An entity that manages the use, performance, and delivery of cloud services and negotiates relationships between cloud providers and cloud subscribers. A cloud service broker has no cloud services of its own.
Cloud Standards Body An entity responsible for setting and maintaining the cloud orchestration standards considered in this usage model. These standards include the following:
• Global standards bodies - Typically technology orientated
• Country standards bodies - Related to compliance and local rules
• Industry sector standards bodies - Pertinent to a specific sector, such as medical
• Company standards bodies - Local to an enterprise, such as security
• Open standards bodies - Driven by open source bodies or communities with significant momentum
Cloud Subscriber A person or organization that has been authenticated to a cloud and maintains a business relationship with a cloud, for use of the resources provided thereby according to the defined terms of the agreement.
Data Center Operator A person or organization operating a generic IT or enterprise data center for its own services and/or as a service for third parties.
Term Description
Interconnectability The parallel process in which two coexisting environments can communicate and interact.
Interoperability Refers to how cloud-based applications and services coexist and interact with each other throughout various data center environments. The two key aspects are interconnectability and portability.
Long Distance For the purposes of this usage model, “long distance” is defined as greater than 20 kilometers of conductor between disparate data centers (cloud provider sites). Inter-site latency is assumed to be at least 10 milliseconds or worse.
Migration - At Rest At-rest migration allows the transfer of a fully-stopped virtual machine instance or a machine image from one provider site to another site, or between disparate providers. It also may include migrating applications, services, and their contents from one site (or provider) to another. By definition, only one site may accept connections for the subject workload at a time. This is also referred to as “cold” migration.2
Migration - Live Live migration approximates continuous operation of a set of processes even as they are being sent to another physical location.
This may also be referred to as “hot” migration.3 The user of the workload is not impacted, with no down time in service.
Portability The serial process of moving a system from one cloud environment to another.
Traffic Trombone The network traffic between disparate data centers caused by separated workload components during a live migration.4
2 National Institute of Standards and Technology (NIST), “NIST Cloud Computing Reference Architecture.” (2001). http://collaborate.nist.gov/twiki-cloud-computing/pub/
CloudComputing/ReferenceArchitectureTaxonomy/NIST_SP_500-292_-_090611.pdf
3 Travostino, F., et al., “Seamless Live Migration of Virtual Machines Over the MAN/WAN.” (2005). http://dl.acm.org/citation.cfm?id=1160266
4 “VMware ‘vFabric’ and the Potential Impact on Data Centre Network Design – The ‘Network Trombone’.” http://etherealmind.com/vmware-vfabric-data-centre-network-design
Figure 1 illustrates the key actors that participate in a long-distance migration.
Source Cloud Provider
Cloud Subscriber
Source Cloud
System
Target Cloud Provider
Target Cloud
System
Service Movement Distance > 20km
Figure 1. The key actors participating in a long-distance migration.
Assumptions
The following assumptions apply to the usage scenarios contained herein:
• Where cloud provider and cloud subscriber are separate legal organizations, there is a properly executed service agreement similar to the Commercial Services Framework from the Open Data Center Alliance (ODCA).
• Cloud providers implement the “ODCA Service Catalog” Usage Model.5
• Compliance with “ODCA VM Interoperability” Usage Model,6 including import and export of virtual machine (VM) packages per the Open Virtualization Format (OVF).
• For extension and sustaining foundational class “migrations,” we assume the applications built on cloud provider services follow the practices of the ODCA white paper “Developing Cloud-Capable Applications”.7 Note: Migration class migrations do not depend on this assumption.
• Service level agreements (SLA), operating level agreements, and relevant controls are specified between the cloud provider and the cloud subscriber. These agreements and controls may include the following:
– Type of migration capabilities required – At-rest migration versus live migration – Availability
– Security root of trust – Consistency of management – Security and compliance – Carbon measurement
– Geographic hosting requirements
– System owner, and roles and responsibilities of relevant parties – Roll-back criteria
– Duration of roll-back availability
• All clocks are synchronized to Coordinated Universal Time (UTC).
• Cloud systems can be stopped for at-rest migration.
• A cloud system can be described in a “manifest.” The manifest is simply a list and dependency mapping of all relevant components and requirements. (A manifest implementation format is beyond the scope of this usage model).
• Unless specified otherwise, all migrations occur between disparate data centers of one or more cloud providers.
• Migrations are initiated based on the cloud subscriber’s intent, either explicitly by the subscriber or automatically, based on a subscriber-defined policy.
• Unless specified otherwise, connectivity between source and target data centers is over the public Internet.
5 www.opendatacenteralliance.org/library
6 www.opendatacenteralliance.org/library
7 www.opendatacenteralliance.org/library
What Does “Long Distance” Mean?
A key factor in IT landscape design, and especially transition, is whether the source and target locations are “close enough” that they can reliably communicate synchronously, and thus be treated as one location. The determining factor is the latency (time taken) for any I/Os or messages between the location: A figure of less than 10 msec, one way, is often seen as a threshold, but it depends on the application and usage.
How that latency time arises is a factor of the laws of physics (that is, the speed of light in a fiber), engineering (how switches and protocol converters work) and configuration (how they are set up, at both ends of the link). Very often, the geographical distance is used as a
“shorthand” to encapsulate this constraint, although it is only one of the factors.
The term “long distance” in this document is used to indicate that the environments involved cannot be treated as if they were in one “virtual data center”: the distance is such that the latency prevents their transparent synchronous use. Thus, special arrangements have to be made to transfer them, especially if they are to remain in use during the transfer.
There is not one fixed definition of what this distance actually is: a “lowest common denominator” value is usually required, to be sure that the most sensitive application can still function in a heterogeneous environment. It seems to be generally accepted that the practical limit for most purposes is around a 40-kilometer (km) cable length, which equates to around a 25-km geographical distance. Note that there is not one, hard distance limitation: It depends very much on the usage and behavior of the application system(s) being used.
Factors to be considered in the configuration of clouds in multiple data centers include:
• Ownership and configuration. If both locations are owned or managed by the same organization—the user or a cloud provider—the connections can be transparent and optimized; otherwise other elements such as firewalls may be required, which introduce further latency.
• Synchronous or asynchronous connections. Very high bandwidth links can be made available, but there are still physical limitations as to the time in which data can be transferred, the latency matters, and the capacity to transfer volumes of data. For short distances, the connection can be used synchronously, so that the center in which any one system resides is treated as irrelevant for these purposes. Over that distance, the delays may influence the design of the system landscape and asynchronous connections are required, with the limitations on use that entails (for example, the potential for lost data).
• Storage. The prime consideration regarding performance, continuity, and disaster recovery is typically the data storage instead of the processing capacity.
The use of this distance facility also extends to the need for backup and archiving. Various configurations can be constructed, using a combination of clouds, and single and twin centers, with a possible further long-distance facility for disaster recovery purposes. The current trend for business-critical systems seems to be toward synchronously connected twin centers for high availability, with a third center at a greater distance for backup, archiving, and disaster recovery. Data can be synchronously replicated within one site, between two sites, and backed up asynchronously to a third, all without host involvement.
Ideally, storage is architected as a service in its own right; this can strongly influence the use and deployment of such data centers. Information Lifecycle Management (ILM) storage tiers determine which facilities are used for which data and how many copies are held where.
There are a number of options regarding the technologies used to connect the centers together, with consequent levels of feasible usage. Dark fibers leave most options open as to usage; alternatively, especially between different organizations, a higher-level protocol service may be deployed. Various connection technologies can be used to provide synchronous connection up to a distance of around a 40-km cable length;
the geographical distance will be less, because the cables cannot always follow straight lines and especially because dual diverse routes are required, so that one event cannot sever all connections. Beyond that distance, different laser technologies and signal repeaters may be needed, and the feasibility really does depend on the nature of the application(s) involved and whether they are susceptible to transaction time lengthening due to the extra latency that is inevitable.
To use such facilities across and between cloud providers and/or data centers, consideration also has to be given to network (IP) and storage (SAN) addressing schemes. They have to be unified if the systems are to be deployed and used transparently.
Long-Distance Migration Usage Scenarios
General Migration Classes
There are three generic classes from which specific migration usage scenarios can be built. These classes can help address specific business usage scenarios not yet encountered, and they provide a way to generalize implementations. These generic migration classes are as follows:
• Relocate. Implies the cloud application or system can be at only one place at a time. It can be thought of as “moving” the workload.
• Extend. Extends the cloud application or system into new places. It can be thought of as “growing” the system.
• Sustain. Sustains the cloud application or system. It can be thought of as “preserving” or “keeping” the system.
Relocating allows a cloud application or system to move from one location to another. Pure migration implies a serial sequence: being operational at only one place at a time. It does include geographic diversity and potential fault detection, but it does not include elements of load distribution or horizontal scalability. Relocation can address disaster recovery sustainability with planning, pre-placement, and failure detection mechanisms.
Extending a cloud application or system extends a function into additional geographic and/or network topological locations: The computing runs in both locations in parallel. It includes elements of geographic diversity, horizontal scalability, and a load distribution mechanism (that is, a global load balancer). Increased capacity, throughput, and performance are common goals of workload extension. “Extend” could also be used to gradually vacate unneeded sites as well.
Sustaining a cloud application or system provides for continuity of operation during and after an event. Elements of sustainability include geographic diversity, horizontal scalability (support for multiple instances), and a load distribution mechanism (that is, a global load balancer).
Increased availability is the goal. Sustaining is analogous to the Internet routing around damage and can address disaster avoidance requirements.
There may be some commonality or overlap between the extending and the sustaining migration classes.
Representative Migration Usage Scenarios
In the context of the migration classes, there are many situations that require the movement of systems or selected components of systems among locations. Based on the migration class and the potential level of impact of the migration to the user (a possibly prolonged service interruption on the one extreme and no service interruption on the other extreme), most of the possible usage scenarios fall into groups. These groups are not exclusive and overlap may exist. The diagram in Figure 2 shows these groups.
Usage Scenario Group 2
Usage Scenario Group 3
Impact to Cloud Subscriber’s Service Availability
Migration Class
Sustain
Extend
Relocate Data Center
Closure Data Affinity
Asynchronous Disaster
Recovery Data Center Consolidation
Follow the sun Follow
the moon Cloud bursting
followLazy the sun
followLazy the moon Franchise create system
copy
Service Scaling
Synchronous Disaster Recovery
Data Center Migration
Continuous Business Availability
Usage Scenario Group 1 Cloud Subscriber
Service Downtime, to enable Service Migration
Continuous Availability of Service to Cloud Subscriber during
Service Migration
Figure 2. Migration scenarios.
This document does not address each of the possible individual usage scenarios, but instead it provides a schema and addresses the three obvious groups of usage scenarios in the context of that schema. The movement of systems in these scenarios could be within a corporation or between external cloud services, but essentially the movement is greater than 20 km (that is, more than 20 km between the source and target locations).
Each of the business drivers, explained in the Purpose section, can be addressed by at least one of the migration classes. Table 2 lists which migration class could be used to address each business driver. Where a given migration class can be applied to a business driver, a sample usage scenario is indicated. Empty cells indicate where a migration may not be applicable.
Table 2. Examples of migration usage scenarios mapped to migration classes.
Migration Class
Motivation Relocate Extend Sustain
Business Continuity Disaster recovery Disaster avoidance
Resource Locality Data affinity Follow-the-moon
Follow-the-Sun Follow-the-sun Lazy-follow-the-sun
Dynamic Scaling Cloud bursting
Data Center Migration Data center closure
Specific examples for each of the situations in Table 2 are addressed in Appendix 1: Example Use Cases. There is some overlap in which more than one migration class can meet a given business driver. Depending on the specific business requirements of a given situation, there may be more than one solution.
System migration is a fundamental requirement in many different kinds of business situations. It is impossible to enumerate each potential such usage scenario. The six use cases described in Appendix 1: Example Use Cases are representative of commonly observed business requirements. There are, of course, other variations. Cloud subscribers can articulate these in a common manner using the workflow steps described in the General Migration Workflow section below.
All of the use cases are applicable to private, community, public, and hybrid clouds.
Units of Long-Distance Migration
Migration of work across clouds relies on the encapsulation of movable elements, which can be described in layers. Each layer is an independent set of items such as application processes, data, configuration information, and state. For IaaS, the abstraction includes VMs.
PaaS contains higher-order abstractions of the OS and middleware layers, which can include items such as database, object storage, and message queuing.
Four distinct layers of movable elements have been identified for this usage model as shown in Table 3.
Table 3. A layered model for a migrating system.
Layer Scope Relevant Standards
3 - Migration Protocol Procedures, processes, methodologies, and checklists, including
Start/Halt/Ack/Stop/Failure/Finish • None
2 - Business;
Non-Functional
Appliance metadata, disaster recovery, high availability, advanced security configurations, compliance requirements, automated scaling parameters, carbon footprint requirements
• Topology and Orchestration Specification for Cloud Applications (TOSCA)
1 - Application and System Management
Software binaries, platform-as-a-service application stack, cloud broker service or appliance, monitoring, logging, auditing, licensing, support
• Cloud Infrastructure Management Interface (CIMI)
• Service Provision Markup Language (SPML)
• System for Cross Domain Identity Management (SCIM) 0 - Base Runtime Virtual appliance including system configurations, additional
metadata, and session-related data • Open Virtualization Format (OVF)
At the lowest layer, the base runtime centers on the virtual appliance. A virtual appliance is a preinstalled, preconfigured OS and software stack encompassing one or more VMs. Each VM is an independently installable runtime entity comprising an OS, applications, and other application- specific data, as well as a specification of the virtual hardware that is required by the VM. Metadata beyond what is contained in the virtual appliance usually includes items such as the following:
• Network. Load balancers (yes/no), connections to other systems inside and outside the cloud.
• Security. VLAN definitions, authorization and authentication data, firewall rules, and encryption keys.
• Compute. Whether memory overcommit is permitted and use of hardware features such as CPU attributes or flash memory.
• Data transfer. Direct file transfer/replication, or database copy. (Data transfer/replication is not always part of the OVF format).
Note: The data transfer or copy itself usually is not part of the metadata.
In addition, as part of the base runtime, active session data may also be migrated. Examples include session cookies that reside outside the compute container and HTTP cache gateways.
The next layer up from the base runtime is the application and system management layer. This layer includes the software framework that may be expressed as software binaries or as a PaaS application stack. Software binaries include executable files, data and configurations that are moved via an installation script as opposed to an appliance. This approach utilizes a scripted methodology that includes selected data copy at the application layer.
Another way to transfer software applications is through a cloud broker service or appliance. Cloud broker appliances are intermediary systems that incorporate an agent in both the source and target clouds. A brokering function handles the orchestration and affects the migration.
Examples of cloud broker appliances include transfer systems from companies such as CA and VMware.
The app and management layer includes all runtime information needed to migrate into the target cloud provider’s management systems. This includes capabilities such as monitoring, auditing, logging, error handling, and help-desk functions. There is a technical integration component, which may include service name and parameters. There is also a data integration component, which provides the context for “transparent migration” with respect to seamless service levels (SLAs). For instance, the migration includes all data needed to ensure application error handling is handled in the same way in the target cloud with the same SLA. This may involve setting parameters when the service is selected from the service catalog in the target cloud or by explicitly setting SLAs.
The business and non-functional requirements layer includes all information related to a higher order of system operation and quality of service.
Appliance metadata defines data elements migration, especially managing the master data in terms of who owns the master copy of the data as it moves among clouds. These data elements must be synchronized during high availability and disaster recovery scenarios and incorporate conflict resolution strategies. In addition, it must be possible to determine the original piece of data and audit changes.
Other elements in the business and non-functional requirements layer are as follows:
• High-availability settings for active/active, active/passive, and global load balancing
• Resource utilization limits for metering
• Definition/topology of other systems that must be running prior to starting a new session
The OASIS “Topology and Orchestration Specification for Cloud Applications” (TOSCA) promotes the concept of a “service template” that specifies the topological structure and orchestration characteristics to provision services.8
The topmost layer in the model is the migration protocol layer. It specifies the content and protocol to establish connections between the source and target cloud providers, which enables the service lifecycle. This provides a configuration to transact the following:
• Lifecycle stages. Start migration, suspend (halt), resume, roll back, stop (success/failure), end of migration.
• Data associated with enacting the lifecycle. Error conditions, timeout, success messages.
8 OASIS “Topology and Orchestration Specification for Cloud Applications Version 1.0.” 18 March 2013. OASIS Committee Specification 01.
http://docs.oasis-open.org/tosca/TOSCA/v1.0/cs01/TOSCA-v1.0-cs01.html.
Network Concept for Migration Considerations
There are a number of key network aspects that must be considered to enable successful long-distance migration.
• Enable bulk data transfer from source to target
• System interaction with non-migrated workloads
• User access
• Management and administration of the migrated system
When considering a system to be migrated, the network is one of the most common areas in which problems may occur. In order to effectively identify the network requirements of a system for migration, it is helpful to consider the network from different viewpoints.
Table 3 describes some of these important viewpoints.
Table 3. Different network viewpoints.
Consideration Explanation
Bulk Data Transfer from Source to Target
Calculate the required network bandwidth and duration for the data transfer phase of the migration (live or offline).
Determine whether there is sufficient bandwidth to deal with the data migration within the given time, considering the defined data volume (synchronously or more usually over long distances – asynchronously).
System Integration with Non-Migrated Workloads
Define the network links required back to other non-migrated systems, and the inter-system dependencies and access requirements.
Map the system requirements within the business transaction chain, any latency impacts on those transactions, and key dependencies on system and data resources. Consider where a WAN outage can result in overall business transactions ceasing although both source and target systems are running normally and available at their respective locations.
User Access Define the service access schema, addressing, and load balancing. Also identify network and service access control, directory services, routing and alternate routes, and intrusion detection and prevention mechanisms. Consider encryption criteria, and traffic auditing between the source and target locations.
Identify what addressing schema will be applicable (especially if alternate cloud provider IP addresses are used), and how the DNS, DHCP, routing, load balancing, authentication, and access configurations are to be defined. Then review how auditing and controls will be implemented.
Management and Administration of the Migrated System
Define the service administration responsibilities according to the selected scenario and how the responsible parties will access the systems to be administered.
This is especially important when administration responsibility is shared across sites for selected elements. Create clearly defined boundaries of responsibility, clear audit points, and consider what access, authentication, and credential controls will be applied. For example, VPNs, directory services, Internet Download Manager and Intrusion Prevention Management, multi-factor authentication, and encryption.
General Migration Workflow
The migrate, extend, and sustain classes each have distinct workflow steps, while also having many steps in common. Variations on workflow implementation can be used to address niche cloud subscriber requirements not considered herein. Table 4 defines some addition terms.
Table 4. Additional terminology.
Term Definition
Source The site or data center where the cloud application or system is originally located.
Target The site or data center where the cloud application or system is to be moved or copied.
Control Shorthand for which site or data center is, at that moment, the primary location for the subject system. This could be the source, the target, or a load balancing facility.
A general workflow framework is illustrated in Table 5 and specific usage scenario examples follow in Figure 3. Here “workflow” refers to the migration process itself. One way to think of the general workflow is as a set of migration building blocks that can be assembled for a given business usage scenario. Proper implementation of the migration workflow is dependent upon good orchestration solutions, which are preferably standardized among cloud providers.
Note: In this context, the word “maybe” means that the migration step depends on specific business requirements.
Table 5. A general workflow framework.
Workflow Applicable to Migration Class
Step Number Migration Step Relocate Extend Sustain
1 Prepare source for migration. Yes Yes Yes
2 Marshal resources and set up environment at the target. Yes Yes Yes
3 Provision processes at the target. Yes Yes Yes
4 Provision data at the target. Yes Maybe Yes
5 Copy state to the target. Maybe No Yes
6 Confirm successful process and data provisioning. Yes Yes Yes
7 Assign control to the target. Yes No Maybe
8 Destroy or disengage the source. Yes Maybe Maybe
9 Accept new connections at the target. Yes Yes Yes
10 Balance connections between the source and the target. No Yes Maybe
Figure 3 generically illustrates the high-level workflow for a system migration over distance. It is important to note that one of the first steps to review is the business requirement that must be met—select the applicable usage scenario that will drive the migration. Based on the usage scenario, it is then possible to select the relevant system elements and scope of the data or system to be migrated and the methodology to be applied, along with the potential service impacts to the service consumers.
Depending on the usage scenario, the process could be iterative or one-off (for example, follow-the-sun versus disaster recovery). Also associated with the usage scenario is the service and operational responsibility, shown at the top of Figure 3. This is the point at which service and operational responsibility is either transferred or becomes shared among the sites and providers.
Start Migration Process
End Migration Process Create, publish,
and maintain Service Catalog
Resource charging by
TCP to CS begins Confirmation
of VM deletion received Reserve VM
resources (tier-based)
Source Cloud Provider (SCP)Cloud Subscriber (CS)Target Cloud Provider (TCP)
Request
Yes Migrate or Build
New VM?
Migrate
New
Original VM and all data deleted Resources released, cloud subscriber billed for resources
consumed, and so on Determine VM
operational requirements in normal operation and for elastic / burst capacity capability
requirements
No
No Yes
Is migrated VM operational?
Criteria must be defined to prevent continuous loop back if conversion fails so VM operation returns
to SCP
VM Running State VM Stopped State VM Running State
TCP responsible for VM operation SCP responsible for VM operation
Find TCP and access Service Catalog
Match VM operational requirements to target CSP’s Service Catalog
Can TCP support VMs
operational requirements?
Service
Catalog VM resource
reservation request Stopped state VM
inventory configuration Running state VM inventory configuration
Change VM state to stopped and clone VM Clone of the original VM is made (for migration purposes) so if the migration process fails, the original VM is still present to continue normal operations
Convert cloned VM to match target platform requirements Responsibility for conversion of source VM is with TCP (not SCP)
Migrate VM
data to new VM Deploy new VM onto TCP platform Create new VM
Match current VM configuration and operational requirements
Move converted cloned VM to
target TCP’s platform
Execute VM Acceptance into
Service and Operation Process Create VM
inventory and configuration
running and stopped states
Figure 3. The high-level workflow for a system migration over distance.
Along with workflow, and especially in the context of regular system movement as required for usage scenarios that support constant service movement (for example, follow-the-sun or follow-the-moon), data lifecycle management is integral to the workflow and planning. This is particularly relevant as new data and records are created and added to the whole moving system or parts thereof (creating a data federation) at each next service location throughout the cycle (for example, three locations in 24 hours, each stage being approximately eight-hours long).
The normal data lifecycle stages for a data record are represented in Figure 4.
Initiate
Cloud Subscriber Transfer
Migrate (*)
Cloud Subscriber
Archive
Cloud Subscriber Media/Online
Share
Use Destroy Create
S
tore
Archi
ve Data
Lifecycle
Cloud Provider
Figure 4. The data lifecycle stages for a data record.
Figure 5 explains in more detail the Migrate (*) stage identified in Figure 4. As multiple locations (and/or providers) come into scope the complexity increases, and the responsibility divides among more parties. The overall responsibility and ownership of data resides with the subscriber but operational responsibility lies with the various providers, and for a federated dataset, this must be well understood, mapped, and accounted for. The migrated system therefore gains more and more complexity, depending on the usage scenario.
A straight migrate or sustain requirement for purposes of disaster recovery remains relatively straightforward, while a service extension, burst, or partial migration that splits system elements and associated data (especially if repeatedly, as in the case of (lazy) follow-the-moon or follow- the-sun), must be carefully managed in order to retain overall system integrity, original master data, and overall system recoverability. The diagram shown in Figure 5 represents this consideration, with the data stages from the previous diagram, integrated into a migration scenario.
For these migration scenarios, the associated system metadata and management/controls should carefully consider how to direct and retain control of the system and its elements.
Target
Cloud Provider Migrate
Initiate Integrate Create Store
Target Cloud Provider Migrate
Initiate Integrate Create Store
Source
Cloud Provider Migrate Prepare
Source
Cloud Provider Migrate
Prepare
Use Use
Migration Protocol Migration Protocol
Figure 5. The migrate stage.
Schema for Usage Scenarios
A number of concepts are important to consider when defining applicable usage scenarios against organizations’ business requirements. Each usage scenario has its own complexities, and each complexity has to be recognized and understood, and accommodated into the orchestration of the migration. Table 6 lists some of the most important aspects of each complexity. The column titles at the top of the table map to the impact to the user, according to the illustration in Figure 2. The row titles on the left side of the table map to important aspects to consider against each use case. The three usage scenarios described later can be considered as overlays on this table, selecting relevant characteristics per scenario from the table.
Table 6. Important aspects of usage scenario complexities.
Service Disruption Partial Service Disruption or
Transaction Reroute No Service Disruption
Data Movement Complete data migration Selected data movement Data duplication
System Resource Movement According to a Layered Model
Complete systems resource movement from source to target location
Movement or duplication of selected system resources from source to target locations
Complete duplication of system resources between source and target locations
Transaction Movement
Complete movement to target site New transactions route to target site Transactions are routed to available sites according to a defined algorithm Global Workload
Balancer Role
No need Workload allocated based on a defined
algorithm
Full duplication and balancing of workload between source and target locations
License Migration License migration Add a temporary license during migration.
For a short period of time, the software is licensed at both locations simultaneously.
The license at the source location is released after the transfer is completed.
Global license
Degree of Service Impact
Service downtime during migration Service available from selected locations during migration
Constant service availability and performance during migration Degree of User
Impact
Accepted service outages Defined access and performance limitations
Constant access and performance
Service Levels Accepted service outages Defined access and performance limitations
Constant access and performance
System Capacity Sufficient capacity for the defined system
Capacity for the source system at the source site and a limited capacity duplicate system at the target site
Duplicate capacity spread across both source and target sites
Compliance Applicable conversion of compliance requirements from source to target locations
Compliance with both source and target location requirements
Compliance with both source and target location requirements
Environmental Impact Including Carbon Footprint
Reporting of selected metrics converted from source to target; for example, carbon footprint or power usage effectiveness (PUE)
Reporting of all required metrics between source and target; for example, carbon footprint or PUE
Reporting of all required metrics between source and target; for example, carbon footprint or PUE
Note: “No Service Disruption” does not necessarily mean that an active session (that is, the end user is using the system when a migration starts) will not be terminated during the migration. In that case, the end user may need to “log on” to the application in the new site.
Usage Scenario Group 1 - Relocate
The example for usage scenario group 1 is motivated by a requirement to move services and operations between locations, with defined user impact, extending to defined service non-availability. This could range from business drivers, such as the cost of the source site, being significantly higher than the cost of the target site, or for purposes of data center consolidation. This situation could also apply to cases where other resources are required, such as specific devices or low-cost power.
Below are the proposed high-level steps for using the relocate class to implement migration.
1. Construct a manifest of all system components and requirements at the source site. This could be done by either the cloud subscriber or cloud provider, depending upon their contract and SLA. This manifest will be used in the orchestration of the remaining steps.
2. Identify suitable target locations, according to business requirements and the system requirements identified in the previous step.
3. The cloud provider deploys all required software applications and system components to the target site.
4. The cloud provider replicates the selected system components and metadata to the target site.
5. The cloud provider executes an offline migration of the system.
6. Once the migration is confirmed to be complete and correct, traffic will be redirected to the alternate (target) site.
7. All quality-of-service (QoS), performance, security, and availability characteristics at the target site must be at parity with the source, unless otherwise agreed upon by the cloud subscriber.
8. All system resources are freed up and returned to the available pool (or decommissioned), and data is deleted at the source site, according to business requirements.
Steps 3 through 7 should be treated as an atomic event. That is, they should be guaranteed either to occur completely or have no effect.
Migration failure during any of those steps should cause a full rollback and continuation to operate with the in-place resources.
Success scenario 1
The system is moved to the target within the acceptable defined service interruptions to system execution.
Failure condition 1
The migration was interrupted or not possible. The migration did not complete successfully, and therefore the cloud application or system is unable to enter a running state at the target. A roll back to the source site and returning an error code are expected to allow for retry or failure event notice.
Failure condition 2
The migrated system cannot be started and does not work as intended. The migration completed successfully. However, the cloud application or system is unable to enter a running state because of various failure conditions. A roll back to the source site and returning an error code are expected to allow for rollback and retry, or failure event notice.
Failure condition 3
The migrated system does not perform as expected. The migration completed successfully, and the system entered a running state at the target site. However, behavior and/or user experience is incorrect because of various failure conditions. The cloud subscriber raises an incident to the cloud provider.
Failure condition 4
The cloud provider cannot meet the usage scenario goals without introducing unacceptable risks such as network broadcast storms or network limitations, loss of data, or significant unplanned outages. The cloud subscriber or provider raises an incident to the other parties, as appropriate.
Usage Scenario Group 2 - Extend
The example for usage scenario group 2 is motivated by a requirement to move selected compute operations to alternate identified locations, with limited user transaction and service impacts. This could also apply to cases where other resources are required, such as specific devices or low-cost power.
Below are the proposed high-level steps for using the extend migration class.
1. Construct a manifest of all system components and requirements at the source site. This could be done by either the cloud subscriber or cloud provider, depending upon their contract and SLA. This manifest is used in the orchestration of the remaining steps.
2. Identify suitable target locations, according to business requirements and the system requirements identified in the previous step.
3. The cloud provider deploys all required software applications, metadata, and system components to the target site.
4. The cloud provider replicates storage to the target site.
5. The cloud provider executes a migration of the designated subset of system components.
6. Once the migration is confirmed complete and correct, the cloud provider adds the target location to the load-balanced group, with rules reflecting the cloud subscriber’s business priorities for the source/target split.
7. All QoS, performance, security, and availability characteristics at the target must be at parity with the source, and be maintained at the source, unless otherwise agreed upon by the cloud subscriber.
8. Repeat steps 3 through 6 according to cloud subscriber requirements as often as required until defined system elements have been migrated from the source to the target (and back again if required for a follow-the-sun or follow-the-moon scenario). This could occur over an extended period of time, and master data management must be taken into account.
Steps 3 through 7 should be treated as an atomic event. That is, they should be guaranteed either to occur completely or have no effect.
Migration failure during any of those steps should cause a full rollback and continuation to operate with the in-place resources.
Success scenario 1
The system (or defined part thereof) is successfully added to the target without disruption to the state at the source. The target is successfully added to the load-balance group.
Failure condition 1
The addition was interrupted or not possible. The addition did not complete successfully, and therefore the system is unable to enter a running state at the target. A roll back to the source site and returning an error code are expected to allow for retry or failure event notice.
Failure condition 2
The added system subset cannot be started and does not work as intended. The addition completed successfully. However, the system subset is unable to enter a running state because of various failure conditions. A rollback and returning an error code are expected to allow for rollback and retry or failure event notice.
Failure condition 3
The added system subset does not perform as expected. The addition completed successfully, and the system subset entered a running state at the target site. However, system subset behavior and/or user experience is incorrect because of various failure conditions. The cloud subscriber raises an incident to the cloud provider.
Failure condition 4
The cloud provider cannot meet the usage scenario goals without introducing unacceptable risks, such as network broadcast storms or network loops, or service and user access impacts. The cloud subscriber or provider raises an incident to the other parties, as appropriate.
Usage Scenario Group 3 - Sustain
The example for usage scenario group 3 is motivated by a requirement to move compute operations seamlessly to alternate identified locations, without affecting user transactions and service availability. This could also apply to cases where other resources are required for service extension or expansion.
The usage scenario is in the sustain migration class and necessitates live migration, as specified by the following steps.
1. Construct a manifest of all system components and requirements at the source site. This could be done by either the cloud subscriber or cloud provider, depending upon their contract and SLA. This manifest is used in the orchestration of the remaining steps.
2. The cloud provider deploys all required software applications and system components to the target site. Ideally some or all of this could be pre-staged before the actual service migration, but this may not be necessary.
3. The cloud provider duplicates the identified system and metadata to the target site and adds it to the load-balanced group.
4. Once storage migration is complete and correct, the cloud provider will execute a service extension event on the service load-balancers and system configuration.
5. Once the live-service migration is confirmed complete and correct, the cloud provider will redirect traffic to the target site according to load-balancer algorithms and defined business requirements.
6. All QoS, performance, security, and availability characteristics at the target must be at parity with the source, unless otherwise agreed upon by the cloud subscriber.
7. At some point a decision must be made as to whether the active system transactions should move back to the source site, according to defined business requirements.
Steps 3 through 7 above should be treated as an atomic event. Migration failure during any of those steps should cause a full roll back to the source site.
Success scenario 1
The system is duplicated from the source to the target without interruption to service execution and without disruption to the state. Control and traffic are successfully set to the target.
Failure condition 1
The migration was interrupted or not possible. The migration did not complete successfully, and therefore the system is unable to enter a running state at the target. A roll back to the source site and returning an code are expected to allow for retry or failure event notice.
Failure condition 2
The migrated system cannot be started and does not work as intended. The migration completed successfully. However, the system is unable to enter a running state due to various failure conditions. A roll back to the source site and returning an error code are expected to allow for rollback and retry or failure event notice.
Failure condition 3
The migrated system does not perform as expected. The migration completed successfully, and the system entered a running state at the target site. However, system behavior and/or user experience is incorrect due to various failure conditions. The cloud subscriber raises an incident to roll back to the source site.
Failure condition 4
The cloud provider cannot meet the usage scenario goals without introducing unacceptable risks, such as network broadcast storms or network loops, or service and user access impacts. The cloud subscriber or provider raises an incident to the other parties, as appropriate.
Key Performance Indicators
This section contains the key performance indicators (KPIs) that are essential for a successful long-distance migration.
“KPI” (or “performance indicator”) is an industry term for a type of performance measurement. A common way to choose KPIs is to apply a management framework such as a balanced scorecard to consolidate a number of SLA perspectives and metrics into an overall indicator.
Sub-categories include the following:
• Quantitative indicators that can be presented as a number
• Practical indicators that interface with existing company processes
• Directional indicators specifying whether or not an organization is getting better
• Actionable indicators are sufficiently within an organization’s control to affect change
• Financial indicators used in performance measurements and when looking at an operating index
• KPIs can be visually represented in multiple ways, including dials, thermometers, and slider bars, as shown in Figure 6.
KPI principles include the following:
• Should define a specific measure title
• The parameters of the measure constitute the aggregated SLA
• Has a high-water and a low-water mark (the aggregated SLA is set against one of these)
• Can have multiple dimensions, some of which are shared
• Has a service consumer view, to gauge quality of service
• Has a service provider view, to manage overall services
• Shared view on some items
For the purposes of long-distance migration, the KPIs should align with any definitions provided by the “Compute Infrastructure as a Service”9 and “ODCA Master Usage Model: Commercial Framework”10 usage models. Table 7 describes some of the important KPIs recommended as a base for long-distance migration.
KPI: Service is available as expected
May include system, network, and storage uptimes, and incident response time
Low-Water Mark:
Minimum acceptable level High-Water Mark:
Target achievement Actual Achievement
Aggregated SLA Committed
Figure 6. Visual representation of key performance indicators.
9 www.opendatacenteralliance.org/library
10 www.opendatacenteralliance.org/library
The examples given in Table 7 suggest measures but do not specify the high- or low-water marks, because they are specific to the contract and SLA. The same measures should be used regardless of service tier to give the subscriber a way to compare the cost and benefit of the different service tiers.
Table 7. Measures in key performance indicator calculations.
Attribute Description of Measure Usage Scenario
Effectiveness Migration success rate calculates the efficiency ratio of processing migrations based on the number of defect-free migrations completed divided by total migrations undertaken.
Relocate, extend
Cost Average cost of long-distance migrations. Advise on the cost of the migration and provide cost data for return-on-investment calculations
Relocate, extend
Availability Measure availability of contracted services. Provide percentage of time that services are accessible. Relocate, extend Data Integrity Calculation of percentage data loss due to migration with target of 0% data loss. Relocate, extend Performance The percentage change in system performance metrics post-migration to understand and quantify
the impact of the long-distance migration on the performance metrics of the system and hence the performance impact related to areas such as tpmCs, IOPS, and network latency.
Sustain
Performance The percentage change in the resource utilization post-migration to measure efficiencies gained or lost because of the long-distance migration.
Sustain
Performance The percentage change in the transaction processing performance post-migration (for example, tpmC or other metrics) because of newly introduced latencies or improvements, such as network or different processing capabilities.
Sustain
Security The percentage change in the security risk profile post-migration to indicate the improvement or degradation of the security risk profile.
Sustain*
Capacity Measure of system capacity post-migration. Aggregate measure accounting for storage/resources, CPU, memory, and overall scalability indicator.
Sustain
* May also be relevant in relocate and extend.
Service Tiers
The features listed in Table 8 are generally applicable to both IaaS and PaaS, and are derived from the ODCA “Standard Units of Measure.”11 For this usage model, not every feature in a given table column must be supported as a group. In practice, a given service provider solution will combine different service levels for different elements. For example, Gold security features may be combined with Bronze performance features.
It is the cloud provider’s responsibility to differentiate service tiers within a given data center or service offering.
11 See www.opendatacenteralliance.org/library