Dive Into VM Live Migration
OpenStack Liberty Summit 2015
Vancouver
Michał Dulko Michał Jastrzębski Paweł Koniszewski
o Imminent host failure
o Maintenance mode
o Optimal resource placement
o Cooling issues
o Storage problems
o Networking problems
o Your datacenter was struck by a flood
Imminent Host
Failure
o
Firmware upgrades
o
Hardware upgrades
o
Kernel upgrades
o Reduce costs
o Move VMs closer to their storage to lessen network
latency
o Stack more VMs on hosts to save power
o Increase resiliency
o Noisy neighbour separation
o Spread VMs across more hosts
o Live
o Consistent
o Transparent
o Minimal service disruption
Non-live migration (cold migration)
o nova migrate <server>
True live migration (shared storage or volume-based)
o nova live-migration <server> [<host>]
Block live migration
o nova live-migration --block-migrate <server> [<host>]
Migration type Local storage Volumes Shared storage
Block LM ✓ ✗ ✗
True LM ✗ ✓ ✓
Block LM with read-only
devices ✗ ✗ ✗
True LM with read-only
devices ✗ ✗ ✓
o Pre-Migration
o Reservation
o Iterative pre-copy
o Stop and copy
o Commitment
Active VM on physical host A, host B selected by scheduler or preselected.
Pre-migration
Compute node A Compute node B
VM A
Active
Pre-migration
Reservation Iterative pre-copy
Stop and copy Commitment
Confirm availability of resources on host B; reserve a new VM.
Reservation
Compute node A Compute node B
VM A VM A ACTIV E
Compute node A Compute node B
VM A Reserved VM A Active Pre-migration Reservation Iterative pre-copy Stop and copy
Memory is transferred from A to B and next dirtied pages are iteratively copied.
Iterative pre-copy
Pre-migration Reservation Iterative pre-copy
Stop and copy Commitment
Compute node A Compute node B
VM A
Paused
VM A
Active
Suspend VM and copy remaining pages and CPU state.
Stop and copy
Compute node A Compute node B
VM A VM A PAUSE D PAUSE D Pre-migration Reservation Iterative pre-copy
Stop and copy
Commitment
Compute node A Compute node B
VM A
Paused
VM A
Paused
Host B becomes primary host for VM A.
Commitment
Pre-migration Reservation Iterative pre-copy
Stop and copy
Commitment
Compute node A Compute node B
VM A VM A PAUSE D PAUSE D
Compute node A Compute node B
VM A
o OpenStack does not allow triggering any operations
on VM during LM
o VMs with intensive memory workload are hard to
migrate
o LM generates heavy load on network
o Migrations between CN with different CPUs
o Memory oversubscription
o OpenStack disallow any operation on ongoing LM
o You can use virsh instead to interact
o Information about ongoing LM
virsh domjobinfo <domain>
Diagnosis
Time elapsed 1918595 ms Data processed 410.137 GiB
Data remaining 4.600 GiB Data total 16.008 GiB Constant pages 144658
Normal pages 107307605 Normal data 409.346 GiB Expected downtime 1023 ms
o Cancel on-going LM
virsh domjobabort <domain>
o Pause VM during LM
virsh suspend <domain>
o QEMU
virsh qemu-monitor-command --hmp <domain> migrate_set_downtime <time (sec)>
o libvirt
virsh migrate-setmaxdowntime <domain> <time (sec)>
o nova.conf setting
live_migration_flag += VIR_MIGRATE_AUTO_CONVERGE
o nova.conf setting live_migration_flag += VIR_MIGRATE_TUNNELLED
Tunneled Migration
Hypervisor libvirt Hypervisor libvirto nova.conf setting live_migration_flag -= VIR_MIGRATE_TUNNELLED
Tunneled Migration
Hypervisor libvirt Hypervisor libvirto libvirt
virsh migrate-setspeed <domain> <speed (MiB/s)>
o nova.conf settings
live_migration_bandwidth = <speed (MiB/s)>
o nova.conf settings live_migration_flag += VIR_MIGRATE_COMPRESSED
XBZRLE Compression
Sent Page Cache Updated Page Delta Compression Delta Received Pages Delta ApplyDelta Updated Page Source Host Destination Host
o nova.conf
o live_migration_uri = qemu+tcp://%s/system
LM On Dedicated Network
Compute node A Compute node B Management Network
VM A
o nova.conf
o live_migration_uri = qemu+tcp://%s-lm/system
o Set up your DNS to resolve hostnames with -lm suffix to IPs in your
dedicated network.
LM On Dedicated Network
Compute node A Compute node B Management Network
VM A
Active
LM Network VM A
o CPU instruction set of source node needs to be a
subset of CPU instruction set of destination node
Different CPUs Between Compute Nodes
Compute Node A Compute Node B
AVX SSE2 MMX AVX MMX Passed Live Migration Failed
o This can be skipped by explicitly setting VM CPU
model in nova.conf:
o cpu_mode = custom
o virt_type = kvm or virt_type = qemu
o And then you can set cpu_model
o List of supported named CPUs is in
libvirt/cpu_map.xml
o LM to specific host does not use memory oversubscription o ram_allocation_ratio
Memory Oversubscription
Compute Node A 2 GB RAM Reported RAM = available - reserved nova-conductor 2 GB 2 GB 2 GB 4 GB nova-scheduler ram_allocation_ratio = 2.0o Skip it by o reserved_host_memory_mb=-2048
Memory Oversubscription
Compute Node A 2 GB RAM Reported RAM = available - reserved nova-conductor 4 GB 4 GB 4 GB 4 GB nova-scheduler ram_allocation_ratio = 1.0o Everything can be sniffed!
o Migrated machines can contain sensitive data
o Legal issues with unencrypted data transfer
o Hypervisor native encryption
o QEMU doesn’t support it
o libvirt tunneled transport
o live_migration_uri = qemu+ssh://%s/system
o live_migration_flag += VIR_MIGRATE_TUNNELLED
o Uses only one core
o IPSec tunnel between hosts
0 0.5 1 1.5 2 2.5 3 QEMU+SSH QEMU+TCP Transfer rate [ GB ps]
Intel(R) Xeon(R) CPU E5-2690 v2 Intel(R) Xeon(R) CPU E5-2660 v3
o Compress every page sent during LM
o zlib used for compression
o Configurable:
o Number of threads o Comperession ratio
o Move workload immediately to destination host
Post-copy Live Migration
Compute node A Compute node B
VM A
Active
VM A
Paused
o Cheap solution to finish live migration in a finite time
o VM needs to be rebooted in case of failure
o Heavy performance impact
o Track memory transfer progress
o Detect possible problems and take actions
o Pause VM
o Abort LM
o See progress
o Change configuration on the fly:
o Maximum tolerable VM down time o Transfer bandwith
Your voice matters!
o Mailing lists:
o Win The Enterprise group:
o [email protected] (IRC: pkoniszewski) o [email protected] (IRC: inc0)
Q&A (& disclaimers)
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system
manufacturer or retailer or learn more at intel.com.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. © 2015 Intel Corporation.