[openstack-dev] Update on live migration priority
Murray, Paul (HP Cloud)
pmurray at hpe.com
Fri Feb 12 16:15:46 UTC 2016
The objective for the live migration priority is to improve the stability of migrations based on operator experience. The high level approach is to do the following:
1. Improve CI
2. Improve documentation
3. Improve manageability of migrations
4. Fix bugs
In this cycle we targeted a few immediately implementable features that would help, specifically giving operators commands to allow them to manage migrations (inspect progress, force completion, and cancel) and improve security (split-networks and remove ssh-based resize/migration; aka storage pools).
Most of these are on track to be completed in this cycle with the exception of storage pools work which is being deferred. Further details follow.
Expand CI coverage - in progress
There is a job in the experimental queue called: gate-tempest-dsvm-multinode-live-migrationqueued. This will become the job that performs live migration tests; any live migration tests in other jobs will be removed. At present the job has been configured to cover different storage configurations including cinder, NFS, ceph. Tests are now being added to the job. Patches are currently up for live migration of instances with swap and instances with ephemeral disks.
Please trigger the experimental queue if your patches touch migrations in some way so we can check the stability of the jobs. Once stable and with sufficient tests we will promote the job from the experimental queue so that it always runs.
Improve API docs - done
Some changes were made to the API guide for moving servers, including better descriptions for the server actions migrate, live migrate, shelve, resize and evacuate ( http://developer.openstack.org/api-guide/compute/server_concepts.html#server-actions ) and a section that describes reasons for moving VMs with common use cases outlined ( http://developer.openstack.org/api-guide/compute/server_concepts.html#moving-servers )
Block live migration with attached volumes - done
The selective block device migration API in libvirt 1.2.17 is used to allow block migration when volumes are attached. A follow on patch to allow readonly drives to be copied in block migration has not been completed. This patch is required to allow iso9600 format config drives to be migrated. Without it only vfat config drives can be migrated. There is still some thought going into that - see: https://review.openstack.org/#/c/234659
Force complete - requires python-novaclient change
Force-complete forces a live migration to complete by pausing the VM and restarting it when it has completed migration. This is intended as a brute force way to make a VM complete its migration when it is taking too long. In the future auto-converge and post-copy will be looked at. These became available in qemu 2.5.
Force complete is done in nova but still requires a change to python-novaclient to implement the CLI.
Cancel - in progress
Cancel stops a live migration, leaving it on the source host with the migration status left as "cancelled". This is in progress and follows the pattern of force-complete. Unfortunately this needs to be bundled up into one patch to avoid multiple API bumps.
Patches for review:
Progress reporting - in progress (no pun intended)
Progress reporting introduces migrations as a sub-resource of servers and adds progress data to the migration record. There was some debate at the mid cycle and on the mailing list about how to record this transient data. It is a waste to keep writing it to the database, but as it is generated at the compute manager but examined at the API it was felt that writing it to the database is necessary to fit the existing architecture. The conclusions was that writing to the database every 5 seconds would not cause a significant overhead. Alternatives could be persued later if necessary. For discussion see this ML thread: http://lists.openstack.org/pipermail/openstack-dev/2016-February/085662.html and the IRC meeting transcript here: http://eavesdrop.openstack.org/meetings/nova_live_migration/2016/nova_live_migration.2016-02-09-14.01.log.html
Patches for review:
Split networking - done
Split networking adds a configuration parameter to specify live_migration_inbound_addr as the ip address or host name to be used as the target for migration traffic. This allows migration traffic to be isolated on a separate network to other management traffic, providing an opportunity to islate service levels for the two networks and improve security by moving unencrypted migration traffic to an isolated network.
Resize/cold migrate using storage pools - deferred
The objective here was to change the libvirt implementation of migrate and resize to use libvirt storage pools instead of scp/rsync over ssh with passwordless keys. Storage pools are supported in all versions of libvrit supported by nova, so it was thought that by changing the implementation it would be possible to drop the ssh based code. However two flaws in this approach arose: the recently added ploop storage device does not work with storage pools in libvirt and the libvirt data copy implementation is very inefficient and so slower than scp or rsync.
The guys at Parallels kindly agreed to implement storage pools support for ploop in libvirt and this work is already making progress. Work was also started in libvirt to improve the copy performance. These features will be available in a future release, so we will need to maintain old ssh-based migration for libvirt as well as refactor and implement the storage pools based alternative.
Work has started on refactoring the libvirt driver code but the following blueprints will be deferred beyond mitaka:
Deprecate migration flags - done
There are a lot of migration flags used with libvirt that are either redundant or can be inferred from the deployed configuration. These are being deprecated and will be removed in the next cycle.
Feel free to respond with corrections or additions.
Technical Lead, HPE Cloud
Hewlett Packard Enterprise
+44 117 316 2527
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the OpenStack-dev