[openstack-dev] [nova] live migration in Mitaka
Paul Carlton
paul.carlton2 at hpe.com
Mon Sep 21 12:50:32 UTC 2015
Daniel
Thanks.
We will need to do some work to recreate the instance performance and
disk i/o issues and investigate further.
My original message did not go out to the mailing list due to an
subscription issue, so including it here
I'm just starting work on Nova upstream having been focused on live
migration orchestration in our large Public Cloud environment. We were
trying to use live migration to do rolling reboots of compute nodes in
order to apply software patches that required node or virtual machine
restarts to apply. For this sort of activity to work on a large scale
the orchestration needs to be highly automated and integrate with the
operations monitoring and issue tracking systems. It also needs the
mechanism used to move instances to be highly robust.
However the most significant impediment we encountered was customer
complaints about performance of instances during migration. We did a
little bit of work to identify the cause of this and concluded that the
main issues was disk i/o contention. I wonder if this is something you
or others have encountered? I'd be interested in any idea for managing
the rate of the migration processing to prevent it from adversely
impacting the customer application performance. I appreciate that if we
throttle the migration processing it will take longer and may not be
able to keep up with the rate of disk/memory change in the instance.
Could you point me at somewhere I can get details of the tuneable
setting relating to cutover down time please? I'm assuming that at
these are libvirt/qemu settings? I'd like to play with them in our test
environment to see if we can simulate busy instances and determine what
works. I'd also be happy to do some work to expose these in nova so the
cloud operator can tweak if necessary?
I understand that you have added some functionality to the nova compute
manager to collect data on migration progress and emit this to the log file.
I'd like to propose that we extend this to emit notification message
containing progress information so a cloud operator's orchestration can
consume these events and use them to monitor progress of individual
migrations. This information could be used to generate alerts or
tickets so that support staff can intervene. The smarts in qemu to help
it make progress are very welcome and necessary but in my experience the
cloud operator needs to be able to manage these and if it is necessary
to slow down or even pause a customer's instance to complete the
migration the cloudoperator may need to gain customer consent before
proceeding.
I am also considering submitting a proposal to build on the current spec
for monitoring and cancelling migrations to make the migration status
information available to users (based on policy setting) and include an
estimated time to complete information in the response. I appreciate
that this would only be an 'estimate' but it may give the user some idea
of how long they will need to wait until they can perform operations on
their instance that are not currently permitted during migration. To
cater for the scenario where a customer urgently needs to perform an
inhibited operation (like attach or detach a volume) then I would
propose that we allow for a user to cancel the migration of their own
instances. This would be enabled for authorized users based on granting
them a specific role.
More thoughts Monday!
-----Original Message-----
From: Daniel P. Berrange [mailto:berrange at redhat.com]
Sent: 21 September 2015 09:56
To: Carlton, Paul (Cloud Services)
Cc: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] [nova] live migration in Mitaka
On Fri, Sep 18, 2015 at 05:47:31PM +0000, Carlton, Paul (Cloud Services)
wrote:
> However the most significant impediment we encountered was customer
> complaints about performance of instances during migration. We did a
> little bit of work to identify the cause of this and concluded that
> the main issues was disk i/o contention. I wonder if this is
> something you or others have encountered? I'd be interested in any
> idea for managing the rate of the migration processing to prevent it
> from adversely impacting the customer application performance. I
> appreciate that if we throttle the migration processing it will take
> longer and may not be able to keep up with the rate of disk/memory change in
> the instance.
I would not expect live migration to have an impact on disk I/O, unless
your storage is network based and using the same network as the
migration data. While migration is taking place you'll see a small
impact on the guest compute performance, due to page table dirty bitmap
tracking, but that shouldn't appear directly as disk I/O problem. There
is no throttling of guest I/O at all during migration.
> Could you point me at somewhere I can get details of the tuneable
> setting relating to cutover down time please? I'm assuming that at
> these are libvirt/qemu settings? I'd like to play with them in our
> test environment to see if we can simulate busy instances and
> determine what works. I'd also be happy to do some work to expose
> these in nova so the cloud operator can tweak if necessary?
It is already exposed as 'live_migration_downtime' along with
live_migration_downtime_steps, and live_migration_downtime_delay.
Again, it shouldn't have any impact on guest performance while live
migration is taking place. It only comes into effect when checking
whether the guest is ready to switch to the new host.
> I understand that you have added some functionality to the nova
> compute manager to collect data on migration progress and emit this to the
> log file.
> I'd like to propose that we extend this to emit notification message
> containing progress information so a cloud operator's orchestration
> can consume these events and use them to monitor progress of
> individual migrations. This information could be used to generate
> alerts or tickets so that support staff can intervene. The smarts in
> qemu to help it make progress are very welcome and necessary but in my
> experience the cloud operator needs to be able to manage these and if
> it is necessary to slow down or even pause a customer's instance to
> complete the migration the cloud operator may need to gain customer consent
> before proceeding.
We already update the Nova instance object's 'progress' value with the
info on the migration progress. IIRC, this is visible via 'nova show
<instance>'
or something like that.
Regards,
Daniel
--
|: http://berrange.com -o-
http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org -o-
http://virt-manager.org :|
|: http://autobuild.org -o-
http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org -o-
http://live.gnome.org/gtk-vnc :|
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4722 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20150921/c11cdc43/attachment.bin>
More information about the OpenStack-dev
mailing list