Open Stack

Mon Sep 21 12:50:32 UTC 2015

Daniel

Thanks.

We will need to do some work to recreate the instance performance  and 
disk i/o issues and investigate further.

My original message did not go out to the mailing list due to an 
subscription issue, so including it here

I'm just starting work on Nova upstream having been focused on live
migration orchestration in our large Public Cloud environment.  We were
trying to use live migration to do rolling reboots of compute nodes in 
order to apply software patches that required node or virtual machine 
restarts to apply.  For this sort of activity to work on a large scale 
the orchestration needs to be highly automated and integrate with the 
operations monitoring and issue tracking systems.  It also needs the 
mechanism used to move instances to be highly robust.

However the most significant impediment we encountered was customer
complaints about performance of instances during migration.  We did a 
little bit of work to identify the cause of this and concluded that the 
main issues was disk i/o contention.  I wonder if this is something you 
or others have encountered?  I'd be interested in any idea for managing 
the rate of the migration processing to prevent it from adversely 
impacting the customer application performance.  I appreciate that if we 
throttle the migration processing it will take longer and may not be 
able to keep up with the rate of disk/memory change in the instance.

Could you point me at somewhere I can get details of the tuneable 
setting relating to cutover down time please?  I'm assuming that at 
these are libvirt/qemu settings?  I'd like to play with them in our test 
environment to see if we can simulate busy instances and determine what 
works.  I'd also be happy to do some work to expose these in nova so the 
cloud operator can tweak if necessary?

I understand that you have added some functionality to the nova compute
manager to collect data on migration progress and emit this to the log file.

I'd like to propose that we extend this to emit notification message
containing progress information so a cloud operator's orchestration can
consume these events and use them to monitor progress of individual
migrations.  This information could be used to generate alerts or 
tickets so that support staff can intervene.  The smarts in qemu to help 
it make progress are very welcome and necessary but in my experience the 
cloud operator needs to be able to manage these and if it is necessary 
to slow down or even pause a customer's instance to complete the 
migration the cloudoperator may need to gain customer consent before 
proceeding.

I am also considering submitting a proposal to build on the current spec 
for monitoring and cancelling migrations to make the migration status 
information available to users (based on policy setting) and include an 
estimated time to complete information in the response.  I appreciate 
that this would only be an 'estimate' but it may give the user some idea 
of how long they will need to wait until they can perform operations on 
their instance that are not currently permitted during migration.  To 
cater for the scenario where a customer urgently needs to perform an 
inhibited operation (like attach or detach a volume) then I would 
propose that we allow for a user to cancel the migration of their own 
instances.  This would be enabled for authorized users based on granting 
them a specific role.

More thoughts Monday!

-----Original Message-----
From: Daniel P. Berrange [mailto:berrange at redhat.com]
Sent: 21 September 2015 09:56
To: Carlton, Paul (Cloud Services)
Cc: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] [nova] live migration in Mitaka

On Fri, Sep 18, 2015 at 05:47:31PM +0000, Carlton, Paul (Cloud Services) 
wrote:
> However the most significant impediment we encountered was customer
> complaints about performance of instances during migration.  We did a
> little bit of work to identify the cause of this and concluded that
> the main issues was disk i/o contention.  I wonder if this is
> something you or others have encountered?  I'd be interested in any
> idea for managing the rate of the migration processing to prevent it
> from adversely impacting the customer application performance.  I
> appreciate that if we throttle the migration processing it will take
> longer and may not be able to keep up with the rate of disk/memory change in
> the instance.

I would not expect live migration to have an impact on disk I/O, unless 
your storage is network based and using the same network as the 
migration data. While migration is taking place you'll see a small 
impact on the guest compute performance, due to page table dirty bitmap 
tracking, but that shouldn't appear directly as disk I/O problem. There 
is no throttling of guest I/O at all during migration.

> Could you point me at somewhere I can get details of the tuneable
> setting relating to cutover down time please?  I'm assuming that at
> these are libvirt/qemu settings?  I'd like to play with them in our
> test environment to see if we can simulate busy instances and
> determine what works.  I'd also be happy to do some work to expose
> these in nova so the cloud operator can tweak if necessary?

It is already exposed as 'live_migration_downtime' along with 
live_migration_downtime_steps, and live_migration_downtime_delay.
Again, it shouldn't have any impact on guest performance while live 
migration is taking place. It only comes into effect when checking 
whether the guest is ready to switch to the new host.

> I understand that you have added some functionality to the nova
> compute manager to collect data on migration progress and emit this to the
> log file.
> I'd like to propose that we extend this to emit notification message
> containing progress information so a cloud operator's orchestration
> can consume these events and use them to monitor progress of
> individual migrations.  This information could be used to generate
> alerts or tickets so that support staff can intervene.  The smarts in
> qemu to help it make progress are very welcome and necessary but in my
> experience the cloud operator needs to be able to manage these and if
> it is necessary to slow down or even pause a customer's instance to
> complete the migration the cloud operator may need to gain customer consent
> before proceeding.

We already update the Nova  instance object's 'progress' value with the 
info on the migration progress. IIRC, this is visible via 'nova show 
<instance>'
or something like that.

Regards,
Daniel
-- 
|: http://berrange.com      -o- 
http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o- 
http://virt-manager.org :|
|: http://autobuild.org       -o- 
http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o- 
http://live.gnome.org/gtk-vnc :|

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4722 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20150921/c11cdc43/attachment.bin>

Open Stack

[openstack-dev] [nova] live migration in Mitaka

OpenStack

Community

Documentation

Branding & Legal