[Openstack-operators] About Live Migration and Bests practises.

David Medberry openstack at medberry.net
Fri May 27 20:10:14 UTC 2016


In general with L-M on a libvirt/kvm based env, I would cc: daniel barrange
(barrange at redhat.com) and I've done that since I didn't seem him
specifically.

We have taken to GENERALLY SPEAKING doing a single VM at a time. If we have
high confidence, we'll go 5 at a time (but never more than that) and we use
ansible with the "-f 5" flag to handle this for us.

In later version (Lib, Mitaka) I believe that OpenStack Nova inherently
handles this better.

Due to the issues you have seen (and that we also saw in I, J, and K
releases) we do NOT use "nova host-evacuate-live" but I haven't tried it
since our Liberty upgrade. With the appropriate --max-servers 5 flag, it
may work just fine now. I'll report back after I give that a whirl in our
test environments.

As far as the "STATE" goes, there are many states you can end up in when
one fails (or more technically when it doesn't complete.) We end up with
ghost VMs and misplaced VMs. Ghost VMs are when there are really TWO VMs
out there (on both the source and destination node) which can easily leak
to disk corruption if both ever go active. Misplaced VMs occur when reality
(ps aux | grep [q]emu shows it on a node) and the nova database disagree
where the node is located.

Cleaning up in either case usually involves doing a virsh destroy of the
ghost(s) and then a nova reboot --hard.

* Note: We also use the term "invisible VMs" when nova list --all-tenants
doesn't show the VM but that is usually just the paging/marking stopping
when it gets to 1000 non-deleted VMs.

ONE MORE THING:
If you are using ephemeral rbd volumes and you migrate from Kilo to Liberty
AND HAVE CONFIGDRIVE FORCED ON, you will likely need a patched version of
Nova or need to manually create rbd based configdrives. Everything will
work fine until you stop then start an instance. It will run fine with the
/var/lib/nova/instances/$UUID_disk.config until such time as it is stopped
and then when it gets started again it assumed
rbd://instances/$UUID_disk.config to exist and will typically fail to start
in that case.
Ref: https://bugs.launchpad.net/nova/mitaka/+bug/1582684


On Fri, May 27, 2016 at 12:19 PM, David Bayle <dbayle.mon at globo.tech> wrote:

> Greetings,
>
> First thanks a lot for all the information provided regarding OpenStack
> and thanks for your hudge work on this topic. (Live Migration)
>
> We are operating an OpenStack setup running Kilo + Ceph +
> (ceph_patch_for_snapshot).
>
> We are still playing with live migration on Kilo and we had some questions
> about it:
>
> - first, when we ask from a live migrate from compute1 to Compute2, does
> it takes the exact same amount of RAM from compute1 and reserve it on
> compute2 ? or is there any little overhead ?
> - and for the second question, does the state NOSTATE of openstack,
> reveals that the migration state of KVM has been lost ? (kvm power state
> for example) or does it reveals that there was an issue copying the RAM of
> the instance from one compute to the other one.
>
> We faced some issues while trying to host-live-evacuate, or even if we do
> live migrate more than 4 to 5 instances at the same time, most of the time
> the live migration brakes and VMs get NOSTATE for power state in OpenStack:
> which is very disturbing because the only way to solve this (the only way
> that we know) is to restart the instance.
> (we could also edit the mysql database as proposed by the IRC chan of the
> community).
> By live migrating each instances one by one => gives no issue.
> More than this can result in live migration failure and NOSTATE in
> openstack power status.
>
> Is there anything that we are doing wrong ? we've seen host-live-evacuate
> working once or two times; but then when having around 15 VMs on a compute;
> the behavior is totally different.
> (and it doesn t seems we are maxing out any resources (but the network can
> be as we are using 1Gb/s management network)
> Here is an example of issue faced with a host-live-evacuate we get this on
> the source comput node:
>
> 2016-05-26 16:49:54.080 3963 WARNING nova.virt.libvirt.driver [-]
> [instance: 98b793a1-61fb-45c6-95b7-6c2bca10d6de] Error monitoring
> migration: internal error: received hangup / error event on socket
> 2016-05-26 16:49:54.080 3963 TRACE nova.virt.libvirt.driver [instance:
> 98b793a1-61fb-45c6-95b7-6c2bca10d6de] Traceback (most recent call last): 2016-05-26
> 16:49:54.080 3963 TRACE nova.virt.libvirt.driver [instance:
> 98b793a1-61fb-45c6-95b7-6c2bca10d6de] File
> "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5689,
> in _live_migration 2016-05-26 16:49:54.080 3963 TRACE
> nova.virt.libvirt.driver [instance: 98b793a1-61fb-45c6-95b7-6c2bca10d6de]
> dom, finish_event) 2016-05-26 16:49:54.080 3963 TRACE
> nova.virt.libvirt.driver [instance: 98b793a1-61fb-45c6-95b7-6c2bca10d6de]
> File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line
> 5521, in _live_migration_monitor 2016-05-26 16:49:54.080 3963 TRACE
> nova.virt.libvirt.driver [instance: 98b793a1-61fb-45c6-95b7-6c2bca10d6de]
> info = host.DomainJobInfo.for_domain(dom) 2016-05-26 16:49:54.080 3963
> TRACE nova.virt.libvirt.driver [instance:
> 98b793a1-61fb-45c6-95b7-6c2bca10d6de] File
> "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/host.py", line 157, in
> for_domain 2016-05-26 16:49:54.080 3963 TRACE nova.virt.libvirt.driver
> [instance: 98b793a1-61fb-45c6-95b7-6c2bca10d6de] stats = dom.jobStats() 2016-05-26
> 16:49:54.080 3963 TRACE nova.virt.libvirt.driver [instance:
> 98b793a1-61fb-45c6-95b7-6c2bca10d6de] File
> "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 183, in doit 2016-05-26
> 16:49:54.080 3963 TRACE nova.virt.libvirt.driver [instance:
> 98b793a1-61fb-45c6-95b7-6c2bca10d6de] result = proxy_call(self._autowrap,
> f, *args, **kwargs) 2016-05-26 16:49:54.080 3963 TRACE
> nova.virt.libvirt.driver [instance: 98b793a1-61fb-45c6-95b7-6c2bca10d6de]
> File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 141, in
> proxy_call 2016-05-26 16:49:54.080 3963 TRACE nova.virt.libvirt.driver
> [instance: 98b793a1-61fb-45c6-95b7-6c2bca10d6de] rv = execute(f, *args,
> **kwargs) 2016-05-26 16:49:54.080 3963 TRACE nova.virt.libvirt.driver
> [instance: 98b793a1-61fb-45c6-95b7-6c2bca10d6de] File
> "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 122, in execute 2016-05-26
> 16:49:54.080 3963 TRACE nova.virt.libvirt.driver [instance:
> 98b793a1-61fb-45c6-95b7-6c2bca10d6de] six.reraise(c, e, tb) 2016-05-26
> 16:49:54.080 3963 TRACE nova.virt.libvirt.driver [instance:
> 98b793a1-61fb-45c6-95b7-6c2bca10d6de] File
> "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 80, in tworker 2016-05-26
> 16:49:54.080 3963 TRACE nova.virt.libvirt.driver [instance:
> 98b793a1-61fb-45c6-95b7-6c2bca10d6de] rv = meth(*args, **kwargs) 2016-05-26
> 16:49:54.080 3963 TRACE nova.virt.libvirt.driver [instance:
> 98b793a1-61fb-45c6-95b7-6c2bca10d6de] File
> "/usr/lib/python2.7/dist-packages/libvirt.py", line 1133, in jobStats 2016-05-26
> 16:49:54.080 3963 TRACE nova.virt.libvirt.driver [instance:
> 98b793a1-61fb-45c6-95b7-6c2bca10d6de] if ret is None: raise libvirtError
> ('virDomainGetJobStats() failed', dom=self) 2016-05-26 16:49:54.080 3963
> TRACE nova.virt.libvirt.driver [instance:
> 98b793a1-61fb-45c6-95b7-6c2bca10d6de] libvirtError: internal error:
> received hangup / error event on socket 2016-05-26 16:49:54.080 3963
> TRACE nova.virt.libvirt.driver [instance:
> 98b793a1-61fb-45c6-95b7-6c2bca10d6de] The first Instance was successfull
> but then all other crashed and went NOSTATE. Again thank you for your help.
> Best regards, David.
>
> --
> David Bayle
> System Administrator
> GloboTech Communications
> Phone: 1-514-907-0050
> Toll Free: 1-(888)-GTCOMM1
> Fax: 1-(514)-907-0750support at globo.techhttp://www.globo.tech
>
>
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20160527/cea38131/attachment.html>


More information about the OpenStack-operators mailing list