[Openstack-operators] live migration fails and libvirtd(?) lockup after trusty > xenial/mitaka > newton upgrade

Vladimir Prokofev v at prokofev.me
Tue Dec 20 14:24:19 UTC 2016


Downgraded the following packages on compute nodes:

qemu/xenial-updates,xenial-security 1:2.5+dfsg-5ubuntu10.6 amd64
[upgradable from: 1:2.5+dfsg-5ubuntu10.5]
qemu-block-extra/xenial-updates,xenial-security 1:2.5+dfsg-5ubuntu10.6
amd64 [upgradable from: 1:2.5+dfsg-5ubuntu10.5]
qemu-kvm/xenial-updates,xenial-security 1:2.5+dfsg-5ubuntu10.6 amd64
[upgradable from: 1:2.5+dfsg-5ubuntu10.5]
qemu-system/xenial-updates,xenial-security 1:2.5+dfsg-5ubuntu10.6 amd64
[upgradable from: 1:2.5+dfsg-5ubuntu10.5]
qemu-system-common/xenial-updates,xenial-security 1:2.5+dfsg-5ubuntu10.6
amd64 [upgradable from: 1:2.5+dfsg-5ubuntu10.5]
qemu-system-misc/xenial-updates,xenial-security 1:2.5+dfsg-5ubuntu10.6
amd64 [upgradable from: 1:2.5+dfsg-5ubuntu10.5]
qemu-system-x86/xenial-updates,xenial-security 1:2.5+dfsg-5ubuntu10.6 amd64
[upgradable from: 1:2.5+dfsg-5ubuntu10.5]
qemu-user/xenial-updates,xenial-security 1:2.5+dfsg-5ubuntu10.6 amd64
[upgradable from: 1:2.5+dfsg-5ubuntu10.5]
qemu-utils/xenial-updates,xenial-security 1:2.5+dfsg-5ubuntu10.6 amd64
[upgradable from: 1:2.5+dfsg-5ubuntu10.5]

Restarted libvirtd and nova-compute.
Performed 16 live migrations of different instances between compute nodes -
0 failures.
As before it took me 5-6 attempts at the most to trigger the problem, and
2-3 in most cases, I guess I can safely state that this issue is resolved.

Sean, thank you for pointing at exactly the right solution.
Chris, thank you for providing me with a link to packages.

2016-12-20 15:59 GMT+03:00 Chris Sarginson <csargiso at gmail.com>:

> Hi Vladimir,
>
> The packages are available on launchpad here:
> https://launchpad.net/ubuntu/+source/qemu/1:2.5+dfsg-
> 5ubuntu10.5/+build/10938755
>
> On Tue, 20 Dec 2016 at 12:49 Vladimir Prokofev <v at prokofev.me> wrote:
>
>> Using
>> compute1:~$ dpkg-query -W qemu-system-x86
>> qemu-system-x86 1:2.5+dfsg-5ubuntu10.6
>> compute1:~$ qemu-system-x86_64 --version
>> QEMU emulator version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.6), Copyright
>> (c) 2003-2008 Fabrice Bellard
>>
>> So I guess you may be right. Now I have to confirm this.
>> Did you build your package from source? 1:2.5+dfsg-5ubuntu10.5 is no
>> longer available in official repository, and I can't find it anywhere
>> except for the source code.
>>
>>
>> 2016-12-20 12:17 GMT+03:00 Sean Redmond <sean.redmond1 at gmail.com>:
>>
>> What version of qemu are you running? I was hit by this bug[1] that seems
>> to give off the same faults you are reporting and had to downgrade qemu
>> packages to version '1:2.5+dfsg-5ubuntu10.5'
>>
>> [1] https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1647389
>>
>> On Tue, Dec 20, 2016 at 12:10 AM, Vladimir Prokofev <v at prokofev.me>
>> wrote:
>>
>> Hello Ops.
>>
>> I want to pick your brains on a live migration issue, cause I'm kinda
>> stuck atm.
>>
>> I'm running a small HA Ubuntu Openstack cloud - 3 controllers(VMs on
>> Ubuntu/KVM) with ha-proxy LB, corosync/pacemaker manages VIP, Galera DB, 3
>> compute nodes(using KVM hypervisor), 3 network nodes(VMs on Ubuntu/KVM),
>> CEPH cluster(4 OSD nodes, 3 MON nodes(VMs on Ubuntu/KVM)).
>> Nova, cinder, and glance use CEPH as backend. Neutron uses linux-bridge.
>>
>> We were running Ubuntu Trusty 14.04/Mitaka, and decided to upgrade to
>> Ubuntu Xenial 16.04/Newton. The issue appeared sometime after upgrade of
>> all Mitaka packages to the latest version prior to Xenial upgrade, and
>> stayed even after upgrade to Xenial/Newton. It's hard to tell for sure as
>> we didn't use LM for a while. It worked fine under Trusty/Mitaka, and broke
>> under Xenial/latest-Mitaka.
>>
>>
>> With a chance of 20-30% live migration will fail. The instance will pause
>> on a source node, but will not be resumed on target node. Target node will
>> destroy instance assuming LM failed, and on source node instance will stay
>> paused. On source node no new messages appear in nova-compute.log, and
>> commands such as "virsh list" won't provide any output, or even exit for
>> that matter. nova-compute can be restarted, but after normal startup
>> messages it once again doesn't provide any new log entries. Nova considers
>> compute node as up and running.
>> In dashboard you can see that instance is now residing at a new host, and
>> is in shutoff state.
>>
>> If I restart libvirtd on target or source node then whole system
>> "unfreezes". Source host unpauses instance, and it is live again. But it
>> now resides on a source node, while database thinks it's on a target node
>> and is shutdown. Warning messages will appear on source host:
>>
>> 2016-12-20 01:13:37.095 20025 WARNING nova.compute.manager
>> [req-b3879539-989f-4075-9ef0-d23ef8868102 - - - - -] While synchronizing
>> instance power states, found 3 instances in the database and 4 instances on
>> the hypervisor.
>>
>> Currenly I'm stopping/destroying/undefining instance on a source node,
>> and launching it via standart Openstack means, but this leads to an
>> instance reboot.
>>
>>
>> Last messages in nova-compute.log on source node after LM start:
>> 2016-12-20 00:09:35.961 16834 INFO nova.virt.libvirt.migration
>> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
>> ecee1197f46e453dba25669554226ce5 - - -] [instance:
>> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Increasing downtime to 1251 ms
>> after 0 sec elapsed time
>> 2016-12-20 00:09:36.127 16834 INFO nova.virt.libvirt.driver
>> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
>> ecee1197f46e453dba25669554226ce5 - - -] [instance:
>> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Migration running for 0 secs,
>> memory 100% remaining; (bytes processed=0, remaining=0, total=0)
>> 2016-12-20 00:09:36.894 16834 INFO nova.compute.manager
>> [req-3a2a828f-8b76-4a65-b49b-ea8d232a3de5 - - - - -] [instance:
>> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] VM Paused (Lifecycle Event)
>> 2016-12-20 00:09:37.046 16834 INFO nova.compute.manager
>> [req-3a2a828f-8b76-4a65-b49b-ea8d232a3de5 - - - - -] [instance:
>> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] During sync_power_state the
>> instance has a pending task (migrating). Skip.
>> 2016-12-20 00:09:37.300 16834 INFO nova.virt.libvirt.driver
>> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
>> ecee1197f46e453dba25669554226ce5 - - -] [instance:
>> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Migration operation has completed
>> 2016-12-20 00:09:37.301 16834 INFO nova.compute.manager
>> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
>> ecee1197f46e453dba25669554226ce5 - - -] [instance:
>> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] _post_live_migration() is started..
>> 2016-12-20 00:09:39.718 16834 INFO os_vif [req-51801b5e-b77a-4d76-ad87-176326ac910e
>> 84498aa7e26443c4908d973f3e86d530 ecee1197f46e453dba25669554226ce5 - - -]
>> Successfully unplugged vif VIFBridge(active=False,
>> address=fa:16:3e:e3:c4:c7,bridge_name='brq0b7e75ee-b6',
>> has_traffic_filtering=True,id=2a141727-77a7-4b43-b4cd-
>> 21fb415a6a55,network=Network(0b7e75ee-b691-4851-9b29-
>> 08c3c62a96e0),plugin='linux_bridge',port_profile=<?>,
>> preserve_on_delete=False,vif_name='tap2a141727-77')
>> 2016-12-20 00:09:39.777 16834 INFO nova.virt.libvirt.driver
>> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
>> ecee1197f46e453dba25669554226ce5 - - -] [instance:
>> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Deleting instance files
>> /var/lib/nova/instances/cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53_del
>> 2016-12-20 00:09:39.778 16834 INFO nova.virt.libvirt.driver
>> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
>> ecee1197f46e453dba25669554226ce5 - - -] [instance:
>> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Deletion of
>> /var/lib/nova/instances/cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53_del complete
>> 2016-12-20 00:10:33.644 16834 INFO nova.compute.resource_tracker
>> [req-00c4c403-d039-4347-b7ad-3098834a891c - - - - -] Auditing locally
>> available compute resources for node compute1.cloud1.local
>>
>> No new messages will appear until libvirtd restart.
>>
>>
>> Messages during LM in nova-compute.log on target node:
>> 2016-12-20 00:09:32.234 29438 INFO nova.virt.libvirt.driver
>> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
>> ecee1197f46e453dba25669554226ce5 - - -] Instance launched has CPU info:
>> {"vendor": "Intel", "model": "Nehalem", "arch": "x86_64", "features":
>> ["pge", "clflush", "sep", "syscall", "vme", "dtes64", "msr", "vmx", "xtpr",
>> "cmov", "ssse3", "est", "pat", "monitor", "smx", "pbe", "lm", "tsc", "nx",
>> "fxsr", "tm", "sse4.1", "pae", "sse4.2", "acpi", "mmx", "cx8", "mce", "de",
>> "rdtscp", "ht", "dca", "lahf_lm", "pdcm", "mca", "pdpe1gb", "apic", "sse",
>> "pse", "ds", "invtsc", "pni", "tm2", "sse2", "ss", "ds_cpl", "pcid", "fpu",
>> "cx16", "pse36", "mtrr", "popcnt"], "topology": {"cores": 6, "cells": 2,
>> "threads": 2, "sockets": 1}}
>> 2016-12-20 00:09:34.476 29438 INFO os_vif [req-51801b5e-b77a-4d76-ad87-176326ac910e
>> 84498aa7e26443c4908d973f3e86d530 ecee1197f46e453dba25669554226ce5 - - -]
>> Successfully plugged vif VIFBridge(active=True,address=
>> fa:16:3e:e3:c4:c7,bridge_name='brq0b7e75ee-b6',has_traffic_
>> filtering=True,id=2a141727-77a7-4b43-b4cd-21fb415a6a55,
>> network=Network(0b7e75ee-b691-4851-9b29-08c3c62a96e0),
>> plugin='linux_bridge',port_profile=<?>,preserve_on_
>> delete=False,vif_name='tap2a141727-77')
>> 2016-12-20 00:09:36.711 29438 INFO nova.compute.manager
>> [req-75abcda4-701a-4b84-8f28-c11f6ad6f933 - - - - -] [instance:
>> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] VM Started (Lifecycle Event)
>> 2016-12-20 00:09:39.777 29438 INFO nova.compute.manager
>> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
>> ecee1197f46e453dba25669554226ce5 - - -] [instance:
>> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Post operation of migration started
>> 2016-12-20 00:09:40.381 29438 INFO nova.network.neutronv2.api
>> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
>> ecee1197f46e453dba25669554226ce5 - - -] [instance:
>> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Updating port
>> 2a141727-77a7-4b43-b4cd-21fb415a6a55 with attributes {'binding:profile':
>> {}, 'binding:host_id': 'compute3'}
>> 2016-12-20 00:09:52.589 29438 INFO nova.compute.manager [-] [instance:
>> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] VM Stopped (Lifecycle Event)
>> 2016-12-20 00:09:52.747 29438 WARNING nova.compute.manager
>> [req-e945fe0d-dea0-47ee-8b87-b88991b9796e - - - - -] [instance:
>> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Instance shutdown by itself.
>> Calling the stop API. Current vm_state: active, current task_state: None,
>> original DB power_state: 4, current VM power_state: 4
>> 2016-12-20 00:09:52.964 29438 INFO nova.compute.manager
>> [req-e945fe0d-dea0-47ee-8b87-b88991b9796e - - - - -] [instance:
>> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Instance is already powered off in
>> the hypervisor when stop is called.
>> 2016-12-20 00:09:53.039 29438 INFO nova.virt.libvirt.driver
>> [req-e945fe0d-dea0-47ee-8b87-b88991b9796e - - - - -] [instance:
>> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Instance already shutdown.
>> 2016-12-20 00:09:53.043 29438 INFO nova.virt.libvirt.driver [-]
>> [instance: cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Instance destroyed
>> successfully.
>> 2016-12-20 00:09:56.696 29438 INFO nova.compute.resource_tracker
>> [req-53e5b7a1-aba9-4b26-a3fb-b9b293bf2f17 - - - - -] Auditing locally
>> available compute resources for node compute3.cloud1.local
>> 2016-12-20 00:09:57.295 29438 INFO nova.compute.resource_tracker
>> [req-53e5b7a1-aba9-4b26-a3fb-b9b293bf2f17 - - - - -] Total usable vcpus:
>> 24, total allocated vcpus: 39
>> 2016-12-20 00:09:57.295 29438 INFO nova.compute.resource_tracker
>> [req-53e5b7a1-aba9-4b26-a3fb-b9b293bf2f17 - - - - -] Final resource
>> view: name=compute3.cloud1.local phys_ram=145060MB used_ram=94720MB
>> phys_disk=14710GB used_disk=45GB total_vcpus=24 used_vcpus=39 pci_stats=[]
>>
>> Target node will proceed normal operations.
>>
>> Nothing interesting appears in libvirtd log or syslog on any compute
>> node. I've tried running debug log for libvirt, but it's unreadable for me
>> - all I can figure out is that libvirt is still running and other instances
>> are unaffected.
>>
>>
>> So, my questions.
>> - Is there any workaround to sync the state of DB and compute node? I
>> wish at least to avoid instance reboot, and for that I need some way to
>> tell DB that instance is on a source node and running, without service
>> interruption. Is it ok to update database directly via SQL?
>> - How to further troubleshoot this issue? As I said. I'm kinda stuck and
>> have no idea where to look further. If the issue appeared 100% of the time
>> it would be a lot easier, but there's no pattern I can figure out - it
>> happens randomly with heavy loaded instances, and idle instances alike.
>>
>> Any suggestions appreciated. I need a boot in a right direction.
>> I will provide any logs and/or config files per request.
>>
>> _______________________________________________
>> OpenStack-operators mailing list
>> OpenStack-operators at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>>
>>
>> _______________________________________________
>> OpenStack-operators mailing list
>> OpenStack-operators at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20161220/dd6c093b/attachment.html>


More information about the OpenStack-operators mailing list