[Openstack-operators] live migration fails and libvirtd(?) lockup after trusty > xenial/mitaka > newton upgrade

Sean Redmond sean.redmond1 at gmail.com
Tue Dec 20 09:17:10 UTC 2016


What version of qemu are you running? I was hit by this bug[1] that seems
to give off the same faults you are reporting and had to downgrade qemu
packages to version '1:2.5+dfsg-5ubuntu10.5'

[1] https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1647389

On Tue, Dec 20, 2016 at 12:10 AM, Vladimir Prokofev <v at prokofev.me> wrote:

> Hello Ops.
>
> I want to pick your brains on a live migration issue, cause I'm kinda
> stuck atm.
>
> I'm running a small HA Ubuntu Openstack cloud - 3 controllers(VMs on
> Ubuntu/KVM) with ha-proxy LB, corosync/pacemaker manages VIP, Galera DB, 3
> compute nodes(using KVM hypervisor), 3 network nodes(VMs on Ubuntu/KVM),
> CEPH cluster(4 OSD nodes, 3 MON nodes(VMs on Ubuntu/KVM)).
> Nova, cinder, and glance use CEPH as backend. Neutron uses linux-bridge.
>
> We were running Ubuntu Trusty 14.04/Mitaka, and decided to upgrade to
> Ubuntu Xenial 16.04/Newton. The issue appeared sometime after upgrade of
> all Mitaka packages to the latest version prior to Xenial upgrade, and
> stayed even after upgrade to Xenial/Newton. It's hard to tell for sure as
> we didn't use LM for a while. It worked fine under Trusty/Mitaka, and broke
> under Xenial/latest-Mitaka.
>
>
> With a chance of 20-30% live migration will fail. The instance will pause
> on a source node, but will not be resumed on target node. Target node will
> destroy instance assuming LM failed, and on source node instance will stay
> paused. On source node no new messages appear in nova-compute.log, and
> commands such as "virsh list" won't provide any output, or even exit for
> that matter. nova-compute can be restarted, but after normal startup
> messages it once again doesn't provide any new log entries. Nova considers
> compute node as up and running.
> In dashboard you can see that instance is now residing at a new host, and
> is in shutoff state.
>
> If I restart libvirtd on target or source node then whole system
> "unfreezes". Source host unpauses instance, and it is live again. But it
> now resides on a source node, while database thinks it's on a target node
> and is shutdown. Warning messages will appear on source host:
>
> 2016-12-20 01:13:37.095 20025 WARNING nova.compute.manager
> [req-b3879539-989f-4075-9ef0-d23ef8868102 - - - - -] While synchronizing
> instance power states, found 3 instances in the database and 4 instances on
> the hypervisor.
>
> Currenly I'm stopping/destroying/undefining instance on a source node, and
> launching it via standart Openstack means, but this leads to an instance
> reboot.
>
>
> Last messages in nova-compute.log on source node after LM start:
> 2016-12-20 00:09:35.961 16834 INFO nova.virt.libvirt.migration
> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
> ecee1197f46e453dba25669554226ce5 - - -] [instance:
> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Increasing downtime to 1251 ms
> after 0 sec elapsed time
> 2016-12-20 00:09:36.127 16834 INFO nova.virt.libvirt.driver
> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
> ecee1197f46e453dba25669554226ce5 - - -] [instance:
> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Migration running for 0 secs,
> memory 100% remaining; (bytes processed=0, remaining=0, total=0)
> 2016-12-20 00:09:36.894 16834 INFO nova.compute.manager
> [req-3a2a828f-8b76-4a65-b49b-ea8d232a3de5 - - - - -] [instance:
> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] VM Paused (Lifecycle Event)
> 2016-12-20 00:09:37.046 16834 INFO nova.compute.manager
> [req-3a2a828f-8b76-4a65-b49b-ea8d232a3de5 - - - - -] [instance:
> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] During sync_power_state the
> instance has a pending task (migrating). Skip.
> 2016-12-20 00:09:37.300 16834 INFO nova.virt.libvirt.driver
> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
> ecee1197f46e453dba25669554226ce5 - - -] [instance:
> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Migration operation has completed
> 2016-12-20 00:09:37.301 16834 INFO nova.compute.manager
> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
> ecee1197f46e453dba25669554226ce5 - - -] [instance:
> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] _post_live_migration() is started..
> 2016-12-20 00:09:39.718 16834 INFO os_vif [req-51801b5e-b77a-4d76-ad87-176326ac910e
> 84498aa7e26443c4908d973f3e86d530 ecee1197f46e453dba25669554226ce5 - - -]
> Successfully unplugged vif VIFBridge(active=False,
> address=fa:16:3e:e3:c4:c7,bridge_name='brq0b7e75ee-b6',
> has_traffic_filtering=True,id=2a141727-77a7-4b43-b4cd-
> 21fb415a6a55,network=Network(0b7e75ee-b691-4851-9b29-
> 08c3c62a96e0),plugin='linux_bridge',port_profile=<?>,
> preserve_on_delete=False,vif_name='tap2a141727-77')
> 2016-12-20 00:09:39.777 16834 INFO nova.virt.libvirt.driver
> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
> ecee1197f46e453dba25669554226ce5 - - -] [instance:
> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Deleting instance files
> /var/lib/nova/instances/cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53_del
> 2016-12-20 00:09:39.778 16834 INFO nova.virt.libvirt.driver
> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
> ecee1197f46e453dba25669554226ce5 - - -] [instance:
> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Deletion of /var/lib/nova/instances/
> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53_del complete
> 2016-12-20 00:10:33.644 16834 INFO nova.compute.resource_tracker
> [req-00c4c403-d039-4347-b7ad-3098834a891c - - - - -] Auditing locally
> available compute resources for node compute1.cloud1.local
>
> No new messages will appear until libvirtd restart.
>
>
> Messages during LM in nova-compute.log on target node:
> 2016-12-20 00:09:32.234 29438 INFO nova.virt.libvirt.driver
> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
> ecee1197f46e453dba25669554226ce5 - - -] Instance launched has CPU info:
> {"vendor": "Intel", "model": "Nehalem", "arch": "x86_64", "features":
> ["pge", "clflush", "sep", "syscall", "vme", "dtes64", "msr", "vmx", "xtpr",
> "cmov", "ssse3", "est", "pat", "monitor", "smx", "pbe", "lm", "tsc", "nx",
> "fxsr", "tm", "sse4.1", "pae", "sse4.2", "acpi", "mmx", "cx8", "mce", "de",
> "rdtscp", "ht", "dca", "lahf_lm", "pdcm", "mca", "pdpe1gb", "apic", "sse",
> "pse", "ds", "invtsc", "pni", "tm2", "sse2", "ss", "ds_cpl", "pcid", "fpu",
> "cx16", "pse36", "mtrr", "popcnt"], "topology": {"cores": 6, "cells": 2,
> "threads": 2, "sockets": 1}}
> 2016-12-20 00:09:34.476 29438 INFO os_vif [req-51801b5e-b77a-4d76-ad87-176326ac910e
> 84498aa7e26443c4908d973f3e86d530 ecee1197f46e453dba25669554226ce5 - - -]
> Successfully plugged vif VIFBridge(active=True,address=
> fa:16:3e:e3:c4:c7,bridge_name='brq0b7e75ee-b6',has_traffic_
> filtering=True,id=2a141727-77a7-4b43-b4cd-21fb415a6a55,
> network=Network(0b7e75ee-b691-4851-9b29-08c3c62a96e0),
> plugin='linux_bridge',port_profile=<?>,preserve_on_delete=False,vif_name='
> tap2a141727-77')
> 2016-12-20 00:09:36.711 29438 INFO nova.compute.manager
> [req-75abcda4-701a-4b84-8f28-c11f6ad6f933 - - - - -] [instance:
> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] VM Started (Lifecycle Event)
> 2016-12-20 00:09:39.777 29438 INFO nova.compute.manager
> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
> ecee1197f46e453dba25669554226ce5 - - -] [instance:
> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Post operation of migration started
> 2016-12-20 00:09:40.381 29438 INFO nova.network.neutronv2.api
> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
> ecee1197f46e453dba25669554226ce5 - - -] [instance:
> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Updating port
> 2a141727-77a7-4b43-b4cd-21fb415a6a55 with attributes {'binding:profile':
> {}, 'binding:host_id': 'compute3'}
> 2016-12-20 00:09:52.589 29438 INFO nova.compute.manager [-] [instance:
> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] VM Stopped (Lifecycle Event)
> 2016-12-20 00:09:52.747 29438 WARNING nova.compute.manager
> [req-e945fe0d-dea0-47ee-8b87-b88991b9796e - - - - -] [instance:
> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Instance shutdown by itself.
> Calling the stop API. Current vm_state: active, current task_state: None,
> original DB power_state: 4, current VM power_state: 4
> 2016-12-20 00:09:52.964 29438 INFO nova.compute.manager
> [req-e945fe0d-dea0-47ee-8b87-b88991b9796e - - - - -] [instance:
> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Instance is already powered off in
> the hypervisor when stop is called.
> 2016-12-20 00:09:53.039 29438 INFO nova.virt.libvirt.driver
> [req-e945fe0d-dea0-47ee-8b87-b88991b9796e - - - - -] [instance:
> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Instance already shutdown.
> 2016-12-20 00:09:53.043 29438 INFO nova.virt.libvirt.driver [-] [instance:
> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Instance destroyed successfully.
> 2016-12-20 00:09:56.696 29438 INFO nova.compute.resource_tracker
> [req-53e5b7a1-aba9-4b26-a3fb-b9b293bf2f17 - - - - -] Auditing locally
> available compute resources for node compute3.cloud1.local
> 2016-12-20 00:09:57.295 29438 INFO nova.compute.resource_tracker
> [req-53e5b7a1-aba9-4b26-a3fb-b9b293bf2f17 - - - - -] Total usable vcpus:
> 24, total allocated vcpus: 39
> 2016-12-20 00:09:57.295 29438 INFO nova.compute.resource_tracker
> [req-53e5b7a1-aba9-4b26-a3fb-b9b293bf2f17 - - - - -] Final resource view:
> name=compute3.cloud1.local phys_ram=145060MB used_ram=94720MB
> phys_disk=14710GB used_disk=45GB total_vcpus=24 used_vcpus=39 pci_stats=[]
>
> Target node will proceed normal operations.
>
> Nothing interesting appears in libvirtd log or syslog on any compute node.
> I've tried running debug log for libvirt, but it's unreadable for me - all
> I can figure out is that libvirt is still running and other instances are
> unaffected.
>
>
> So, my questions.
> - Is there any workaround to sync the state of DB and compute node? I wish
> at least to avoid instance reboot, and for that I need some way to tell DB
> that instance is on a source node and running, without service
> interruption. Is it ok to update database directly via SQL?
> - How to further troubleshoot this issue? As I said. I'm kinda stuck and
> have no idea where to look further. If the issue appeared 100% of the time
> it would be a lot easier, but there's no pattern I can figure out - it
> happens randomly with heavy loaded instances, and idle instances alike.
>
> Any suggestions appreciated. I need a boot in a right direction.
> I will provide any logs and/or config files per request.
>
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20161220/acc8cdff/attachment.html>


More information about the OpenStack-operators mailing list