[Openstack-operators] live migration fails and libvirtd(?) lockup after trusty > xenial/mitaka > newton upgrade

Vladimir Prokofev v at prokofev.me
Tue Dec 20 00:10:59 UTC 2016


Hello Ops.

I want to pick your brains on a live migration issue, cause I'm kinda stuck
atm.

I'm running a small HA Ubuntu Openstack cloud - 3 controllers(VMs on
Ubuntu/KVM) with ha-proxy LB, corosync/pacemaker manages VIP, Galera DB, 3
compute nodes(using KVM hypervisor), 3 network nodes(VMs on Ubuntu/KVM),
CEPH cluster(4 OSD nodes, 3 MON nodes(VMs on Ubuntu/KVM)).
Nova, cinder, and glance use CEPH as backend. Neutron uses linux-bridge.

We were running Ubuntu Trusty 14.04/Mitaka, and decided to upgrade to
Ubuntu Xenial 16.04/Newton. The issue appeared sometime after upgrade of
all Mitaka packages to the latest version prior to Xenial upgrade, and
stayed even after upgrade to Xenial/Newton. It's hard to tell for sure as
we didn't use LM for a while. It worked fine under Trusty/Mitaka, and broke
under Xenial/latest-Mitaka.


With a chance of 20-30% live migration will fail. The instance will pause
on a source node, but will not be resumed on target node. Target node will
destroy instance assuming LM failed, and on source node instance will stay
paused. On source node no new messages appear in nova-compute.log, and
commands such as "virsh list" won't provide any output, or even exit for
that matter. nova-compute can be restarted, but after normal startup
messages it once again doesn't provide any new log entries. Nova considers
compute node as up and running.
In dashboard you can see that instance is now residing at a new host, and
is in shutoff state.

If I restart libvirtd on target or source node then whole system
"unfreezes". Source host unpauses instance, and it is live again. But it
now resides on a source node, while database thinks it's on a target node
and is shutdown. Warning messages will appear on source host:

2016-12-20 01:13:37.095 20025 WARNING nova.compute.manager
[req-b3879539-989f-4075-9ef0-d23ef8868102 - - - - -] While synchronizing
instance power states, found 3 instances in the database and 4 instances on
the hypervisor.

Currenly I'm stopping/destroying/undefining instance on a source node, and
launching it via standart Openstack means, but this leads to an instance
reboot.


Last messages in nova-compute.log on source node after LM start:
2016-12-20 00:09:35.961 16834 INFO nova.virt.libvirt.migration
[req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
ecee1197f46e453dba25669554226ce5 - - -] [instance:
cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Increasing downtime to 1251 ms after
0 sec elapsed time
2016-12-20 00:09:36.127 16834 INFO nova.virt.libvirt.driver
[req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
ecee1197f46e453dba25669554226ce5 - - -] [instance:
cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Migration running for 0 secs, memory
100% remaining; (bytes processed=0, remaining=0, total=0)
2016-12-20 00:09:36.894 16834 INFO nova.compute.manager
[req-3a2a828f-8b76-4a65-b49b-ea8d232a3de5 - - - - -] [instance:
cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] VM Paused (Lifecycle Event)
2016-12-20 00:09:37.046 16834 INFO nova.compute.manager
[req-3a2a828f-8b76-4a65-b49b-ea8d232a3de5 - - - - -] [instance:
cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] During sync_power_state the instance
has a pending task (migrating). Skip.
2016-12-20 00:09:37.300 16834 INFO nova.virt.libvirt.driver
[req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
ecee1197f46e453dba25669554226ce5 - - -] [instance:
cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Migration operation has completed
2016-12-20 00:09:37.301 16834 INFO nova.compute.manager
[req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
ecee1197f46e453dba25669554226ce5 - - -] [instance:
cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] _post_live_migration() is started..
2016-12-20 00:09:39.718 16834 INFO os_vif
[req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
ecee1197f46e453dba25669554226ce5 - - -] Successfully unplugged vif
VIFBridge(active=False,address=fa:16:3e:e3:c4:c7,bridge_name='brq0b7e75ee-b6',has_traffic_filtering=True,id=2a141727-77a7-4b43-b4cd-21fb415a6a55,network=Network(0b7e75ee-b691-4851-9b29-08c3c62a96e0),plugin='linux_bridge',port_profile=<?>,preserve_on_delete=False,vif_name='tap2a141727-77')
2016-12-20 00:09:39.777 16834 INFO nova.virt.libvirt.driver
[req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
ecee1197f46e453dba25669554226ce5 - - -] [instance:
cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Deleting instance files
/var/lib/nova/instances/cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53_del
2016-12-20 00:09:39.778 16834 INFO nova.virt.libvirt.driver
[req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
ecee1197f46e453dba25669554226ce5 - - -] [instance:
cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Deletion of
/var/lib/nova/instances/cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53_del complete
2016-12-20 00:10:33.644 16834 INFO nova.compute.resource_tracker
[req-00c4c403-d039-4347-b7ad-3098834a891c - - - - -] Auditing locally
available compute resources for node compute1.cloud1.local

No new messages will appear until libvirtd restart.


Messages during LM in nova-compute.log on target node:
2016-12-20 00:09:32.234 29438 INFO nova.virt.libvirt.driver
[req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
ecee1197f46e453dba25669554226ce5 - - -] Instance launched has CPU info:
{"vendor": "Intel", "model": "Nehalem", "arch": "x86_64", "features":
["pge", "clflush", "sep", "syscall", "vme", "dtes64", "msr", "vmx", "xtpr",
"cmov", "ssse3", "est", "pat", "monitor", "smx", "pbe", "lm", "tsc", "nx",
"fxsr", "tm", "sse4.1", "pae", "sse4.2", "acpi", "mmx", "cx8", "mce", "de",
"rdtscp", "ht", "dca", "lahf_lm", "pdcm", "mca", "pdpe1gb", "apic", "sse",
"pse", "ds", "invtsc", "pni", "tm2", "sse2", "ss", "ds_cpl", "pcid", "fpu",
"cx16", "pse36", "mtrr", "popcnt"], "topology": {"cores": 6, "cells": 2,
"threads": 2, "sockets": 1}}
2016-12-20 00:09:34.476 29438 INFO os_vif
[req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
ecee1197f46e453dba25669554226ce5 - - -] Successfully plugged vif
VIFBridge(active=True,address=fa:16:3e:e3:c4:c7,bridge_name='brq0b7e75ee-b6',has_traffic_filtering=True,id=2a141727-77a7-4b43-b4cd-21fb415a6a55,network=Network(0b7e75ee-b691-4851-9b29-08c3c62a96e0),plugin='linux_bridge',port_profile=<?>,preserve_on_delete=False,vif_name='tap2a141727-77')
2016-12-20 00:09:36.711 29438 INFO nova.compute.manager
[req-75abcda4-701a-4b84-8f28-c11f6ad6f933 - - - - -] [instance:
cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] VM Started (Lifecycle Event)
2016-12-20 00:09:39.777 29438 INFO nova.compute.manager
[req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
ecee1197f46e453dba25669554226ce5 - - -] [instance:
cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Post operation of migration started
2016-12-20 00:09:40.381 29438 INFO nova.network.neutronv2.api
[req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530
ecee1197f46e453dba25669554226ce5 - - -] [instance:
cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Updating port
2a141727-77a7-4b43-b4cd-21fb415a6a55 with attributes {'binding:profile':
{}, 'binding:host_id': 'compute3'}
2016-12-20 00:09:52.589 29438 INFO nova.compute.manager [-] [instance:
cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] VM Stopped (Lifecycle Event)
2016-12-20 00:09:52.747 29438 WARNING nova.compute.manager
[req-e945fe0d-dea0-47ee-8b87-b88991b9796e - - - - -] [instance:
cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Instance shutdown by itself. Calling
the stop API. Current vm_state: active, current task_state: None, original
DB power_state: 4, current VM power_state: 4
2016-12-20 00:09:52.964 29438 INFO nova.compute.manager
[req-e945fe0d-dea0-47ee-8b87-b88991b9796e - - - - -] [instance:
cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Instance is already powered off in
the hypervisor when stop is called.
2016-12-20 00:09:53.039 29438 INFO nova.virt.libvirt.driver
[req-e945fe0d-dea0-47ee-8b87-b88991b9796e - - - - -] [instance:
cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Instance already shutdown.
2016-12-20 00:09:53.043 29438 INFO nova.virt.libvirt.driver [-] [instance:
cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Instance destroyed successfully.
2016-12-20 00:09:56.696 29438 INFO nova.compute.resource_tracker
[req-53e5b7a1-aba9-4b26-a3fb-b9b293bf2f17 - - - - -] Auditing locally
available compute resources for node compute3.cloud1.local
2016-12-20 00:09:57.295 29438 INFO nova.compute.resource_tracker
[req-53e5b7a1-aba9-4b26-a3fb-b9b293bf2f17 - - - - -] Total usable vcpus:
24, total allocated vcpus: 39
2016-12-20 00:09:57.295 29438 INFO nova.compute.resource_tracker
[req-53e5b7a1-aba9-4b26-a3fb-b9b293bf2f17 - - - - -] Final resource view:
name=compute3.cloud1.local phys_ram=145060MB used_ram=94720MB
phys_disk=14710GB used_disk=45GB total_vcpus=24 used_vcpus=39 pci_stats=[]

Target node will proceed normal operations.

Nothing interesting appears in libvirtd log or syslog on any compute node.
I've tried running debug log for libvirt, but it's unreadable for me - all
I can figure out is that libvirt is still running and other instances are
unaffected.


So, my questions.
- Is there any workaround to sync the state of DB and compute node? I wish
at least to avoid instance reboot, and for that I need some way to tell DB
that instance is on a source node and running, without service
interruption. Is it ok to update database directly via SQL?
- How to further troubleshoot this issue? As I said. I'm kinda stuck and
have no idea where to look further. If the issue appeared 100% of the time
it would be a lot easier, but there's no pattern I can figure out - it
happens randomly with heavy loaded instances, and idle instances alike.

Any suggestions appreciated. I need a boot in a right direction.
I will provide any logs and/or config files per request.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20161220/c3751c11/attachment.html>


More information about the OpenStack-operators mailing list