[Openstack-operators] libvirt freezing when loading Nova instance nwfilters

Edmund Rhudy (BLOOMBERG/ 120 PARK) erhudy at bloomberg.net
Thu Feb 23 17:12:34 UTC 2017


Thanks for confirming I'm not fully insane. We only have one cluster left to upgrade now (naturally the oldest, biggest and most dangerous one). Hopefully it doesn't repeat there, but if it does, you've given me a few more things to look at.

From: joe at topjian.net 
Subject: Re: [Openstack-operators] libvirt freezing when loading Nova instance nwfilters

We ran into the "virsh nwfilter-list hanging indefinitely" thing back in early January. I spent hours and I almost went insane trying to figure it out. We weren't upgrading nodes, though, it just sort of happened.

I have no idea if the following was the correct way of handling this, but this ultimately got nova-compute back up and running:

I ran:

$ ss -ax

on the hypervisor and saw that some monitor sockets had a Recv-Q of non-zero. On the processes related to those sockets, I ran:

$ strace -p <pid>

and saw no activity. Compared to sockets with zero Recv-Q, strace showed activity. By now, I figured my only options were a full hypervisor reboot or to kill the instances with no activity. Since those instances would be killed from a full reboot anyway, I did a "virsh destroy" on the instances. Once they were destroyed, nova-compute was able to start cleanly.

We had this happen on 3 hypervisors. Each one had between 1 and 3 of these types of instances, so not a lot at all. Once they were destroyed, nova-compute began working again on all 3.

We later had a user report that he noticed some problems with his instance (not one of the ones destroyed) and thought it might have to do with the leap second. No idea if that's true, but the timing kind of works out.

Hope that helps,
Joe


On Wed, Feb 22, 2017 at 8:33 AM, Edmund Rhudy (BLOOMBERG/ 120 PARK) <erhudy at bloomberg.net> wrote:

I recently witnessed a strange issue with libvirt when upgrading one of our clusters from Kilo to Liberty. I'm not really looking for a specific diagnosis here because of the large number of confounding factors and the relative ease of remediating it, but I'm interested to hear if anyone else has witnessed this particular problem.

Background is we had a number of Kilo-based clusters, all running Ubuntu 14.04.4 with OpenStack installed from the Ubuntu cloud archive. The upgrade process to Liberty involved upgrading the OpenStack components and their dependencies (including libvirt), then afterward upgrading all remaining packages via dist-upgrade (and staging a kernel upgrade from 3.13 to 4.4, to take effect on the next reboot). 7 clusters had all been upgraded successfully using this strategy.

One cluster, however, decided to get a bit weird. After the upgrade, 4 hypervisors showed that nova-compute was refusing to come up properly and was showing as enabled/down in nova service-list. Upon further investigation, nova-compute was starting up but was getting jammed on loading nwfilters. When I ran "virsh nwfilter-list", the command stalled indefinitely. Killing nova-compute and restarting libvirt-bin service allowed the command to work again, but it did not list any of the nova-instance-instance-* nwfilters. Once nova-compute was started, it tried to start loading the instance-specific filters and libvirt would wedge. I spent a while tinkering with the affected systems but could not find any way of correcting the issue other than rebooting the hypervisor, after which everything was fine.

Has anyone ever seen anything like this? libvirt was upgraded from 1.2.12 to 1.2.16. Hundreds of hypervisors had already received this exact same upgrade without showing this problem, and I have no idea how I could reproduce it. I'm interested to hear if anyone else has ever run into this and if they figured out what the root cause was, though I've already braced myself for tumbleweeds.
_______________________________________________
OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20170223/1026702b/attachment.html>


More information about the OpenStack-operators mailing list