[Openstack-operators] mitaka/xenial libvirt issues

Tobias Urdin tobias.urdin at crystone.com
Mon Nov 27 09:10:45 UTC 2017


Hello,

The seems to assume tunnelled migrations, the live_migration_flag is removed in later version but is there in Mitaka.

Do you have the VIR_MIGRATE_TUNNELLED flag set for [libvirt]live_migration_flag in nova.conf?


Might be a long shot, but I've removed VIR_MIGRATE_TUNNELLED in our clouds

Best regards

On 11/26/2017 01:01 PM, Sean Redmond wrote:
Hi,

I think it maybe related to this:

https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1647389

Thanks

On Thu, Nov 23, 2017 at 6:20 PM, Joe Topjian <joe at topjian.net<mailto:joe at topjian.net>> wrote:
OK, thanks. We'll definitely look at downgrading in a test environment.

To add some further info to this problem, here are some log entries. When an instance fails to snapshot or fails to migrate, we see:

libvirtd[27939]: Cannot start job (modify, none) for domain instance-00004fe4; current job is (modify, none) owned by (27942 remoteDispatchDomainBlockJobAbort, 0 <null>) for (69116s, 0s)

libvirtd[27939]: Cannot start job (none, migration out) for domain instance-00004fe4; current job is (modify, none) owned by (27942 remoteDispatchDomainBlockJobAbort, 0 <null>) for (69361s, 0s)


The one piece of this that I'm currently fixated on is the length of time it takes libvirt to start. I'm not sure if it's causing the above, though. When starting libvirt through systemd, it takes much longer to process the iptables and ebtables rules than if we start libvirtd on the command-line directly.

virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -L libvirt-J-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -L libvirt-P-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -F libvirt-J-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -X libvirt-J-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -F libvirt-P-vnet5'
virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t nat -X libvirt-P-vnet5'

We're talking about a difference between 5 minutes and 5 seconds depending on where libvirt was started. This doesn't seem normal to me.

In general, is anyone aware of systemd performing restrictions of some kind on processes which create subprocesses? Or something like that? I've tried comparing cgroups and the various limits within systemd between my shell session and the libvirt-bin.service session and can't find anything immediately noticeable. Maybe it's apparmor?

Thanks,
Joe

On Thu, Nov 23, 2017 at 11:03 AM, Chris Sarginson <csargiso at gmail.com<mailto:csargiso at gmail.com>> wrote:
I think we may have pinned libvirt-bin as well, (1.3.1), but I can't guarantee that, sorry - I would suggest its worth trying pinning both initially.

Chris

On Thu, 23 Nov 2017 at 17:42 Joe Topjian <joe at topjian.net<mailto:joe at topjian.net>> wrote:
Hi Chris,

Thanks - we will definitely look into this. To confirm: did you also downgrade libvirt as well or was it all qemu?

Thanks,
Joe

On Thu, Nov 23, 2017 at 9:16 AM, Chris Sarginson <csargiso at gmail.com<mailto:csargiso at gmail.com>> wrote:
We hit the same issue a while back (I suspect), which we seemed to resolve by pinning QEMU and related packages at the following version (you might need to hunt down the debs manually):

1:2.5+dfsg-5ubuntu10.5

I'm certain there's a launchpad bug for Ubuntu qemu regarding this, but don't have it to hand.

Hope this helps,
Chris

On Thu, 23 Nov 2017 at 15:33 Joe Topjian <joe at topjian.net<mailto:joe at topjian.net>> wrote:
Hi all,

We're seeing some strange libvirt issues in an Ubuntu 16.04 environment. It's running Mitaka, but I don't think this is a problem with OpenStack itself.

We're in the process of upgrading this environment from Ubuntu 14.04 with the Mitaka cloud archive to 16.04. Instances are being live migrated (NFS share) to a new 16.04 compute node (fresh install), so there's a change between libvirt versions (1.2.2 to 1.3.1). The problem we're seeing is only happening on the 16.04/1.3.1 nodes.

We're getting occasional reports of instances not able to be snapshotted. Upon investigation, the snapshot process quits early with a libvirt/qemu lock timeout error. We then see that the instance's xml file has disappeared from /etc/libvirt/qemu and must restart libvirt and hard-reboot the instance to get things back to a normal state. Trying to live-migrate the instance to another node causes the same thing to happen.

However, at some random time, either the snapshot or the migration will work without error. I haven't been able to reproduce this issue on my own and haven't been able to figure out the root cause by inspecting instances reported to me.

One thing that has stood out is the length of time it takes for libvirt to start. If I run "/etc/init.d/libvirt-bin start", it takes at least 5 minutes before a simple "virsh list" will work. The command will hang otherwise. If I increase libvirt's logging level, I can see that during this period of time, libvirt is working on iptables and ebtables (looks like it's shelling out commands).

But if I run "libvirtd -l" straight on the command line, all of this completes within 5 seconds (including all of the shelling out).

My initial thought is that systemd is doing some type of throttling between the system and user slice, but I've tried comparing slice attributes and, probably due to my lack of understanding of systemd, can't find anything to prove this.

Is anyone else running into this problem? Does anyone know what might be the cause?

Thanks,
Joe
_______________________________________________
OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org<mailto:OpenStack-operators at lists.openstack.org>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators



_______________________________________________
OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org<mailto:OpenStack-operators at lists.openstack.org>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20171127/5399e35e/attachment.html>


More information about the OpenStack-operators mailing list