[Openstack-operators] mitaka/xenial libvirt issues

Sean Redmond sean.redmond1 at gmail.com
Sun Nov 26 11:58:10 UTC 2017


Hi,

I think it maybe related to this:

https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1647389

Thanks

On Thu, Nov 23, 2017 at 6:20 PM, Joe Topjian <joe at topjian.net> wrote:

> OK, thanks. We'll definitely look at downgrading in a test environment.
>
> To add some further info to this problem, here are some log entries. When
> an instance fails to snapshot or fails to migrate, we see:
>
> libvirtd[27939]: Cannot start job (modify, none) for domain
> instance-00004fe4; current job is (modify, none) owned by (27942
> remoteDispatchDomainBlockJobAbort, 0 <null>) for (69116s, 0s)
>
> libvirtd[27939]: Cannot start job (none, migration out) for domain
> instance-00004fe4; current job is (modify, none) owned by (27942
> remoteDispatchDomainBlockJobAbort, 0 <null>) for (69361s, 0s)
>
>
> The one piece of this that I'm currently fixated on is the length of time
> it takes libvirt to start. I'm not sure if it's causing the above, though.
> When starting libvirt through systemd, it takes much longer to process the
> iptables and ebtables rules than if we start libvirtd on the command-line
> directly.
>
> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t
> nat -L libvirt-J-vnet5'
> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t
> nat -L libvirt-P-vnet5'
> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t
> nat -F libvirt-J-vnet5'
> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t
> nat -X libvirt-J-vnet5'
> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t
> nat -F libvirt-P-vnet5'
> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t
> nat -X libvirt-P-vnet5'
>
> We're talking about a difference between 5 minutes and 5 seconds depending
> on where libvirt was started. This doesn't seem normal to me.
>
> In general, is anyone aware of systemd performing restrictions of some
> kind on processes which create subprocesses? Or something like that? I've
> tried comparing cgroups and the various limits within systemd between my
> shell session and the libvirt-bin.service session and can't find anything
> immediately noticeable. Maybe it's apparmor?
>
> Thanks,
> Joe
>
> On Thu, Nov 23, 2017 at 11:03 AM, Chris Sarginson <csargiso at gmail.com>
> wrote:
>
>> I think we may have pinned libvirt-bin as well, (1.3.1), but I can't
>> guarantee that, sorry - I would suggest its worth trying pinning both
>> initially.
>>
>> Chris
>>
>> On Thu, 23 Nov 2017 at 17:42 Joe Topjian <joe at topjian.net> wrote:
>>
>>> Hi Chris,
>>>
>>> Thanks - we will definitely look into this. To confirm: did you also
>>> downgrade libvirt as well or was it all qemu?
>>>
>>> Thanks,
>>> Joe
>>>
>>> On Thu, Nov 23, 2017 at 9:16 AM, Chris Sarginson <csargiso at gmail.com>
>>> wrote:
>>>
>>>> We hit the same issue a while back (I suspect), which we seemed to
>>>> resolve by pinning QEMU and related packages at the following version (you
>>>> might need to hunt down the debs manually):
>>>>
>>>> 1:2.5+dfsg-5ubuntu10.5
>>>>
>>>> I'm certain there's a launchpad bug for Ubuntu qemu regarding this, but
>>>> don't have it to hand.
>>>>
>>>> Hope this helps,
>>>> Chris
>>>>
>>>> On Thu, 23 Nov 2017 at 15:33 Joe Topjian <joe at topjian.net> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> We're seeing some strange libvirt issues in an Ubuntu 16.04
>>>>> environment. It's running Mitaka, but I don't think this is a problem with
>>>>> OpenStack itself.
>>>>>
>>>>> We're in the process of upgrading this environment from Ubuntu 14.04
>>>>> with the Mitaka cloud archive to 16.04. Instances are being live migrated
>>>>> (NFS share) to a new 16.04 compute node (fresh install), so there's a
>>>>> change between libvirt versions (1.2.2 to 1.3.1). The problem we're seeing
>>>>> is only happening on the 16.04/1.3.1 nodes.
>>>>>
>>>>> We're getting occasional reports of instances not able to be
>>>>> snapshotted. Upon investigation, the snapshot process quits early with a
>>>>> libvirt/qemu lock timeout error. We then see that the instance's xml file
>>>>> has disappeared from /etc/libvirt/qemu and must restart libvirt and
>>>>> hard-reboot the instance to get things back to a normal state. Trying to
>>>>> live-migrate the instance to another node causes the same thing to happen.
>>>>>
>>>>> However, at some random time, either the snapshot or the migration
>>>>> will work without error. I haven't been able to reproduce this issue on my
>>>>> own and haven't been able to figure out the root cause by inspecting
>>>>> instances reported to me.
>>>>>
>>>>> One thing that has stood out is the length of time it takes for
>>>>> libvirt to start. If I run "/etc/init.d/libvirt-bin start", it takes at
>>>>> least 5 minutes before a simple "virsh list" will work. The command will
>>>>> hang otherwise. If I increase libvirt's logging level, I can see that
>>>>> during this period of time, libvirt is working on iptables and ebtables
>>>>> (looks like it's shelling out commands).
>>>>>
>>>>> But if I run "libvirtd -l" straight on the command line, all of this
>>>>> completes within 5 seconds (including all of the shelling out).
>>>>>
>>>>> My initial thought is that systemd is doing some type of throttling
>>>>> between the system and user slice, but I've tried comparing slice
>>>>> attributes and, probably due to my lack of understanding of systemd, can't
>>>>> find anything to prove this.
>>>>>
>>>>> Is anyone else running into this problem? Does anyone know what might
>>>>> be the cause?
>>>>>
>>>>> Thanks,
>>>>> Joe
>>>>> _______________________________________________
>>>>> OpenStack-operators mailing list
>>>>> OpenStack-operators at lists.openstack.org
>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstac
>>>>> k-operators
>>>>>
>>>>
>>>
>
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20171126/0ca6a48d/attachment.html>


More information about the OpenStack-operators mailing list