[Openstack-operators] mitaka/xenial libvirt issues

Joe Topjian joe at topjian.net
Mon Nov 27 21:13:36 UTC 2017


We think we've pinned the qemu errors down to a mismatched group ID on a
handful of compute nodes.

The slow systemd/libvirt is still unsolved, but at the moment that does not
actually be the cause of the qemu errors.

On Mon, Nov 27, 2017 at 8:04 AM, Joe Topjian <joe at topjian.net> wrote:

> Hi all,
>
> To my knowledge, we don't use tunneled migrations. This issue is also
> happening with snapshots, so it's not restricted to just migrations.
>
> I haven't yet tried the apparmor patches that George mentioned. I plan on
> applying them once I get another report of a problematic instance.
>
> Thank you for the suggetions, though :)
> Joe
>
> On Mon, Nov 27, 2017 at 2:10 AM, Tobias Urdin <tobias.urdin at crystone.com>
> wrote:
>
>> Hello,
>>
>> The seems to assume tunnelled migrations, the live_migration_flag is
>> removed in later version but is there in Mitaka.
>>
>> Do you have the VIR_MIGRATE_TUNNELLED flag set for
>> [libvirt]live_migration_flag in nova.conf?
>>
>>
>> Might be a long shot, but I've removed VIR_MIGRATE_TUNNELLED in our clouds
>>
>> Best regards
>>
>> On 11/26/2017 01:01 PM, Sean Redmond wrote:
>>
>> Hi,
>>
>> I think it maybe related to this:
>>
>> https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1647389
>>
>> Thanks
>>
>> On Thu, Nov 23, 2017 at 6:20 PM, Joe Topjian <joe at topjian.net> wrote:
>>
>>> OK, thanks. We'll definitely look at downgrading in a test environment.
>>>
>>> To add some further info to this problem, here are some log entries.
>>> When an instance fails to snapshot or fails to migrate, we see:
>>>
>>> libvirtd[27939]: Cannot start job (modify, none) for domain
>>> instance-00004fe4; current job is (modify, none) owned by (27942
>>> remoteDispatchDomainBlockJobAbort, 0 <null>) for (69116s, 0s)
>>>
>>> libvirtd[27939]: Cannot start job (none, migration out) for domain
>>> instance-00004fe4; current job is (modify, none) owned by (27942
>>> remoteDispatchDomainBlockJobAbort, 0 <null>) for (69361s, 0s)
>>>
>>>
>>> The one piece of this that I'm currently fixated on is the length of
>>> time it takes libvirt to start. I'm not sure if it's causing the above,
>>> though. When starting libvirt through systemd, it takes much longer to
>>> process the iptables and ebtables rules than if we start libvirtd on the
>>> command-line directly.
>>>
>>> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t
>>> nat -L libvirt-J-vnet5'
>>> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t
>>> nat -L libvirt-P-vnet5'
>>> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t
>>> nat -F libvirt-J-vnet5'
>>> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t
>>> nat -X libvirt-J-vnet5'
>>> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t
>>> nat -F libvirt-P-vnet5'
>>> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t
>>> nat -X libvirt-P-vnet5'
>>>
>>> We're talking about a difference between 5 minutes and 5 seconds
>>> depending on where libvirt was started. This doesn't seem normal to me.
>>>
>>> In general, is anyone aware of systemd performing restrictions of some
>>> kind on processes which create subprocesses? Or something like that? I've
>>> tried comparing cgroups and the various limits within systemd between my
>>> shell session and the libvirt-bin.service session and can't find anything
>>> immediately noticeable. Maybe it's apparmor?
>>>
>>> Thanks,
>>> Joe
>>>
>>> On Thu, Nov 23, 2017 at 11:03 AM, Chris Sarginson <csargiso at gmail.com>
>>> wrote:
>>>
>>>> I think we may have pinned libvirt-bin as well, (1.3.1), but I can't
>>>> guarantee that, sorry - I would suggest its worth trying pinning both
>>>> initially.
>>>>
>>>> Chris
>>>>
>>>> On Thu, 23 Nov 2017 at 17:42 Joe Topjian <joe at topjian.net> wrote:
>>>>
>>>>> Hi Chris,
>>>>>
>>>>> Thanks - we will definitely look into this. To confirm: did you also
>>>>> downgrade libvirt as well or was it all qemu?
>>>>>
>>>>> Thanks,
>>>>> Joe
>>>>>
>>>>> On Thu, Nov 23, 2017 at 9:16 AM, Chris Sarginson <csargiso at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> We hit the same issue a while back (I suspect), which we seemed to
>>>>>> resolve by pinning QEMU and related packages at the following version (you
>>>>>> might need to hunt down the debs manually):
>>>>>>
>>>>>> 1:2.5+dfsg-5ubuntu10.5
>>>>>>
>>>>>> I'm certain there's a launchpad bug for Ubuntu qemu regarding this,
>>>>>> but don't have it to hand.
>>>>>>
>>>>>> Hope this helps,
>>>>>> Chris
>>>>>>
>>>>>> On Thu, 23 Nov 2017 at 15:33 Joe Topjian <joe at topjian.net> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> We're seeing some strange libvirt issues in an Ubuntu 16.04
>>>>>>> environment. It's running Mitaka, but I don't think this is a problem with
>>>>>>> OpenStack itself.
>>>>>>>
>>>>>>> We're in the process of upgrading this environment from Ubuntu 14.04
>>>>>>> with the Mitaka cloud archive to 16.04. Instances are being live migrated
>>>>>>> (NFS share) to a new 16.04 compute node (fresh install), so there's a
>>>>>>> change between libvirt versions (1.2.2 to 1.3.1). The problem we're seeing
>>>>>>> is only happening on the 16.04/1.3.1 nodes.
>>>>>>>
>>>>>>> We're getting occasional reports of instances not able to be
>>>>>>> snapshotted. Upon investigation, the snapshot process quits early with a
>>>>>>> libvirt/qemu lock timeout error. We then see that the instance's xml file
>>>>>>> has disappeared from /etc/libvirt/qemu and must restart libvirt and
>>>>>>> hard-reboot the instance to get things back to a normal state. Trying to
>>>>>>> live-migrate the instance to another node causes the same thing to happen.
>>>>>>>
>>>>>>> However, at some random time, either the snapshot or the migration
>>>>>>> will work without error. I haven't been able to reproduce this issue on my
>>>>>>> own and haven't been able to figure out the root cause by inspecting
>>>>>>> instances reported to me.
>>>>>>>
>>>>>>> One thing that has stood out is the length of time it takes for
>>>>>>> libvirt to start. If I run "/etc/init.d/libvirt-bin start", it takes at
>>>>>>> least 5 minutes before a simple "virsh list" will work. The command will
>>>>>>> hang otherwise. If I increase libvirt's logging level, I can see that
>>>>>>> during this period of time, libvirt is working on iptables and ebtables
>>>>>>> (looks like it's shelling out commands).
>>>>>>>
>>>>>>> But if I run "libvirtd -l" straight on the command line, all of this
>>>>>>> completes within 5 seconds (including all of the shelling out).
>>>>>>>
>>>>>>> My initial thought is that systemd is doing some type of throttling
>>>>>>> between the system and user slice, but I've tried comparing slice
>>>>>>> attributes and, probably due to my lack of understanding of systemd, can't
>>>>>>> find anything to prove this.
>>>>>>>
>>>>>>> Is anyone else running into this problem? Does anyone know what
>>>>>>> might be the cause?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Joe
>>>>>>> _______________________________________________
>>>>>>> OpenStack-operators mailing list
>>>>>>> OpenStack-operators at lists.openstack.org
>>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstac
>>>>>>> k-operators
>>>>>>>
>>>>>>
>>>>>
>>>
>>> _______________________________________________
>>> OpenStack-operators mailing list
>>> OpenStack-operators at lists.openstack.org
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>>
>>>
>>
>>
>> _______________________________________________
>> OpenStack-operators mailing list
>> OpenStack-operators at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20171127/8b929fbf/attachment.html>


More information about the OpenStack-operators mailing list