[Openstack-operators] mitaka/xenial libvirt issues

Joe Topjian joe at topjian.net
Mon Nov 27 15:04:12 UTC 2017


Hi all,

To my knowledge, we don't use tunneled migrations. This issue is also
happening with snapshots, so it's not restricted to just migrations.

I haven't yet tried the apparmor patches that George mentioned. I plan on
applying them once I get another report of a problematic instance.

Thank you for the suggetions, though :)
Joe

On Mon, Nov 27, 2017 at 2:10 AM, Tobias Urdin <tobias.urdin at crystone.com>
wrote:

> Hello,
>
> The seems to assume tunnelled migrations, the live_migration_flag is
> removed in later version but is there in Mitaka.
>
> Do you have the VIR_MIGRATE_TUNNELLED flag set for
> [libvirt]live_migration_flag in nova.conf?
>
>
> Might be a long shot, but I've removed VIR_MIGRATE_TUNNELLED in our clouds
>
> Best regards
>
> On 11/26/2017 01:01 PM, Sean Redmond wrote:
>
> Hi,
>
> I think it maybe related to this:
>
> https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1647389
>
> Thanks
>
> On Thu, Nov 23, 2017 at 6:20 PM, Joe Topjian <joe at topjian.net> wrote:
>
>> OK, thanks. We'll definitely look at downgrading in a test environment.
>>
>> To add some further info to this problem, here are some log entries. When
>> an instance fails to snapshot or fails to migrate, we see:
>>
>> libvirtd[27939]: Cannot start job (modify, none) for domain
>> instance-00004fe4; current job is (modify, none) owned by (27942
>> remoteDispatchDomainBlockJobAbort, 0 <null>) for (69116s, 0s)
>>
>> libvirtd[27939]: Cannot start job (none, migration out) for domain
>> instance-00004fe4; current job is (modify, none) owned by (27942
>> remoteDispatchDomainBlockJobAbort, 0 <null>) for (69361s, 0s)
>>
>>
>> The one piece of this that I'm currently fixated on is the length of time
>> it takes libvirt to start. I'm not sure if it's causing the above, though.
>> When starting libvirt through systemd, it takes much longer to process the
>> iptables and ebtables rules than if we start libvirtd on the command-line
>> directly.
>>
>> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t
>> nat -L libvirt-J-vnet5'
>> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t
>> nat -L libvirt-P-vnet5'
>> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t
>> nat -F libvirt-J-vnet5'
>> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t
>> nat -X libvirt-J-vnet5'
>> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t
>> nat -F libvirt-P-vnet5'
>> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t
>> nat -X libvirt-P-vnet5'
>>
>> We're talking about a difference between 5 minutes and 5 seconds
>> depending on where libvirt was started. This doesn't seem normal to me.
>>
>> In general, is anyone aware of systemd performing restrictions of some
>> kind on processes which create subprocesses? Or something like that? I've
>> tried comparing cgroups and the various limits within systemd between my
>> shell session and the libvirt-bin.service session and can't find anything
>> immediately noticeable. Maybe it's apparmor?
>>
>> Thanks,
>> Joe
>>
>> On Thu, Nov 23, 2017 at 11:03 AM, Chris Sarginson <csargiso at gmail.com>
>> wrote:
>>
>>> I think we may have pinned libvirt-bin as well, (1.3.1), but I can't
>>> guarantee that, sorry - I would suggest its worth trying pinning both
>>> initially.
>>>
>>> Chris
>>>
>>> On Thu, 23 Nov 2017 at 17:42 Joe Topjian <joe at topjian.net> wrote:
>>>
>>>> Hi Chris,
>>>>
>>>> Thanks - we will definitely look into this. To confirm: did you also
>>>> downgrade libvirt as well or was it all qemu?
>>>>
>>>> Thanks,
>>>> Joe
>>>>
>>>> On Thu, Nov 23, 2017 at 9:16 AM, Chris Sarginson <csargiso at gmail.com>
>>>> wrote:
>>>>
>>>>> We hit the same issue a while back (I suspect), which we seemed to
>>>>> resolve by pinning QEMU and related packages at the following version (you
>>>>> might need to hunt down the debs manually):
>>>>>
>>>>> 1:2.5+dfsg-5ubuntu10.5
>>>>>
>>>>> I'm certain there's a launchpad bug for Ubuntu qemu regarding this,
>>>>> but don't have it to hand.
>>>>>
>>>>> Hope this helps,
>>>>> Chris
>>>>>
>>>>> On Thu, 23 Nov 2017 at 15:33 Joe Topjian <joe at topjian.net> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> We're seeing some strange libvirt issues in an Ubuntu 16.04
>>>>>> environment. It's running Mitaka, but I don't think this is a problem with
>>>>>> OpenStack itself.
>>>>>>
>>>>>> We're in the process of upgrading this environment from Ubuntu 14.04
>>>>>> with the Mitaka cloud archive to 16.04. Instances are being live migrated
>>>>>> (NFS share) to a new 16.04 compute node (fresh install), so there's a
>>>>>> change between libvirt versions (1.2.2 to 1.3.1). The problem we're seeing
>>>>>> is only happening on the 16.04/1.3.1 nodes.
>>>>>>
>>>>>> We're getting occasional reports of instances not able to be
>>>>>> snapshotted. Upon investigation, the snapshot process quits early with a
>>>>>> libvirt/qemu lock timeout error. We then see that the instance's xml file
>>>>>> has disappeared from /etc/libvirt/qemu and must restart libvirt and
>>>>>> hard-reboot the instance to get things back to a normal state. Trying to
>>>>>> live-migrate the instance to another node causes the same thing to happen.
>>>>>>
>>>>>> However, at some random time, either the snapshot or the migration
>>>>>> will work without error. I haven't been able to reproduce this issue on my
>>>>>> own and haven't been able to figure out the root cause by inspecting
>>>>>> instances reported to me.
>>>>>>
>>>>>> One thing that has stood out is the length of time it takes for
>>>>>> libvirt to start. If I run "/etc/init.d/libvirt-bin start", it takes at
>>>>>> least 5 minutes before a simple "virsh list" will work. The command will
>>>>>> hang otherwise. If I increase libvirt's logging level, I can see that
>>>>>> during this period of time, libvirt is working on iptables and ebtables
>>>>>> (looks like it's shelling out commands).
>>>>>>
>>>>>> But if I run "libvirtd -l" straight on the command line, all of this
>>>>>> completes within 5 seconds (including all of the shelling out).
>>>>>>
>>>>>> My initial thought is that systemd is doing some type of throttling
>>>>>> between the system and user slice, but I've tried comparing slice
>>>>>> attributes and, probably due to my lack of understanding of systemd, can't
>>>>>> find anything to prove this.
>>>>>>
>>>>>> Is anyone else running into this problem? Does anyone know what might
>>>>>> be the cause?
>>>>>>
>>>>>> Thanks,
>>>>>> Joe
>>>>>> _______________________________________________
>>>>>> OpenStack-operators mailing list
>>>>>> OpenStack-operators at lists.openstack.org
>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstac
>>>>>> k-operators
>>>>>>
>>>>>
>>>>
>>
>> _______________________________________________
>> OpenStack-operators mailing list
>> OpenStack-operators at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>>
>
>
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20171127/7269732b/attachment.html>


More information about the OpenStack-operators mailing list