[openstack-dev] [TripleO] Tis the season...for a cloud reboot

Derek Higgins derekh at redhat.com
Tue Dec 19 22:45:49 UTC 2017


On 19 December 2017 at 22:23, Brian Haley <haleyb.dev at gmail.com> wrote:

> On 12/19/2017 04:00 PM, Ben Nemec wrote:
>
>>
>>
>> On 12/19/2017 02:43 PM, Brian Haley wrote:
>>
>>> On 12/19/2017 11:53 AM, Ben Nemec wrote:
>>>
>>>> The reboot is done (mostly...see below).
>>>>
>>>> On 12/18/2017 05:11 PM, Joe Talerico wrote:
>>>>
>>>>> Ben - Can you provide some links to the ovs port exhaustion issue for
>>>>> some background?
>>>>>
>>>>
>>>> I don't know if we ever had a bug opened, but there's some discussion
>>>> of it in http://lists.openstack.org/pipermail/openstack-dev/2016-Dece
>>>> mber/109182.html   I've also copied Derek since I believe he was the
>>>> one who found it originally.
>>>>
>>>> The gist is that after about 3 months of tripleo-ci running in this
>>>> cloud we start to hit errors creating instances because of problems
>>>> creating OVS ports on the compute nodes.  Sometimes we see a huge number of
>>>> ports in general, other times we see a lot of ports that look like this:
>>>>
>>>> Port "qvod2cade14-7c"
>>>>              tag: 4095
>>>>              Interface "qvod2cade14-7c"
>>>>
>>>> Notably they all have a tag of 4095, which seems suspicious to me.  I
>>>> don't know whether it's actually an issue though.
>>>>
>>>
>>> Tag 4095 is for "dead" OVS ports, it's an unused VLAN tag in the agent.
>>>
>>> The 'qvo' here shows it's part of the VETH pair that os-vif created when
>>> it plugged in the VM (the other half is 'qvb'), and they're created so that
>>> iptables rules can be applied by neutron.  It's part of the "old" way to do
>>> security groups with the OVSHybridIptablesFirewallDriver, and can
>>> eventually go away once the OVSFirewallDriver can be used everywhere
>>> (requires newer OVS and agent).
>>>
>>> I wonder if you can run the ovs_cleanup utility to clean some of these
>>> up?
>>>
>>
>> As in neutron-ovs-cleanup?  Doesn't that wipe out everything, including
>> any ports that are still in use?  Or is there a different tool I'm not
>> aware of that can do more targeted cleanup?
>>
>
> Crap, I thought there was an option to just cleanup these dead devices, I
> should have read the code, it's either neutron ports (default) or all
> ports.  Maybe that should be an option.


iirc neutron-ovs-cleanup was being run following the reboot as part of
a ExecStartPre= on one of the neutron services this is what essentially
removed the ports for us.



>
>
> -Brian
>
>
> Oh, also worth noting that I don't think we have os-vif in this cloud
>> because it's so old.  There's no os-vif package installed anyway.
>>
>>
>>> -Brian
>>>
>>> I've had some offline discussions about getting someone on this cloud to
>>>> debug the problem.  Originally we decided not to pursue it since it's not
>>>> hard to work around and we didn't want to disrupt the environment by trying
>>>> to move to later OpenStack code (we're still back on Mitaka), but it was
>>>> pointed out to me this time around that from a downstream perspective we
>>>> have users on older code as well and it may be worth debugging to make sure
>>>> they don't hit similar problems.
>>>>
>>>> To that end, I've left one compute node un-rebooted for debugging
>>>> purposes.  The downstream discussion is ongoing, but I'll update here if we
>>>> find anything.
>>>>
>>>>
>>>>> Thanks,
>>>>> Joe
>>>>>
>>>>> On Mon, Dec 18, 2017 at 10:43 AM, Ben Nemec <openstack at nemebean.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> It's that magical time again.  You know the one, when we reboot rh1
>>>>>> to avoid
>>>>>> OVS port exhaustion. :-)
>>>>>>
>>>>>> If all goes well you won't even notice that this is happening, but
>>>>>> there is
>>>>>> the possibility that a few jobs will fail while the te-broker host is
>>>>>> rebooted so I wanted to let everyone know.  If you notice anything
>>>>>> else
>>>>>> hosted in rh1 is down (tripleo.org, zuul-status, etc.) let me know.
>>>>>> I have
>>>>>> been known to forget to restart services after the reboot.
>>>>>>
>>>>>> I'll send a followup when I'm done.
>>>>>>
>>>>>> -Ben
>>>>>>
>>>>>> __________________________________________________________________________
>>>>>>
>>>>>> OpenStack Development Mailing List (not for usage questions)
>>>>>> Unsubscribe: OpenStack-dev-request at lists.op
>>>>>> enstack.org?subject:unsubscribe
>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>>>
>>>>>
>>>>> __________________________________________________________________________
>>>>>
>>>>> OpenStack Development Mailing List (not for usage questions)
>>>>> Unsubscribe: OpenStack-dev-request at lists.op
>>>>> enstack.org?subject:unsubscribe
>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>>
>>>>>
>>>> __________________________________________________________________________
>>>>
>>>> OpenStack Development Mailing List (not for usage questions)
>>>> Unsubscribe: OpenStack-dev-request at lists.op
>>>> enstack.org?subject:unsubscribe
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>
>>>
>>>
>>> __________________________________________________________________________
>>>
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe: OpenStack-dev-request at lists.op
>>> enstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20171219/3566c376/attachment.html>


More information about the OpenStack-dev mailing list