[openstack-dev] [TripleO] Tis the season...for a cloud reboot
Joe Talerico
jtaleric at redhat.com
Wed Dec 20 15:25:46 UTC 2017
On Wed, Dec 20, 2017 at 9:08 AM, Ben Nemec <openstack at nemebean.com> wrote:
>
>
> On 12/19/2017 05:34 PM, Joe Talerico wrote:
>>
>> On Tue, Dec 19, 2017 at 5:45 PM, Derek Higgins <derekh at redhat.com> wrote:
>>>
>>>
>>>
>>> On 19 December 2017 at 22:23, Brian Haley <haleyb.dev at gmail.com> wrote:
>>>>
>>>>
>>>> On 12/19/2017 04:00 PM, Ben Nemec wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 12/19/2017 02:43 PM, Brian Haley wrote:
>>>>>>
>>>>>>
>>>>>> On 12/19/2017 11:53 AM, Ben Nemec wrote:
>>>>>>>
>>>>>>>
>>>>>>> The reboot is done (mostly...see below).
>>>>>>>
>>>>>>> On 12/18/2017 05:11 PM, Joe Talerico wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Ben - Can you provide some links to the ovs port exhaustion issue
>>>>>>>> for
>>>>>>>> some background?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I don't know if we ever had a bug opened, but there's some discussion
>>>>>>> of it in
>>>>>>>
>>>>>>> http://lists.openstack.org/pipermail/openstack-dev/2016-December/109182.html
>>>>>>> I've also copied Derek since I believe he was the one who found it
>>>>>>> originally.
>>>>>>>
>>>>>>> The gist is that after about 3 months of tripleo-ci running in this
>>>>>>> cloud we start to hit errors creating instances because of problems
>>>>>>> creating
>>>>>>> OVS ports on the compute nodes. Sometimes we see a huge number of
>>>>>>> ports in
>>>>>>> general, other times we see a lot of ports that look like this:
>>>>>>>
>>>>>>> Port "qvod2cade14-7c"
>>>>>>> tag: 4095
>>>>>>> Interface "qvod2cade14-7c"
>>>>>>>
>>>>>>> Notably they all have a tag of 4095, which seems suspicious to me. I
>>>>>>> don't know whether it's actually an issue though.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Tag 4095 is for "dead" OVS ports, it's an unused VLAN tag in the
>>>>>> agent.
>>>>>>
>>>>>> The 'qvo' here shows it's part of the VETH pair that os-vif created
>>>>>> when
>>>>>> it plugged in the VM (the other half is 'qvb'), and they're created so
>>>>>> that
>>>>>> iptables rules can be applied by neutron. It's part of the "old" way
>>>>>> to do
>>>>>> security groups with the OVSHybridIptablesFirewallDriver, and can
>>>>>> eventually
>>>>>> go away once the OVSFirewallDriver can be used everywhere (requires
>>>>>> newer
>>>>>> OVS and agent).
>>>>>>
>>>>>> I wonder if you can run the ovs_cleanup utility to clean some of these
>>>>>> up?
>>>>>
>>>>>
>>>>>
>>>>> As in neutron-ovs-cleanup? Doesn't that wipe out everything, including
>>>>> any ports that are still in use? Or is there a different tool I'm not
>>>>> aware
>>>>> of that can do more targeted cleanup?
>>>>
>>>>
>>>>
>>>> Crap, I thought there was an option to just cleanup these dead devices,
>>>> I
>>>> should have read the code, it's either neutron ports (default) or all
>>>> ports.
>>>> Maybe that should be an option.
>>>
>>>
>>>
>>> iirc neutron-ovs-cleanup was being run following the reboot as part of a
>>> ExecStartPre= on one of the neutron services this is what essentially
>>> removed the ports for us.
>>>
>>>
>>
>> There is actually unit files for cleanup (netns|ovs|lb), specifically
>> for ovs-cleanup[1]
>>
>> Maybe this can be ran to mitigate the need for a reboot?
>
>
> That's what Brian suggested too, but running it with instances on the node
> will cause an outage because it cleans up everything, including in-use
> ports. The reason a reboot works is basically that it causes this unit to
> run when the node comes back up because it's a dep of the other services.
> So it's possible we could use this to skip the complete reboot, but that's
> not the time-consuming part of the process. It's waiting for all the
> instances to cycle off so we don't cause spurious failures when we wipe the
> ovs ports. Actually rebooting the nodes takes about five minutes (and it's
> only that long because of an old TripleO bug).
ack. There are options you can pass with the cleanup to not nuke everything.
I wonder if it is a combination of ovs-cleanup + restarting the
ovs-agent? Anyway, doesn't seem that big of a problem then. /me gets
off his uptime soapbox
Joe
>
>
>>
>> [1]
>> [Unit]
>> Description=OpenStack Neutron Open vSwitch Cleanup Utility
>> After=syslog.target network.target openvswitch.service
>> Before=neutron-openvswitch-agent.service neutron-dhcp-agent.service
>> neutron-l3-agent.service openstack-nova-compute.service
>>
>> [Service]
>> Type=oneshot
>> User=neutron
>> ExecStart=/usr/bin/neutron-ovs-cleanup --config-file
>> /usr/share/neutron/neutron-dist.conf --config-file
>> /etc/neutron/neutron.conf --config-file
>> /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir
>> /etc/neutron/conf.d/common --config-dir
>> /etc/neutron/conf.d/neutron-ovs-cleanup --log-file
>> /var/log/neutron/ovs-cleanup.log
>> ExecStop=/usr/bin/neutron-ovs-cleanup --config-file
>> /usr/share/neutron/neutron-dist.conf --config-file
>> /etc/neutron/neutron.conf --config-file
>> /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir
>> /etc/neutron/conf.d/common --config-dir
>> /etc/neutron/conf.d/neutron-ovs-cleanup --log-file
>> /var/log/neutron/ovs-cleanup.log
>> PrivateTmp=true
>> RemainAfterExit=yes
>>
>> [Install]
>> WantedBy=multi-user.target
>> ~
>>>>
>>>>
>>>>
>>>>
>>>> -Brian
>>>>
>>>>
>>>>> Oh, also worth noting that I don't think we have os-vif in this cloud
>>>>> because it's so old. There's no os-vif package installed anyway.
>>>>>
>>>>>>
>>>>>> -Brian
>>>>>>
>>>>>>> I've had some offline discussions about getting someone on this cloud
>>>>>>> to debug the problem. Originally we decided not to pursue it since
>>>>>>> it's not
>>>>>>> hard to work around and we didn't want to disrupt the environment by
>>>>>>> trying
>>>>>>> to move to later OpenStack code (we're still back on Mitaka), but it
>>>>>>> was
>>>>>>> pointed out to me this time around that from a downstream perspective
>>>>>>> we
>>>>>>> have users on older code as well and it may be worth debugging to
>>>>>>> make sure
>>>>>>> they don't hit similar problems.
>>>>>>>
>>>>>>> To that end, I've left one compute node un-rebooted for debugging
>>>>>>> purposes. The downstream discussion is ongoing, but I'll update here
>>>>>>> if we
>>>>>>> find anything.
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Joe
>>>>>>>>
>>>>>>>> On Mon, Dec 18, 2017 at 10:43 AM, Ben Nemec <openstack at nemebean.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> It's that magical time again. You know the one, when we reboot rh1
>>>>>>>>> to avoid
>>>>>>>>> OVS port exhaustion. :-)
>>>>>>>>>
>>>>>>>>> If all goes well you won't even notice that this is happening, but
>>>>>>>>> there is
>>>>>>>>> the possibility that a few jobs will fail while the te-broker host
>>>>>>>>> is
>>>>>>>>> rebooted so I wanted to let everyone know. If you notice anything
>>>>>>>>> else
>>>>>>>>> hosted in rh1 is down (tripleo.org, zuul-status, etc.) let me know.
>>>>>>>>> I
>>>>>>>>> have
>>>>>>>>> been known to forget to restart services after the reboot.
>>>>>>>>>
>>>>>>>>> I'll send a followup when I'm done.
>>>>>>>>>
>>>>>>>>> -Ben
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> __________________________________________________________________________
>>>>>>>>> OpenStack Development Mailing List (not for usage questions)
>>>>>>>>> Unsubscribe:
>>>>>>>>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>>>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> __________________________________________________________________________
>>>>>>>> OpenStack Development Mailing List (not for usage questions)
>>>>>>>> Unsubscribe:
>>>>>>>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> __________________________________________________________________________
>>>>>>> OpenStack Development Mailing List (not for usage questions)
>>>>>>> Unsubscribe:
>>>>>>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> __________________________________________________________________________
>>>>>> OpenStack Development Mailing List (not for usage questions)
>>>>>> Unsubscribe:
>>>>>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> __________________________________________________________________________
>>>> OpenStack Development Mailing List (not for usage questions)
>>>> Unsubscribe:
>>>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>>
>>>
>>>
>>>
>>> __________________________________________________________________________
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe:
>>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
More information about the OpenStack-dev
mailing list