[neutron][nova] [kolla] vif plugged timeout

Jan Vondra jan.vondra at ultimum.io
Wed Nov 24 13:30:00 UTC 2021


Hi guys,
I've been further investigating Michal's (OP) issue, since he is on his
holiday, and I've found out that the issue is not really plugging the VIF
but cleanup after previous port bindings.

We are creating 6 servers with 2-4 vifs using heat template [0]. We were
hitting some problems with placement so the stack sometimes failed to
create and we had to delete the stack and recreate it.
If we recreate it right after the deletion, the vif plugging timeout
occurs. If we wait some time (approx. 10 minutes) the stack is created
successfully.

This brings me to believe that there is some issue with deferring the
removal of security groups from unbound ports (somewhere around this part
of code [1]) and it somehow affects the creation of new ports. However, I
am unable to find any lock that could cause this behaviour.

The only proof I have is that after the stack recreation scenario I have
measured that the process_network_ports [2] function call could take up to
650 s (varies from 5 s to 651 s in our environment).

Any idea what could be causing this?

[0] https://pastebin.com/infvj4ai
[1]
https://github.com/openstack/neutron/blob/master/neutron/agent/firewall.py#L133
[2]
https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L2079

*Jan Vondra*
*http://ultimum.io <http://ultimum.io>*


st 24. 11. 2021 v 11:08 odesílatel Bogdan Dobrelya <bdobreli at redhat.com>
napsal:

> On 11/24/21 1:21 AM, Tony Liu wrote:
> > I hit the same problem, from time to time, not consistently. I am using
> OVN.
> > Typically, it takes no more than a few seconds for neutron to confirm
> the port is up.
> > The default timeout in my setup is 600s. Even the ports shows up in both
> OVN SB
> > and NB, nova-compute still didn't get confirmation from neutron. Either
> neutron
> > didn't pick it up or the message was lost and didn't get to nova-compute.
> > Hoping someone could share more thoughts.
>
> That also may be a super-set of the revert-resize with OVS hybrid plug
> issue described in [0]. Even though the problems described in the topic
> may have nothing to that particular case, but does look related to the
> external events framework.
>
> Issues like that make me thinking about some improvements to it.
>
> [tl;dr] bring back up the idea of buffering events with a ttl
>
> Like a new deferred RPC calls feature maybe? That would execute a call
> after some trigger, like send unplug and forget. That would make
> debugging harder, but cover the cases when an external service "forgot"
> (an event was lost and the like cases) to notify Nova when it is done.
>
> Adding a queue to store events that Nova did not have a recieve handler
> set for might help as well. And have a TTL set on it, or a more advanced
> reaping logic, for example based on tombstone events invalidating the
> queue contents by causal conditions. That would eliminate flaky
> expectations set around starting to wait for receiving events vs sending
> unexpected or belated events. Why flaky? Because in an async distributed
> system there is no "before" nor "after", so an external to Nova service
> will unlikely conform to any time-frame based contract for
> send-notify/wait-receive/real-completion-fact. And the fact that Nova
> can't tell what the network backend is (because [1] was not fully
> implemented) does not make things simpler.
>
> As Sean noted in a private irc conversation, with OVN the current
> implementation is not capable of fullfilling the contract that
> network-vif-plugged events are only sent after the interface is fully
> configred. So it send events at bind time once it have updated the
> logical port in the ovn db but before real configuration has happened. I
> believe that deferred RPC calls and/or queued events might improve such
> a "cheating" by making the real post-completion processing a thing for
> any backend?
>
> [0] https://bugs.launchpad.net/nova/+bug/1952003
>
> [1]
>
> https://specs.openstack.org/openstack/neutron-specs/specs/train/port-binding-extended-information.html
>
> >
> > Thanks!
> > Tony
> > ________________________________________
> > From: Laurent Dumont <laurentfdumont at gmail.com>
> > Sent: November 22, 2021 02:05 PM
> > To: Michal Arbet
> > Cc: openstack-discuss
> > Subject: Re: [neutron][nova] [kolla] vif plugged timeout
> >
> > How high did you have to raise it? If it does appear after X amount of
> time, then the VIF plug is not lost?
> >
> > On Sat, Nov 20, 2021 at 7:23 AM Michal Arbet <michal.arbet at ultimum.io
> <mailto:michal.arbet at ultimum.io>> wrote:
> > + if i raise vif_plugged_timeout ( hope i rember it correct ) in nova to
> some high number ..problem dissapear ... But it's only workaround
> >
> > Dňa so 20. 11. 2021, 12:05 Michal Arbet <michal.arbet at ultimum.io<mailto:
> michal.arbet at ultimum.io>> napísal(a):
> > Hi,
> >
> > Has anyone seen issue which I am currently facing ?
> >
> > When launching heat stack ( but it's same if I launch several of
> instances ) vif plugged in timeouts an I don't know why, sometimes it is OK
> ..sometimes is failing.
> >
> > Sometimes neutron reports vif plugged in < 10 sec ( test env ) sometimes
> it's 100 and more seconds, it seems there is some race condition but I
> can't find out where the problem is. But on the end every instance is
> spawned ok (retry mechanism worked).
> >
> > Another finding is that it has to do something with security group, if
> noop driver is used ..everything is working good.
> >
> > Firewall security setup is openvswitch .
> >
> > Test env is wallaby.
> >
> > I will attach some logs when I will be near PC ..
> >
> > Thank you,
> > Michal Arbet (Kevko)
> >
> >
> >
> >
> >
> >
>
>
> --
> Best regards,
> Bogdan Dobrelya,
> Irc #bogdando
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20211124/1fa3d2a1/attachment.htm>


More information about the openstack-discuss mailing list