[neutron][nova] [kolla] vif plugged timeout

Michal Arbet michal.arbet at ultimum.io
Mon Nov 29 16:14:03 UTC 2021


Hello,

Have you already considered what Jan Vondra sent to this discussion ?
I am just making sure that this was read.

Thanks,
Michal Arbet
Openstack Engineer

Ultimum Technologies a.s.
Na Poříčí 1047/26, 11000 Praha 1
Czech Republic

+420 604 228 897
michal.arbet at ultimum.io
*https://ultimum.io <https://ultimum.io/>*

LinkedIn <https://www.linkedin.com/company/ultimum-technologies> | Twitter
<https://twitter.com/ultimumtech> | Facebook
<https://www.facebook.com/ultimumtechnologies/timeline>


st 24. 11. 2021 v 14:30 odesílatel Jan Vondra <jan.vondra at ultimum.io>
napsal:

> Hi guys,
> I've been further investigating Michal's (OP) issue, since he is on his
> holiday, and I've found out that the issue is not really plugging the VIF
> but cleanup after previous port bindings.
>
> We are creating 6 servers with 2-4 vifs using heat template [0]. We were
> hitting some problems with placement so the stack sometimes failed to
> create and we had to delete the stack and recreate it.
> If we recreate it right after the deletion, the vif plugging timeout
> occurs. If we wait some time (approx. 10 minutes) the stack is created
> successfully.
>
> This brings me to believe that there is some issue with deferring the
> removal of security groups from unbound ports (somewhere around this part
> of code [1]) and it somehow affects the creation of new ports. However, I
> am unable to find any lock that could cause this behaviour.
>
> The only proof I have is that after the stack recreation scenario I have
> measured that the process_network_ports [2] function call could take up to
> 650 s (varies from 5 s to 651 s in our environment).
>
> Any idea what could be causing this?
>
> [0] https://pastebin.com/infvj4ai
> [1]
> https://github.com/openstack/neutron/blob/master/neutron/agent/firewall.py#L133
> [2]
> https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L2079
>
> *Jan Vondra*
> *http://ultimum.io <http://ultimum.io>*
>
>
> st 24. 11. 2021 v 11:08 odesílatel Bogdan Dobrelya <bdobreli at redhat.com>
> napsal:
>
>> On 11/24/21 1:21 AM, Tony Liu wrote:
>> > I hit the same problem, from time to time, not consistently. I am using
>> OVN.
>> > Typically, it takes no more than a few seconds for neutron to confirm
>> the port is up.
>> > The default timeout in my setup is 600s. Even the ports shows up in
>> both OVN SB
>> > and NB, nova-compute still didn't get confirmation from neutron. Either
>> neutron
>> > didn't pick it up or the message was lost and didn't get to
>> nova-compute.
>> > Hoping someone could share more thoughts.
>>
>> That also may be a super-set of the revert-resize with OVS hybrid plug
>> issue described in [0]. Even though the problems described in the topic
>> may have nothing to that particular case, but does look related to the
>> external events framework.
>>
>> Issues like that make me thinking about some improvements to it.
>>
>> [tl;dr] bring back up the idea of buffering events with a ttl
>>
>> Like a new deferred RPC calls feature maybe? That would execute a call
>> after some trigger, like send unplug and forget. That would make
>> debugging harder, but cover the cases when an external service "forgot"
>> (an event was lost and the like cases) to notify Nova when it is done.
>>
>> Adding a queue to store events that Nova did not have a recieve handler
>> set for might help as well. And have a TTL set on it, or a more advanced
>> reaping logic, for example based on tombstone events invalidating the
>> queue contents by causal conditions. That would eliminate flaky
>> expectations set around starting to wait for receiving events vs sending
>> unexpected or belated events. Why flaky? Because in an async distributed
>> system there is no "before" nor "after", so an external to Nova service
>> will unlikely conform to any time-frame based contract for
>> send-notify/wait-receive/real-completion-fact. And the fact that Nova
>> can't tell what the network backend is (because [1] was not fully
>> implemented) does not make things simpler.
>>
>> As Sean noted in a private irc conversation, with OVN the current
>> implementation is not capable of fullfilling the contract that
>> network-vif-plugged events are only sent after the interface is fully
>> configred. So it send events at bind time once it have updated the
>> logical port in the ovn db but before real configuration has happened. I
>> believe that deferred RPC calls and/or queued events might improve such
>> a "cheating" by making the real post-completion processing a thing for
>> any backend?
>>
>> [0] https://bugs.launchpad.net/nova/+bug/1952003
>>
>> [1]
>>
>> https://specs.openstack.org/openstack/neutron-specs/specs/train/port-binding-extended-information.html
>>
>> >
>> > Thanks!
>> > Tony
>> > ________________________________________
>> > From: Laurent Dumont <laurentfdumont at gmail.com>
>> > Sent: November 22, 2021 02:05 PM
>> > To: Michal Arbet
>> > Cc: openstack-discuss
>> > Subject: Re: [neutron][nova] [kolla] vif plugged timeout
>> >
>> > How high did you have to raise it? If it does appear after X amount of
>> time, then the VIF plug is not lost?
>> >
>> > On Sat, Nov 20, 2021 at 7:23 AM Michal Arbet <michal.arbet at ultimum.io
>> <mailto:michal.arbet at ultimum.io>> wrote:
>> > + if i raise vif_plugged_timeout ( hope i rember it correct ) in nova
>> to some high number ..problem dissapear ... But it's only workaround
>> >
>> > Dňa so 20. 11. 2021, 12:05 Michal Arbet <michal.arbet at ultimum.io
>> <mailto:michal.arbet at ultimum.io>> napísal(a):
>> > Hi,
>> >
>> > Has anyone seen issue which I am currently facing ?
>> >
>> > When launching heat stack ( but it's same if I launch several of
>> instances ) vif plugged in timeouts an I don't know why, sometimes it is OK
>> ..sometimes is failing.
>> >
>> > Sometimes neutron reports vif plugged in < 10 sec ( test env )
>> sometimes it's 100 and more seconds, it seems there is some race condition
>> but I can't find out where the problem is. But on the end every instance is
>> spawned ok (retry mechanism worked).
>> >
>> > Another finding is that it has to do something with security group, if
>> noop driver is used ..everything is working good.
>> >
>> > Firewall security setup is openvswitch .
>> >
>> > Test env is wallaby.
>> >
>> > I will attach some logs when I will be near PC ..
>> >
>> > Thank you,
>> > Michal Arbet (Kevko)
>> >
>> >
>> >
>> >
>> >
>> >
>>
>>
>> --
>> Best regards,
>> Bogdan Dobrelya,
>> Irc #bogdando
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20211129/397d6be5/attachment-0001.htm>


More information about the openstack-discuss mailing list