[neutron][nova] [kolla] vif plugged timeout
Sean Mooney
smooney at redhat.com
Wed Nov 24 14:53:21 UTC 2021
On Wed, 2021-11-24 at 11:05 +0100, Bogdan Dobrelya wrote:
> On 11/24/21 1:21 AM, Tony Liu wrote:
> > I hit the same problem, from time to time, not consistently. I am using OVN.
> > Typically, it takes no more than a few seconds for neutron to confirm the port is up.
> > The default timeout in my setup is 600s. Even the ports shows up in both OVN SB
> > and NB, nova-compute still didn't get confirmation from neutron. Either neutron
> > didn't pick it up or the message was lost and didn't get to nova-compute.
> > Hoping someone could share more thoughts.
>
> That also may be a super-set of the revert-resize with OVS hybrid plug
> issue described in [0]. Even though the problems described in the topic
> may have nothing to that particular case, but does look related to the
> external events framework.
>
> Issues like that make me thinking about some improvements to it.
>
> [tl;dr] bring back up the idea of buffering events with a ttl
>
> Like a new deferred RPC calls feature maybe? That would execute a call
> after some trigger, like send unplug and forget. That would make
> debugging harder, but cover the cases when an external service "forgot"
> (an event was lost and the like cases) to notify Nova when it is done.
>
> Adding a queue to store events that Nova did not have a recieve handler
> set for might help as well. And have a TTL set on it, or a more advanced
> reaping logic, for example based on tombstone events invalidating the
> queue contents by causal conditions. That would eliminate flaky
> expectations set around starting to wait for receiving events vs sending
> unexpected or belated events. Why flaky? Because in an async distributed
> system there is no "before" nor "after", so an external to Nova service
> will unlikely conform to any time-frame based contract for
> send-notify/wait-receive/real-completion-fact. And the fact that Nova
> can't tell what the network backend is (because [1] was not fully
> implemented) does not make things simpler.
i honestly dont think this is a viable option we have discussed it several times
in nova in the past and keep coming to the same conclution
either the event shoudl be sent and waited for at that right times or they loose there value.
buffering the event masks bad behavior in non complent netowrk backends, it potentially
exposes the teants and oeprators to security issues by breaking multi tenancy
https://bugs.launchpad.net/neutron/+bug/1734320 or network conenct connecity https://bugs.launchpad.net/nova/+bug/1815989.
neutron somethime sened the events ealier then we expect and some times it send multiple network vif plugged events for effectivly the
same operations. we recently "fixed" the fact that the dhcp agent would send a netwrok-vif-plugged event during live migration becasue
it was already configured nad the port was fully plugged on the souce node when we were waiting for the event form the destiont nodes l2 agent.
https://review.opendev.org/c/openstack/neutron/+/766277 howeveer that fix si config driven and nova cannot detach how that is set...
i dissagree that in a distibuted system like nova there si no before or after.
we had a contract with neutron that severla neutron ml2 plugs or out of tree core plugins did not comply with.
when we add a vm interface to a network backend we requrie neutron to notificy use in a timely manner that the backend has processed the port and its now safe
to proceed. several backend chosse to violate that contract including ovn and as a result we have to try and make thse broken backend work in nova whne infact
we shoudl not supprot them at all.
the odl comuntiy when to great effort to impleent a websocket callback mechsium to be able to have odl notify neutron when it had configured
the port on the ovs bridge and networking-odl then incoperated that in to there ml2 dirver
https://opendev.org/openstack/networking-odl/src/branch/master/networking_odl/ml2/port_status_update.py#L92-L95
all of the in tree pluggins before ovn was merged in tree also implemeted this protocoal correctly sending event when the port provisioning on the netwrok backedn was compelte.
ovn however still sets the l2 provision as complete when the prot status is set to up
https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py#L1031-L1066
that gets called when the logical swith port is set to up
https://github.com/openstack/neutron/blob/4e339776d90cf211396da5f95e29af65332dac61/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#L421-L438
but that does not fully adress the issue since move oepration like live migatrion are nto properly supproted.
https://review.opendev.org/c/openstack/neutron-specs/+/799198/6/specs/xena/ovn-auxiliary-port-bridge-live-migration.rst#120 should help although im slightly
dismaded to see that tey will be using a new `port.vif_details`` backend field to identify it as ovn instead of the previously agreed on bound_drivers field
https://specs.openstack.org/openstack/neutron-specs/specs/train/port-binding-extended-information.html
if every ml2 driver and core plugin set this new backedn filed it more of less the same as the bound_drivers feild however i fear
this will jsut get implemeted for ovn since its part of the ovn speficic spec which will jsut create more tech debt so im relutant to suggest nova will use
this info untill it done properly for all backends.
>
> As Sean noted in a private irc conversation, with OVN the current
> implementation is not capable of fullfilling the contract that
> network-vif-plugged events are only sent after the interface is fully
> configred. So it send events at bind time once it have updated the
> logical port in the ovn db but before real configuration has happened. I
> believe that deferred RPC calls and/or queued events might improve such
> a "cheating" by making the real post-completion processing a thing for
> any backend?
>
> [0] https://bugs.launchpad.net/nova/+bug/1952003
>
> [1]
> https://specs.openstack.org/openstack/neutron-specs/specs/train/port-binding-extended-information.html
>
> >
> > Thanks!
> > Tony
> > ________________________________________
> > From: Laurent Dumont <laurentfdumont at gmail.com>
> > Sent: November 22, 2021 02:05 PM
> > To: Michal Arbet
> > Cc: openstack-discuss
> > Subject: Re: [neutron][nova] [kolla] vif plugged timeout
> >
> > How high did you have to raise it? If it does appear after X amount of time, then the VIF plug is not lost?
> >
> > On Sat, Nov 20, 2021 at 7:23 AM Michal Arbet <michal.arbet at ultimum.io<mailto:michal.arbet at ultimum.io>> wrote:
> > + if i raise vif_plugged_timeout ( hope i rember it correct ) in nova to some high number ..problem dissapear ... But it's only workaround
> >
> > D�a so 20. 11. 2021, 12:05 Michal Arbet <michal.arbet at ultimum.io<mailto:michal.arbet at ultimum.io>> nap�sal(a):
> > Hi,
> >
> > Has anyone seen issue which I am currently facing ?
> >
> > When launching heat stack ( but it's same if I launch several of instances ) vif plugged in timeouts an I don't know why, sometimes it is OK ..sometimes is failing.
> >
> > Sometimes neutron reports vif plugged in < 10 sec ( test env ) sometimes it's 100 and more seconds, it seems there is some race condition but I can't find out where the problem is. But on the end every instance is spawned ok (retry mechanism worked).
> >
> > Another finding is that it has to do something with security group, if noop driver is used ..everything is working good.
> >
> > Firewall security setup is openvswitch .
> >
> > Test env is wallaby.
> >
> > I will attach some logs when I will be near PC ..
> >
> > Thank you,
> > Michal Arbet (Kevko)
> >
> >
> >
> >
> >
> >
>
>
More information about the openstack-discuss
mailing list