[neutron][nova] [kolla] vif plugged timeout
Hi, Has anyone seen issue which I am currently facing ? When launching heat stack ( but it's same if I launch several of instances ) vif plugged in timeouts an I don't know why, sometimes it is OK ..sometimes is failing. Sometimes neutron reports vif plugged in < 10 sec ( test env ) sometimes it's 100 and more seconds, it seems there is some race condition but I can't find out where the problem is. But on the end every instance is spawned ok (retry mechanism worked). Another finding is that it has to do something with security group, if noop driver is used ..everything is working good. Firewall security setup is openvswitch . Test env is wallaby. I will attach some logs when I will be near PC .. Thank you, Michal Arbet (Kevko)
+ if i raise vif_plugged_timeout ( hope i rember it correct ) in nova to some high number ..problem dissapear ... But it's only workaround Dňa so 20. 11. 2021, 12:05 Michal Arbet <michal.arbet@ultimum.io> napísal(a):
Hi,
Has anyone seen issue which I am currently facing ?
When launching heat stack ( but it's same if I launch several of instances ) vif plugged in timeouts an I don't know why, sometimes it is OK ..sometimes is failing.
Sometimes neutron reports vif plugged in < 10 sec ( test env ) sometimes it's 100 and more seconds, it seems there is some race condition but I can't find out where the problem is. But on the end every instance is spawned ok (retry mechanism worked).
Another finding is that it has to do something with security group, if noop driver is used ..everything is working good.
Firewall security setup is openvswitch .
Test env is wallaby.
I will attach some logs when I will be near PC ..
Thank you, Michal Arbet (Kevko)
Hi, It seems it's same issue as issue on launchpad https://bugs.launchpad.net/nova/+bug/1944779 Thanks, Kevko Dňa so 20. 11. 2021, 14:20 Michal Arbet <michal.arbet@ultimum.io> napísal(a):
+ if i raise vif_plugged_timeout ( hope i rember it correct ) in nova to some high number ..problem dissapear ... But it's only workaround
Dňa so 20. 11. 2021, 12:05 Michal Arbet <michal.arbet@ultimum.io> napísal(a):
Hi,
Has anyone seen issue which I am currently facing ?
When launching heat stack ( but it's same if I launch several of instances ) vif plugged in timeouts an I don't know why, sometimes it is OK ..sometimes is failing.
Sometimes neutron reports vif plugged in < 10 sec ( test env ) sometimes it's 100 and more seconds, it seems there is some race condition but I can't find out where the problem is. But on the end every instance is spawned ok (retry mechanism worked).
Another finding is that it has to do something with security group, if noop driver is used ..everything is working good.
Firewall security setup is openvswitch .
Test env is wallaby.
I will attach some logs when I will be near PC ..
Thank you, Michal Arbet (Kevko)
How high did you have to raise it? If it does appear after X amount of time, then the VIF plug is not lost? On Sat, Nov 20, 2021 at 7:23 AM Michal Arbet <michal.arbet@ultimum.io> wrote:
+ if i raise vif_plugged_timeout ( hope i rember it correct ) in nova to some high number ..problem dissapear ... But it's only workaround
Dňa so 20. 11. 2021, 12:05 Michal Arbet <michal.arbet@ultimum.io> napísal(a):
Hi,
Has anyone seen issue which I am currently facing ?
When launching heat stack ( but it's same if I launch several of instances ) vif plugged in timeouts an I don't know why, sometimes it is OK ..sometimes is failing.
Sometimes neutron reports vif plugged in < 10 sec ( test env ) sometimes it's 100 and more seconds, it seems there is some race condition but I can't find out where the problem is. But on the end every instance is spawned ok (retry mechanism worked).
Another finding is that it has to do something with security group, if noop driver is used ..everything is working good.
Firewall security setup is openvswitch .
Test env is wallaby.
I will attach some logs when I will be near PC ..
Thank you, Michal Arbet (Kevko)
I hit the same problem, from time to time, not consistently. I am using OVN. Typically, it takes no more than a few seconds for neutron to confirm the port is up. The default timeout in my setup is 600s. Even the ports shows up in both OVN SB and NB, nova-compute still didn't get confirmation from neutron. Either neutron didn't pick it up or the message was lost and didn't get to nova-compute. Hoping someone could share more thoughts. Thanks! Tony ________________________________________ From: Laurent Dumont <laurentfdumont@gmail.com> Sent: November 22, 2021 02:05 PM To: Michal Arbet Cc: openstack-discuss Subject: Re: [neutron][nova] [kolla] vif plugged timeout How high did you have to raise it? If it does appear after X amount of time, then the VIF plug is not lost? On Sat, Nov 20, 2021 at 7:23 AM Michal Arbet <michal.arbet@ultimum.io<mailto:michal.arbet@ultimum.io>> wrote: + if i raise vif_plugged_timeout ( hope i rember it correct ) in nova to some high number ..problem dissapear ... But it's only workaround Dňa so 20. 11. 2021, 12:05 Michal Arbet <michal.arbet@ultimum.io<mailto:michal.arbet@ultimum.io>> napísal(a): Hi, Has anyone seen issue which I am currently facing ? When launching heat stack ( but it's same if I launch several of instances ) vif plugged in timeouts an I don't know why, sometimes it is OK ..sometimes is failing. Sometimes neutron reports vif plugged in < 10 sec ( test env ) sometimes it's 100 and more seconds, it seems there is some race condition but I can't find out where the problem is. But on the end every instance is spawned ok (retry mechanism worked). Another finding is that it has to do something with security group, if noop driver is used ..everything is working good. Firewall security setup is openvswitch . Test env is wallaby. I will attach some logs when I will be near PC .. Thank you, Michal Arbet (Kevko)
On 11/24/21 1:21 AM, Tony Liu wrote:
I hit the same problem, from time to time, not consistently. I am using OVN. Typically, it takes no more than a few seconds for neutron to confirm the port is up. The default timeout in my setup is 600s. Even the ports shows up in both OVN SB and NB, nova-compute still didn't get confirmation from neutron. Either neutron didn't pick it up or the message was lost and didn't get to nova-compute. Hoping someone could share more thoughts.
That also may be a super-set of the revert-resize with OVS hybrid plug issue described in [0]. Even though the problems described in the topic may have nothing to that particular case, but does look related to the external events framework. Issues like that make me thinking about some improvements to it. [tl;dr] bring back up the idea of buffering events with a ttl Like a new deferred RPC calls feature maybe? That would execute a call after some trigger, like send unplug and forget. That would make debugging harder, but cover the cases when an external service "forgot" (an event was lost and the like cases) to notify Nova when it is done. Adding a queue to store events that Nova did not have a recieve handler set for might help as well. And have a TTL set on it, or a more advanced reaping logic, for example based on tombstone events invalidating the queue contents by causal conditions. That would eliminate flaky expectations set around starting to wait for receiving events vs sending unexpected or belated events. Why flaky? Because in an async distributed system there is no "before" nor "after", so an external to Nova service will unlikely conform to any time-frame based contract for send-notify/wait-receive/real-completion-fact. And the fact that Nova can't tell what the network backend is (because [1] was not fully implemented) does not make things simpler. As Sean noted in a private irc conversation, with OVN the current implementation is not capable of fullfilling the contract that network-vif-plugged events are only sent after the interface is fully configred. So it send events at bind time once it have updated the logical port in the ovn db but before real configuration has happened. I believe that deferred RPC calls and/or queued events might improve such a "cheating" by making the real post-completion processing a thing for any backend? [0] https://bugs.launchpad.net/nova/+bug/1952003 [1] https://specs.openstack.org/openstack/neutron-specs/specs/train/port-binding...
Thanks! Tony ________________________________________ From: Laurent Dumont <laurentfdumont@gmail.com> Sent: November 22, 2021 02:05 PM To: Michal Arbet Cc: openstack-discuss Subject: Re: [neutron][nova] [kolla] vif plugged timeout
How high did you have to raise it? If it does appear after X amount of time, then the VIF plug is not lost?
On Sat, Nov 20, 2021 at 7:23 AM Michal Arbet <michal.arbet@ultimum.io<mailto:michal.arbet@ultimum.io>> wrote: + if i raise vif_plugged_timeout ( hope i rember it correct ) in nova to some high number ..problem dissapear ... But it's only workaround
Dňa so 20. 11. 2021, 12:05 Michal Arbet <michal.arbet@ultimum.io<mailto:michal.arbet@ultimum.io>> napísal(a): Hi,
Has anyone seen issue which I am currently facing ?
When launching heat stack ( but it's same if I launch several of instances ) vif plugged in timeouts an I don't know why, sometimes it is OK ..sometimes is failing.
Sometimes neutron reports vif plugged in < 10 sec ( test env ) sometimes it's 100 and more seconds, it seems there is some race condition but I can't find out where the problem is. But on the end every instance is spawned ok (retry mechanism worked).
Another finding is that it has to do something with security group, if noop driver is used ..everything is working good.
Firewall security setup is openvswitch .
Test env is wallaby.
I will attach some logs when I will be near PC ..
Thank you, Michal Arbet (Kevko)
-- Best regards, Bogdan Dobrelya, Irc #bogdando
Hi guys, I've been further investigating Michal's (OP) issue, since he is on his holiday, and I've found out that the issue is not really plugging the VIF but cleanup after previous port bindings. We are creating 6 servers with 2-4 vifs using heat template [0]. We were hitting some problems with placement so the stack sometimes failed to create and we had to delete the stack and recreate it. If we recreate it right after the deletion, the vif plugging timeout occurs. If we wait some time (approx. 10 minutes) the stack is created successfully. This brings me to believe that there is some issue with deferring the removal of security groups from unbound ports (somewhere around this part of code [1]) and it somehow affects the creation of new ports. However, I am unable to find any lock that could cause this behaviour. The only proof I have is that after the stack recreation scenario I have measured that the process_network_ports [2] function call could take up to 650 s (varies from 5 s to 651 s in our environment). Any idea what could be causing this? [0] https://pastebin.com/infvj4ai [1] https://github.com/openstack/neutron/blob/master/neutron/agent/firewall.py#L... [2] https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers... *Jan Vondra* *http://ultimum.io <http://ultimum.io>* st 24. 11. 2021 v 11:08 odesílatel Bogdan Dobrelya <bdobreli@redhat.com> napsal:
I hit the same problem, from time to time, not consistently. I am using OVN. Typically, it takes no more than a few seconds for neutron to confirm
On 11/24/21 1:21 AM, Tony Liu wrote: the port is up.
The default timeout in my setup is 600s. Even the ports shows up in both OVN SB and NB, nova-compute still didn't get confirmation from neutron. Either neutron didn't pick it up or the message was lost and didn't get to nova-compute. Hoping someone could share more thoughts.
That also may be a super-set of the revert-resize with OVS hybrid plug issue described in [0]. Even though the problems described in the topic may have nothing to that particular case, but does look related to the external events framework.
Issues like that make me thinking about some improvements to it.
[tl;dr] bring back up the idea of buffering events with a ttl
Like a new deferred RPC calls feature maybe? That would execute a call after some trigger, like send unplug and forget. That would make debugging harder, but cover the cases when an external service "forgot" (an event was lost and the like cases) to notify Nova when it is done.
Adding a queue to store events that Nova did not have a recieve handler set for might help as well. And have a TTL set on it, or a more advanced reaping logic, for example based on tombstone events invalidating the queue contents by causal conditions. That would eliminate flaky expectations set around starting to wait for receiving events vs sending unexpected or belated events. Why flaky? Because in an async distributed system there is no "before" nor "after", so an external to Nova service will unlikely conform to any time-frame based contract for send-notify/wait-receive/real-completion-fact. And the fact that Nova can't tell what the network backend is (because [1] was not fully implemented) does not make things simpler.
As Sean noted in a private irc conversation, with OVN the current implementation is not capable of fullfilling the contract that network-vif-plugged events are only sent after the interface is fully configred. So it send events at bind time once it have updated the logical port in the ovn db but before real configuration has happened. I believe that deferred RPC calls and/or queued events might improve such a "cheating" by making the real post-completion processing a thing for any backend?
[0] https://bugs.launchpad.net/nova/+bug/1952003
[1]
https://specs.openstack.org/openstack/neutron-specs/specs/train/port-binding...
Thanks! Tony ________________________________________ From: Laurent Dumont <laurentfdumont@gmail.com> Sent: November 22, 2021 02:05 PM To: Michal Arbet Cc: openstack-discuss Subject: Re: [neutron][nova] [kolla] vif plugged timeout
How high did you have to raise it? If it does appear after X amount of
time, then the VIF plug is not lost?
On Sat, Nov 20, 2021 at 7:23 AM Michal Arbet <michal.arbet@ultimum.io
<mailto:michal.arbet@ultimum.io>> wrote:
+ if i raise vif_plugged_timeout ( hope i rember it correct ) in nova to some high number ..problem dissapear ... But it's only workaround
Dňa so 20. 11. 2021, 12:05 Michal Arbet <michal.arbet@ultimum.io<mailto: michal.arbet@ultimum.io>> napísal(a): Hi,
Has anyone seen issue which I am currently facing ?
When launching heat stack ( but it's same if I launch several of instances ) vif plugged in timeouts an I don't know why, sometimes it is OK ..sometimes is failing.
Sometimes neutron reports vif plugged in < 10 sec ( test env ) sometimes it's 100 and more seconds, it seems there is some race condition but I can't find out where the problem is. But on the end every instance is spawned ok (retry mechanism worked).
Another finding is that it has to do something with security group, if noop driver is used ..everything is working good.
Firewall security setup is openvswitch .
Test env is wallaby.
I will attach some logs when I will be near PC ..
Thank you, Michal Arbet (Kevko)
-- Best regards, Bogdan Dobrelya, Irc #bogdando
Hello, Have you already considered what Jan Vondra sent to this discussion ? I am just making sure that this was read. Thanks, Michal Arbet Openstack Engineer Ultimum Technologies a.s. Na Poříčí 1047/26, 11000 Praha 1 Czech Republic +420 604 228 897 michal.arbet@ultimum.io *https://ultimum.io <https://ultimum.io/>* LinkedIn <https://www.linkedin.com/company/ultimum-technologies> | Twitter <https://twitter.com/ultimumtech> | Facebook <https://www.facebook.com/ultimumtechnologies/timeline> st 24. 11. 2021 v 14:30 odesílatel Jan Vondra <jan.vondra@ultimum.io> napsal:
Hi guys, I've been further investigating Michal's (OP) issue, since he is on his holiday, and I've found out that the issue is not really plugging the VIF but cleanup after previous port bindings.
We are creating 6 servers with 2-4 vifs using heat template [0]. We were hitting some problems with placement so the stack sometimes failed to create and we had to delete the stack and recreate it. If we recreate it right after the deletion, the vif plugging timeout occurs. If we wait some time (approx. 10 minutes) the stack is created successfully.
This brings me to believe that there is some issue with deferring the removal of security groups from unbound ports (somewhere around this part of code [1]) and it somehow affects the creation of new ports. However, I am unable to find any lock that could cause this behaviour.
The only proof I have is that after the stack recreation scenario I have measured that the process_network_ports [2] function call could take up to 650 s (varies from 5 s to 651 s in our environment).
Any idea what could be causing this?
[0] https://pastebin.com/infvj4ai [1] https://github.com/openstack/neutron/blob/master/neutron/agent/firewall.py#L... [2] https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers...
*Jan Vondra* *http://ultimum.io <http://ultimum.io>*
st 24. 11. 2021 v 11:08 odesílatel Bogdan Dobrelya <bdobreli@redhat.com> napsal:
I hit the same problem, from time to time, not consistently. I am using OVN. Typically, it takes no more than a few seconds for neutron to confirm
On 11/24/21 1:21 AM, Tony Liu wrote: the port is up.
The default timeout in my setup is 600s. Even the ports shows up in both OVN SB and NB, nova-compute still didn't get confirmation from neutron. Either neutron didn't pick it up or the message was lost and didn't get to nova-compute. Hoping someone could share more thoughts.
That also may be a super-set of the revert-resize with OVS hybrid plug issue described in [0]. Even though the problems described in the topic may have nothing to that particular case, but does look related to the external events framework.
Issues like that make me thinking about some improvements to it.
[tl;dr] bring back up the idea of buffering events with a ttl
Like a new deferred RPC calls feature maybe? That would execute a call after some trigger, like send unplug and forget. That would make debugging harder, but cover the cases when an external service "forgot" (an event was lost and the like cases) to notify Nova when it is done.
Adding a queue to store events that Nova did not have a recieve handler set for might help as well. And have a TTL set on it, or a more advanced reaping logic, for example based on tombstone events invalidating the queue contents by causal conditions. That would eliminate flaky expectations set around starting to wait for receiving events vs sending unexpected or belated events. Why flaky? Because in an async distributed system there is no "before" nor "after", so an external to Nova service will unlikely conform to any time-frame based contract for send-notify/wait-receive/real-completion-fact. And the fact that Nova can't tell what the network backend is (because [1] was not fully implemented) does not make things simpler.
As Sean noted in a private irc conversation, with OVN the current implementation is not capable of fullfilling the contract that network-vif-plugged events are only sent after the interface is fully configred. So it send events at bind time once it have updated the logical port in the ovn db but before real configuration has happened. I believe that deferred RPC calls and/or queued events might improve such a "cheating" by making the real post-completion processing a thing for any backend?
[0] https://bugs.launchpad.net/nova/+bug/1952003
[1]
https://specs.openstack.org/openstack/neutron-specs/specs/train/port-binding...
Thanks! Tony ________________________________________ From: Laurent Dumont <laurentfdumont@gmail.com> Sent: November 22, 2021 02:05 PM To: Michal Arbet Cc: openstack-discuss Subject: Re: [neutron][nova] [kolla] vif plugged timeout
How high did you have to raise it? If it does appear after X amount of
time, then the VIF plug is not lost?
On Sat, Nov 20, 2021 at 7:23 AM Michal Arbet <michal.arbet@ultimum.io
<mailto:michal.arbet@ultimum.io>> wrote:
+ if i raise vif_plugged_timeout ( hope i rember it correct ) in nova to some high number ..problem dissapear ... But it's only workaround
Dňa so 20. 11. 2021, 12:05 Michal Arbet <michal.arbet@ultimum.io <mailto:michal.arbet@ultimum.io>> napísal(a): Hi,
Has anyone seen issue which I am currently facing ?
When launching heat stack ( but it's same if I launch several of instances ) vif plugged in timeouts an I don't know why, sometimes it is OK ..sometimes is failing.
Sometimes neutron reports vif plugged in < 10 sec ( test env ) sometimes it's 100 and more seconds, it seems there is some race condition but I can't find out where the problem is. But on the end every instance is spawned ok (retry mechanism worked).
Another finding is that it has to do something with security group, if noop driver is used ..everything is working good.
Firewall security setup is openvswitch .
Test env is wallaby.
I will attach some logs when I will be near PC ..
Thank you, Michal Arbet (Kevko)
-- Best regards, Bogdan Dobrelya, Irc #bogdando
On Wed, 2021-11-24 at 11:05 +0100, Bogdan Dobrelya wrote:
On 11/24/21 1:21 AM, Tony Liu wrote:
I hit the same problem, from time to time, not consistently. I am using OVN. Typically, it takes no more than a few seconds for neutron to confirm the port is up. The default timeout in my setup is 600s. Even the ports shows up in both OVN SB and NB, nova-compute still didn't get confirmation from neutron. Either neutron didn't pick it up or the message was lost and didn't get to nova-compute. Hoping someone could share more thoughts.
That also may be a super-set of the revert-resize with OVS hybrid plug issue described in [0]. Even though the problems described in the topic may have nothing to that particular case, but does look related to the external events framework.
Issues like that make me thinking about some improvements to it.
[tl;dr] bring back up the idea of buffering events with a ttl
Like a new deferred RPC calls feature maybe? That would execute a call after some trigger, like send unplug and forget. That would make debugging harder, but cover the cases when an external service "forgot" (an event was lost and the like cases) to notify Nova when it is done.
Adding a queue to store events that Nova did not have a recieve handler set for might help as well. And have a TTL set on it, or a more advanced reaping logic, for example based on tombstone events invalidating the queue contents by causal conditions. That would eliminate flaky expectations set around starting to wait for receiving events vs sending unexpected or belated events. Why flaky? Because in an async distributed system there is no "before" nor "after", so an external to Nova service will unlikely conform to any time-frame based contract for send-notify/wait-receive/real-completion-fact. And the fact that Nova can't tell what the network backend is (because [1] was not fully implemented) does not make things simpler.
i honestly dont think this is a viable option we have discussed it several times in nova in the past and keep coming to the same conclution either the event shoudl be sent and waited for at that right times or they loose there value. buffering the event masks bad behavior in non complent netowrk backends, it potentially exposes the teants and oeprators to security issues by breaking multi tenancy https://bugs.launchpad.net/neutron/+bug/1734320 or network conenct connecity https://bugs.launchpad.net/nova/+bug/1815989. neutron somethime sened the events ealier then we expect and some times it send multiple network vif plugged events for effectivly the same operations. we recently "fixed" the fact that the dhcp agent would send a netwrok-vif-plugged event during live migration becasue it was already configured nad the port was fully plugged on the souce node when we were waiting for the event form the destiont nodes l2 agent. https://review.opendev.org/c/openstack/neutron/+/766277 howeveer that fix si config driven and nova cannot detach how that is set... i dissagree that in a distibuted system like nova there si no before or after. we had a contract with neutron that severla neutron ml2 plugs or out of tree core plugins did not comply with. when we add a vm interface to a network backend we requrie neutron to notificy use in a timely manner that the backend has processed the port and its now safe to proceed. several backend chosse to violate that contract including ovn and as a result we have to try and make thse broken backend work in nova whne infact we shoudl not supprot them at all. the odl comuntiy when to great effort to impleent a websocket callback mechsium to be able to have odl notify neutron when it had configured the port on the ovs bridge and networking-odl then incoperated that in to there ml2 dirver https://opendev.org/openstack/networking-odl/src/branch/master/networking_od... all of the in tree pluggins before ovn was merged in tree also implemeted this protocoal correctly sending event when the port provisioning on the netwrok backedn was compelte. ovn however still sets the l2 provision as complete when the prot status is set to up https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers... that gets called when the logical swith port is set to up https://github.com/openstack/neutron/blob/4e339776d90cf211396da5f95e29af6533... but that does not fully adress the issue since move oepration like live migatrion are nto properly supproted. https://review.opendev.org/c/openstack/neutron-specs/+/799198/6/specs/xena/o... should help although im slightly dismaded to see that tey will be using a new `port.vif_details`` backend field to identify it as ovn instead of the previously agreed on bound_drivers field https://specs.openstack.org/openstack/neutron-specs/specs/train/port-binding... if every ml2 driver and core plugin set this new backedn filed it more of less the same as the bound_drivers feild however i fear this will jsut get implemeted for ovn since its part of the ovn speficic spec which will jsut create more tech debt so im relutant to suggest nova will use this info untill it done properly for all backends.
As Sean noted in a private irc conversation, with OVN the current implementation is not capable of fullfilling the contract that network-vif-plugged events are only sent after the interface is fully configred. So it send events at bind time once it have updated the logical port in the ovn db but before real configuration has happened. I believe that deferred RPC calls and/or queued events might improve such a "cheating" by making the real post-completion processing a thing for any backend?
[0] https://bugs.launchpad.net/nova/+bug/1952003
[1] https://specs.openstack.org/openstack/neutron-specs/specs/train/port-binding...
Thanks! Tony ________________________________________ From: Laurent Dumont <laurentfdumont@gmail.com> Sent: November 22, 2021 02:05 PM To: Michal Arbet Cc: openstack-discuss Subject: Re: [neutron][nova] [kolla] vif plugged timeout
How high did you have to raise it? If it does appear after X amount of time, then the VIF plug is not lost?
On Sat, Nov 20, 2021 at 7:23 AM Michal Arbet <michal.arbet@ultimum.io<mailto:michal.arbet@ultimum.io>> wrote: + if i raise vif_plugged_timeout ( hope i rember it correct ) in nova to some high number ..problem dissapear ... But it's only workaround
D�a so 20. 11. 2021, 12:05 Michal Arbet <michal.arbet@ultimum.io<mailto:michal.arbet@ultimum.io>> nap�sal(a): Hi,
Has anyone seen issue which I am currently facing ?
When launching heat stack ( but it's same if I launch several of instances ) vif plugged in timeouts an I don't know why, sometimes it is OK ..sometimes is failing.
Sometimes neutron reports vif plugged in < 10 sec ( test env ) sometimes it's 100 and more seconds, it seems there is some race condition but I can't find out where the problem is. But on the end every instance is spawned ok (retry mechanism worked).
Another finding is that it has to do something with security group, if noop driver is used ..everything is working good.
Firewall security setup is openvswitch .
Test env is wallaby.
I will attach some logs when I will be near PC ..
Thank you, Michal Arbet (Kevko)
Adding a queue to store events that Nova did not have a recieve handler set for might help as well. And have a TTL set on it, or a more advanced reaping logic, for example based on tombstone events invalidating the queue contents by causal conditions. That would eliminate flaky expectations set around starting to wait for receiving events vs sending unexpected or belated events. Why flaky? Because in an async distributed system there is no "before" nor "after", so an external to Nova service will unlikely conform to any time-frame based contract for send-notify/wait-receive/real-completion-fact. And the fact that Nova can't tell what the network backend is (because [1] was not fully implemented) does not make things simpler.
i honestly dont think this is a viable option we have discussed it several times in nova in the past and keep coming to the same conclution either the event shoudl be sent and waited for at that right times or they loose there value.
Yep, agree with Sean here. I definitely don't think that making the system less deterministic is going to make things more reliable. This needs to be a contract between two services that we can depend on. Nova and Neutron both serve the purpose of abstracting the lower-level elements into a service. Making the different technologies they support behave similarly is the reason they exist. --Dan
On Wed, 2021-11-24 at 00:21 +0000, Tony Liu wrote:
I hit the same problem, from time to time, not consistently. I am using OVN. Typically, it takes no more than a few seconds for neutron to confirm the port is up. The default timeout in my setup is 600s. Even the ports shows up in both OVN SB and NB, nova-compute still didn't get confirmation from neutron. Either neutron didn't pick it up or the message was lost and didn't get to nova-compute. Hoping someone could share more thoughts.
there are some knonw bugs in this area. basicaly every neutorn backend behaves slightly differently with regards to how/when it send the network vif plugged event and this depend on many factors and change form release to release. for exampel im pretty sure in the past ml2/ovs used to send network-vif-plugged events for ports that are adminstiratively disabel since nova/os-vif still pluggs thoses into the ovs bridge we would expect them to be sent however that apparently has changed at some point. leading to https://bugs.launchpad.net/nova/+bug/1951623 ml2/ovn never send network-vif-plugged events when teh port is plugged it cheats and send them when the port is bound but the exact rules for that have also chagne over the last few releases. nova has no way to discover this behavior from neutron and we have to do our best to geuess based on some atrrbutes of the port. for example as noted below the firewall dirver used with ml2/ovs makes a difference if you use iptables_hybrid we sue the hybrid_plug mechanisum that means the vm tap device is added to a linux bridge which is then connect to ovs with a veth pair. for move operation like live migrate the linux bridge and veth pair are created on the destionat in prelivemigrate and nova waits for the event. sicne we cant detech what security group driver is used from the port we have to guess based on if hybrid_plug=true in the port binding profile. for iptables hybrid_plug is True for noop and openvswich security group driver hybrid_plug is set to false https://review.opendev.org/c/openstack/nova/+/767368 attempted to account for the fact that network-vif-plugged woudl not be sent in thet later case in prelive migrate since at the time the vm interface was only plugged in to ovs by libvirt during the migration. https://review.opendev.org/c/openstack/nova/+/767368/1/nova/network/model.py... def get_live_migration_plug_time_events(self): """Returns a list of external events for any VIFs that have "plug-time" events during live migration. """ return [('network-vif-plugged', vif['id']) for vif in self if vif.has_live_migration_plug_time_event] https://review.opendev.org/c/openstack/nova/+/767368/1/nova/network/model.py... def has_live_migration_plug_time_event(self): """Returns whether this VIF's network-vif-plugged external event will be sent by Neutron at "plugtime" - in other words, as soon as neutron completes configuring the network backend. """ return self.is_hybrid_plug_enabled() what that code does is skip waiting for network-vif plugged event during live migration for all interfaces wehre hybrid_plug is false which include ml2/ovs with noop or openvswitch security group driver and ml2/ovn as it never send them at the correct time. it turns out to fix https://bugs.launchpad.net/nova/+bug/1951623 we also should be skipping waiting if the admin state on the port is disabled by adding and vif['active'] == 'active' to the list comprehention. the code shoul also have addtional knoladge of the network backedn to make the reight descissions however the bound_drivers intoduced by https://specs.openstack.org/openstack/neutron-specs/specs/train/port-binding... was never actully implemeted in neutron so neutron does not curretnly tell nova if it ml2/ovs or ml2/ovn or ml2/odl all of the above have vif_type OVS so we cant renable waiting for netwrok vif plugged event when hybrid_plug is false and ml2/ovs is used sicne while it would be correct for ml2/ovs it woudl break ml2/ovn so we are forced to support the least capable netowrk backend in any situation. until this is fix in nova and neutron its unlikely you will be able ot adres this in kolla in a meaningful way. every time we skip waiting for a network-vif-plugged event in nova when there ideally woudl be one as part fo a move operation we introduced a race between the vm starting on the destinatnion host and the network backend completing its configuration so simpley settign [DEFAULT]/vif_plugging_is_fatal=False or [compute]/live_migration_wait_for_vif_plug=false risk the vm not haveing netwroking when configured. https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.vif... https://docs.openstack.org/nova/latest/configuration/config.html#compute.liv... the do provide ways for operators to work around some bugs as will the recently added [workarounds]/wait_for_vif_plugged_event_during_hard_reboot option https://docs.openstack.org/nova/latest/configuration/config.html#workarounds... however this shoudl not be complexity that the operator shoudl have to understand and configure via kolla. we shoudl fix the contract between nova and neutron includeing requireing out of tree network vendros like cisco aci or other core plugins to actully conform to the interface but after 5 eyars of trying to get this fixed its still not and we just have to play the wackamole game everytime someoen reports another edgecase. in this specific case i dont knwo why you are not getting the event but ffor ml2/ovs both the l2 agent and dhcp agent but need to notify the neutron server that provisiouning is complete and apprent the port also now need to be admin state actitve/up before the network-vif-plugged event is sent. in the case wehre it fails i woudl chekc the dhcp agaent log, l2 agent log and neutorn server log try and se if one or both of the l2/dhcp agent failed to provision the port. i would guess it sthe dhcp agent given it works on the retry to the next host. regards sean
Thanks! Tony ________________________________________ From: Laurent Dumont <laurentfdumont@gmail.com> Sent: November 22, 2021 02:05 PM To: Michal Arbet Cc: openstack-discuss Subject: Re: [neutron][nova] [kolla] vif plugged timeout
How high did you have to raise it? If it does appear after X amount of time, then the VIF plug is not lost?
On Sat, Nov 20, 2021 at 7:23 AM Michal Arbet <michal.arbet@ultimum.io<mailto:michal.arbet@ultimum.io>> wrote: + if i raise vif_plugged_timeout ( hope i rember it correct ) in nova to some high number ..problem dissapear ... But it's only workaround
D�a so 20. 11. 2021, 12:05 Michal Arbet <michal.arbet@ultimum.io<mailto:michal.arbet@ultimum.io>> nap�sal(a): Hi,
Has anyone seen issue which I am currently facing ?
When launching heat stack ( but it's same if I launch several of instances ) vif plugged in timeouts an I don't know why, sometimes it is OK ..sometimes is failing.
Sometimes neutron reports vif plugged in < 10 sec ( test env ) sometimes it's 100 and more seconds, it seems there is some race condition but I can't find out where the problem is. But on the end every instance is spawned ok (retry mechanism worked).
Another finding is that it has to do something with security group, if noop driver is used ..everything is working good.
Firewall security setup is openvswitch .
Test env is wallaby.
I will attach some logs when I will be near PC ..
Thank you, Michal Arbet (Kevko)
Hi, Basically in ML2/OVS case it may be one of 2 reasons why port isn't provisioned properly quickly: - neutron-ovs-agent is somehow slow with provisioning it or - neutron-dhcp-agent is slow provisioning that port. To check which of those happens really, You can enable debug logs in You neutron-server and look there for logs like "Port xxx provisioning completed by entity L2/DHCP" (or something similar, I don't remember it now exactly). If it works much faster with noop firewall driver, then it seems that it is more likely to be on the neutron-ovs-agent's side. In such case couple of things to check: - are You using l2population (it's required with DVR for example), - are You using SG with rules which references "remote_group_id" (like default SG for each tenant does)? If so, can You try to remove from You SG such rules and use rules with CIDRs instead? We know that using SG with remote_group_id don't scale well and if You have many ports using same SG, it may slow down neutron-ovs-agent a lot. - do You maybe have any other errors in the neutron-ovs-agent logs? Like rpc message communication errors or something else? Such errors will trigger doing fullsync of all ports on the node so it may take long time to get to actually provisioning Your new port sometimes. - what exactly version of Neutron are You using there? On sobota, 20 listopada 2021 11:05:16 CET Michal Arbet wrote:
Hi,
Has anyone seen issue which I am currently facing ?
When launching heat stack ( but it's same if I launch several of instances ) vif plugged in timeouts an I don't know why, sometimes it is OK ..sometimes is failing.
Sometimes neutron reports vif plugged in < 10 sec ( test env ) sometimes it's 100 and more seconds, it seems there is some race condition but I can't find out where the problem is. But on the end every instance is spawned ok (retry mechanism worked).
Another finding is that it has to do something with security group, if noop driver is used ..everything is working good.
Firewall security setup is openvswitch .
Test env is wallaby.
I will attach some logs when I will be near PC ..
Thank you, Michal Arbet (Kevko)
-- Slawek Kaplonski Principal Software Engineer Red Hat
Hi, You can find logs from controller0 and compute0 in attachment (other controllers and computes were turned off for this test). Thank you, Michal Arbet Openstack Engineer Ultimum Technologies a.s. Na Poříčí 1047/26, 11000 Praha 1 Czech Republic +420 604 228 897 michal.arbet@ultimum.io *https://ultimum.io <https://ultimum.io/>* LinkedIn <https://www.linkedin.com/company/ultimum-technologies> | Twitter <https://twitter.com/ultimumtech> | Facebook <https://www.facebook.com/ultimumtechnologies/timeline> čt 25. 11. 2021 v 8:22 odesílatel Slawek Kaplonski <skaplons@redhat.com> napsal:
Hi,
Basically in ML2/OVS case it may be one of 2 reasons why port isn't provisioned properly quickly: - neutron-ovs-agent is somehow slow with provisioning it or - neutron-dhcp-agent is slow provisioning that port.
To check which of those happens really, You can enable debug logs in You neutron-server and look there for logs like "Port xxx provisioning completed by entity L2/DHCP" (or something similar, I don't remember it now exactly).
If it works much faster with noop firewall driver, then it seems that it is more likely to be on the neutron-ovs-agent's side. In such case couple of things to check: - are You using l2population (it's required with DVR for example), - are You using SG with rules which references "remote_group_id" (like default SG for each tenant does)? If so, can You try to remove from You SG such rules and use rules with CIDRs instead? We know that using SG with remote_group_id don't scale well and if You have many ports using same SG, it may slow down neutron-ovs-agent a lot. - do You maybe have any other errors in the neutron-ovs-agent logs? Like rpc message communication errors or something else? Such errors will trigger doing fullsync of all ports on the node so it may take long time to get to actually provisioning Your new port sometimes. - what exactly version of Neutron are You using there?
On sobota, 20 listopada 2021 11:05:16 CET Michal Arbet wrote:
Hi,
Has anyone seen issue which I am currently facing ?
When launching heat stack ( but it's same if I launch several of instances ) vif plugged in timeouts an I don't know why, sometimes it is OK ..sometimes is failing.
Sometimes neutron reports vif plugged in < 10 sec ( test env ) sometimes it's 100 and more seconds, it seems there is some race condition but I can't find out where the problem is. But on the end every instance is spawned ok (retry mechanism worked).
Another finding is that it has to do something with security group, if noop driver is used ..everything is working good.
Firewall security setup is openvswitch .
Test env is wallaby.
I will attach some logs when I will be near PC ..
Thank you, Michal Arbet (Kevko)
-- Slawek Kaplonski Principal Software Engineer Red Hat
participants (8)
-
Bogdan Dobrelya
-
Dan Smith
-
Jan Vondra
-
Laurent Dumont
-
Michal Arbet
-
Sean Mooney
-
Slawek Kaplonski
-
Tony Liu