[qa][tempest] Waiting for interface status == ACTIVE before checking status
In the networking-ovn project, we hit this bug *very* often: https://bugs.launchpad.net/tempest/+bug/1728600. You can see the logstash here where it has failed 330 times in the last week: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A... The bug has been around since 2017, and there are earlier reports of it than that. The bug happens in some projects outside of networking-ovn as well. At the core of the issue is that _get_server_port_id_and_ip4 loops through server ports to return ones that are ACTIVE, but there is a race where a port could become temporarily inactive if the ml2 driver continually monitors the actual port status. In the case we hit, os-vif started recreating the ovs port during an operation, so we would detect the status of the port as down and change the status, and then when the port is recreated we set the port status back to up. If the check happens while the port is down, the test fails. There have been comments that the port status shouldn't flip w/o any user request that would cause it, but that would mean that a plugin/driver would have to ignore the actual status of a port and that seems wrong. External things can affect what state a port is in. https://review.openstack.org/#/c/449695/7/tempest/scenario/manager.py adds a wait mechanism to checking the port status so that momentary flips of port status will not cause the test to inadvertently fail. The patch currently has 10 +1s. We really need to get this fixed. Thanks! Terry
---- On Thu, 24 Jan 2019 07:09:26 +0900 Terry Wilson <twilson@redhat.com> wrote ----
In the networking-ovn project, we hit this bug *very* often: https://bugs.launchpad.net/tempest/+bug/1728600. You can see the logstash here where it has failed 330 times in the last week: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A...
The bug has been around since 2017, and there are earlier reports of it than that. The bug happens in some projects outside of networking-ovn as well.
At the core of the issue is that _get_server_port_id_and_ip4 loops through server ports to return ones that are ACTIVE, but there is a race where a port could become temporarily inactive if the ml2 driver continually monitors the actual port status. In the case we hit, os-vif started recreating the ovs port during an operation, so we would detect the status of the port as down and change the status, and then when the port is recreated we set the port status back to up. If the check happens while the port is down, the test fails.
But is this by design or bug that Active port on Active VM can flip to down. Waitinthe g for already active and bounded port to become active again after we got the Active server is not right things to test. As Sean also pointed that in patch that we should go for the approach of "making sure all attached interface to server is active, server is sshable bthe efore server can be used in test" [1]. This is something we agreed in Denver PTG for afazekas proposal[2]. If we see the from user perspective , user can have an Active VM with active port which can flip to down in between of that port usage. This seems bug to me. [1] - https://review.openstack.org/#/c/600046 [2] - https://etherpad.openstack.org/p/handling-of-interface-attach-detach-hotplug... I have also commented on patch, sorry for delaying the review on that. -gmann
There have been comments that the port status shouldn't flip w/o any user request that would cause it, but that would mean that a plugin/driver would have to ignore the actual status of a port and that seems wrong. External things can affect what state a port is in.
https://review.openstack.org/#/c/449695/7/tempest/scenario/manager.py adds a wait mechanism to checking the port status so that momentary flips of port status will not cause the test to inadvertently fail. The patch currently has 10 +1s. We really need to get this fixed.
Thanks! Terry
On Fri, 2019-01-25 at 11:27 -0500, Jay Pipes wrote:
On 01/24/2019 08:27 PM, Ghanshyam Mann wrote:
If we see the from user perspective , user can have an Active VM with active port which can flip to down in between of that port usage. This seems bug to me.
Agreed, Ghanshyam. so as this bug states https://bugs.launchpad.net/neutron/+bug/1672629 if admin-sate-up is False tehn the nova port status shoudl be down even if the vm is active.
it may also be true that if the Data-plane-status extention is used https://specs.openstack.org/openstack/neutron-specs/specs/pike/port-data-pla... the port status might change to down the datapath status is is marked as down but im not sure about that. they are ment to be independ but its a little confusing.
-jay
On Thu, Jan 24, 2019 at 7:34 PM Ghanshyam Mann <gmann@ghanshyammann.com> wrote:
As Sean also pointed that in patch that we should go for the approach of "making sure all attached interface to server is active, server is sshable bthe efore server can be used in test" [1]. This is something we agreed in Denver PTG for afazekas proposal[2].
If we see the from user perspective , user can have an Active VM with active port which can flip to down in between of that port usage. This seems bug to me.
To me, this ignores real-world situations where a port status *can* change w/o user interaction. It seems weird to ignore a status change if it is detected. In the case that we hit, it was a change to os-vif where it was recreating a port. But it could just as easily be some vendor-specific "that port just died" kind of thing. Why not update the status of the port if you know it has changed? Also, the patch itself (outside the ironic case) just adds a window for the status to bounce.
On 01/25/2019 12:04 PM, Terry Wilson wrote:
On Thu, Jan 24, 2019 at 7:34 PM Ghanshyam Mann <gmann@ghanshyammann.com> wrote:
As Sean also pointed that in patch that we should go for the approach of "making sure all attached interface to server is active, server is sshable bthe efore server can be used in test" [1]. This is something we agreed in Denver PTG for afazekas proposal[2].
If we see the from user perspective , user can have an Active VM with active port which can flip to down in between of that port usage. This seems bug to me.
To me, this ignores real-world situations where a port status *can* change w/o user interaction.
How is this ignoring that scenario?
It seems weird to ignore a status change if it is detected. In the case that we hit, it was a change to os-vif where it was recreating a port.
Which was a bug, right?
But it could just as easily be some vendor-specific "that port just died" kind of thing.
In which case, the test waiting for SSH to be available would timeout because connectivity would be broken anyway, no?
Why not update the status of the port if you
know it has changed?
Sorry, I don't see where anyone is suggesting not changing the status of the port if some non-bug real scenario changes the status of the port?
Also, the patch itself (outside the ironic case) just adds a window for the status to bounce.
Unless I'm mistaken, the patch is simply changing the condition that the tempest test uses to identify broken VM connectivity. It will use the SSH connectivity test instead of looking at the port status test. The SSH test was determined to be a more stable test of VM network connectivity than relying on the Neutron port status indicator which can be a little flaky. Or am I missing something? -jay
On 01/25/2019 12:04 PM, Terry Wilson wrote:
On Thu, Jan 24, 2019 at 7:34 PM Ghanshyam Mann <gmann@ghanshyammann.com> wrote:
As Sean also pointed that in patch that we should go for the approach of "making sure all attached interface to server is active, server is sshable bthe efore server can be used in test" [1]. This is something we agreed in Denver PTG for afazekas proposal[2].
If we see the from user perspective , user can have an Active VM with active port which can flip to down in between of that port usage. This seems bug to me.
To me, this ignores real-world situations where a port status *can* change w/o user interaction.
How is this ignoring that scenario?
On Fri, 2019-01-25 at 12:26 -0500, Jay Pipes wrote: the only case i know of for definity would be if the admin state is down which should not prevent the vm from booting but neutron shoudl not allow network connecitigy in this case.
It seems weird to ignore a status change if it is detected. In the case that we hit, it was a change to os-vif where it was recreating a port.
Which was a bug, right?
yes kind of. we could have fixed it by merging the nova change i had or reverting the os-vif change. i revert the os-vif change as the nova change was hitting a different bug in neutron. but only one entity. os-vif or the hyperviror should have been creating the port on ovs. so it was a bug when both were.
But it could just as easily be some vendor-specific "that port just died" kind of thing.
In which case, the test waiting for SSH to be available would timeout because connectivity would be broken anyway, no?
if it did not recover yes it would.
Why not update the status of the port if you
know it has changed?
Sorry, I don't see where anyone is suggesting not changing the status of the port if some non-bug real scenario changes the status of the port?
Also, the patch itself (outside the ironic case) just adds a window for the status to bounce.
Unless I'm mistaken, the patch is simply changing the condition that the tempest test uses to identify broken VM connectivity. It will use the SSH connectivity test instead of looking at the port status test.
The SSH test was determined to be a more stable test of VM network connectivity than relying on the Neutron port status indicator which can be a little flaky.
ssh is more reliable for hotpug as we needed to wait for the guest os to process the hotplug event. waithing for the vm to be pingable or sshable is more reliable in that specific case. the port status being active simply means that the port is curently configured by neutron. that gives you no knolage of if the gust has processed the hotplug event. in general im not sure if ssh connectivity would be more reliabel but if that is what the test requires to work its better to expeclitly validate it then use the port status as a proxy.
Or am I missing something?
its a valid question i think port status and vm connectity are two different things. if you are writing an api test then port status hsould be suffient. if you need to connect to the vm in any way it becomes a senario test in which case wait for sshable or pingable might be more suitable. not sure if i answer your question however.
-jay
---- On Sat, 26 Jan 2019 03:17:44 +0900 Sean Mooney <smooney@redhat.com> wrote ----
On 01/25/2019 12:04 PM, Terry Wilson wrote:
On Thu, Jan 24, 2019 at 7:34 PM Ghanshyam Mann <gmann@ghanshyammann.com> wrote:
As Sean also pointed that in patch that we should go for the approach of "making sure all attached interface to server is active, server is sshable bthe efore server can be used in test" [1]. This is something we agreed in Denver PTG for afazekas proposal[2].
If we see the from user perspective , user can have an Active VM with active port which can flip to down in between of that port usage. This seems bug to me.
To me, this ignores real-world situations where a port status *can* change w/o user interaction.
How is this ignoring that scenario?
On Fri, 2019-01-25 at 12:26 -0500, Jay Pipes wrote: the only case i know of for definity would be if the admin state is down which should not prevent the vm from booting but neutron shoudl not allow network connecitigy in this case.
Can it happen in between of connectivity also? I mean when VM is active and SSHable then, down admin state can cause port to become down.
It seems weird to ignore a status change if it is detected. In the case that we hit, it was a change to os-vif where it was recreating a port.
Which was a bug, right?
yes kind of. we could have fixed it by merging the nova change i had or reverting the os-vif change. i revert the os-vif change as the nova change was hitting a different bug in neutron. but only one entity. os-vif or the hyperviror should have been creating the port on ovs. so it was a bug when both were.
But it could just as easily be some vendor-specific "that port just died" kind of thing.
In which case, the test waiting for SSH to be available would timeout because connectivity would be broken anyway, no?
if it did not recover yes it would.
Why not update the status of the port if you
know it has changed?
Sorry, I don't see where anyone is suggesting not changing the status of the port if some non-bug real scenario changes the status of the port?
Also, the patch itself (outside the ironic case) just adds a window for the status to bounce.
Unless I'm mistaken, the patch is simply changing the condition that the tempest test uses to identify broken VM connectivity. It will use the SSH connectivity test instead of looking at the port status test.
The SSH test was determined to be a more stable test of VM network connectivity than relying on the Neutron port status indicator which can be a little flaky.
ssh is more reliable for hotpug as we needed to wait for the guest os to process the hotplug event. waithing for the vm to be pingable or sshable is more reliable in that specific case. the port status being active simply means that the port is curently configured by neutron. that gives you no knolage of if the gust has processed the hotplug event.
+1, I agree on hotplug event case and yes Tempest test should make test VM usable for test after sshable/pingable success. afazekas updated few test for that and it will be reasonable thing to do.
in general im not sure if ssh connectivity would be more reliabel but if that is what the test requires to work its better to expeclitly validate it then use the port status as a proxy.
Or am I missing something?
its a valid question i think port status and vm connectity are two different things.
if you are writing an api test then port status hsould be suffient. if you need to connect to the vm in any way it becomes a senario test in which case wait for sshable or pingable might be more suitable.
Yeah, scenario tests expect the end-to-end connectivity internal/external to tenants. Tempest API tests hardly check the ssh verification. -gmann
not sure if i answer your question however.
-jay
participants (4)
-
Ghanshyam Mann
-
Jay Pipes
-
Sean Mooney
-
Terry Wilson