---- On Thu, 24 Jan 2019 07:09:26 +0900 Terry Wilson <twilson@redhat.com> wrote ----
In the networking-ovn project, we hit this bug *very* often: https://bugs.launchpad.net/tempest/+bug/1728600. You can see the logstash here where it has failed 330 times in the last week: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A...
The bug has been around since 2017, and there are earlier reports of it than that. The bug happens in some projects outside of networking-ovn as well.
At the core of the issue is that _get_server_port_id_and_ip4 loops through server ports to return ones that are ACTIVE, but there is a race where a port could become temporarily inactive if the ml2 driver continually monitors the actual port status. In the case we hit, os-vif started recreating the ovs port during an operation, so we would detect the status of the port as down and change the status, and then when the port is recreated we set the port status back to up. If the check happens while the port is down, the test fails.
But is this by design or bug that Active port on Active VM can flip to down. Waitinthe g for already active and bounded port to become active again after we got the Active server is not right things to test. As Sean also pointed that in patch that we should go for the approach of "making sure all attached interface to server is active, server is sshable bthe efore server can be used in test" [1]. This is something we agreed in Denver PTG for afazekas proposal[2]. If we see the from user perspective , user can have an Active VM with active port which can flip to down in between of that port usage. This seems bug to me. [1] - https://review.openstack.org/#/c/600046 [2] - https://etherpad.openstack.org/p/handling-of-interface-attach-detach-hotplug... I have also commented on patch, sorry for delaying the review on that. -gmann
There have been comments that the port status shouldn't flip w/o any user request that would cause it, but that would mean that a plugin/driver would have to ignore the actual status of a port and that seems wrong. External things can affect what state a port is in.
https://review.openstack.org/#/c/449695/7/tempest/scenario/manager.py adds a wait mechanism to checking the port status so that momentary flips of port status will not cause the test to inadvertently fail. The patch currently has 10 +1s. We really need to get this fixed.
Thanks! Terry