Open Stack

Thu Dec 19 03:21:58 UTC 2013

Hi,

Yair and I were discussing a change that I initiated and was 
incorporated into the test_network_basic_ops test. It was intended as a 
configuration control point for floating IP address assignments before 
actually testing connectivity. The question we were discussing was 
whether this check was a valid pass/fail criteria for tests like 
test_network_basic_ops.

The initial motivation for the change was that test_network_basic_ops 
had a less than 50/50 chance of passing in my local environment for 
whatever reason. After looking at the test, it seemed ridiculous that it 
should be failing. The problem is that more often than not the data that 
was available in the logs all pointed to it being set up correctly but 
the ping test for connectivity was timing out. From the logs it wasn't 
clear that the test was failing because neutron did not do the right 
thing, did not do it fast enough, or is something else happening?  Of 
course if I paused the test for a short bit between setup and the checks 
to manually verify everything the checks always passed. So it's a timing 
issue right?

Two things: adding more timeout to a check is as appealing to me as 
gargling glass AND I was less "annoyed" that the test was failing as I 
was that it wasn't clear from reading logs what had gone wrong. I tried 
to find an additional intermediate control point that would "split" 
failure modes into two categories: neutron is too slow in setting things 
up and neutron failed to set things up correctly. Granted it still is 
adding timeout to the test, but if I could find a control point based on 
"settling" so that if it passed, then there is a good chance that if the 
next check failed it was because neutron actually screwed up what it was 
trying to do.

Waiting until the query on the nova for the floating IP information 
seemed a relatively reasonable, if imperfect, "settling" criteria before 
attempting to connect to the VM. Testing to see if the floating IP 
assignment gets to the nova instance details is a valid test and, 
AFAICT, missing from the current tests. However, Yair has the reasonable 
point that connectivity is often available long before the floating IP 
appears in the nova results and that it could be considered invalid to 
use non-network specific criteria as pass/fail for this test.

In general, the validity of checking for the presence of a floating IP 
in the server details is a matter of interpretation. I think it is a 
given that it must be tested somewhere and that if it causes a test to 
fail then it is as valid a failure than a ping failing. Certainly I have 
seen scenarios where an IP appears, but doesn't actually work and others 
where the IP doesn't appear (ever, not just in really long while) but 
magically works. Both are bugs. Which is more appropriate to tests like 
test_network_basic_ops?

Currently, the polling interval for the checks in the gate should be 
tuned. They are borrowing other polling configuration and I can see it 
is ill-advised. It is currently polling at an interval of a second and 
if the intent is to wait for the entire system to settle down before 
proceeding then polling nova that quickly is too often. It simply 
increases the load while we are waiting to adapt to a loaded system. For 
example in the course of a three minute timeout, the floating IP check 
polled nova for server details 180 times.

All this aside it is granted that checking for the floating IP in the 
nova instance details is imperfect in itself. There is nothing that 
assures that the presence of that information indicates that the 
networking backend is done its work.

Comments, suggestions, queries, foam bricks?

Cheers,

Brent

Open Stack

[openstack-dev] [neutron][qa] test_network_basic_ops and the "FloatingIPChecker" control point

OpenStack

Community

Documentation

Branding & Legal