[openstack-dev] [neutron][qa] test_network_basic_ops and the "FloatingIPChecker" control point
yfried at redhat.com
Thu Dec 19 08:31:38 UTC 2013
I run into this issue trying to incorporate this test into cross_tenant_connectivity scenario:
launching 2 VMs in different tenants
What I saw, is that in the gate it fails half the time (the original test passes without issues) and ONLY on the 2nd VM (the first FLIP propagates fine).
I don't see this in:
1. my local RHOS-Havana setup
2. the cross_tenant_connectivity scenario without the control point (test passes without issues)
3. test_network_basic_ops runs in the gate
So here's my somewhat less experienced opinion:
1. this happens due to stress (more than a single FLIP/VM)
2. (as Brent said) Timeout interval between polling are too short
3. FLIP is usually reachable long before it is seen in the nova DB (also from manual experience), so blocking the test until it reaches the nova DB doesn't make sense for me. if we could do this in different thread, then maybe, but using a Pass/Fail criteria to test for a timing issue seems wrong. Especially since as I understand it, the issue is on IF it reaches nova DB, only WHEN.
I would like to, at least, move this check from its place as a blocker to later in the test. Before this is done, I would like to know if anyone else has seen the same problems Brent describes prior to this patch being merged.
Regarding Jay's scenario suggestion, I think this should not be a part of network_basic_ops, but rather a separate stress scenario creating multiple VMs and testing for FLIP associations and propagation time.
(Also added my comments inline)
----- Original Message -----
From: "Jay Pipes" <jaypipes at gmail.com>
To: openstack-dev at lists.openstack.org
Sent: Thursday, December 19, 2013 5:54:29 AM
Subject: Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the "FloatingIPChecker" control point
On 12/18/2013 10:21 PM, Brent Eagles wrote:
> Yair and I were discussing a change that I initiated and was
> incorporated into the test_network_basic_ops test. It was intended as a
> configuration control point for floating IP address assignments before
> actually testing connectivity. The question we were discussing was
> whether this check was a valid pass/fail criteria for tests like
> The initial motivation for the change was that test_network_basic_ops
> had a less than 50/50 chance of passing in my local environment for
> whatever reason. After looking at the test, it seemed ridiculous that it
> should be failing. The problem is that more often than not the data that
> was available in the logs all pointed to it being set up correctly but
> the ping test for connectivity was timing out. From the logs it wasn't
> clear that the test was failing because neutron did not do the right
> thing, did not do it fast enough, or is something else happening? Of
> course if I paused the test for a short bit between setup and the checks
> to manually verify everything the checks always passed. So it's a timing
> issue right?
DID anyone else see experience this issue? locally or on the gate?
> Two things: adding more timeout to a check is as appealing to me as
> gargling glass AND I was less "annoyed" that the test was failing as I
> was that it wasn't clear from reading logs what had gone wrong. I tried
> to find an additional intermediate control point that would "split"
> failure modes into two categories: neutron is too slow in setting things
> up and neutron failed to set things up correctly. Granted it still is
> adding timeout to the test, but if I could find a control point based on
> "settling" so that if it passed, then there is a good chance that if the
> next check failed it was because neutron actually screwed up what it was
> trying to do.
> Waiting until the query on the nova for the floating IP information
> seemed a relatively reasonable, if imperfect, "settling" criteria before
> attempting to connect to the VM. Testing to see if the floating IP
> assignment gets to the nova instance details is a valid test and,
> AFAICT, missing from the current tests. However, Yair has the reasonable
> point that connectivity is often available long before the floating IP
> appears in the nova results and that it could be considered invalid to
> use non-network specific criteria as pass/fail for this test.
But, Tempest is all about functional integration testing. Using a call
to Nova's server details to determine whether a dependent call to
Neutron succeeded (setting up the floating IP) is exactly what I think
Tempest is all about. It's validating that the integration between Nova
and Neutron is working as expected.
So, I actually think the assertion on the floating IP address appearing
(after some timeout/timeout-backoff) is entirely appropriate.
Blocking the connectivity check until DB is updated doesn't make sense to me, since we know FLIP is reachable well before nova DB is updated (this is seen also in manual mode, not just by automation timing standards)
> In general, the validity of checking for the presence of a floating IP
> in the server details is a matter of interpretation. I think it is a
> given that it must be tested somewhere and that if it causes a test to
> fail then it is as valid a failure than a ping failing. Certainly I have
> seen scenarios where an IP appears, but doesn't actually work and others
> where the IP doesn't appear (ever, not just in really long while) but
> magically works. Both are bugs. Which is more appropriate to tests like
I believe both assertions should be part of the test cases, but since
the latter condition (good ping connectivity, but no floater ever
appears attached to the instance) necessarily depends on the first
failure (floating IP does not appear in the server details after a
timeout), then perhaps one way to handle this would be to do this:
a) create server instance
b) assign floating ip
c) query server details looking for floater in a timeout-backoff loop
c1) floater does appear
c1-a) assert ping connectivity
c2) floater does not appear
c2-a) check ping connectivity. if ping connectivity succeeds, use a
call to testtools.TestCase.addDetail() to provide some "interesting"
c2-b) raise assertion that floater did not appear in the server details
IMO This should be a different stress scenario and not part of network_basic_ops
> Currently, the polling interval for the checks in the gate should be
> tuned. They are borrowing other polling configuration and I can see it
> is ill-advised. It is currently polling at an interval of a second and
> if the intent is to wait for the entire system to settle down before
> proceeding then polling nova that quickly is too often. It simply
> increases the load while we are waiting to adapt to a loaded system. For
> example in the course of a three minute timeout, the floating IP check
> polled nova for server details 180 times.
> All this aside it is granted that checking for the floating IP in the
> nova instance details is imperfect in itself. There is nothing that
> assures that the presence of that information indicates that the
> networking backend is done its work.
> Comments, suggestions, queries, foam bricks?
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
OpenStack-dev mailing list
OpenStack-dev at lists.openstack.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the OpenStack-dev