[openstack-dev] [neutron][qa] test_network_basic_ops and the "FloatingIPChecker" control point
Sean Dague
sean at dague.net
Thu Dec 19 10:39:28 UTC 2013
On 12/18/2013 10:54 PM, Jay Pipes wrote:
> On 12/18/2013 10:21 PM, Brent Eagles wrote:
>> Hi,
>>
>> Yair and I were discussing a change that I initiated and was
>> incorporated into the test_network_basic_ops test. It was intended as a
>> configuration control point for floating IP address assignments before
>> actually testing connectivity. The question we were discussing was
>> whether this check was a valid pass/fail criteria for tests like
>> test_network_basic_ops.
>>
>> The initial motivation for the change was that test_network_basic_ops
>> had a less than 50/50 chance of passing in my local environment for
>> whatever reason. After looking at the test, it seemed ridiculous that it
>> should be failing. The problem is that more often than not the data that
>> was available in the logs all pointed to it being set up correctly but
>> the ping test for connectivity was timing out. From the logs it wasn't
>> clear that the test was failing because neutron did not do the right
>> thing, did not do it fast enough, or is something else happening? Of
>> course if I paused the test for a short bit between setup and the checks
>> to manually verify everything the checks always passed. So it's a timing
>> issue right?
>>
>> Two things: adding more timeout to a check is as appealing to me as
>> gargling glass AND I was less "annoyed" that the test was failing as I
>> was that it wasn't clear from reading logs what had gone wrong. I tried
>> to find an additional intermediate control point that would "split"
>> failure modes into two categories: neutron is too slow in setting things
>> up and neutron failed to set things up correctly. Granted it still is
>> adding timeout to the test, but if I could find a control point based on
>> "settling" so that if it passed, then there is a good chance that if the
>> next check failed it was because neutron actually screwed up what it was
>> trying to do.
>>
>> Waiting until the query on the nova for the floating IP information
>> seemed a relatively reasonable, if imperfect, "settling" criteria before
>> attempting to connect to the VM. Testing to see if the floating IP
>> assignment gets to the nova instance details is a valid test and,
>> AFAICT, missing from the current tests. However, Yair has the reasonable
>> point that connectivity is often available long before the floating IP
>> appears in the nova results and that it could be considered invalid to
>> use non-network specific criteria as pass/fail for this test.
>
> But, Tempest is all about functional integration testing. Using a call
> to Nova's server details to determine whether a dependent call to
> Neutron succeeded (setting up the floating IP) is exactly what I think
> Tempest is all about. It's validating that the integration between Nova
> and Neutron is working as expected.
>
> So, I actually think the assertion on the floating IP address appearing
> (after some timeout/timeout-backoff) is entirely appropriate.
>
>> In general, the validity of checking for the presence of a floating IP
>> in the server details is a matter of interpretation. I think it is a
>> given that it must be tested somewhere and that if it causes a test to
>> fail then it is as valid a failure than a ping failing. Certainly I have
>> seen scenarios where an IP appears, but doesn't actually work and others
>> where the IP doesn't appear (ever, not just in really long while) but
>> magically works. Both are bugs. Which is more appropriate to tests like
>> test_network_basic_ops?
>
> I believe both assertions should be part of the test cases, but since
> the latter condition (good ping connectivity, but no floater ever
> appears attached to the instance) necessarily depends on the first
> failure (floating IP does not appear in the server details after a
> timeout), then perhaps one way to handle this would be to do this:
>
> a) create server instance
> b) assign floating ip
> c) query server details looking for floater in a timeout-backoff loop
> c1) floater does appear
> c1-a) assert ping connectivity
> c2) floater does not appear
> c2-a) check ping connectivity. if ping connectivity succeeds, use a
> call to testtools.TestCase.addDetail() to provide some "interesting"
> feedback
> c2-b) raise assertion that floater did not appear in the server details
>
>> Currently, the polling interval for the checks in the gate should be
>> tuned. They are borrowing other polling configuration and I can see it
>> is ill-advised. It is currently polling at an interval of a second and
>> if the intent is to wait for the entire system to settle down before
>> proceeding then polling nova that quickly is too often. It simply
>> increases the load while we are waiting to adapt to a loaded system. For
>> example in the course of a three minute timeout, the floating IP check
>> polled nova for server details 180 times.
>
> Agreed completely.
We should just add an exponential backoff to the waiting. That should
decrease load over time. I'd be +2 to such a patch.
That being said.... I'm not sure why 1 request / sec is considered load
that would break the system. That doesn't seem a completely unreasonable
load. If you look at the sysstat log in the gate runs where things fail,
you will be able to see current load where this doesn't work.
-Sean
--
Sean Dague
Samsung Research America
sean at dague.net / sean.dague at samsung.com
http://dague.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20131219/d6481d5a/attachment.pgp>
More information about the OpenStack-dev
mailing list