[openstack-dev] [neutron] [nova] non-deterministic gate failures due to unclosed eventlet Timeouts

Jay Pipes jaypipes at gmail.com
Mon Sep 8 21:25:22 UTC 2014


On 09/07/2014 10:43 AM, Matt Riedemann wrote:
> On 9/7/2014 8:39 AM, John Schwarz wrote:
>> Hi,
>>
>> Long story short: for future reference, if you initialize an eventlet
>> Timeout, make sure you close it (either with a context manager or simply
>> timeout.close()), and be extra-careful when writing tests using
>> eventlet Timeouts, because these timeouts don't implicitly expire and
>> will cause unexpected behaviours (see [1]) like gate failures. In our
>> case this caused non-deterministic failures on the dsvm-functional test
>> suite.
>>
>>
>> Late last week, a bug was found ([2]) in which an eventlet Timeout
>> object was initialized but not closed. This instance was left inside
>> eventlet's inner-workings and triggered non-deterministic "Timeout: 10
>> seconds" errors and failures in dsvm-functional tests.
>>
>> As mentioned earlier, initializing a new eventlet.timeout.Timeout
>> instance also registers it to inner mechanisms that exist within the
>> library, and the reference remains there until it is explicitly removed
>> (and not until the scope leaves the function block, as some would have
>> thought). Thus, the old code (simply creating an instance without
>> assigning it to a variable) left no way to close the timeout object.
>> This reference remains throughout the "life" of a worker, so this can
>> (and did) effect other tests and procedures using eventlet under the
>> same process. Obviously this could easily effect production-grade
>> systems with very high load.
>>
>> For future reference:
>>   1) If you run into a "Timeout: %d seconds" exception whose traceback
>> includes "hub.switch()" and "self.greenlet.switch()" calls, there might
>> be a latent Timeout somewhere in the code, and a search for all
>> eventlet.timeout.Timeout instances will probably produce the culprit.
>>
>>   2) The setup used to reproduce this error for debugging purposes is a
>> baremetal machine running a VM with devstack. In the baremetal machine I
>> used some 6 "dd if=/dev/zero of=/dev/null" to simulate high CPU load
>> (full command can be found at [3]), and in the VM I ran the
>> dsvm-functional suite. Using only a VM with similar high CPU simulation
>> fails to produce the result.
>>
>> [1]
>> http://eventlet.net/doc/modules/timeout.html#eventlet.timeout.eventlet.timeout.Timeout.Timeout.cancel
>>
>> [2] https://review.openstack.org/#/c/119001/
>> [3]
>> http://stackoverflow.com/questions/2925606/how-to-create-a-cpu-spike-with-a-bash-command
>>
>>
>>
>> --
>> John Schwarz,
>> Software Engineer, Red Hat.
>>
>>
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
> Thanks, that might be what's causing this timeout/gate failure in the
> nova unit tests. [1]
>
> [1] https://bugs.launchpad.net/nova/+bug/1357578

Indeed, there are a couple places where eventlet.timeout.Timeout() seems 
to be used in the test suite without a context manager or calling 
close() explicitly:

tests/virt/libvirt/test_driver.py
8925:                raise eventlet.timeout.Timeout()

tests/virt/hyperv/test_vmops.py
196:        mock_with_timeout.side_effect = etimeout.Timeout()

Best,
-jay



More information about the OpenStack-dev mailing list