[openstack-dev] [neutron] non-deterministic gate failures due to unclosed eventlet Timeouts

John Schwarz jschwarz at redhat.com
Sun Sep 7 13:39:27 UTC 2014


Hi,

Long story short: for future reference, if you initialize an eventlet
Timeout, make sure you close it (either with a context manager or simply
timeout.close()), and be extra-careful when writing tests using
eventlet Timeouts, because these timeouts don't implicitly expire and
will cause unexpected behaviours (see [1]) like gate failures. In our
case this caused non-deterministic failures on the dsvm-functional test
suite.


Late last week, a bug was found ([2]) in which an eventlet Timeout
object was initialized but not closed. This instance was left inside
eventlet's inner-workings and triggered non-deterministic "Timeout: 10
seconds" errors and failures in dsvm-functional tests.

As mentioned earlier, initializing a new eventlet.timeout.Timeout
instance also registers it to inner mechanisms that exist within the
library, and the reference remains there until it is explicitly removed
(and not until the scope leaves the function block, as some would have
thought). Thus, the old code (simply creating an instance without
assigning it to a variable) left no way to close the timeout object.
This reference remains throughout the "life" of a worker, so this can
(and did) effect other tests and procedures using eventlet under the
same process. Obviously this could easily effect production-grade
systems with very high load.

For future reference:
 1) If you run into a "Timeout: %d seconds" exception whose traceback
includes "hub.switch()" and "self.greenlet.switch()" calls, there might
be a latent Timeout somewhere in the code, and a search for all
eventlet.timeout.Timeout instances will probably produce the culprit.

 2) The setup used to reproduce this error for debugging purposes is a
baremetal machine running a VM with devstack. In the baremetal machine I
used some 6 "dd if=/dev/zero of=/dev/null" to simulate high CPU load
(full command can be found at [3]), and in the VM I ran the
dsvm-functional suite. Using only a VM with similar high CPU simulation
fails to produce the result.

[1]
http://eventlet.net/doc/modules/timeout.html#eventlet.timeout.eventlet.timeout.Timeout.Timeout.cancel
[2] https://review.openstack.org/#/c/119001/
[3]
http://stackoverflow.com/questions/2925606/how-to-create-a-cpu-spike-with-a-bash-command


--
John Schwarz,
Software Engineer, Red Hat.




More information about the OpenStack-dev mailing list