[openstack-dev] [neutron] [nova] non-deterministic gate failures due to unclosed eventlet Timeouts

Miguel Angel Ajo Pelayo mangelajo at redhat.com
Thu Sep 11 05:52:48 UTC 2014


Good catch John, and good work Angus! ;)

This will save a lot of headaches.

----- Original Message -----
> On Mon, 8 Sep 2014 05:25:22 PM Jay Pipes wrote:
> > On 09/07/2014 10:43 AM, Matt Riedemann wrote:
> > > On 9/7/2014 8:39 AM, John Schwarz wrote:
> > >> Hi,
> > >> 
> > >> Long story short: for future reference, if you initialize an eventlet
> > >> Timeout, make sure you close it (either with a context manager or simply
> > >> timeout.close()), and be extra-careful when writing tests using
> > >> eventlet Timeouts, because these timeouts don't implicitly expire and
> > >> will cause unexpected behaviours (see [1]) like gate failures. In our
> > >> case this caused non-deterministic failures on the dsvm-functional test
> > >> suite.
> > >> 
> > >> 
> > >> Late last week, a bug was found ([2]) in which an eventlet Timeout
> > >> object was initialized but not closed. This instance was left inside
> > >> eventlet's inner-workings and triggered non-deterministic "Timeout: 10
> > >> seconds" errors and failures in dsvm-functional tests.
> > >> 
> > >> As mentioned earlier, initializing a new eventlet.timeout.Timeout
> > >> instance also registers it to inner mechanisms that exist within the
> > >> library, and the reference remains there until it is explicitly removed
> > >> (and not until the scope leaves the function block, as some would have
> > >> thought). Thus, the old code (simply creating an instance without
> > >> assigning it to a variable) left no way to close the timeout object.
> > >> This reference remains throughout the "life" of a worker, so this can
> > >> (and did) effect other tests and procedures using eventlet under the
> > >> same process. Obviously this could easily effect production-grade
> > >> systems with very high load.
> > >> 
> > >> For future reference:
> > >>   1) If you run into a "Timeout: %d seconds" exception whose traceback
> > >> 
> > >> includes "hub.switch()" and "self.greenlet.switch()" calls, there might
> > >> be a latent Timeout somewhere in the code, and a search for all
> > >> eventlet.timeout.Timeout instances will probably produce the culprit.
> > >> 
> > >>   2) The setup used to reproduce this error for debugging purposes is a
> > >> 
> > >> baremetal machine running a VM with devstack. In the baremetal machine I
> > >> used some 6 "dd if=/dev/zero of=/dev/null" to simulate high CPU load
> > >> (full command can be found at [3]), and in the VM I ran the
> > >> dsvm-functional suite. Using only a VM with similar high CPU simulation
> > >> fails to produce the result.
> > >> 
> > >> [1]
> > >> http://eventlet.net/doc/modules/timeout.html#eventlet.timeout.eventlet.ti
> > >> meout.Timeout.Timeout.cancel
> > >> 
> > >> [2] https://review.openstack.org/#/c/119001/
> > >> [3]
> > >> http://stackoverflow.com/questions/2925606/how-to-create-a-cpu-spike-with
> > >> -a-bash-command
> > >> 
> > >> 
> > >> 
> > >> --
> > >> John Schwarz,
> > >> Software Engineer, Red Hat.
> > >> 
> > >> 
> > >> _______________________________________________
> > >> OpenStack-dev mailing list
> > >> OpenStack-dev at lists.openstack.org
> > >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > > 
> > > Thanks, that might be what's causing this timeout/gate failure in the
> > > nova unit tests. [1]
> > > 
> > > [1] https://bugs.launchpad.net/nova/+bug/1357578
> > 
> > Indeed, there are a couple places where eventlet.timeout.Timeout() seems
> > to be used in the test suite without a context manager or calling
> > close() explicitly:
> > 
> > tests/virt/libvirt/test_driver.py
> > 8925:                raise eventlet.timeout.Timeout()
> > 
> > tests/virt/hyperv/test_vmops.py
> > 196:        mock_with_timeout.side_effect = etimeout.Timeout()
> 
> If it's useful for anyone, I wrote a quick pylint test that will catch all
> the
> above cases of "misused" context managers.
> 
> (Indeed, it will currently trigger on the "raise Timeout()" case, which is
> probably too eager but can be disabled in the usual #pylint meta-comment way)
> 
> Here: https://review.openstack.org/#/c/120320/
> 
> --
>  - Gus
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 



More information about the OpenStack-dev mailing list