[openstack-dev] [neutron] [nova] non-deterministic gate failures due to unclosed eventlet Timeouts
Angus Lees
gus at inodes.org
Thu Sep 11 04:59:29 UTC 2014
On Mon, 8 Sep 2014 05:25:22 PM Jay Pipes wrote:
> On 09/07/2014 10:43 AM, Matt Riedemann wrote:
> > On 9/7/2014 8:39 AM, John Schwarz wrote:
> >> Hi,
> >>
> >> Long story short: for future reference, if you initialize an eventlet
> >> Timeout, make sure you close it (either with a context manager or simply
> >> timeout.close()), and be extra-careful when writing tests using
> >> eventlet Timeouts, because these timeouts don't implicitly expire and
> >> will cause unexpected behaviours (see [1]) like gate failures. In our
> >> case this caused non-deterministic failures on the dsvm-functional test
> >> suite.
> >>
> >>
> >> Late last week, a bug was found ([2]) in which an eventlet Timeout
> >> object was initialized but not closed. This instance was left inside
> >> eventlet's inner-workings and triggered non-deterministic "Timeout: 10
> >> seconds" errors and failures in dsvm-functional tests.
> >>
> >> As mentioned earlier, initializing a new eventlet.timeout.Timeout
> >> instance also registers it to inner mechanisms that exist within the
> >> library, and the reference remains there until it is explicitly removed
> >> (and not until the scope leaves the function block, as some would have
> >> thought). Thus, the old code (simply creating an instance without
> >> assigning it to a variable) left no way to close the timeout object.
> >> This reference remains throughout the "life" of a worker, so this can
> >> (and did) effect other tests and procedures using eventlet under the
> >> same process. Obviously this could easily effect production-grade
> >> systems with very high load.
> >>
> >> For future reference:
> >> 1) If you run into a "Timeout: %d seconds" exception whose traceback
> >>
> >> includes "hub.switch()" and "self.greenlet.switch()" calls, there might
> >> be a latent Timeout somewhere in the code, and a search for all
> >> eventlet.timeout.Timeout instances will probably produce the culprit.
> >>
> >> 2) The setup used to reproduce this error for debugging purposes is a
> >>
> >> baremetal machine running a VM with devstack. In the baremetal machine I
> >> used some 6 "dd if=/dev/zero of=/dev/null" to simulate high CPU load
> >> (full command can be found at [3]), and in the VM I ran the
> >> dsvm-functional suite. Using only a VM with similar high CPU simulation
> >> fails to produce the result.
> >>
> >> [1]
> >> http://eventlet.net/doc/modules/timeout.html#eventlet.timeout.eventlet.ti
> >> meout.Timeout.Timeout.cancel
> >>
> >> [2] https://review.openstack.org/#/c/119001/
> >> [3]
> >> http://stackoverflow.com/questions/2925606/how-to-create-a-cpu-spike-with
> >> -a-bash-command
> >>
> >>
> >>
> >> --
> >> John Schwarz,
> >> Software Engineer, Red Hat.
> >>
> >>
> >> _______________________________________________
> >> OpenStack-dev mailing list
> >> OpenStack-dev at lists.openstack.org
> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> > Thanks, that might be what's causing this timeout/gate failure in the
> > nova unit tests. [1]
> >
> > [1] https://bugs.launchpad.net/nova/+bug/1357578
>
> Indeed, there are a couple places where eventlet.timeout.Timeout() seems
> to be used in the test suite without a context manager or calling
> close() explicitly:
>
> tests/virt/libvirt/test_driver.py
> 8925: raise eventlet.timeout.Timeout()
>
> tests/virt/hyperv/test_vmops.py
> 196: mock_with_timeout.side_effect = etimeout.Timeout()
If it's useful for anyone, I wrote a quick pylint test that will catch all the
above cases of "misused" context managers.
(Indeed, it will currently trigger on the "raise Timeout()" case, which is
probably too eager but can be disabled in the usual #pylint meta-comment way)
Here: https://review.openstack.org/#/c/120320/
--
- Gus
More information about the OpenStack-dev
mailing list