[openstack-dev] [Fuel][Nailgun] Random failures in unit tests
ikalnitsky at mirantis.com
Wed Mar 16 14:06:53 UTC 2016
As you might know recently we encounter a lot of random test failures
on CI, and they are still there (likely with a bit less probability).
A nature of that random failures is actually not a random, they are
happened because of so called fake threads.
Fake threads, actually, ain't fake at all. They are native OS threads
that are designed to emulate Astute behaviour (i.e. catch RPC call and
respond with appropriate message). Since they are native threads and
we use SQLAlchemy's scoped_session, fake threads are using a separate
database session, hence - transaction. That leads to the following
* Races. We don't know when threads are switched, therefore, we don't
know what's committed and what's not. Some Nailgun tests sends
something via RPC (catched by fake threads) and immediately checks
something. The issue is, we can't guarantee fake threads is already
committed produced result. That could be avoided by waiting for
'ready' status of created nailgun task, however, it's better to simply
do not use fake threads in that case and simply call appropriate
Nailgun receiver's method directly in the test.
* Deadlocks. It's incredibly hard to ensure the same order of database
locks in test + business code on one hand and fake thread code on
other hand. That's why we can (and we do) encounter deadlocks on CI,
when test case waits for lock acquired by fake thread, and fake thread
waits for lock acquired by test case.
Fake threads are became a bottleneck of landing patches to master in
time, and we can't ignore it anymore. We have ~190 tests that use fake
threads, and fixing them all at once is a boring routine. So I kindly
ask Nailgun contrubitors to fix them as soon as we face them. Let's
file a bug on each file in CI, and quicly prepare a separate patch
that removes fake thread from failed test.
Thanks in advance,
More information about the OpenStack-dev