Re: [nova][gate] status of some gate bugs

8 Apr 2020

      On 4/7/20 11:49, melanie witt wrote:

<snip>
...
...
...
* http://status.openstack.org/elastic-recheck/#1844929
This one is where a job fails and the following error is seen one of 
the logs, usually screen-n-sch.txt: "Timed out waiting for response 
from cell 8acfb79b-2e40-4e1c-bc3d-d404dac6db90".
The TL;DR on this one is there's no immediate clue why it's 
happening. This bug used to hit more occasionally on "slow" nodes 
like nodes from the OVH or INAP providers (and OVH restricts disk 
iops [1]). Now, it seems like it's hitting much more often (still 
mostly on OVH nodes).
I've been looking at it for about a week now and I've been using a 
DNM patch to add debug logging, look at dstat --disk-wait output, try 
mysqld my.cnf settings, etc:
https://review.opendev.org/701478
So far, what I find is that when we get into the fail state, we get 
no rows back from the database server when we query for nova 
'services' and 'compute_nodes' records, and we fail with the "Timed 
out waiting for response" error.
Haven't figured out why yet, so far. The disk wait doesn't look high 
when this happens (or at any time during a run) so it's not seeming 
like it's related to disk IO. I'm continuing to look into it.
I think I finally got to the bottom of this one and found that during 
the grenade runs after we restart nova-scheduler, while it's coming 
back up, requests are flowing in and the parent process is holding an 
oslo.db lock when the child process workers are forked. So, the child 
processes inherit the held locks and database access through those 
oslo.db objects can never succeed. This results in the CellTimeout 
error -- it behaves as though the database never sends us results, but 
what really happened was our database requests were never made because 
they were stuck outside the inherited held locks [2].
Inheriting of held standard library locks at fork is a known issue in 
python [3] that is currently being worked on [4].
Correction: this issue [5] is not considered a bug in python and the 
recommended way of handling the problem is to use the 
os.register_at_fork() method (new in python 3.7) in oslo.db to 
reinitialize its lock [6].

For now, we still support python 3.6, so we have to handle it outside of 
oslo.db some way (https://review.opendev.org/717662 is proposed). But we 
can also add use of os.register_at_fork() in oslo.db for those who are 
running >= python 3.7 to handle the problem more generally.

[5] https://bugs.python.org/issue6721
[6] https://bugs.python.org/issue40091#msg365959
...
...
I think we can handle this in nova though in the meantime by clearing 
our cell cache that holds oslo.db database transaction context manager 
objects during service start(). This way, we get fresh oslo.db objects 
with unlocked locks when child processes start up. I have proposed a 
patch for this here:
https://review.opendev.org/717662
I would appreciate review on this patch ^
...
and have been rechecking it a bunch of times to get more samples. So 
far, the grenade-py3 and nova-grenade-multinode jobs have both passed 
7 runs in a row on the change.
Cheers,
-melanie
[2] https://bugs.launchpad.net/nova/+bug/1844929/comments/28
[3] https://bugs.python.org/issue40089
[4] https://github.com/python/cpython/pull/19195
...
[1] 
http://lists.openstack.org/pipermail/openstack-discuss/2019-November/010505....