[nova][gate] status of some gate bugs
melwittt at gmail.com
Wed Apr 8 05:40:47 UTC 2020
On 4/7/20 11:49, melanie witt wrote:
>>> * http://status.openstack.org/elastic-recheck/#1844929
>>> This one is where a job fails and the following error is seen one of
>>> the logs, usually screen-n-sch.txt: "Timed out waiting for response
>>> from cell 8acfb79b-2e40-4e1c-bc3d-d404dac6db90".
>>> The TL;DR on this one is there's no immediate clue why it's
>>> happening. This bug used to hit more occasionally on "slow" nodes
>>> like nodes from the OVH or INAP providers (and OVH restricts disk
>>> iops ). Now, it seems like it's hitting much more often (still
>>> mostly on OVH nodes).
>>> I've been looking at it for about a week now and I've been using a
>>> DNM patch to add debug logging, look at dstat --disk-wait output, try
>>> mysqld my.cnf settings, etc:
>>> So far, what I find is that when we get into the fail state, we get
>>> no rows back from the database server when we query for nova
>>> 'services' and 'compute_nodes' records, and we fail with the "Timed
>>> out waiting for response" error.
>>> Haven't figured out why yet, so far. The disk wait doesn't look high
>>> when this happens (or at any time during a run) so it's not seeming
>>> like it's related to disk IO. I'm continuing to look into it.
>> I think I finally got to the bottom of this one and found that during
>> the grenade runs after we restart nova-scheduler, while it's coming
>> back up, requests are flowing in and the parent process is holding an
>> oslo.db lock when the child process workers are forked. So, the child
>> processes inherit the held locks and database access through those
>> oslo.db objects can never succeed. This results in the CellTimeout
>> error -- it behaves as though the database never sends us results, but
>> what really happened was our database requests were never made because
>> they were stuck outside the inherited held locks .
>> Inheriting of held standard library locks at fork is a known issue in
>> python  that is currently being worked on .
Correction: this issue  is not considered a bug in python and the
recommended way of handling the problem is to use the
os.register_at_fork() method (new in python 3.7) in oslo.db to
reinitialize its lock .
For now, we still support python 3.6, so we have to handle it outside of
oslo.db some way (https://review.opendev.org/717662 is proposed). But we
can also add use of os.register_at_fork() in oslo.db for those who are
running >= python 3.7 to handle the problem more generally.
>> I think we can handle this in nova though in the meantime by clearing
>> our cell cache that holds oslo.db database transaction context manager
>> objects during service start(). This way, we get fresh oslo.db objects
>> with unlocked locks when child processes start up. I have proposed a
>> patch for this here:
> I would appreciate review on this patch ^
>> and have been rechecking it a bunch of times to get more samples. So
>> far, the grenade-py3 and nova-grenade-multinode jobs have both passed
>> 7 runs in a row on the change.
>>  https://bugs.launchpad.net/nova/+bug/1844929/comments/28
>>  https://bugs.python.org/issue40089
>>  https://github.com/python/cpython/pull/19195
More information about the openstack-discuss