Open Stack

Wed Apr 8 05:40:47 UTC 2020

On 4/7/20 11:49, melanie witt wrote:

<snip>

>>> * http://status.openstack.org/elastic-recheck/#1844929
>>>
>>> This one is where a job fails and the following error is seen one of 
>>> the logs, usually screen-n-sch.txt: "Timed out waiting for response 
>>> from cell 8acfb79b-2e40-4e1c-bc3d-d404dac6db90".
>>>
>>> The TL;DR on this one is there's no immediate clue why it's 
>>> happening. This bug used to hit more occasionally on "slow" nodes 
>>> like nodes from the OVH or INAP providers (and OVH restricts disk 
>>> iops [1]). Now, it seems like it's hitting much more often (still 
>>> mostly on OVH nodes).
>>>
>>> I've been looking at it for about a week now and I've been using a 
>>> DNM patch to add debug logging, look at dstat --disk-wait output, try 
>>> mysqld my.cnf settings, etc:
>>>
>>> https://review.opendev.org/701478
>>>
>>> So far, what I find is that when we get into the fail state, we get 
>>> no rows back from the database server when we query for nova 
>>> 'services' and 'compute_nodes' records, and we fail with the "Timed 
>>> out waiting for response" error.
>>>
>>> Haven't figured out why yet, so far. The disk wait doesn't look high 
>>> when this happens (or at any time during a run) so it's not seeming 
>>> like it's related to disk IO. I'm continuing to look into it.
>>
>> I think I finally got to the bottom of this one and found that during 
>> the grenade runs after we restart nova-scheduler, while it's coming 
>> back up, requests are flowing in and the parent process is holding an 
>> oslo.db lock when the child process workers are forked. So, the child 
>> processes inherit the held locks and database access through those 
>> oslo.db objects can never succeed. This results in the CellTimeout 
>> error -- it behaves as though the database never sends us results, but 
>> what really happened was our database requests were never made because 
>> they were stuck outside the inherited held locks [2].
>>
>> Inheriting of held standard library locks at fork is a known issue in 
>> python [3] that is currently being worked on [4].

Correction: this issue [5] is not considered a bug in python and the 
recommended way of handling the problem is to use the 
os.register_at_fork() method (new in python 3.7) in oslo.db to 
reinitialize its lock [6].

For now, we still support python 3.6, so we have to handle it outside of 
oslo.db some way (https://review.opendev.org/717662 is proposed). But we 
can also add use of os.register_at_fork() in oslo.db for those who are 
running >= python 3.7 to handle the problem more generally.

[5] https://bugs.python.org/issue6721
[6] https://bugs.python.org/issue40091#msg365959

>> I think we can handle this in nova though in the meantime by clearing 
>> our cell cache that holds oslo.db database transaction context manager 
>> objects during service start(). This way, we get fresh oslo.db objects 
>> with unlocked locks when child processes start up. I have proposed a 
>> patch for this here:
>>
>> https://review.opendev.org/717662
> 
> I would appreciate review on this patch ^
> 
>> and have been rechecking it a bunch of times to get more samples. So 
>> far, the grenade-py3 and nova-grenade-multinode jobs have both passed 
>> 7 runs in a row on the change.
>>
>> Cheers,
>> -melanie
>>
>> [2] https://bugs.launchpad.net/nova/+bug/1844929/comments/28
>> [3] https://bugs.python.org/issue40089
>> [4] https://github.com/python/cpython/pull/19195
>>
>>> [1] 
>>> http://lists.openstack.org/pipermail/openstack-discuss/2019-November/010505.html 
>>>
>>
> 

Open Stack

[nova][gate] status of some gate bugs

OpenStack

Community

Documentation

Branding & Legal