[nova][gate] status of some gate bugs

melanie witt melwittt at gmail.com
Tue Apr 7 18:49:13 UTC 2020


Hello again,

I realized I didn't mention I had written updates inline as well and 
maybe they weren't seen because of it. Re-sending in case that happened, 
please see a few updates and a couple of requests for review inline.

On 4/6/20 17:16, melanie witt wrote:
> Hey all,
> 
> I have a few updates since last time.

<snip>

> On 3/18/20 22:50, melanie witt wrote:
>> * http://status.openstack.org/elastic-recheck/#1813789
>>
>> This one is where the nova-live-migration job fails a server evacuate 
>> test with: "Timeout waiting for [('network-vif-plugged', 
>> 'e3d3db3f-bce4-4889-b161-4b73648f79be')] for instance with vm_state 
>> error and task_state rebuild_spawning.: eventlet.timeout.Timeout: 300 
>> seconds" in the screen-n-cpu.txt log.
>>
>> lyarwood has a WIP patch here:
>>
>> https://review.opendev.org/713674
> 
> This patch has merged since last time and the fail rate is down to 1 
> fail in the last 10 days (note this is likely off a bit because of the 
> indexing snafu from earlier).
> 
>> and sean-k-mooney has a WIP patch here:
>>
>> https://review.opendev.org/713342
> 
> But, it is not completely gone yet (I just had a patch fail on it a few 
> min ago), so we'll need to work further on this patch (relevant IRC 
> convo is linked in a review comment).
> 
>> * https://launchpad.net/bugs/1867380
>>
>> This one is where the nova-live-migration or nova-grenade-multinode 
>> job fail due to n-cpu restarting slowly after being reconfigured for 
>> ceph. The server will fail to build and it's because the test begins 
>> before nova-compute has fully come up and we see this error: "Instance 
>> spawn was interrupted before instance_claim, setting instance to ERROR 
>> state {{(pid=3783) _error_out_instances_whose_build_was_interrupted" 
>> in the screen-n-cpu.txt log.
>>
>> lyarwood has a patch approved here that we've been rechecking the heck 
>> out of that has yet to merge:
>>
>> https://review.opendev.org/713035
> 
> This patch has merged and I noticed a big improvement in fail rate in 
> the gate since then (we did not have a e-r fingerprint for this one -- 
> I'll try to add one so we can see when/if it crops back up).
> 
>> * https://launchpad.net/bugs/1844568
>>
>> This one is where a job fails with: "Body: b'{"conflictingRequest": 
>> {"code": 409, "message": "Multiple possible networks found, use a 
>> Network ID to be more specific."}}'"
>>
>> gmann has a patch proposed to fix some of these here:
>>
>> https://review.opendev.org/711049
> 
> This patch also merged and its fail rate is down to 5 fails in the last 
> 10 days.
> 
>> There might be more test classes that need create_default_network = True.
> 
> One of my patches hit another instance of this bug and I've proposed a 
> patch for that one specifically:
> 
> https://review.opendev.org/716809

I would appreciate review on this patch ^

>> * http://status.openstack.org/elastic-recheck/#1844929
>>
>> This one is where a job fails and the following error is seen one of 
>> the logs, usually screen-n-sch.txt: "Timed out waiting for response 
>> from cell 8acfb79b-2e40-4e1c-bc3d-d404dac6db90".
>>
>> The TL;DR on this one is there's no immediate clue why it's happening. 
>> This bug used to hit more occasionally on "slow" nodes like nodes from 
>> the OVH or INAP providers (and OVH restricts disk iops [1]). Now, it 
>> seems like it's hitting much more often (still mostly on OVH nodes).
>>
>> I've been looking at it for about a week now and I've been using a DNM 
>> patch to add debug logging, look at dstat --disk-wait output, try 
>> mysqld my.cnf settings, etc:
>>
>> https://review.opendev.org/701478
>>
>> So far, what I find is that when we get into the fail state, we get no 
>> rows back from the database server when we query for nova 'services' 
>> and 'compute_nodes' records, and we fail with the "Timed out waiting 
>> for response" error.
>>
>> Haven't figured out why yet, so far. The disk wait doesn't look high 
>> when this happens (or at any time during a run) so it's not seeming 
>> like it's related to disk IO. I'm continuing to look into it.
> 
> I think I finally got to the bottom of this one and found that during 
> the grenade runs after we restart nova-scheduler, while it's coming back 
> up, requests are flowing in and the parent process is holding an oslo.db 
> lock when the child process workers are forked. So, the child processes 
> inherit the held locks and database access through those oslo.db objects 
> can never succeed. This results in the CellTimeout error -- it behaves 
> as though the database never sends us results, but what really happened 
> was our database requests were never made because they were stuck 
> outside the inherited held locks [2].
> 
> Inheriting of held standard library locks at fork is a known issue in 
> python [3] that is currently being worked on [4].
> 
> I think we can handle this in nova though in the meantime by clearing 
> our cell cache that holds oslo.db database transaction context manager 
> objects during service start(). This way, we get fresh oslo.db objects 
> with unlocked locks when child processes start up. I have proposed a 
> patch for this here:
> 
> https://review.opendev.org/717662

I would appreciate review on this patch ^

> and have been rechecking it a bunch of times to get more samples. So 
> far, the grenade-py3 and nova-grenade-multinode jobs have both passed 7 
> runs in a row on the change.
> 
> Cheers,
> -melanie
> 
> [2] https://bugs.launchpad.net/nova/+bug/1844929/comments/28
> [3] https://bugs.python.org/issue40089
> [4] https://github.com/python/cpython/pull/19195
> 
>> [1] 
>> http://lists.openstack.org/pipermail/openstack-discuss/2019-November/010505.html 
>>
> 




More information about the openstack-discuss mailing list