Open Stack

Tue Apr 7 00:16:03 UTC 2020

Hey all,

I have a few updates since last time.

This morning we noticed http://status.openstack.org/elastic-recheck was 
behind on elastic search indexing by 90+ hours. I mentioned it in 
#openstack-infra and clarkb checked and saw that a worker was stuck on 
226 MB of nova-api logs. I checked and found it was due to default 
policy deprecation message logging 10K+ messages per run.

I was able to get help from gmann on the best way to stop the bleeding 
and he proposed a patch that we've since merged:

https://review.opendev.org/717802

After this merged, clarkb cleared the elastic search indexing backlog as 
it would never be able to catch up.

gmann is working on some oslo.policy changes to better handle policy 
deprecation logging that we hope to be able to consume soon. Worst case, 
we'll leave the policy deprecation logging disabled for now as the nova 
code is OR'ing new policy defaults with old policy defaults and will for 
1-2 cycles. So we have time to sort out the logging issue in the event 
that we can't get a solution via FFE in oslo.policy.

On 3/18/20 22:50, melanie witt wrote:
> * http://status.openstack.org/elastic-recheck/#1813789
> 
> This one is where the nova-live-migration job fails a server evacuate 
> test with: "Timeout waiting for [('network-vif-plugged', 
> 'e3d3db3f-bce4-4889-b161-4b73648f79be')] for instance with vm_state 
> error and task_state rebuild_spawning.: eventlet.timeout.Timeout: 300 
> seconds" in the screen-n-cpu.txt log.
> 
> lyarwood has a WIP patch here:
> 
> https://review.opendev.org/713674

This patch has merged since last time and the fail rate is down to 1 
fail in the last 10 days (note this is likely off a bit because of the 
indexing snafu from earlier).

> and sean-k-mooney has a WIP patch here:
> 
> https://review.opendev.org/713342

But, it is not completely gone yet (I just had a patch fail on it a few 
min ago), so we'll need to work further on this patch (relevant IRC 
convo is linked in a review comment).

> * https://launchpad.net/bugs/1867380
> 
> This one is where the nova-live-migration or nova-grenade-multinode job 
> fail due to n-cpu restarting slowly after being reconfigured for ceph. 
> The server will fail to build and it's because the test begins before 
> nova-compute has fully come up and we see this error: "Instance spawn 
> was interrupted before instance_claim, setting instance to ERROR state 
> {{(pid=3783) _error_out_instances_whose_build_was_interrupted" in the 
> screen-n-cpu.txt log.
> 
> lyarwood has a patch approved here that we've been rechecking the heck 
> out of that has yet to merge:
> 
> https://review.opendev.org/713035

This patch has merged and I noticed a big improvement in fail rate in 
the gate since then (we did not have a e-r fingerprint for this one -- 
I'll try to add one so we can see when/if it crops back up).

> * https://launchpad.net/bugs/1844568
> 
> This one is where a job fails with: "Body: b'{"conflictingRequest": 
> {"code": 409, "message": "Multiple possible networks found, use a 
> Network ID to be more specific."}}'"
> 
> gmann has a patch proposed to fix some of these here:
> 
> https://review.opendev.org/711049

This patch also merged and its fail rate is down to 5 fails in the last 
10 days.

> There might be more test classes that need create_default_network = True.

One of my patches hit another instance of this bug and I've proposed a 
patch for that one specifically:

https://review.opendev.org/716809

> * http://status.openstack.org/elastic-recheck/#1844929
> 
> This one is where a job fails and the following error is seen one of the 
> logs, usually screen-n-sch.txt: "Timed out waiting for response from 
> cell 8acfb79b-2e40-4e1c-bc3d-d404dac6db90".
> 
> The TL;DR on this one is there's no immediate clue why it's happening. 
> This bug used to hit more occasionally on "slow" nodes like nodes from 
> the OVH or INAP providers (and OVH restricts disk iops [1]). Now, it 
> seems like it's hitting much more often (still mostly on OVH nodes).
> 
> I've been looking at it for about a week now and I've been using a DNM 
> patch to add debug logging, look at dstat --disk-wait output, try mysqld 
> my.cnf settings, etc:
> 
> https://review.opendev.org/701478
> 
> So far, what I find is that when we get into the fail state, we get no 
> rows back from the database server when we query for nova 'services' and 
> 'compute_nodes' records, and we fail with the "Timed out waiting for 
> response" error.
> 
> Haven't figured out why yet, so far. The disk wait doesn't look high 
> when this happens (or at any time during a run) so it's not seeming like 
> it's related to disk IO. I'm continuing to look into it.

I think I finally got to the bottom of this one and found that during 
the grenade runs after we restart nova-scheduler, while it's coming back 
up, requests are flowing in and the parent process is holding an oslo.db 
lock when the child process workers are forked. So, the child processes 
inherit the held locks and database access through those oslo.db objects 
can never succeed. This results in the CellTimeout error -- it behaves 
as though the database never sends us results, but what really happened 
was our database requests were never made because they were stuck 
outside the inherited held locks [2].

Inheriting of held standard library locks at fork is a known issue in 
python [3] that is currently being worked on [4].

I think we can handle this in nova though in the meantime by clearing 
our cell cache that holds oslo.db database transaction context manager 
objects during service start(). This way, we get fresh oslo.db objects 
with unlocked locks when child processes start up. I have proposed a 
patch for this here:

https://review.opendev.org/717662

and have been rechecking it a bunch of times to get more samples. So 
far, the grenade-py3 and nova-grenade-multinode jobs have both passed 7 
runs in a row on the change.

Cheers,
-melanie

[2] https://bugs.launchpad.net/nova/+bug/1844929/comments/28
[3] https://bugs.python.org/issue40089
[4] https://github.com/python/cpython/pull/19195

> [1] 
> http://lists.openstack.org/pipermail/openstack-discuss/2019-November/010505.html 
> 

Open Stack

[nova][gate] status of some gate bugs

OpenStack

Community

Documentation

Branding & Legal