Open Stack

Thu Mar 19 05:50:18 UTC 2020

Hey all,

We've been having a tough time lately in the gate hitting various bugs 
while our patches go through CI. I just wanted to mention a few of them 
that I've seen often in my gerrit notifications and give a brief status 
on fix efforts.

* http://status.openstack.org/elastic-recheck/#1813789

This one is where the nova-live-migration job fails a server evacuate 
test with: "Timeout waiting for [('network-vif-plugged', 
'e3d3db3f-bce4-4889-b161-4b73648f79be')] for instance with vm_state 
error and task_state rebuild_spawning.: eventlet.timeout.Timeout: 300 
seconds" in the screen-n-cpu.txt log.

lyarwood has a WIP patch here:

https://review.opendev.org/713674

and sean-k-mooney has a WIP patch here:

https://review.opendev.org/713342

* https://launchpad.net/bugs/1867380

This one is where the nova-live-migration or nova-grenade-multinode job 
fail due to n-cpu restarting slowly after being reconfigured for ceph. 
The server will fail to build and it's because the test begins before 
nova-compute has fully come up and we see this error: "Instance spawn 
was interrupted before instance_claim, setting instance to ERROR state 
{{(pid=3783) _error_out_instances_whose_build_was_interrupted" in the 
screen-n-cpu.txt log.

lyarwood has a patch approved here that we've been rechecking the heck 
out of that has yet to merge:

https://review.opendev.org/713035

* https://launchpad.net/bugs/1844568

This one is where a job fails with: "Body: b'{"conflictingRequest": 
{"code": 409, "message": "Multiple possible networks found, use a 
Network ID to be more specific."}}'"

gmann has a patch proposed to fix some of these here:

https://review.opendev.org/711049

There might be more test classes that need create_default_network = True.

* http://status.openstack.org/elastic-recheck/#1844929

This one is where a job fails and the following error is seen one of the 
logs, usually screen-n-sch.txt: "Timed out waiting for response from 
cell 8acfb79b-2e40-4e1c-bc3d-d404dac6db90".

The TL;DR on this one is there's no immediate clue why it's happening. 
This bug used to hit more occasionally on "slow" nodes like nodes from 
the OVH or INAP providers (and OVH restricts disk iops [1]). Now, it 
seems like it's hitting much more often (still mostly on OVH nodes).

I've been looking at it for about a week now and I've been using a DNM 
patch to add debug logging, look at dstat --disk-wait output, try mysqld 
my.cnf settings, etc:

https://review.opendev.org/701478

So far, what I find is that when we get into the fail state, we get no 
rows back from the database server when we query for nova 'services' and 
'compute_nodes' records, and we fail with the "Timed out waiting for 
response" error.

Haven't figured out why yet, so far. The disk wait doesn't look high 
when this happens (or at any time during a run) so it's not seeming like 
it's related to disk IO. I'm continuing to look into it.

Cheers,
-melanie

[1] 
http://lists.openstack.org/pipermail/openstack-discuss/2019-November/010505.html

Open Stack

[nova][gate] status of some gate bugs

OpenStack

Community

Documentation

Branding & Legal