[nova][gate] status of some gate bugs
melanie witt
melwittt at gmail.com
Thu Mar 19 05:50:18 UTC 2020
Hey all,
We've been having a tough time lately in the gate hitting various bugs
while our patches go through CI. I just wanted to mention a few of them
that I've seen often in my gerrit notifications and give a brief status
on fix efforts.
* http://status.openstack.org/elastic-recheck/#1813789
This one is where the nova-live-migration job fails a server evacuate
test with: "Timeout waiting for [('network-vif-plugged',
'e3d3db3f-bce4-4889-b161-4b73648f79be')] for instance with vm_state
error and task_state rebuild_spawning.: eventlet.timeout.Timeout: 300
seconds" in the screen-n-cpu.txt log.
lyarwood has a WIP patch here:
https://review.opendev.org/713674
and sean-k-mooney has a WIP patch here:
https://review.opendev.org/713342
* https://launchpad.net/bugs/1867380
This one is where the nova-live-migration or nova-grenade-multinode job
fail due to n-cpu restarting slowly after being reconfigured for ceph.
The server will fail to build and it's because the test begins before
nova-compute has fully come up and we see this error: "Instance spawn
was interrupted before instance_claim, setting instance to ERROR state
{{(pid=3783) _error_out_instances_whose_build_was_interrupted" in the
screen-n-cpu.txt log.
lyarwood has a patch approved here that we've been rechecking the heck
out of that has yet to merge:
https://review.opendev.org/713035
* https://launchpad.net/bugs/1844568
This one is where a job fails with: "Body: b'{"conflictingRequest":
{"code": 409, "message": "Multiple possible networks found, use a
Network ID to be more specific."}}'"
gmann has a patch proposed to fix some of these here:
https://review.opendev.org/711049
There might be more test classes that need create_default_network = True.
* http://status.openstack.org/elastic-recheck/#1844929
This one is where a job fails and the following error is seen one of the
logs, usually screen-n-sch.txt: "Timed out waiting for response from
cell 8acfb79b-2e40-4e1c-bc3d-d404dac6db90".
The TL;DR on this one is there's no immediate clue why it's happening.
This bug used to hit more occasionally on "slow" nodes like nodes from
the OVH or INAP providers (and OVH restricts disk iops [1]). Now, it
seems like it's hitting much more often (still mostly on OVH nodes).
I've been looking at it for about a week now and I've been using a DNM
patch to add debug logging, look at dstat --disk-wait output, try mysqld
my.cnf settings, etc:
https://review.opendev.org/701478
So far, what I find is that when we get into the fail state, we get no
rows back from the database server when we query for nova 'services' and
'compute_nodes' records, and we fail with the "Timed out waiting for
response" error.
Haven't figured out why yet, so far. The disk wait doesn't look high
when this happens (or at any time during a run) so it's not seeming like
it's related to disk IO. I'm continuing to look into it.
Cheers,
-melanie
[1]
http://lists.openstack.org/pipermail/openstack-discuss/2019-November/010505.html
More information about the openstack-discuss
mailing list