Re: [nova][gate] status of some gate bugs

19 Mar 2020

      On 18-03-20 22:50:18, melanie witt wrote:
...
Hey all,
We've been having a tough time lately in the gate hitting various bugs while
our patches go through CI. I just wanted to mention a few of them that I've
seen often in my gerrit notifications and give a brief status on fix
efforts.
Many thanks for writing this up Mel!

Comments below on issues I've been working on.
...
* http://status.openstack.org/elastic-recheck/#1813789
This one is where the nova-live-migration job fails a server evacuate test
with: "Timeout waiting for [('network-vif-plugged',
'e3d3db3f-bce4-4889-b161-4b73648f79be')] for instance with vm_state error
and task_state rebuild_spawning.: eventlet.timeout.Timeout: 300 seconds" in
the screen-n-cpu.txt log.
lyarwood has a WIP patch here:
https://review.opendev.org/713674
This looks like it actually fixes it by fencing the destack@* services
and running guest domains on the subnode. In the gate now, thanks gibi,
sean-k-mooney and stephenfin!
...
and sean-k-mooney has a WIP patch here:
https://review.opendev.org/713342
Might be something we still want to look into under another bug?
...
* https://launchpad.net/bugs/1867380
This one is where the nova-live-migration or nova-grenade-multinode job fail
due to n-cpu restarting slowly after being reconfigured for ceph. The server
will fail to build and it's because the test begins before nova-compute has
fully come up and we see this error: "Instance spawn was interrupted before
instance_claim, setting instance to ERROR state {{(pid=3783)
_error_out_instances_whose_build_was_interrupted" in the screen-n-cpu.txt
log.
lyarwood has a patch approved here that we've been rechecking the heck out
of that has yet to merge:
https://review.opendev.org/713035
Merged on master and backported all the way back to stable/pike on the
following topic:

https://review.opendev.org/#/q/topic:bug/1867380+status:open
...
* https://launchpad.net/bugs/1844568
This one is where a job fails with: "Body: b'{"conflictingRequest": {"code":
409, "message": "Multiple possible networks found, use a Network ID to be
more specific."}}'"
gmann has a patch proposed to fix some of these here:
https://review.opendev.org/711049
There might be more test classes that need create_default_network = True.
* http://status.openstack.org/elastic-recheck/#1844929
This one is where a job fails and the following error is seen one of the
logs, usually screen-n-sch.txt: "Timed out waiting for response from cell
8acfb79b-2e40-4e1c-bc3d-d404dac6db90".
The TL;DR on this one is there's no immediate clue why it's happening. This
bug used to hit more occasionally on "slow" nodes like nodes from the OVH or
INAP providers (and OVH restricts disk iops [1]). Now, it seems like it's
hitting much more often (still mostly on OVH nodes).
I've been looking at it for about a week now and I've been using a DNM patch
to add debug logging, look at dstat --disk-wait output, try mysqld my.cnf
settings, etc:
https://review.opendev.org/701478
So far, what I find is that when we get into the fail state, we get no rows
back from the database server when we query for nova 'services' and
'compute_nodes' records, and we fail with the "Timed out waiting for
response" error.
Haven't figured out why yet, so far. The disk wait doesn't look high when
this happens (or at any time during a run) so it's not seeming like it's
related to disk IO. I'm continuing to look into it.
Cheers,
-melanie
[1] http://lists.openstack.org/pipermail/openstack-discuss/2019-November/010505....
I've also hit the following a few times on master:

test_update_delete_extra_route failing due to timeout when creating subnets
https://bugs.launchpad.net/neutron/+bug/1867936

I'll try to write up a logstash query for this now and post a review for
recheck.

Thanks again,

-- 
Lee Yarwood                 A5D1 9385 88CB 7E5F BE64  6618 BCA6 6E33 F672 2D76