[nova][gate] status of some gate bugs

Lee Yarwood lyarwood at redhat.com
Thu Mar 19 14:27:31 UTC 2020


On 18-03-20 22:50:18, melanie witt wrote:
> Hey all,
> 
> We've been having a tough time lately in the gate hitting various bugs while
> our patches go through CI. I just wanted to mention a few of them that I've
> seen often in my gerrit notifications and give a brief status on fix
> efforts.

Many thanks for writing this up Mel!

Comments below on issues I've been working on.
 
> * http://status.openstack.org/elastic-recheck/#1813789
> 
> This one is where the nova-live-migration job fails a server evacuate test
> with: "Timeout waiting for [('network-vif-plugged',
> 'e3d3db3f-bce4-4889-b161-4b73648f79be')] for instance with vm_state error
> and task_state rebuild_spawning.: eventlet.timeout.Timeout: 300 seconds" in
> the screen-n-cpu.txt log.
> 
> lyarwood has a WIP patch here:
> 
> https://review.opendev.org/713674

This looks like it actually fixes it by fencing the destack@* services
and running guest domains on the subnode. In the gate now, thanks gibi,
sean-k-mooney and stephenfin!
 
> and sean-k-mooney has a WIP patch here:
> 
> https://review.opendev.org/713342

Might be something we still want to look into under another bug?
 
> * https://launchpad.net/bugs/1867380
> 
> This one is where the nova-live-migration or nova-grenade-multinode job fail
> due to n-cpu restarting slowly after being reconfigured for ceph. The server
> will fail to build and it's because the test begins before nova-compute has
> fully come up and we see this error: "Instance spawn was interrupted before
> instance_claim, setting instance to ERROR state {{(pid=3783)
> _error_out_instances_whose_build_was_interrupted" in the screen-n-cpu.txt
> log.
> 
> lyarwood has a patch approved here that we've been rechecking the heck out
> of that has yet to merge:
> 
> https://review.opendev.org/713035

Merged on master and backported all the way back to stable/pike on the
following topic:

https://review.opendev.org/#/q/topic:bug/1867380+status:open

> * https://launchpad.net/bugs/1844568
> 
> This one is where a job fails with: "Body: b'{"conflictingRequest": {"code":
> 409, "message": "Multiple possible networks found, use a Network ID to be
> more specific."}}'"
> 
> gmann has a patch proposed to fix some of these here:
> 
> https://review.opendev.org/711049
> 
> There might be more test classes that need create_default_network = True.
> 
> * http://status.openstack.org/elastic-recheck/#1844929
> 
> This one is where a job fails and the following error is seen one of the
> logs, usually screen-n-sch.txt: "Timed out waiting for response from cell
> 8acfb79b-2e40-4e1c-bc3d-d404dac6db90".
> 
> The TL;DR on this one is there's no immediate clue why it's happening. This
> bug used to hit more occasionally on "slow" nodes like nodes from the OVH or
> INAP providers (and OVH restricts disk iops [1]). Now, it seems like it's
> hitting much more often (still mostly on OVH nodes).
> 
> I've been looking at it for about a week now and I've been using a DNM patch
> to add debug logging, look at dstat --disk-wait output, try mysqld my.cnf
> settings, etc:
> 
> https://review.opendev.org/701478
> 
> So far, what I find is that when we get into the fail state, we get no rows
> back from the database server when we query for nova 'services' and
> 'compute_nodes' records, and we fail with the "Timed out waiting for
> response" error.
> 
> Haven't figured out why yet, so far. The disk wait doesn't look high when
> this happens (or at any time during a run) so it's not seeming like it's
> related to disk IO. I'm continuing to look into it.
> 
> Cheers,
> -melanie
> 
> [1] http://lists.openstack.org/pipermail/openstack-discuss/2019-November/010505.html

I've also hit the following a few times on master:

test_update_delete_extra_route failing due to timeout when creating subnets
https://bugs.launchpad.net/neutron/+bug/1867936

I'll try to write up a logstash query for this now and post a review for
recheck.

Thanks again,

-- 
Lee Yarwood                 A5D1 9385 88CB 7E5F BE64  6618 BCA6 6E33 F672 2D76
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20200319/700973c8/attachment-0001.sig>


More information about the openstack-discuss mailing list