Open Stack

Thu Mar 19 15:08:09 UTC 2020

On Thu, 2020-03-19 at 14:27 +0000, Lee Yarwood wrote:
> On 18-03-20 22:50:18, melanie witt wrote:
> > Hey all,
> > 
> > We've been having a tough time lately in the gate hitting various bugs while
> > our patches go through CI. I just wanted to mention a few of them that I've
> > seen often in my gerrit notifications and give a brief status on fix
> > efforts.
> 
> Many thanks for writing this up Mel!
> 
> Comments below on issues I've been working on.
>  
> > * http://status.openstack.org/elastic-recheck/#1813789
> > 
> > This one is where the nova-live-migration job fails a server evacuate test
> > with: "Timeout waiting for [('network-vif-plugged',
> > 'e3d3db3f-bce4-4889-b161-4b73648f79be')] for instance with vm_state error
> > and task_state rebuild_spawning.: eventlet.timeout.Timeout: 300 seconds" in
> > the screen-n-cpu.txt log.
> > 
> > lyarwood has a WIP patch here:
> > 
> > https://review.opendev.org/713674
> 
> This looks like it actually fixes it by fencing the destack@* services
> and running guest domains on the subnode. In the gate now, thanks gibi,
> sean-k-mooney and stephenfin!
>  
> > and sean-k-mooney has a WIP patch here:
> > 
> > https://review.opendev.org/713342
> 
> Might be something we still want to look into under another bug?
ya so i think we can file a seperate bug for the 
"intermittently fails with network-vif-plugged timeout exception for shelve"
we have already fixed it for resize revert but as part of that mriedemann noted
that it was broken for shelve 
https://bugs.launchpad.net/nova/+bug/1813789/comments/3
and its theoretically broken for evacuate.

so what i can do is file the new bug and then track
https://bugs.launchpad.net/nova/+bug/1813789 as a related bug but rather then the
fix above specificaly closing that since i think your patch above fixes it suffienctly.

my [WIP] patch seams to have some other side effect that i was not expecting so i need to look
into that more closely.
>  
> > * https://launchpad.net/bugs/1867380
> > 
> > This one is where the nova-live-migration or nova-grenade-multinode job fail
> > due to n-cpu restarting slowly after being reconfigured for ceph. The server
> > will fail to build and it's because the test begins before nova-compute has
> > fully come up and we see this error: "Instance spawn was interrupted before
> > instance_claim, setting instance to ERROR state {{(pid=3783)
> > _error_out_instances_whose_build_was_interrupted" in the screen-n-cpu.txt
> > log.
> > 
> > lyarwood has a patch approved here that we've been rechecking the heck out
> > of that has yet to merge:
> > 
> > https://review.opendev.org/713035
> 
> Merged on master and backported all the way back to stable/pike on the
> following topic:
> 
> https://review.opendev.org/#/q/topic:bug/1867380+status:open
> 
> > * https://launchpad.net/bugs/1844568
> > 
> > This one is where a job fails with: "Body: b'{"conflictingRequest": {"code":
> > 409, "message": "Multiple possible networks found, use a Network ID to be
> > more specific."}}'"
> > 
> > gmann has a patch proposed to fix some of these here:
> > 
> > https://review.opendev.org/711049
> > 
> > There might be more test classes that need create_default_network = True.
> > 
> > * http://status.openstack.org/elastic-recheck/#1844929
> > 
> > This one is where a job fails and the following error is seen one of the
> > logs, usually screen-n-sch.txt: "Timed out waiting for response from cell
> > 8acfb79b-2e40-4e1c-bc3d-d404dac6db90".
> > 
> > The TL;DR on this one is there's no immediate clue why it's happening. This
> > bug used to hit more occasionally on "slow" nodes like nodes from the OVH or
> > INAP providers (and OVH restricts disk iops [1]). Now, it seems like it's
> > hitting much more often (still mostly on OVH nodes).
> > 
> > I've been looking at it for about a week now and I've been using a DNM patch
> > to add debug logging, look at dstat --disk-wait output, try mysqld my.cnf
> > settings, etc:
> > 
> > https://review.opendev.org/701478
> > 
> > So far, what I find is that when we get into the fail state, we get no rows
> > back from the database server when we query for nova 'services' and
> > 'compute_nodes' records, and we fail with the "Timed out waiting for
> > response" error.
> > 
> > Haven't figured out why yet, so far. The disk wait doesn't look high when
> > this happens (or at any time during a run) so it's not seeming like it's
> > related to disk IO. I'm continuing to look into it.
> > 
> > Cheers,
> > -melanie
> > 
> > [1] http://lists.openstack.org/pipermail/openstack-discuss/2019-November/010505.html
> 
> I've also hit the following a few times on master:
> 
> test_update_delete_extra_route failing due to timeout when creating subnets
> https://bugs.launchpad.net/neutron/+bug/1867936
> 
> I'll try to write up a logstash query for this now and post a review for
> recheck.
> 
> Thanks again,
> 

Open Stack

[nova][gate] status of some gate bugs

OpenStack

Community

Documentation

Branding & Legal