Open Stack

Thu Mar 19 08:25:48 UTC 2020

Hey Melanie, all,

About OVH case (company I work for).
We are digging into the issue.

First thing, we do not limit anymore the IOPS. I dont remember when we
removed this limit, but this is not new.

However, the hypervisor are quite old now, and our policy on this old
servers was to use some swap.
And we think that the host may slow down when overcommitting on RAM
(swapping on disk).

Anyway, we also know that we can have better latency when upgrading
QEMU. We are currently in the middle of testing a new QEMU version, I
will push to upgrade your hypervisors first, so we will see if the
latency on QEMU side can help the gate.

Finally, we plan to change the hardware and stop doing overcommit on RAM
(and swapping on disk). However, I have no ETA about that, but for sure,
this will improve the IOPS.

I'll keep you in touch.

Cheers,

-- 
Arnaud Morin

On 18.03.20 - 22:50, melanie witt wrote:
> Hey all,
> 
> We've been having a tough time lately in the gate hitting various bugs while
> our patches go through CI. I just wanted to mention a few of them that I've
> seen often in my gerrit notifications and give a brief status on fix
> efforts.
> 
> * http://status.openstack.org/elastic-recheck/#1813789
> 
> This one is where the nova-live-migration job fails a server evacuate test
> with: "Timeout waiting for [('network-vif-plugged',
> 'e3d3db3f-bce4-4889-b161-4b73648f79be')] for instance with vm_state error
> and task_state rebuild_spawning.: eventlet.timeout.Timeout: 300 seconds" in
> the screen-n-cpu.txt log.
> 
> lyarwood has a WIP patch here:
> 
> https://review.opendev.org/713674
> 
> and sean-k-mooney has a WIP patch here:
> 
> https://review.opendev.org/713342
> 
> * https://launchpad.net/bugs/1867380
> 
> This one is where the nova-live-migration or nova-grenade-multinode job fail
> due to n-cpu restarting slowly after being reconfigured for ceph. The server
> will fail to build and it's because the test begins before nova-compute has
> fully come up and we see this error: "Instance spawn was interrupted before
> instance_claim, setting instance to ERROR state {{(pid=3783)
> _error_out_instances_whose_build_was_interrupted" in the screen-n-cpu.txt
> log.
> 
> lyarwood has a patch approved here that we've been rechecking the heck out
> of that has yet to merge:
> 
> https://review.opendev.org/713035
> 
> * https://launchpad.net/bugs/1844568
> 
> This one is where a job fails with: "Body: b'{"conflictingRequest": {"code":
> 409, "message": "Multiple possible networks found, use a Network ID to be
> more specific."}}'"
> 
> gmann has a patch proposed to fix some of these here:
> 
> https://review.opendev.org/711049
> 
> There might be more test classes that need create_default_network = True.
> 
> * http://status.openstack.org/elastic-recheck/#1844929
> 
> This one is where a job fails and the following error is seen one of the
> logs, usually screen-n-sch.txt: "Timed out waiting for response from cell
> 8acfb79b-2e40-4e1c-bc3d-d404dac6db90".
> 
> The TL;DR on this one is there's no immediate clue why it's happening. This
> bug used to hit more occasionally on "slow" nodes like nodes from the OVH or
> INAP providers (and OVH restricts disk iops [1]). Now, it seems like it's
> hitting much more often (still mostly on OVH nodes).
> 
> I've been looking at it for about a week now and I've been using a DNM patch
> to add debug logging, look at dstat --disk-wait output, try mysqld my.cnf
> settings, etc:
> 
> https://review.opendev.org/701478
> 
> So far, what I find is that when we get into the fail state, we get no rows
> back from the database server when we query for nova 'services' and
> 'compute_nodes' records, and we fail with the "Timed out waiting for
> response" error.
> 
> Haven't figured out why yet, so far. The disk wait doesn't look high when
> this happens (or at any time during a run) so it's not seeming like it's
> related to disk IO. I'm continuing to look into it.
> 
> Cheers,
> -melanie
> 
> [1] http://lists.openstack.org/pipermail/openstack-discuss/2019-November/010505.html
> 

Open Stack

[nova][gate] status of some gate bugs

OpenStack

Community

Documentation

Branding & Legal