<div dir="ltr">It looks to me like there are specific jobs on specific providers that are not functioning correctly. <div><br></div><div>I will pick on Fort Nebula for a minute.</div><div><br></div><div>tacker-functional-devstack-multinode just doesn't seem to work, but most of the other jobs that do something similar work ok. </div><div><br></div><div>You can see the load on Fort Nebula here, and by looking at the data I don't see any issues with it being overloaded/oversubscribed. </div><div><a href="https://grafana.fortnebula.com/d/9MMqh8HWk/openstack-utilization?orgId=2&refresh=30s&from=now-12h&to=now">https://grafana.fortnebula.com/d/9MMqh8HWk/openstack-utilization?orgId=2&refresh=30s&from=now-12h&to=now</a><br></div><div><br></div><div>Also most jobs are IO/Memory bound and Fort Nebula uses local NVME for all of the Openstack Jobs.. There isn't a reasonable way to make it any faster.</div><div><br></div><div>With that said, I would like to get to the bottom of it. It surely doesn't help anyone to have jobs be failing for non code related reasons. </div><div><br></div><div>~/D</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Sep 22, 2019 at 12:58 PM Mark Goddard <<a href="mailto:mark@stackhpc.com">mark@stackhpc.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto"><div><br><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, 22 Sep 2019, 16:39 Matt Riedemann, <<a href="mailto:mriedemos@gmail.com" target="_blank">mriedemos@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I noticed this while looking at a grenade failure on an unrelated patch:<br>

<br>

<a href="https://bugs.launchpad.net/nova/+bug/1844929" rel="noreferrer noreferrer" target="_blank">https://bugs.launchpad.net/nova/+bug/1844929</a><br>

<br>

The details are in the bug but it looks like this showed up around Sept <br>

17 and hits mostly on FortNebula nodes but also OVH nodes. It's <br>

restricted to grenade jobs and while I don't see anything obvious in the <br>

rabbitmq logs (the only errors are about uwsgi [api] heartbeat issues), <br>

it's possible that these are slower infra nodes and we're just not <br>

waiting for something properly during the grenade upgrade. We also don't <br>

seem to have the mysql logs published during the grenade jobs which we <br>

need to fix (and recently did fix for devstack jobs [1] but grenade jobs <br>

are still using devstack-gate so log collection happens there).<br>

<br>

I didn't see any changes in nova, grenade or devstack since Sept 16 that <br>

look like they would be related to this so I'm guessing right now it's <br>

just a combination of performance on certain infra nodes (slower?) and <br>

something in grenade/nova not restarting properly or not waiting long <br>

enough for the upgrade to complete.<br></blockquote></div></div><div dir="auto"><br></div><div dir="auto">Julia recently fixed an issue in ironic caused by a low MTU on fortnebula. May or may not be related.</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

[1] <br>

<a href="https://github.com/openstack/devstack/commit/f92c346131db2c89b930b1a23f8489419a2217dc" rel="noreferrer noreferrer" target="_blank">https://github.com/openstack/devstack/commit/f92c346131db2c89b930b1a23f8489419a2217dc</a><br>

<br>

-- <br>

<br>

Thanks,<br>

<br>

Matt<br>

<br>

</blockquote></div></div></div>

</blockquote></div>