[nova] New gate bug 1844929, timed out waiting for response from cell during scheduling
I noticed this while looking at a grenade failure on an unrelated patch: https://bugs.launchpad.net/nova/+bug/1844929 The details are in the bug but it looks like this showed up around Sept 17 and hits mostly on FortNebula nodes but also OVH nodes. It's restricted to grenade jobs and while I don't see anything obvious in the rabbitmq logs (the only errors are about uwsgi [api] heartbeat issues), it's possible that these are slower infra nodes and we're just not waiting for something properly during the grenade upgrade. We also don't seem to have the mysql logs published during the grenade jobs which we need to fix (and recently did fix for devstack jobs [1] but grenade jobs are still using devstack-gate so log collection happens there). I didn't see any changes in nova, grenade or devstack since Sept 16 that look like they would be related to this so I'm guessing right now it's just a combination of performance on certain infra nodes (slower?) and something in grenade/nova not restarting properly or not waiting long enough for the upgrade to complete. [1] https://github.com/openstack/devstack/commit/f92c346131db2c89b930b1a23f84894... -- Thanks, Matt
On Sun, 22 Sep 2019, 16:39 Matt Riedemann, <mriedemos@gmail.com> wrote:
I noticed this while looking at a grenade failure on an unrelated patch:
https://bugs.launchpad.net/nova/+bug/1844929
The details are in the bug but it looks like this showed up around Sept 17 and hits mostly on FortNebula nodes but also OVH nodes. It's restricted to grenade jobs and while I don't see anything obvious in the rabbitmq logs (the only errors are about uwsgi [api] heartbeat issues), it's possible that these are slower infra nodes and we're just not waiting for something properly during the grenade upgrade. We also don't seem to have the mysql logs published during the grenade jobs which we need to fix (and recently did fix for devstack jobs [1] but grenade jobs are still using devstack-gate so log collection happens there).
I didn't see any changes in nova, grenade or devstack since Sept 16 that look like they would be related to this so I'm guessing right now it's just a combination of performance on certain infra nodes (slower?) and something in grenade/nova not restarting properly or not waiting long enough for the upgrade to complete.
Julia recently fixed an issue in ironic caused by a low MTU on fortnebula. May or may not be related. [1]
https://github.com/openstack/devstack/commit/f92c346131db2c89b930b1a23f84894...
--
Thanks,
Matt
It looks to me like there are specific jobs on specific providers that are not functioning correctly. I will pick on Fort Nebula for a minute. tacker-functional-devstack-multinode just doesn't seem to work, but most of the other jobs that do something similar work ok. You can see the load on Fort Nebula here, and by looking at the data I don't see any issues with it being overloaded/oversubscribed. https://grafana.fortnebula.com/d/9MMqh8HWk/openstack-utilization?orgId=2&refresh=30s&from=now-12h&to=now Also most jobs are IO/Memory bound and Fort Nebula uses local NVME for all of the Openstack Jobs.. There isn't a reasonable way to make it any faster. With that said, I would like to get to the bottom of it. It surely doesn't help anyone to have jobs be failing for non code related reasons. ~/D On Sun, Sep 22, 2019 at 12:58 PM Mark Goddard <mark@stackhpc.com> wrote:
On Sun, 22 Sep 2019, 16:39 Matt Riedemann, <mriedemos@gmail.com> wrote:
I noticed this while looking at a grenade failure on an unrelated patch:
https://bugs.launchpad.net/nova/+bug/1844929
The details are in the bug but it looks like this showed up around Sept 17 and hits mostly on FortNebula nodes but also OVH nodes. It's restricted to grenade jobs and while I don't see anything obvious in the rabbitmq logs (the only errors are about uwsgi [api] heartbeat issues), it's possible that these are slower infra nodes and we're just not waiting for something properly during the grenade upgrade. We also don't seem to have the mysql logs published during the grenade jobs which we need to fix (and recently did fix for devstack jobs [1] but grenade jobs are still using devstack-gate so log collection happens there).
I didn't see any changes in nova, grenade or devstack since Sept 16 that look like they would be related to this so I'm guessing right now it's just a combination of performance on certain infra nodes (slower?) and something in grenade/nova not restarting properly or not waiting long enough for the upgrade to complete.
Julia recently fixed an issue in ironic caused by a low MTU on fortnebula. May or may not be related.
[1]
https://github.com/openstack/devstack/commit/f92c346131db2c89b930b1a23f84894...
--
Thanks,
Matt
It would also be helpful to give the project a way to prefer certain infra providers for certain jobs. For the most part Fort Neubla is terrible at CPU bound long running jobs... I wish I could make it better, but I cannot. Is there a method we could come up with that would allow us to exploit certain traits of a certain provider? Maybe like some additional metadata that say what the certain provider is best at doing? For example highly IO bound jobs work like gangbusters on FN because the underlying storage is very fast, but CPU bound jobs do the direct opposite. Thoughts? ~/DonnyD On Mon, Sep 23, 2019 at 8:37 AM Donny Davis <donny@fortnebula.com> wrote:
It looks to me like there are specific jobs on specific providers that are not functioning correctly.
I will pick on Fort Nebula for a minute.
tacker-functional-devstack-multinode just doesn't seem to work, but most of the other jobs that do something similar work ok.
You can see the load on Fort Nebula here, and by looking at the data I don't see any issues with it being overloaded/oversubscribed.
Also most jobs are IO/Memory bound and Fort Nebula uses local NVME for all of the Openstack Jobs.. There isn't a reasonable way to make it any faster.
With that said, I would like to get to the bottom of it. It surely doesn't help anyone to have jobs be failing for non code related reasons.
~/D
On Sun, Sep 22, 2019 at 12:58 PM Mark Goddard <mark@stackhpc.com> wrote:
On Sun, 22 Sep 2019, 16:39 Matt Riedemann, <mriedemos@gmail.com> wrote:
I noticed this while looking at a grenade failure on an unrelated patch:
https://bugs.launchpad.net/nova/+bug/1844929
The details are in the bug but it looks like this showed up around Sept 17 and hits mostly on FortNebula nodes but also OVH nodes. It's restricted to grenade jobs and while I don't see anything obvious in the rabbitmq logs (the only errors are about uwsgi [api] heartbeat issues), it's possible that these are slower infra nodes and we're just not waiting for something properly during the grenade upgrade. We also don't seem to have the mysql logs published during the grenade jobs which we need to fix (and recently did fix for devstack jobs [1] but grenade jobs are still using devstack-gate so log collection happens there).
I didn't see any changes in nova, grenade or devstack since Sept 16 that look like they would be related to this so I'm guessing right now it's just a combination of performance on certain infra nodes (slower?) and something in grenade/nova not restarting properly or not waiting long enough for the upgrade to complete.
Julia recently fixed an issue in ironic caused by a low MTU on fortnebula. May or may not be related.
[1]
https://github.com/openstack/devstack/commit/f92c346131db2c89b930b1a23f84894...
--
Thanks,
Matt
On 9/22/2019 11:55 AM, Mark Goddard wrote:
Julia recently fixed an issue in ironic caused by a low MTU on fortnebula. May or may not be related.
Thanks but it looks like that was specific to ironic jobs and looking at logstash it's fixed: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22dropped%20over-mtu%20packet%5C%22%20AND%20tags%3A%5C%22syslog.txt%5C%22&from=7d -- Thanks, Matt
On 9/22/2019 10:37 AM, Matt Riedemann wrote:
We also don't seem to have the mysql logs published during the grenade jobs which we need to fix (and recently did fix for devstack jobs [1] but grenade jobs are still using devstack-gate so log collection happens there).
Fix for mysql log collection in grenade jobs is here: https://review.opendev.org/#/c/684042/ I'm just waiting on results to make sure that works before removing the -W. -- Thanks, Matt
participants (3)
-
Donny Davis
-
Mark Goddard
-
Matt Riedemann