Re: State of the Gate

31 Oct 2019

      Hi,

On Thu, Oct 31, 2019 at 10:23:01AM -0500, Matt Riedemann wrote:
...
Things are great! Surprise! I just wanted to let everyone know. Later!
.
.
.
.
.
Now that you've been tricked, on Halloween no less, I'm here to tell you
that things suck right now. This is your periodic digest of issues. Grab
some fun-sized candy bars and read on.
I think right now we have three major issues.
1. http://status.openstack.org/elastic-recheck/index.html#1763070
This has resurfaced and I'm not sure why, nor do I think we ever had a great
handle on what is causing this or how to work around it so if anyone has new
ideas please chip in.
I think that this is "just" some slowdown of node on which job is running. I
noticed it too in some neutron jobs and I checked some. It seems that one API
request is processed for very long time. For example in one of fresh examples:
https://13cf3dd11b8f009809dc-97cb3b32849366f5bed744685e46b266.ssl.cf5.rackcd...
it was request to nova which caused very long time:

Oct 31 16:55:08.632162 ubuntu-bionic-inap-mtl01-0012620879 devstack@n-api.service[7191]: INFO nova.api.openstack.requestlog [None req-275af2df-bd4e-4e64-b46e-6582e8de5148 tempest-ServerDiskConfigTestJSON-1598674508 tempest-ServerDiskConfigTestJSON-1598674508] 198.72.124.104 "POST /compute/v2.1/servers/d15d2033-b29b-44f7-b619-ed7ef83fe477/action" status: 500 len: 216 microversion: 2.1 time: 161.951140
...
2. http://status.openstack.org/elastic-recheck/index.html#1844929
I've done some digging into this one and my notes are in the bug report. It
mostly affects grenade jobs but is not entirely restricted to grenade jobs.
It's also mostly on OVH and FortNebula nodes but not totally.
From looking at the mysql logs in the grenade jobs, mysqld is (re)started
three times. I think (1) for initial package install, (2) for stacking
devstack on the old side, and (3) for stacking devstack on the new side.
After the last restart, there are a lot of aborted connection messages in
the msyql error log. It's around then that grenade is running post-upgrade
smoke tests to create a server and the nova-scheduler times out
communicating with the nova_cell1 database.
I have a few patches up to grenade/devstack [1] to try some things and get
more msyql logs but so far they aren't really helpful. We need someone with
some more mysql debugging experience to help here, maybe zzzeek or mordred?
3. CirrOS guest SSH issues
There are several (some might be duplicates):
http://status.openstack.org/elastic-recheck/index.html#1848078
This one is I think the same as we have reported in
https://bugs.launchpad.net/neutron/+bug/1850557

Basically we noticed issues with dhcp after resize/migration/shelve of instance
but I didn't have time to investigate it yet.
...
http://status.openstack.org/elastic-recheck/index.html#1808010 (most hits)
http://status.openstack.org/elastic-recheck/index.html#1463631
http://status.openstack.org/elastic-recheck/index.html#1849857
http://status.openstack.org/elastic-recheck/index.html#1737039
http://status.openstack.org/elastic-recheck/index.html#1840355
http://status.openstack.org/elastic-recheck/index.html#1843610
A few notes here.
a) We're still using CirrOS 0.4.0 since Stein:
https://review.opendev.org/#/c/521825/
And that image was published nearly 2 years ago and there are no newer
versions on the CirrOS download site so we can't try a newer image to see if
that fixes things.
b) Some of the issues above are related to running out of disk in the guest.
I'm not sure what is causing that, but I have posted a devstack patch that
is related:
https://review.opendev.org/#/c/690991
tl;dr before Stein the tempest flavors we used had disk=0 so nova would fit
the root disk to the size of the image. Since Stein the tempest flavors
specify root disk size (1GiB for the CirrOS images). My patch pads an extra
1GiB to the root disk on the tempest flavors. One side effect of that is the
volumes tempest creates will go from 1GiB to 2GiB which could be a problem
if a lot of tempest volume tests run at the same time, though we do have a
volume group size of 24GB in gate runs so I think we're OK for now. I'm not
sure my patch would help, but it's an idea.
As for the other key generation and dhcp lease failures, I don't know what
to do about those. We need more eyes on these issues to generate some ideas
or see if we're doing something wrong in our tests, e.g. generating too much
data for the config drive? Not using config drive in some cases? Metadata
API server is too slow (note we cache the metadata since [2])? Other ideas
on injecting logs for debug?
[1] https://review.opendev.org/#/q/topic:bug/1844929+status:open
[2] https://review.opendev.org/#/q/I9082be077b59acd3a39910fa64e29147cb5c2dd7
--
Thanks,
Matt
-- 
Slawek Kaplonski
Senior software engineer
Red Hat