Re: State of the Gate

3 Dec 2019


      Hi,

On Tue, Dec 03, 2019 at 01:53:04PM +0000, Sean Mooney wrote:
...
On Thu, Oct 31, 2019 at 3:29 PM Matt Riedemann <mriedemos@gmail.com> wrote:
...
Things are great! Surprise! I just wanted to let everyone know. Later!
.
.
.
.
.
Now that you've been tricked, on Halloween no less, I'm here to tell you
that things suck right now. This is your periodic digest of issues. Grab
some fun-sized candy bars and read on.
I think right now we have three major issues.
1. http://status.openstack.org/elastic-recheck/index.html#1763070
This has resurfaced and I'm not sure why, nor do I think we ever had a
great handle on what is causing this or how to work around it so if
anyone has new ideas please chip in.
the only theory i have on that is in the absense of any other indication
we could be running out of entorpy in the vm which can lead to https
connections failing. we might want to consider enabling the virtio
random number generator in
the gate vms.
https://github.com/openstack/glance/blob/master/etc/metadefs/compute-libvirt...
this need to be enabled in the flavours too but low entroy can cass
ssl connection to fail
https://major.io/2007/07/01/check-available-entropy-in-linux/
You may be right with lack of entropy is culprit in some issues. E.g. here:
https://zuul.opendev.org/t/openstack/build/edcf837457a741abb752693723319b15/...
it took 120 seconds to Initialize random number generator on guest VM and
because of that network wasn't configured properly and test failed.
...
...
2. http://status.openstack.org/elastic-recheck/index.html#1844929
I've done some digging into this one and my notes are in the bug report.
It mostly affects grenade jobs but is not entirely restricted to grenade
jobs. It's also mostly on OVH and FortNebula nodes but not totally.
From looking at the mysql logs in the grenade jobs, mysqld is
(re)started three times. I think (1) for initial package install, (2)
for stacking devstack on the old side, and (3) for stacking devstack on
the new side. After the last restart, there are a lot of aborted
connection messages in the msyql error log. It's around then that
grenade is running post-upgrade smoke tests to create a server and the
nova-scheduler times out communicating with the nova_cell1 database.
I have a few patches up to grenade/devstack [1] to try some things and
get more msyql logs but so far they aren't really helpful. We need
someone with some more mysql debugging experience to help here, maybe
zzzeek or mordred?
3. CirrOS guest SSH issues
There are several (some might be duplicates):
http://status.openstack.org/elastic-recheck/index.html#1848078
http://status.openstack.org/elastic-recheck/index.html#1808010 (most hits)
http://status.openstack.org/elastic-recheck/index.html#1463631
http://status.openstack.org/elastic-recheck/index.html#1849857
http://status.openstack.org/elastic-recheck/index.html#1737039
http://status.openstack.org/elastic-recheck/index.html#1840355
http://status.openstack.org/elastic-recheck/index.html#1843610
A few notes here.
a) We're still using CirrOS 0.4.0 since Stein:
https://review.opendev.org/#/c/521825/
And that image was published nearly 2 years ago and there are no newer
versions on the CirrOS download site so we can't try a newer image to
see if that fixes things.
b) Some of the issues above are related to running out of disk in the
guest. I'm not sure what is causing that, but I have posted a devstack
patch that is related:
https://review.opendev.org/#/c/690991
tl;dr before Stein the tempest flavors we used had disk=0 so nova would
fit the root disk to the size of the image. Since Stein the tempest
flavors specify root disk size (1GiB for the CirrOS images). My patch
pads an extra 1GiB to the root disk on the tempest flavors. One side
effect of that is the volumes tempest creates will go from 1GiB to 2GiB
which could be a problem if a lot of tempest volume tests run at the
same time, though we do have a volume group size of 24GB in gate runs so
I think we're OK for now. I'm not sure my patch would help, but it's an
idea.
As for the other key generation and dhcp lease failures, I don't know
what to do about those.We need more eyes on these issues to generate
so the ssh key generation issue may also be down to entorpoy.
i have not looked at those specific failueres but i did note in some failed
test in the past the we printed the kernel entropy in the guest and it
was like 36
so some other very low number(it should be in the hundreds). if we
have low entropy key generation will take
a long time. https://wiki.debian.org/BoottimeEntropyStarvation
...
some ideas or see if we're doing something wrong in our tests, e.g.
generating too much data for the config drive? Not using config drive in
some cases? Metadata API server is too slow (note we cache the metadata
since [2])? Other ideas on injecting logs for debug?
[1] https://review.opendev.org/#/q/topic:bug/1844929+status:open
[2] https://review.opendev.org/#/q/I9082be077b59acd3a39910fa64e29147cb5c2dd7
--
Thanks,
Matt
-- 
Slawek Kaplonski
Senior software engineer
Red Hat