Open Stack

Tue Dec 3 15:12:02 UTC 2019

Hi,

On Tue, Dec 03, 2019 at 01:53:04PM +0000, Sean Mooney wrote:
> On Thu, Oct 31, 2019 at 3:29 PM Matt Riedemann <mriedemos at gmail.com> wrote:
> >
> > Things are great! Surprise! I just wanted to let everyone know. Later!
> > .
> > .
> > .
> > .
> > .
> > Now that you've been tricked, on Halloween no less, I'm here to tell you
> > that things suck right now. This is your periodic digest of issues. Grab
> > some fun-sized candy bars and read on.
> >
> > I think right now we have three major issues.
> >
> > 1. http://status.openstack.org/elastic-recheck/index.html#1763070
> >
> > This has resurfaced and I'm not sure why, nor do I think we ever had a
> > great handle on what is causing this or how to work around it so if
> > anyone has new ideas please chip in.
> the only theory i have on that is in the absense of any other indication
> we could be running out of entorpy in the vm which can lead to https
> connections failing. we might want to consider enabling the virtio
> random number generator in
> the gate vms.
> https://github.com/openstack/glance/blob/master/etc/metadefs/compute-libvirt-image.json#L54-L59
> this need to be enabled in the flavours too but low entroy can cass
> ssl connection to fail
> https://major.io/2007/07/01/check-available-entropy-in-linux/

You may be right with lack of entropy is culprit in some issues. E.g. here:
https://zuul.opendev.org/t/openstack/build/edcf837457a741abb752693723319b15/log/controller/logs/tempest_log.txt.gz#14652
it took 120 seconds to Initialize random number generator on guest VM and
because of that network wasn't configured properly and test failed.

> >
> > 2. http://status.openstack.org/elastic-recheck/index.html#1844929
> >
> > I've done some digging into this one and my notes are in the bug report.
> > It mostly affects grenade jobs but is not entirely restricted to grenade
> > jobs. It's also mostly on OVH and FortNebula nodes but not totally.
> >
> >  From looking at the mysql logs in the grenade jobs, mysqld is
> > (re)started three times. I think (1) for initial package install, (2)
> > for stacking devstack on the old side, and (3) for stacking devstack on
> > the new side. After the last restart, there are a lot of aborted
> > connection messages in the msyql error log. It's around then that
> > grenade is running post-upgrade smoke tests to create a server and the
> > nova-scheduler times out communicating with the nova_cell1 database.
> >
> > I have a few patches up to grenade/devstack [1] to try some things and
> > get more msyql logs but so far they aren't really helpful. We need
> > someone with some more mysql debugging experience to help here, maybe
> > zzzeek or mordred?
> >
> > 3. CirrOS guest SSH issues
> >
> > There are several (some might be duplicates):
> >
> > http://status.openstack.org/elastic-recheck/index.html#1848078
> > http://status.openstack.org/elastic-recheck/index.html#1808010 (most hits)
> > http://status.openstack.org/elastic-recheck/index.html#1463631
> > http://status.openstack.org/elastic-recheck/index.html#1849857
> > http://status.openstack.org/elastic-recheck/index.html#1737039
> > http://status.openstack.org/elastic-recheck/index.html#1840355
> > http://status.openstack.org/elastic-recheck/index.html#1843610
> >
> > A few notes here.
> >
> > a) We're still using CirrOS 0.4.0 since Stein:
> >
> > https://review.opendev.org/#/c/521825/
> >
> > And that image was published nearly 2 years ago and there are no newer
> > versions on the CirrOS download site so we can't try a newer image to
> > see if that fixes things.
> >
> > b) Some of the issues above are related to running out of disk in the
> > guest. I'm not sure what is causing that, but I have posted a devstack
> > patch that is related:
> >
> > https://review.opendev.org/#/c/690991
> >
> > tl;dr before Stein the tempest flavors we used had disk=0 so nova would
> > fit the root disk to the size of the image. Since Stein the tempest
> > flavors specify root disk size (1GiB for the CirrOS images). My patch
> > pads an extra 1GiB to the root disk on the tempest flavors. One side
> > effect of that is the volumes tempest creates will go from 1GiB to 2GiB
> > which could be a problem if a lot of tempest volume tests run at the
> > same time, though we do have a volume group size of 24GB in gate runs so
> > I think we're OK for now. I'm not sure my patch would help, but it's an
> > idea.
> >
> > As for the other key generation and dhcp lease failures, I don't know
> > what to do about those.We need more eyes on these issues to generate
> 
> so the ssh key generation issue may also be down to entorpoy.
> i have not looked at those specific failueres but i did note in some failed
> test in the past the we printed the kernel entropy in the guest and it
> was like 36
> so some other very low number(it should be in the hundreds). if we
> have low entropy key generation will take
> a long time. https://wiki.debian.org/BoottimeEntropyStarvation
> 
> 
> > some ideas or see if we're doing something wrong in our tests, e.g.
> > generating too much data for the config drive? Not using config drive in
> > some cases? Metadata API server is too slow (note we cache the metadata
> > since [2])? Other ideas on injecting logs for debug?
> >
> > [1] https://review.opendev.org/#/q/topic:bug/1844929+status:open
> > [2] https://review.opendev.org/#/q/I9082be077b59acd3a39910fa64e29147cb5c2dd7
> >
> > --
> >
> > Thanks,
> >
> > Matt
> >
> 
> 

-- 
Slawek Kaplonski
Senior software engineer
Red Hat

Open Stack

State of the Gate

OpenStack

Community

Documentation

Branding & Legal