Open Stack

Thu Oct 31 21:15:35 UTC 2019

Hi,

On Thu, Oct 31, 2019 at 10:23:01AM -0500, Matt Riedemann wrote:
> Things are great! Surprise! I just wanted to let everyone know. Later!
> .
> .
> .
> .
> .
> Now that you've been tricked, on Halloween no less, I'm here to tell you
> that things suck right now. This is your periodic digest of issues. Grab
> some fun-sized candy bars and read on.
> 
> I think right now we have three major issues.
> 
> 1. http://status.openstack.org/elastic-recheck/index.html#1763070
> 
> This has resurfaced and I'm not sure why, nor do I think we ever had a great
> handle on what is causing this or how to work around it so if anyone has new
> ideas please chip in.

I think that this is "just" some slowdown of node on which job is running. I
noticed it too in some neutron jobs and I checked some. It seems that one API
request is processed for very long time. For example in one of fresh examples:
https://13cf3dd11b8f009809dc-97cb3b32849366f5bed744685e46b266.ssl.cf5.rackcdn.com/692206/3/check/tempest-integrated-compute/35ecb4a/job-output.txt
it was request to nova which caused very long time:

Oct 31 16:55:08.632162 ubuntu-bionic-inap-mtl01-0012620879 devstack at n-api.service[7191]: INFO nova.api.openstack.requestlog [None req-275af2df-bd4e-4e64-b46e-6582e8de5148 tempest-ServerDiskConfigTestJSON-1598674508 tempest-ServerDiskConfigTestJSON-1598674508] 198.72.124.104 "POST /compute/v2.1/servers/d15d2033-b29b-44f7-b619-ed7ef83fe477/action" status: 500 len: 216 microversion: 2.1 time: 161.951140

> 
> 2. http://status.openstack.org/elastic-recheck/index.html#1844929
> 
> I've done some digging into this one and my notes are in the bug report. It
> mostly affects grenade jobs but is not entirely restricted to grenade jobs.
> It's also mostly on OVH and FortNebula nodes but not totally.
> 
> From looking at the mysql logs in the grenade jobs, mysqld is (re)started
> three times. I think (1) for initial package install, (2) for stacking
> devstack on the old side, and (3) for stacking devstack on the new side.
> After the last restart, there are a lot of aborted connection messages in
> the msyql error log. It's around then that grenade is running post-upgrade
> smoke tests to create a server and the nova-scheduler times out
> communicating with the nova_cell1 database.
> 
> I have a few patches up to grenade/devstack [1] to try some things and get
> more msyql logs but so far they aren't really helpful. We need someone with
> some more mysql debugging experience to help here, maybe zzzeek or mordred?
> 
> 3. CirrOS guest SSH issues
> 
> There are several (some might be duplicates):
> 
> http://status.openstack.org/elastic-recheck/index.html#1848078

This one is I think the same as we have reported in
https://bugs.launchpad.net/neutron/+bug/1850557

Basically we noticed issues with dhcp after resize/migration/shelve of instance
but I didn't have time to investigate it yet.

> http://status.openstack.org/elastic-recheck/index.html#1808010 (most hits)
> http://status.openstack.org/elastic-recheck/index.html#1463631
> http://status.openstack.org/elastic-recheck/index.html#1849857
> http://status.openstack.org/elastic-recheck/index.html#1737039
> http://status.openstack.org/elastic-recheck/index.html#1840355
> http://status.openstack.org/elastic-recheck/index.html#1843610
> 
> A few notes here.
> 
> a) We're still using CirrOS 0.4.0 since Stein:
> 
> https://review.opendev.org/#/c/521825/
> 
> And that image was published nearly 2 years ago and there are no newer
> versions on the CirrOS download site so we can't try a newer image to see if
> that fixes things.
> 
> b) Some of the issues above are related to running out of disk in the guest.
> I'm not sure what is causing that, but I have posted a devstack patch that
> is related:
> 
> https://review.opendev.org/#/c/690991
> 
> tl;dr before Stein the tempest flavors we used had disk=0 so nova would fit
> the root disk to the size of the image. Since Stein the tempest flavors
> specify root disk size (1GiB for the CirrOS images). My patch pads an extra
> 1GiB to the root disk on the tempest flavors. One side effect of that is the
> volumes tempest creates will go from 1GiB to 2GiB which could be a problem
> if a lot of tempest volume tests run at the same time, though we do have a
> volume group size of 24GB in gate runs so I think we're OK for now. I'm not
> sure my patch would help, but it's an idea.
> 
> As for the other key generation and dhcp lease failures, I don't know what
> to do about those. We need more eyes on these issues to generate some ideas
> or see if we're doing something wrong in our tests, e.g. generating too much
> data for the config drive? Not using config drive in some cases? Metadata
> API server is too slow (note we cache the metadata since [2])? Other ideas
> on injecting logs for debug?
> 
> [1] https://review.opendev.org/#/q/topic:bug/1844929+status:open
> [2] https://review.opendev.org/#/q/I9082be077b59acd3a39910fa64e29147cb5c2dd7
> 
> -- 
> 
> Thanks,
> 
> Matt
> 

-- 
Slawek Kaplonski
Senior software engineer
Red Hat

Open Stack

State of the Gate

OpenStack

Community

Documentation

Branding & Legal