On 12/6/2018 5:16 PM, Clark Boylan wrote:
I was asked to write another one of these in the Nova meeting today so here goes.
Thanks Clark, this is really helpful.
One thing to keep in mind is that while the test nodes are slower than we'd like, they have also exposed some situations where our software is less efficient than we'd like. At least one bug,https://bugs.launchpad.net/nova/+bug/1807219, has been identified through this. I would encourage people debugging these slow tests to look to see if this exposes a deficiency in our software that can be fixed.
That was split off from this: https://bugs.launchpad.net/nova/+bug/1807044 But yeah a couple of issues Dan and I are digging into. Another thing I noticed in one of these nova-api start timeout failures in ovh-bhs1 was uwsgi seems to just stall for 26 seconds here: http://logs.openstack.org/01/619701/5/gate/tempest-slow/2bb461b/controller/l... I pushed a patch to enable uwsgi debug logging: https://review.openstack.org/#/c/623265/ But of course I didn't (1) get a recreate or (2) seem to see any additional debug logging from uwsgi. If someone else knows how to enable that please let me know.
These are the big issues that affect large numbers of projects (or even all of them), but there are still many project specific problems floating around as well. Unfortunately I haven't had much time to help dig into those recently (see broader issues above), but I think it would be helpful if projects can do some of that digging themselves. Also, a friendly reminder that we try to provide in cloud region mirrors and caches for commonly used resources like distro packages, pypi packages, dockerhub images, and so on. If your jobs aren't using these and you find they fail occasionally due to the Internet being flaky we'll be happy to help you update the jobs to use the in region resources instead.
I'm not sure if this query is valid anymore: http://status.openstack.org/elastic-recheck/#1783405 If it is, then we still have some tempest tests that aren't marked as slow but are contributing to job timeouts outside the tempest-slow job. I know the last time this came up, the QA team had a report of the slowest non-slow tests - can we get another one of those now? Another thing is, are there particular voting jobs that have a failure rate over 50% and are resetting the gate? If we do, we should consider making them non-voting while project teams work on fixing the issues. Because I've had approved patches for days now taking 13+ hours just to fail, which is pretty unsustainable.
We'll keep pushing to fix the broader issues and are more than happy to help debug failures you hit within your projects as well.
Hopefully this was helpful despite its length.
Again, thank you Clark for taking the time to write up this summary - it's extremely useful. -- Thanks, Matt