[infra] Update on test throughput and Zuul backlogs

6 Dec 2018

      Hello everyone,

I was asked to write another one of these in the Nova meeting today so here goes.

TripleO has done a good job of reducing resource consumption and now represents about 42% of the total resource usage for the last month down from over 50% when we first started tracking this info. Generating the report requires access to Zuul's scheduler logs so I've pasted a copy at http://paste.openstack.org/show/736797/. There is a change, https://review.openstack.org/#/c/616306/, to report this data via statsd which will allow anyone to generate it off of our graphite server once deployed.

Another piece of exciting (good) news is that we've changed the way the Zuul resource allocation scheme prioritizes requests. In the check pipeline a change's relative priority is based on how many changes for that project are already in check and in the gate pipeline it is relative to the number of changes in the shared gate queue. What this means is that less active projects shouldn't need to wait as long for their changes to be tested, but more active projects like tripleo-heat-templates, nova, and neutron may see other changes being tested ahead of their changes. More details on this thread, http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000482.....

One side effect of this change is that our Zuul is now running more jobs per hour than in the past (because it can more quickly churn through changes for "cheap" projects). Unfortunately, this has increased the memory demands on the zuul-executors and we've found that we are using far more swap than we'd like. We'll be applying https://review.openstack.org/#/c/623245/ to reduce the amount of memory required by each job during the course of its run which we hope helps. We've also added one new executor with plans to add a second if this change doesn't help.

All that said flaky tests are still an issue. One set of problems seems related to slower than expected/before test nodes in the BHS1 region. We've been debugging these with OVH (thank you amorin!) and think we've managed to make some improvements though so far the problems persist. Current theory is that we are acting as our own noisy neighbors starving the hypervisors of disk IO throughput. In order to test that we've halved the total number of resources we'll use there. More details at https://etherpad.openstack.org/p/bhs1-test-node-slowness including a list of e-r bugs that may be tied to this issue.

One thing to keep in mind is that while the test nodes are slower than we'd like, they have also exposed some situations where our software is less efficient than we'd like. At least one bug, https://bugs.launchpad.net/nova/+bug/1807219, has been identified through this. I would encourage people debugging these slow tests to look to see if this exposes a deficiency in our software that can be fixed.

CentOS 7.6 released this last Monday. Fallout from that has included needing to update ansible playbooks that ensure the latest version of a centos distro package without setting become: yes. Previously the package was installed at the latest version on our images which ansible could verify without root privileges. Additionally golang is no longer a valid package on the base OS as it was on 7.5 (side note, this doesn't feel incredibly stable for users if anyone from rhel is listening). If your jobs depend on golang on centos and were getting that from the distro packages on 7.5 you'll need to find somewhere else to get golang now.

With the distro updates come broken nested virt. Unfortunately nested virt continues to be a back and forth of working today, not working tomorrow. It seem that our test node kernels play a big impact on that then a few days later the various clouds apply new hypervisor kernel updates and things work again. If your jobs attempt to use nested virt and you've seen odd behavior from them (like reboots) recently this may be the cause.

These are the big issues that affect large numbers of projects (or even all of them), but there are still many project specific problems floating around as well. Unfortunately I haven't had much time to help dig into those recently (see broader issues above), but I think it would be helpful if projects can do some of that digging themselves. Also, a friendly reminder that we try to provide in cloud region mirrors and caches for commonly used resources like distro packages, pypi packages, dockerhub images, and so on. If your jobs aren't using these and you find they fail occasionally due to the Internet being flaky we'll be happy to help you update the jobs to use the in region resources instead.

We'll keep pushing to fix the broader issues and are more than happy to help debug failures you hit within your projects as well.

Hopefully this was helpful despite its length.

Clark

Clark Boylan

Matt Riedemann

Ghanshyam Mann

melanie witt

melanie witt

Tristan Cacqueray

Matt Riedemann

Ghanshyam Mann

tags

participants (5)