[infra] Update on test throughput and Zuul backlogs
Hello everyone, I was asked to write another one of these in the Nova meeting today so here goes. TripleO has done a good job of reducing resource consumption and now represents about 42% of the total resource usage for the last month down from over 50% when we first started tracking this info. Generating the report requires access to Zuul's scheduler logs so I've pasted a copy at http://paste.openstack.org/show/736797/. There is a change, https://review.openstack.org/#/c/616306/, to report this data via statsd which will allow anyone to generate it off of our graphite server once deployed. Another piece of exciting (good) news is that we've changed the way the Zuul resource allocation scheme prioritizes requests. In the check pipeline a change's relative priority is based on how many changes for that project are already in check and in the gate pipeline it is relative to the number of changes in the shared gate queue. What this means is that less active projects shouldn't need to wait as long for their changes to be tested, but more active projects like tripleo-heat-templates, nova, and neutron may see other changes being tested ahead of their changes. More details on this thread, http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000482..... One side effect of this change is that our Zuul is now running more jobs per hour than in the past (because it can more quickly churn through changes for "cheap" projects). Unfortunately, this has increased the memory demands on the zuul-executors and we've found that we are using far more swap than we'd like. We'll be applying https://review.openstack.org/#/c/623245/ to reduce the amount of memory required by each job during the course of its run which we hope helps. We've also added one new executor with plans to add a second if this change doesn't help. All that said flaky tests are still an issue. One set of problems seems related to slower than expected/before test nodes in the BHS1 region. We've been debugging these with OVH (thank you amorin!) and think we've managed to make some improvements though so far the problems persist. Current theory is that we are acting as our own noisy neighbors starving the hypervisors of disk IO throughput. In order to test that we've halved the total number of resources we'll use there. More details at https://etherpad.openstack.org/p/bhs1-test-node-slowness including a list of e-r bugs that may be tied to this issue. One thing to keep in mind is that while the test nodes are slower than we'd like, they have also exposed some situations where our software is less efficient than we'd like. At least one bug, https://bugs.launchpad.net/nova/+bug/1807219, has been identified through this. I would encourage people debugging these slow tests to look to see if this exposes a deficiency in our software that can be fixed. CentOS 7.6 released this last Monday. Fallout from that has included needing to update ansible playbooks that ensure the latest version of a centos distro package without setting become: yes. Previously the package was installed at the latest version on our images which ansible could verify without root privileges. Additionally golang is no longer a valid package on the base OS as it was on 7.5 (side note, this doesn't feel incredibly stable for users if anyone from rhel is listening). If your jobs depend on golang on centos and were getting that from the distro packages on 7.5 you'll need to find somewhere else to get golang now. With the distro updates come broken nested virt. Unfortunately nested virt continues to be a back and forth of working today, not working tomorrow. It seem that our test node kernels play a big impact on that then a few days later the various clouds apply new hypervisor kernel updates and things work again. If your jobs attempt to use nested virt and you've seen odd behavior from them (like reboots) recently this may be the cause. These are the big issues that affect large numbers of projects (or even all of them), but there are still many project specific problems floating around as well. Unfortunately I haven't had much time to help dig into those recently (see broader issues above), but I think it would be helpful if projects can do some of that digging themselves. Also, a friendly reminder that we try to provide in cloud region mirrors and caches for commonly used resources like distro packages, pypi packages, dockerhub images, and so on. If your jobs aren't using these and you find they fail occasionally due to the Internet being flaky we'll be happy to help you update the jobs to use the in region resources instead. We'll keep pushing to fix the broader issues and are more than happy to help debug failures you hit within your projects as well. Hopefully this was helpful despite its length. Clark
On 12/6/2018 5:16 PM, Clark Boylan wrote:
I was asked to write another one of these in the Nova meeting today so here goes.
Thanks Clark, this is really helpful.
One thing to keep in mind is that while the test nodes are slower than we'd like, they have also exposed some situations where our software is less efficient than we'd like. At least one bug,https://bugs.launchpad.net/nova/+bug/1807219, has been identified through this. I would encourage people debugging these slow tests to look to see if this exposes a deficiency in our software that can be fixed.
That was split off from this: https://bugs.launchpad.net/nova/+bug/1807044 But yeah a couple of issues Dan and I are digging into. Another thing I noticed in one of these nova-api start timeout failures in ovh-bhs1 was uwsgi seems to just stall for 26 seconds here: http://logs.openstack.org/01/619701/5/gate/tempest-slow/2bb461b/controller/l... I pushed a patch to enable uwsgi debug logging: https://review.openstack.org/#/c/623265/ But of course I didn't (1) get a recreate or (2) seem to see any additional debug logging from uwsgi. If someone else knows how to enable that please let me know.
These are the big issues that affect large numbers of projects (or even all of them), but there are still many project specific problems floating around as well. Unfortunately I haven't had much time to help dig into those recently (see broader issues above), but I think it would be helpful if projects can do some of that digging themselves. Also, a friendly reminder that we try to provide in cloud region mirrors and caches for commonly used resources like distro packages, pypi packages, dockerhub images, and so on. If your jobs aren't using these and you find they fail occasionally due to the Internet being flaky we'll be happy to help you update the jobs to use the in region resources instead.
I'm not sure if this query is valid anymore: http://status.openstack.org/elastic-recheck/#1783405 If it is, then we still have some tempest tests that aren't marked as slow but are contributing to job timeouts outside the tempest-slow job. I know the last time this came up, the QA team had a report of the slowest non-slow tests - can we get another one of those now? Another thing is, are there particular voting jobs that have a failure rate over 50% and are resetting the gate? If we do, we should consider making them non-voting while project teams work on fixing the issues. Because I've had approved patches for days now taking 13+ hours just to fail, which is pretty unsustainable.
We'll keep pushing to fix the broader issues and are more than happy to help debug failures you hit within your projects as well.
Hopefully this was helpful despite its length.
Again, thank you Clark for taking the time to write up this summary - it's extremely useful. -- Thanks, Matt
---- On Fri, 07 Dec 2018 08:50:30 +0900 Matt Riedemann <mriedemos@gmail.com> wrote ----
On 12/6/2018 5:16 PM, Clark Boylan wrote:
I was asked to write another one of these in the Nova meeting today so here goes.
Thanks Clark, this is really helpful.
One thing to keep in mind is that while the test nodes are slower than we'd like, they have also exposed some situations where our software is less efficient than we'd like. At least one bug,https://bugs.launchpad.net/nova/+bug/1807219, has been identified through this. I would encourage people debugging these slow tests to look to see if this exposes a deficiency in our software that can be fixed.
That was split off from this:
https://bugs.launchpad.net/nova/+bug/1807044
But yeah a couple of issues Dan and I are digging into.
Another thing I noticed in one of these nova-api start timeout failures in ovh-bhs1 was uwsgi seems to just stall for 26 seconds here:
http://logs.openstack.org/01/619701/5/gate/tempest-slow/2bb461b/controller/l...
I pushed a patch to enable uwsgi debug logging:
https://review.openstack.org/#/c/623265/
But of course I didn't (1) get a recreate or (2) seem to see any additional debug logging from uwsgi. If someone else knows how to enable that please let me know.
These are the big issues that affect large numbers of projects (or even all of them), but there are still many project specific problems floating around as well. Unfortunately I haven't had much time to help dig into those recently (see broader issues above), but I think it would be helpful if projects can do some of that digging themselves. Also, a friendly reminder that we try to provide in cloud region mirrors and caches for commonly used resources like distro packages, pypi packages, dockerhub images, and so on. If your jobs aren't using these and you find they fail occasionally due to the Internet being flaky we'll be happy to help you update the jobs to use the in region resources instead.
I'm not sure if this query is valid anymore:
http://status.openstack.org/elastic-recheck/#1783405
If it is, then we still have some tempest tests that aren't marked as slow but are contributing to job timeouts outside the tempest-slow job. I know the last time this came up, the QA team had a report of the slowest non-slow tests - can we get another one of those now?
This seems still valid query. 7 fails in 24 hrs / 302 fails in 10 days. I did some more catagorization for this query with build_name and found failure are- tempest-full or tempest-full-py3 - ~50% tempest-all - 2 % tempest-slow - 2% rest all is in all other jobs. I proposed to modify the query to exclude the tempest-all and tempest-slow job which runs all slow tests also. - https://review.openstack.org/#/c/623949/ On doing another round of marking slow tests, I will check if we can get more specific slow tests which are slow consistantly. -gmann
Another thing is, are there particular voting jobs that have a failure rate over 50% and are resetting the gate? If we do, we should consider making them non-voting while project teams work on fixing the issues. Because I've had approved patches for days now taking 13+ hours just to fail, which is pretty unsustainable.
We'll keep pushing to fix the broader issues and are more than happy to help debug failures you hit within your projects as well.
Hopefully this was helpful despite its length.
Again, thank you Clark for taking the time to write up this summary - it's extremely useful.
--
Thanks,
Matt
On Thu, 06 Dec 2018 15:16:01 -0800, Clark Boylan wrote: [snip]
One thing to keep in mind is that while the test nodes are slower than we'd like, they have also exposed some situations where our software is less efficient than we'd like. At least one bug, https://bugs.launchpad.net/nova/+bug/1807219, has been identified through this. I would encourage people debugging these slow tests to look to see if this exposes a deficiency in our software that can be fixed.
[snip]
These are the big issues that affect large numbers of projects (or even all of them), but there are still many project specific problems floating around as well. Unfortunately I haven't had much time to help dig into those recently (see broader issues above), but I think it would be helpful if projects can do some of that digging themselves.
[snip] FYI for interested people, we are working on some nova-specific problems in the following patches/series: https://review.openstack.org/623282 https://review.openstack.org/623246 https://review.openstack.org/623265
We'll keep pushing to fix the broader issues and are more than happy to help debug failures you hit within your projects as well.
Thanks for the excellent write-up. It's a nice window into what's going on in the gate, the work the infra team is doing, and letting us know how we can help. Best, -melanie
On Thu, 6 Dec 2018 16:01:28 -0800, Melanie Witt wrote:
On Thu, 06 Dec 2018 15:16:01 -0800, Clark Boylan wrote:
[snip]
One thing to keep in mind is that while the test nodes are slower than we'd like, they have also exposed some situations where our software is less efficient than we'd like. At least one bug, https://bugs.launchpad.net/nova/+bug/1807219, has been identified through this. I would encourage people debugging these slow tests to look to see if this exposes a deficiency in our software that can be fixed.
[snip]
These are the big issues that affect large numbers of projects (or even all of them), but there are still many project specific problems floating around as well. Unfortunately I haven't had much time to help dig into those recently (see broader issues above), but I think it would be helpful if projects can do some of that digging themselves.
[snip]
FYI for interested people, we are working on some nova-specific problems in the following patches/series:
https://review.openstack.org/623282 https://review.openstack.org/623246 https://review.openstack.org/623265
We'll keep pushing to fix the broader issues and are more than happy to help debug failures you hit within your projects as well.
Thanks for the excellent write-up. It's a nice window into what's going on in the gate, the work the infra team is doing, and letting us know how we can help.
Bah, didn't see Matt's reply by the time I hit send. Apologies for the [less detailed] replication. -melanie
On December 6, 2018 11:16 pm, Clark Boylan wrote:
Additionally golang is no longer a valid package on the base OS as it was on 7.5.
According to the release note, golang is now shipped as part of the SCL. See this how-to for the install instructions: http://www.karan.org/blog/2018/12/06/using-go-toolset-on-centos-linux-7-x86_... Regards, -Tristan
On 12/6/2018 5:16 PM, Clark Boylan wrote:
All that said flaky tests are still an issue. One set of problems seems related to slower than expected/before test nodes in the BHS1 region. We've been debugging these with OVH (thank you amorin!) and think we've managed to make some improvements though so far the problems persist. Current theory is that we are acting as our own noisy neighbors starving the hypervisors of disk IO throughput. In order to test that we've halved the total number of resources we'll use there. More details athttps://etherpad.openstack.org/p/bhs1-test-node-slowness including a list of e-r bugs that may be tied to this issue.
One thing to keep in mind is that while the test nodes are slower than we'd like, they have also exposed some situations where our software is less efficient than we'd like. At least one bug,https://bugs.launchpad.net/nova/+bug/1807219, has been identified through this. I would encourage people debugging these slow tests to look to see if this exposes a deficiency in our software that can be fixed.
Here are a couple of fixes for recently fingerprinted gate bugs: https://review.openstack.org/#/c/623669/ https://review.openstack.org/#/c/623597/ Those are in grenade and devstack respectively so we'll need some QA cores. -- Thanks, Matt
---- On Sun, 09 Dec 2018 03:28:47 +0900 Matt Riedemann <mriedemos@gmail.com> wrote ----
On 12/6/2018 5:16 PM, Clark Boylan wrote:
All that said flaky tests are still an issue. One set of problems seems related to slower than expected/before test nodes in the BHS1 region. We've been debugging these with OVH (thank you amorin!) and think we've managed to make some improvements though so far the problems persist. Current theory is that we are acting as our own noisy neighbors starving the hypervisors of disk IO throughput. In order to test that we've halved the total number of resources we'll use there. More details athttps://etherpad.openstack.org/p/bhs1-test-node-slowness including a list of e-r bugs that may be tied to this issue.
One thing to keep in mind is that while the test nodes are slower than we'd like, they have also exposed some situations where our software is less efficient than we'd like. At least one bug,https://bugs.launchpad.net/nova/+bug/1807219, has been identified through this. I would encourage people debugging these slow tests to look to see if this exposes a deficiency in our software that can be fixed.
Here are a couple of fixes for recently fingerprinted gate bugs:
https://review.openstack.org/#/c/623669/
https://review.openstack.org/#/c/623597/
Those are in grenade and devstack respectively so we'll need some QA cores.
Done. grenade one is merged and devstack is in the queue. -gmann
--
Thanks,
Matt
participants (5)
-
Clark Boylan
-
Ghanshyam Mann
-
Matt Riedemann
-
melanie witt
-
Tristan Cacqueray