[openstack-dev] Zuul Queue backlogs and resource usage
Clark Boylan
cboylan at sapwetik.org
Tue Oct 30 16:03:37 UTC 2018
Hello everyone,
A little while back I sent email explaining how the gate queues work and how fixing bugs helps us test and merge more code. All of this still is still true and we should keep pushing to improve our testing to avoid gate resets.
Last week we migrated Zuul and Nodepool to a new Zookeeper cluster. In the process of doing this we had to restart Zuul which brought in a new logging feature that exposes node resource usage by jobs. Using this data I've been able to generate some report information on where our node demand is going. This change [0] produces this report [1].
As with optimizing software we want to identify which changes will have the biggest impact and to be able to measure whether or not changes have had an impact once we have made them. Hopefully this information is a start at doing that. Currently we can only look back to the point Zuul was restarted, but we have a thirty day log rotation for this service and should be able to look at a month's worth of data going forward.
Looking at the data you might notice that Tripleo is using many more node resources than our other projects. They are aware of this and have a plan [2] to reduce their resource consumption. We'll likely be using this report generator to check progress of this plan over time.
Also related to the long queue backlogs is this proposal [3] to change how Zuul prioritizes resource allocations to try to be more fair.
[0] https://review.openstack.org/#/c/613674/
[1] http://paste.openstack.org/show/733644/
[2] http://lists.openstack.org/pipermail/openstack-dev/2018-October/135396.html
[3] http://lists.zuul-ci.org/pipermail/zuul-discuss/2018-October/000575.html
If you find any of this interesting and would like to help feel free to reach out to myself or the infra team.
Thank you,
Clark
More information about the OpenStack-dev
mailing list