Long, Slow Zuul Queues and Why They Happen
Hello, We've been fielding a fair bit of questions and suggestions around Zuul's long change (and job) queues over the last week or so. As a result I tried to put a quick FAQ type document [0] on how we schedule jobs, why we schedule that way, and how we can improve the long queues. Hoping that gives us all a better understanding of why were are in the current situation and ideas on how we can help to improve things. [0] https://docs.openstack.org/infra/manual/testing.html#why-are-jobs-for-change... Thanks, Clark
On 9/13/2019 2:03 PM, Clark Boylan wrote:
We've been fielding a fair bit of questions and suggestions around Zuul's long change (and job) queues over the last week or so. As a result I tried to put a quick FAQ type document [0] on how we schedule jobs, why we schedule that way, and how we can improve the long queues.
Hoping that gives us all a better understanding of why were are in the current situation and ideas on how we can help to improve things.
[0]https://docs.openstack.org/infra/manual/testing.html#why-are-jobs-for-change...
Thanks for writing this up Clark. As for the current status of the gate, several nova devs have been closely monitoring the gate since we have 3 fairly lengthy series of feature changes approved since yesterday and we're trying to shepherd those through but we're seeing failures and trying to react to them. Two issues of note this week: 1. http://status.openstack.org/elastic-recheck/index.html#1843615 I had pushed a fix for that one earlier in the week but there was a bug in my fix which Takashi has fixed: https://review.opendev.org/#/c/682025/ That was promoted to the gate earlier today but failed on... 2. http://status.openstack.org/elastic-recheck/index.html#1813147 We have a couple of patches up for that now which might get promoted once we are reasonably sure those are going to pass check (promote to gate means skipping check which is risky because if it fails in the gate we have to re-queue the gate as the doc above explains). As far as overall failure classifications we're pretty good there in elastic-recheck: http://status.openstack.org/elastic-recheck/data/integrated_gate.html Meaning for the most part we know what's failing, we just need to fix the bugs. One that continues to dog us (and by "us" I mean OpenStack, not just nova) is this one: http://status.openstack.org/elastic-recheck/gate.html#1686542 The QA team's work to split apart the big tempest full jobs into service-oriented jobs like tempest-integrated-compute should have helped here but we're still seeing there are lots of jobs timing out which likely means there are some really slow tests running in too many jobs and those require investigation. It could also be devstack setup that is taking a long time like Clark identified with OSC usage awhile back: http://lists.openstack.org/pipermail/openstack-discuss/2019-July/008071.html If you have questions about how elastic-recheck works or how to help investigate some of these failures, like with using logstash.openstack.org, please reach out to me (mriedem), clarkb and/or gmann in #openstack-qa. -- Thanks, Matt
*These are only observations, so please keep in mind I am only trying to get to the bottom of efficiency with our limited resources.* Please feel free to correct my understanding We have some core projects which many other projects depend on - Nova, Glance, Keystone, Neutron, Cinder. etc In the CI it's equal access for any project. If feature A in non-core project depends on feature B in core project - why is feature B not prioritized ? Can we solve this issue by breaking apart the current equal access structure into something more granular? I understand that improving job efficiencies will likely result in more smaller jobs, but will that actually solve issue at the gate come this time in the cycle...every release? (as I am sure it comes up every time) More smaller jobs will result in more jobs - If the job time is cut in half, but the # of jobs is doubled we will probably still have the same issue. We have limited resources and without more providers coming online I fear this issue is only going to get worse as time goes on if we do nothing. ~/DonnyD On Fri, Sep 13, 2019 at 3:47 PM Matt Riedemann <mriedemos@gmail.com> wrote:
We've been fielding a fair bit of questions and suggestions around Zuul's long change (and job) queues over the last week or so. As a result I
On 9/13/2019 2:03 PM, Clark Boylan wrote: tried to put a quick FAQ type document [0] on how we schedule jobs, why we schedule that way, and how we can improve the long queues.
Hoping that gives us all a better understanding of why were are in the
current situation and ideas on how we can help to improve things.
[0]
https://docs.openstack.org/infra/manual/testing.html#why-are-jobs-for-change...
Thanks for writing this up Clark.
As for the current status of the gate, several nova devs have been closely monitoring the gate since we have 3 fairly lengthy series of feature changes approved since yesterday and we're trying to shepherd those through but we're seeing failures and trying to react to them.
Two issues of note this week:
1. http://status.openstack.org/elastic-recheck/index.html#1843615
I had pushed a fix for that one earlier in the week but there was a bug in my fix which Takashi has fixed:
https://review.opendev.org/#/c/682025/
That was promoted to the gate earlier today but failed on...
2. http://status.openstack.org/elastic-recheck/index.html#1813147
We have a couple of patches up for that now which might get promoted once we are reasonably sure those are going to pass check (promote to gate means skipping check which is risky because if it fails in the gate we have to re-queue the gate as the doc above explains).
As far as overall failure classifications we're pretty good there in elastic-recheck:
http://status.openstack.org/elastic-recheck/data/integrated_gate.html
Meaning for the most part we know what's failing, we just need to fix the bugs.
One that continues to dog us (and by "us" I mean OpenStack, not just nova) is this one:
http://status.openstack.org/elastic-recheck/gate.html#1686542
The QA team's work to split apart the big tempest full jobs into service-oriented jobs like tempest-integrated-compute should have helped here but we're still seeing there are lots of jobs timing out which likely means there are some really slow tests running in too many jobs and those require investigation. It could also be devstack setup that is taking a long time like Clark identified with OSC usage awhile back:
http://lists.openstack.org/pipermail/openstack-discuss/2019-July/008071.html
If you have questions about how elastic-recheck works or how to help investigate some of these failures, like with using logstash.openstack.org, please reach out to me (mriedem), clarkb and/or gmann in #openstack-qa.
--
Thanks,
Matt
On Mon, Sep 23, 2019, at 8:03 AM, Donny Davis wrote:
*These are only observations, so please keep in mind I am only trying to get to the bottom of efficiency with our limited resources.* Please feel free to correct my understanding
We have some core projects which many other projects depend on - Nova, Glance, Keystone, Neutron, Cinder. etc In the CI it's equal access for any project. If feature A in non-core project depends on feature B in core project - why is feature B not prioritized ?
The priority queuing happens per "gate queue". The integrated gate (nova, cinder, keystone, etc) has one queue, Tripleo has another, OSA has one and so on. We do this so that important work can happen across disparate efforts. What this means is if Nova and the rest of the integrated gate has a set of priority changes they should stop approving other changes while they work to merge those priority items. I have suggested that OpenStack needs an "air traffic controller" to help coordinate these efforts particularly around feature freeze time (I suggested it to both the QA team and release team). Any queue could use one if they wanted to. All that to say you can do this today, but it requires humans to work together and communicate what their goals are then give the CI system the correct information to act on these changes in the desired manner.
Can we solve this issue by breaking apart the current equal access structure into something more granular?
I understand that improving job efficiencies will likely result in more smaller jobs, but will that actually solve issue at the gate come this time in the cycle...every release? (as I am sure it comes up every time) More smaller jobs will result in more jobs - If the job time is cut in half, but the # of jobs is doubled we will probably still have the same issue.
We have limited resources and without more providers coming online I fear this issue is only going to get worse as time goes on if we do nothing.
~/DonnyD
In a different thread I had another possible suggestion - its probably more appropriate for this one. [1] It would also be helpful to give the project a way to prefer certain infra providers for certain jobs. For the most part Fort Neubla is terrible at CPU bound long running jobs... I wish I could make it better, but I cannot. Is there a method we could come up with that would allow us to exploit certain traits of a certain provider? Maybe like some additional metadata that say what the certain provider is best at doing? For example highly IO bound jobs work like gangbusters on FN because the underlying storage is very fast, but CPU bound jobs do the direct opposite. Thoughts? ~/DonnyD 1. http://lists.openstack.org/pipermail/openstack-discuss/2019-September/009592... On Mon, Sep 23, 2019 at 11:14 AM Clark Boylan <cboylan@sapwetik.org> wrote:
On Mon, Sep 23, 2019, at 8:03 AM, Donny Davis wrote:
*These are only observations, so please keep in mind I am only trying to get to the bottom of efficiency with our limited resources.* Please feel free to correct my understanding
We have some core projects which many other projects depend on - Nova, Glance, Keystone, Neutron, Cinder. etc In the CI it's equal access for any project. If feature A in non-core project depends on feature B in core project - why is feature B not prioritized ?
The priority queuing happens per "gate queue". The integrated gate (nova, cinder, keystone, etc) has one queue, Tripleo has another, OSA has one and so on. We do this so that important work can happen across disparate efforts.
What this means is if Nova and the rest of the integrated gate has a set of priority changes they should stop approving other changes while they work to merge those priority items. I have suggested that OpenStack needs an "air traffic controller" to help coordinate these efforts particularly around feature freeze time (I suggested it to both the QA team and release team). Any queue could use one if they wanted to.
All that to say you can do this today, but it requires humans to work together and communicate what their goals are then give the CI system the correct information to act on these changes in the desired manner.
Can we solve this issue by breaking apart the current equal access structure into something more granular?
I understand that improving job efficiencies will likely result in more smaller jobs, but will that actually solve issue at the gate come this time in the cycle...every release? (as I am sure it comes up every time) More smaller jobs will result in more jobs - If the job time is cut in half, but the # of jobs is doubled we will probably still have the same issue.
We have limited resources and without more providers coming online I fear this issue is only going to get worse as time goes on if we do nothing.
~/DonnyD
participants (3)
-
Clark Boylan
-
Donny Davis
-
Matt Riedemann