Open Stack

Wed Nov 20 20:44:52 UTC 2013

Joe Gordon has been doing great working tracking test failures and how
often they affect us. Post Havana release the failure rate has
increased dramatically, negatively affecting the gate and forcing it to
run in a near worst case scenario. That is changes are being tested in
parallel but the head of the queue is more often than not running into a
failed job forcing all changes behind it to be retested and so on.

This led to a gate queue 130 deep with the head of the queue 18 hours
behind its approval. We have identified fixes for some of the worst
current bugs and in order to get them in have restarted Zuul effectively
cancelling the gate queue and have queued these changes up at the front
of the qeueue. Once these changes are in and we are happy with the bug
fixing results we will requeue changes that were in the queue when it
got cancelled.

How do we avoid this in the future? Step one is reviewers that are
approving changes (or reverifying them) should keep an eye on the gate
queue. If it is struggling adding more changes to that queue problably
won't help. Instead we should focus on identifying the bugs, submitting
changes to elastic-recheck to track these bugs, and work towards fixing
the bugs. Everyone is affected by persistent gate failures, we need to
work together to fix them.

Thank you for your patience,

Clark

Open Stack

[openstack-dev] The recent gate performance and how it affects you

OpenStack

Community

Documentation

Branding & Legal