[openstack-dev] The recent gate performance and how it affects you

Matt Riedemann mriedem at linux.vnet.ibm.com
Wed Nov 20 21:46:08 UTC 2013



On Wednesday, November 20, 2013 2:44:52 PM, Clark Boylan wrote:
> Joe Gordon has been doing great working tracking test failures and how
> often they affect us. Post Havana release the failure rate has
> increased dramatically, negatively affecting the gate and forcing it to
> run in a near worst case scenario. That is changes are being tested in
> parallel but the head of the queue is more often than not running into a
> failed job forcing all changes behind it to be retested and so on.
>
> This led to a gate queue 130 deep with the head of the queue 18 hours
> behind its approval. We have identified fixes for some of the worst
> current bugs and in order to get them in have restarted Zuul effectively
> cancelling the gate queue and have queued these changes up at the front
> of the qeueue. Once these changes are in and we are happy with the bug
> fixing results we will requeue changes that were in the queue when it
> got cancelled.
>
> How do we avoid this in the future? Step one is reviewers that are
> approving changes (or reverifying them) should keep an eye on the gate
> queue. If it is struggling adding more changes to that queue problably
> won't help. Instead we should focus on identifying the bugs, submitting
> changes to elastic-recheck to track these bugs, and work towards fixing
> the bugs. Everyone is affected by persistent gate failures, we need to
> work together to fix them.
>
> Thank you for your patience,
>
> Clark
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

Let me also say that I think it's really helpful that Joe has been 
sending out recaps to the mailing list about the top offenders so 
people can help pitch in on investigating and fixing those (like we saw 
with the Neutron team's response to Joe's recent post about the top 
gate failures).

People get heads-down in their own projects and what they are working 
on and it's hard to keep up with what's going on in the infra channel 
(or nova channel for that matter), so sending out a recap that everyone 
can see in the mailing list is helpful to reset where things are at and 
focus possibly various isolated investigations (as we saw happen this 
week).

--

Thanks,

Matt Riedemann




More information about the OpenStack-dev mailing list