[openstack-dev] [infra] Intermittent network problems allowed to sneak passed the gate?

Sean Dague sean at dague.net
Tue May 6 15:17:39 UTC 2014


On 05/06/2014 10:52 AM, Derek Higgins wrote:
> Hi,
> 
>     I've been working on a check job that uses devstack-gate jobs to run
> the nova with the docker driver, while doing this I noticed that
> sometimes during the nova boot for an instance the node looses network
> connectivity(obviously a problem that needs to be worked on).
> Whats interesting is zuuls behavior when this occurs in the check queue.
> The job simply got restarted and this kept happening until the job passed.
> 
> A legitimately failed job :
>   https://jenkins05.openstack.org/job/check-nova-docker-dsvm-f20/2/
> 
> http://logs.openstack.org/14/91514/5/check/check-nova-docker-dsvm-f20/d5c1ebf/console.html
> 
> Retry (also failed)      :
>   https://jenkins07.openstack.org/job/check-nova-docker-dsvm-f20/3/
> 
> http://logs.openstack.org/14/91514/5/check/check-nova-docker-dsvm-f20/d5f26ed/console.html
> 
> Retried again (passed)   :
>   https://jenkins01.openstack.org/job/check-nova-docker-dsvm-f20/3/
> 
> http://logs.openstack.org/14/91514/5/check/check-nova-docker-dsvm-f20/2ebfa88/console.html
> 
> And success gets reported back to gerrit
> https://review.openstack.org/#/c/91514/
> Patch Set 5: Verified+1
>     check-nova-docker-dsvm-f20 SUCCESS in 17m 27s (non-voting)
> 
> 
> Wouldn't this behavior allow commits that cause intermittent network
> problems to more easily sneak passed the gating infrastructure?
> 
> 
> I'm guessing that the retry is being triggered in
> zuul/launcher/gearman.py : onBuildCompleted()
> 
> because onDisconnect calls onBuildCompleted with no results param
> 
> Any thoughts?

There is some automatic retry facility in zuul right now to deal with a
set of issues which are considered recoverable and typically the fault
of the infrastructure provider.

There might be a way to slip something through, however, all failures in
the gate do tend to get eyes on them, and I've yet to see this kind of
issue slip through. So something to keep an eye out for. Would be
curious to see if we can mine out these issues in elastic recheck. The
failed results are still reported to logstash from what I can see, so we
can track them.

	-Sean

-- 
Sean Dague
http://dague.net

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140506/e3eaf35e/attachment.pgp>


More information about the OpenStack-dev mailing list