[openstack-dev] [infra] Intermittent network problems allowed to sneak passed the gate?

Derek Higgins derekh at redhat.com
Tue May 6 16:08:47 UTC 2014


On 06/05/14 16:17, Sean Dague wrote:
> On 05/06/2014 10:52 AM, Derek Higgins wrote:
>> Hi,
>>
>>     I've been working on a check job that uses devstack-gate jobs to run
>> the nova with the docker driver, while doing this I noticed that
>> sometimes during the nova boot for an instance the node looses network
>> connectivity(obviously a problem that needs to be worked on).
>> Whats interesting is zuuls behavior when this occurs in the check queue.
>> The job simply got restarted and this kept happening until the job passed.
>>
>> A legitimately failed job :
>>   https://jenkins05.openstack.org/job/check-nova-docker-dsvm-f20/2/
>>
>> http://logs.openstack.org/14/91514/5/check/check-nova-docker-dsvm-f20/d5c1ebf/console.html
>>
>> Retry (also failed)      :
>>   https://jenkins07.openstack.org/job/check-nova-docker-dsvm-f20/3/
>>
>> http://logs.openstack.org/14/91514/5/check/check-nova-docker-dsvm-f20/d5f26ed/console.html
>>
>> Retried again (passed)   :
>>   https://jenkins01.openstack.org/job/check-nova-docker-dsvm-f20/3/
>>
>> http://logs.openstack.org/14/91514/5/check/check-nova-docker-dsvm-f20/2ebfa88/console.html
>>
>> And success gets reported back to gerrit
>> https://review.openstack.org/#/c/91514/
>> Patch Set 5: Verified+1
>>     check-nova-docker-dsvm-f20 SUCCESS in 17m 27s (non-voting)
>>
>>
>> Wouldn't this behavior allow commits that cause intermittent network
>> problems to more easily sneak passed the gating infrastructure?
>>
>>
>> I'm guessing that the retry is being triggered in
>> zuul/launcher/gearman.py : onBuildCompleted()
>>
>> because onDisconnect calls onBuildCompleted with no results param
>>
>> Any thoughts?
> 
> There is some automatic retry facility in zuul right now to deal with a
> set of issues which are considered recoverable and typically the fault
> of the infrastructure provider.
> 
> There might be a way to slip something through, however, all failures in
> the gate do tend to get eyes on them, and I've yet to see this kind of
> issue slip through. So something to keep an eye out for. Would be
Hasn't this problem already slipped through (although its in the check
queue not the gate), I mean it can now be merged and was only noticed
because I was watching the zuul status page while the jobs were running?

> curious to see if we can mine out these issues in elastic recheck. The
> failed results are still reported to logstash from what I can see, so we
> can track them.
I'll see if I can find any similar occurrences in other jobs and report
back.

> 
> 	-Sean
> 
> 
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 




More information about the OpenStack-dev mailing list