[OpenStack-Infra] Gate Issues

Paul Belanger pabelanger at redhat.com
Fri Dec 8 13:58:58 UTC 2017


On Fri, Dec 08, 2017 at 08:38:24PM +1100, Ian Wienand wrote:
> Hello,
> 
> Just to save people reverse-engineering IRC logs...
> 
> At ~04:00UTC frickler called out that things had been sitting in the
> gate for ~17 hours.
> 
> Upon investigation, one of the stuck jobs was a
> legacy-tempest-dsvm-neutron-full job
> (bba5d98bb7b14b99afb539a75ee86a80) as part of
> https://review.openstack.org/475955
> 
> Checking the zuul logs, it had sent that to ze04
> 
>   2017-12-07 15:06:20,962 DEBUG zuul.Pipeline.openstack.gate: Build <Build bba5d98bb7b14b99afb539a75ee86a80 of   legacy-tempest-dsvm-neutron-full on <Worker ze04.openstack.org>> started
> 
> However, zuul-executor was not running on ze04.  I believe there were
> issues with this host yesterday.  "/etc/init.d/zuul-executor start" and
> "service zuul-executor start" reported as OK, but didn't actually
> start the daemon.  Rather than debug, I just used
> _SYSTEMCTL_SKIP_REDIRECT=1 and that got it going.  We should look into
> that, I've noticed similar things with zuul-scheduler too.
> 
> At this point, the evidence suggested zuul was waiting for jobs that
> would never return.  Thus I saved the queues, restarted zuul-scheduler
> and re-queued.
> 
> Soon after frickler again noticed that releasenotes jobs were now
> failing with "could not import extension openstackdocstheme" [1].  We
> suspect [2].
> 
> However, the gate did not become healthy.  Upon further investigation,
> the executors are very frequently failing jobs with
> 
>  2017-12-08 06:41:10,412 ERROR zuul.AnsibleJob: [build: 11062f1cca144052afb733813cdb16d8] Exception while executing job
>  Traceback (most recent call last):
>    File "/usr/local/lib/python3.5/dist-packages/zuul/executor/server.py", line 588, in execute
>      str(self.job.unique))
>    File "/usr/local/lib/python3.5/dist-packages/zuul/executor/server.py", line 702, in _execute
>    File "/usr/local/lib/python3.5/dist-packages/zuul/executor/server.py", line 1157, in prepareAnsibleFiles
>    File "/usr/local/lib/python3.5/dist-packages/zuul/executor/server.py", line 500, in make_inventory_dict
>      for name in node['name']:
>  TypeError: unhashable type: 'list'
> 
> This is leading to the very high "retry_limit" failures.
> 
> We suspect change [3] as this did some changes in the node area.  I
> did not want to revert this via a force-merge, I unfortunately don't
> have time to do something like apply manually on the host and babysit
> (I did not have time for a short email, so I sent a long one instead :)
> 
> At this point, I sent the alert to warn people the gate is unstable,
> which is about the latest state.
> 
> Good luck,
> 
> -i
> 
> [1] http://logs.openstack.org/95/526595/1/check/build-openstack-releasenotes/f38ccb4/job-output.txt.gz
> [2] https://review.openstack.org/525688
> [3] https://review.openstack.org/521324
> 
Digging into some of the issues this morning, I believe that citycloud-sto2 has
been wedged for some time. I see ready / locked nodes sitting for 2+ days.  We
also have a few ready / locked nodes in rax-iad, which I think are related to
the unhasable list from this morning.

As i understand it, the only way to release these nodes is to stop the
scheduler, is that correct? If so, I'd like to request we add some sort of CLI
--force option to delete, or some other command, if it make sense.

I'll hold off on a restart until jeblair or shrews has a moment to look at logs.

Paul



More information about the OpenStack-Infra mailing list