[OpenStack-Infra] Gate Issues

David Shrewsbury shrewsbury.dave at gmail.com
Fri Dec 8 15:16:34 UTC 2017


The locked nodes were due to something Clark found earlier this week, and
hopefully fixed with:

https://review.openstack.org/526234

Short story is that the request handlers (holding the locks on the nodes),
were never being allowed
to continue processing because of an exception being thrown that was short
circuiting the process.

There was *something* causing our node requests to disappear (zuul
restarts?). The node request locks
are removed after 8 hours if the request no longer exists. This removal was
causing the exception being
handled in the review above. Why the request handlers still have requests 8
hours after their removal
is a bit of a mystery. Maybe some weirdness with citycloud.

-Dave

On Fri, Dec 8, 2017 at 8:58 AM, Paul Belanger <pabelanger at redhat.com> wrote:

> On Fri, Dec 08, 2017 at 08:38:24PM +1100, Ian Wienand wrote:
> > Hello,
> >
> > Just to save people reverse-engineering IRC logs...
> >
> > At ~04:00UTC frickler called out that things had been sitting in the
> > gate for ~17 hours.
> >
> > Upon investigation, one of the stuck jobs was a
> > legacy-tempest-dsvm-neutron-full job
> > (bba5d98bb7b14b99afb539a75ee86a80) as part of
> > https://review.openstack.org/475955
> >
> > Checking the zuul logs, it had sent that to ze04
> >
> >   2017-12-07 15:06:20,962 DEBUG zuul.Pipeline.openstack.gate: Build
> <Build bba5d98bb7b14b99afb539a75ee86a80 of   legacy-tempest-dsvm-neutron-full
> on <Worker ze04.openstack.org>> started
> >
> > However, zuul-executor was not running on ze04.  I believe there were
> > issues with this host yesterday.  "/etc/init.d/zuul-executor start" and
> > "service zuul-executor start" reported as OK, but didn't actually
> > start the daemon.  Rather than debug, I just used
> > _SYSTEMCTL_SKIP_REDIRECT=1 and that got it going.  We should look into
> > that, I've noticed similar things with zuul-scheduler too.
> >
> > At this point, the evidence suggested zuul was waiting for jobs that
> > would never return.  Thus I saved the queues, restarted zuul-scheduler
> > and re-queued.
> >
> > Soon after frickler again noticed that releasenotes jobs were now
> > failing with "could not import extension openstackdocstheme" [1].  We
> > suspect [2].
> >
> > However, the gate did not become healthy.  Upon further investigation,
> > the executors are very frequently failing jobs with
> >
> >  2017-12-08 06:41:10,412 ERROR zuul.AnsibleJob: [build:
> 11062f1cca144052afb733813cdb16d8] Exception while executing job
> >  Traceback (most recent call last):
> >    File "/usr/local/lib/python3.5/dist-packages/zuul/executor/server.py",
> line 588, in execute
> >      str(self.job.unique))
> >    File "/usr/local/lib/python3.5/dist-packages/zuul/executor/server.py",
> line 702, in _execute
> >    File "/usr/local/lib/python3.5/dist-packages/zuul/executor/server.py",
> line 1157, in prepareAnsibleFiles
> >    File "/usr/local/lib/python3.5/dist-packages/zuul/executor/server.py",
> line 500, in make_inventory_dict
> >      for name in node['name']:
> >  TypeError: unhashable type: 'list'
> >
> > This is leading to the very high "retry_limit" failures.
> >
> > We suspect change [3] as this did some changes in the node area.  I
> > did not want to revert this via a force-merge, I unfortunately don't
> > have time to do something like apply manually on the host and babysit
> > (I did not have time for a short email, so I sent a long one instead :)
> >
> > At this point, I sent the alert to warn people the gate is unstable,
> > which is about the latest state.
> >
> > Good luck,
> >
> > -i
> >
> > [1] http://logs.openstack.org/95/526595/1/check/build-
> openstack-releasenotes/f38ccb4/job-output.txt.gz
> > [2] https://review.openstack.org/525688
> > [3] https://review.openstack.org/521324
> >
> Digging into some of the issues this morning, I believe that
> citycloud-sto2 has
> been wedged for some time. I see ready / locked nodes sitting for 2+
> days.  We
> also have a few ready / locked nodes in rax-iad, which I think are related
> to
> the unhasable list from this morning.
>
> As i understand it, the only way to release these nodes is to stop the
> scheduler, is that correct? If so, I'd like to request we add some sort of
> CLI
> --force option to delete, or some other command, if it make sense.
>
> I'll hold off on a restart until jeblair or shrews has a moment to look at
> logs.
>
> Paul
>
> _______________________________________________
> OpenStack-Infra mailing list
> OpenStack-Infra at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra
>



-- 
David Shrewsbury (Shrews)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-infra/attachments/20171208/7faa67d9/attachment.html>


More information about the OpenStack-Infra mailing list