[OpenStack-Infra] Gate Issues
Ian Wienand
iwienand at redhat.com
Fri Dec 8 09:38:24 UTC 2017
Hello,
Just to save people reverse-engineering IRC logs...
At ~04:00UTC frickler called out that things had been sitting in the
gate for ~17 hours.
Upon investigation, one of the stuck jobs was a
legacy-tempest-dsvm-neutron-full job
(bba5d98bb7b14b99afb539a75ee86a80) as part of
https://review.openstack.org/475955
Checking the zuul logs, it had sent that to ze04
2017-12-07 15:06:20,962 DEBUG zuul.Pipeline.openstack.gate: Build <Build bba5d98bb7b14b99afb539a75ee86a80 of legacy-tempest-dsvm-neutron-full on <Worker ze04.openstack.org>> started
However, zuul-executor was not running on ze04. I believe there were
issues with this host yesterday. "/etc/init.d/zuul-executor start" and
"service zuul-executor start" reported as OK, but didn't actually
start the daemon. Rather than debug, I just used
_SYSTEMCTL_SKIP_REDIRECT=1 and that got it going. We should look into
that, I've noticed similar things with zuul-scheduler too.
At this point, the evidence suggested zuul was waiting for jobs that
would never return. Thus I saved the queues, restarted zuul-scheduler
and re-queued.
Soon after frickler again noticed that releasenotes jobs were now
failing with "could not import extension openstackdocstheme" [1]. We
suspect [2].
However, the gate did not become healthy. Upon further investigation,
the executors are very frequently failing jobs with
2017-12-08 06:41:10,412 ERROR zuul.AnsibleJob: [build: 11062f1cca144052afb733813cdb16d8] Exception while executing job
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/zuul/executor/server.py", line 588, in execute
str(self.job.unique))
File "/usr/local/lib/python3.5/dist-packages/zuul/executor/server.py", line 702, in _execute
File "/usr/local/lib/python3.5/dist-packages/zuul/executor/server.py", line 1157, in prepareAnsibleFiles
File "/usr/local/lib/python3.5/dist-packages/zuul/executor/server.py", line 500, in make_inventory_dict
for name in node['name']:
TypeError: unhashable type: 'list'
This is leading to the very high "retry_limit" failures.
We suspect change [3] as this did some changes in the node area. I
did not want to revert this via a force-merge, I unfortunately don't
have time to do something like apply manually on the host and babysit
(I did not have time for a short email, so I sent a long one instead :)
At this point, I sent the alert to warn people the gate is unstable,
which is about the latest state.
Good luck,
-i
[1] http://logs.openstack.org/95/526595/1/check/build-openstack-releasenotes/f38ccb4/job-output.txt.gz
[2] https://review.openstack.org/525688
[3] https://review.openstack.org/521324
More information about the OpenStack-Infra
mailing list