[tripleo][ci] jobs in retry_limit or skipped
Greetings, Just FYI.. I believe we hit a bump in the road in upstream infra ( not sure yet ). It appears to be global and not isolated to tripleo or centos based jobs. I have a tripleo bug to track it. https://bugs.launchpad.net/tripleo/+bug/1922148 See #opendev for details, it looks like infra is very busy working and fixing the issues atm. http://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2021-03-31.log.... http://eavesdrop.openstack.org/irclogs/%23opendev/latest.log.html
On Thu, Apr 1, 2021 at 7:02 AM Wesley Hayutin <whayutin@redhat.com> wrote:
Greetings,
Just FYI.. I believe we hit a bump in the road in upstream infra ( not sure yet ). It appears to be global and not isolated to tripleo or centos based jobs.
I have a tripleo bug to track it. https://bugs.launchpad.net/tripleo/+bug/1922148
See #opendev for details, it looks like infra is very busy working and fixing the issues atm.
http://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2021-03-31.log.... http://eavesdrop.openstack.org/irclogs/%23opendev/latest.log.html
Zuul got restarted, jobs have started working fine now. if there is no job running against the patches, please recheck your patches slowly as it might flood the gates. Thanks, Chandan Kumar
On 2021-04-01 13:53:46 +0530 (+0530), Chandan Kumar wrote:
On Thu, Apr 1, 2021 at 7:02 AM Wesley Hayutin <whayutin@redhat.com> wrote:
Greetings,
Just FYI.. I believe we hit a bump in the road in upstream infra ( not sure yet ). It appears to be global and not isolated to tripleo or centos based jobs.
I have a tripleo bug to track it. https://bugs.launchpad.net/tripleo/+bug/1922148
See #opendev for details, it looks like infra is very busy working and fixing the issues atm.
http://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2021-03-31.log.... http://eavesdrop.openstack.org/irclogs/%23opendev/latest.log.html
Zuul got restarted, jobs have started working fine now. if there is no job running against the patches, please recheck your patches slowly as it might flood the gates.
It's a complex situation with a few problems intermingled. First, the tripleo-ansible-centos-8-molecule-tripleo-modules job seemed to have some bug of its own causing frequent disconnects of the job node leading to retries. Also some recent change in Zuul seems to have introduced a semi-slow memory leak which, when we run into memory pressure on the scheduler, causes Zookeeper disconnects which trigger mass build retries. Further, because the source of the memory leak has been really tough to nail down, live debugging directly in the running process has been applied, and this slows the scheduler by orders of magnitude when engaged, triggering similar Zookeeper disconnects as well. -- Jeremy Stanley
participants (3)
-
Chandan Kumar
-
Jeremy Stanley
-
Wesley Hayutin