[tripleo] Gate blocker NODE_FAILURES when running tripleo-ci-centos-9-scenario010-standalone

Ronelle Landy rlandy at redhat.com
Tue Sep 20 16:01:09 UTC 2022


On Tue, Sep 20, 2022 at 10:39 AM Clark Boylan <cboylan at sapwetik.org> wrote:

> On Tue, Sep 20, 2022, at 5:01 AM, Amol Kahat wrote:
> > Hello All,
> >
> > Description of problem:
> > NODE_FAILURE when running tripleo-ci-centos-9-scenario010-standalone job.
> >
> > We have been seeing this failure[1] since 09/17. Logs are not present
> > so it's hard to say what is the root cause of this issue.
>
> NODE_FAILURE indicates that Nodepool could not boot any nodes to fulfill
> the nodeset requested by your job. The reason there are no job logs is that
> this occurs before any Zuul jobs can run. See below for the information
> that is available though.
>
> >
> > This job uses nodeset: single-centos-9-node-nested-virt - so assumption
> > is that it's the nest-virt nodeset
>
> Fungi ended up debugging and correcting [2] this issue, but I was able to
> get a good idea of what might be happening using just a phone and no
> special access. This label is provided by four cloud providers [3][4][5][6]
> only two of which currently have positive max-servers values [7][8]. We can
> check the general health of those providers using Grafana [9]. This shows
> the providers idled for some reason. Finding the specific cause of that
> idling did require extra privileges, and fungi pasted that info for us [10].
>
> While I agree it is more difficult to say what the root cause is, there is
> still plenty of information to narrow the problem down and determine what
> might be going on. Ideally we would publish the Nodepool launcher logs too,
> then we wouldn't need special access to retrieve the traceback in the
> paste. Unfortunately, there has been a long standing concern that we might
> leak cloud credentials if openstacksdk or Nodepool logging do something we
> don't expect. There is also a Zuul spec to merge Nodepool functionality
> into Zuul proper [11] which should allow us to report better errors when
> NODE_FAILURE occurs.
>
> >
> > [1]
> >
> https://zuul.openstack.org/builds?job_name=tripleo-ci-centos-9-scenario010-standalone+&skip=0
>
> [2] https://review.opendev.org/c/openstack/project-config/+/858523
> [3]
> https://opendev.org/openstack/project-config/src/branch/master/nodepool/nl02.opendev.org.yaml#L210
> [4]
> https://opendev.org/openstack/project-config/src/branch/master/nodepool/nl03.opendev.org.yaml#L404
> [5]
> https://opendev.org/openstack/project-config/src/branch/master/nodepool/nl04.opendev.org.yaml#L174
> [6]
> https://opendev.org/openstack/project-config/src/branch/master/nodepool/nl04.opendev.org.yaml#L195
> [7]
> https://opendev.org/openstack/project-config/src/branch/master/nodepool/nl04.opendev.org.yaml#L82
> [8]
> https://opendev.org/openstack/project-config/src/branch/master/nodepool/nl04.opendev.org.yaml#L194
> [9]
> https://grafana.opendev.org/d/2b4dba9e25/nodepool-ovh?orgId=1&from=now-5d&to=now
> [10] https://paste.opendev.org/show/816812/
> [11]
> https://zuul-ci.org/docs/zuul/latest/developer/specs/nodepool-in-zuul.html



Thanks for the fix and all the above info

>
>
> >
> > Thanks,
> > --
> > *Amol Kahat*
> > Software Engineer
> > *Red Hat India Pvt. Ltd. Pune, India.*
> > akahat at redhat.com
> > B764 E6F8 F4C1 A1AF 816C  6840 FDD3 BA6C 832D 7715
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20220920/588bde60/attachment.htm>


More information about the openstack-discuss mailing list