[ironic] Ironic tempest jobs hitting retry_limit failures

9 Sep 2019

      Hello Ironic,

We've noticed that your tempest jobs have been hitting retry_limit failure recently. What this means is we attempted to run the job 3 times but each time the job failed due to "network" problems and Zuul eventually gave up.

On further investigation I found that this is happening because the ironic tempest jobs are filling the root disk on rackspace nodes (which have a smaller root / + ephemeral drive mounted at /opt) with libvirt qcow2 images. This seems to cause ansible to fail to operate because it needs to write to /tmp and it thinks there is a "network" error.

I've thrown my investigation into a bug for you [0]. It would be great if you could take a look at this as we are effectively spinning our wheels for about 9 hours every time this happens. I did hold the node I used to investigate. If you'd like to dig in yourselves just ask the infra team for access to nodepool node ubuntu-bionic-rax-ord-0011007873.

Finally, to help debug these issues in the future I've started adding a cleanup-run playbook [1] which should give us network and disk info (can be expanded if necessary too) for every job when it is done running. Even if the disk is full.

[0] https://storyboard.openstack.org/#!/story/2006520
[1] https://review.opendev.org/#/c/681100/

Clark

Clark Boylan

tags

participants (1)