[ironic] Ironic tempest jobs hitting retry_limit failures
Hello Ironic, We've noticed that your tempest jobs have been hitting retry_limit failure recently. What this means is we attempted to run the job 3 times but each time the job failed due to "network" problems and Zuul eventually gave up. On further investigation I found that this is happening because the ironic tempest jobs are filling the root disk on rackspace nodes (which have a smaller root / + ephemeral drive mounted at /opt) with libvirt qcow2 images. This seems to cause ansible to fail to operate because it needs to write to /tmp and it thinks there is a "network" error. I've thrown my investigation into a bug for you [0]. It would be great if you could take a look at this as we are effectively spinning our wheels for about 9 hours every time this happens. I did hold the node I used to investigate. If you'd like to dig in yourselves just ask the infra team for access to nodepool node ubuntu-bionic-rax-ord-0011007873. Finally, to help debug these issues in the future I've started adding a cleanup-run playbook [1] which should give us network and disk info (can be expanded if necessary too) for every job when it is done running. Even if the disk is full. [0] https://storyboard.openstack.org/#!/story/2006520 [1] https://review.opendev.org/#/c/681100/ Clark
participants (1)
-
Clark Boylan