Hi folks,

We have a 300 strong ironic cluster which has been relatively stable for the most part.
Occasionally we get the IPA timeout errors.




2019-10-20 23:33:37.994
2019-10-20 23:33:37.421 265272 ERROR nova.virt.ironic.driver [req-71933c39-2ac4-497b-8cdc-9578889f42fe cda0725e2a68445db857f56c49f77faf 53a4a4738d9f49e99fe580c1bd51f950 - default default] Error deploying instance c51368f6-b9d8-4c3c-bd63-d1f84ee26bfb on baremetal node 83dc67d4-36be-4004-8847-afd9b969de0e.: InstanceDeployFailure: Failed to provision instance c51368f6-b9d8-4c3c-bd63-d1f84ee26bfb: Timeout reached while waiting for callback for node 83dc67d4-36be-4004-8847-afd9b969de0e
Here's my troubleshooting so far.

DHCP: All good.

Oct 22 09:49:58 sc-control03 dnsmasq-dhcp[965699]: 3076961402 DHCPDISCOVER(ns-601d0738-39) 6c:b3:11:4f:87:ac
Oct 22 09:49:58 sc-control03 dnsmasq-dhcp[965699]: 3076961402 DHCPOFFER(ns-601d0738-39) 10.33.23.94 6c:b3:11:4f:87:ac
Oct 22 09:50:02 sc-control03 dnsmasq-dhcp[965699]: 3076961402 DHCPREQUEST(ns-601d0738-39) 10.33.23.94 6c:b3:11:4f:87:ac
Oct 22 09:50:02 sc-control03 dnsmasq-dhcp[965699]: 3076961402 DHCPACK(ns-601d0738-39) 10.33.23.94 6c:b3:11:4f:87:ac host-10-33-23-94
Oct 22 10:19:47 sc-control03 dnsmasq-dhcp[965699]: 0 DHCPRELEASE(ns-601d0738-39) 10.33.23.94 6c:b3:11:4f:87:ac
Oct 22 10:21:23 sc-control03 dnsmasq-dhcp[965699]: 2188330678 DHCPDISCOVER(ns-601d0738-39) 6c:b3:11:4f:87:ac
Oct 22 10:21:23 sc-control03 dnsmasq-dhcp[965699]: 2188330678 DHCPOFFER(ns-601d0738-39) 10.33.23.187 6c:b3:11:4f:87:ac
Oct 22 10:21:26 sc-control03 dnsmasq-dhcp[965699]: 2188330678 DHCPREQUEST(ns-601d0738-39) 10.33.23.187 6c:b3:11:4f:87:ac
Oct 22 10:21:26 sc-control03 dnsmasq-dhcp[965699]: 2188330678 DHCPACK(ns-601d0738-39) 10.33.23.187 6c:b3:11:4f:87:ac host-10-33-23-
Oct 22 10:22:46 sc-control03 dnsmasq-dhcp[965699]: 3549042952 DHCPDISCOVER(ns-601d0738-39) 6c:b3:11:4f:87:ac

Ironic conductor logs: (shows ironic state proceeding to Power ON, ironic state: deploying.  Then 30 minutes later timeout since IPA failed to check-in

2019-10-22 09:49:03.746 267066 DEBUG ironic.drivers.modules.deploy_utils [req-fabea5a4-da6c-48be-834a-49763c23b0e9 1ded19959aa94e53b6487d7c4f8d3371 acf8cd411e5e4751a61d1ed54e8e874d - default default] Deploy boot mode is uefi for d465fa17-a2ea-43ff-a3a7-033cbb497774. get_boot_mode_for_deploy /usr/lib/python2.7/site-packages/ironic/drivers/modules/deploy_utils.py:798

2019-10-22 09:49:03.747 267066 DEBUG ironic.common.pxe_utils [req-fabea5a4-da6c-48be-834a-49763c23b0e9 1ded19959aa94e53b6487d7c4f8d3371 acf8cd411e5e4751a61d1ed54e8e874d - default default] Building PXE config for node d465fa17-a2ea-43ff-a3a7-033cbb497774 create_pxe_config /usr/lib/python2.7/site-packages/ironic/common/pxe_utils.py:222

2019-10-22 09:49:03.748 267066 DEBUG ironic.drivers.modules.deploy_utils [req-fabea5a4-da6c-48be-834a-49763c23b0e9 1ded19959aa94e53b6487d7c4f8d3371 acf8cd411e5e4751a61d1ed54e8e874d - default default] Deploy boot mode is uefi for d465fa17-a2ea-43ff-a3a7-033cbb497774. get_boot_mode_for_deploy /usr/lib/python2.7/site-packages/ironic/drivers/modules/deploy_utils.py:798

2019-10-22 09:49:04.516 267066 DEBUG ironic.drivers.modules.pxe [req-fabea5a4-da6c-48be-834a-49763c23b0e9 
1ded19959aa94e53b6487d7c4f8d3371 acf8cd411e5e4751a61d1ed54e8e874d - default default] Fetching necessary kernel and ramdisk for node d465fa17-a2ea-43ff-a3a7-033cbb497774 _cache_ramdisk_kernel /usr/lib/python2.7/site-packages/ironic/drivers/modules/pxe.py:382

2019-10-22 09:49:17.761 267066 INFO ironic.conductor.utils [req-fabea5a4-da6c-48be-834a-49763c23b0e9 1ded19959aa94e53b6487d7c4f8d3371 acf8cd411e5e4751a61d1ed54e8e874d - default default] Successfully set node d465fa17-a2ea-43ff-a3a7-033cbb497774 power state to power on by rebooting.

===== 30 minutes later timeout.  IPA failed to check-in

2019-10-22 10:19:37.448 267066 INFO ironic.conductor.task_manager [req-762d8acf-0a55-4d8d-99d8-93020a14547c - - - - -] Node d465fa17-a2ea-43ff-a3a7-033cbb497774 moved to provision state "deploy failed" from state "wait call-back"; target provision state is "active"

2019-10-22 10:19:37.471 267066 DEBUG ironic.common.pxe_utils [req-762d8acf-0a55-4d8d-99d8-93020a14547c - - - - -] Cleaning up PXE config for node d465fa17-a2ea-43ff-a3a7-033cbb497774 clean_up_pxe_config /usr/lib/python2.7/site-packages/ironic/common/pxe_utils.py:288



TFTP logs: shows TFTP client timed out (weird).  Any pointers here?
tftpd shows ramdisk_deployed completed.  Then, it reports that the client timed out.





Oct 22 09:50:15 sc-ironic04 in.tftpd[318194]: RRQ from 10.33.23.94 filename d465fa17-a2ea-43ff-a3a7-033cbb497774/deploy_ramdisk remapped to /tftpboot/d465fa17-a2ea-43ff-a3a7-033cbb497774/deploy_ramdisk

Oct 22 09:50:44 sc-ironic04 in.tftpd[318194]: Client 10.33.23.94 finished d465fa17-a2ea-43ff-a3a7-033cbb497774/deploy_ramdisk
Oct 22 09:50:44 sc-ironic04 in.tftpd[318194]: Client 10.33.23.94 timed out



Oct 22 09:50:03 sc-ironic04 in.tftpd[318148]: remap: input: bootx64.efi
Oct 22 09:50:03 sc-ironic04 in.tftpd[318148]: remap: rule 3: rewrite: /tftpboot/bootx64.efi
Oct 22 09:50:03 sc-ironic04 in.tftpd[318148]: remap: rule 3: exit
Oct 22 09:50:03 sc-ironic04 in.tftpd[318148]: RRQ from 10.33.23.94 filename bootx64.efi remapped to /tftpboot/bootx64.efi
Oct 22 09:50:04 sc-ironic04 in.tftpd[318148]: Error code 8: User aborted the transfer
Oct 22 09:50:04 sc-ironic04 in.tftpd[318150]: remap: input: bootx64.efi
Oct 22 09:50:04 sc-ironic04 in.tftpd[318150]: remap: rule 3: rewrite: /tftpboot/bootx64.efi
Oct 22 09:50:04 sc-ironic04 in.tftpd[318150]: remap: rule 3: exit
Oct 22 09:50:04 sc-ironic04 in.tftpd[318150]: RRQ from 10.33.23.94 filename bootx64.efi remapped to /tftpboot/bootx64.efi
Oct 22 09:50:04 sc-ironic04 in.tftpd[318150]: Client 10.33.23.94 finished bootx64.efi
Oct 22 09:50:04 sc-ironic04 in.tftpd[318151]: remap: input: grubx64.efi
Oct 22 09:50:04 sc-ironic04 in.tftpd[318151]: remap: rule 3: rewrite: /tftpboot/grubx64.efi
Oct 22 09:50:04 sc-ironic04 in.tftpd[318151]: remap: rule 3: exit
Oct 22 09:50:04 sc-ironic04 in.tftpd[318151]: RRQ from 10.33.23.94 filename grubx64.efi remapped to /tftpboot/grubx64.efi
Oct 22 09:50:05 sc-ironic04 in.tftpd[318151]: Client 10.33.23.94 finished grubx64.efi
Oct 22 09:50:05 sc-ironic04 in.tftpd[318152]: remap: input: /grub.cfg-01-6c-b3-11-4f-87-ac
Oct 22 09:50:05 sc-ironic04 in.tftpd[318152]: remap: rule 2: rewrite: /tftpboot//grub.cfg-01-6c-b3-11-4f-87-ac
Oct 22 09:50:05 sc-ironic04 in.tftpd[318152]: remap: rule 2: exit
Oct 22 09:50:05 sc-ironic04 in.tftpd[318152]: RRQ from 10.33.23.94 filename /grub.cfg-01-6c-b3-11-4f-87-ac remapped to /tftpboot//grub.cfg-01-6c-b3-11-4f-87-ac
Oct 22 09:50:05 sc-ironic04 in.tftpd[318152]: Client 10.33.23.94 File not found /tftpboot//grub.cfg-01-6c-b3-11-4f-87-ac
Oct 22 09:50:05 sc-ironic04 in.tftpd[318152]: sending NAK (1, File not found) to 10.33.23.94
Oct 22 09:50:05 sc-ironic04 in.tftpd[318153]: remap: input: /grub.cfg-0A21175E
Oct 22 09:50:05 sc-ironic04 in.tftpd[318153]: remap: rule 2: rewrite: /tftpboot//grub.cfg-0A21175E
Oct 22 09:50:05 sc-ironic04 in.tftpd[318153]: remap: rule 2: exit
Oct 22 09:50:05 sc-ironic04 in.tftpd[318153]: RRQ from 10.33.23.94 filename /grub.cfg-0A21175E remapped to /tftpboot//grub.cfg-0A21175E
Oct 22 09:50:05 sc-ironic04 in.tftpd[318153]: Client 10.33.23.94 File not found /tftpboot//grub.cfg-0A21175E
Oct 22 09:50:05 sc-ironic04 in.tftpd[318153]: sending NAK (1, File not found) to 10.33.23.94
Oct 22 09:50:05 sc-ironic04 in.tftpd[318154]: remap: input: /grub.cfg-0A21175
Oct 22 09:50:05 sc-ironic04 in.tftpd[318154]: remap: rule 2: rewrite: /tftpboot//grub.cfg-0A21175
Oct 22 09:50:05 sc-ironic04 in.tftpd[318154]: remap: rule 2: exit
Oct 22 09:50:05 sc-ironic04 in.tftpd[318154]: RRQ from 10.33.23.94 filename /grub.cfg-0A21175 remapped to /tftpboot//grub.cfg-0A21175
Oct 22 09:50:05 sc-ironic04 in.tftpd[318154]: Client 10.33.23.94 File not found /tftpboot//grub.cfg-0A21175
Oct 22 09:50:05 sc-ironic04 in.tftpd[318154]: sending NAK (1, File not found) to 10.33.23.94
Oct 22 09:50:05 sc-ironic04 in.tftpd[318155]: remap: input: /grub.cfg-0A2117
Oct 22 09:50:05 sc-ironic04 in.tftpd[318155]: remap: rule 2: rewrite: /tftpboot//grub.cfg-0A2117
Oct 22 09:50:05 sc-ironic04 in.tftpd[318155]: remap: rule 2: exit
Oct 22 09:50:05 sc-ironic04 in.tftpd[318155]: RRQ from 10.33.23.94 filename /grub.cfg-0A2117 remapped to /tftpboot//grub.cfg-0A2117
Oct 22 09:50:05 sc-ironic04 in.tftpd[318155]: Client 10.33.23.94 File not found /tftpboot//grub.cfg-0A2117
Oct 22 09:50:05 sc-ironic04 in.tftpd[318155]: sending NAK (1, File not found) to 10.33.23.94
Oct 22 09:50:05 sc-ironic04 in.tftpd[318156]: remap: input: /grub.cfg-0A211
Oct 22 09:50:05 sc-ironic04 in.tftpd[318156]: remap: rule 2: rewrite: /tftpboot//grub.cfg-0A211
Oct 22 09:50:05 sc-ironic04 in.tftpd[318156]: remap: rule 2: exit
Oct 22 09:50:05 sc-ironic04 in.tftpd[318156]: RRQ from 10.33.23.94 filename /grub.cfg-0A211 remapped to /tftpboot//grub.cfg-0A211
Oct 22 09:50:05 sc-ironic04 in.tftpd[318156]: Client 10.33.23.94 File not found /tftpboot//grub.cfg-0A211
Oct 22 09:50:05 sc-ironic04 in.tftpd[318156]: sending NAK (1, File not found) to 10.33.23.94
Oct 22 09:50:05 sc-ironic04 in.tftpd[318157]: remap: input: /grub.cfg-0A21
Oct 22 09:50:05 sc-ironic04 in.tftpd[318157]: remap: rule 2: rewrite: /tftpboot//grub.cfg-0A21
Oct 22 09:50:05 sc-ironic04 in.tftpd[318157]: remap: rule 2: exit
Oct 22 09:50:05 sc-ironic04 in.tftpd[318157]: RRQ from 10.33.23.94 filename /grub.cfg-0A21 remapped to /tftpboot//grub.cfg-0A21
Oct 22 09:50:05 sc-ironic04 in.tftpd[318157]: Client 10.33.23.94 File not found /tftpboot//grub.cfg-0A21
Oct 22 09:50:05 sc-ironic04 in.tftpd[318157]: sending NAK (1, File not found) to 10.33.23.94
Oct 22 09:50:05 sc-ironic04 in.tftpd[318158]: remap: input: /grub.cfg-0A2
Oct 22 09:50:05 sc-ironic04 in.tftpd[318158]: remap: rule 2: rewrite: /tftpboot//grub.cfg-0A2
Oct 22 09:50:05 sc-ironic04 in.tftpd[318158]: remap: rule 2: exit
Oct 22 09:50:05 sc-ironic04 in.tftpd[318178]: RRQ from 10.33.23.94 filename /EFI/centos/grub.cfg remapped to /tftpboot//EFI/centos/grub.cfg
Oct 22 09:50:05 sc-ironic04 in.tftpd[318178]: Client 10.33.23.94 finished /EFI/centos/grub.cfg
Oct 22 09:50:05 sc-ironic04 in.tftpd[318179]: remap: input: /EFI/centos/grub.cfg
Oct 22 09:50:05 sc-ironic04 in.tftpd[318179]: remap: rule 2: rewrite: /tftpboot//EFI/centos/grub.cfg
Oct 22 09:50:05 sc-ironic04 in.tftpd[318179]: remap: rule 2: exit
Oct 22 09:50:05 sc-ironic04 in.tftpd[318179]: RRQ from 10.33.23.94 filename /EFI/centos/grub.cfg remapped to /tftpboot//EFI/centos/grub.cfg
Oct 22 09:50:05 sc-ironic04 in.tftpd[318179]: Client 10.33.23.94 finished /EFI/centos/grub.cfg
Oct 22 09:50:07 sc-ironic04 ironic-conductor: 2019-10-22 09:50:07.298 267066 DEBUG futurist.periodics [-] Submitting periodic callback 'ironic.conductor.manager.ConductorManager._sync_local_state' _process_scheduled /usr/lib/python2.7/site-packages/futurist/periodics.py:639
Oct 22 09:50:07 sc-ironic04 ironic-conductor: 2019-10-22 09:50:07.298 267066 INFO ironic.conductor.manager [-] Executing sync_local_state
Oct 22 09:50:07 sc-ironic04 ironic-conductor: 2019-10-22 09:50:07.305 267066 DEBUG ironic.common.hash_ring [-] Rebuilding cached hash rings ring /usr/lib/python2.7/site-packages/ironic/common/hash_ring.py:51
Oct 22 09:50:07 sc-ironic04 ironic-conductor: 2019-10-22 09:50:07.466 267066 DEBUG ironic.common.hash_ring [-] Finished rebuilding hash rings, available drivers are ipmi, fake, snmp, fake_pxe, pxe_snmp, agent_ipmitool, pxe_ipmitool ring /usr/lib/python2.7/site-packages/ironic/common/hash_ring.py:56
Oct 22 09:50:10 sc-ironic04 in.tftpd[318184]: remap: input: /tftpboot/10.33.23.94.conf
Oct 22 09:50:10 sc-ironic04 in.tftpd[318184]: remap: rule 0: rewrite: /tftpboot/10.33.23.94.conf
Oct 22 09:50:10 sc-ironic04 in.tftpd[318184]: remap: rule 0: exit
Oct 22 09:50:10 sc-ironic04 in.tftpd[318184]: RRQ from 10.33.23.94 filename /tftpboot/10.33.23.94.conf
Oct 22 09:50:10 sc-ironic04 in.tftpd[318184]: Client 10.33.23.94 finished /tftpboot/10.33.23.94.conf
Oct 22 09:50:10 sc-ironic04 in.tftpd[318185]: remap: input: /tftpboot/10.33.23.94.conf
Oct 22 09:50:10 sc-ironic04 in.tftpd[318185]: remap: rule 0: rewrite: /tftpboot/10.33.23.94.conf
Oct 22 09:50:10 sc-ironic04 in.tftpd[318185]: remap: rule 0: exit
Oct 22 09:50:10 sc-ironic04 in.tftpd[318185]: RRQ from 10.33.23.94 filename /tftpboot/10.33.23.94.conf
Oct 22 09:50:10 sc-ironic04 in.tftpd[318185]: Client 10.33.23.94 finished /tftpboot/10.33.23.94.conf
Oct 22 09:50:10 sc-ironic04 in.tftpd[318186]: remap: input: /tftpboot/10.33.23.94.conf
Oct 22 09:50:10 sc-ironic04 in.tftpd[318186]: remap: rule 0: rewrite: /tftpboot/10.33.23.94.conf
Oct 22 09:50:10 sc-ironic04 in.tftpd[318186]: remap: rule 0: exit
Oct 22 09:50:10 sc-ironic04 in.tftpd[318186]: RRQ from 10.33.23.94 filename /tftpboot/10.33.23.94.conf
Oct 22 09:50:10 sc-ironic04 in.tftpd[318186]: Client 10.33.23.94 finished /tftpboot/10.33.23.94.conf
Oct 22 09:50:15 sc-ironic04 in.tftpd[318192]: remap: input: d465fa17-a2ea-43ff-a3a7-033cbb497774/deploy_kernel
Oct 22 09:50:15 sc-ironic04 in.tftpd[318192]: remap: rule 3: rewrite: /tftpboot/d465fa17-a2ea-43ff-a3a7-033cbb497774/deploy_kernel
Oct 22 09:50:15 sc-ironic04 in.tftpd[318192]: remap: rule 3: exit
Oct 22 09:50:15 sc-ironic04 in.tftpd[318192]: RRQ from 10.33.23.94 filename d465fa17-a2ea-43ff-a3a7-033cbb497774/deploy_kernel remapped to /tftpboot/d465fa17-a2ea-43ff-a3a7-033cbb497774/deploy_kernel
Oct 22 09:50:15 sc-ironic04 ironic-conductor: 2019-10-22 09:50:15.898 267066 DEBUG futurist.periodics [-] Submitting periodic callback 'ironic.conductor.manager.ConductorManager._check_rescuewait_timeouts' _process_scheduled /usr/lib/python2.7/site-packages/futurist/periodics.py:639
Oct 22 09:50:15 sc-ironic04 in.tftpd[318192]: Client 10.33.23.94 finished d465fa17-a2ea-43ff-a3a7-033cbb497774/deploy_kernel
Oct 22 09:50:15 sc-ironic04 in.tftpd[318194]: remap: input: d465fa17-a2ea-43ff-a3a7-033cbb497774/deploy_ramdisk
Oct 22 09:50:15 sc-ironic04 in.tftpd[318194]: remap: rule 3: rewrite: /tftpboot/d465fa17-a2ea-43ff-a3a7-033cbb497774/deploy_ramdisk
Oct 22 09:50:15 sc-ironic04 in.tftpd[318194]: remap: rule 3: exit
Oct 22 09:50:15 sc-ironic04 in.tftpd[318194]: RRQ from 10.33.23.94 filename d465fa17-a2ea-43ff-a3a7-033cbb497774/deploy_ramdisk remapped to /tftpboot/d465fa17-a2ea-43ff-a3a7-033cbb497774/deploy_ramdisk



Oct 22 09:50:44 sc-ironic04 in.tftpd[318194]: Client 10.33.23.94 finished d465fa17-a2ea-43ff-a3a7-033cbb497774/deploy_ramdisk
Oct 22 09:50:44 sc-ironic04 in.tftpd[318194]: Client 10.33.23.94 timed out



This has me stumped here.  This exact failure seems to be happening 3 to 4 times a week on different nodes.
Any pointers appreciated.

thanks,
Fred.