[ironic]: Timeout reached while waiting for callback for node
Hi folks, We have a 300 strong ironic cluster which has been relatively stable for the most part.Occasionally we get the IPA timeout errors. | | 2019-10-20 23:33:37.994 | 2019-10-20 23:33:37.421 265272 ERROR nova.virt.ironic.driver [req-71933c39-2ac4-497b-8cdc-9578889f42fe cda0725e2a68445db857f56c49f77faf 53a4a4738d9f49e99fe580c1bd51f950 - default default] Error deploying instance c51368f6-b9d8-4c3c-bd63-d1f84ee26bfb on baremetal node 83dc67d4-36be-4004-8847-afd9b969de0e.: InstanceDeployFailure: Failed to provision instance c51368f6-b9d8-4c3c-bd63-d1f84ee26bfb: Timeout reached while waiting for callback for node 83dc67d4-36be-4004-8847-afd9b969de0e | Here's my troubleshooting so far. DHCP: All good. Oct 22 09:49:58 sc-control03 dnsmasq-dhcp[965699]: 3076961402 DHCPDISCOVER(ns-601d0738-39) 6c:b3:11:4f:87:acOct 22 09:49:58 sc-control03 dnsmasq-dhcp[965699]: 3076961402 DHCPOFFER(ns-601d0738-39) 10.33.23.94 6c:b3:11:4f:87:acOct 22 09:50:02 sc-control03 dnsmasq-dhcp[965699]: 3076961402 DHCPREQUEST(ns-601d0738-39) 10.33.23.94 6c:b3:11:4f:87:acOct 22 09:50:02 sc-control03 dnsmasq-dhcp[965699]: 3076961402 DHCPACK(ns-601d0738-39) 10.33.23.94 6c:b3:11:4f:87:ac host-10-33-23-94Oct 22 10:19:47 sc-control03 dnsmasq-dhcp[965699]: 0 DHCPRELEASE(ns-601d0738-39) 10.33.23.94 6c:b3:11:4f:87:acOct 22 10:21:23 sc-control03 dnsmasq-dhcp[965699]: 2188330678 DHCPDISCOVER(ns-601d0738-39) 6c:b3:11:4f:87:acOct 22 10:21:23 sc-control03 dnsmasq-dhcp[965699]: 2188330678 DHCPOFFER(ns-601d0738-39) 10.33.23.187 6c:b3:11:4f:87:acOct 22 10:21:26 sc-control03 dnsmasq-dhcp[965699]: 2188330678 DHCPREQUEST(ns-601d0738-39) 10.33.23.187 6c:b3:11:4f:87:acOct 22 10:21:26 sc-control03 dnsmasq-dhcp[965699]: 2188330678 DHCPACK(ns-601d0738-39) 10.33.23.187 6c:b3:11:4f:87:ac host-10-33-23-Oct 22 10:22:46 sc-control03 dnsmasq-dhcp[965699]: 3549042952 DHCPDISCOVER(ns-601d0738-39) 6c:b3:11:4f:87:ac Ironic conductor logs: (shows ironic state proceeding to Power ON, ironic state: deploying. Then 30 minutes later timeout since IPA failed to check-in 2019-10-22 09:49:03.746 267066 DEBUG ironic.drivers.modules.deploy_utils [req-fabea5a4-da6c-48be-834a-49763c23b0e9 1ded19959aa94e53b6487d7c4f8d3371 acf8cd411e5e4751a61d1ed54e8e874d - default default] Deploy boot mode is uefi for d465fa17-a2ea-43ff-a3a7-033cbb497774. get_boot_mode_for_deploy /usr/lib/python2.7/site-packages/ironic/drivers/modules/deploy_utils.py:798 2019-10-22 09:49:03.747 267066 DEBUG ironic.common.pxe_utils [req-fabea5a4-da6c-48be-834a-49763c23b0e9 1ded19959aa94e53b6487d7c4f8d3371 acf8cd411e5e4751a61d1ed54e8e874d - default default] Building PXE config for node d465fa17-a2ea-43ff-a3a7-033cbb497774 create_pxe_config /usr/lib/python2.7/site-packages/ironic/common/pxe_utils.py:222 2019-10-22 09:49:03.748 267066 DEBUG ironic.drivers.modules.deploy_utils [req-fabea5a4-da6c-48be-834a-49763c23b0e9 1ded19959aa94e53b6487d7c4f8d3371 acf8cd411e5e4751a61d1ed54e8e874d - default default] Deploy boot mode is uefi for d465fa17-a2ea-43ff-a3a7-033cbb497774. get_boot_mode_for_deploy /usr/lib/python2.7/site-packages/ironic/drivers/modules/deploy_utils.py:798 2019-10-22 09:49:04.516 267066 DEBUG ironic.drivers.modules.pxe [req-fabea5a4-da6c-48be-834a-49763c23b0e9 1ded19959aa94e53b6487d7c4f8d3371 acf8cd411e5e4751a61d1ed54e8e874d - default default] Fetching necessary kernel and ramdisk for node d465fa17-a2ea-43ff-a3a7-033cbb497774 _cache_ramdisk_kernel /usr/lib/python2.7/site-packages/ironic/drivers/modules/pxe.py:382 2019-10-22 09:49:17.761 267066 INFO ironic.conductor.utils [req-fabea5a4-da6c-48be-834a-49763c23b0e9 1ded19959aa94e53b6487d7c4f8d3371 acf8cd411e5e4751a61d1ed54e8e874d - default default] Successfully set node d465fa17-a2ea-43ff-a3a7-033cbb497774 power state to power on by rebooting. ===== 30 minutes later timeout. IPA failed to check-in 2019-10-22 10:19:37.448 267066 INFO ironic.conductor.task_manager [req-762d8acf-0a55-4d8d-99d8-93020a14547c - - - - -] Node d465fa17-a2ea-43ff-a3a7-033cbb497774 moved to provision state "deploy failed" from state "wait call-back"; target provision state is "active" 2019-10-22 10:19:37.471 267066 DEBUG ironic.common.pxe_utils [req-762d8acf-0a55-4d8d-99d8-93020a14547c - - - - -] Cleaning up PXE config for node d465fa17-a2ea-43ff-a3a7-033cbb497774 clean_up_pxe_config /usr/lib/python2.7/site-packages/ironic/common/pxe_utils.py:288 TFTP logs: shows TFTP client timed out (weird). Any pointers here?tftpd shows ramdisk_deployed completed. Then, it reports that the client timed out. Oct 22 09:50:15 sc-ironic04 in.tftpd[318194]: RRQ from 10.33.23.94 filename d465fa17-a2ea-43ff-a3a7-033cbb497774/deploy_ramdisk remapped to /tftpboot/d465fa17-a2ea-43ff-a3a7-033cbb497774/deploy_ramdisk Oct 22 09:50:44 sc-ironic04 in.tftpd[318194]: Client 10.33.23.94 finished d465fa17-a2ea-43ff-a3a7-033cbb497774/deploy_ramdiskOct 22 09:50:44 sc-ironic04 in.tftpd[318194]: Client 10.33.23.94 timed out Oct 22 09:50:03 sc-ironic04 in.tftpd[318148]: remap: input: bootx64.efiOct 22 09:50:03 sc-ironic04 in.tftpd[318148]: remap: rule 3: rewrite: /tftpboot/bootx64.efiOct 22 09:50:03 sc-ironic04 in.tftpd[318148]: remap: rule 3: exitOct 22 09:50:03 sc-ironic04 in.tftpd[318148]: RRQ from 10.33.23.94 filename bootx64.efi remapped to /tftpboot/bootx64.efiOct 22 09:50:04 sc-ironic04 in.tftpd[318148]: Error code 8: User aborted the transferOct 22 09:50:04 sc-ironic04 in.tftpd[318150]: remap: input: bootx64.efiOct 22 09:50:04 sc-ironic04 in.tftpd[318150]: remap: rule 3: rewrite: /tftpboot/bootx64.efiOct 22 09:50:04 sc-ironic04 in.tftpd[318150]: remap: rule 3: exitOct 22 09:50:04 sc-ironic04 in.tftpd[318150]: RRQ from 10.33.23.94 filename bootx64.efi remapped to /tftpboot/bootx64.efiOct 22 09:50:04 sc-ironic04 in.tftpd[318150]: Client 10.33.23.94 finished bootx64.efiOct 22 09:50:04 sc-ironic04 in.tftpd[318151]: remap: input: grubx64.efiOct 22 09:50:04 sc-ironic04 in.tftpd[318151]: remap: rule 3: rewrite: /tftpboot/grubx64.efiOct 22 09:50:04 sc-ironic04 in.tftpd[318151]: remap: rule 3: exitOct 22 09:50:04 sc-ironic04 in.tftpd[318151]: RRQ from 10.33.23.94 filename grubx64.efi remapped to /tftpboot/grubx64.efiOct 22 09:50:05 sc-ironic04 in.tftpd[318151]: Client 10.33.23.94 finished grubx64.efiOct 22 09:50:05 sc-ironic04 in.tftpd[318152]: remap: input: /grub.cfg-01-6c-b3-11-4f-87-acOct 22 09:50:05 sc-ironic04 in.tftpd[318152]: remap: rule 2: rewrite: /tftpboot//grub.cfg-01-6c-b3-11-4f-87-acOct 22 09:50:05 sc-ironic04 in.tftpd[318152]: remap: rule 2: exitOct 22 09:50:05 sc-ironic04 in.tftpd[318152]: RRQ from 10.33.23.94 filename /grub.cfg-01-6c-b3-11-4f-87-ac remapped to /tftpboot//grub.cfg-01-6c-b3-11-4f-87-acOct 22 09:50:05 sc-ironic04 in.tftpd[318152]: Client 10.33.23.94 File not found /tftpboot//grub.cfg-01-6c-b3-11-4f-87-acOct 22 09:50:05 sc-ironic04 in.tftpd[318152]: sending NAK (1, File not found) to 10.33.23.94Oct 22 09:50:05 sc-ironic04 in.tftpd[318153]: remap: input: /grub.cfg-0A21175EOct 22 09:50:05 sc-ironic04 in.tftpd[318153]: remap: rule 2: rewrite: /tftpboot//grub.cfg-0A21175EOct 22 09:50:05 sc-ironic04 in.tftpd[318153]: remap: rule 2: exitOct 22 09:50:05 sc-ironic04 in.tftpd[318153]: RRQ from 10.33.23.94 filename /grub.cfg-0A21175E remapped to /tftpboot//grub.cfg-0A21175EOct 22 09:50:05 sc-ironic04 in.tftpd[318153]: Client 10.33.23.94 File not found /tftpboot//grub.cfg-0A21175EOct 22 09:50:05 sc-ironic04 in.tftpd[318153]: sending NAK (1, File not found) to 10.33.23.94Oct 22 09:50:05 sc-ironic04 in.tftpd[318154]: remap: input: /grub.cfg-0A21175Oct 22 09:50:05 sc-ironic04 in.tftpd[318154]: remap: rule 2: rewrite: /tftpboot//grub.cfg-0A21175Oct 22 09:50:05 sc-ironic04 in.tftpd[318154]: remap: rule 2: exitOct 22 09:50:05 sc-ironic04 in.tftpd[318154]: RRQ from 10.33.23.94 filename /grub.cfg-0A21175 remapped to /tftpboot//grub.cfg-0A21175Oct 22 09:50:05 sc-ironic04 in.tftpd[318154]: Client 10.33.23.94 File not found /tftpboot//grub.cfg-0A21175Oct 22 09:50:05 sc-ironic04 in.tftpd[318154]: sending NAK (1, File not found) to 10.33.23.94Oct 22 09:50:05 sc-ironic04 in.tftpd[318155]: remap: input: /grub.cfg-0A2117Oct 22 09:50:05 sc-ironic04 in.tftpd[318155]: remap: rule 2: rewrite: /tftpboot//grub.cfg-0A2117Oct 22 09:50:05 sc-ironic04 in.tftpd[318155]: remap: rule 2: exitOct 22 09:50:05 sc-ironic04 in.tftpd[318155]: RRQ from 10.33.23.94 filename /grub.cfg-0A2117 remapped to /tftpboot//grub.cfg-0A2117Oct 22 09:50:05 sc-ironic04 in.tftpd[318155]: Client 10.33.23.94 File not found /tftpboot//grub.cfg-0A2117Oct 22 09:50:05 sc-ironic04 in.tftpd[318155]: sending NAK (1, File not found) to 10.33.23.94Oct 22 09:50:05 sc-ironic04 in.tftpd[318156]: remap: input: /grub.cfg-0A211Oct 22 09:50:05 sc-ironic04 in.tftpd[318156]: remap: rule 2: rewrite: /tftpboot//grub.cfg-0A211Oct 22 09:50:05 sc-ironic04 in.tftpd[318156]: remap: rule 2: exitOct 22 09:50:05 sc-ironic04 in.tftpd[318156]: RRQ from 10.33.23.94 filename /grub.cfg-0A211 remapped to /tftpboot//grub.cfg-0A211Oct 22 09:50:05 sc-ironic04 in.tftpd[318156]: Client 10.33.23.94 File not found /tftpboot//grub.cfg-0A211Oct 22 09:50:05 sc-ironic04 in.tftpd[318156]: sending NAK (1, File not found) to 10.33.23.94Oct 22 09:50:05 sc-ironic04 in.tftpd[318157]: remap: input: /grub.cfg-0A21Oct 22 09:50:05 sc-ironic04 in.tftpd[318157]: remap: rule 2: rewrite: /tftpboot//grub.cfg-0A21Oct 22 09:50:05 sc-ironic04 in.tftpd[318157]: remap: rule 2: exitOct 22 09:50:05 sc-ironic04 in.tftpd[318157]: RRQ from 10.33.23.94 filename /grub.cfg-0A21 remapped to /tftpboot//grub.cfg-0A21Oct 22 09:50:05 sc-ironic04 in.tftpd[318157]: Client 10.33.23.94 File not found /tftpboot//grub.cfg-0A21Oct 22 09:50:05 sc-ironic04 in.tftpd[318157]: sending NAK (1, File not found) to 10.33.23.94Oct 22 09:50:05 sc-ironic04 in.tftpd[318158]: remap: input: /grub.cfg-0A2Oct 22 09:50:05 sc-ironic04 in.tftpd[318158]: remap: rule 2: rewrite: /tftpboot//grub.cfg-0A2Oct 22 09:50:05 sc-ironic04 in.tftpd[318158]: remap: rule 2: exitOct 22 09:50:05 sc-ironic04 in.tftpd[318178]: RRQ from 10.33.23.94 filename /EFI/centos/grub.cfg remapped to /tftpboot//EFI/centos/grub.cfgOct 22 09:50:05 sc-ironic04 in.tftpd[318178]: Client 10.33.23.94 finished /EFI/centos/grub.cfgOct 22 09:50:05 sc-ironic04 in.tftpd[318179]: remap: input: /EFI/centos/grub.cfgOct 22 09:50:05 sc-ironic04 in.tftpd[318179]: remap: rule 2: rewrite: /tftpboot//EFI/centos/grub.cfgOct 22 09:50:05 sc-ironic04 in.tftpd[318179]: remap: rule 2: exitOct 22 09:50:05 sc-ironic04 in.tftpd[318179]: RRQ from 10.33.23.94 filename /EFI/centos/grub.cfg remapped to /tftpboot//EFI/centos/grub.cfgOct 22 09:50:05 sc-ironic04 in.tftpd[318179]: Client 10.33.23.94 finished /EFI/centos/grub.cfgOct 22 09:50:07 sc-ironic04 ironic-conductor: 2019-10-22 09:50:07.298 267066 DEBUG futurist.periodics [-] Submitting periodic callback 'ironic.conductor.manager.ConductorManager._sync_local_state' _process_scheduled /usr/lib/python2.7/site-packages/futurist/periodics.py:639Oct 22 09:50:07 sc-ironic04 ironic-conductor: 2019-10-22 09:50:07.298 267066 INFO ironic.conductor.manager [-] Executing sync_local_stateOct 22 09:50:07 sc-ironic04 ironic-conductor: 2019-10-22 09:50:07.305 267066 DEBUG ironic.common.hash_ring [-] Rebuilding cached hash rings ring /usr/lib/python2.7/site-packages/ironic/common/hash_ring.py:51Oct 22 09:50:07 sc-ironic04 ironic-conductor: 2019-10-22 09:50:07.466 267066 DEBUG ironic.common.hash_ring [-] Finished rebuilding hash rings, available drivers are ipmi, fake, snmp, fake_pxe, pxe_snmp, agent_ipmitool, pxe_ipmitool ring /usr/lib/python2.7/site-packages/ironic/common/hash_ring.py:56Oct 22 09:50:10 sc-ironic04 in.tftpd[318184]: remap: input: /tftpboot/10.33.23.94.confOct 22 09:50:10 sc-ironic04 in.tftpd[318184]: remap: rule 0: rewrite: /tftpboot/10.33.23.94.confOct 22 09:50:10 sc-ironic04 in.tftpd[318184]: remap: rule 0: exitOct 22 09:50:10 sc-ironic04 in.tftpd[318184]: RRQ from 10.33.23.94 filename /tftpboot/10.33.23.94.confOct 22 09:50:10 sc-ironic04 in.tftpd[318184]: Client 10.33.23.94 finished /tftpboot/10.33.23.94.confOct 22 09:50:10 sc-ironic04 in.tftpd[318185]: remap: input: /tftpboot/10.33.23.94.confOct 22 09:50:10 sc-ironic04 in.tftpd[318185]: remap: rule 0: rewrite: /tftpboot/10.33.23.94.confOct 22 09:50:10 sc-ironic04 in.tftpd[318185]: remap: rule 0: exitOct 22 09:50:10 sc-ironic04 in.tftpd[318185]: RRQ from 10.33.23.94 filename /tftpboot/10.33.23.94.confOct 22 09:50:10 sc-ironic04 in.tftpd[318185]: Client 10.33.23.94 finished /tftpboot/10.33.23.94.confOct 22 09:50:10 sc-ironic04 in.tftpd[318186]: remap: input: /tftpboot/10.33.23.94.confOct 22 09:50:10 sc-ironic04 in.tftpd[318186]: remap: rule 0: rewrite: /tftpboot/10.33.23.94.confOct 22 09:50:10 sc-ironic04 in.tftpd[318186]: remap: rule 0: exitOct 22 09:50:10 sc-ironic04 in.tftpd[318186]: RRQ from 10.33.23.94 filename /tftpboot/10.33.23.94.confOct 22 09:50:10 sc-ironic04 in.tftpd[318186]: Client 10.33.23.94 finished /tftpboot/10.33.23.94.confOct 22 09:50:15 sc-ironic04 in.tftpd[318192]: remap: input: d465fa17-a2ea-43ff-a3a7-033cbb497774/deploy_kernelOct 22 09:50:15 sc-ironic04 in.tftpd[318192]: remap: rule 3: rewrite: /tftpboot/d465fa17-a2ea-43ff-a3a7-033cbb497774/deploy_kernelOct 22 09:50:15 sc-ironic04 in.tftpd[318192]: remap: rule 3: exitOct 22 09:50:15 sc-ironic04 in.tftpd[318192]: RRQ from 10.33.23.94 filename d465fa17-a2ea-43ff-a3a7-033cbb497774/deploy_kernel remapped to /tftpboot/d465fa17-a2ea-43ff-a3a7-033cbb497774/deploy_kernelOct 22 09:50:15 sc-ironic04 ironic-conductor: 2019-10-22 09:50:15.898 267066 DEBUG futurist.periodics [-] Submitting periodic callback 'ironic.conductor.manager.ConductorManager._check_rescuewait_timeouts' _process_scheduled /usr/lib/python2.7/site-packages/futurist/periodics.py:639Oct 22 09:50:15 sc-ironic04 in.tftpd[318192]: Client 10.33.23.94 finished d465fa17-a2ea-43ff-a3a7-033cbb497774/deploy_kernelOct 22 09:50:15 sc-ironic04 in.tftpd[318194]: remap: input: d465fa17-a2ea-43ff-a3a7-033cbb497774/deploy_ramdiskOct 22 09:50:15 sc-ironic04 in.tftpd[318194]: remap: rule 3: rewrite: /tftpboot/d465fa17-a2ea-43ff-a3a7-033cbb497774/deploy_ramdiskOct 22 09:50:15 sc-ironic04 in.tftpd[318194]: remap: rule 3: exitOct 22 09:50:15 sc-ironic04 in.tftpd[318194]: RRQ from 10.33.23.94 filename d465fa17-a2ea-43ff-a3a7-033cbb497774/deploy_ramdisk remapped to /tftpboot/d465fa17-a2ea-43ff-a3a7-033cbb497774/deploy_ramdisk Oct 22 09:50:44 sc-ironic04 in.tftpd[318194]: Client 10.33.23.94 finished d465fa17-a2ea-43ff-a3a7-033cbb497774/deploy_ramdiskOct 22 09:50:44 sc-ironic04 in.tftpd[318194]: Client 10.33.23.94 timed out This has me stumped here. This exact failure seems to be happening 3 to 4 times a week on different nodes.Any pointers appreciated. thanks,Fred.
Greetings Fred! Reply in-line. On Tue, Oct 22, 2019 at 12:47 PM fsbiz@yahoo.com <fsbiz@yahoo.com> wrote: [trim]
TFTP logs: shows TFTP client timed out (weird). Any pointers here?
Sadly this is one of those things that comes with using TFTP. Issues like this is why the community tends to recommend using ipxe.efi to chainload as you can perform transport over TCP as opposed to UDP where in something might happen mid-transport.
tftpd shows ramdisk_deployed completed. Then, it reports that the client timed out.
Grub does tend to be very abrupt and not wrap up very final actions. I suspect it may just never be sending the ack back and the transfer may be completing. I'm afraid this is one of those things you really need to see on the console what is going on. My guess would be that your deploy_ramdisk lost a packet in transfer or that it was corrupted in transport. It would be interesting to know if the network card stack is performing checksum validation, but for IPv4 it is optional. [trim]
This has me stumped here. This exact failure seems to be happening 3 to 4 times a week on different nodes. Any pointers appreciated.
thanks, Fred.
participants (2)
-
fsbiz@yahoo.com
-
Julia Kreger