[ironic]: Timeout reached while waiting for callback for node
juliaashleykreger at gmail.com
Tue Oct 29 14:29:04 UTC 2019
That is great news to hear that you've been able to correlate it.
We've written some things regarding scaling, but the key really
depends on your architecture and how your utilizing the workload.
Since you mentioned a spine-leaf architecture, physical locality of
conductors will matter as well as having as much efficiency as
possible. I believe CERN is running 4-5 conductors to manage ?3000+?
physical machines. Naturally you'll need to scale as appropriate to
your deployment pattern. If much of your fleet is being redeployed
often, you may wish to consider having more conductors to match that
1) Use the ``direct`` deploy interface. This moves the act of
unpacking the image files and streaming them to disk to the end node.
This generally requires an HTTP(S) download endpoint offered by the
conductor OR via Swift. Ironic-Python-Agent downloads the file, and
unpacks it in memory and directly streams it to disk. With the
``iscsi`` interface, you can end up in situations, depending on image
composition and settings being passed to dd, where part of your deploy
process is trying to write zeros over the wire in blocks to the remote
disk. Naturally this needlessly consumes IO Bandwidth.
2) Once your using the ``direct`` deploy_interface, Consider using
caching. While we don't use it in CI, ironic does have the capability
to pass configuration for caching proxy servers. This is set on a
per-node basis. If you have any deployed proxy/caching servers on your
spine or in your leafs close to physical nodes. Some timers are also
present to enable ironic to re-use swift URLs if your deploying the
same image to multiple servers concurrently. Swift tempurl usage does
negatively impact the gain over using a caching proxy though, but it
is something to consider in your architecture and IO pattern.
3) Consider using ``conductor_groups``. If it would help, you can
localize conductors to specific pools of machines of machines. This
may be useful if you have pools with different security requirements,
or if you have multiple spines and can dedicate some conductors per
4) Turn off periodic driver tasks for drivers your not using. Power
sync, and sensor data collection are two periodic workers that consume
resources when they run and the periodic tasks of other drivers still
consume a worker slot and query the database to see if there is work
to be done. You may also want to increase the number of permitted
Power sync can be a huge issue on older versions. I believe Stein is
where we improved the parallelism of the power sync workers in Ironic
and Train now has power state callback with nova, which will greatly
reduce the ironic-api and nova-compute processor overhead.
Hope this helps!
On Mon, Oct 28, 2019 at 3:26 PM fsbiz at yahoo.com <fsbiz at yahoo.com> wrote:
> Thanks Julia.
> In addition to what you mentioned this particular issue seems to have cropped up when we added 100 more baremetal nodes.
> I've also narrowed down the issue (TFTP timeouts) when 3-4 baremetal nodes are in "deploy" state and downloading the OS via iSCSI. Each iSCSI transfer takes about 6 Gbps and thus with four transfers we are over our 20Gbps capacity of the leaf-spine links. We are slowly migrating to iPXE so it should help.
> That being said is there a document on large scale ironic design architectures?
> We are looking into a DC design (primarily for baremetals) for upto 2500 nodes.
> On Wednesday, October 23, 2019, 03:19:41 PM PDT, Julia Kreger <juliaashleykreger at gmail.com> wrote:
> Greetings Fred!
> Reply in-line.
> On Tue, Oct 22, 2019 at 12:47 PM fsbiz at yahoo.com <fsbiz at yahoo.com> wrote:
> TFTP logs: shows TFTP client timed out (weird). Any pointers here?
> Sadly this is one of those things that comes with using TFTP. Issues like this is why the community tends to recommend using ipxe.efi to chainload as you can perform transport over TCP as opposed to UDP where in something might happen mid-transport.
> tftpd shows ramdisk_deployed completed. Then, it reports that the client timed out.
> Grub does tend to be very abrupt and not wrap up very final actions. I suspect it may just never be sending the ack back and the transfer may be completing. I'm afraid this is one of those things you really need to see on the console what is going on. My guess would be that your deploy_ramdisk lost a packet in transfer or that it was corrupted in transport. It would be interesting to know if the network card stack is performing checksum validation, but for IPv4 it is optional.
> This has me stumped here. This exact failure seems to be happening 3 to 4 times a week on different nodes.
> Any pointers appreciated.
More information about the openstack-discuss