Thanks Julia.
In addition to what you mentioned this particular issue seems to have cropped up when we added 100 more baremetal nodes.
I've also narrowed down the issue (TFTP timeouts) when 3-4 baremetal nodes are in "deploy" state and downloading the OS via iSCSI. Each iSCSI transfer takes about 6 Gbps and thus with four transfers we are over our 20Gbps capacity of the leaf-spine links. We are slowly migrating to iPXE so it should help.
That being said is there a document on large scale ironic design architectures?
We are looking into a DC design (primarily for baremetals) for upto 2500 nodes.
thanks,
Fred,