Re: [ironic]: Timeout reached while waiting for callback for node
Thanks Julia. In addition to what you mentioned this particular issue seems to have cropped up when we added 100 more baremetal nodes. I've also narrowed down the issue (TFTP timeouts) when 3-4 baremetal nodes are in "deploy" state and downloading the OS via iSCSI. Each iSCSI transfer takes about 6 Gbps and thus with four transfers we are over our 20Gbps capacity of the leaf-spine links. We are slowly migrating to iPXE so it should help. That being said is there a document on large scale ironic design architectures?We are looking into a DC design (primarily for baremetals) for upto 2500 nodes. thanks,Fred, On Wednesday, October 23, 2019, 03:19:41 PM PDT, Julia Kreger <juliaashleykreger@gmail.com> wrote: Greetings Fred! Reply in-line. On Tue, Oct 22, 2019 at 12:47 PM fsbiz@yahoo.com <fsbiz@yahoo.com> wrote: [trim] TFTP logs: shows TFTP client timed out (weird). Any pointers here? Sadly this is one of those things that comes with using TFTP. Issues like this is why the community tends to recommend using ipxe.efi to chainload as you can perform transport over TCP as opposed to UDP where in something might happen mid-transport. tftpd shows ramdisk_deployed completed. Then, it reports that the client timed out. Grub does tend to be very abrupt and not wrap up very final actions. I suspect it may just never be sending the ack back and the transfer may be completing. I'm afraid this is one of those things you really need to see on the console what is going on. My guess would be that your deploy_ramdisk lost a packet in transfer or that it was corrupted in transport. It would be interesting to know if the network card stack is performing checksum validation, but for IPv4 it is optional. [trim] This has me stumped here. This exact failure seems to be happening 3 to 4 times a week on different nodes.Any pointers appreciated. thanks,Fred.
That is great news to hear that you've been able to correlate it. We've written some things regarding scaling, but the key really depends on your architecture and how your utilizing the workload. Since you mentioned a spine-leaf architecture, physical locality of conductors will matter as well as having as much efficiency as possible. I believe CERN is running 4-5 conductors to manage ?3000+? physical machines. Naturally you'll need to scale as appropriate to your deployment pattern. If much of your fleet is being redeployed often, you may wish to consider having more conductors to match that overall load. 1) Use the ``direct`` deploy interface. This moves the act of unpacking the image files and streaming them to disk to the end node. This generally requires an HTTP(S) download endpoint offered by the conductor OR via Swift. Ironic-Python-Agent downloads the file, and unpacks it in memory and directly streams it to disk. With the ``iscsi`` interface, you can end up in situations, depending on image composition and settings being passed to dd, where part of your deploy process is trying to write zeros over the wire in blocks to the remote disk. Naturally this needlessly consumes IO Bandwidth. 2) Once your using the ``direct`` deploy_interface, Consider using caching. While we don't use it in CI, ironic does have the capability to pass configuration for caching proxy servers. This is set on a per-node basis. If you have any deployed proxy/caching servers on your spine or in your leafs close to physical nodes. Some timers are also present to enable ironic to re-use swift URLs if your deploying the same image to multiple servers concurrently. Swift tempurl usage does negatively impact the gain over using a caching proxy though, but it is something to consider in your architecture and IO pattern. https://docs.openstack.org/ironic/latest/admin/drivers/ipa.html#using-proxie... 3) Consider using ``conductor_groups``. If it would help, you can localize conductors to specific pools of machines of machines. This may be useful if you have pools with different security requirements, or if you have multiple spines and can dedicate some conductors per spine. https://docs.openstack.org/ironic/latest/admin/conductor-groups.html 4) Turn off periodic driver tasks for drivers your not using. Power sync, and sensor data collection are two periodic workers that consume resources when they run and the periodic tasks of other drivers still consume a worker slot and query the database to see if there is work to be done. You may also want to increase the number of permitted workers. Power sync can be a huge issue on older versions. I believe Stein is where we improved the parallelism of the power sync workers in Ironic and Train now has power state callback with nova, which will greatly reduce the ironic-api and nova-compute processor overhead. Hope this helps! -Julia On Mon, Oct 28, 2019 at 3:26 PM fsbiz@yahoo.com <fsbiz@yahoo.com> wrote:
Thanks Julia. In addition to what you mentioned this particular issue seems to have cropped up when we added 100 more baremetal nodes.
I've also narrowed down the issue (TFTP timeouts) when 3-4 baremetal nodes are in "deploy" state and downloading the OS via iSCSI. Each iSCSI transfer takes about 6 Gbps and thus with four transfers we are over our 20Gbps capacity of the leaf-spine links. We are slowly migrating to iPXE so it should help.
That being said is there a document on large scale ironic design architectures? We are looking into a DC design (primarily for baremetals) for upto 2500 nodes.
thanks, Fred,
On Wednesday, October 23, 2019, 03:19:41 PM PDT, Julia Kreger <juliaashleykreger@gmail.com> wrote:
Greetings Fred!
Reply in-line.
On Tue, Oct 22, 2019 at 12:47 PM fsbiz@yahoo.com <fsbiz@yahoo.com> wrote:
[trim]
TFTP logs: shows TFTP client timed out (weird). Any pointers here?
Sadly this is one of those things that comes with using TFTP. Issues like this is why the community tends to recommend using ipxe.efi to chainload as you can perform transport over TCP as opposed to UDP where in something might happen mid-transport.
tftpd shows ramdisk_deployed completed. Then, it reports that the client timed out.
Grub does tend to be very abrupt and not wrap up very final actions. I suspect it may just never be sending the ack back and the transfer may be completing. I'm afraid this is one of those things you really need to see on the console what is going on. My guess would be that your deploy_ramdisk lost a packet in transfer or that it was corrupted in transport. It would be interesting to know if the network card stack is performing checksum validation, but for IPv4 it is optional.
[trim]
This has me stumped here. This exact failure seems to be happening 3 to 4 times a week on different nodes. Any pointers appreciated.
thanks, Fred.
Hi Fred, To confirm what Julia said: We currently have ~3700 physical nodes in Ironic, managed by 3 controllers (16GB VMs running httpd, conductor, and inspector). We recently moved to larger nodes for these controllers due to the "thundering image” problem Julia was mentioning: when we deployed ~100 nodes in parallel, the conductors were running out of memory. We have yet to see if that change has the desired effect, though: we will add another 1000 nodes or so over the coming weeks. As for you, this is all with iscsi deploy. We didn’t set up things up with ‘direct' initially as we didn’t have a swift endpoint, but if this problem persists we will look into this as ‘direct' will clearly scale better. The recently added parallelism in Ironic's power sync in Ironic sped up this sync loop significantly: while the loops were running into each other before, the conductors can now check each of their 1000+ servers in <60 seconds. Cheers, Arne
On 29 Oct 2019, at 15:29, Julia Kreger <juliaashleykreger@gmail.com> wrote:
That is great news to hear that you've been able to correlate it. We've written some things regarding scaling, but the key really depends on your architecture and how your utilizing the workload. Since you mentioned a spine-leaf architecture, physical locality of conductors will matter as well as having as much efficiency as possible. I believe CERN is running 4-5 conductors to manage ?3000+? physical machines. Naturally you'll need to scale as appropriate to your deployment pattern. If much of your fleet is being redeployed often, you may wish to consider having more conductors to match that overall load.
1) Use the ``direct`` deploy interface. This moves the act of unpacking the image files and streaming them to disk to the end node. This generally requires an HTTP(S) download endpoint offered by the conductor OR via Swift. Ironic-Python-Agent downloads the file, and unpacks it in memory and directly streams it to disk. With the ``iscsi`` interface, you can end up in situations, depending on image composition and settings being passed to dd, where part of your deploy process is trying to write zeros over the wire in blocks to the remote disk. Naturally this needlessly consumes IO Bandwidth. 2) Once your using the ``direct`` deploy_interface, Consider using caching. While we don't use it in CI, ironic does have the capability to pass configuration for caching proxy servers. This is set on a per-node basis. If you have any deployed proxy/caching servers on your spine or in your leafs close to physical nodes. Some timers are also present to enable ironic to re-use swift URLs if your deploying the same image to multiple servers concurrently. Swift tempurl usage does negatively impact the gain over using a caching proxy though, but it is something to consider in your architecture and IO pattern. https://docs.openstack.org/ironic/latest/admin/drivers/ipa.html#using-proxie... 3) Consider using ``conductor_groups``. If it would help, you can localize conductors to specific pools of machines of machines. This may be useful if you have pools with different security requirements, or if you have multiple spines and can dedicate some conductors per spine. https://docs.openstack.org/ironic/latest/admin/conductor-groups.html 4) Turn off periodic driver tasks for drivers your not using. Power sync, and sensor data collection are two periodic workers that consume resources when they run and the periodic tasks of other drivers still consume a worker slot and query the database to see if there is work to be done. You may also want to increase the number of permitted workers.
Power sync can be a huge issue on older versions. I believe Stein is where we improved the parallelism of the power sync workers in Ironic and Train now has power state callback with nova, which will greatly reduce the ironic-api and nova-compute processor overhead.
Hope this helps!
-Julia
On Mon, Oct 28, 2019 at 3:26 PM fsbiz@yahoo.com <fsbiz@yahoo.com> wrote:
Thanks Julia. In addition to what you mentioned this particular issue seems to have cropped up when we added 100 more baremetal nodes.
I've also narrowed down the issue (TFTP timeouts) when 3-4 baremetal nodes are in "deploy" state and downloading the OS via iSCSI. Each iSCSI transfer takes about 6 Gbps and thus with four transfers we are over our 20Gbps capacity of the leaf-spine links. We are slowly migrating to iPXE so it should help.
That being said is there a document on large scale ironic design architectures? We are looking into a DC design (primarily for baremetals) for upto 2500 nodes.
thanks, Fred,
On Wednesday, October 23, 2019, 03:19:41 PM PDT, Julia Kreger <juliaashleykreger@gmail.com> wrote:
Greetings Fred!
Reply in-line.
On Tue, Oct 22, 2019 at 12:47 PM fsbiz@yahoo.com <fsbiz@yahoo.com> wrote:
[trim]
TFTP logs: shows TFTP client timed out (weird). Any pointers here?
Sadly this is one of those things that comes with using TFTP. Issues like this is why the community tends to recommend using ipxe.efi to chainload as you can perform transport over TCP as opposed to UDP where in something might happen mid-transport.
tftpd shows ramdisk_deployed completed. Then, it reports that the client timed out.
Grub does tend to be very abrupt and not wrap up very final actions. I suspect it may just never be sending the ack back and the transfer may be completing. I'm afraid this is one of those things you really need to see on the console what is going on. My guess would be that your deploy_ramdisk lost a packet in transfer or that it was corrupted in transport. It would be interesting to know if the network card stack is performing checksum validation, but for IPv4 it is optional.
[trim]
This has me stumped here. This exact failure seems to be happening 3 to 4 times a week on different nodes. Any pointers appreciated.
thanks, Fred.
Thanks Arne and Julia with the great suggestions on scaling ironic nodes. We are currently trying to root cause an issue (it has occured twice) where a large number of nodes(but not all the nodes) suddenly migrate from one IC to another. E.g.69 nodes moved from sc-ironic04 and sc-ironic05 tosc-ironic06 from 21:07 to 21:10 on nov. 23rd. [root@sc-ironic06 nova]# grep "moving from" /var/log/nova/nova-compute.log-20191124 2019-11-23 21:07:46.606 210241 INFOnova.compute.resource_tracker [req-96baf341-0ecb-4dec-a204-32c2f77f3f64 - - - --] ComputeNode 1cb9ef2e-aa7d-4e25-8878-14669a3ead7a moving fromsc-ironic05.nvc.nvidia.com to sc-ironic06.nvc.nvidia.com 2019-11-23 21:08:17.518 210241 INFOnova.compute.resource_tracker [req-96baf341-0ecb-4dec-a204-32c2f77f3f64 - - - --] ComputeNode 56e58642-12ac-4455-bc95-2a328198f845 moving fromsc-ironic04.nvc.nvidia.com to sc-ironic06.nvc.nvidia.com 2019-11-23 21:08:35.843 210241 INFOnova.compute.resource_tracker [req-96baf341-0ecb-4dec-a204-32c2f77f3f64 - - - --] ComputeNode e0b9b94c-2ea3-4324-a85f-645d572e370b moving fromsc-ironic05.nvc.nvidia.com to sc-ironic06.nvc.nvidia.com 2019-11-23 21:08:42.264 210241 INFOnova.compute.resource_tracker [req-96baf341-0ecb-4dec-a204-32c2f77f3f64 - - - --] ComputeNode 1c7d461c-2de7-4d9a-beff-dcb490c7b2e4 moving fromsc-ironic04.nvc.nvidia.com to sc-ironic06.nvc.nvidia.com 2019-11-23 21:08:43.819 210241 INFO nova.compute.resource_tracker[req-96baf341-0ecb-4dec-a204-32c2f77f3f64 - - - - -] ComputeNode73ed8bd4-23c2-46bc-b748-e6f5ab6fa932 moving from sc-ironic05.nvc.nvidia.com tosc-ironic06.nvc.nvidia.com 2019-11-23 21:08:45.651 210241 INFO nova.compute.resource_tracker[req-96baf341-0ecb-4dec-a204-32c2f77f3f64 - - - - -] ComputeNode51da1570-5666-4a21-a46f-4b7510d28415 moving from sc-ironic05.nvc.nvidia.com tosc-ironic06.nvc.nvidia.com 2019-11-23 21:08:46.905 210241 INFOnova.compute.resource_tracker [req-96baf341-0ecb-4dec-a204-32c2f77f3f64 - - - --] ComputeNode 38b41797-4b97-405b-bbd5-fccc61d237c3 moving fromsc-ironic04.nvc.nvidia.com to sc-ironic06.nvc.nvidia.com 2019-11-23 21:08:49.065 210241 INFOnova.compute.resource_tracker [req-96baf341-0ecb-4dec-a204-32c2f77f3f64 - - - --] ComputeNode c5c89749-a11c-4eb8-b159-e8d47ecfcbb9 moving fromsc-ironic04.nvc.nvidia.com to sc-ironic06.nvc.nvidia.com Restarting nova-compute and ironic-conductor services on the IC seems to have fixed the issue but we are still in the root cause analysis phase and seem to have hit a wall narrowing this down. Any suggestions are welcome. Thanks,Fred. On Wednesday, October 30, 2019, 02:02:42 PM PDT, Arne Wiebalck <arne.wiebalck@cern.ch> wrote: Hi Fred, To confirm what Julia said: We currently have ~3700 physical nodes in Ironic, managed by 3 controllers (16GB VMs running httpd, conductor, and inspector). We recently moved to l
participants (3)
-
Arne Wiebalck
-
fsbiz@yahoo.com
-
Julia Kreger