Hello Team, I hope you all are good. I am using openstack ironic deployment and have some issues and some observations. These are: Issue: At the time of "openstack server create" for launching baremetal node, I came across the following multiple observations: - Sometimes when I launch baremetal node on openstack, after one time pxe booting, the baremetal node goes down again and then comes up and goes into second time booting and gets stuck there in "Probing" state ( Seen on node's console) BUT according to openstack horizon, it is up and running and according to "openstack baremetal node show", it is in "Active" state. - And sometimes when i launch baremetal node on openstack, after one time pxe booting, the baremetal node goes down again and then comes up, the "spawning" state on openstack horizon goes into ERROR. Error seen in "nova-compute-ironic-0" container is : "ERROR nova.compute.manager [instance: edd447c6-12ac-49ba-b0bc-f419aff4892a] nova.exception.InstanceDeployFailure: Failed to provision instance edd447c6-12ac-49ba-b0bc-f419aff4892a: Timeout reached while waiting for callback for node 75210cc4-ad98-442d-ace1-89ce69467580" - The baremetal node always takes near about 2 hours to be in "available" state from "cleaning" and "clean-wait". Is it correct behaviour ? Please guide me how to resolve this. Regards Akshay DISCLAIMER: This electronic message and all of its contents, contains information which is privileged, confidential or otherwise protected from disclosure. The information contained in this electronic mail transmission is intended for use only by the individual or entity to which it is addressed. If you are not the intended recipient or may have received this electronic mail transmission in error, please notify the sender immediately and delete / destroy all copies of this electronic mail transmission without disclosing, copying, distributing, forwarding, printing or retaining any part of it. Hughes Systique accepts no responsibility for loss or damage arising from the use of the information transmitted by this email including damage from virus.
Hi Akshay, On 09.10.20 06:58, Akshay 346 wrote:
Hello Team,
I hope you all are good.
I am using openstack ironic deployment and have some issues and some observations. These are:
Issue: At the time of "openstack server create" for launching baremetal node, I came across the following multiple observations:
- Sometimes when I launch baremetal node on openstack, after one time pxe booting, the baremetal node goes down again and then comes up and goes into second time booting and gets stuck there in "Probing" state ( Seen on node's console) BUT according to openstack horizon, it is up and running and according to "openstack baremetal node show", it is in "Active" state.
Right: in order to deploy a node, Ironic will boot the node via PXE into a ramdisk (with the Ironic Python Agent) to download and install the user image. Once this is done, it boots the node from the just installed disk. These are the two boot events you see. At the moment when Ironic boots the node the second time, Ironic is done with the deployment. At this stage the node moves to active, which means there is now a user instance on this node. Whether or not the node is able to boot from this image does not affect this state.
- And sometimes when i launch baremetal node on openstack, after one time pxe booting, the baremetal node goes down again and then comes up, the "spawning" state on openstack horizon goes into ERROR.
Error seen in "nova-compute-ironic-0" container is :
"ERROR nova.compute.manager [instance: edd447c6-12ac-49ba-b0bc-f419aff4892a] nova.exception.InstanceDeployFailure: Failed to provision instance edd447c6-12ac-49ba-b0bc-f419aff4892a: Timeout reached while waiting for callback for node 75210cc4-ad98-442d-ace1-89ce69467580"
In this case, something went wrong during the deployment. The Ironic deploy logs will give some hint about the cause. The specific error you quote looks like Ironic timed out waiting for the node to call back. When the deployment fails, Ironic may try to clean the node and this is the second boot you see.
- The baremetal node always takes near about 2 hours to be in "available" state from "cleaning" and "clean-wait". Is it correct behaviour ?
That depends on how you configured cleaning, but if Ironic, for instance, needs to erase all disks, cleaning can take a while. If you have added your keys to the IPA image, you can log into the node while it is cleaning and actually check what it is doing. HTH, Arne -- Arne Wiebalck CERN IT
On Fri, Oct 9, 2020 at 6:09 AM Arne Wiebalck <arne.wiebalck@cern.ch> wrote:
Hi Akshay,
On 09.10.20 06:58, Akshay 346 wrote:
Hello Team,
I hope you all are good.
I am using openstack ironic deployment and have some issues and some observations. These are:
Issue: At the time of "openstack server create" for launching baremetal node, I came across the following multiple observations:
- Sometimes when I launch baremetal node on openstack, after one time pxe booting, the baremetal node goes down again and then comes up and goes into second time booting and gets stuck there in "Probing" state ( Seen on node's console) BUT according to openstack horizon, it is up and running and according to "openstack baremetal node show", it is in "Active" state.
Right: in order to deploy a node, Ironic will boot the node via PXE into a ramdisk (with the Ironic Python Agent) to download and install the user image. Once this is done, it boots the node from the just installed disk. These are the two boot events you see.
At the moment when Ironic boots the node the second time, Ironic is done with the deployment. At this stage the node moves to active, which means there is now a user instance on this node. Whether or not the node is able to boot from this image does not affect this state.
- And sometimes when i launch baremetal node on openstack, after one time pxe booting, the baremetal node goes down again and then comes up, the "spawning" state on openstack horizon goes into ERROR.
Error seen in "nova-compute-ironic-0" container is :
"ERROR nova.compute.manager [instance: edd447c6-12ac-49ba-b0bc-f419aff4892a] nova.exception.InstanceDeployFailure: Failed to provision instance edd447c6-12ac-49ba-b0bc-f419aff4892a: Timeout reached while waiting for callback for node 75210cc4-ad98-442d-ace1-89ce69467580"
One thing worth noting is callback timeout failures are typically a result of the physical networking or some process involving the physical infrastucture. A good first step is to watch the physical machine's console if you can and see if it network boots. The next step is to make sure it is actually able to perform it's lookup and heartbeat operation to the ironic API. Routing issues or firewall issues from your provisioning network to your API endpoints can cause deployments to fail like this.
In this case, something went wrong during the deployment. The Ironic deploy logs will give some hint about the cause. The specific error you quote looks like Ironic timed out waiting for the node to call back. When the deployment fails, Ironic may try to clean the node and this is the second boot you see.
- The baremetal node always takes near about 2 hours to be in "available" state from "cleaning" and "clean-wait". Is it correct behaviour ?
That depends on how you configured cleaning, but if Ironic, for instance, needs to erase all disks, cleaning can take a while. If you have added your keys to the IPA image, you can log into the node while it is cleaning and actually check what it is doing.
HTH, Arne
-- Arne Wiebalck CERN IT
participants (3)
-
Akshay 346
-
Arne Wiebalck
-
Julia Kreger