[ironic] How to move nodes from a 'clean failed' state into 'Available'
Hello Team, I had a situation where my undercloud-node had a problem with it’s disk and has disconnected from overcloud. I couldn’t restore the undercloud controller and ended up re-installing it (running 'openstack undercloud install’). The installation ended successfully but now I’m in a situation where Cleanup of the overcloud deployed nodes fails: (undercloud) [stack@interop010 ~]$ openstack baremetal node list +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 | interop025 | None | power on | clean failed | True | | 4b02703a-f765-4ebb-85ed-75e88b4cbea5 | interop026 | None | power on | clean failed | True | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ I’ve tried to move node to available state but cannot: (undercloud) [stack@interop010 ~]$ openstack baremetal node provide 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 The requested action "provide" can not be performed on node "97b9a603-f64f-47c1-9fb4-6c68a5b38ff6" while it is in state "clean failed". (HTTP 400) My question is: How do I make the nodes available again? as the deployment of overcloud fails with: ERROR due to "Message: No valid host was found. , Code: 500” Thanks, Igal
Hello all, While troubleshooting this, another observation I see is that when I run put the node in state provide: 'openstack baremetal node provide 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6’ It starts the cleaning process, then the node boots into PXE but the undercloud ignores it. When I tap the port I see that requests reach its interface: (undercloud) [stack@interop010 ~]$ sudo tcpdump -i br-ctlplane 10:43:10.600421 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from a0:36:9f:95:dd:e2 (oui Unknown), length 548 But on the same time the dnsmasq ignores it: (undercloud) [stack@interop010 ~]$ sudo tail -f /var/log/containers/ironic-inspector/dnsmasq.log Mar 24 10:39:43 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) 6c:ae:8b:69:ee:80 ignored Mar 24 10:40:36 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) a0:36:9f:95:dd:e2 ignored Mar 24 10:40:39 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) a0:36:9f:95:dd:e2 ignored Mar 24 10:40:48 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) 6c:ae:8b:69:ee:80 ignored Mar 24 10:41:52 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) 6c:ae:8b:69:ee:80 ignored Mar 24 10:42:57 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) 6c:ae:8b:69:ee:80 ignored Mar 24 10:43:06 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) a0:36:9f:95:dd:e2 ignored Mar 24 10:43:10 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) a0:36:9f:95:dd:e2 ignored Mar 24 10:43:14 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) a0:36:9f:95:dd:e2 ignored Why is that? What is needed for the cleanup to start? Thanks, Igal
On 24 Mar 2021, at 0:09, Igal Katzir <ikatzir@infinidat.com> wrote:
Hello Team,
I had a situation where my undercloud-node had a problem with it’s disk and has disconnected from overcloud. I couldn’t restore the undercloud controller and ended up re-installing it (running 'openstack undercloud install’). The installation ended successfully but now I’m in a situation where Cleanup of the overcloud deployed nodes fails:
(undercloud) [stack@interop010 ~]$ openstack baremetal node list +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 | interop025 | None | power on | clean failed | True | | 4b02703a-f765-4ebb-85ed-75e88b4cbea5 | interop026 | None | power on | clean failed | True | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+
I’ve tried to move node to available state but cannot: (undercloud) [stack@interop010 ~]$ openstack baremetal node provide 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 The requested action "provide" can not be performed on node "97b9a603-f64f-47c1-9fb4-6c68a5b38ff6" while it is in state "clean failed". (HTTP 400)
My question is: How do I make the nodes available again? as the deployment of overcloud fails with: ERROR due to "Message: No valid host was found. , Code: 500”
Thanks, Igal
So versions and overall configuration might help, *but* often these issues are just a typo with a MAC address or the wrong port. Can you verify that the MAC address your seeing DHCP requests for matchs what is recorded for the node in the `openstack baremetal port list` output? On Wed, Mar 24, 2021 at 8:18 AM Igal Katzir <ikatzir@infinidat.com> wrote:
Hello all,
While troubleshooting this, another observation I see is that when I run put the node in state provide: 'openstack baremetal node provide 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6’ It starts the cleaning process, then the node boots into PXE but the undercloud ignores it. When I tap the port I see that requests reach its interface:
(undercloud) [stack@interop010 ~]$ sudo tcpdump -i br-ctlplane 10:43:10.600421 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from a0:36:9f:95:dd:e2 (oui Unknown), length 548
But on the same time the dnsmasq ignores it: (undercloud) [stack@interop010 ~]$ sudo tail -f /var/log/containers/ironic-inspector/dnsmasq.log Mar 24 10:39:43 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) 6c:ae:8b:69:ee:80 ignored Mar 24 10:40:36 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) a0:36:9f:95:dd:e2 ignored Mar 24 10:40:39 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) a0:36:9f:95:dd:e2 ignored Mar 24 10:40:48 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) 6c:ae:8b:69:ee:80 ignored Mar 24 10:41:52 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) 6c:ae:8b:69:ee:80 ignored Mar 24 10:42:57 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) 6c:ae:8b:69:ee:80 ignored Mar 24 10:43:06 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) a0:36:9f:95:dd:e2 ignored Mar 24 10:43:10 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) a0:36:9f:95:dd:e2 ignored Mar 24 10:43:14 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) a0:36:9f:95:dd:e2 ignored
Why is that? What is needed for the cleanup to start?
Thanks, Igal
On 24 Mar 2021, at 0:09, Igal Katzir <ikatzir@infinidat.com> wrote:
Hello Team,
I had a situation where my undercloud-node had a problem with it’s disk and has disconnected from overcloud. I couldn’t restore the undercloud controller and ended up re-installing it (running 'openstack undercloud install’). The installation ended successfully but now I’m in a situation where Cleanup of the overcloud deployed nodes fails:
(undercloud) [stack@interop010 ~]$ openstack baremetal node list +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 | interop025 | None | power on | clean failed | True | | 4b02703a-f765-4ebb-85ed-75e88b4cbea5 | interop026 | None | power on | clean failed | True | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+
I’ve tried to move node to available state but cannot: (undercloud) [stack@interop010 ~]$ openstack baremetal node provide 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 The requested action "provide" can not be performed on node "97b9a603-f64f-47c1-9fb4-6c68a5b38ff6" while it is in state "clean failed". (HTTP 400)
My question is: How do I make the nodes available again? as the deployment of overcloud fails with: ERROR due to "Message: No valid host was found. , Code: 500”
Thanks, Igal
Hello Julia, Thanks for your response. I am using a RedHat Openstack Platform 16.1, which is running on RHEL 8.2. All are physical servers; - One Undercloud Director. - Overcloud consists of two nodes. (This is for Certification purposes) It is unlikely that it's a mac addr. mismatch (I wish...) since I've already deployed these nodes several times, using the same nodes.json Just for reference , here is the output: (undercloud) [stack@interop010 ~]$ openstack baremetal port list +--------------------------------------+-------------------+ | UUID | Address | +--------------------------------------+-------------------+ | 2d404695-f236-4d32-8b65-5ca1fa6b756a | a0:36:9f:95:dd:e2 | | 32669178-0408-4ff1-b4b4-df65fc7643c9 | 6c:ae:8b:69:ee:80 | +--------------------------------------+-------------------+ The operation was working well until I have 'lost' the undercloud node, but overcloud stayed working. I might need to delete these nodes and run introspection again. Igal On Wed, Mar 24, 2021 at 7:31 PM Julia Kreger <juliaashleykreger@gmail.com> wrote:
So versions and overall configuration might help, *but* often these issues are just a typo with a MAC address or the wrong port. Can you verify that the MAC address your seeing DHCP requests for matchs what is recorded for the node in the `openstack baremetal port list` output?
On Wed, Mar 24, 2021 at 8:18 AM Igal Katzir <ikatzir@infinidat.com> wrote:
Hello all,
While troubleshooting this, another observation I see is that when I run
put the node in state provide:
'openstack baremetal node provide 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6’ It starts the cleaning process, then the node boots into PXE but the undercloud ignores it. When I tap the port I see that requests reach its interface:
(undercloud) [stack@interop010 ~]$ sudo tcpdump -i br-ctlplane 10:43:10.600421 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from a0:36:9f:95:dd:e2 (oui Unknown), length 548
But on the same time the dnsmasq ignores it: (undercloud) [stack@interop010 ~]$ sudo tail -f /var/log/containers/ironic-inspector/dnsmasq.log Mar 24 10:39:43 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) 6c:ae:8b:69:ee:80 ignored Mar 24 10:40:36 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) a0:36:9f:95:dd:e2 ignored Mar 24 10:40:39 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) a0:36:9f:95:dd:e2 ignored Mar 24 10:40:48 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) 6c:ae:8b:69:ee:80 ignored Mar 24 10:41:52 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) 6c:ae:8b:69:ee:80 ignored Mar 24 10:42:57 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) 6c:ae:8b:69:ee:80 ignored Mar 24 10:43:06 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) a0:36:9f:95:dd:e2 ignored Mar 24 10:43:10 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) a0:36:9f:95:dd:e2 ignored Mar 24 10:43:14 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) a0:36:9f:95:dd:e2 ignored
Why is that? What is needed for the cleanup to start?
Thanks, Igal
On 24 Mar 2021, at 0:09, Igal Katzir <ikatzir@infinidat.com> wrote:
Hello Team,
I had a situation where my undercloud-node had a problem with it’s disk and has disconnected from overcloud. I couldn’t restore the undercloud controller and ended up re-installing it (running 'openstack undercloud install’). The installation ended successfully but now I’m in a situation where Cleanup of the overcloud deployed nodes fails:
(undercloud) [stack@interop010 ~]$ openstack baremetal node list
+--------------------------------------+------------+---------------+-------------+--------------------+-------------+
| UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance |
| 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 | interop025 | None |
| 4b02703a-f765-4ebb-85ed-75e88b4cbea5 | interop026 | None |
+--------------------------------------+------------+---------------+-------------+--------------------+-------------+ power on | clean failed | True | power on | clean failed | True |
+--------------------------------------+------------+---------------+-------------+--------------------+-------------+
I’ve tried to move node to available state but cannot: (undercloud) [stack@interop010 ~]$ openstack baremetal node provide
97b9a603-f64f-47c1-9fb4-6c68a5b38ff6
The requested action "provide" can not be performed on node "97b9a603-f64f-47c1-9fb4-6c68a5b38ff6" while it is in state "clean failed". (HTTP 400)
My question is: How do I make the nodes available again? as the deployment of overcloud fails with: ERROR due to "Message: No valid host was found. , Code: 500”
Thanks, Igal
-- Regards, *Igal Katzir* Cell +972-54-5597086 Interoperability Team *INFINIDAT*
A node in CLEAN FAILED must be moved to MANAGEABLE state before it can be told to "provide" (which eventually puts it back in AVAILABLE). Try this: `openstack baremetal node manage UUID`, then run the command with "provide" as you did before. The available states and their transitions are documented here: https://docs.openstack.org/ironic/latest/contributor/states.html I'll note that if cleaning failed, it's possible the node is misconfigured in such a way that will cause all deployments and cleanings to fail (e.g.; if you're using Ironic with Nova, and you attempt to provision a machine and it errors during deploy; Nova will by default attempt to clean that node, which may be why you see it end up in clean failed). So I strongly suggest you look at the last_error field on the node and attempt to determine why the failure happened before retrying. Good luck! -Jay Faulkner On Wed, Mar 24, 2021 at 8:20 AM Igal Katzir <ikatzir@infinidat.com> wrote:
Hello Team,
I had a situation where my *undercloud-node *had a problem with it’s disk and has disconnected from overcloud. I couldn’t restore the undercloud controller and ended up re-installing it (running 'openstack undercloud install’). The installation ended successfully but now I’m in a situation where Cleanup of the overcloud deployed nodes fails:
(undercloud) [stack@interop010 ~]$ openstack baremetal node list
+--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 | interop025 | None | power on | clean failed | True | | 4b02703a-f765-4ebb-85ed-75e88b4cbea5 | interop026 | None | power on | clean failed | True |
+--------------------------------------+------------+---------------+-------------+--------------------+-------------+
I’ve tried to move node to available state but cannot: (undercloud) [stack@interop010 ~]$ openstack baremetal node provide 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 The requested action "provide" can not be performed on node "97b9a603-f64f-47c1-9fb4-6c68a5b38ff6" while it is in state "clean failed". (HTTP 400)
My question is: *How do I make the nodes available again?* as the deployment of overcloud fails with: ERROR due to "Message: No valid host was found. , Code: 500”
Thanks, Igal
Thanks Jay, It gets into 'clean failed' state because it fails to boot into PXE mode. I don't understand why the DHCP does not respond to the clients request, it's like it remembers that the same client already received an IP in the past. Is there a way to clear the dnsmasq database of reservations? Igal On Wed, Mar 24, 2021 at 5:26 PM Jay Faulkner <jay.faulkner@verizonmedia.com> wrote:
A node in CLEAN FAILED must be moved to MANAGEABLE state before it can be told to "provide" (which eventually puts it back in AVAILABLE).
Try this: `openstack baremetal node manage UUID`, then run the command with "provide" as you did before.
The available states and their transitions are documented here: https://docs.openstack.org/ironic/latest/contributor/states.html
I'll note that if cleaning failed, it's possible the node is misconfigured in such a way that will cause all deployments and cleanings to fail (e.g.; if you're using Ironic with Nova, and you attempt to provision a machine and it errors during deploy; Nova will by default attempt to clean that node, which may be why you see it end up in clean failed). So I strongly suggest you look at the last_error field on the node and attempt to determine why the failure happened before retrying.
Good luck!
-Jay Faulkner
On Wed, Mar 24, 2021 at 8:20 AM Igal Katzir <ikatzir@infinidat.com> wrote:
Hello Team,
I had a situation where my *undercloud-node *had a problem with it’s disk and has disconnected from overcloud. I couldn’t restore the undercloud controller and ended up re-installing it (running 'openstack undercloud install’). The installation ended successfully but now I’m in a situation where Cleanup of the overcloud deployed nodes fails:
(undercloud) [stack@interop010 ~]$ openstack baremetal node list
+--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 | interop025 | None | power on | clean failed | True | | 4b02703a-f765-4ebb-85ed-75e88b4cbea5 | interop026 | None | power on | clean failed | True |
+--------------------------------------+------------+---------------+-------------+--------------------+-------------+
I’ve tried to move node to available state but cannot: (undercloud) [stack@interop010 ~]$ openstack baremetal node provide 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 The requested action "provide" can not be performed on node "97b9a603-f64f-47c1-9fb4-6c68a5b38ff6" while it is in state "clean failed". (HTTP 400)
My question is: *How do I make the nodes available again?* as the deployment of overcloud fails with: ERROR due to "Message: No valid host was found. , Code: 500”
Thanks, Igal
-- Regards, *Igal Katzir* Cell +972-54-5597086 Interoperability Team *INFINIDAT*
Hello Forum, Just for the record, the problem was resolved by restarting all the ironic containers, I believe that restarting the UC node entirely would have also fixed that. So after the ironic containers started fresh, the PXE worked well, and after running 'openstack overcloud node introspect --all-manageable --provide' it shows: +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | 588bc3f6-dc14-4a07-8e38-202540d046f8 | interop025 | None | power off | available | False | | dceab84b-1d99-49b5-8f79-c589c0884269 | interop026 | None | power off | available | False | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ I now ready for deployment of overcloud. thanks, Igal On Thu, Mar 25, 2021 at 12:48 AM Igal Katzir <ikatzir@infinidat.com> wrote:
Thanks Jay, It gets into 'clean failed' state because it fails to boot into PXE mode. I don't understand why the DHCP does not respond to the clients request, it's like it remembers that the same client already received an IP in the past. Is there a way to clear the dnsmasq database of reservations? Igal
On Wed, Mar 24, 2021 at 5:26 PM Jay Faulkner < jay.faulkner@verizonmedia.com> wrote:
A node in CLEAN FAILED must be moved to MANAGEABLE state before it can be told to "provide" (which eventually puts it back in AVAILABLE).
Try this: `openstack baremetal node manage UUID`, then run the command with "provide" as you did before.
The available states and their transitions are documented here: https://docs.openstack.org/ironic/latest/contributor/states.html
I'll note that if cleaning failed, it's possible the node is misconfigured in such a way that will cause all deployments and cleanings to fail (e.g.; if you're using Ironic with Nova, and you attempt to provision a machine and it errors during deploy; Nova will by default attempt to clean that node, which may be why you see it end up in clean failed). So I strongly suggest you look at the last_error field on the node and attempt to determine why the failure happened before retrying.
Good luck!
-Jay Faulkner
On Wed, Mar 24, 2021 at 8:20 AM Igal Katzir <ikatzir@infinidat.com> wrote:
Hello Team,
I had a situation where my *undercloud-node *had a problem with it’s disk and has disconnected from overcloud. I couldn’t restore the undercloud controller and ended up re-installing it (running 'openstack undercloud install’). The installation ended successfully but now I’m in a situation where Cleanup of the overcloud deployed nodes fails:
(undercloud) [stack@interop010 ~]$ openstack baremetal node list
+--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 | interop025 | None | power on | clean failed | True | | 4b02703a-f765-4ebb-85ed-75e88b4cbea5 | interop026 | None | power on | clean failed | True |
+--------------------------------------+------------+---------------+-------------+--------------------+-------------+
I’ve tried to move node to available state but cannot: (undercloud) [stack@interop010 ~]$ openstack baremetal node provide 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 The requested action "provide" can not be performed on node "97b9a603-f64f-47c1-9fb4-6c68a5b38ff6" while it is in state "clean failed". (HTTP 400)
My question is: *How do I make the nodes available again?* as the deployment of overcloud fails with: ERROR due to "Message: No valid host was found. , Code: 500”
Thanks, Igal
-- Regards,
*Igal Katzir* Cell +972-54-5597086 Interoperability Team *INFINIDAT*
-- Regards, *Igal Katzir* Cell +972-54-5597086 Interoperability Team *INFINIDAT*
Out of curiosity, is this a very new version of dnsmasq? or an older version? I ask because there have been some fixes and regressions related to dnsmasq updating its configuration and responding to machines appropriately. A version might be helpful, just to enable those of us who are curious to go double check things at a minimum. On Wed, Mar 31, 2021 at 1:28 AM Igal Katzir <ikatzir@infinidat.com> wrote:
Hello Forum, Just for the record, the problem was resolved by restarting all the ironic containers, I believe that restarting the UC node entirely would have also fixed that. So after the ironic containers started fresh, the PXE worked well, and after running 'openstack overcloud node introspect --all-manageable --provide' it shows: +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | 588bc3f6-dc14-4a07-8e38-202540d046f8 | interop025 | None | power off | available | False | | dceab84b-1d99-49b5-8f79-c589c0884269 | interop026 | None | power off | available | False | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+
I now ready for deployment of overcloud. thanks, Igal
On Thu, Mar 25, 2021 at 12:48 AM Igal Katzir <ikatzir@infinidat.com> wrote:
Thanks Jay, It gets into 'clean failed' state because it fails to boot into PXE mode. I don't understand why the DHCP does not respond to the clients request, it's like it remembers that the same client already received an IP in the past. Is there a way to clear the dnsmasq database of reservations? Igal
On Wed, Mar 24, 2021 at 5:26 PM Jay Faulkner <jay.faulkner@verizonmedia.com> wrote:
A node in CLEAN FAILED must be moved to MANAGEABLE state before it can be told to "provide" (which eventually puts it back in AVAILABLE).
Try this: `openstack baremetal node manage UUID`, then run the command with "provide" as you did before.
The available states and their transitions are documented here: https://docs.openstack.org/ironic/latest/contributor/states.html
I'll note that if cleaning failed, it's possible the node is misconfigured in such a way that will cause all deployments and cleanings to fail (e.g.; if you're using Ironic with Nova, and you attempt to provision a machine and it errors during deploy; Nova will by default attempt to clean that node, which may be why you see it end up in clean failed). So I strongly suggest you look at the last_error field on the node and attempt to determine why the failure happened before retrying.
Good luck!
-Jay Faulkner
On Wed, Mar 24, 2021 at 8:20 AM Igal Katzir <ikatzir@infinidat.com> wrote:
Hello Team,
I had a situation where my undercloud-node had a problem with it’s disk and has disconnected from overcloud. I couldn’t restore the undercloud controller and ended up re-installing it (running 'openstack undercloud install’). The installation ended successfully but now I’m in a situation where Cleanup of the overcloud deployed nodes fails:
(undercloud) [stack@interop010 ~]$ openstack baremetal node list +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 | interop025 | None | power on | clean failed | True | | 4b02703a-f765-4ebb-85ed-75e88b4cbea5 | interop026 | None | power on | clean failed | True | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+
I’ve tried to move node to available state but cannot: (undercloud) [stack@interop010 ~]$ openstack baremetal node provide 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 The requested action "provide" can not be performed on node "97b9a603-f64f-47c1-9fb4-6c68a5b38ff6" while it is in state "clean failed". (HTTP 400)
My question is: How do I make the nodes available again? as the deployment of overcloud fails with: ERROR due to "Message: No valid host was found. , Code: 500”
Thanks, Igal
-- Regards, Igal Katzir Cell +972-54-5597086 Interoperability Team INFINIDAT
-- Regards, Igal Katzir Cell +972-54-5597086 Interoperability Team INFINIDAT
Hi Julia, How can I easily tell the ironic version? This is an rhosp 16.1 installation so its pretty much new. Igal בתאריך יום ד׳, 31 במרץ 2021, 21:25, מאת Julia Kreger < juliaashleykreger@gmail.com>:
Out of curiosity, is this a very new version of dnsmasq? or an older version? I ask because there have been some fixes and regressions related to dnsmasq updating its configuration and responding to machines appropriately. A version might be helpful, just to enable those of us who are curious to go double check things at a minimum.
On Wed, Mar 31, 2021 at 1:28 AM Igal Katzir <ikatzir@infinidat.com> wrote:
Hello Forum, Just for the record, the problem was resolved by restarting all the
ironic containers, I believe that restarting the UC node entirely would have also fixed that.
So after the ironic containers started fresh, the PXE worked well, and after running 'openstack overcloud node introspect --all-manageable --provide' it shows:
+--------------------------------------+------------+---------------+-------------+--------------------+-------------+
| UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance |
| 588bc3f6-dc14-4a07-8e38-202540d046f8 | interop025 | None |
| dceab84b-1d99-49b5-8f79-c589c0884269 | interop026 | None |
+--------------------------------------+------------+---------------+-------------+--------------------+-------------+ power off | available | False | power off | available | False |
+--------------------------------------+------------+---------------+-------------+--------------------+-------------+
I now ready for deployment of overcloud. thanks, Igal
On Thu, Mar 25, 2021 at 12:48 AM Igal Katzir <ikatzir@infinidat.com>
Thanks Jay, It gets into 'clean failed' state because it fails to boot into PXE
mode.
I don't understand why the DHCP does not respond to the clients request, it's like it remembers that the same client already received an IP in the past. Is there a way to clear the dnsmasq database of reservations? Igal
On Wed, Mar 24, 2021 at 5:26 PM Jay Faulkner < jay.faulkner@verizonmedia.com> wrote:
A node in CLEAN FAILED must be moved to MANAGEABLE state before it can
be told to "provide" (which eventually puts it back in AVAILABLE).
Try this: `openstack baremetal node manage UUID`, then run the command with
"provide" as you did before.
The available states and their transitions are documented here:
https://docs.openstack.org/ironic/latest/contributor/states.html
I'll note that if cleaning failed, it's possible the node is
misconfigured in such a way that will cause all deployments and cleanings to fail (e.g.; if you're using Ironic with Nova, and you attempt to
Good luck!
-Jay Faulkner
On Wed, Mar 24, 2021 at 8:20 AM Igal Katzir <ikatzir@infinidat.com>
wrote:
Hello Team,
I had a situation where my undercloud-node had a problem with it’s
disk and has disconnected from overcloud.
I couldn’t restore the undercloud controller and ended up re-installing it (running 'openstack undercloud install’). The installation ended successfully but now I’m in a situation where Cleanup of the overcloud deployed nodes fails:
(undercloud) [stack@interop010 ~]$ openstack baremetal node list
+--------------------------------------+------------+---------------+-------------+--------------------+-------------+
| UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+------------+---------------+-------------+--------------------+-------------+
| 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 | interop025 | None |
| 4b02703a-f765-4ebb-85ed-75e88b4cbea5 | interop026 | None |
wrote: provision a machine and it errors during deploy; Nova will by default attempt to clean that node, which may be why you see it end up in clean failed). So I strongly suggest you look at the last_error field on the node and attempt to determine why the failure happened before retrying. power on | clean failed | True | power on | clean failed | True |
+--------------------------------------+------------+---------------+-------------+--------------------+-------------+
I’ve tried to move node to available state but cannot: (undercloud) [stack@interop010 ~]$ openstack baremetal node provide
97b9a603-f64f-47c1-9fb4-6c68a5b38ff6
The requested action "provide" can not be performed on node "97b9a603-f64f-47c1-9fb4-6c68a5b38ff6" while it is in state "clean failed". (HTTP 400)
My question is: How do I make the nodes available again? as the deployment of overcloud fails with: ERROR due to "Message: No valid host was found. , Code: 500”
Thanks, Igal
-- Regards, Igal Katzir Cell +972-54-5597086 Interoperability Team INFINIDAT
-- Regards, Igal Katzir Cell +972-54-5597086 Interoperability Team INFINIDAT
In that case, file a case with Red Hat support and provide them an sosreport. Basically, you shouldn't have to reboot or restart dnsmasq to get things to wake up. It is not about the version of ironic, but more about the version of dnsmasq, but if there is an issue, their support org needs that visibility so we can track it and get it remedied because it is not an upstream issue in that case, but likely a downstream issue. On Wed, Mar 31, 2021 at 12:24 PM Igal Katzir <ikatzir@infinidat.com> wrote:
Hi Julia, How can I easily tell the ironic version? This is an rhosp 16.1 installation so its pretty much new. Igal
בתאריך יום ד׳, 31 במרץ 2021, 21:25, מאת Julia Kreger <juliaashleykreger@gmail.com>:
Out of curiosity, is this a very new version of dnsmasq? or an older version? I ask because there have been some fixes and regressions related to dnsmasq updating its configuration and responding to machines appropriately. A version might be helpful, just to enable those of us who are curious to go double check things at a minimum.
On Wed, Mar 31, 2021 at 1:28 AM Igal Katzir <ikatzir@infinidat.com> wrote:
Hello Forum, Just for the record, the problem was resolved by restarting all the ironic containers, I believe that restarting the UC node entirely would have also fixed that. So after the ironic containers started fresh, the PXE worked well, and after running 'openstack overcloud node introspect --all-manageable --provide' it shows: +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | 588bc3f6-dc14-4a07-8e38-202540d046f8 | interop025 | None | power off | available | False | | dceab84b-1d99-49b5-8f79-c589c0884269 | interop026 | None | power off | available | False | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+
I now ready for deployment of overcloud. thanks, Igal
On Thu, Mar 25, 2021 at 12:48 AM Igal Katzir <ikatzir@infinidat.com> wrote:
Thanks Jay, It gets into 'clean failed' state because it fails to boot into PXE mode. I don't understand why the DHCP does not respond to the clients request, it's like it remembers that the same client already received an IP in the past. Is there a way to clear the dnsmasq database of reservations? Igal
On Wed, Mar 24, 2021 at 5:26 PM Jay Faulkner <jay.faulkner@verizonmedia.com> wrote:
A node in CLEAN FAILED must be moved to MANAGEABLE state before it can be told to "provide" (which eventually puts it back in AVAILABLE).
Try this: `openstack baremetal node manage UUID`, then run the command with "provide" as you did before.
The available states and their transitions are documented here: https://docs.openstack.org/ironic/latest/contributor/states.html
I'll note that if cleaning failed, it's possible the node is misconfigured in such a way that will cause all deployments and cleanings to fail (e.g.; if you're using Ironic with Nova, and you attempt to provision a machine and it errors during deploy; Nova will by default attempt to clean that node, which may be why you see it end up in clean failed). So I strongly suggest you look at the last_error field on the node and attempt to determine why the failure happened before retrying.
Good luck!
-Jay Faulkner
On Wed, Mar 24, 2021 at 8:20 AM Igal Katzir <ikatzir@infinidat.com> wrote:
Hello Team,
I had a situation where my undercloud-node had a problem with it’s disk and has disconnected from overcloud. I couldn’t restore the undercloud controller and ended up re-installing it (running 'openstack undercloud install’). The installation ended successfully but now I’m in a situation where Cleanup of the overcloud deployed nodes fails:
(undercloud) [stack@interop010 ~]$ openstack baremetal node list +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 | interop025 | None | power on | clean failed | True | | 4b02703a-f765-4ebb-85ed-75e88b4cbea5 | interop026 | None | power on | clean failed | True | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+
I’ve tried to move node to available state but cannot: (undercloud) [stack@interop010 ~]$ openstack baremetal node provide 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 The requested action "provide" can not be performed on node "97b9a603-f64f-47c1-9fb4-6c68a5b38ff6" while it is in state "clean failed". (HTTP 400)
My question is: How do I make the nodes available again? as the deployment of overcloud fails with: ERROR due to "Message: No valid host was found. , Code: 500”
Thanks, Igal
-- Regards, Igal Katzir Cell +972-54-5597086 Interoperability Team INFINIDAT
-- Regards, Igal Katzir Cell +972-54-5597086 Interoperability Team INFINIDAT
participants (3)
-
Igal Katzir
-
Jay Faulkner
-
Julia Kreger