[ironic] How to move nodes from a 'clean failed' state into 'Available'

newer
[neutron] oslo.privsep migration...

Igal Katzir

24 Mar 2021 24 Mar '21

7:09 a.m.

Hello Team, I had a situation where my undercloud-node had a problem with it’s disk and has disconnected from overcloud. I couldn’t restore the undercloud controller and ended up re-installing it (running 'openstack undercloud install’). The installation ended successfully but now I’m in a situation where Cleanup of the overcloud deployed nodes fails: (undercloud) [stack@interop010 ~]$ openstack baremetal node list +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 | interop025 | None | power on | clean failed | True | | 4b02703a-f765-4ebb-85ed-75e88b4cbea5 | interop026 | None | power on | clean failed | True | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ I’ve tried to move node to available state but cannot: (undercloud) [stack@interop010 ~]$ openstack baremetal node provide 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 The requested action "provide" can not be performed on node "97b9a603-f64f-47c1-9fb4-6c68a5b38ff6" while it is in state "clean failed". (HTTP 400) My question is: How do I make the nodes available again? as the deployment of overcloud fails with: ERROR due to "Message: No valid host was found. , Code: 500” Thanks, Igal

Attachments:

attachment.html (text/html — 2.7 KB)

Show replies by date

Igal Katzir

24 Mar 24 Mar

6:09 p.m.

New subject: [ironic] Cannot move nodes from state 'clean failed' into provisioning state 'Available'

Hello all, While troubleshooting this, another observation I see is that when I run put the node in state provide: 'openstack baremetal node provide 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6’ It starts the cleaning process, then the node boots into PXE but the undercloud ignores it. When I tap the port I see that requests reach its interface: (undercloud) [stack@interop010 ~]$ sudo tcpdump -i br-ctlplane 10:43:10.600421 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from a0:36:9f:95:dd:e2 (oui Unknown), length 548 But on the same time the dnsmasq ignores it: (undercloud) [stack@interop010 ~]$ sudo tail -f /var/log/containers/ironic-inspector/dnsmasq.log Mar 24 10:39:43 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) 6c:ae:8b:69:ee:80 ignored Mar 24 10:40:36 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) a0:36:9f:95:dd:e2 ignored Mar 24 10:40:39 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) a0:36:9f:95:dd:e2 ignored Mar 24 10:40:48 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) 6c:ae:8b:69:ee:80 ignored Mar 24 10:41:52 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) 6c:ae:8b:69:ee:80 ignored Mar 24 10:42:57 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) 6c:ae:8b:69:ee:80 ignored Mar 24 10:43:06 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) a0:36:9f:95:dd:e2 ignored Mar 24 10:43:10 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) a0:36:9f:95:dd:e2 ignored Mar 24 10:43:14 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) a0:36:9f:95:dd:e2 ignored Why is that? What is needed for the cleanup to start? Thanks, Igal

...

On 24 Mar 2021, at 0:09, Igal Katzir <ikatzir@infinidat.com> wrote:

Hello Team,

I had a situation where my undercloud-node had a problem with it’s disk and has disconnected from overcloud. I couldn’t restore the undercloud controller and ended up re-installing it (running 'openstack undercloud install’). The installation ended successfully but now I’m in a situation where Cleanup of the overcloud deployed nodes fails:

(undercloud) [stack@interop010 ~]$ openstack baremetal node list +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 | interop025 | None | power on | clean failed | True | | 4b02703a-f765-4ebb-85ed-75e88b4cbea5 | interop026 | None | power on | clean failed | True | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+

I’ve tried to move node to available state but cannot: (undercloud) [stack@interop010 ~]$ openstack baremetal node provide 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 The requested action "provide" can not be performed on node "97b9a603-f64f-47c1-9fb4-6c68a5b38ff6" while it is in state "clean failed". (HTTP 400)

My question is: How do I make the nodes available again? as the deployment of overcloud fails with: ERROR due to "Message: No valid host was found. , Code: 500”

Thanks, Igal

Julia Kreger

25 Mar 25 Mar

2:31 a.m.

New subject: [ironic] Cannot move nodes from state 'clean failed' into provisioning state 'Available'

So versions and overall configuration might help, *but* often these issues are just a typo with a MAC address or the wrong port. Can you verify that the MAC address your seeing DHCP requests for matchs what is recorded for the node in the `openstack baremetal port list` output? On Wed, Mar 24, 2021 at 8:18 AM Igal Katzir <ikatzir@infinidat.com> wrote:

...

Hello all,

While troubleshooting this, another observation I see is that when I run put the node in state provide: 'openstack baremetal node provide 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6’ It starts the cleaning process, then the node boots into PXE but the undercloud ignores it. When I tap the port I see that requests reach its interface:

(undercloud) [stack@interop010 ~]$ sudo tcpdump -i br-ctlplane 10:43:10.600421 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from a0:36:9f:95:dd:e2 (oui Unknown), length 548

But on the same time the dnsmasq ignores it: (undercloud) [stack@interop010 ~]$ sudo tail -f /var/log/containers/ironic-inspector/dnsmasq.log Mar 24 10:39:43 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) 6c:ae:8b:69:ee:80 ignored Mar 24 10:40:36 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) a0:36:9f:95:dd:e2 ignored Mar 24 10:40:39 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) a0:36:9f:95:dd:e2 ignored Mar 24 10:40:48 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) 6c:ae:8b:69:ee:80 ignored Mar 24 10:41:52 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) 6c:ae:8b:69:ee:80 ignored Mar 24 10:42:57 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) 6c:ae:8b:69:ee:80 ignored Mar 24 10:43:06 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) a0:36:9f:95:dd:e2 ignored Mar 24 10:43:10 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) a0:36:9f:95:dd:e2 ignored Mar 24 10:43:14 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) a0:36:9f:95:dd:e2 ignored

Why is that? What is needed for the cleanup to start?

Thanks, Igal

On 24 Mar 2021, at 0:09, Igal Katzir <ikatzir@infinidat.com> wrote:

Hello Team,

I had a situation where my undercloud-node had a problem with it’s disk and has disconnected from overcloud. I couldn’t restore the undercloud controller and ended up re-installing it (running 'openstack undercloud install’). The installation ended successfully but now I’m in a situation where Cleanup of the overcloud deployed nodes fails:

(undercloud) [stack@interop010 ~]$ openstack baremetal node list +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 | interop025 | None | power on | clean failed | True | | 4b02703a-f765-4ebb-85ed-75e88b4cbea5 | interop026 | None | power on | clean failed | True | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+

I’ve tried to move node to available state but cannot: (undercloud) [stack@interop010 ~]$ openstack baremetal node provide 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 The requested action "provide" can not be performed on node "97b9a603-f64f-47c1-9fb4-6c68a5b38ff6" while it is in state "clean failed". (HTTP 400)

My question is: How do I make the nodes available again? as the deployment of overcloud fails with: ERROR due to "Message: No valid host was found. , Code: 500”

Thanks, Igal

Igal Katzir

7:39 a.m.

New subject: [ironic] Cannot move nodes from state 'clean failed' into provisioning state 'Available'

Hello Julia, Thanks for your response. I am using a RedHat Openstack Platform 16.1, which is running on RHEL 8.2. All are physical servers; - One Undercloud Director. - Overcloud consists of two nodes. (This is for Certification purposes) It is unlikely that it's a mac addr. mismatch (I wish...) since I've already deployed these nodes several times, using the same nodes.json Just for reference , here is the output: (undercloud) [stack@interop010 ~]$ openstack baremetal port list +--------------------------------------+-------------------+ | UUID | Address | +--------------------------------------+-------------------+ | 2d404695-f236-4d32-8b65-5ca1fa6b756a | a0:36:9f:95:dd:e2 | | 32669178-0408-4ff1-b4b4-df65fc7643c9 | 6c:ae:8b:69:ee:80 | +--------------------------------------+-------------------+ The operation was working well until I have 'lost' the undercloud node, but overcloud stayed working. I might need to delete these nodes and run introspection again. Igal On Wed, Mar 24, 2021 at 7:31 PM Julia Kreger <juliaashleykreger@gmail.com> wrote:

...

So versions and overall configuration might help, *but* often these issues are just a typo with a MAC address or the wrong port. Can you verify that the MAC address your seeing DHCP requests for matchs what is recorded for the node in the `openstack baremetal port list` output?

On Wed, Mar 24, 2021 at 8:18 AM Igal Katzir <ikatzir@infinidat.com> wrote:

...
Hello all,

While troubleshooting this, another observation I see is that when I run

put the node in state provide:

...
'openstack baremetal node provide 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6’ It starts the cleaning process, then the node boots into PXE but the undercloud ignores it. When I tap the port I see that requests reach its interface:

(undercloud) [stack@interop010 ~]$ sudo tcpdump -i br-ctlplane 10:43:10.600421 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from a0:36:9f:95:dd:e2 (oui Unknown), length 548

But on the same time the dnsmasq ignores it: (undercloud) [stack@interop010 ~]$ sudo tail -f /var/log/containers/ironic-inspector/dnsmasq.log Mar 24 10:39:43 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) 6c:ae:8b:69:ee:80 ignored Mar 24 10:40:36 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) a0:36:9f:95:dd:e2 ignored Mar 24 10:40:39 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) a0:36:9f:95:dd:e2 ignored Mar 24 10:40:48 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) 6c:ae:8b:69:ee:80 ignored Mar 24 10:41:52 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) 6c:ae:8b:69:ee:80 ignored Mar 24 10:42:57 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) 6c:ae:8b:69:ee:80 ignored Mar 24 10:43:06 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) a0:36:9f:95:dd:e2 ignored Mar 24 10:43:10 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) a0:36:9f:95:dd:e2 ignored Mar 24 10:43:14 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) a0:36:9f:95:dd:e2 ignored

Why is that? What is needed for the cleanup to start?

Thanks, Igal

On 24 Mar 2021, at 0:09, Igal Katzir <ikatzir@infinidat.com> wrote:

Hello Team,

I had a situation where my undercloud-node had a problem with it’s disk and has disconnected from overcloud. I couldn’t restore the undercloud controller and ended up re-installing it (running 'openstack undercloud install’). The installation ended successfully but now I’m in a situation where Cleanup of the overcloud deployed nodes fails:

(undercloud) [stack@interop010 ~]$ openstack baremetal node list

+--------------------------------------+------------+---------------+-------------+--------------------+-------------+

...
| UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance |

...
| 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 | interop025 | None |

...
| 4b02703a-f765-4ebb-85ed-75e88b4cbea5 | interop026 | None |

+--------------------------------------+------------+---------------+-------------+--------------------+-------------+ power on | clean failed | True | power on | clean failed | True |

...
+--------------------------------------+------------+---------------+-------------+--------------------+-------------+

...
I’ve tried to move node to available state but cannot: (undercloud) [stack@interop010 ~]$ openstack baremetal node provide

97b9a603-f64f-47c1-9fb4-6c68a5b38ff6

...
The requested action "provide" can not be performed on node "97b9a603-f64f-47c1-9fb4-6c68a5b38ff6" while it is in state "clean failed". (HTTP 400)

My question is: How do I make the nodes available again? as the deployment of overcloud fails with: ERROR due to "Message: No valid host was found. , Code: 500”

Thanks, Igal

-- Regards, *Igal Katzir* Cell +972-54-5597086 Interoperability Team *INFINIDAT*

Jay Faulkner

12:26 a.m.

New subject: [E] [ironic] How to move nodes from a 'clean failed' state into 'Available'

A node in CLEAN FAILED must be moved to MANAGEABLE state before it can be told to "provide" (which eventually puts it back in AVAILABLE). Try this: `openstack baremetal node manage UUID`, then run the command with "provide" as you did before. The available states and their transitions are documented here: https://docs.openstack.org/ironic/latest/contributor/states.html I'll note that if cleaning failed, it's possible the node is misconfigured in such a way that will cause all deployments and cleanings to fail (e.g.; if you're using Ironic with Nova, and you attempt to provision a machine and it errors during deploy; Nova will by default attempt to clean that node, which may be why you see it end up in clean failed). So I strongly suggest you look at the last_error field on the node and attempt to determine why the failure happened before retrying. Good luck! -Jay Faulkner On Wed, Mar 24, 2021 at 8:20 AM Igal Katzir <ikatzir@infinidat.com> wrote:

...

Hello Team,

I had a situation where my *undercloud-node *had a problem with it’s disk and has disconnected from overcloud. I couldn’t restore the undercloud controller and ended up re-installing it (running 'openstack undercloud install’). The installation ended successfully but now I’m in a situation where Cleanup of the overcloud deployed nodes fails:

(undercloud) [stack@interop010 ~]$ openstack baremetal node list

+--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance |

+--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 | interop025 | None | power on | clean failed | True | | 4b02703a-f765-4ebb-85ed-75e88b4cbea5 | interop026 | None | power on | clean failed | True |

+--------------------------------------+------------+---------------+-------------+--------------------+-------------+

I’ve tried to move node to available state but cannot: (undercloud) [stack@interop010 ~]$ openstack baremetal node provide 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 The requested action "provide" can not be performed on node "97b9a603-f64f-47c1-9fb4-6c68a5b38ff6" while it is in state "clean failed". (HTTP 400)

My question is: *How do I make the nodes available again?* as the deployment of overcloud fails with: ERROR due to "Message: No valid host was found. , Code: 500”

Thanks, Igal

Igal Katzir

7:48 a.m.

New subject: [E] [ironic] How to move nodes from a 'clean failed' state into 'Available'

Thanks Jay, It gets into 'clean failed' state because it fails to boot into PXE mode. I don't understand why the DHCP does not respond to the clients request, it's like it remembers that the same client already received an IP in the past. Is there a way to clear the dnsmasq database of reservations? Igal On Wed, Mar 24, 2021 at 5:26 PM Jay Faulkner <jay.faulkner@verizonmedia.com> wrote:

...

A node in CLEAN FAILED must be moved to MANAGEABLE state before it can be told to "provide" (which eventually puts it back in AVAILABLE).

Try this: `openstack baremetal node manage UUID`, then run the command with "provide" as you did before.

The available states and their transitions are documented here: https://docs.openstack.org/ironic/latest/contributor/states.html

I'll note that if cleaning failed, it's possible the node is misconfigured in such a way that will cause all deployments and cleanings to fail (e.g.; if you're using Ironic with Nova, and you attempt to provision a machine and it errors during deploy; Nova will by default attempt to clean that node, which may be why you see it end up in clean failed). So I strongly suggest you look at the last_error field on the node and attempt to determine why the failure happened before retrying.

Good luck!

-Jay Faulkner

On Wed, Mar 24, 2021 at 8:20 AM Igal Katzir <ikatzir@infinidat.com> wrote:

...
Hello Team,

I had a situation where my *undercloud-node *had a problem with it’s disk and has disconnected from overcloud. I couldn’t restore the undercloud controller and ended up re-installing it (running 'openstack undercloud install’). The installation ended successfully but now I’m in a situation where Cleanup of the overcloud deployed nodes fails:

(undercloud) [stack@interop010 ~]$ openstack baremetal node list

+--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance |

+--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 | interop025 | None | power on | clean failed | True | | 4b02703a-f765-4ebb-85ed-75e88b4cbea5 | interop026 | None | power on | clean failed | True |

+--------------------------------------+------------+---------------+-------------+--------------------+-------------+

I’ve tried to move node to available state but cannot: (undercloud) [stack@interop010 ~]$ openstack baremetal node provide 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 The requested action "provide" can not be performed on node "97b9a603-f64f-47c1-9fb4-6c68a5b38ff6" while it is in state "clean failed". (HTTP 400)

My question is: *How do I make the nodes available again?* as the deployment of overcloud fails with: ERROR due to "Message: No valid host was found. , Code: 500”

Thanks, Igal

-- Regards, *Igal Katzir* Cell +972-54-5597086 Interoperability Team *INFINIDAT*

Igal Katzir

31 Mar 31 Mar

5:28 p.m.

New subject: [E] [ironic] How to move nodes from a 'clean failed' state into 'Available'

Hello Forum, Just for the record, the problem was resolved by restarting all the ironic containers, I believe that restarting the UC node entirely would have also fixed that. So after the ironic containers started fresh, the PXE worked well, and after running 'openstack overcloud node introspect --all-manageable --provide' it shows: +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | 588bc3f6-dc14-4a07-8e38-202540d046f8 | interop025 | None | power off | available | False | | dceab84b-1d99-49b5-8f79-c589c0884269 | interop026 | None | power off | available | False | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ I now ready for deployment of overcloud. thanks, Igal On Thu, Mar 25, 2021 at 12:48 AM Igal Katzir <ikatzir@infinidat.com> wrote:

...

Thanks Jay, It gets into 'clean failed' state because it fails to boot into PXE mode. I don't understand why the DHCP does not respond to the clients request, it's like it remembers that the same client already received an IP in the past. Is there a way to clear the dnsmasq database of reservations? Igal

On Wed, Mar 24, 2021 at 5:26 PM Jay Faulkner < jay.faulkner@verizonmedia.com> wrote:

...
A node in CLEAN FAILED must be moved to MANAGEABLE state before it can be told to "provide" (which eventually puts it back in AVAILABLE).

Try this: `openstack baremetal node manage UUID`, then run the command with "provide" as you did before.

The available states and their transitions are documented here: https://docs.openstack.org/ironic/latest/contributor/states.html

I'll note that if cleaning failed, it's possible the node is misconfigured in such a way that will cause all deployments and cleanings to fail (e.g.; if you're using Ironic with Nova, and you attempt to provision a machine and it errors during deploy; Nova will by default attempt to clean that node, which may be why you see it end up in clean failed). So I strongly suggest you look at the last_error field on the node and attempt to determine why the failure happened before retrying.

Good luck!

-Jay Faulkner

On Wed, Mar 24, 2021 at 8:20 AM Igal Katzir <ikatzir@infinidat.com> wrote:

...
Hello Team,

I had a situation where my *undercloud-node *had a problem with it’s disk and has disconnected from overcloud. I couldn’t restore the undercloud controller and ended up re-installing it (running 'openstack undercloud install’). The installation ended successfully but now I’m in a situation where Cleanup of the overcloud deployed nodes fails:

(undercloud) [stack@interop010 ~]$ openstack baremetal node list

+--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance |

+--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 | interop025 | None | power on | clean failed | True | | 4b02703a-f765-4ebb-85ed-75e88b4cbea5 | interop026 | None | power on | clean failed | True |

+--------------------------------------+------------+---------------+-------------+--------------------+-------------+

I’ve tried to move node to available state but cannot: (undercloud) [stack@interop010 ~]$ openstack baremetal node provide 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 The requested action "provide" can not be performed on node "97b9a603-f64f-47c1-9fb4-6c68a5b38ff6" while it is in state "clean failed". (HTTP 400)

My question is: *How do I make the nodes available again?* as the deployment of overcloud fails with: ERROR due to "Message: No valid host was found. , Code: 500”

Thanks, Igal

-- Regards,

*Igal Katzir* Cell +972-54-5597086 Interoperability Team *INFINIDAT*

-- Regards, *Igal Katzir* Cell +972-54-5597086 Interoperability Team *INFINIDAT*

Julia Kreger

1 Apr 1 Apr

3:25 a.m.

New subject: [E] [ironic] How to move nodes from a 'clean failed' state into 'Available'

Out of curiosity, is this a very new version of dnsmasq? or an older version? I ask because there have been some fixes and regressions related to dnsmasq updating its configuration and responding to machines appropriately. A version might be helpful, just to enable those of us who are curious to go double check things at a minimum. On Wed, Mar 31, 2021 at 1:28 AM Igal Katzir <ikatzir@infinidat.com> wrote:

...

Hello Forum, Just for the record, the problem was resolved by restarting all the ironic containers, I believe that restarting the UC node entirely would have also fixed that. So after the ironic containers started fresh, the PXE worked well, and after running 'openstack overcloud node introspect --all-manageable --provide' it shows: +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | 588bc3f6-dc14-4a07-8e38-202540d046f8 | interop025 | None | power off | available | False | | dceab84b-1d99-49b5-8f79-c589c0884269 | interop026 | None | power off | available | False | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+

I now ready for deployment of overcloud. thanks, Igal

On Thu, Mar 25, 2021 at 12:48 AM Igal Katzir <ikatzir@infinidat.com> wrote:

...
Thanks Jay, It gets into 'clean failed' state because it fails to boot into PXE mode. I don't understand why the DHCP does not respond to the clients request, it's like it remembers that the same client already received an IP in the past. Is there a way to clear the dnsmasq database of reservations? Igal

On Wed, Mar 24, 2021 at 5:26 PM Jay Faulkner <jay.faulkner@verizonmedia.com> wrote:

...
A node in CLEAN FAILED must be moved to MANAGEABLE state before it can be told to "provide" (which eventually puts it back in AVAILABLE).

Try this: `openstack baremetal node manage UUID`, then run the command with "provide" as you did before.

The available states and their transitions are documented here: https://docs.openstack.org/ironic/latest/contributor/states.html

I'll note that if cleaning failed, it's possible the node is misconfigured in such a way that will cause all deployments and cleanings to fail (e.g.; if you're using Ironic with Nova, and you attempt to provision a machine and it errors during deploy; Nova will by default attempt to clean that node, which may be why you see it end up in clean failed). So I strongly suggest you look at the last_error field on the node and attempt to determine why the failure happened before retrying.

Good luck!

-Jay Faulkner

On Wed, Mar 24, 2021 at 8:20 AM Igal Katzir <ikatzir@infinidat.com> wrote:

...
Hello Team,

I had a situation where my undercloud-node had a problem with it’s disk and has disconnected from overcloud. I couldn’t restore the undercloud controller and ended up re-installing it (running 'openstack undercloud install’). The installation ended successfully but now I’m in a situation where Cleanup of the overcloud deployed nodes fails:

(undercloud) [stack@interop010 ~]$ openstack baremetal node list +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 | interop025 | None | power on | clean failed | True | | 4b02703a-f765-4ebb-85ed-75e88b4cbea5 | interop026 | None | power on | clean failed | True | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+

I’ve tried to move node to available state but cannot: (undercloud) [stack@interop010 ~]$ openstack baremetal node provide 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 The requested action "provide" can not be performed on node "97b9a603-f64f-47c1-9fb4-6c68a5b38ff6" while it is in state "clean failed". (HTTP 400)

My question is: How do I make the nodes available again? as the deployment of overcloud fails with: ERROR due to "Message: No valid host was found. , Code: 500”

Thanks, Igal

-- Regards, Igal Katzir Cell +972-54-5597086 Interoperability Team INFINIDAT

-- Regards, Igal Katzir Cell +972-54-5597086 Interoperability Team INFINIDAT

Igal Katzir

4:24 a.m.

New subject: [E] [ironic] How to move nodes from a 'clean failed' state into 'Available'

Hi Julia, How can I easily tell the ironic version? This is an rhosp 16.1 installation so its pretty much new. Igal בתאריך יום ד׳, 31 במרץ 2021, 21:25, מאת Julia Kreger ‏< juliaashleykreger@gmail.com>:

...

Out of curiosity, is this a very new version of dnsmasq? or an older version? I ask because there have been some fixes and regressions related to dnsmasq updating its configuration and responding to machines appropriately. A version might be helpful, just to enable those of us who are curious to go double check things at a minimum.

On Wed, Mar 31, 2021 at 1:28 AM Igal Katzir <ikatzir@infinidat.com> wrote:

...
Hello Forum, Just for the record, the problem was resolved by restarting all the

ironic containers, I believe that restarting the UC node entirely would have also fixed that.

...
So after the ironic containers started fresh, the PXE worked well, and after running 'openstack overcloud node introspect --all-manageable --provide' it shows:

+--------------------------------------+------------+---------------+-------------+--------------------+-------------+

...
| UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance |

...
| 588bc3f6-dc14-4a07-8e38-202540d046f8 | interop025 | None |

...
| dceab84b-1d99-49b5-8f79-c589c0884269 | interop026 | None |

+--------------------------------------+------------+---------------+-------------+--------------------+-------------+ power off | available | False | power off | available | False |

...
+--------------------------------------+------------+---------------+-------------+--------------------+-------------+

...
I now ready for deployment of overcloud. thanks, Igal

On Thu, Mar 25, 2021 at 12:48 AM Igal Katzir <ikatzir@infinidat.com>

...
...
Thanks Jay, It gets into 'clean failed' state because it fails to boot into PXE

mode.

...
I don't understand why the DHCP does not respond to the clients request, it's like it remembers that the same client already received an IP in the past. Is there a way to clear the dnsmasq database of reservations? Igal

On Wed, Mar 24, 2021 at 5:26 PM Jay Faulkner < jay.faulkner@verizonmedia.com> wrote:

...
A node in CLEAN FAILED must be moved to MANAGEABLE state before it can

be told to "provide" (which eventually puts it back in AVAILABLE).

...
Try this: `openstack baremetal node manage UUID`, then run the command with

"provide" as you did before.

...
The available states and their transitions are documented here:

https://docs.openstack.org/ironic/latest/contributor/states.html

...
I'll note that if cleaning failed, it's possible the node is

misconfigured in such a way that will cause all deployments and cleanings to fail (e.g.; if you're using Ironic with Nova, and you attempt to

...
...
...
Good luck!

-Jay Faulkner

On Wed, Mar 24, 2021 at 8:20 AM Igal Katzir <ikatzir@infinidat.com>

wrote:

...
...
Hello Team,

I had a situation where my undercloud-node had a problem with it’s

disk and has disconnected from overcloud.

...
I couldn’t restore the undercloud controller and ended up re-installing it (running 'openstack undercloud install’). The installation ended successfully but now I’m in a situation where Cleanup of the overcloud deployed nodes fails:

(undercloud) [stack@interop010 ~]$ openstack baremetal node list

+--------------------------------------+------------+---------------+-------------+--------------------+-------------+

...
| UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance |

+--------------------------------------+------------+---------------+-------------+--------------------+-------------+

...
| 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 | interop025 | None |

...
...
...
...
| 4b02703a-f765-4ebb-85ed-75e88b4cbea5 | interop026 | None |

wrote: provision a machine and it errors during deploy; Nova will by default attempt to clean that node, which may be why you see it end up in clean failed). So I strongly suggest you look at the last_error field on the node and attempt to determine why the failure happened before retrying. power on | clean failed | True | power on | clean failed | True |

...
...
...
...
+--------------------------------------+------------+---------------+-------------+--------------------+-------------+

...
I’ve tried to move node to available state but cannot: (undercloud) [stack@interop010 ~]$ openstack baremetal node provide

97b9a603-f64f-47c1-9fb4-6c68a5b38ff6

...
The requested action "provide" can not be performed on node "97b9a603-f64f-47c1-9fb4-6c68a5b38ff6" while it is in state "clean failed". (HTTP 400)

My question is: How do I make the nodes available again? as the deployment of overcloud fails with: ERROR due to "Message: No valid host was found. , Code: 500”

Thanks, Igal

-- Regards, Igal Katzir Cell +972-54-5597086 Interoperability Team INFINIDAT

-- Regards, Igal Katzir Cell +972-54-5597086 Interoperability Team INFINIDAT

Julia Kreger

4:49 a.m.

New subject: [E] [ironic] How to move nodes from a 'clean failed' state into 'Available'

In that case, file a case with Red Hat support and provide them an sosreport. Basically, you shouldn't have to reboot or restart dnsmasq to get things to wake up. It is not about the version of ironic, but more about the version of dnsmasq, but if there is an issue, their support org needs that visibility so we can track it and get it remedied because it is not an upstream issue in that case, but likely a downstream issue. On Wed, Mar 31, 2021 at 12:24 PM Igal Katzir <ikatzir@infinidat.com> wrote:

...

Hi Julia, How can I easily tell the ironic version? This is an rhosp 16.1 installation so its pretty much new. Igal

בתאריך יום ד׳, 31 במרץ 2021, 21:25, מאת Julia Kreger ‏<juliaashleykreger@gmail.com>:

...
Out of curiosity, is this a very new version of dnsmasq? or an older version? I ask because there have been some fixes and regressions related to dnsmasq updating its configuration and responding to machines appropriately. A version might be helpful, just to enable those of us who are curious to go double check things at a minimum.

On Wed, Mar 31, 2021 at 1:28 AM Igal Katzir <ikatzir@infinidat.com> wrote:

...
Hello Forum, Just for the record, the problem was resolved by restarting all the ironic containers, I believe that restarting the UC node entirely would have also fixed that. So after the ironic containers started fresh, the PXE worked well, and after running 'openstack overcloud node introspect --all-manageable --provide' it shows: +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | 588bc3f6-dc14-4a07-8e38-202540d046f8 | interop025 | None | power off | available | False | | dceab84b-1d99-49b5-8f79-c589c0884269 | interop026 | None | power off | available | False | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+

I now ready for deployment of overcloud. thanks, Igal

On Thu, Mar 25, 2021 at 12:48 AM Igal Katzir <ikatzir@infinidat.com> wrote:

...
Thanks Jay, It gets into 'clean failed' state because it fails to boot into PXE mode. I don't understand why the DHCP does not respond to the clients request, it's like it remembers that the same client already received an IP in the past. Is there a way to clear the dnsmasq database of reservations? Igal

On Wed, Mar 24, 2021 at 5:26 PM Jay Faulkner <jay.faulkner@verizonmedia.com> wrote:

...
A node in CLEAN FAILED must be moved to MANAGEABLE state before it can be told to "provide" (which eventually puts it back in AVAILABLE).

Try this: `openstack baremetal node manage UUID`, then run the command with "provide" as you did before.

The available states and their transitions are documented here: https://docs.openstack.org/ironic/latest/contributor/states.html

I'll note that if cleaning failed, it's possible the node is misconfigured in such a way that will cause all deployments and cleanings to fail (e.g.; if you're using Ironic with Nova, and you attempt to provision a machine and it errors during deploy; Nova will by default attempt to clean that node, which may be why you see it end up in clean failed). So I strongly suggest you look at the last_error field on the node and attempt to determine why the failure happened before retrying.

Good luck!

-Jay Faulkner

On Wed, Mar 24, 2021 at 8:20 AM Igal Katzir <ikatzir@infinidat.com> wrote:

...
Hello Team,

I had a situation where my undercloud-node had a problem with it’s disk and has disconnected from overcloud. I couldn’t restore the undercloud controller and ended up re-installing it (running 'openstack undercloud install’). The installation ended successfully but now I’m in a situation where Cleanup of the overcloud deployed nodes fails:

(undercloud) [stack@interop010 ~]$ openstack baremetal node list +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+ | 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 | interop025 | None | power on | clean failed | True | | 4b02703a-f765-4ebb-85ed-75e88b4cbea5 | interop026 | None | power on | clean failed | True | +--------------------------------------+------------+---------------+-------------+--------------------+-------------+

I’ve tried to move node to available state but cannot: (undercloud) [stack@interop010 ~]$ openstack baremetal node provide 97b9a603-f64f-47c1-9fb4-6c68a5b38ff6 The requested action "provide" can not be performed on node "97b9a603-f64f-47c1-9fb4-6c68a5b38ff6" while it is in state "clean failed". (HTTP 400)

My question is: How do I make the nodes available again? as the deployment of overcloud fails with: ERROR due to "Message: No valid host was found. , Code: 500”

Thanks, Igal

-- Regards, Igal Katzir Cell +972-54-5597086 Interoperability Team INFINIDAT

-- Regards, Igal Katzir Cell +972-54-5597086 Interoperability Team INFINIDAT

1581

Age (days ago)

1589

Last active (days ago)

List overview

Download

9 comments

3 participants

participants (3)

Igal Katzir
Jay Faulkner
Julia Kreger

[ironic] How to move nodes from a 'clean failed' state into 'Available'

tags

participants (3)