[ironic] Recovering IPMI-type baremetal nodes in 'error' state
Hello all, I have a handful of baremetal nodes enrolled in Ironic that use the IPMI hardware type, whose motherboards were recently replaced in a hardware recall by the vendor. After the replacement, the BMC IPMI-over-LAN feature was accidentally left disabled on the nodes, and future attempts to control them with Ironic has put these nodes into the ERROR provisioning state. The IPMI-over-LAN feature on the boards has been enabled again as expected, but is there now any easy way to get the BM nodes back out of that ERROR state, without first deleting and re-enrolling them? -- ******************* Paul Browne Research Computing Platforms University Information Services Roger Needham Building JJ Thompson Avenue University of Cambridge Cambridge United Kingdom E-Mail: pfb29@cam.ac.uk Tel: 0044-1223-746548 *******************
Greetings Paul, Obviously, deleting and re-enrolling would be an action of last resort. The only way that I can think you could have gotten the machines into the provision state of ERROR is if they were somehow requested to be un-provisioned. The state machine diagram[0], refers to the provision state verb as "deleted", but the command line tool command this is undeploy[1]. [0]: https://docs.openstack.org/ironic/latest/_images/states.svg [1]: https://docs.openstack.org/python-ironicclient/latest/cli/osc/v1/index.html#... On Wed, Sep 23, 2020 at 4:58 PM Paul Browne <pfb29@cam.ac.uk> wrote:
Hello all,
I have a handful of baremetal nodes enrolled in Ironic that use the IPMI hardware type, whose motherboards were recently replaced in a hardware recall by the vendor.
After the replacement, the BMC IPMI-over-LAN feature was accidentally left disabled on the nodes, and future attempts to control them with Ironic has put these nodes into the ERROR provisioning state.
The IPMI-over-LAN feature on the boards has been enabled again as expected, but is there now any easy way to get the BM nodes back out of that ERROR state, without first deleting and re-enrolling them?
-- ******************* Paul Browne Research Computing Platforms University Information Services Roger Needham Building JJ Thompson Avenue University of Cambridge Cambridge United Kingdom E-Mail: pfb29@cam.ac.uk Tel: 0044-1223-746548 *******************
Well, somehow I accidentally clicked send! \o/ If you can confirm that the provision_state is ERROR, and if you can identify how the machines got there, it would be helpful. If the machines are still in working order in the database, you may need to actually edit the database because we offer no explicit means to force override the state, mainly to help prevent issues sort of exactly like this. I suspect you may be encountering issues if the node is marked in maintenance state. If the power state is None, maintenance is also set automatically. Newer versions of ironic _do_ periodically check nodes and reset that state, but again it is something to check and if there are continued connectivity issues to the BMC then that may not be happening. So: to recap: 1) Verify the node's provision_state is ERROR. If ERROR is coming from Nova, that is a different situation. 2) Ensure the node is not set in maintenance mode[3] 3) You may also need to ensure the ipmi_address/ipmi_username/ipmi_password is also correct for the node that matches what can be accessed on the motherboard. Additionally, you may also want to externally verify that you actually query the IPMI BMCs. If this somehow started down this path due to power management being lost due to the BMC, some BMCs can have some weirdness around IP networking so it is always good just to manually check using ipmitool. One last thing, is target_provision_state set for these nodes? [3]: https://docs.openstack.org/python-ironicclient/latest/cli/osc/v1/index.html#... On Wed, Sep 23, 2020 at 9:20 PM Julia Kreger <juliaashleykreger@gmail.com> wrote:
Greetings Paul,
Obviously, deleting and re-enrolling would be an action of last resort. The only way that I can think you could have gotten the machines into the provision state of ERROR is if they were somehow requested to be un-provisioned.
The state machine diagram[0], refers to the provision state verb as "deleted", but the command line tool command this is undeploy[1].
[0]: https://docs.openstack.org/ironic/latest/_images/states.svg [1]: https://docs.openstack.org/python-ironicclient/latest/cli/osc/v1/index.html#...
On Wed, Sep 23, 2020 at 4:58 PM Paul Browne <pfb29@cam.ac.uk> wrote:
Hello all,
I have a handful of baremetal nodes enrolled in Ironic that use the IPMI hardware type, whose motherboards were recently replaced in a hardware recall by the vendor.
After the replacement, the BMC IPMI-over-LAN feature was accidentally left disabled on the nodes, and future attempts to control them with Ironic has put these nodes into the ERROR provisioning state.
The IPMI-over-LAN feature on the boards has been enabled again as expected, but is there now any easy way to get the BM nodes back out of that ERROR state, without first deleting and re-enrolling them?
-- ******************* Paul Browne Research Computing Platforms University Information Services Roger Needham Building JJ Thompson Avenue University of Cambridge Cambridge United Kingdom E-Mail: pfb29@cam.ac.uk Tel: 0044-1223-746548 *******************
Hi Julia, Thanks very much for the detailed answer and pointers, I've done some digging along the lines you suggested, results here <https://pastebin.com/E14FVuF0> Digging into the Ironic DB, I do see the last_error field for all 3 is " Failed to tear down. Error: IPMI call failed: power status." That makes sense, as IPMI-over-LAN was accidentally disabled on those nodes, so these calls would fail. I think that the order of operation was that the failing calls were part of an instance teardown and node cleaning. IPMI-over-LAN's been fixed so now manual ipmitool power status calls will succeed, and these correct credentials are in the node ipmi_* driver_info fields. The nodes are also out of maintenance mode. We're running Train Ironic, as an extra piece of info if that's relevant at all to how Ironic may periodically check and/or correct node states. Perhaps the next thing to try might be manual DB edit of provision_state of these 3 nodes back to 'available' On Thu, 24 Sep 2020 at 05:29, Julia Kreger <juliaashleykreger@gmail.com> wrote:
Well, somehow I accidentally clicked send! \o/
If you can confirm that the provision_state is ERROR, and if you can identify how the machines got there, it would be helpful. If the machines are still in working order in the database, you may need to actually edit the database because we offer no explicit means to force override the state, mainly to help prevent issues sort of exactly like this. I suspect you may be encountering issues if the node is marked in maintenance state. If the power state is None, maintenance is also set automatically. Newer versions of ironic _do_ periodically check nodes and reset that state, but again it is something to check and if there are continued connectivity issues to the BMC then that may not be happening.
So: to recap:
1) Verify the node's provision_state is ERROR. If ERROR is coming from Nova, that is a different situation. 2) Ensure the node is not set in maintenance mode[3] 3) You may also need to ensure the ipmi_address/ipmi_username/ipmi_password is also correct for the node that matches what can be accessed on the motherboard.
Additionally, you may also want to externally verify that you actually query the IPMI BMCs. If this somehow started down this path due to power management being lost due to the BMC, some BMCs can have some weirdness around IP networking so it is always good just to manually check using ipmitool.
One last thing, is target_provision_state set for these nodes?
[3]: https://docs.openstack.org/python-ironicclient/latest/cli/osc/v1/index.html#...
On Wed, Sep 23, 2020 at 9:20 PM Julia Kreger <juliaashleykreger@gmail.com> wrote:
Greetings Paul,
Obviously, deleting and re-enrolling would be an action of last resort. The only way that I can think you could have gotten the machines into the provision state of ERROR is if they were somehow requested to be un-provisioned.
The state machine diagram[0], refers to the provision state verb as "deleted", but the command line tool command this is undeploy[1].
[0]: https://docs.openstack.org/ironic/latest/_images/states.svg [1]:
https://docs.openstack.org/python-ironicclient/latest/cli/osc/v1/index.html#...
On Wed, Sep 23, 2020 at 4:58 PM Paul Browne <pfb29@cam.ac.uk> wrote:
Hello all,
I have a handful of baremetal nodes enrolled in Ironic that use the
After the replacement, the BMC IPMI-over-LAN feature was accidentally
left disabled on the nodes, and future attempts to control them with Ironic has put these nodes into the ERROR provisioning state.
The IPMI-over-LAN feature on the boards has been enabled again as
expected, but is there now any easy way to get the BM nodes back out of
IPMI hardware type, whose motherboards were recently replaced in a hardware recall by the vendor. that ERROR state, without first deleting and re-enrolling them?
-- ******************* Paul Browne Research Computing Platforms University Information Services Roger Needham Building JJ Thompson Avenue University of Cambridge Cambridge United Kingdom E-Mail: pfb29@cam.ac.uk Tel: 0044-1223-746548 *******************
-- ******************* Paul Browne Research Computing Platforms University Information Services Roger Needham Building JJ Thompson Avenue University of Cambridge Cambridge United Kingdom E-Mail: pfb29@cam.ac.uk Tel: 0044-1223-746548 *******************
participants (2)
-
Julia Kreger
-
Paul Browne