[openstack-dev] [ironic] Tooling for recovering nodes

Devananda van der Veen devananda.vdv at gmail.com
Tue May 31 19:26:12 UTC 2016

On 05/31/2016 01:35 AM, Dmitry Tantsur wrote:
> On 05/31/2016 10:25 AM, Tan, Lin wrote:
>> Hi,
>> Recently, I am working on a spec[1] in order to recover nodes which get stuck
>> in deploying state, so I really expect some feedback from you guys.
>> Ironic nodes can be stuck in
>> deploying/deploywait/cleaning/cleanwait/inspecting/deleting if the node is
>> reserved by a dead conductor (the exclusive lock was not released).
>> Any further requests will be denied by ironic because it thinks the node
>> resource is under control of another conductor.
>> To be more clear, let's narrow the scope and focus on the deploying state
>> first. Currently, people do have several choices to clear the reserved lock:
>> 1. restart the dead conductor
>> 2. wait up to 2 or 3 minutes and _check_deploying_states() will clear the lock.
>> 3. The operator touches the DB to manually recover these nodes.
>> Option two looks very promising but there are some weakness:
>> 2.1 It won't work if the dead conductor was renamed or deleted.
>> 2.2 It won't work if the node's specific driver was not enabled on live
>> conductors.
>> 2.3 It won't work if the node is in maintenance. (only a corner case).
> We can and should fix all three cases.

2.1 and 2.2 appear to be a bug in the behavior of _check_deploying_status().

The method claims to do exactly what you suggest in 2.1 and 2.2 -- it gathers a
list of Nodes reserved by *any* offline conductor and tries to release the lock.
However, it will always fail to update them, because objects.Node.release()
raises a NodeLocked exception when called on a Node locked by a different conductor.

Here's the relevant code path:

1259     def _check_deploying_status(self, context):
1269         offline_conductors = self.dbapi.get_offline_conductors()
1273         node_iter = self.iter_nodes(
1274             fields=['id', 'reservation'],
1275             filters={'provision_state': states.DEPLOYING,
1276                      'maintenance': False,
1277                      'reserved_by_any_of': offline_conductors})
1281         for node_uuid, driver, node_id, conductor_hostname in node_iter:
1285             try:
1286                 objects.Node.release(context, conductor_hostname, node_id)
1292             except exception.NodeLocked:
1293                 LOG.warning(...)
1297                 continue

As far as 2.3, I think we should change the query string at the start of this
method so that it includes nodes in maintenance mode. I think it's both safe and
reasonable (and, frankly, what an operator will expect) that a node which is in
maintenance mode, and in DEPLOYING state, whose conductor is offline, should
have that reservation cleared and be set to DEPLOYFAILED state.


>> Definitely we should improve the option 2, but there are could be more issues
>> I didn't know in a more complicated environment.
>> So my question is do we still need a new command to recover these node easier
>> without accessing DB, like this PoC [2]:
>>   ironic-noderecover --node_uuids=UUID1,UUID2 
>> --config-file=/etc/ironic/ironic.conf
> I'm -1 to anything silently removing the lock until I see a clear use case which
> is impossible to improve within Ironic itself. Such utility may and will be abused.
> I'm fine with anything that does not forcibly remove the lock by default.
>> Best Regards,
>> Tan
>> [1] https://review.openstack.org/#/c/319812
>> [2] https://review.openstack.org/#/c/311273/

More information about the OpenStack-dev mailing list