[openstack-dev] [ironic] Tooling for recovering nodes

Tan, Lin lin.tan at intel.com
Wed Jun 1 03:21:49 UTC 2016


Thanks Devananda for your suggestions. I opened a new bug for it.

But I am asking this is because this is a task from newton summit to create a new command "for getting nodes out of stuck *ing states"
https://etherpad.openstack.org/p/ironic-newton-summit-ops
And we have a RFE bug already for this[1]

But as Dmitry said, there is a big risk to remove the lock of nodes and mark it as deploy failed state. But if the tool didn't remove the lock of nodes, then users still cannot manipulate the node resource. So I want to involve more people to discuss the spec[2].

Considering ironic already have _check_deploying_states() to recover deploying state, should I focus on improving it?
Or
There is still a need to create a new command.

B.R

Tan

[1]https://bugs.launchpad.net/ironic/+bug/1580931
[2]https://review.openstack.org/#/c/319812
-----Original Message-----
From: Devananda van der Veen [mailto:devananda.vdv at gmail.com] 
Sent: Wednesday, June 1, 2016 3:26 AM
To: openstack-dev at lists.openstack.org
Subject: Re: [openstack-dev] [ironic] Tooling for recovering nodes

On 05/31/2016 01:35 AM, Dmitry Tantsur wrote:
> On 05/31/2016 10:25 AM, Tan, Lin wrote:
>> Hi,
>>
>> Recently, I am working on a spec[1] in order to recover nodes which 
>> get stuck in deploying state, so I really expect some feedback from you guys.
>>
>> Ironic nodes can be stuck in
>> deploying/deploywait/cleaning/cleanwait/inspecting/deleting if the 
>> node is reserved by a dead conductor (the exclusive lock was not released).
>> Any further requests will be denied by ironic because it thinks the 
>> node resource is under control of another conductor.
>>
>> To be more clear, let's narrow the scope and focus on the deploying 
>> state first. Currently, people do have several choices to clear the reserved lock:
>> 1. restart the dead conductor
>> 2. wait up to 2 or 3 minutes and _check_deploying_states() will clear the lock.
>> 3. The operator touches the DB to manually recover these nodes.
>>
>> Option two looks very promising but there are some weakness:
>> 2.1 It won't work if the dead conductor was renamed or deleted.
>> 2.2 It won't work if the node's specific driver was not enabled on 
>> live conductors.
>> 2.3 It won't work if the node is in maintenance. (only a corner case).
> 
> We can and should fix all three cases.

2.1 and 2.2 appear to be a bug in the behavior of _check_deploying_status().

The method claims to do exactly what you suggest in 2.1 and 2.2 -- it gathers a list of Nodes reserved by *any* offline conductor and tries to release the lock.
However, it will always fail to update them, because objects.Node.release() raises a NodeLocked exception when called on a Node locked by a different conductor.

Here's the relevant code path:

ironic/conductor/manager.py:
1259     def _check_deploying_status(self, context):
...
1269         offline_conductors = self.dbapi.get_offline_conductors()
...
1273         node_iter = self.iter_nodes(
1274             fields=['id', 'reservation'],
1275             filters={'provision_state': states.DEPLOYING,
1276                      'maintenance': False,
1277                      'reserved_by_any_of': offline_conductors})
...
1281         for node_uuid, driver, node_id, conductor_hostname in node_iter:
1285             try:
1286                 objects.Node.release(context, conductor_hostname, node_id)
...
1292             except exception.NodeLocked:
1293                 LOG.warning(...)
1297                 continue


As far as 2.3, I think we should change the query string at the start of this method so that it includes nodes in maintenance mode. I think it's both safe and reasonable (and, frankly, what an operator will expect) that a node which is in maintenance mode, and in DEPLOYING state, whose conductor is offline, should have that reservation cleared and be set to DEPLOYFAILED state.

--devananda

>>
>> Definitely we should improve the option 2, but there are could be 
>> more issues I didn't know in a more complicated environment.
>> So my question is do we still need a new command to recover these 
>> node easier without accessing DB, like this PoC [2]:
>>   ironic-noderecover --node_uuids=UUID1,UUID2 
>> --config-file=/etc/ironic/ironic.conf
> 
> I'm -1 to anything silently removing the lock until I see a clear use 
> case which is impossible to improve within Ironic itself. Such utility may and will be abused.
> 
> I'm fine with anything that does not forcibly remove the lock by default.
> 
>>
>> Best Regards,
>>
>> Tan
>>
>>
>> [1] https://review.openstack.org/#/c/319812
>> [2] https://review.openstack.org/#/c/311273/
>>

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list