[openstack-dev] [ironic] Tooling for recovering nodes

Jay Faulkner jay at jvf.cc
Wed Jun 1 23:44:43 UTC 2016


Some comments inline.


On 5/31/16 12:26 PM, Devananda van der Veen wrote:
> On 05/31/2016 01:35 AM, Dmitry Tantsur wrote:
>> On 05/31/2016 10:25 AM, Tan, Lin wrote:
>>> Hi,
>>>
>>> Recently, I am working on a spec[1] in order to recover nodes which get stuck
>>> in deploying state, so I really expect some feedback from you guys.
>>>
>>> Ironic nodes can be stuck in
>>> deploying/deploywait/cleaning/cleanwait/inspecting/deleting if the node is
>>> reserved by a dead conductor (the exclusive lock was not released).
>>> Any further requests will be denied by ironic because it thinks the node
>>> resource is under control of another conductor.
>>>
>>> To be more clear, let's narrow the scope and focus on the deploying state
>>> first. Currently, people do have several choices to clear the reserved lock:
>>> 1. restart the dead conductor
>>> 2. wait up to 2 or 3 minutes and _check_deploying_states() will clear the lock.
>>> 3. The operator touches the DB to manually recover these nodes.
>>>
>>> Option two looks very promising but there are some weakness:
>>> 2.1 It won't work if the dead conductor was renamed or deleted.
>>> 2.2 It won't work if the node's specific driver was not enabled on live
>>> conductors.
>>> 2.3 It won't work if the node is in maintenance. (only a corner case).
>> We can and should fix all three cases.
> 2.1 and 2.2 appear to be a bug in the behavior of _check_deploying_status().
>
> The method claims to do exactly what you suggest in 2.1 and 2.2 -- it gathers a
> list of Nodes reserved by *any* offline conductor and tries to release the lock.
> However, it will always fail to update them, because objects.Node.release()
> raises a NodeLocked exception when called on a Node locked by a different conductor.
>
> Here's the relevant code path:
>
> ironic/conductor/manager.py:
> 1259     def _check_deploying_status(self, context):
> ...
> 1269         offline_conductors = self.dbapi.get_offline_conductors()
> ...
> 1273         node_iter = self.iter_nodes(
> 1274             fields=['id', 'reservation'],
> 1275             filters={'provision_state': states.DEPLOYING,
> 1276                      'maintenance': False,
> 1277                      'reserved_by_any_of': offline_conductors})
> ...
> 1281         for node_uuid, driver, node_id, conductor_hostname in node_iter:
> 1285             try:
> 1286                 objects.Node.release(context, conductor_hostname, node_id)
> ...
> 1292             except exception.NodeLocked:
> 1293                 LOG.warning(...)
> 1297                 continue
>
>
> As far as 2.3, I think we should change the query string at the start of this
> method so that it includes nodes in maintenance mode. I think it's both safe and
> reasonable (and, frankly, what an operator will expect) that a node which is in
> maintenance mode, and in DEPLOYING state, whose conductor is offline, should
> have that reservation cleared and be set to DEPLOYFAILED state.

This is an excellent idea -- and I'm going to extend it further. If I 
have any nodes in a *ING state, and they are put into maintenance, it 
should force a failure. This is potentially a more API-friendly way of 
cleaning up nodes in bad states -- an operator would need to maintenance 
the node, and once it enters the *FAIL state, troubleshoot why it 
failed, unmaintenance, and return to production.

I obviously strongly desire an "override command" as an operator, but I 
really think this could handle a large percentage of the use cases that 
made me desire it in the first place.

> --devananda
>
>>> Definitely we should improve the option 2, but there are could be more issues
>>> I didn't know in a more complicated environment.
>>> So my question is do we still need a new command to recover these node easier
>>> without accessing DB, like this PoC [2]:
>>>    ironic-noderecover --node_uuids=UUID1,UUID2
>>> --config-file=/etc/ironic/ironic.conf
>> I'm -1 to anything silently removing the lock until I see a clear use case which
>> is impossible to improve within Ironic itself. Such utility may and will be abused.
>>
>> I'm fine with anything that does not forcibly remove the lock by default.
I agree such a utility could be abused. I don't think that's a good 
argument for not writing it for operators. However, I agree that any 
utility we write that could or would modify a lock should not do so by 
default, and should warn before doing so, but there are cases where 
getting a lock cleared is desirable and necessary.

A good example of this would be an ironic-conductor failing while a node 
is locked, and being brought up with a different hostname. Today, 
there's no way to get that lock off that node again.

Even if you force operators to replace a conductor with one with an 
identical hostname, during the time this replacement was occurring any 
nodes locked would remain locked.

Thanks,
Jay Faulkner
>>> Best Regards,
>>>
>>> Tan
>>>
>>>
>>> [1] https://review.openstack.org/#/c/319812
>>> [2] https://review.openstack.org/#/c/311273/
>>>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




More information about the OpenStack-dev mailing list