[openstack-dev] [ironic] Tooling for recovering nodes

Jay Faulkner jay at jvf.cc
Wed Jun 1 23:36:24 UTC 2016


Hey Tan, some comments inline.


On 5/31/16 1:25 AM, Tan, Lin wrote:
> Hi,
>
> Recently, I am working on a spec[1] in order to recover nodes which get stuck in deploying state, so I really expect some feedback from you guys.
>
> Ironic nodes can be stuck in deploying/deploywait/cleaning/cleanwait/inspecting/deleting if the node is reserved by a dead conductor (the exclusive lock was not released).
> Any further requests will be denied by ironic because it thinks the node resource is under control of another conductor.
>
> To be more clear, let's narrow the scope and focus on the deploying state first. Currently, people do have several choices to clear the reserved lock:
> 1. restart the dead conductor
> 2. wait up to 2 or 3 minutes and _check_deploying_states() will clear the lock.
> 3. The operator touches the DB to manually recover these nodes.
I actually like option #3 being optionally integrated into a tool to 
clear nodes stuck in *ing state. If specified, it would clear the lock 
on the deploy as it moved it from DEPLOYING -> DEPLOYFAILED. Obviously, 
for cleaning this could be dangerous, and should be documented as so -- 
imagine clearing a lock mid-firmware flash and having a power action 
taken to brick the node.

Given this is tooling intended to handle many cases, I think it's better 
to give the operator the choice to take more dramatic action if they wish.


Thanks,
Jay Faulkner
> Option two looks very promising but there are some weakness:
> 2.1 It won't work if the dead conductor was renamed or deleted.
> 2.2 It won't work if the node's specific driver was not enabled on live conductors.
> 2.3 It won't work if the node is in maintenance. (only a corner case).
>
> Definitely we should improve the option 2, but there are could be more issues I didn't know in a more complicated environment.
> So my question is do we still need a new command to recover these node easier without accessing DB, like this PoC [2]:
>    ironic-noderecover --node_uuids=UUID1,UUID2  --config-file=/etc/ironic/ironic.conf
>
> Best Regards,
>
> Tan
>
>
> [1] https://review.openstack.org/#/c/319812
> [2] https://review.openstack.org/#/c/311273/
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




More information about the OpenStack-dev mailing list