[Edge-computing] [ironic][ops] Taking ironic nodes out of production

Julia Kreger juliaashleykreger at gmail.com
Tue May 21 17:33:10 UTC 2019


On Tue, May 21, 2019 at 5:55 AM <Arkady.Kanevsky at dell.com> wrote:
>
> Let's dig deeper into requirements.
> I see three distinct use cases:
> 1. put node into maintenance mode. Say to upgrade FW/BIOS or any other life-cycle event. It stays in ironic cluster but it is no longer in use by the rest of openstack, like Nova.
> 2. Put node into "fail" state. That is remove from usage, remove from Ironic cluster. What cleanup, operator would like/can do is subject to failure. Depending on the node type it may need to be "replaced".

Or troubleshooted by a human, and could be returned to a non-failure
state. I think largely the only way we as developers could support
that is allow for hook scripts to be called upon entering/exiting such
a state. That being said, At least from what Beth was saying at the
PTG, this seems to be one of the most important states.

> 3. Put node into "available" to other usage. What cleanup operator wants to do will need to be defined. This is very similar step as used for Baremetal as a Service as node is reassigned back into available pool. Depending on the next usage of a node it may stay in the Ironic cluster or may be removed from it. Once removed it can be "retired" or used for any other purpose.

Do you mean "unprovision" a node and move it through cleaning? I'm not
sure I  understand what your trying to get across. There is a case
where a node would have been moved to a "failed" state, and could be
"unprovisioned". If we reach the point where we are able to
unprovision, it seems like we might be able to re-deploy, so maybe the
option is to automatically move to state which is kind of like bucket
for broken nodes?

>
> Thanks,
> Arkady
>
> -----Original Message-----
> From: Christopher Price <christopher.price at est.tech>
> Sent: Tuesday, May 21, 2019 3:26 AM
> To: Bogdan Dobrelya; openstack-discuss at lists.openstack.org; edge-computing at lists.openstack.org
> Subject: Re: [Edge-computing] [ironic][ops] Taking ironic nodes out of production
>
>
> [EXTERNAL EMAIL]
>
> I would add that something as simple as an operator policy could/should be able to remove hardware from an operational domain.  It does not specifically need to be a fault or retirement, it may be as simple as repurposing to a different operational domain. From an OpenStack perspective this should not require any special handling from "retirement", it's just to know that there may be time constraints implied in a policy change that could potentially be ignored in a "retirement scenario".
>
> Further, at least in my imagination, one might be reallocating hardware from one Ironic domain to another which may have implications on how we best bring a new node online.  (or not, I'm no expert) </ end dubious thought stream>
>
> / Chris
>
> On 2019-05-21, 09:16, "Bogdan Dobrelya" <bdobreli at redhat.com> wrote:
>
>     [CC'ed edge-computing at lists.openstack.org]
>
>     On 20.05.2019 18:33, Arne Wiebalck wrote:
>     > Dear all,
>     >
>     > One of the discussions at the PTG in Denver raised the need for
>     > a mechanism to take ironic nodes out of production (a task for
>     > which the currently available 'maintenance' flag does not seem
>     > appropriate [1]).
>     >
>     > The use case there is an unhealthy physical node in state 'active',
>     > i.e. associated with an instance. The request is then to enable an
>     > admin to mark such a node as 'faulty' or 'in quarantine' with the
>     > aim of not returning the node to the pool of available nodes once
>     > the hosted instance is deleted.
>     >
>     > A very similar use case which came up independently is node
>     > retirement: it should be possible to mark nodes ('active' or not)
>     > as being 'up for retirement' to prepare the eventual removal from
>     > ironic. As in the example above, ('active') nodes marked this way
>     > should not become eligible for instance scheduling again, but
>     > automatic cleaning, for instance, should still be possible.
>     >
>     > In an effort to cover these use cases by a more general
>     > "quarantine/retirement" feature:
>     >
>     > - are there additional use cases which could profit from such a
>     >    "take a node out of service" mechanism?
>
>     There are security related examples described in the Edge Security
>     Challenges whitepaper [0] drafted by k8s IoT SIG [1], like in the
>     chapter 2 Trusting hardware, whereby "GPS coordinate changes can be used
>     to force a shutdown of an edge node". So a node may be taken out of
>     service as an indicator of a particular condition of edge hardware.
>
>     [0]
>     https://docs.google.com/document/d/1iSIk8ERcheehk0aRG92dfOvW5NjkdedN8F7mSUTr-r0/edit#heading=h.xf8mdv7zexgq
>     [1] https://github.com/kubernetes/community/tree/master/wg-iot-edge
>
>     >
>     > - would these use cases put additional constraints on how the
>     >    feature should look like (e.g.: "should not prevent cleaning")
>     >
>     > - are there other characteristics such a feature should have
>     >    (e.g.: "finding these nodes should be supported by the cli")
>     >
>     > Let me know if you have any thoughts on this.
>     >
>     > Cheers,
>     >   Arne
>     >
>     >
>     > [1] https://etherpad.openstack.org/p/DEN-train-ironic-ptg, l. 360
>     >
>
>
>     --
>     Best regards,
>     Bogdan Dobrelya,
>     Irc #bogdando
>
>     _______________________________________________
>     Edge-computing mailing list
>     Edge-computing at lists.openstack.org
>     http://lists.openstack.org/cgi-bin/mailman/listinfo/edge-computing
>
>
> _______________________________________________
> Edge-computing mailing list
> Edge-computing at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/edge-computing



More information about the openstack-discuss mailing list