[Edge-computing] [ironic][ops] Taking ironic nodes out of production
Christopher Price
christopher.price at est.tech
Tue May 21 08:26:25 UTC 2019
I would add that something as simple as an operator policy could/should be able to remove hardware from an operational domain. It does not specifically need to be a fault or retirement, it may be as simple as repurposing to a different operational domain. From an OpenStack perspective this should not require any special handling from "retirement", it's just to know that there may be time constraints implied in a policy change that could potentially be ignored in a "retirement scenario".
Further, at least in my imagination, one might be reallocating hardware from one Ironic domain to another which may have implications on how we best bring a new node online. (or not, I'm no expert) </ end dubious thought stream>
/ Chris
On 2019-05-21, 09:16, "Bogdan Dobrelya" <bdobreli at redhat.com> wrote:
[CC'ed edge-computing at lists.openstack.org]
On 20.05.2019 18:33, Arne Wiebalck wrote:
> Dear all,
>
> One of the discussions at the PTG in Denver raised the need for
> a mechanism to take ironic nodes out of production (a task for
> which the currently available 'maintenance' flag does not seem
> appropriate [1]).
>
> The use case there is an unhealthy physical node in state 'active',
> i.e. associated with an instance. The request is then to enable an
> admin to mark such a node as 'faulty' or 'in quarantine' with the
> aim of not returning the node to the pool of available nodes once
> the hosted instance is deleted.
>
> A very similar use case which came up independently is node
> retirement: it should be possible to mark nodes ('active' or not)
> as being 'up for retirement' to prepare the eventual removal from
> ironic. As in the example above, ('active') nodes marked this way
> should not become eligible for instance scheduling again, but
> automatic cleaning, for instance, should still be possible.
>
> In an effort to cover these use cases by a more general
> "quarantine/retirement" feature:
>
> - are there additional use cases which could profit from such a
> "take a node out of service" mechanism?
There are security related examples described in the Edge Security
Challenges whitepaper [0] drafted by k8s IoT SIG [1], like in the
chapter 2 Trusting hardware, whereby "GPS coordinate changes can be used
to force a shutdown of an edge node". So a node may be taken out of
service as an indicator of a particular condition of edge hardware.
[0]
https://docs.google.com/document/d/1iSIk8ERcheehk0aRG92dfOvW5NjkdedN8F7mSUTr-r0/edit#heading=h.xf8mdv7zexgq
[1] https://github.com/kubernetes/community/tree/master/wg-iot-edge
>
> - would these use cases put additional constraints on how the
> feature should look like (e.g.: "should not prevent cleaning")
>
> - are there other characteristics such a feature should have
> (e.g.: "finding these nodes should be supported by the cli")
>
> Let me know if you have any thoughts on this.
>
> Cheers,
> Arne
>
>
> [1] https://etherpad.openstack.org/p/DEN-train-ironic-ptg, l. 360
>
--
Best regards,
Bogdan Dobrelya,
Irc #bogdando
_______________________________________________
Edge-computing mailing list
Edge-computing at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/edge-computing
More information about the openstack-discuss
mailing list