Inline response
-----Original Message----- From: Julia Kreger juliaashleykreger@gmail.com Sent: Tuesday, May 21, 2019 12:33 PM To: Kanevsky, Arkady Cc: Christopher Price; Bogdan Dobrelya; openstack-discuss; edge-computing@lists.openstack.org Subject: Re: [Edge-computing] [ironic][ops] Taking ironic nodes out of production
[EXTERNAL EMAIL]
On Tue, May 21, 2019 at 5:55 AM Arkady.Kanevsky@dell.com wrote:
Let's dig deeper into requirements. I see three distinct use cases:
- put node into maintenance mode. Say to upgrade FW/BIOS or any other life-cycle event. It stays in ironic cluster but it is no longer in use by the rest of openstack, like Nova.
- Put node into "fail" state. That is remove from usage, remove from Ironic cluster. What cleanup, operator would like/can do is subject to failure. Depending on the node type it may need to be "replaced".
Or troubleshooted by a human, and could be returned to a non-failure state. I think largely the only way we as developers could support that is allow for hook scripts to be called upon entering/exiting such a state. That being said, At least from what Beth was saying at the PTG, this seems to be one of the most important states.
- Put node into "available" to other usage. What cleanup operator wants to do will need to be defined. This is very similar step as used for Baremetal as a Service as node is reassigned back into available pool. Depending on the next usage of a node it may stay in the Ironic cluster or may be removed from it. Once removed it can be "retired" or used for any other purpose.
Do you mean "unprovision" a node and move it through cleaning? I'm not sure I understand what your trying to get across. There is a case where a node would have been moved to a "failed" state, and could be "unprovisioned". If we reach the point where we are able to unprovision, it seems like we might be able to re-deploy, so maybe the option is to automatically move to state which is kind of like bucket for broken nodes?
AK: Before node is removed from Ironic some level of cleanup is expected. Especially if node is to be reused as Chris stated. I assume that that cleanup will be done by Ironic. What you do with the node after it is outside of Ironic is out of scope.
Thanks, Arkady
-----Original Message----- From: Christopher Price christopher.price@est.tech Sent: Tuesday, May 21, 2019 3:26 AM To: Bogdan Dobrelya; openstack-discuss@lists.openstack.org; edge-computing@lists.openstack.org Subject: Re: [Edge-computing] [ironic][ops] Taking ironic nodes out of production
[EXTERNAL EMAIL]
I would add that something as simple as an operator policy could/should be able to remove hardware from an operational domain. It does not specifically need to be a fault or retirement, it may be as simple as repurposing to a different operational domain. From an OpenStack perspective this should not require any special handling from "retirement", it's just to know that there may be time constraints implied in a policy change that could potentially be ignored in a "retirement scenario".
Further, at least in my imagination, one might be reallocating hardware from one Ironic domain to another which may have implications on how we best bring a new node online. (or not, I'm no expert) </ end dubious thought stream>
/ Chris
On 2019-05-21, 09:16, "Bogdan Dobrelya" bdobreli@redhat.com wrote:
[CC'ed edge-computing@lists.openstack.org] On 20.05.2019 18:33, Arne Wiebalck wrote: > Dear all, > > One of the discussions at the PTG in Denver raised the need for > a mechanism to take ironic nodes out of production (a task for > which the currently available 'maintenance' flag does not seem > appropriate [1]). > > The use case there is an unhealthy physical node in state 'active', > i.e. associated with an instance. The request is then to enable an > admin to mark such a node as 'faulty' or 'in quarantine' with the > aim of not returning the node to the pool of available nodes once > the hosted instance is deleted. > > A very similar use case which came up independently is node > retirement: it should be possible to mark nodes ('active' or not) > as being 'up for retirement' to prepare the eventual removal from > ironic. As in the example above, ('active') nodes marked this way > should not become eligible for instance scheduling again, but > automatic cleaning, for instance, should still be possible. > > In an effort to cover these use cases by a more general > "quarantine/retirement" feature: > > - are there additional use cases which could profit from such a > "take a node out of service" mechanism? There are security related examples described in the Edge Security Challenges whitepaper [0] drafted by k8s IoT SIG [1], like in the chapter 2 Trusting hardware, whereby "GPS coordinate changes can be used to force a shutdown of an edge node". So a node may be taken out of service as an indicator of a particular condition of edge hardware. [0] https://docs.google.com/document/d/1iSIk8ERcheehk0aRG92dfOvW5NjkdedN8F7mSUTr-r0/edit#heading=h.xf8mdv7zexgq [1]
https://github.com/kubernetes/community/tree/master/wg-iot-edge
> > - would these use cases put additional constraints on how the > feature should look like (e.g.: "should not prevent cleaning") > > - are there other characteristics such a feature should have > (e.g.: "finding these nodes should be supported by the cli") > > Let me know if you have any thoughts on this. > > Cheers, > Arne > > > [1] https://etherpad.openstack.org/p/DEN-train-ironic-ptg, l. 360 > -- Best regards, Bogdan Dobrelya, Irc #bogdando _______________________________________________ Edge-computing mailing list Edge-computing@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/edge-computing
Edge-computing mailing list Edge-computing@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/edge-computing