RE: [Edge-computing] [ironic][ops] Taking ironic nodes out of production

9 Aug 2019


      Julia,
For #3 what I was trying to cover the case when Ironic is used to manage servers for multiple different platform clusters. Like 2 different OpenStack cluster that share single Ironic. Ore One OpenStack and one Kubernetes cluster with shared Ironic between them.
This use case support take a node from one platform cluster, clean it up, and allocate to another platform cluster.

Thanks,
Arkady

-----Original Message-----
From: Julia Kreger <juliaashleykreger@gmail.com> 
Sent: Tuesday, May 21, 2019 12:33 PM
To: Kanevsky, Arkady
Cc: Christopher Price; Bogdan Dobrelya; openstack-discuss; edge-computing@lists.openstack.org
Subject: Re: [Edge-computing] [ironic][ops] Taking ironic nodes out of production


[EXTERNAL EMAIL] 

On Tue, May 21, 2019 at 5:55 AM <Arkady.Kanevsky@dell.com> wrote:
...
Let's dig deeper into requirements.
I see three distinct use cases:
1. put node into maintenance mode. Say to upgrade FW/BIOS or any other life-cycle event. It stays in ironic cluster but it is no longer in use by the rest of openstack, like Nova.
2. Put node into "fail" state. That is remove from usage, remove from Ironic cluster. What cleanup, operator would like/can do is subject to failure. Depending on the node type it may need to be "replaced".
Or troubleshooted by a human, and could be returned to a non-failure state. I think largely the only way we as developers could support that is allow for hook scripts to be called upon entering/exiting such a state. That being said, At least from what Beth was saying at the PTG, this seems to be one of the most important states.
...
3. Put node into "available" to other usage. What cleanup operator wants to do will need to be defined. This is very similar step as used for Baremetal as a Service as node is reassigned back into available pool. Depending on the next usage of a node it may stay in the Ironic cluster or may be removed from it. Once removed it can be "retired" or used for any other purpose.
Do you mean "unprovision" a node and move it through cleaning? I'm not sure I  understand what your trying to get across. There is a case where a node would have been moved to a "failed" state, and could be "unprovisioned". If we reach the point where we are able to unprovision, it seems like we might be able to re-deploy, so maybe the option is to automatically move to state which is kind of like bucket for broken nodes?
...
Thanks,
Arkady
-----Original Message-----
From: Christopher Price <christopher.price@est.tech>
Sent: Tuesday, May 21, 2019 3:26 AM
To: Bogdan Dobrelya; openstack-discuss@lists.openstack.org; 
edge-computing@lists.openstack.org
Subject: Re: [Edge-computing] [ironic][ops] Taking ironic nodes out of 
production
[EXTERNAL EMAIL]
I would add that something as simple as an operator policy could/should be able to remove hardware from an operational domain.  It does not specifically need to be a fault or retirement, it may be as simple as repurposing to a different operational domain. From an OpenStack perspective this should not require any special handling from "retirement", it's just to know that there may be time constraints implied in a policy change that could potentially be ignored in a "retirement scenario".
Further, at least in my imagination, one might be reallocating 
hardware from one Ironic domain to another which may have implications 
on how we best bring a new node online.  (or not, I'm no expert) </ 
end dubious thought stream>
/ Chris
On 2019-05-21, 09:16, "Bogdan Dobrelya" <bdobreli@redhat.com> wrote:
[CC'ed edge-computing@lists.openstack.org]
On 20.05.2019 18:33, Arne Wiebalck wrote:
    > Dear all,
    >
    > One of the discussions at the PTG in Denver raised the need for
    > a mechanism to take ironic nodes out of production (a task for
    > which the currently available 'maintenance' flag does not seem
    > appropriate [1]).
    >
    > The use case there is an unhealthy physical node in state 'active',
    > i.e. associated with an instance. The request is then to enable an
    > admin to mark such a node as 'faulty' or 'in quarantine' with the
    > aim of not returning the node to the pool of available nodes once
    > the hosted instance is deleted.
    >
    > A very similar use case which came up independently is node
    > retirement: it should be possible to mark nodes ('active' or not)
    > as being 'up for retirement' to prepare the eventual removal from
    > ironic. As in the example above, ('active') nodes marked this way
    > should not become eligible for instance scheduling again, but
    > automatic cleaning, for instance, should still be possible.
    >
    > In an effort to cover these use cases by a more general
    > "quarantine/retirement" feature:
    >
    > - are there additional use cases which could profit from such a
    >    "take a node out of service" mechanism?
There are security related examples described in the Edge Security
    Challenges whitepaper [0] drafted by k8s IoT SIG [1], like in the
    chapter 2 Trusting hardware, whereby "GPS coordinate changes can be used
    to force a shutdown of an edge node". So a node may be taken out of
    service as an indicator of a particular condition of edge hardware.
[0]
    https://docs.google.com/document/d/1iSIk8ERcheehk0aRG92dfOvW5NjkdedN8F7mSUTr...
    [1] 
https://github.com/kubernetes/community/tree/master/wg-iot-edge
>
    > - would these use cases put additional constraints on how the
    >    feature should look like (e.g.: "should not prevent cleaning")
    >
    > - are there other characteristics such a feature should have
    >    (e.g.: "finding these nodes should be supported by the cli")
    >
    > Let me know if you have any thoughts on this.
    >
    > Cheers,
    >   Arne
    >
    >
    > [1] https://etherpad.openstack.org/p/DEN-train-ironic-ptg, l. 360
    >
--
    Best regards,
    Bogdan Dobrelya,
    Irc #bogdando
_______________________________________________
    Edge-computing mailing list
    Edge-computing@lists.openstack.org
    http://lists.openstack.org/cgi-bin/mailman/listinfo/edge-computing
_______________________________________________
Edge-computing mailing list
Edge-computing@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/edge-computing
_______________________________________________
Edge-computing mailing list
Edge-computing@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/edge-computing

RE: [Edge-computing] [ironic][ops] Taking ironic nodes out of production

Arkady.Kanevsky＠dell.com