[openstack-dev] [tripleo] ironic automated cleaning by default?

Dmitry Tantsur dtantsur at redhat.com
Fri Apr 27 13:40:11 UTC 2018


Hi Tim,

On 04/26/2018 07:16 PM, Tim Bell wrote:
> My worry with changing the default is that it would be like adding the following in /etc/environment,
> 
> alias ls=' rm -rf / --no-preserve-root'
> 
> i.e. an operation which was previously read-only now becomes irreversible.

Well, deleting instances has never been read-only :) The problem really is that 
Heat can delete instances during a seemingly innocent operations. And I do agree 
that we cannot just ignore this problem.

> 
> We also have current use cases with Ironic where we are moving machines between projects by 'disowning' them to the spare pool and then reclaiming them (by UUID) into new projects with the same state.

I'd be curious to hear how exactly it works. Does it work on Nova level or on 
Ironic level?

> 
> However, other operators may feel differently which is why I suggest asking what people feel about changing the default.
> 
> In any case, changes in default behaviour need to be highly visible.
> 
> Tim
> 
> -----Original Message-----
> From: "Arkady.Kanevsky at dell.com" <Arkady.Kanevsky at dell.com>
> Reply-To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev at lists.openstack.org>
> Date: Thursday, 26 April 2018 at 18:48
> To: "openstack-dev at lists.openstack.org" <openstack-dev at lists.openstack.org>
> Subject: Re: [openstack-dev] [tripleo] ironic automated cleaning by default?
> 
>      +1.
>      It would be good to also identify the use cases.
>      Surprised that node should be cleaned up automatically.
>      I would expect that we want it to be a deliberate request from administrator to do.
>      Maybe user when they "return" a node to free pool after baremetal usage.
>      Thanks,
>      Arkady
>      
>      -----Original Message-----
>      From: Tim Bell [mailto:Tim.Bell at cern.ch]
>      Sent: Thursday, April 26, 2018 11:17 AM
>      To: OpenStack Development Mailing List (not for usage questions)
>      Subject: Re: [openstack-dev] [tripleo] ironic automated cleaning by default?
>      
>      How about asking the operators at the summit Forum or asking on openstack-operators to see what the users think?
>      
>      Tim
>      
>      -----Original Message-----
>      From: Ben Nemec <openstack at nemebean.com>
>      Reply-To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev at lists.openstack.org>
>      Date: Thursday, 26 April 2018 at 17:39
>      To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev at lists.openstack.org>, Dmitry Tantsur <dtantsur at redhat.com>
>      Subject: Re: [openstack-dev] [tripleo] ironic automated cleaning by default?
>      
>          
>          
>          On 04/26/2018 09:24 AM, Dmitry Tantsur wrote:
>          > Answering to both James and Ben inline.
>          >
>          > On 04/25/2018 05:47 PM, Ben Nemec wrote:
>          >>
>          >>
>          >> On 04/25/2018 10:28 AM, James Slagle wrote:
>          >>> On Wed, Apr 25, 2018 at 10:55 AM, Dmitry Tantsur
>          >>> <dtantsur at redhat.com> wrote:
>          >>>> On 04/25/2018 04:26 PM, James Slagle wrote:
>          >>>>>
>          >>>>> On Wed, Apr 25, 2018 at 9:14 AM, Dmitry Tantsur <dtantsur at redhat.com>
>          >>>>> wrote:
>          >>>>>>
>          >>>>>> Hi all,
>          >>>>>>
>          >>>>>> I'd like to restart conversation on enabling node automated
>          >>>>>> cleaning by
>          >>>>>> default for the undercloud. This process wipes partitioning tables
>          >>>>>> (optionally, all the data) from overcloud nodes each time they
>          >>>>>> move to
>          >>>>>> "available" state (i.e. on initial enrolling and after each tear
>          >>>>>> down).
>          >>>>>>
>          >>>>>> We have had it disabled for a few reasons:
>          >>>>>> - it was not possible to skip time-consuming wiping if data from
>          >>>>>> disks
>          >>>>>> - the way our workflows used to work required going between
>          >>>>>> manageable
>          >>>>>> and
>          >>>>>> available steps several times
>          >>>>>>
>          >>>>>> However, having cleaning disabled has several issues:
>          >>>>>> - a configdrive left from a previous deployment may confuse
>          >>>>>> cloud-init
>          >>>>>> - a bootable partition left from a previous deployment may take
>          >>>>>> precedence
>          >>>>>> in some BIOS
>          >>>>>> - an UEFI boot partition left from a previous deployment is likely to
>          >>>>>> confuse UEFI firmware
>          >>>>>> - apparently ceph does not work correctly without cleaning (I'll
>          >>>>>> defer to
>          >>>>>> the storage team to comment)
>          >>>>>>
>          >>>>>> For these reasons we don't recommend having cleaning disabled, and I
>          >>>>>> propose
>          >>>>>> to re-enable it.
>          >>>>>>
>          >>>>>> It has the following drawbacks:
>          >>>>>> - The default workflow will require another node boot, thus becoming
>          >>>>>> several
>          >>>>>> minutes longer (incl. the CI)
>          >>>>>> - It will no longer be possible to easily restore a deleted overcloud
>          >>>>>> node.
>          >>>>>
>          >>>>>
>          >>>>> I'm trending towards -1, for these exact reasons you list as
>          >>>>> drawbacks. There has been no shortage of occurrences of users who have
>          >>>>> ended up with accidentally deleted overclouds. These are usually
>          >>>>> caused by user error or unintended/unpredictable Heat operations.
>          >>>>> Until we have a way to guarantee that Heat will never delete a node,
>          >>>>> or Heat is entirely out of the picture for Ironic provisioning, then
>          >>>>> I'd prefer that we didn't enable automated cleaning by default.
>          >>>>>
>          >>>>> I believe we had done something with policy.json at one time to
>          >>>>> prevent node delete, but I don't recall if that protected from both
>          >>>>> user initiated actions and Heat actions. And even that was not enabled
>          >>>>> by default.
>          >>>>>
>          >>>>> IMO, we need to keep "safe" defaults. Even if it means manually
>          >>>>> documenting that you should clean to prevent the issues you point out
>          >>>>> above. The alternative is to have no way to recover deleted nodes by
>          >>>>> default.
>          >>>>
>          >>>>
>          >>>> Well, it's not clear what is "safe" here: protect people who explicitly
>          >>>> delete their stacks or protect people who don't realize that a previous
>          >>>> deployment may screw up their new one in a subtle way.
>          >>>
>          >>> The latter you can recover from, the former you can't if automated
>          >>> cleaning is true.
>          >
>          > Nor can we recover from 'rm -rf / --no-preserve-root', but it's not a
>          > reason to disable the 'rm' command :)
>          >
>          >>>
>          >>> It's not just about people who explicitly delete their stacks (whether
>          >>> intentional or not). There could be user error (non-explicit) or
>          >>> side-effects triggered by Heat that could cause nodes to get deleted.
>          >
>          > If we have problems with Heat, we should fix Heat or stop using it. What
>          > you're saying is essentially "we prevent ironic from doing the right
>          > thing because we're using a tool that can invoke 'rm -rf /' at a wrong
>          > moment."
>          >
>          >>>
>          >>> You couldn't recover from those scenarios if automated cleaning were
>          >>> true. Whereas you could always fix a deployment error by opting in to
>          >>> do an automated clean. Does Ironic keep track of it a node has been
>          >>> previously cleaned? Could we add a validation to check whether any
>          >>> nodes might be used in the deployment that were not previously
>          >>> cleaned?
>          >
>          > It's may be possible possible to figure out if a node was ever cleaned.
>          > But then we'll force operators to invoke cleaning manually, right? It
>          > will work, but that's another step on the default workflow. Are you okay
>          > with it?
>          >
>          >>
>          >> Is there a way to only do cleaning right before a node is deployed?
>          >> If you're about to write a new image to the disk then any data there
>          >> is forfeit anyway. Since the concern is old data on the disk messing
>          >> up subsequent deploys, it doesn't really matter whether you clean it
>          >> right after it's deleted or right before it's deployed, but the latter
>          >> leaves the data intact for longer in case a mistake was made.
>          >>
>          >> If that's not possible then consider this an RFE. :-)
>          >
>          > It's a good idea, but it may cause problems with rebuilding instances.
>          > Rebuild is essentially a re-deploy of the OS, users may not expect the
>          > whole disk to be wiped..
>          >
>          > Also it's unclear whether we want to write additional features to work
>          > around disabled cleaning.
>          
>          No matter how good the tooling gets, user error will always be a thing.
>          Someone will scale down the wrong node or something similar.  I think
>          there's value to allowing recovery from mistakes.  We all make them. :-)
>          
>          __________________________________________________________________________
>          OpenStack Development Mailing List (not for usage questions)
>          Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>          http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>          
>      
>      __________________________________________________________________________
>      OpenStack Development Mailing List (not for usage questions)
>      Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>      http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>      __________________________________________________________________________
>      OpenStack Development Mailing List (not for usage questions)
>      Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>      http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>      
> 
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 




More information about the OpenStack-dev mailing list