[openstack-dev] [tripleo] ironic automated cleaning by default?

Tim Bell Tim.Bell at cern.ch
Thu Apr 26 16:16:38 UTC 2018


How about asking the operators at the summit Forum or asking on openstack-operators to see what the users think?

Tim

-----Original Message-----
From: Ben Nemec <openstack at nemebean.com>
Reply-To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev at lists.openstack.org>
Date: Thursday, 26 April 2018 at 17:39
To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev at lists.openstack.org>, Dmitry Tantsur <dtantsur at redhat.com>
Subject: Re: [openstack-dev] [tripleo] ironic automated cleaning by default?

    
    
    On 04/26/2018 09:24 AM, Dmitry Tantsur wrote:
    > Answering to both James and Ben inline.
    > 
    > On 04/25/2018 05:47 PM, Ben Nemec wrote:
    >>
    >>
    >> On 04/25/2018 10:28 AM, James Slagle wrote:
    >>> On Wed, Apr 25, 2018 at 10:55 AM, Dmitry Tantsur 
    >>> <dtantsur at redhat.com> wrote:
    >>>> On 04/25/2018 04:26 PM, James Slagle wrote:
    >>>>>
    >>>>> On Wed, Apr 25, 2018 at 9:14 AM, Dmitry Tantsur <dtantsur at redhat.com>
    >>>>> wrote:
    >>>>>>
    >>>>>> Hi all,
    >>>>>>
    >>>>>> I'd like to restart conversation on enabling node automated 
    >>>>>> cleaning by
    >>>>>> default for the undercloud. This process wipes partitioning tables
    >>>>>> (optionally, all the data) from overcloud nodes each time they 
    >>>>>> move to
    >>>>>> "available" state (i.e. on initial enrolling and after each tear 
    >>>>>> down).
    >>>>>>
    >>>>>> We have had it disabled for a few reasons:
    >>>>>> - it was not possible to skip time-consuming wiping if data from 
    >>>>>> disks
    >>>>>> - the way our workflows used to work required going between 
    >>>>>> manageable
    >>>>>> and
    >>>>>> available steps several times
    >>>>>>
    >>>>>> However, having cleaning disabled has several issues:
    >>>>>> - a configdrive left from a previous deployment may confuse 
    >>>>>> cloud-init
    >>>>>> - a bootable partition left from a previous deployment may take
    >>>>>> precedence
    >>>>>> in some BIOS
    >>>>>> - an UEFI boot partition left from a previous deployment is likely to
    >>>>>> confuse UEFI firmware
    >>>>>> - apparently ceph does not work correctly without cleaning (I'll 
    >>>>>> defer to
    >>>>>> the storage team to comment)
    >>>>>>
    >>>>>> For these reasons we don't recommend having cleaning disabled, and I
    >>>>>> propose
    >>>>>> to re-enable it.
    >>>>>>
    >>>>>> It has the following drawbacks:
    >>>>>> - The default workflow will require another node boot, thus becoming
    >>>>>> several
    >>>>>> minutes longer (incl. the CI)
    >>>>>> - It will no longer be possible to easily restore a deleted overcloud
    >>>>>> node.
    >>>>>
    >>>>>
    >>>>> I'm trending towards -1, for these exact reasons you list as
    >>>>> drawbacks. There has been no shortage of occurrences of users who have
    >>>>> ended up with accidentally deleted overclouds. These are usually
    >>>>> caused by user error or unintended/unpredictable Heat operations.
    >>>>> Until we have a way to guarantee that Heat will never delete a node,
    >>>>> or Heat is entirely out of the picture for Ironic provisioning, then
    >>>>> I'd prefer that we didn't enable automated cleaning by default.
    >>>>>
    >>>>> I believe we had done something with policy.json at one time to
    >>>>> prevent node delete, but I don't recall if that protected from both
    >>>>> user initiated actions and Heat actions. And even that was not enabled
    >>>>> by default.
    >>>>>
    >>>>> IMO, we need to keep "safe" defaults. Even if it means manually
    >>>>> documenting that you should clean to prevent the issues you point out
    >>>>> above. The alternative is to have no way to recover deleted nodes by
    >>>>> default.
    >>>>
    >>>>
    >>>> Well, it's not clear what is "safe" here: protect people who explicitly
    >>>> delete their stacks or protect people who don't realize that a previous
    >>>> deployment may screw up their new one in a subtle way.
    >>>
    >>> The latter you can recover from, the former you can't if automated
    >>> cleaning is true.
    > 
    > Nor can we recover from 'rm -rf / --no-preserve-root', but it's not a 
    > reason to disable the 'rm' command :)
    > 
    >>>
    >>> It's not just about people who explicitly delete their stacks (whether
    >>> intentional or not). There could be user error (non-explicit) or
    >>> side-effects triggered by Heat that could cause nodes to get deleted.
    > 
    > If we have problems with Heat, we should fix Heat or stop using it. What 
    > you're saying is essentially "we prevent ironic from doing the right 
    > thing because we're using a tool that can invoke 'rm -rf /' at a wrong 
    > moment."
    > 
    >>>
    >>> You couldn't recover from those scenarios if automated cleaning were
    >>> true. Whereas you could always fix a deployment error by opting in to
    >>> do an automated clean. Does Ironic keep track of it a node has been
    >>> previously cleaned? Could we add a validation to check whether any
    >>> nodes might be used in the deployment that were not previously
    >>> cleaned?
    > 
    > It's may be possible possible to figure out if a node was ever cleaned. 
    > But then we'll force operators to invoke cleaning manually, right? It 
    > will work, but that's another step on the default workflow. Are you okay 
    > with it?
    > 
    >>
    >> Is there a way to only do cleaning right before a node is deployed?  
    >> If you're about to write a new image to the disk then any data there 
    >> is forfeit anyway. Since the concern is old data on the disk messing 
    >> up subsequent deploys, it doesn't really matter whether you clean it 
    >> right after it's deleted or right before it's deployed, but the latter 
    >> leaves the data intact for longer in case a mistake was made.
    >>
    >> If that's not possible then consider this an RFE. :-)
    > 
    > It's a good idea, but it may cause problems with rebuilding instances. 
    > Rebuild is essentially a re-deploy of the OS, users may not expect the 
    > whole disk to be wiped..
    > 
    > Also it's unclear whether we want to write additional features to work 
    > around disabled cleaning.
    
    No matter how good the tooling gets, user error will always be a thing. 
    Someone will scale down the wrong node or something similar.  I think 
    there's value to allowing recovery from mistakes.  We all make them. :-)
    
    __________________________________________________________________________
    OpenStack Development Mailing List (not for usage questions)
    Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
    http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
    



More information about the OpenStack-dev mailing list