[openstack-dev] [tripleo] ironic automated cleaning by default?

Ben Nemec openstack at nemebean.com
Thu Apr 26 15:37:58 UTC 2018



On 04/26/2018 09:24 AM, Dmitry Tantsur wrote:
> Answering to both James and Ben inline.
> 
> On 04/25/2018 05:47 PM, Ben Nemec wrote:
>>
>>
>> On 04/25/2018 10:28 AM, James Slagle wrote:
>>> On Wed, Apr 25, 2018 at 10:55 AM, Dmitry Tantsur 
>>> <dtantsur at redhat.com> wrote:
>>>> On 04/25/2018 04:26 PM, James Slagle wrote:
>>>>>
>>>>> On Wed, Apr 25, 2018 at 9:14 AM, Dmitry Tantsur <dtantsur at redhat.com>
>>>>> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I'd like to restart conversation on enabling node automated 
>>>>>> cleaning by
>>>>>> default for the undercloud. This process wipes partitioning tables
>>>>>> (optionally, all the data) from overcloud nodes each time they 
>>>>>> move to
>>>>>> "available" state (i.e. on initial enrolling and after each tear 
>>>>>> down).
>>>>>>
>>>>>> We have had it disabled for a few reasons:
>>>>>> - it was not possible to skip time-consuming wiping if data from 
>>>>>> disks
>>>>>> - the way our workflows used to work required going between 
>>>>>> manageable
>>>>>> and
>>>>>> available steps several times
>>>>>>
>>>>>> However, having cleaning disabled has several issues:
>>>>>> - a configdrive left from a previous deployment may confuse 
>>>>>> cloud-init
>>>>>> - a bootable partition left from a previous deployment may take
>>>>>> precedence
>>>>>> in some BIOS
>>>>>> - an UEFI boot partition left from a previous deployment is likely to
>>>>>> confuse UEFI firmware
>>>>>> - apparently ceph does not work correctly without cleaning (I'll 
>>>>>> defer to
>>>>>> the storage team to comment)
>>>>>>
>>>>>> For these reasons we don't recommend having cleaning disabled, and I
>>>>>> propose
>>>>>> to re-enable it.
>>>>>>
>>>>>> It has the following drawbacks:
>>>>>> - The default workflow will require another node boot, thus becoming
>>>>>> several
>>>>>> minutes longer (incl. the CI)
>>>>>> - It will no longer be possible to easily restore a deleted overcloud
>>>>>> node.
>>>>>
>>>>>
>>>>> I'm trending towards -1, for these exact reasons you list as
>>>>> drawbacks. There has been no shortage of occurrences of users who have
>>>>> ended up with accidentally deleted overclouds. These are usually
>>>>> caused by user error or unintended/unpredictable Heat operations.
>>>>> Until we have a way to guarantee that Heat will never delete a node,
>>>>> or Heat is entirely out of the picture for Ironic provisioning, then
>>>>> I'd prefer that we didn't enable automated cleaning by default.
>>>>>
>>>>> I believe we had done something with policy.json at one time to
>>>>> prevent node delete, but I don't recall if that protected from both
>>>>> user initiated actions and Heat actions. And even that was not enabled
>>>>> by default.
>>>>>
>>>>> IMO, we need to keep "safe" defaults. Even if it means manually
>>>>> documenting that you should clean to prevent the issues you point out
>>>>> above. The alternative is to have no way to recover deleted nodes by
>>>>> default.
>>>>
>>>>
>>>> Well, it's not clear what is "safe" here: protect people who explicitly
>>>> delete their stacks or protect people who don't realize that a previous
>>>> deployment may screw up their new one in a subtle way.
>>>
>>> The latter you can recover from, the former you can't if automated
>>> cleaning is true.
> 
> Nor can we recover from 'rm -rf / --no-preserve-root', but it's not a 
> reason to disable the 'rm' command :)
> 
>>>
>>> It's not just about people who explicitly delete their stacks (whether
>>> intentional or not). There could be user error (non-explicit) or
>>> side-effects triggered by Heat that could cause nodes to get deleted.
> 
> If we have problems with Heat, we should fix Heat or stop using it. What 
> you're saying is essentially "we prevent ironic from doing the right 
> thing because we're using a tool that can invoke 'rm -rf /' at a wrong 
> moment."
> 
>>>
>>> You couldn't recover from those scenarios if automated cleaning were
>>> true. Whereas you could always fix a deployment error by opting in to
>>> do an automated clean. Does Ironic keep track of it a node has been
>>> previously cleaned? Could we add a validation to check whether any
>>> nodes might be used in the deployment that were not previously
>>> cleaned?
> 
> It's may be possible possible to figure out if a node was ever cleaned. 
> But then we'll force operators to invoke cleaning manually, right? It 
> will work, but that's another step on the default workflow. Are you okay 
> with it?
> 
>>
>> Is there a way to only do cleaning right before a node is deployed?  
>> If you're about to write a new image to the disk then any data there 
>> is forfeit anyway. Since the concern is old data on the disk messing 
>> up subsequent deploys, it doesn't really matter whether you clean it 
>> right after it's deleted or right before it's deployed, but the latter 
>> leaves the data intact for longer in case a mistake was made.
>>
>> If that's not possible then consider this an RFE. :-)
> 
> It's a good idea, but it may cause problems with rebuilding instances. 
> Rebuild is essentially a re-deploy of the OS, users may not expect the 
> whole disk to be wiped..
> 
> Also it's unclear whether we want to write additional features to work 
> around disabled cleaning.

No matter how good the tooling gets, user error will always be a thing. 
Someone will scale down the wrong node or something similar.  I think 
there's value to allowing recovery from mistakes.  We all make them. :-)



More information about the OpenStack-dev mailing list