[openstack-dev] [TripleO] a need to assert user ownership in preserved state

Chris Jones cmsj at tenshu.net
Tue Oct 7 11:35:33 UTC 2014


Hi

> On 6 Oct 2014, at 17:41, Clint Byrum <clint at fewbar.com> wrote:
> We have to be _extremely_ careful in how we manage this. I actually think
> it has potential to really blow up in our faces.

Yes, anything we do here has the potential to be extremely ruinous for operators, but the reality is that any existing TripleO deployment is at pretty severe risk of blowing up because of UIDs/GIDs changing when they update.

> We need to give people
> a way to move forward without us merging a patch, and at the same time
> we need to make sure we provide a consistent set of UIDs for anything
> people may want to deploy with diskimage-builder.

IMO the only desirable option *has* to be that we statically define UIDs and GIDs in the elements, because:
 1: Requires no data fragments to be kept safe and fed to subsequent build processes
 2: Doesn't do anything dynamic on first boot that could take hours/days
 3: Can be thoroughly audited at build time to ensure correctness

As you rightly point out though, any existing deployments will definitely be disrupted by this, but as I said above, all we'd be doing there is moving the needle from "possible/probable" to "definite".

Since the only leftovers we have from their previous image builds, are the images themselves, we could add the ability for a DIB run to extract IDs from a previous image, but this couldn't be required as a default build option, so we'd still risk existing deployments if they don't notice this feature.

We could create a script that would spider an existing cloud and extract its ID mappings, to produce a fragment to feed into future builds, but again we're relying on operators to know that they need to do this.

Instead, I agree with Greg's view that this is our fault and we should fix it. We didn't think of this sooner, and as a result, our users are at risk. If we don't entirely fix this ourselves, we will be both expecting them to become aware of this issue and expecting them to do additional work to mitigate it.

To that end, I think we should audit all of our elements for use of /mnt/state/ and use the specific knowledge we have of the software they relate to, to build one-time ID migration scripts, which would:
 1: Execute before any related services start
 2: Compare the now-static ID mappings against known files in /mnt/state
 3: chown/chgrp any files/directories that need migrating
 4: store a flag file in /mnt/state indicating that this process doesn't need to run again

It does mean they have a potentially painfully long update process once, but the result will be a completely stable, static arrangement that will not require them to preserve precious build fragments for the rest of time. Nor does it require some odd run-time remapping, or any additional mechanisms to centralise user management (e.g. LDAP. Please, no LDAP!)

I think that tying ourselves and our operators into knots because we're afraid of the hit of one-time data migration, is crazy.

AFAICS, the only risk left at that point, is elements that other people are maintaining. If we consider that to be a sufficient risk, we can still build the mechanism for injecting ID values from a previous build (essentially just seeding the static values that we'd be setting anyway) and apologise to the users who need that, or who don't discover its existence and break their clouds.

Cheers,

Chris


More information about the OpenStack-dev mailing list