[openstack-dev] [cinder] [nova] os-brick privsep failures and an upgrade strategy?
Walter A. Boring IV
walter.boring at hpe.com
Tue Jun 14 17:31:03 UTC 2016
I just put up a WIP patch in os-brick that tests to see if os-privsep is
the helper_command. If it's not, then os-brick falls back to using
with the root_helper and run_as_root kwargs passed in.
If you can check this out that would be helpful. If this is the route
we want to go,
then I'll add unit tests and take it out of WIP and try to get it in.
So, if nova.conf and cinder.conf aren't updated with the privsep_osbrick
providing the helper_command, then os_brick will assume local
with the configured root_helper passed in.
This should be backwards compatible (grenade upgrade tests). But we
admins to add that section to their nova.conf and cinder.conf files.
The other downside
to this is that if we have to keep this code in place, then we
effectively still have to maintain
rootwrap filters in place and keep them up to date. *sadness*
On 06/14/2016 04:49 AM, Sean Dague wrote:
> os-brick 1.4 was released over the weekend, and was the first os-brick
> to include privsep. We got a really odd failure rate in the
> grenade-multinode jobs (1/3 - 1/2) after wards which was super non
> obvious why. Hemma looks to have figured it out (this is a summary of
> what I've seen on IRC to pull it all together)
> Remembering the following -
> https://github.com/openstack-dev/grenade#theory-of-upgrade and
> - New code must work with N-1 configs. So this is `master` running with
> `mitaka` configuration.
> privsep requires a sudo rule or rootwrap rule (to get to sudo) to allow
> the privsep daemon to be spawned for volume actions.
> During gate testing we have a blanket sudoer rule for the stack user
> during the run of grenade.sh. It has to do system level modifications
> broadly to perform the upgrade. This sudoer rule is deleted at the end
> of the grenade.sh run before Tempest tests are run, so that Tempest
> tests don't accidentally require root privs on their target environment.
> Grenade *also* makes sure that some resources live across the upgrade
> boundary. This includes a boot from volume guest, which is torn down
> before testing starts. And this is where things get interesting.
> This means there is a volume teardown needed before grenade ends. But
> there is only one. In single node grenade this happens about 30 seconds
> for the end of the script, triggers the privsep daemon start, and then
> we're done. And the 50_stack_sh sudoers file is removed. In multinode,
> *if* the boot from volume server is on the upgrade node, then the same
> thing happens. *However*, if it instead ended up on the subnode, which
> is not upgraded, then the volume tear down in on the old node. No
> os-brick calls are made on the upgraded node before grenade finishes.
> The 50_stack_sh sudoers file is removed, as expected.
> And now all volume tests on those nodes fail.
> Which is what should happen. The point is that in production no one is
> going to put a blanket sudoers rule like that in place. It's just we
> needed it for this activity, and the userid on the services being the
> same as the shell user (which is not root) let this fallback rule be used.
> The crux of the problem is that os-brick 1.4 and privsep can't be used
> without a config file change during the upgrade. Which violates our
> policy, because it breaks rolling upgrades.
> So... we have a few options:
> 1) make an exception here with release notes, because it's the only way
> to move forward.
> 2) have some way for os-brick to use either mode for a transition period
> (depending on whether privsep is configured to work)
> 3) Something else.... ?
> https://bugs.launchpad.net/os-brick/+bug/1592043 is the bug we've got on
> this. We should probably sort out the path forward here on the ML as
> there are a bunch of folks in a bunch of different time zones that have
> important perspectives here.
More information about the OpenStack-dev