[openstack-dev] [grenade] future direction on partial upgrade support

Sean Dague sean at dague.net
Wed Jun 24 18:03:23 UTC 2015


On 06/24/2015 01:41 PM, Russell Bryant wrote:
> On 06/24/2015 01:31 PM, Joe Gordon wrote:
>>
>>
>> On Tue, Jun 16, 2015 at 9:58 AM, Sean Dague <sean at dague.net
>> <mailto:sean at dague.net>> wrote:
>>
>>     Back when Nova first wanted to test partial upgrade, we did a bunch of
>>     slightly odd conditionals inside of grenade and devstack to make it so
>>     that if you were very careful, you could just not stop some of the old
>>     services on a single node, upgrade everything else, and as long as the
>>     old services didn't stop, they'd be running cached code in memory, and
>>     it would look a bit like a 2 node worker not upgraded model. It worked,
>>     but it was weird.
>>
>>     There has been some interest by the Nova team to expand what's not being
>>     touched, as well as the Neutron team to add partial upgrade testing
>>     support. Both are great initiatives, but I think going about it the old
>>     way is going to add a lot of complexity in weird places, and not be as
>>     good of a test as we really want.
>>
>>     Nodepool now supports allocating multiple nodes. We have a multinode job
>>     in Nova regularly testing live migration using this.
>>
>>     If we slice this problem differently, I think we get a better
>>     architecture, a much easier way to add new configs, and a much more
>>     realistic end test.
>>
>>     Conceptually, use devstack-gate multinode support to set up 2 nodes, an
>>     all in one, and a worker. Let grenade upgrade the all in one, leave the
>>     worker alone.
>>
>>     I think the only complexity here is the fact that grenade.sh implicitly
>>     drives stack.sh. Which means one of:
>>
>>     1) devstack-gate could build the worker first, then run grenade.sh
>>
>>     2) we make it so grenade.sh can execute in parts more easily, so it can
>>     hand something else running stack.sh for it.'
>>
>>     3) we make grenade understand the subnode for partial upgrade, so it
>>     will run the stack phase on the subnode itself (given credentials).
>>
>>     This kind of approach means deciding which services you don't want to
>>     upgrade doesn't require devstack changes, it's just a change of the
>>     services on the worker.
>>
>>     We need a volunteer for taking this on, but I think all the follow on
>>     partial upgrade support will be much much easier to do after we have
>>     this kind of mechanism in place.
>>
>>
>> I think this is a great approach for the future of partial upgrade
>> support in grenade. I would like to point out step 0 here, is to get
>> tempest passing consistently in multinode.
>>
>> Currently the neutron job is failing consistently, and nova-network
>> fails roughly 10% of the time due
>> to https://bugs.launchpad.net/nova/+bug/1462305
>> and https://bugs.launchpad.net/nova/+bug/1445569
> 
> If multi-node isn't reliable more generally yet, do you think the
> simpler implementation of partial-upgrade testing could proceed?  I've
> already done all of the patches to do it for Neutron.  That way we could
> quickly get something in place to help block regressions and work on the
> longer-term multinode refactoring without as much time pressure.

The thing is, these partial service bits are sneaker than one realizes
over time. There have been all kinds of edge conditions that crept up on
the n-cpu one that are really subtle because code is running in memory
on stale versions of dependencies which are no longer on disk. And the
number of people that have this model in their head is basically down to
a SPOF.

The fact that neutron-grenade is at a 40% fail rate right now (and has
been for over a week) is not preventing anyone from just rechecking to
get past it. So I think assuming additional failing grenade tests are
going to keep folks from landing bugs is probably not a good assumption.
Making the whole path more complicated for other people to debug is an
explosion waiting to happen.

So I do want to take a hard line on doing this right, because the debt
here is higher than you might think. The partial code was always very
conceptually fragile, and fails in really funny ways some times, because
of the fact that old is not isolated from new in a way that would be
expected.

I -1ed the n-net partial upgrade changes for the same reason.

	-Sean

-- 
Sean Dague
http://dague.net



More information about the OpenStack-dev mailing list