[openstack-dev] [grenade] future direction on partial upgrade support
Joe Gordon
joe.gordon0 at gmail.com
Fri Jun 26 17:15:40 UTC 2015
On Wed, Jun 24, 2015 at 11:44 AM, Joe Gordon <joe.gordon0 at gmail.com> wrote:
>
>
> On Wed, Jun 24, 2015 at 11:03 AM, Sean Dague <sean at dague.net> wrote:
>
>> On 06/24/2015 01:41 PM, Russell Bryant wrote:
>> > On 06/24/2015 01:31 PM, Joe Gordon wrote:
>> >>
>> >>
>> >> On Tue, Jun 16, 2015 at 9:58 AM, Sean Dague <sean at dague.net
>> >> <mailto:sean at dague.net>> wrote:
>> >>
>> >> Back when Nova first wanted to test partial upgrade, we did a
>> bunch of
>> >> slightly odd conditionals inside of grenade and devstack to make
>> it so
>> >> that if you were very careful, you could just not stop some of the
>> old
>> >> services on a single node, upgrade everything else, and as long as
>> the
>> >> old services didn't stop, they'd be running cached code in memory,
>> and
>> >> it would look a bit like a 2 node worker not upgraded model. It
>> worked,
>> >> but it was weird.
>> >>
>> >> There has been some interest by the Nova team to expand what's not
>> being
>> >> touched, as well as the Neutron team to add partial upgrade testing
>> >> support. Both are great initiatives, but I think going about it
>> the old
>> >> way is going to add a lot of complexity in weird places, and not
>> be as
>> >> good of a test as we really want.
>> >>
>> >> Nodepool now supports allocating multiple nodes. We have a
>> multinode job
>> >> in Nova regularly testing live migration using this.
>> >>
>> >> If we slice this problem differently, I think we get a better
>> >> architecture, a much easier way to add new configs, and a much more
>> >> realistic end test.
>> >>
>> >> Conceptually, use devstack-gate multinode support to set up 2
>> nodes, an
>> >> all in one, and a worker. Let grenade upgrade the all in one,
>> leave the
>> >> worker alone.
>> >>
>> >> I think the only complexity here is the fact that grenade.sh
>> implicitly
>> >> drives stack.sh. Which means one of:
>> >>
>> >> 1) devstack-gate could build the worker first, then run grenade.sh
>> >>
>> >> 2) we make it so grenade.sh can execute in parts more easily, so
>> it can
>> >> hand something else running stack.sh for it.'
>> >>
>> >> 3) we make grenade understand the subnode for partial upgrade, so
>> it
>> >> will run the stack phase on the subnode itself (given credentials).
>> >>
>> >> This kind of approach means deciding which services you don't want
>> to
>> >> upgrade doesn't require devstack changes, it's just a change of the
>> >> services on the worker.
>> >>
>> >> We need a volunteer for taking this on, but I think all the follow
>> on
>> >> partial upgrade support will be much much easier to do after we
>> have
>> >> this kind of mechanism in place.
>> >>
>> >>
>> >> I think this is a great approach for the future of partial upgrade
>> >> support in grenade. I would like to point out step 0 here, is to get
>> >> tempest passing consistently in multinode.
>> >>
>> >> Currently the neutron job is failing consistently, and nova-network
>> >> fails roughly 10% of the time due
>> >> to https://bugs.launchpad.net/nova/+bug/1462305
>> >> and https://bugs.launchpad.net/nova/+bug/1445569
>> >
>> > If multi-node isn't reliable more generally yet, do you think the
>> > simpler implementation of partial-upgrade testing could proceed? I've
>> > already done all of the patches to do it for Neutron. That way we could
>> > quickly get something in place to help block regressions and work on the
>> > longer-term multinode refactoring without as much time pressure.
>>
>> The thing is, these partial service bits are sneaker than one realizes
>> over time. There have been all kinds of edge conditions that crept up on
>> the n-cpu one that are really subtle because code is running in memory
>> on stale versions of dependencies which are no longer on disk. And the
>> number of people that have this model in their head is basically down to
>> a SPOF.
>>
>
> I agree, As the author of the current multinode job it is definitely a
> ugly hack (but one that has worked surprisingly well until now).
>
>
>>
>> The fact that neutron-grenade is at a 40% fail rate right now (and has
>> been for over a week) is not preventing anyone from just rechecking to
>> get past it. So I think assuming additional failing grenade tests are
>> going to keep folks from landing bugs is probably not a good assumption.
>> Making the whole path more complicated for other people to debug is an
>> explosion waiting to happen.
>>
>> So I do want to take a hard line on doing this right, because the debt
>> here is higher than you might think. The partial code was always very
>> conceptually fragile, and fails in really funny ways some times, because
>> of the fact that old is not isolated from new in a way that would be
>> expected.
>>
>
> Assuming the smoke jobs work, I don't think making grenade do mulitnode
> should take very long. In which case we get a much more realistic upgrade
> situation.
>
>
Good news, it looks like both smoke jobs are working (ignoring failures
from https://review.openstack.org/#/c/195748/).
>
>> I -1ed the n-net partial upgrade changes for the same reason.
>>
>> -Sean
>>
>> --
>> Sean Dague
>> http://dague.net
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20150626/088db707/attachment.html>
More information about the OpenStack-dev
mailing list