[openstack-dev] [nova] Order of n-api (placement) and n-sch upgrades for Ocata
Eoghan Glynn
eglynn at redhat.com
Thu Jan 19 17:59:48 UTC 2017
> >> Sylvain and I were talking about how he's going to work placement
> >> microversion requests into his filter scheduler patch [1]. He needs to
> >> make
> >> requests to the placement API with microversion 1.4 [2] or later for
> >> resource provider filtering on specific resource classes like VCPU and
> >> MEMORY_MB.
> >>
> >> The question was what happens if microversion 1.4 isn't available in the
> >> placement API, i.e. the nova-scheduler is running Ocata code now but the
> >> placement service is running Newton still.
> >>
> >> Our rolling upgrades doc [3] says:
> >>
> >> "It is safest to start nova-conductor first and nova-api last."
> >>
> >> But since placement is bundled with n-api that would cause issues since
> >> n-sch now depends on the n-api code.
> >>
> >> If you package the placement service separately from the nova-api service
> >> then this is probably not an issue. You can still roll out n-api last and
> >> restart it last (for control services), and just make sure that placement
> >> is
> >> upgraded before nova-scheduler (we need to be clear about that in [3]).
> >>
> >> But do we have any other issues if they are not packaged separately? Is it
> >> possible to install the new code, but still only restart the placement
> >> service before nova-api? I believe it is, but want to ask this out loud.
> >>
> >
> > Forgive me as I haven't looked really in depth, but if the api and
> > placement api are both collocated in the same apache instance this is
> > not necessarily the simplest thing to achieve. While, yes it could be
> > achieved it will require more manual intervention of custom upgrade
> > scripts. To me this is not a good idea. My personal preference (now
> > having dealt with multiple N->O nova related acrobatics) is that these
> > types of requirements not be made. We've already run into these
> > assumptions for new installs as well specifically in this newer code.
> > Why can't we turn all the services on and they properly enter a wait
> > state until such conditions are satisfied?
>
> Simply put, because it adds a bunch of conditional, temporary code to
> the Nova codebase as a replacement for well-documented upgrade steps.
>
> Can we do it? Yes. Is it kind of a pain in the ass? Yeah, mostly because
> of the testing requirements.
>
> But meh, I can whip up an amendment to Sylvain's patch that would add
> the self-healing/fallback to legacy behaviour if this is what the
> operator community insists on.
I think Alex is suggesting something different than falling back to the
legacy behaviour. The ocata scheduler would still roll forward to basing
its node selection decisions on data provided by the placement API, but
would be tolerant of the 3 different transient cases that are problematic:
1. placement API momentarily not running yet
2. placement API already running, but still on the newton micro-version
3. placement API already running ocata code, but not yet warmed up
IIUC Alex is suggesting that the nova services themselves are tolerant
of those transient conditions during the upgrade, rather than requiring
multiple upgrade toolings to independently force the new ordering
constraint.
On my superficial understanding, case #3 would require the a freshly
deployed ocata placement (i.e. when upgraded from a placement-less
newton deployment) to detect that it's being run for the first time
(i.e. no providers reported yet) and return say 503s to the scheduler
queries until enough time has passed for all computes to have reported
in their inventories & allocations.
Cheers,
Eoghan
> I think Matt generally has been in the "push forward" camp because we're
> tired of delaying improvements to Nova because of some terror that we
> may cause some deployer somewhere to restart their controller services
> in a particular order in order to minimize any downtime of the control
> plane.
>
> For the distributed compute nodes, I totally understand the need to
> tolerate long rolling upgrade windows. For controller nodes/services,
> what we're talking about here is adding code into Nova scheduler to deal
> with what in 99% of cases will be something that isn't even noticed
> because the upgrade tooling will be restarting all these nodes at almost
> the same time and the momentary failures that might be logged on the
> scheduler (400s returned from the placement API due to using an unknown
> parameter in a GET request) will only exist for a second or two as the
> upgrade completes.
>
> So, yeah, a lot of work and testing for very little real-world benefit,
> which is why a number of us just want to more forward...
>
> Best,
> -jay
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
More information about the OpenStack-dev
mailing list