Stephen Finucane <sfinucan@redhat.com> 于2019年6月17日周一 下午8:47写道:
On Mon, 2019-06-17 at 17:47 +0800, Alex Xu wrote:
> Sean Mooney <smooney@redhat.com> 于2019年6月17日周一 下午5:19写道:
> > On Mon, 2019-06-17 at 16:45 +0800, Alex Xu wrote:
> > > I'm thinking we should have recommended upgrade follow. If we
> > > give a lot of flexibility for the operator to have a lot
> > > combination of the value of vcpu_pin_set, dedicated_cpu_set and
> > > shared_cpu_set, then we have trouble in this email and have to do
> > > a lot of checks this email introduced also.
> >
> > we modified the spec intentionally to make upgradeing simple.
> > i don't be believe the concerns raised in the intial 2 emails are
> > valid if we follow what was detailed in the spec.
> > we did take some steps to restrict what values you can set.
> > for example dedicated_cpu_set cannot be set if vcpu pin set is set.
> > technicall i belive we relaxed that to say we would ignore vcpu pin
> > set in that case be original i was pushing for it to be a hard
> > error.
> >
> > > I'm thinking that the pre-request filter (which translates the
> > > cpu_policy=dedicated to PCPU request) should be enabled after all
> > > the node upgrades to the Train release. Before that, all the
> > > cpu_policy=dedicated instance still using the VCPU.
> >
> > it should be enabled after all node are upgraded but not
> > nessisarily before all compute nodes are updated to use
> > dedicated_cpu_set.
>
> If we enable the pre-request filter in the middle of upgrade, there
> will have the problem Bhagyashri said. Reporting PCPU and VCPU
> sametime doesn't resolve the concern from him as my understand.
>
> For example, we have 100 nodes for dedicated host in the cluster.
>
> The operator begins to upgrade the cluster. The controller plane
> upgrade first, and the pre-request filter enabled.
> For rolling upgrade, he begins to upgrade 10 nodes first. Then only
> those 10 nodes report PCPU and VCPU sametime.
> But any new request with dedicated cpu policy begins to request PCPU,
> all of those new instance only can be go to those 10 nodes. Also
> if the existed instances execute the resize and evacuate, and
> shelve/unshelve are going to those 10 nodes also. That is kind of
> nervious on the capacity at that time.

The exact same issue can happen the other way around. As an operator
slowly starts upgrading, by setting the necessary configuration
options, the compute nodes will reduce the VCPU inventory they report
and start reporting PCPU inventory. Using the above example, if we
upgraded 90 of the 100 compute nodes and didn't enable the prefilter,
we would only be able to schedule to one of the remaining 10 nodes.
This doesn't seem any better.

At some point we're going to need to make a clean break from pinned
instances consuming VCPU resources to them using PCPU resources. When
that happens is up to us. I figured it was easiest to do this as soon
as the controllers were updated because I had assumed compute nodes
would be updated pretty soon after the controllers and therefore there
would only be a short window where instances would start requesting
PCPU resources but there wouldn't be any available. Maybe that doesn't
make sense though. If not, I guess we need to make this configurable.

I propose that as soon as compute nodes are upgraded then they will all
start reporting PCPU inventory, as noted in the spec. However, the
prefilter will initially be disabled and we will not reshape existing
inventories. This means pinned instances will continue consuming VCPU
resources as before but that is not an issue since this is the behavior
we currently have. Once the operator is happy that all of the compute
nodes have been upgraded, or at least enough that they care about, we
will then need some way for us to switch on the prefilter and reshape
existing instances. Perhaps this would require manual configuration
changes, validated by an upgrade check, or perhaps we could add a
workaround config option?

In any case, at some point we need to have a switch from "use VCPUs for
pinned instances" to "use PCPUs for pinned instances".

All agree, we are talking about the same thing. This is the upgrade step I write below. 
I didn't see the spec describe those steps clearly or I miss something.


Stephen

> > > Trying to image the upgrade as below:
> > >
> > > 1. Rolling upgrade the compute node.
> > > 2. The upgraded compute node begins to report both VCPU and PCPU,
> > > but reshape for the existed inventories.
> > >      The upgraded node is still using the vcpu_pin_set config, or
> > > didn't set the vcpu_pin_config. Both in this two cases are
> > > reporting VCPU and PCPU same time. And the request with
> > > cpu_policy=dedicated still uses the VCPU.
> > > Then it is worked same as Stein release. And existed instance can
> > > be shelved/unshelved, migration and evacuate.
> >
> > +1
> >
> > > 3. Disable the new request and operation for the instance to the
> > > hosts for dedicated instance. (it is kind of breaking our live-
> > > upgrade? I thought this will be a short interrupt for the control
> > > plane if that is available)
> >
> > im not sure why we need to do this unless you are thinging this
> > will be done by a cli? e.g. like nova-manage.
>
> The inventories of existed instance still consumes VCPU. As we know
> the PCPU and VCPU reporting same time, that is kind of duplicated
> resources. If we begin to consume the PCPU, in the end, it will over
> consume the resource.
>
> yes, the disable request is done by CLI, probably disable the
> service.

> > > 4. reshape the inventories for existed instance for all the
> > > hosts.
> >
> > should this not happen when the agent starts up?
> >
> > > 5. Enable the instance's new request and operation, also enable
> > > the pre-request filter.
> > > 6. Operator copies the value of vcpu_pin_set to
> > > dedicated_cpu_set.
> >
> > vcpu_pin_set is not the set of cpu used for pinning. the operators
> > should set dedicated_cpu_set and shared_cpu_set approprealy at this
> > point but in general they proably wont just copy it as host that
> > used vcpu_pin_set but were not used for pinned instances will be
> > copied to shared_cpu_set.
>
> Yes, I should say this upgrade flow is for those dedicated instance
> host. For the host only running floating instance, they doesn't have
> trouble with those problem.

> > > For the case of vcpu_pin_set isn't set, the value of
> > > dedicated_cpu_set should be all the cpu ids exclude
> > > shared_cpu_set if set.
> > >
> > > Two rules at here:
> > > 1. The operator doesn't allow to change a different value for
> > > dedicated_cpu_set with vcpu_pin_set when any instance is running
> > > on the host.
> > > 2. The operator doesn't allow to change the value of
> > > dedicated_cpu_set and shared_cpu_set when any instance is running
> > > on the host.
> >
> > neither of these rule can be enforced. one of the requirements that dan smith had
> > for edge computeing is that we need to supprot upgraes with instance inplace.