More upgrade issues with PCPUs - input wanted

Stephen Finucane sfinucan at redhat.com
Fri Aug 16 09:58:50 UTC 2019


On Fri, 2019-08-16 at 12:09 +0800, Alex Xu wrote:
> Stephen Finucane <sfinucan at redhat.com> 于2019年8月15日周四 下午8:25写道:
> > tl;dr: Is breaking booting of pinned instances on Stein compute
> > nodes
> > 
> > in a Train deployment an acceptable thing to do, and if not, how do
> > we
> > 
> > best handle the VCPU->PCPU migration in Train?
> > 
> > 
> > 
> > I've been working through the cpu-resources spec [1] and have run
> > into
> > 
> > a tricky issue I'd like some input on. In short, this spec means
> > that
> > 
> > pinned instances (i.e. 'hw:cpu_policy=dedicated') will now start
> > 
> > consuming a new resources type, PCPU, instead of VCPU. Many things
> > need
> > 
> > to change to make this happen but the key changes are:
> > 
> > 
> > 
> >    1. The scheduler needs to start modifying requests for pinned
> > instances
> > 
> >       to request PCPU resources instead of VCPU resources
> > 
> >    2. The libvirt driver needs to start reporting PCPU resources
> > 
> >    3. The libvirt driver needs to do a reshape, moving all existing
> > 
> >       allocations of VCPUs to PCPUs, if the instance holding that
> > 
> >       allocation is pinned
> > 
> > 
> > 
> > The first two of these steps presents an issue for which we have a
> > 
> > solution, but the solutions we've chosen are now resulting in this
> > new
> > 
> > issue.
> > 
> > 
> > 
> >  * For (1), the translation of VCPU to PCPU in the scheduler means
> > 
> >    compute nodes must now report PCPU in order for a pinned
> > instance to
> > 
> >    land on that host. Since controllers are upgraded before compute
> > 
> >    nodes and all compute nodes aren't necessarily upgraded in one
> > go
> > 
> >    (particularly for edge or other large or multi-cell
> > deployments),
> > 
> >    this can mean there will be a period of time where there are
> > very
> > 
> >    few or no hosts available on which to schedule pinned instances.
> > 
> > 
> > 
> >  * For (2), we're hampered by the fact that there is no clear way
> > to
> > 
> >    determine if a host is used for pinned instances or not. Because
> > of
> > 
> >    this, we can't determine if a host should be reporting PCPU or
> > VCPU
> > 
> >    inventory.
> > 
> > 
> > 
> > The solution we have for the issues with (1) is to add a workaround
> > 
> > option that would disable this translation, allowing operators time
> > to
> > 
> > upgrade all their compute nodes to report PCPU resources before
> > 
> > anything starts using them. For (2), we've decided to temporarily
> > (i.e.
> > 
> > for one release or until configuration is updated) report both, in
> > the
> > 
> > expectation that everyone using pinned instances has followed the
> > long-
> > 
> > standing advice to separate hosts intended for pinned instances
> > from
> > 
> > those intended for unpinned instances using host aggregates (e.g.
> > even
> > 
> > if we started reporting PCPUs on a host, nothing would consume that
> > due
> > 
> > to 'pinned=False' aggregate metadata or similar). These actually
> > 
> > benefit each other, since if instances are still consuming VCPUs
> > then
> > 
> > the hosts need to continue reporting VCPUs. However, both interfere
> > 
> > with our ability to do the reshape.
> > 
> > 
> > 
> > Normally, a reshape is a one time thing. The way we'd planned to
> > 
> > determine if a reshape was necessary was to check if PCPU inventory
> > was
> > 
> > registered against the host and, if not, whether there were any
> > pinned
> > 
> > instances on the host. If PCPU inventory was not available and
> > there
> > 
> > were pinned instances, we would update the allocations for these
> > 
> > instances so that they would be consuming PCPUs instead of VCPUs
> > and
> > 
> > then update the inventory. This is problematic though, because our
> > 
> > solution for the issue with (1) means pinned instances can continue
> > to
> > 
> > request VCPU resources, which in turn means we could end up with
> > some
> > 
> > pinned instances on a host consuming PCPU and other consuming VCPU.
> > 
> > That obviously can't happen, so we need to change tacks slightly.
> > The
> > 
> > two obvious solutions would be to either (a) remove the workaround
> > 
> > option so the scheduler would immediately start requesting PCPUs
> > and
> > 
> > just advise operators to upgrade their hosts for pinned instances
> > asap
> > 
> > or (b) add a different option, defaulting to True, that would apply
> > to
> > 
> > both the scheduler and compute nodes and prevent not only
> > translation
> > 
> > of flavors in the scheduler but also the reporting PCPUs and
> > reshaping
> > 
> > of allocations until disabled.
> > 
> > 
> 
> The step I'm thinking is:
> 
> 1. upgrade control plane, disable request PCPU, still request VCPU.
> 2. rolling upgrade compute node, compute nodes begin to report both
> PCPU and VCPU. But the request still add to VCPU.
> 3. enabling the PCPU request, the new request is request PCPU.
>        In this point, some of instances are using VCPU, some of
> instances are using PCPU on same node. And the amount VCPU + PCPU
> will double the available cpu resources. The NUMATopology filter is
> responsible for stop over-consuming the total number of cpu.
> 4. rolling update compute node's configure to use cpu_dedicated_set,
> that trigger the reshape existed VCPU consuming to PCPU consuming.
>      New request is going to PCPU at step3, no more VCPU request at
> this point. Roll upgrade node to get rid of existed VCPU consuming.
> 5. done

This had been my initial plan. The issue is that by reporting both PCPU
and VCPU in (2), our compute node's resource provider will now have
PCPU inventory available (though it won't be used). This is problematic
since "does this resource provider have PCPU inventory" is one of the
questions I need to ask to determine if I should do a reshape. If I
can't rely on this heuristic, I need to start querying for allocation
information (so I can ask "does this resource provider have PCPU
*allocations*") every time I start a compute node. I'm guessing this is
expensive, since we don't do it by default.
Stephen
> > I'm currently leaning towards (a) because it's a *lot* simpler, far
> > 
> > more robust (IMO) and lets us finish this effort in a single cycle,
> > but
> > 
> > I imagine this could make upgrades very painful for operators if
> > they
> > 
> > can't fast track their compute node upgrades. (b) is more complex
> > and
> > 
> > would have some constraints, chief among them being that the option
> > 
> > would have to be disabled at some point post-release and would have
> > to
> > 
> > be disabled on the scheduler first (to prevent the mismash or VCPU
> > and
> > 
> > PCPU resource allocations) above. It also means this becomes a
> > three
> > 
> > cycle effort at minimum, since this new option will default to True
> > in
> > 
> > Train, before defaulting to False and being deprecated in U and
> > finally
> > 
> > being removed in V. As such, I'd like some input, particularly from
> > 
> > operators using pinned instances in larger deployments. What are
> > your
> > 
> > thoughts, and are there any potential solutions that I'm missing
> > here?
> > 
> > 
> > 
> > Cheers,
> > 
> > Stephen
> > 
> > 
> > 
> > [1] 
> > https://specs.openstack.org/openstack/nova-specs/specs/train/approved/cpu-resources.html
> > 
> > 
> > 
> > 
> > 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20190816/c960c6c6/attachment-0001.html>


More information about the openstack-discuss mailing list