More upgrade issues with PCPUs - input wanted

15 Aug 2019

      tl;dr: Is breaking booting of pinned instances on Stein compute nodes
in a Train deployment an acceptable thing to do, and if not, how do we
best handle the VCPU->PCPU migration in Train?

I've been working through the cpu-resources spec [1] and have run into
a tricky issue I'd like some input on. In short, this spec means that
pinned instances (i.e. 'hw:cpu_policy=dedicated') will now start
consuming a new resources type, PCPU, instead of VCPU. Many things need
to change to make this happen but the key changes are:

   1. The scheduler needs to start modifying requests for pinned instances
      to request PCPU resources instead of VCPU resources
   2. The libvirt driver needs to start reporting PCPU resources
   3. The libvirt driver needs to do a reshape, moving all existing
      allocations of VCPUs to PCPUs, if the instance holding that
      allocation is pinned

The first two of these steps presents an issue for which we have a
solution, but the solutions we've chosen are now resulting in this new
issue.

 * For (1), the translation of VCPU to PCPU in the scheduler means
   compute nodes must now report PCPU in order for a pinned instance to
   land on that host. Since controllers are upgraded before compute
   nodes and all compute nodes aren't necessarily upgraded in one go
   (particularly for edge or other large or multi-cell deployments),
   this can mean there will be a period of time where there are very
   few or no hosts available on which to schedule pinned instances.

 * For (2), we're hampered by the fact that there is no clear way to
   determine if a host is used for pinned instances or not. Because of
   this, we can't determine if a host should be reporting PCPU or VCPU
   inventory.

The solution we have for the issues with (1) is to add a workaround
option that would disable this translation, allowing operators time to
upgrade all their compute nodes to report PCPU resources before
anything starts using them. For (2), we've decided to temporarily (i.e.
for one release or until configuration is updated) report both, in the
expectation that everyone using pinned instances has followed the long-
standing advice to separate hosts intended for pinned instances from
those intended for unpinned instances using host aggregates (e.g. even
if we started reporting PCPUs on a host, nothing would consume that due
to 'pinned=False' aggregate metadata or similar). These actually
benefit each other, since if instances are still consuming VCPUs then
the hosts need to continue reporting VCPUs. However, both interfere
with our ability to do the reshape.

Normally, a reshape is a one time thing. The way we'd planned to
determine if a reshape was necessary was to check if PCPU inventory was
registered against the host and, if not, whether there were any pinned
instances on the host. If PCPU inventory was not available and there
were pinned instances, we would update the allocations for these
instances so that they would be consuming PCPUs instead of VCPUs and
then update the inventory. This is problematic though, because our
solution for the issue with (1) means pinned instances can continue to
request VCPU resources, which in turn means we could end up with some
pinned instances on a host consuming PCPU and other consuming VCPU.
That obviously can't happen, so we need to change tacks slightly. The
two obvious solutions would be to either (a) remove the workaround
option so the scheduler would immediately start requesting PCPUs and
just advise operators to upgrade their hosts for pinned instances asap
or (b) add a different option, defaulting to True, that would apply to
both the scheduler and compute nodes and prevent not only translation
of flavors in the scheduler but also the reporting PCPUs and reshaping
of allocations until disabled.

I'm currently leaning towards (a) because it's a *lot* simpler, far
more robust (IMO) and lets us finish this effort in a single cycle, but
I imagine this could make upgrades very painful for operators if they
can't fast track their compute node upgrades. (b) is more complex and
would have some constraints, chief among them being that the option
would have to be disabled at some point post-release and would have to
be disabled on the scheduler first (to prevent the mismash or VCPU and
PCPU resource allocations) above. It also means this becomes a three
cycle effort at minimum, since this new option will default to True in
Train, before defaulting to False and being deprecated in U and finally
being removed in V. As such, I'd like some input, particularly from
operators using pinned instances in larger deployments. What are your
thoughts, and are there any potential solutions that I'm missing here?

Cheers,
Stephen

[1] https://specs.openstack.org/openstack/nova-specs/specs/train/approved/cpu-re...

More upgrade issues with PCPUs - input wanted

Stephen Finucane