<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Stephen Finucane <<a href="mailto:sfinucan@redhat.com">sfinucan@redhat.com</a>> 于2019年8月15日周四 下午8:25写道：<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">tl;dr: Is breaking booting of pinned instances on Stein compute nodes<br>

in a Train deployment an acceptable thing to do, and if not, how do we<br>

best handle the VCPU->PCPU migration in Train?<br>

<br>

I've been working through the cpu-resources spec [1] and have run into<br>

a tricky issue I'd like some input on. In short, this spec means that<br>

pinned instances (i.e. 'hw:cpu_policy=dedicated') will now start<br>

consuming a new resources type, PCPU, instead of VCPU. Many things need<br>

to change to make this happen but the key changes are:<br>

<br>

   1. The scheduler needs to start modifying requests for pinned instances<br>

      to request PCPU resources instead of VCPU resources<br>

   2. The libvirt driver needs to start reporting PCPU resources<br>

   3. The libvirt driver needs to do a reshape, moving all existing<br>

      allocations of VCPUs to PCPUs, if the instance holding that<br>

      allocation is pinned<br>

<br>

The first two of these steps presents an issue for which we have a<br>

solution, but the solutions we've chosen are now resulting in this new<br>

issue.<br>

<br>

 * For (1), the translation of VCPU to PCPU in the scheduler means<br>

   compute nodes must now report PCPU in order for a pinned instance to<br>

   land on that host. Since controllers are upgraded before compute<br>

   nodes and all compute nodes aren't necessarily upgraded in one go<br>

   (particularly for edge or other large or multi-cell deployments),<br>

   this can mean there will be a period of time where there are very<br>

   few or no hosts available on which to schedule pinned instances.<br>

<br>

 * For (2), we're hampered by the fact that there is no clear way to<br>

   determine if a host is used for pinned instances or not. Because of<br>

   this, we can't determine if a host should be reporting PCPU or VCPU<br>

   inventory.<br>

<br>

The solution we have for the issues with (1) is to add a workaround<br>

option that would disable this translation, allowing operators time to<br>

upgrade all their compute nodes to report PCPU resources before<br>

anything starts using them. For (2), we've decided to temporarily (i.e.<br>

for one release or until configuration is updated) report both, in the<br>

expectation that everyone using pinned instances has followed the long-<br>

standing advice to separate hosts intended for pinned instances from<br>

those intended for unpinned instances using host aggregates (e.g. even<br>

if we started reporting PCPUs on a host, nothing would consume that due<br>

to 'pinned=False' aggregate metadata or similar). These actually<br>

benefit each other, since if instances are still consuming VCPUs then<br>

the hosts need to continue reporting VCPUs. However, both interfere<br>

with our ability to do the reshape.<br>

<br>

Normally, a reshape is a one time thing. The way we'd planned to<br>

determine if a reshape was necessary was to check if PCPU inventory was<br>

registered against the host and, if not, whether there were any pinned<br>

instances on the host. If PCPU inventory was not available and there<br>

were pinned instances, we would update the allocations for these<br>

instances so that they would be consuming PCPUs instead of VCPUs and<br>

then update the inventory. This is problematic though, because our<br>

solution for the issue with (1) means pinned instances can continue to<br>

request VCPU resources, which in turn means we could end up with some<br>

pinned instances on a host consuming PCPU and other consuming VCPU.<br>

That obviously can't happen, so we need to change tacks slightly. The<br>

two obvious solutions would be to either (a) remove the workaround<br>

option so the scheduler would immediately start requesting PCPUs and<br>

just advise operators to upgrade their hosts for pinned instances asap<br>

or (b) add a different option, defaulting to True, that would apply to<br>

both the scheduler and compute nodes and prevent not only translation<br>

of flavors in the scheduler but also the reporting PCPUs and reshaping<br>

of allocations until disabled.<br>

<br></blockquote><div><br></div><div>The step I'm thinking is:</div><div><br></div><div>1. upgrade control plane, disable request PCPU, still request VCPU.</div><div>2. rolling upgrade compute node, compute nodes begin to report both PCPU and VCPU. But the request still add to VCPU.</div><div>3. enabling the PCPU request, the new request is request PCPU.</div><div>       In this point, some of instances are using VCPU, some of instances are using PCPU on same node. And the amount VCPU + PCPU will double the available cpu resources. The NUMATopology filter is responsible for stop over-consuming the total number of cpu.</div><div>4. rolling update compute node's configure to use cpu_dedicated_set, that trigger the reshape existed VCPU consuming to PCPU consuming.</div><div>     New request is going to PCPU at step3, no more VCPU request at this point. Roll upgrade node to get rid of existed VCPU consuming.</div><div>5. done</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

I'm currently leaning towards (a) because it's a *lot* simpler, far<br>

more robust (IMO) and lets us finish this effort in a single cycle, but<br>

I imagine this could make upgrades very painful for operators if they<br>

can't fast track their compute node upgrades. (b) is more complex and<br>

would have some constraints, chief among them being that the option<br>

would have to be disabled at some point post-release and would have to<br>

be disabled on the scheduler first (to prevent the mismash or VCPU and<br>

PCPU resource allocations) above. It also means this becomes a three<br>

cycle effort at minimum, since this new option will default to True in<br>

Train, before defaulting to False and being deprecated in U and finally<br>

being removed in V. As such, I'd like some input, particularly from<br>

operators using pinned instances in larger deployments. What are your<br>

thoughts, and are there any potential solutions that I'm missing here?<br>

<br>

Cheers,<br>

Stephen<br>

<br>

[1] <a href="https://specs.openstack.org/openstack/nova-specs/specs/train/approved/cpu-resources.html" rel="noreferrer" target="_blank">https://specs.openstack.org/openstack/nova-specs/specs/train/approved/cpu-resources.html</a><br>

<br>

<br>

</blockquote></div></div>