<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Stephen Finucane <<a href="mailto:sfinucan@redhat.com">sfinucan@redhat.com</a>> 于2019年8月15日周四 下午8:25写道:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">tl;dr: Is breaking booting of pinned instances on Stein compute nodes<br>
in a Train deployment an acceptable thing to do, and if not, how do we<br>
best handle the VCPU->PCPU migration in Train?<br>
<br>
I've been working through the cpu-resources spec [1] and have run into<br>
a tricky issue I'd like some input on. In short, this spec means that<br>
pinned instances (i.e. 'hw:cpu_policy=dedicated') will now start<br>
consuming a new resources type, PCPU, instead of VCPU. Many things need<br>
to change to make this happen but the key changes are:<br>
<br>
1. The scheduler needs to start modifying requests for pinned instances<br>
to request PCPU resources instead of VCPU resources<br>
2. The libvirt driver needs to start reporting PCPU resources<br>
3. The libvirt driver needs to do a reshape, moving all existing<br>
allocations of VCPUs to PCPUs, if the instance holding that<br>
allocation is pinned<br>
<br>
The first two of these steps presents an issue for which we have a<br>
solution, but the solutions we've chosen are now resulting in this new<br>
issue.<br>
<br>
* For (1), the translation of VCPU to PCPU in the scheduler means<br>
compute nodes must now report PCPU in order for a pinned instance to<br>
land on that host. Since controllers are upgraded before compute<br>
nodes and all compute nodes aren't necessarily upgraded in one go<br>
(particularly for edge or other large or multi-cell deployments),<br>
this can mean there will be a period of time where there are very<br>
few or no hosts available on which to schedule pinned instances.<br>
<br>
* For (2), we're hampered by the fact that there is no clear way to<br>
determine if a host is used for pinned instances or not. Because of<br>
this, we can't determine if a host should be reporting PCPU or VCPU<br>
inventory.<br>
<br>
The solution we have for the issues with (1) is to add a workaround<br>
option that would disable this translation, allowing operators time to<br>
upgrade all their compute nodes to report PCPU resources before<br>
anything starts using them. For (2), we've decided to temporarily (i.e.<br>
for one release or until configuration is updated) report both, in the<br>
expectation that everyone using pinned instances has followed the long-<br>
standing advice to separate hosts intended for pinned instances from<br>
those intended for unpinned instances using host aggregates (e.g. even<br>
if we started reporting PCPUs on a host, nothing would consume that due<br>
to 'pinned=False' aggregate metadata or similar). These actually<br>
benefit each other, since if instances are still consuming VCPUs then<br>
the hosts need to continue reporting VCPUs. However, both interfere<br>
with our ability to do the reshape.<br>
<br>
Normally, a reshape is a one time thing. The way we'd planned to<br>
determine if a reshape was necessary was to check if PCPU inventory was<br>
registered against the host and, if not, whether there were any pinned<br>
instances on the host. If PCPU inventory was not available and there<br>
were pinned instances, we would update the allocations for these<br>
instances so that they would be consuming PCPUs instead of VCPUs and<br>
then update the inventory. This is problematic though, because our<br>
solution for the issue with (1) means pinned instances can continue to<br>
request VCPU resources, which in turn means we could end up with some<br>
pinned instances on a host consuming PCPU and other consuming VCPU.<br>
That obviously can't happen, so we need to change tacks slightly. The<br>
two obvious solutions would be to either (a) remove the workaround<br>
option so the scheduler would immediately start requesting PCPUs and<br>
just advise operators to upgrade their hosts for pinned instances asap<br>
or (b) add a different option, defaulting to True, that would apply to<br>
both the scheduler and compute nodes and prevent not only translation<br>
of flavors in the scheduler but also the reporting PCPUs and reshaping<br>
of allocations until disabled.<br>
<br></blockquote><div><br></div><div>The step I'm thinking is:</div><div><br></div><div>1. upgrade control plane, disable request PCPU, still request VCPU.</div><div>2. rolling upgrade compute node, compute nodes begin to report both PCPU and VCPU. But the request still add to VCPU.</div><div>3. enabling the PCPU request, the new request is request PCPU.</div><div> In this point, some of instances are using VCPU, some of instances are using PCPU on same node. And the amount VCPU + PCPU will double the available cpu resources. The NUMATopology filter is responsible for stop over-consuming the total number of cpu.</div><div>4. rolling update compute node's configure to use cpu_dedicated_set, that trigger the reshape existed VCPU consuming to PCPU consuming.</div><div> New request is going to PCPU at step3, no more VCPU request at this point. Roll upgrade node to get rid of existed VCPU consuming.</div><div>5. done</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
I'm currently leaning towards (a) because it's a *lot* simpler, far<br>
more robust (IMO) and lets us finish this effort in a single cycle, but<br>
I imagine this could make upgrades very painful for operators if they<br>
can't fast track their compute node upgrades. (b) is more complex and<br>
would have some constraints, chief among them being that the option<br>
would have to be disabled at some point post-release and would have to<br>
be disabled on the scheduler first (to prevent the mismash or VCPU and<br>
PCPU resource allocations) above. It also means this becomes a three<br>
cycle effort at minimum, since this new option will default to True in<br>
Train, before defaulting to False and being deprecated in U and finally<br>
being removed in V. As such, I'd like some input, particularly from<br>
operators using pinned instances in larger deployments. What are your<br>
thoughts, and are there any potential solutions that I'm missing here?<br>
<br>
Cheers,<br>
Stephen<br>
<br>
[1] <a href="https://specs.openstack.org/openstack/nova-specs/specs/train/approved/cpu-resources.html" rel="noreferrer" target="_blank">https://specs.openstack.org/openstack/nova-specs/specs/train/approved/cpu-resources.html</a><br>
<br>
<br>
</blockquote></div></div>