<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Thu, May 31, 2018 at 7:09 PM, Dan Smith <span dir="ltr"><<a href="mailto:dms@danplanet.com" target="_blank">dms@danplanet.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">> My feeling is that we should not attempt to "migrate" any allocations<br>
> or inventories between root or child providers within a compute node,<br>
> period.<br>
<br>
</span>While I agree this is the simplest approach, it does put a lot of<br>
responsibility on the operators to do work to sidestep this issue, which<br>
might not even apply to them (and knowing if it does might be<br>
difficult).<br>
<span class=""><br></span></blockquote><div><br></div><div>Shit, I missed the point why we were discussing about migrations. When you upgrade, you wanna move your workloads for upgrading your kernel and the likes. Gotcha.<br></div><div>But, I assume that's not something mandatory for a single upgrade (say Queens>Rocky). In that case, you just wanna upgrade your compute without moving your instances. Or you notified your users about a maintenance and you know you have a minimal maintenance period for breaking them.<br></div><div>In both cases, adding more steps for upgrading seems a tricky and dangerous path for those operators who are afraid of making a mistake.<br> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">
> The virt drivers should simply error out of update_provider_tree() if<br>
> there are ANY existing VMs on the host AND the virt driver wishes to<br>
> begin tracking resources with nested providers.<br>
><br>
> The upgrade operation should look like this:<br>
><br>
> 1) Upgrade placement<br>
> 2) Upgrade nova-scheduler<br>
> 3) start loop on compute nodes. for each compute node:<br>
> 3a) disable nova-compute service on node (to take it out of scheduling)<br>
> 3b) evacuate all existing VMs off of node<br>
<br>
</span>You mean s/evacuate/cold migrate/ of course... :)<br>
<span class=""><br>
> 3c) upgrade compute node (on restart, the compute node will see no<br>
> VMs running on the node and will construct the provider tree inside<br>
> update_provider_tree() with an appropriate set of child providers<br>
> and inventories on those child providers)<br>
> 3d) enable nova-compute service on node<br>
><br>
> Which is virtually identical to the "normal" upgrade process whenever<br>
> there are significant changes to the compute node -- such as upgrading<br>
> libvirt or the kernel.<br>
<br>
</span>Not necessarily. It's totally legit (and I expect quite common) to just<br>
reboot the host to take kernel changes, bringing back all the instances<br>
that were there when it resumes. The "normal" case of moving things<br>
around slide-puzzle-style applies to live migration (which isn't an<br>
option here). I think people that can take downtime for the instances<br>
would rather not have to move things around for no reason if the<br>
instance has to get shut off anyway.<br>
<span class=""><br></span></blockquote><div><br></div><div>Yeah exactly that. Accepting a downtime is fair, to the price to not have a long list of operations to do during that downtime period.<br><br> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">
> Nested resource tracking is another such significant change and should<br>
> be dealt with in a similar way, IMHO.<br>
<br>
</span>This basically says that for anyone to move to rocky, they will have to<br>
cold migrate every single instance in order to do that upgrade right? I<br>
mean, anyone with two socket machines or SRIOV NICs would end up with at<br>
least one level of nesting, correct? Forcing everyone to move everything<br>
to do an upgrade seems like a non-starter to me.<br>
<br></blockquote><div><br></div><div>For the moment, we aren't providing NUMA topologies with nested RPs but once we do that, yeah, that would imply the above, which sounds harsh to hear from an operator perspective.<br><br> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
We also need to consider the case where people would be FFU'ing past<br>
rocky (i.e. never running rocky computes). We've previously said that<br>
we'd provide a way to push any needed transitions with everything<br>
offline to facilitate that case, so I think we need to implement that<br>
method anyway.<br>
<br>
I kinda think we need to either:<br>
<br>
1. Make everything perform the pivot on compute node start (which can be<br>
re-used by a CLI tool for the offline case)<br></blockquote><div><br></div><div>That's another alternative I haven't explored yet. Thanks for the idea. We already reconcile the world when we restart the compute service by checking whether mediated devices exist, so that could be a good option actually.<br><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
2. Make everything default to non-nested inventory at first, and provide<br>
a way to migrate a compute node and its instances one at a time (in<br>
place) to roll through.<br>
<br></blockquote><div><br></div><div> We could say that Rocky isn't supporting multiple vGPU types until you make the necessary DB migration that will create child RPs and the likes. That's yet another approach.<br><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
We can also document "or do the cold-migration slide puzzle thing" as an<br>
alternative for people that feel that's more reasonable.<br>
<br>
I just think that forcing people to take down their data plane to work<br>
around our own data model is kinda evil and something we should be<br>
avoiding at this level of project maturity. What we're really saying is<br>
"we know how to translate A into B, but we require you to move many GBs<br>
of data over the network and take some downtime because it's easier for<br>
*us* than making it seamless."<br>
<span class="HOEnZb"><font color="#888888"><br>
--Dan<br>
</font></span><div class="HOEnZb"><div class="h5"><br>
______________________________<wbr>______________________________<wbr>______________<br>
OpenStack Development Mailing List (not for usage questions)<br>
Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">OpenStack-dev-request@lists.<wbr>openstack.org?subject:<wbr>unsubscribe</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/<wbr>cgi-bin/mailman/listinfo/<wbr>openstack-dev</a><br>
</div></div></blockquote></div><br></div></div>