Open Stack

Fri Jun 1 17:22:18 UTC 2018

Dan, you are leaving out the parts of my response where I am agreeing 
with you and saying that your "Option #2" is probably the things we 
should go with.

-jay

On 06/01/2018 12:22 PM, Dan Smith wrote:
>> So, you're saying the normal process is to try upgrading the Linux
>> kernel and associated low-level libs, wait the requisite amount of
>> time that takes (can be a long time) and just hope that everything
>> comes back OK? That doesn't sound like any upgrade I've ever seen.
> 
> I'm saying I think it's a process practiced by some to install the new
> kernel and libs and then reboot to activate, yeah.
> 
>> No, sorry if I wasn't clear. They can live-migrate the instances off
>> of the to-be-upgraded compute host. They would only need to
>> cold-migrate instances that use the aforementioned non-movable
>> resources.
> 
> I don't think it's reasonable to force people to have to move every
> instance in their cloud (live or otherwise) in order to upgrade. That
> means that people who currently do their upgrades in-place in one step,
> now have to do their upgrade in N steps, for N compute nodes. That
> doesn't seem reasonable to me.
> 
>> If we are going to go through the hassle of writing a bunch of
>> transformation code in order to keep operator action as low as
>> possible, I would prefer to consolidate all of this code into the
>> nova-manage (or nova-status) tool and put some sort of
>> attribute/marker on each compute node record to indicate whether a
>> "heal" operation has occurred for that compute node.
> 
> We need to know details of each compute node in order to do that. We
> could make the tool external and something they run per-compute node,
> but that still makes it N steps, even if the N steps are lighter
> weight.
> 
>> Someone (maybe Gibi?) on this thread had mentioned having the virt
>> driver (in update_provider_tree) do the whole set reserved = total
>> thing when first attempting to create the child providers. That would
>> work to prevent the scheduler from attempting to place workloads on
>> those child providers, but we would still need some marker on the
>> compute node to indicate to the nova-manage heal_nested_providers (or
>> whatever) command that the compute node has had its provider tree
>> validated/healed, right?
> 
> So that means you restart your cloud and it's basically locked up until
> you perform the N steps to unlock N nodes? That also seems like it's not
> going to make us very popular on the playground :)
> 
> I need to go read Eric's tome on how to handle the communication of
> things from virt to compute so that this translation can be done. I'm
> not saying I have the answer, I'm just saying that making this the
> problem of the operators doesn't seem like a solution to me, and that we
> should figure out how we're going to do this before we go down the
> rabbit hole.
> 
> --Dan
> 

Open Stack

[openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

OpenStack

Community

Documentation

Branding & Legal