[openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

Dan Smith dms at danplanet.com
Fri Jun 1 16:22:05 UTC 2018


> So, you're saying the normal process is to try upgrading the Linux
> kernel and associated low-level libs, wait the requisite amount of
> time that takes (can be a long time) and just hope that everything
> comes back OK? That doesn't sound like any upgrade I've ever seen.

I'm saying I think it's a process practiced by some to install the new
kernel and libs and then reboot to activate, yeah.

> No, sorry if I wasn't clear. They can live-migrate the instances off
> of the to-be-upgraded compute host. They would only need to
> cold-migrate instances that use the aforementioned non-movable
> resources.

I don't think it's reasonable to force people to have to move every
instance in their cloud (live or otherwise) in order to upgrade. That
means that people who currently do their upgrades in-place in one step,
now have to do their upgrade in N steps, for N compute nodes. That
doesn't seem reasonable to me.

> If we are going to go through the hassle of writing a bunch of
> transformation code in order to keep operator action as low as
> possible, I would prefer to consolidate all of this code into the
> nova-manage (or nova-status) tool and put some sort of
> attribute/marker on each compute node record to indicate whether a
> "heal" operation has occurred for that compute node.

We need to know details of each compute node in order to do that. We
could make the tool external and something they run per-compute node,
but that still makes it N steps, even if the N steps are lighter
weight.

> Someone (maybe Gibi?) on this thread had mentioned having the virt
> driver (in update_provider_tree) do the whole set reserved = total
> thing when first attempting to create the child providers. That would
> work to prevent the scheduler from attempting to place workloads on
> those child providers, but we would still need some marker on the
> compute node to indicate to the nova-manage heal_nested_providers (or
> whatever) command that the compute node has had its provider tree
> validated/healed, right?

So that means you restart your cloud and it's basically locked up until
you perform the N steps to unlock N nodes? That also seems like it's not
going to make us very popular on the playground :)

I need to go read Eric's tome on how to handle the communication of
things from virt to compute so that this translation can be done. I'm
not saying I have the answer, I'm just saying that making this the
problem of the operators doesn't seem like a solution to me, and that we
should figure out how we're going to do this before we go down the
rabbit hole.

--Dan



More information about the OpenStack-dev mailing list