[openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

Jay Pipes jaypipes at gmail.com
Fri Jun 1 18:12:23 UTC 2018


On 05/31/2018 02:26 PM, Eric Fried wrote:
>> 1. Make everything perform the pivot on compute node start (which can be
>>     re-used by a CLI tool for the offline case)
>> 2. Make everything default to non-nested inventory at first, and provide
>>     a way to migrate a compute node and its instances one at a time (in
>>     place) to roll through.
> 
> I agree that it sure would be nice to do ^ rather than requiring the
> "slide puzzle" thing.
> 
> But how would this be accomplished, in light of the current "separation
> of responsibilities" drawn at the virt driver interface, whereby the
> virt driver isn't supposed to talk to placement directly, or know
> anything about allocations?
FWIW, I don't have a problem with the virt driver "knowing about 
allocations". What I have a problem with is the virt driver *claiming 
resources for an instance*.

That's what the whole placement claims resources things was all about, 
and I'm not interested in stepping back to the days of long racy claim 
operations by having the compute nodes be responsible for claiming 
resources.

That said, once the consumer generation microversion lands [1], it 
should be possible to *safely* modify an allocation set for a consumer 
(instance) and move allocation records for an instance from one provider 
to another.

[1] https://review.openstack.org/#/c/565604/

> Here's a first pass:
> 
> The virt driver, via the return value from update_provider_tree, tells
> the resource tracker that "inventory of resource class A on provider B
> have moved to provider C" for all applicable AxBxC.  E.g.
> 
> [ { 'from_resource_provider': <cn_rp_uuid>,
>      'moved_resources': [VGPU: 4],
>      'to_resource_provider': <gpu_rp1_uuid>
>    },
>    { 'from_resource_provider': <cn_rp_uuid>,
>      'moved_resources': [VGPU: 4],
>      'to_resource_provider': <gpu_rp2_uuid>
>    },
>    { 'from_resource_provider': <cn_rp_uuid>,
>      'moved_resources': [
>          SRIOV_NET_VF: 2,
>          NET_BANDWIDTH_EGRESS_KILOBITS_PER_SECOND: 1000,
>          NET_BANDWIDTH_INGRESS_KILOBITS_PER_SECOND: 1000,
>      ],
>      'to_resource_provider': <gpu_rp2_uuid>
>    }
> ]
> 
> As today, the resource tracker takes the updated provider tree and
> invokes [1] the report client method update_from_provider_tree [2] to
> flush the changes to placement.  But now update_from_provider_tree also
> accepts the return value from update_provider_tree and, for each "move":
> 
> - Creates provider C (as described in the provider_tree) if it doesn't
> already exist.
> - Creates/updates provider C's inventory as described in the
> provider_tree (without yet updating provider B's inventory).  This ought
> to create the inventory of resource class A on provider C.

Unfortunately, right here you'll introduce a race condition. As soon as 
this operation completes, the scheduler will have the ability to throw 
new instances on provider C and consume the inventory from it that you 
intend to give to the existing instance that is consuming from provider B.

> - Discovers allocations of rc A on rp B and POSTs to move them to rp C*.

For each consumer of resources on rp B, right?

> - Updates provider B's inventory.

Again, this is problematic because the scheduler will have already begun 
to place new instances on B's inventory, which could very well result in 
incorrect resource accounting on the node.

We basically need to have one giant new REST API call that accepts the 
list of "move instructions" and performs all of the instructions in a 
single transaction. :(

> (*There's a hole here: if we're splitting a glommed-together inventory
> across multiple new child providers, as the VGPUs in the example, we
> don't know which allocations to put where.  The virt driver should know
> which instances own which specific inventory units, and would be able to
> report that info within the data structure.  That's getting kinda close
> to the virt driver mucking with allocations, but maybe it fits well
> enough into this model to be acceptable?)

Well, it's not really the virt driver *itself* mucking with the 
allocations. It's more that the virt driver is telling something *else* 
the move instructions that it feels are needed...

> Note that the return value from update_provider_tree is optional, and
> only used when the virt driver is indicating a "move" of this ilk.  If
> it's None/[] then the RT/update_from_provider_tree flow is the same as
> it is today.
> 
> If we can do it this way, we don't need a migration tool.  In fact, we
> don't even need to restrict provider tree "reshaping" to release
> boundaries.  As long as the virt driver understands its own data model
> migrations and reports them properly via update_provider_tree, it can
> shuffle its tree around whenever it wants.

Due to the many race conditions we would have in trying to fudge 
inventory amounts (the reserved/total thing) and allocation movement for 
 >1 consumer at a time, I'm pretty sure the only safe thing to do is 
have a single new HTTP endpoint that would take this list of move 
operations and perform them atomically (on the placement server side of 
course).

Here's a strawman for how that HTTP endpoint might look like:

https://etherpad.openstack.org/p/placement-migrate-operations

feel free to markup and destroy.

Best,
-jay

> Thoughts?
> 
> -efried
> 
> [1]
> https://github.com/openstack/nova/blob/8753c9a38667f984d385b4783c3c2fc34d7e8e1b/nova/compute/resource_tracker.py#L890
> [2]
> https://github.com/openstack/nova/blob/8753c9a38667f984d385b4783c3c2fc34d7e8e1b/nova/scheduler/client/report.py#L1341
> 
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 



More information about the OpenStack-dev mailing list