[openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers
Eric Fried
openstack at fried.cc
Mon Jun 4 23:00:48 UTC 2018
There has been much discussion. We've gotten to a point of an initial
proposal and are ready for more (hopefully smaller, hopefully
conclusive) discussion.
To that end, there will be a HANGOUT tomorrow (TUESDAY, JUNE 5TH) at
1500 UTC. Be in #openstack-placement to get the link to join.
The strawpeople outlined below and discussed in the referenced etherpad
have been consolidated/distilled into a new etherpad [1] around which
the hangout discussion will be centered.
[1] https://etherpad.openstack.org/p/placement-making-the-(up)grade
Thanks,
efried
On 06/01/2018 01:12 PM, Jay Pipes wrote:
> On 05/31/2018 02:26 PM, Eric Fried wrote:
>>> 1. Make everything perform the pivot on compute node start (which can be
>>> re-used by a CLI tool for the offline case)
>>> 2. Make everything default to non-nested inventory at first, and provide
>>> a way to migrate a compute node and its instances one at a time (in
>>> place) to roll through.
>>
>> I agree that it sure would be nice to do ^ rather than requiring the
>> "slide puzzle" thing.
>>
>> But how would this be accomplished, in light of the current "separation
>> of responsibilities" drawn at the virt driver interface, whereby the
>> virt driver isn't supposed to talk to placement directly, or know
>> anything about allocations?
> FWIW, I don't have a problem with the virt driver "knowing about
> allocations". What I have a problem with is the virt driver *claiming
> resources for an instance*.
>
> That's what the whole placement claims resources things was all about,
> and I'm not interested in stepping back to the days of long racy claim
> operations by having the compute nodes be responsible for claiming
> resources.
>
> That said, once the consumer generation microversion lands [1], it
> should be possible to *safely* modify an allocation set for a consumer
> (instance) and move allocation records for an instance from one provider
> to another.
>
> [1] https://review.openstack.org/#/c/565604/
>
>> Here's a first pass:
>>
>> The virt driver, via the return value from update_provider_tree, tells
>> the resource tracker that "inventory of resource class A on provider B
>> have moved to provider C" for all applicable AxBxC. E.g.
>>
>> [ { 'from_resource_provider': <cn_rp_uuid>,
>> 'moved_resources': [VGPU: 4],
>> 'to_resource_provider': <gpu_rp1_uuid>
>> },
>> { 'from_resource_provider': <cn_rp_uuid>,
>> 'moved_resources': [VGPU: 4],
>> 'to_resource_provider': <gpu_rp2_uuid>
>> },
>> { 'from_resource_provider': <cn_rp_uuid>,
>> 'moved_resources': [
>> SRIOV_NET_VF: 2,
>> NET_BANDWIDTH_EGRESS_KILOBITS_PER_SECOND: 1000,
>> NET_BANDWIDTH_INGRESS_KILOBITS_PER_SECOND: 1000,
>> ],
>> 'to_resource_provider': <gpu_rp2_uuid>
>> }
>> ]
>>
>> As today, the resource tracker takes the updated provider tree and
>> invokes [1] the report client method update_from_provider_tree [2] to
>> flush the changes to placement. But now update_from_provider_tree also
>> accepts the return value from update_provider_tree and, for each "move":
>>
>> - Creates provider C (as described in the provider_tree) if it doesn't
>> already exist.
>> - Creates/updates provider C's inventory as described in the
>> provider_tree (without yet updating provider B's inventory). This ought
>> to create the inventory of resource class A on provider C.
>
> Unfortunately, right here you'll introduce a race condition. As soon as
> this operation completes, the scheduler will have the ability to throw
> new instances on provider C and consume the inventory from it that you
> intend to give to the existing instance that is consuming from provider B.
>
>> - Discovers allocations of rc A on rp B and POSTs to move them to rp C*.
>
> For each consumer of resources on rp B, right?
>
>> - Updates provider B's inventory.
>
> Again, this is problematic because the scheduler will have already begun
> to place new instances on B's inventory, which could very well result in
> incorrect resource accounting on the node.
>
> We basically need to have one giant new REST API call that accepts the
> list of "move instructions" and performs all of the instructions in a
> single transaction. :(
>
>> (*There's a hole here: if we're splitting a glommed-together inventory
>> across multiple new child providers, as the VGPUs in the example, we
>> don't know which allocations to put where. The virt driver should know
>> which instances own which specific inventory units, and would be able to
>> report that info within the data structure. That's getting kinda close
>> to the virt driver mucking with allocations, but maybe it fits well
>> enough into this model to be acceptable?)
>
> Well, it's not really the virt driver *itself* mucking with the
> allocations. It's more that the virt driver is telling something *else*
> the move instructions that it feels are needed...
>
>> Note that the return value from update_provider_tree is optional, and
>> only used when the virt driver is indicating a "move" of this ilk. If
>> it's None/[] then the RT/update_from_provider_tree flow is the same as
>> it is today.
>>
>> If we can do it this way, we don't need a migration tool. In fact, we
>> don't even need to restrict provider tree "reshaping" to release
>> boundaries. As long as the virt driver understands its own data model
>> migrations and reports them properly via update_provider_tree, it can
>> shuffle its tree around whenever it wants.
>
> Due to the many race conditions we would have in trying to fudge
> inventory amounts (the reserved/total thing) and allocation movement for
>>1 consumer at a time, I'm pretty sure the only safe thing to do is have
> a single new HTTP endpoint that would take this list of move operations
> and perform them atomically (on the placement server side of course).
>
> Here's a strawman for how that HTTP endpoint might look like:
>
> https://etherpad.openstack.org/p/placement-migrate-operations
>
> feel free to markup and destroy.
>
> Best,
> -jay
>
>> Thoughts?
>>
>> -efried
>>
>> [1]
>> https://github.com/openstack/nova/blob/8753c9a38667f984d385b4783c3c2fc34d7e8e1b/nova/compute/resource_tracker.py#L890
>>
>> [2]
>> https://github.com/openstack/nova/blob/8753c9a38667f984d385b4783c3c2fc34d7e8e1b/nova/scheduler/client/report.py#L1341
>>
>>
>> __________________________________________________________________________
>>
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
More information about the OpenStack-dev
mailing list