[openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

Eric Fried openstack at fried.cc
Fri Jun 8 16:09:03 UTC 2018


There is now a blueprint [1] and draft spec [2].  Reviews welcomed.

[1] https://blueprints.launchpad.net/nova/+spec/reshape-provider-tree
[2] https://review.openstack.org/#/c/572583/

On 06/04/2018 06:00 PM, Eric Fried wrote:
> There has been much discussion.  We've gotten to a point of an initial
> proposal and are ready for more (hopefully smaller, hopefully
> conclusive) discussion.
> 
> To that end, there will be a HANGOUT tomorrow (TUESDAY, JUNE 5TH) at
> 1500 UTC.  Be in #openstack-placement to get the link to join.
> 
> The strawpeople outlined below and discussed in the referenced etherpad
> have been consolidated/distilled into a new etherpad [1] around which
> the hangout discussion will be centered.
> 
> [1] https://etherpad.openstack.org/p/placement-making-the-(up)grade
> 
> Thanks,
> efried
> 
> On 06/01/2018 01:12 PM, Jay Pipes wrote:
>> On 05/31/2018 02:26 PM, Eric Fried wrote:
>>>> 1. Make everything perform the pivot on compute node start (which can be
>>>>     re-used by a CLI tool for the offline case)
>>>> 2. Make everything default to non-nested inventory at first, and provide
>>>>     a way to migrate a compute node and its instances one at a time (in
>>>>     place) to roll through.
>>>
>>> I agree that it sure would be nice to do ^ rather than requiring the
>>> "slide puzzle" thing.
>>>
>>> But how would this be accomplished, in light of the current "separation
>>> of responsibilities" drawn at the virt driver interface, whereby the
>>> virt driver isn't supposed to talk to placement directly, or know
>>> anything about allocations?
>> FWIW, I don't have a problem with the virt driver "knowing about
>> allocations". What I have a problem with is the virt driver *claiming
>> resources for an instance*.
>>
>> That's what the whole placement claims resources things was all about,
>> and I'm not interested in stepping back to the days of long racy claim
>> operations by having the compute nodes be responsible for claiming
>> resources.
>>
>> That said, once the consumer generation microversion lands [1], it
>> should be possible to *safely* modify an allocation set for a consumer
>> (instance) and move allocation records for an instance from one provider
>> to another.
>>
>> [1] https://review.openstack.org/#/c/565604/
>>
>>> Here's a first pass:
>>>
>>> The virt driver, via the return value from update_provider_tree, tells
>>> the resource tracker that "inventory of resource class A on provider B
>>> have moved to provider C" for all applicable AxBxC.  E.g.
>>>
>>> [ { 'from_resource_provider': <cn_rp_uuid>,
>>>      'moved_resources': [VGPU: 4],
>>>      'to_resource_provider': <gpu_rp1_uuid>
>>>    },
>>>    { 'from_resource_provider': <cn_rp_uuid>,
>>>      'moved_resources': [VGPU: 4],
>>>      'to_resource_provider': <gpu_rp2_uuid>
>>>    },
>>>    { 'from_resource_provider': <cn_rp_uuid>,
>>>      'moved_resources': [
>>>          SRIOV_NET_VF: 2,
>>>          NET_BANDWIDTH_EGRESS_KILOBITS_PER_SECOND: 1000,
>>>          NET_BANDWIDTH_INGRESS_KILOBITS_PER_SECOND: 1000,
>>>      ],
>>>      'to_resource_provider': <gpu_rp2_uuid>
>>>    }
>>> ]
>>>
>>> As today, the resource tracker takes the updated provider tree and
>>> invokes [1] the report client method update_from_provider_tree [2] to
>>> flush the changes to placement.  But now update_from_provider_tree also
>>> accepts the return value from update_provider_tree and, for each "move":
>>>
>>> - Creates provider C (as described in the provider_tree) if it doesn't
>>> already exist.
>>> - Creates/updates provider C's inventory as described in the
>>> provider_tree (without yet updating provider B's inventory).  This ought
>>> to create the inventory of resource class A on provider C.
>>
>> Unfortunately, right here you'll introduce a race condition. As soon as
>> this operation completes, the scheduler will have the ability to throw
>> new instances on provider C and consume the inventory from it that you
>> intend to give to the existing instance that is consuming from provider B.
>>
>>> - Discovers allocations of rc A on rp B and POSTs to move them to rp C*.
>>
>> For each consumer of resources on rp B, right?
>>
>>> - Updates provider B's inventory.
>>
>> Again, this is problematic because the scheduler will have already begun
>> to place new instances on B's inventory, which could very well result in
>> incorrect resource accounting on the node.
>>
>> We basically need to have one giant new REST API call that accepts the
>> list of "move instructions" and performs all of the instructions in a
>> single transaction. :(
>>
>>> (*There's a hole here: if we're splitting a glommed-together inventory
>>> across multiple new child providers, as the VGPUs in the example, we
>>> don't know which allocations to put where.  The virt driver should know
>>> which instances own which specific inventory units, and would be able to
>>> report that info within the data structure.  That's getting kinda close
>>> to the virt driver mucking with allocations, but maybe it fits well
>>> enough into this model to be acceptable?)
>>
>> Well, it's not really the virt driver *itself* mucking with the
>> allocations. It's more that the virt driver is telling something *else*
>> the move instructions that it feels are needed...
>>
>>> Note that the return value from update_provider_tree is optional, and
>>> only used when the virt driver is indicating a "move" of this ilk.  If
>>> it's None/[] then the RT/update_from_provider_tree flow is the same as
>>> it is today.
>>>
>>> If we can do it this way, we don't need a migration tool.  In fact, we
>>> don't even need to restrict provider tree "reshaping" to release
>>> boundaries.  As long as the virt driver understands its own data model
>>> migrations and reports them properly via update_provider_tree, it can
>>> shuffle its tree around whenever it wants.
>>
>> Due to the many race conditions we would have in trying to fudge
>> inventory amounts (the reserved/total thing) and allocation movement for
>>> 1 consumer at a time, I'm pretty sure the only safe thing to do is have
>> a single new HTTP endpoint that would take this list of move operations
>> and perform them atomically (on the placement server side of course).
>>
>> Here's a strawman for how that HTTP endpoint might look like:
>>
>> https://etherpad.openstack.org/p/placement-migrate-operations
>>
>> feel free to markup and destroy.
>>
>> Best,
>> -jay
>>
>>> Thoughts?
>>>
>>> -efried
>>>
>>> [1]
>>> https://github.com/openstack/nova/blob/8753c9a38667f984d385b4783c3c2fc34d7e8e1b/nova/compute/resource_tracker.py#L890
>>>
>>> [2]
>>> https://github.com/openstack/nova/blob/8753c9a38667f984d385b4783c3c2fc34d7e8e1b/nova/scheduler/client/report.py#L1341
>>>
>>>
>>> __________________________________________________________________________
>>>
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe:
>>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 



More information about the OpenStack-dev mailing list