[openstack-dev] [nova] heads up to users of Aggregate[Core|Ram|Disk]Filter: behavior change in >= Ocata
Mathieu Gagné
mgagne at calavera.ca
Fri Jan 19 00:24:53 UTC 2018
On Thu, Jan 18, 2018 at 5:19 PM, Jay Pipes <jaypipes at gmail.com> wrote:
> On 01/18/2018 03:54 PM, Mathieu Gagné wrote:
>>
>> Hi,
>>
>> On Tue, Jan 16, 2018 at 4:24 PM, melanie witt <melwittt at gmail.com> wrote:
>>>
>>> Hello Stackers,
>>>
>>> This is a heads up to any of you using the AggregateCoreFilter,
>>> AggregateRamFilter, and/or AggregateDiskFilter in the filter scheduler.
>>> These filters have effectively allowed operators to set overcommit ratios
>>> per aggregate rather than per compute node in <= Newton.
>>>
>>> Beginning in Ocata, there is a behavior change where aggregate-based
>>> overcommit ratios will no longer be honored during scheduling. Instead,
>>> overcommit values must be set on a per compute node basis in nova.conf.
>>>
>>> Details: as of Ocata, instead of considering all compute nodes at the
>>> start
>>> of scheduler filtering, an optimization has been added to query resource
>>> capacity from placement and prune the compute node list with the result
>>> *before* any filters are applied. Placement tracks resource capacity and
>>> usage and does *not* track aggregate metadata [1]. Because of this,
>>> placement cannot consider aggregate-based overcommit and will exclude
>>> compute nodes that do not have capacity based on per compute node
>>> overcommit.
>>>
>>> How to prepare: if you have been relying on per aggregate overcommit,
>>> during
>>> your upgrade to Ocata, you must change to using per compute node
>>> overcommit
>>> ratios in order for your scheduling behavior to stay consistent.
>>> Otherwise,
>>> you may notice increased NoValidHost scheduling failures as the
>>> aggregate-based overcommit is no longer being considered. You can safely
>>> remove the AggregateCoreFilter, AggregateRamFilter, and
>>> AggregateDiskFilter
>>> from your enabled_filters and you do not need to replace them with any
>>> other
>>> core/ram/disk filters. The placement query takes care of the
>>> core/ram/disk
>>> filtering instead, so CoreFilter, RamFilter, and DiskFilter are
>>> redundant.
>>>
>>> Thanks,
>>> -melanie
>>>
>>> [1] Placement has been a new slate for resource management and prior to
>>> placement, there were conflicts between the different methods for setting
>>> overcommit ratios that were never addressed, such as, "which value to
>>> take
>>> if a compute node has overcommit set AND the aggregate has it set? Which
>>> takes precedence?" And, "if a compute node is in more than one aggregate,
>>> which overcommit value should be taken?" So, the ambiguities were not
>>> something that was desirable to bring forward into placement.
>>
>>
>> So we are a user of this feature and I do have some questions/concerns.
>>
>> We use this feature to segregate capacity/hosts based on CPU
>> allocation ratio using aggregates.
>> This is because we have different offers/flavors based on those
>> allocation ratios. This is part of our business model.
>> A flavor extra_specs is use to schedule instances on appropriate hosts
>> using AggregateInstanceExtraSpecsFilter.
>
>
> The AggregateInstanceExtraSpecsFilter will continue to work, but this filter
> is run *after* the placement service would have already eliminated compute
> node records due to placement considering the allocation ratio set for the
> compute node provider's inventory records.
Ok. Does it mean I will have to use something else to properly filter
compute nodes based on flavor?
Is there a way for a compute node to expose some arbitrary
feature/spec instead and still use flavor extra_specs to filter?
(I still have to read on placement API)
I don't mind migrating out of aggregates but I need to find a way to
make it "self service" through the API with granular control like
aggregates used to offer.
We won't be giving access to our configuration manager to our
technicians and even less direct access to the database.
I see that you are suggesting using the placement API below, see my
comments below.
>> Our setup has a configuration management system and we use aggregates
>> exclusively when it comes to allocation ratio.
>
>
> Yes, that's going to be a problem. You will need to use your configuration
> management system to write the nova.CONF.XXX_allocation_ratio configuration
> option values appropriately for each compute node.
Yes, that's my understanding and which is a concern for us.
>> We do not rely on cpu_allocation_ratio config in nova-scheduler or
>> nova-compute.
>> One of the reasons is we do not wish to have to
>> update/package/redeploy our configuration management system just to
>> add one or multiple compute nodes to an aggregate/capacity pool.
>
>
> Yes, I understand.
>
>> This means anyone (likely an operator or other provisioning
>> technician) can perform this action without having to touch or even
>> know about our configuration management system.
>> We can also transfer capacity from one aggregate to another if there
>> is a need, again, using aggregate memberships.
>
>
> Aggregates don't have "capacity". Aggregates are not capacity pools. Only
> compute nodes provide resources for guests to consume.
Aggregates have been so far a very useful construct for us. You might
not agree with our concept of "capacity pools" but so far, that's what
we got and has been working very well for years.
Our monitoring/operations are entirely based on this concept. You list
the aggregate members, do some computing and cross calculation with
hypervisor stats and you have a capacity monitoring system going.
>> (we do "evacuate" the
>>
>> node if there are instances on it)
>> Our capacity monitoring is based on aggregate memberships and this
>> offer an easy overview of the current capacity.
>
>
> By "based on aggregate membership", I believe you are referring to a system
> where you have all compute nodes in a particular aggregate only schedule
> instances with a particular flavor "A" and so you manage "capacity" by
> saying things like "aggregate X can fit 10 more instances of flavor A in
> it"?
>
> Do I understand you correctly?
Yes, more or less. We do group compute nodes based on flavor "series".
(we have A1 and B1 series)
>
>> Note that a host can
>>
>> be in one and only one aggregate in our setup.
>
>
> In *your* setup. And that's the only reason this works for you. You'd get
> totally unpredictable behaviour if your compute nodes were in multiple
> aggregates.
Yes. It worked very well for us so far. I do agree that it's not
perfect and that you technically can end up with unpredictable
behaviour if a host is part of multiple aggregates. That's why we
avoid doing it.
>> What's the migration path for us?
>>
>> My understanding is that we will now be forced to have people rely on
>> our configuration management system (which they don't have access to)
>> to perform simple task we used to be able to do through the API.
>> I find this unfortunate and I would like to be offered an alternative
>> solution as the current proposed solution is not acceptable for us.
>> We are loosing "agility" in our operational tasks.
>
>
> I see a possible path forward:
>
> We add a new CONF option called "disable_allocation_ratio_autoset". This new
> CONF option would disable the behaviour of the nova-compute service in
> automatically setting the allocation ratio of its inventory records for
> VCPU, MEMORY_MB and DISK_GB resources.
>
> This would allow you to set compute node allocation ratios in batches.
>
> At first, it might be manual... executing something like this against the
> API database:
>
> UPDATE inventories
> INNER JOIN resource_provider
> ON inventories.resource_provider_id = resource_provider.id
> AND inventories.resource_class_id = $RESOURCE_CLASS_ID
> INNER JOIN resource_provider_aggregates
> ON resource_providers.id =
> resource_provider_aggregates.resource_provider_id
> INNER JOIN provider_aggregates
> ON resource_provider_aggregates.aggregate_id = provider_aggregates.id
> AND provider_aggregates.uuid = $AGGREGATE_UUID
> SET inventories.allocation_ratio = $NEW_VALUE;
>
> We could follow up with a little CLI tool that would do the above for you on
> the command line... something like this:
>
> nova-manage db set_aggregate_placement_allocation_ratio
> --aggregate_uuid=$AGG_UUID --resource_class=VCPU --ratio 16.0
>
> Of course, you could always call the Placement REST API to override the
> allocation ratio for particular providers:
>
> DATA='{"resource_provider_generation": X, "allocation_ratio": $RATIO}'
> curl -XPUT -H "Content-Type: application/json" -H$AUTH_TOKEN -d$DATA \
> https://$PLACEMENT/resource_providers/$RP_UUID/inventories/VCPU
>
> and you could loop through all the resource providers listed under a
> particular aggregate, which you can find using something like this:
>
> curl https://$PLACEMENT/resource_providers?member_of:$AGG_UUID
>
> Anyway, there's multiple ways to set the allocation ratios in batches, as
> you can tell.
>
> I think the key is somehow disabling the behaviour of the nova-compute
> service of overriding the allocation ratio of compute nodes with the value
> of the nova.cnf options.
>
> Thoughts?
So far, a couple challenges/issues:
We used to have fine grain control over the calls a user could make to
the Nova API:
* os_compute_api:os-aggregates:add_host
* os_compute_api:os-aggregates:remove_host
This means we could make it so our technicians could *ONLY* manage
this aspect of our cloud.
With placement API, it's all or nothing. (and found some weeks ago
that it's hardcoded to the "admin" role)
And you now have to craft your own curl calls and no more UI in
Horizon. (let me know if I missed something regarding the ACL)
I will read about placement API and see with my coworkers how we could
adapt our systems/tools to use placement API instead. (assuming
disable_allocation_ratio_autoset will be implemented)
But ACL is a big concern for us if we go down that path.
While I agree there are very technical/raw solutions to the issue
(like the ones you suggested), please understand that from our side,
this is still a major regression in the usability of OpenStack from an
operator point of view.
And it's unfortunate that I feel I now have to play catch up and
explain my concerns about a "fait accompli" that wasn't well
communicated to the operators and wasn't clearly mentioned in the
release notes.
I would have appreciated an email to the ops list explaining the
proposed change and if anyone has concerns/comments about it. I don't
often reply but I feel like I would have this time as this is a major
change for us.
Thanks for your time and suggestions,
--
Mathieu
More information about the OpenStack-dev
mailing list