[openstack-dev] [nova] heads up to users of Aggregate[Core|Ram|Disk]Filter: behavior change in >= Ocata
Jay Pipes
jaypipes at gmail.com
Thu Jan 18 22:19:55 UTC 2018
On 01/18/2018 03:54 PM, Mathieu Gagné wrote:
> Hi,
>
> On Tue, Jan 16, 2018 at 4:24 PM, melanie witt <melwittt at gmail.com> wrote:
>> Hello Stackers,
>>
>> This is a heads up to any of you using the AggregateCoreFilter,
>> AggregateRamFilter, and/or AggregateDiskFilter in the filter scheduler.
>> These filters have effectively allowed operators to set overcommit ratios
>> per aggregate rather than per compute node in <= Newton.
>>
>> Beginning in Ocata, there is a behavior change where aggregate-based
>> overcommit ratios will no longer be honored during scheduling. Instead,
>> overcommit values must be set on a per compute node basis in nova.conf.
>>
>> Details: as of Ocata, instead of considering all compute nodes at the start
>> of scheduler filtering, an optimization has been added to query resource
>> capacity from placement and prune the compute node list with the result
>> *before* any filters are applied. Placement tracks resource capacity and
>> usage and does *not* track aggregate metadata [1]. Because of this,
>> placement cannot consider aggregate-based overcommit and will exclude
>> compute nodes that do not have capacity based on per compute node
>> overcommit.
>>
>> How to prepare: if you have been relying on per aggregate overcommit, during
>> your upgrade to Ocata, you must change to using per compute node overcommit
>> ratios in order for your scheduling behavior to stay consistent. Otherwise,
>> you may notice increased NoValidHost scheduling failures as the
>> aggregate-based overcommit is no longer being considered. You can safely
>> remove the AggregateCoreFilter, AggregateRamFilter, and AggregateDiskFilter
>> from your enabled_filters and you do not need to replace them with any other
>> core/ram/disk filters. The placement query takes care of the core/ram/disk
>> filtering instead, so CoreFilter, RamFilter, and DiskFilter are redundant.
>>
>> Thanks,
>> -melanie
>>
>> [1] Placement has been a new slate for resource management and prior to
>> placement, there were conflicts between the different methods for setting
>> overcommit ratios that were never addressed, such as, "which value to take
>> if a compute node has overcommit set AND the aggregate has it set? Which
>> takes precedence?" And, "if a compute node is in more than one aggregate,
>> which overcommit value should be taken?" So, the ambiguities were not
>> something that was desirable to bring forward into placement.
>
> So we are a user of this feature and I do have some questions/concerns.
>
> We use this feature to segregate capacity/hosts based on CPU
> allocation ratio using aggregates.
> This is because we have different offers/flavors based on those
> allocation ratios. This is part of our business model.
> A flavor extra_specs is use to schedule instances on appropriate hosts
> using AggregateInstanceExtraSpecsFilter.
The AggregateInstanceExtraSpecsFilter will continue to work, but this
filter is run *after* the placement service would have already
eliminated compute node records due to placement considering the
allocation ratio set for the compute node provider's inventory records.
> Our setup has a configuration management system and we use aggregates
> exclusively when it comes to allocation ratio.
Yes, that's going to be a problem. You will need to use your
configuration management system to write the
nova.CONF.XXX_allocation_ratio configuration option values appropriately
for each compute node.
> We do not rely on cpu_allocation_ratio config in nova-scheduler or nova-compute.
> One of the reasons is we do not wish to have to
> update/package/redeploy our configuration management system just to
> add one or multiple compute nodes to an aggregate/capacity pool.
Yes, I understand.
> This means anyone (likely an operator or other provisioning
> technician) can perform this action without having to touch or even
> know about our configuration management system.
> We can also transfer capacity from one aggregate to another if there
> is a need, again, using aggregate memberships.
Aggregates don't have "capacity". Aggregates are not capacity pools.
Only compute nodes provide resources for guests to consume.
> (we do "evacuate" the
> node if there are instances on it)
> Our capacity monitoring is based on aggregate memberships and this
> offer an easy overview of the current capacity.
By "based on aggregate membership", I believe you are referring to a
system where you have all compute nodes in a particular aggregate only
schedule instances with a particular flavor "A" and so you manage
"capacity" by saying things like "aggregate X can fit 10 more instances
of flavor A in it"?
Do I understand you correctly?
> Note that a host can
> be in one and only one aggregate in our setup.
In *your* setup. And that's the only reason this works for you. You'd
get totally unpredictable behaviour if your compute nodes were in
multiple aggregates.
> What's the migration path for us?
>
> My understanding is that we will now be forced to have people rely on
> our configuration management system (which they don't have access to)
> to perform simple task we used to be able to do through the API.
> I find this unfortunate and I would like to be offered an alternative
> solution as the current proposed solution is not acceptable for us.
> We are loosing "agility" in our operational tasks.
I see a possible path forward:
We add a new CONF option called "disable_allocation_ratio_autoset". This
new CONF option would disable the behaviour of the nova-compute service
in automatically setting the allocation ratio of its inventory records
for VCPU, MEMORY_MB and DISK_GB resources.
This would allow you to set compute node allocation ratios in batches.
At first, it might be manual... executing something like this against
the API database:
UPDATE inventories
INNER JOIN resource_provider
ON inventories.resource_provider_id = resource_provider.id
AND inventories.resource_class_id = $RESOURCE_CLASS_ID
INNER JOIN resource_provider_aggregates
ON resource_providers.id =
resource_provider_aggregates.resource_provider_id
INNER JOIN provider_aggregates
ON resource_provider_aggregates.aggregate_id = provider_aggregates.id
AND provider_aggregates.uuid = $AGGREGATE_UUID
SET inventories.allocation_ratio = $NEW_VALUE;
We could follow up with a little CLI tool that would do the above for
you on the command line... something like this:
nova-manage db set_aggregate_placement_allocation_ratio
--aggregate_uuid=$AGG_UUID --resource_class=VCPU --ratio 16.0
Of course, you could always call the Placement REST API to override the
allocation ratio for particular providers:
DATA='{"resource_provider_generation": X, "allocation_ratio": $RATIO}'
curl -XPUT -H "Content-Type: application/json" -H$AUTH_TOKEN -d$DATA \
https://$PLACEMENT/resource_providers/$RP_UUID/inventories/VCPU
and you could loop through all the resource providers listed under a
particular aggregate, which you can find using something like this:
curl https://$PLACEMENT/resource_providers?member_of:$AGG_UUID
Anyway, there's multiple ways to set the allocation ratios in batches,
as you can tell.
I think the key is somehow disabling the behaviour of the nova-compute
service of overriding the allocation ratio of compute nodes with the
value of the nova.cnf options.
Thoughts?
-jay
More information about the OpenStack-dev
mailing list