[openstack-dev] [nova] heads up to users of Aggregate[Core|Ram|Disk]Filter: behavior change in >= Ocata

Jay Pipes jaypipes at gmail.com
Thu Jan 18 22:19:55 UTC 2018


On 01/18/2018 03:54 PM, Mathieu Gagné wrote:
> Hi,
> 
> On Tue, Jan 16, 2018 at 4:24 PM, melanie witt <melwittt at gmail.com> wrote:
>> Hello Stackers,
>>
>> This is a heads up to any of you using the AggregateCoreFilter,
>> AggregateRamFilter, and/or AggregateDiskFilter in the filter scheduler.
>> These filters have effectively allowed operators to set overcommit ratios
>> per aggregate rather than per compute node in <= Newton.
>>
>> Beginning in Ocata, there is a behavior change where aggregate-based
>> overcommit ratios will no longer be honored during scheduling. Instead,
>> overcommit values must be set on a per compute node basis in nova.conf.
>>
>> Details: as of Ocata, instead of considering all compute nodes at the start
>> of scheduler filtering, an optimization has been added to query resource
>> capacity from placement and prune the compute node list with the result
>> *before* any filters are applied. Placement tracks resource capacity and
>> usage and does *not* track aggregate metadata [1]. Because of this,
>> placement cannot consider aggregate-based overcommit and will exclude
>> compute nodes that do not have capacity based on per compute node
>> overcommit.
>>
>> How to prepare: if you have been relying on per aggregate overcommit, during
>> your upgrade to Ocata, you must change to using per compute node overcommit
>> ratios in order for your scheduling behavior to stay consistent. Otherwise,
>> you may notice increased NoValidHost scheduling failures as the
>> aggregate-based overcommit is no longer being considered. You can safely
>> remove the AggregateCoreFilter, AggregateRamFilter, and AggregateDiskFilter
>> from your enabled_filters and you do not need to replace them with any other
>> core/ram/disk filters. The placement query takes care of the core/ram/disk
>> filtering instead, so CoreFilter, RamFilter, and DiskFilter are redundant.
>>
>> Thanks,
>> -melanie
>>
>> [1] Placement has been a new slate for resource management and prior to
>> placement, there were conflicts between the different methods for setting
>> overcommit ratios that were never addressed, such as, "which value to take
>> if a compute node has overcommit set AND the aggregate has it set? Which
>> takes precedence?" And, "if a compute node is in more than one aggregate,
>> which overcommit value should be taken?" So, the ambiguities were not
>> something that was desirable to bring forward into placement.
> 
> So we are a user of this feature and I do have some questions/concerns.
> 
> We use this feature to segregate capacity/hosts based on CPU
> allocation ratio using aggregates.
> This is because we have different offers/flavors based on those
> allocation ratios. This is part of our business model.
> A flavor extra_specs is use to schedule instances on appropriate hosts
> using AggregateInstanceExtraSpecsFilter.

The AggregateInstanceExtraSpecsFilter will continue to work, but this 
filter is run *after* the placement service would have already 
eliminated compute node records due to placement considering the 
allocation ratio set for the compute node provider's inventory records.

> Our setup has a configuration management system and we use aggregates
> exclusively when it comes to allocation ratio.

Yes, that's going to be a problem. You will need to use your 
configuration management system to write the 
nova.CONF.XXX_allocation_ratio configuration option values appropriately 
for each compute node.

> We do not rely on cpu_allocation_ratio config in nova-scheduler or nova-compute.
> One of the reasons is we do not wish to have to
> update/package/redeploy our configuration management system just to
> add one or multiple compute nodes to an aggregate/capacity pool.

Yes, I understand.

> This means anyone (likely an operator or other provisioning
> technician) can perform this action without having to touch or even
> know about our configuration management system.
> We can also transfer capacity from one aggregate to another if there
> is a need, again, using aggregate memberships.

Aggregates don't have "capacity". Aggregates are not capacity pools. 
Only compute nodes provide resources for guests to consume.

 > (we do "evacuate" the
> node if there are instances on it)
> Our capacity monitoring is based on aggregate memberships and this
> offer an easy overview of the current capacity.

By "based on aggregate membership", I believe you are referring to a 
system where you have all compute nodes in a particular aggregate only 
schedule instances with a particular flavor "A" and so you manage 
"capacity" by saying things like "aggregate X can fit 10 more instances 
of flavor A in it"?

Do I understand you correctly?

 > Note that a host can
> be in one and only one aggregate in our setup.

In *your* setup. And that's the only reason this works for you. You'd 
get totally unpredictable behaviour if your compute nodes were in 
multiple aggregates.

> What's the migration path for us?
> 
> My understanding is that we will now be forced to have people rely on
> our configuration management system (which they don't have access to)
> to perform simple task we used to be able to do through the API.
> I find this unfortunate and I would like to be offered an alternative
> solution as the current proposed solution is not acceptable for us.
> We are loosing "agility" in our operational tasks.

I see a possible path forward:

We add a new CONF option called "disable_allocation_ratio_autoset". This 
new CONF option would disable the behaviour of the nova-compute service 
in automatically setting the allocation ratio of its inventory records 
for VCPU, MEMORY_MB and DISK_GB resources.

This would allow you to set compute node allocation ratios in batches.

At first, it might be manual... executing something like this against 
the API database:

  UPDATE inventories
  INNER JOIN resource_provider
  ON inventories.resource_provider_id = resource_provider.id
  AND inventories.resource_class_id = $RESOURCE_CLASS_ID
  INNER JOIN resource_provider_aggregates
  ON resource_providers.id = 
resource_provider_aggregates.resource_provider_id
  INNER JOIN provider_aggregates
  ON resource_provider_aggregates.aggregate_id = provider_aggregates.id
  AND provider_aggregates.uuid = $AGGREGATE_UUID
  SET inventories.allocation_ratio = $NEW_VALUE;

We could follow up with a little CLI tool that would do the above for 
you on the command line... something like this:

  nova-manage db set_aggregate_placement_allocation_ratio 
--aggregate_uuid=$AGG_UUID --resource_class=VCPU --ratio 16.0

Of course, you could always call the Placement REST API to override the 
allocation ratio for particular providers:

  DATA='{"resource_provider_generation": X, "allocation_ratio": $RATIO}'
  curl -XPUT -H "Content-Type: application/json" -H$AUTH_TOKEN -d$DATA \
     https://$PLACEMENT/resource_providers/$RP_UUID/inventories/VCPU

and you could loop through all the resource providers listed under a 
particular aggregate, which you can find using something like this:

  curl https://$PLACEMENT/resource_providers?member_of:$AGG_UUID

Anyway, there's multiple ways to set the allocation ratios in batches, 
as you can tell.

I think the key is somehow disabling the behaviour of the nova-compute 
service of overriding the allocation ratio of compute nodes with the 
value of the nova.cnf options.

Thoughts?
-jay



More information about the OpenStack-dev mailing list