[nova] Wondering about dynamic overcommit scheduling
Hello, I am exploring the potential for resource consolidation from the initial placement perspective, specifically within the filter scheduler. From my research, I understand that OpenStack Watcher can addresses consolidation in a dynamic manner (monitoring resource usage and using live migration), while consolidation in nova seems to rely on a static overcommitment ratio, without accounting for real resource usage. Has dynamic overcommitment been considered for the initial placement phase? If not, I would like to propose implementing such a feature and would greatly appreciate any feedback or guidance from the community on this idea. Specifically, a custom filter could be interfaced with a monitoring stack to deploy servers as long as some thresholds are not exceeded Thank you for your time and input. Best regards, Pierre
Hey, You can do more dynamic distribution during initial scheduling through a metric weighter. This question has been already raised and very well explained by Sean some time ago in this thread: https://lists.openstack.org/archives/list/openstack-discuss@lists.openstack.... On Fri, 7 Feb 2025, 22:54 Jacquet, Pierre, <Pierre.Jacquet@etsmtl.ca> wrote:
Hello,
I am exploring the potential for resource consolidation from the initial placement perspective, specifically within the filter scheduler. From my research, I understand that OpenStack Watcher can addresses consolidation in a dynamic manner (monitoring resource usage and using live migration), while consolidation in nova seems to rely on a static overcommitment ratio, without accounting for real resource usage. Has dynamic overcommitment been considered for the initial placement phase? If not, I would like to propose implementing such a feature and would greatly appreciate any feedback or guidance from the community on this idea. Specifically, a custom filter could be interfaced with a monitoring stack to deploy servers as long as some thresholds are not exceeded
Thank you for your time and input. Best regards,
Pierre
On 07/02/2025 21:53, Jacquet, Pierre wrote:
Hello,
I am exploring the potential for resource consolidation from the initial placement perspective, specifically within the filter scheduler. From my research, I understand that OpenStack Watcher can addresses consolidation in a dynamic manner (monitoring resource usage and using live migration), while consolidation in nova seems to rely on a static overcommitment ratio, without accounting for real resource usage.
that partly correct. nova provide a static resource allocation model, we discover the available capacity of a host and report it to the placement service along with the over allocation ratios configured by the admin. placement answer the question "what host have enough capastiy to fit this request" and to a lesser degree what hosts also support abstract capabilities called "traits" next nova filter thos host candidate host using scheduler filters which either enforce policy (tenant isolation) or non placement capacity requirement like numa affinity. finally nova weighes each host to select where to place the instance. one thing to keep in mind. a non admin user can never make an api call that affect the resocues of another user as a side-effect. i.e. if user/proejct A boot a vm on host 1 and then later user/project B makes a api call (say booting a new vm) it cannot have the sideefect for live migrating user A's vm. That would violate multi tenancy. watcher can move the instance for better utilization because it only accepts request form the cloud admin. as such watcher is allowed to perform actions on workload belowing to any project. nova cannot allow live migrate instance just because it would improve resource utilization unless requested to do so by an admin. live migration breask the SLA for a guest and can cause data loss or downtime even if there is no error or exception. under default policy normal user are not allowed to live migrate there instance either as that is potentially a security risk. - if they are allowed ot sepcify a host they could try adn attach other vms on the specified host directly or indirectly via noisy neighbor effect. - even if they cant specify a host they could abuse live migration to degrade the performance of the cloud either as a sideefect of the data transfer requirement or by preventing other legitimate live migration form happening. so live migration is admin only because its not something that should be exposed to a normal user. we are considering adding a new RBAC role called manager which will be able to live migrate instance in there own project but will not be able to specify a host. the manager role would not be assigned to normal end users by default but could be granted ath the admin discretion to more trusted users that should not have global admin.
Has dynamic overcommitment been considered for the initial placement phase?
yes in fact we already have one weigher that does that but in general that is not the direction we have continued to pursue. The "MetricsWeigher" https://github.com/openstack/nova/blob/master/nova/scheduler/weights/metrics... uses cpu metrics reported by the compute nodes via the cpu_monitor https://github.com/openstack/nova/blob/master/nova/compute/monitors/cpu/virt... this allow the nova scheduler to weigh host based on "near" realtime metrics when doing initial placement. most commonly this is used to weigh hosts based on the cpu load average or io wait to prevent overloading heavily contended hosts. gathering workload or host metrics is currently explicitly out of scope of the nova project. https://docs.openstack.org/nova/latest/contributor/policies.html#metrics-gat... so as it stand expanding the current metrics weigher is not in line with the project direction. the main reasons for this the extra load it created on the RPC bus and implications for multi cell deployments.
If not, I would like to propose implementing such a feature and would greatly appreciate any feedback or guidance from the community on this idea. Specifically, a custom filter could be interfaced with a monitoring stack to deploy servers as long as some thresholds are not exceeded
we have discussed telemetry aware schduilign many times, it comes up every year or so. nova removes the ability to have plug able scheduler implementation so the only scheduler extension that are currently possible are custom filters or custom weigher. upstream scheduler filter are not allowed top make rest calls or arbitrary db quieis or rpc calls, so the are not allowed to consult external monitoring stacks. that not to say people have not done this for example in ONAP or OpenMano communities they explored replacing the nova scheduler logic with a single filter that asked a external scheduler. if i was to implement this today i would either do it as a weigher not a filter, or do it as a post weigehr step. the problem with filter is they run per host so its very inefficient to implement this there. wegheirs are less inefficient as they are only processing a very small subset of host vs a filter. teh interface we have for weighers however still involved calculating a weight on a per host basis. ideally you would be able to batch the call so that intead of making 1 rest call per candiate host you could make 1 rest call per scheduler request. When i last played with this, i toyed with 2 different approaches one by using the external monitoring system to lable the comptue nodes with CUSTOM_* traits to denote which nodes were experiencing load (cpu, ram, disk network) similar to kebernetes *-pressure lables and then have a weigher that woudl do soft anti affinity based on those labels., access to those would be provide by passing the provider summeries for each host to the weigher via the host stat object. the scheduler has that info but its not currently available to the filter/weigher to use. so this approch could be done without any addtional rest/db calles so it shoudl scale well. the other approach was to have a weigher that queried prometheus to calculate a weight for the host by generating a synthetic metric though combining a set of host metrics like cpu_load, io wait ectra. the problem with this is teh scalablity as its one call per candiate host if its a where. to make this practical we would likely need to refactor hwe we do weighing to allow batching if the weigher support it.
Thank you for your time and input. Best regards,
Pierre
participants (3)
-
Dmitriy Rabotyagov
-
Jacquet, Pierre
-
Sean Mooney