Hi all,

Last week's PTG had a lot of great discussions, covering many different aspects of the Watcher project.
If you want to get mode details about a topic, you can find the PTG etherpad with all notes here https://etherpad.opendev.org/p/watcher-2026.1-ptg
Here is a summary of the discussions that we had:

Future of datasources backends and untested integrations

Discussion: The discussion revolved around the removal timeline for Monasca, the status of Gnocchi, and the deprecation of Prometheus datasource in favor of Aetos. There was also a proposal to deprecate MAAS support due to lack of testing and documentation, and to re-evaluate experimental integrations in 2026.2.

Agreed:

AI(jgilaber): Monasca datasource will be removed this cycle.

More research is needed on current Gnocchi usage, and there will be no change in its support this cycle. Suggestion: include a question about gnocchi in the next openstack user survey.

AI(jgilaber): Prometheus will be deprecated in 2026.1 in favor of Aetos. Suggestion: include documentation on how to upgrade between datasources.

AI(dviroel): MAAS will be marked as deprecated this cycle, since it has Eventlet dependent code that is planned to be removed in a near future. We will send another email to the ML to call for maintainers.

AI(sean-k-mooney): Update Watcher documentation about service integrations that are now deprecated.

Openstack SDK

Discussion: It was focused on integrating Watcher with the OpenStack SDK to deprecate and remove the python-bindings in other places. The goal for watcher is to replace the current usage of the project's client with the SDK instead (if there is support for it). The goal for python-watcherclient is to only provide the openstackclient plugin and replace python binding by adopting the usage of the SDK (when available). And for the watcher-dashboard is to use the SDK exclusively for API interactions (also when available).

Agreed:

Start with watcher integrations. The goal is to move one service client (e.g.: Nova) to the SDK this cycle.

Do not freeze python bindings until SDK support is in place.

A single spec can address the overall plan.

Code modernization, dependencies and dead code removal

Discussion: The discussion covered code modernization using pyupdate and potentially ruff for cleanup. It also addressed the removal of dead code, such as API routes that always raised exceptions and commented-out code in production files. The removal of various client dependencies (Neutron, Glance, Monasca) was also proposed.

Agreed:

Apply the same pre-commit and ruff linting checks to the tempest plugin and watcher client.

Defer decisions on more typing until necessary, possibly starting with interfaces.

Explore dropping the number of dependencies, such as multiple timezone libraries.

Applier's Workflow Execution and Its Interface/Contract

Discussion: The workflow for the Applier was identified as poorly documented, leading to questions during reviews of new actions. The need for a default Action interface was discussed.

Agreed:

The assessment of a default Action interface will be based on the chosen path for rollback and aborting topics.

AI(dviroel): Current interface needs to be documented regardless.

Applier: Aborting running tasks

Discussion: The current implementation of the Applier spawning a new green thread for each action, and killing threads for actions that support abort(), was discussed.

Agreed:

Stop spawning/killing threads on every action. This code can be refactored right after we merge the eventlet changes in the applier (so we don't mix the proposals).

Improve the execute() method in actions to check resource status and abort the process/looping when an action is cancelled/aborted.

Applier: Rollback of Action Plans

Discussion: The current lack of a working rollback mechanism and the fact that the revert() method from Actions is not being tested or called were highlighted. The future of the rollback option was debated. New rollback mechanisms, such as a user-triggered "rollback" action for failed action plans, were also considered.

Agreed:

Auto-revert does not work and should be removed in the future.

AI(dviroel): Current behavior should be treated as a bug, and the documentation and associated configuration options should be updated accordingly.

A new spec should be proposed for a new action plan to revert workflow.

CI Testing and Coverage

Discussion: The naming and refactoring of watcher CI jobs were discussed. The need for every voting job in check to also run in the gate was emphasized, as well as in stable branches and other watcher projects. Job renames and consolidation were proposed, along with enabling tempest scenario jobs for stable branches and creating a new grenade job for upgrade testing.

Agreed:

AI(dviroel): Job renames/consolidation: watcher-functional can be merged into other tempest jobs, watcher-tempest-actuator can be merged into strategies job, and ipv6 job should be integrated into other existing tempest jobs. Gate updates will occur after renaming and merging jobs. Backporting changes from master to stable branches is expected.

AI(dviroel): In watcher-tempest-plugin: replace tempest-functional with a tempest job that runs scenario tests for stable branch validation.

AI(chandankumar): A new grenade job to test upgrades between slurp releases and include more testing in existing jobs.

AI(sean-k-mooney): Propose a watcher job to run against OpenStack requirements project.

Improving testing coverage for strategies by doing functional testing

Discussion: A specification proposal and a detailed implementation plan for improving testing coverage for strategies through functional testing were presented.

Agreed:

A phased approach will be taken:

AI(amoralej): 1st phase: API only GETs/POSTs.

2nd phase: Adding decision-engine + Nova + Prometheus datastore.

3rd phase: Adding Applier.

Rally Testing in watcher

Discussion: The current status of Rally testing in Watcher was reviewed, noting that the rally-task-watcher job runs but is not in the Watcher repository. Missing functionalities were identified, such as the inability to pass audit template scope and parameters, lack of auto-triggering, and no support for event, continuous audit, action plans, and actions.

Agreed:

AI(chandankumar): Move rally-openstack watcher plugin code to watcher repo

AI(chandankumar): Add periodic job to run rally jobs

Revisit this topic in future as we progress on scaling watcher CI

Watcher-dashboard improvements

Discussion: Improvements to the Watcher dashboard were discussed, including adding auto-page refresh for audit/action plan status, a start button for action plan details, and options for bulk archiving. Dashboard testing using the Django test framework and Playwright was also covered.

Agreed:

Create wishlist bugs or blueprints to track dashboard improvements.

AI(chandankumar): Check with TC regarding the usage of pytest to improve wording and proceed with integration tests implementation

AI(reviewers): Get the spec https://review.opendev.org/c/openstack/watcher-specs/+/963438 merged.

A spec will be required for the audit and action plan bulk archive feature.

The future of datamodel list API

Discussion: The utility of the datamodel list API, beyond tempest tests, was questioned. The possibility of freezing the API with existing content to avoid microversion bumps for new instance/node updates was raised. The addition of new storage models (storage and baremetal) was also discussed.

Agreed:

Do NOT extend the API to support additional models.

Defer the removal of datamodel list (compute model) to future discussions, but for now, do not extend it further, even if new fields are added to model elements.

AI(dviroel): Add test to avoid new API changes in datamodel list.

Eventlet Removal

Discussion: Changes made in the Flamingo cycle regarding Eventlet removal were reviewed.The need for a collector sync timeout for threading mode and refactoring the action plan cancel workflow were discussed.

Agreed:

AI(dviroel): Collector timeout topic: REST API calls should continue to have their own timeouts, and an event trigger can assist in stopping the overall sync process. Add a new config option for collector timeout should be considered.

Applier: stop killing threads when an action plan is canceled. We can keep the current behavior for evenlet but it will be a noop for threading mode (minimal impact for eventlet removal changes).

AI(dviroel): Set MAAS to deprecate after PTG. We will send an email to the ML to call for maintainers once again. (same AI in Future of datasources topic)

Future of Noisy Neighbort Goal/Strategy

Discussion: The deprecation of cache monitoring metrics in the kernel and Nova, which formed the basis of the current noisy neighbor strategy, was discussed. The need to replace this strategy and identify new metrics for contention (e.g., CPU steal, CPU pressure, IOWait) and noisy neighbors (e.g., CPU usage from low-priority instances) was explored. Instance priority mechanisms were also considered.

Agreed:

Remove the current noisy neighbor strategy in 2026.2+ to allow deprecation to ship in a Slurp release.

AI(dviroel): a proof of concept for CPU steal/IOWait/other metrics is a nice to have, to replace current LLC monitoring metrics.

Use instance metadata for PoC, but consider different solutions for identifying/classifying workloads priorities, like tiering based on flavor extra spec information.

Scaling Watcher

Discussion: The limitations of running a single instance of the Decision Engine and Applier were discussed, with ideas for horizontal scalability. Scalability concerns were raised for different watcher resources: audits, actions and action plans. Some of the open issues that were discussed:

Centralized datamodel: it is independently managed by each decision-engine. Move the model from in-memory to memcached or the database was discussed.

ONESHOT audits are not a failover in a multiple-decision engine deployment.

When an applier dies, the action plan remains ONGOING and is only cancelled upon the applier restart.

The main issue was identified as continuous audits being associated with a specific decision-engine. The concept of moving to an event-driven model with stateless decision engines and appliers was proposed, where data models would reside in the database or a shared data store.

Agreed:

Fow now, instrument the decision-engine to measure the size of the model, and time taken to process notifications.

Keep current failover behavior for ONESHOT audits.

We could have a service monitor in the applier to reschedule PENDING action plans and cancel ONGOING ones, providing the proper status message.

There should be a way limit and set concurrency of actions in audits. In this cycle we will propose a mechanism setting them at system or audit level that covers all the strategies.

Stacking Strategies

Discussion: The possibility of having stacking strategies was brainstormed in this topic. This included ideas like sequential execution of multiple strategies, where a mutable cluster data model would be shared among them. This could result in a list of linked action plans, or result in merging all actions into a single action plan to avoid unnecessary steps. This discussion may be revisited soon, with a more detailed use case.

Agreed:

Get back to this topic when we have more detailed use cases. Propose a spec to highlight the need of this feature in watcher.

Please let me know if I missed something.
Thanks!

Assisted-By: Gemini

Douglas Viroel - dviroel