Hi all, Thank you all that joined the Watcher PTG sessions last week. We had productive discussions across three days covering technical debt, new features, scalability improvements, and testing priorities. The Watcher PTG etherpad used during the sessions can be found here[1]. Below is a summary of the key topics and decisions. Eventlet Removal (dviroel) ==================== We reviewed progress on the community-wide eventlet removal effort. Native threading mode with environment variable support was added in Flamingo, and all Watcher components now support native thread mode. We merged the patch making native threading the default for 2026.2. For remaining tasks, we agreed that new scaling tests will provide updated performance results now that threading is the default. We discussed the need for performance testing but determined that the rally job may not be ready for this cycle. Instead, we plan to leverage emulators or fake drivers for performance testing. *Agreement*: Eventlet code removal will happen early in the 2027.1 release cycle (around November 2026), including removal of eventlet CI jobs and adding checks for eventlet imports. Telemetry and Datasources ===================== a) Prometheus and Aetos Support (dviroel) We had a cross-team session with the Telemetry team to discuss datasource status. Aetos support was added in Flamingo, and Prometheus was deprecated in Gazpacho. We discussed the timing of Prometheus code removal and agreed to defer it to align with other projects like Cloudkitty and Aodh that still support both datasources. We identified the would be good to add a lightweight Aetos job to devstack-plugin-prometheus changes and also to test Application Credentials with Aetos in devstack jobs. *AI*: Work on an Aetos job to run in devstack-plugin-prometheus changes. *AI*: Test Application Credentials with Aetos in devstack jobs. b) New Metrics for Watcher (dviroel) We discussed new metrics that can be consumed by existing strategies and enable implementation of new strategies. Key metrics identified include: - Network throughput metrics (current ceilometer provides packet counts) - Disk latency (can be calculated from ceilometer's read/write request counts and times) - CPU steal time (vcpu.delay from libvirt domstats) for identifying contention in noisy neighbor scenarios. This metric is not available in ceilometer today. - Host-level CPU/memory/IO pressure (PSI metrics via node_exporter) We confirmed that new metrics can be added to Ceilometer without proposing a spec. The Telemetry team can help identify the effort required once we raise specific requirements. *Agreement*: Provide more details in a blueprint and present it at weekly meetings in the IRC. Cluster Data Model Extensions (amoralej/dviroel) ===================================== We discussed the addition of new attributes to the Cluster Data Model (CDM). Recent additions include pinned_az and flavor extra specs. With the openstacksdk migration complete for nova integration, we can now retrieve scheduler_hints and image_properties from the Nova API. New attributes under consideration for servers include: - cpu_info (to identify different architectures) - OS-EXT-STS:task_state, OS-EXT-STS:vm_state, instance created_at (needed for zone_migration strategy to remove Nova dependency). For compute nodes, attributes like hypervisor_type, hypervisor_version, and CPU architecture would help identify compatible destination nodes for migrations. We reached important agreements on the datamodel list API: *AI*: Update Watcher documentation to reflect that the datamodel list API is frozen. *AI*: Define and document the deprecation process for CDM attributes in the Contributors Guide, as removing attributes may affect third-party strategies. Filters for Watcher Strategies (dviroel) ============================= We discussed how to leverage CDM attributes to provide better solutions through filters that assist strategies in selecting valid destination nodes. Filters would apply to static information (capabilities, node and instance attributes, scheduler hints, server groups, availability zones) but not to metrics, which each strategy evaluates independently. The discussion covered whether to implement filters as a plugin-based architecture using stevedore, though we noted that the plugin architecture is already available for strategies themselves. Multiple filters could be chained, and selection could be controlled via configuration options. *Agreement*: Further investigation and refinement of the strategy filtering destination functionality. A PoC is desired and a spec proposal is needed. Watcher Dashboard Improvements (chadankumar) ======================================= a) Django 5 Migration We reviewed the Django 5 migration status. The horizon-tox-python3-django52 job is passing but doesn't test watcher-dashboard due to missing test coverage. We agreed that we need playwright tests running with both Django 5.2 and older Django versions. b) User Flows and Playwright Testing We identified key user flows to test with Playwright integration testing: - Creating dummy audits, viewing action plans and actions - Skipping actions and starting action plans - Audit scope workflow testing with custom parameters - Archiving action plans, audits, and actions - Continuous audit flows - Event-based audit flows (pending tempest plugin work) *AI*: Review the proposed architecture and patterns proposed for watcher-dashboard. *AI*: Review watcher-dashboard patches to allow new Playwright tests to be implemented. CI Improvements (chadankumar) ========================= a) Grenade Testing We agreed to move the watcher-grenade job to multinode configuration to properly run scenario tests, as all scenario tests require two nodes. The multinode grenade job only upgrades the controller node. b) Testing Coverage Gaps We identified high-priority features missing testing coverage: - **Audit scope** (marked as must-have, very important feature) - HIGH PRIORITY - **Event-type audits** - HIGH PRIORITY - **Baremetal integration** (Ironic) - LOW PRIORITY For Ironic integration, we need to check the feasibility of getting a CI job simulating Ironic nodes using a fake driver or vbmc. Given the lack of documentation, testing, and use cases, we agreed to deprecate the Ironic integration in this release with removal deferred to future releases. *AI*: Propose new watcher-tempest-tests for audit scope. *AI*: Propose new watcher-tempest-tests for event-type audits. *AI*: Check how feasible it is to get a CI job simulating Ironic nodes using a fake driver or vbmc. c) Simulator Improvements We discussed the Alfredo's simulator tool under development [2], agreeing it can be incorporated into the project. We identified the need to improve it by generating topology and metric files instead of using large static files which are harder to review and maintain. *Agreement*: Propose improvements to the simulator as discussed at the CI improvements session. Datasources, Integrations, and Strategies Status Review (dviroel) ================================================== a) Datasources Review - **Monasca**: Removed in Gazpacho - **Prometheus**: Deprecated in Gazpacho; code removal deferred (see Telemetry section) - **Grafana**: Missing testing and documentation; we agreed to deprecate it in 2026.2 *AI*: Propose Grafana deprecation patch and send email to the mailing list regarding Grafana deprecation. b) Integration Status - **MAAS**: Deprecated in Gazpacho; agreed to remove code in Hibiscus as it has eventlet code requiring refactoring - **Ironic**: Missing documentation and testing; deprecate in this release (see CI Improvements section) - **Glance and Neutron**: Removed in Gazpacho; documentation needs updates *AI*: Propose patch for MAAS code removal. *AI*: Propose Grafana deprecation patch and send email to ML calling out for maintainers. c) Strategy Removals We discussed the removal of the deprecated noisy_neighbor strategy and goal. We agreed that we need a "soft-delete" approach where deprecated strategies are blocked from new audits but existing audits using them can still be archived during upgrades. This prevents issues with the database migration. *AI*: Propose a plan for removing a strategy and a goal from Watcher. Document it in the Contributors Guide. d) Strategies Missing Tempest Coverage - **saving_energy**: Depends on node power state change action and Ironic integration - **storage_capacity**: Tempest test is being skipped; has tech debt with direct Cinder API calls (https://bugs.launchpad.net/watcher/+bug/2142219) - **uniform_airflow**: Requires host_airflow, host_inlet_temp, host_power metrics - **outlet_temperature_control**: Requires host_outlet_temp metric *AI*: Understand what is missing to get storage_capacity test running in CI and propose a fix. *AI*: Work on storage_capacity tech debt: https://bugs.launchpad.net/watcher/+bug/2142219 *AI*: Investigate whether airflow, temperature, and host power metrics are available in currently supported datasources/exporters. OpenstackSDK Migration (jgilaber) ========================== The Nova helper migration is complete. We plan to tackle Keystone, Placement, Ironic, and Cinder using the same pattern applied to Nova helper. The first patch for each helper will remove unused methods. The openstacksdk fix for retry configuration[3] was merged and released in openstacksdk 4.11.0. We will revisit the Ironic integration migration given the deprecation discussion. *Agreement*: jgilaber will continue proposing the migration patches during this release. Patches will focus in one service at a time. Scalability Topics (amoralej) ===================== a) Limiting Action Plan Size Consistently We discussed defining a mechanism to limit action plan size consistently across strategies. We agreed on a two-level approach: - System-wide config settings provide hard limits with default values - Audit parameters allow users to customize limits (e.g., number of hosts evacuated, total VM migrations) Different strategies have different applicable limits: - Consolidation strategies: number of hosts evacuated, total VM migrations - Workload stabilization: total VM migrations - Zone migration: total VM migrations, VM migrations per host, total volumes, volumes per backend Parameter names should be consistent and common across strategies, possibly grouped by resource type. One alternative is to use Audit Scope to reduce the size of the solution and, respectively, the size of the Action Plan, but it may not solve the problem for all use cases. In zone_migration strategy, existing parameters would be translated to new ones to maintain compatibility. *Agreement*: Design the implementation and provide documentation or a spec for the team to review. b) Parallelization of Action Execution We discussed providing a way to define parallelization of actions in action plan execution for all strategies. Currently, parallelization is defined at two levels: - In the Applier: limited by config value max_workers - In the audit execution: set by the planner (varies by strategy) We discussed criteria to optimize action plan order common to all strategies: - Parallelization level per action type - Interleaving migrations based on source_node when planning migrations - Pipeline of distinct groups which must be optimized We agreed that system configuration should define this initially, not exposing it to users. The implementation should consider whether a common planner can be used or if per-strategy planners are needed. The team also mentioned the possibility of replacing current implementation of one threadpool per Action Plan by a single shared threadpool to make easier for user to configure limits. Another improvement mentioned for the Applier is the dispatch of Actions instead of Action Plang among the running Appliers. *Agreement*: Design the implementation and provide documentation or a spec for the team to review. Migration Action Prioritization and Filtering (amoralej) ======================================== We briefly discussed implementing a mechanism to prioritize, filter, and order migration actions in action plans that could be reused across audits. Use cases include: - Not migrating instances marked with specific metadata, tags, or hints - Prioritizing migrations based on instance characteristics (CPU count, memory size, metrics) - Applying ordering criteria to optimize execution time (e.g., round-robin distribution based on source compute node) We identified two points where this feature can be applied: - **Strategy execution**: Where filtering must be done to avoid breaking efficacy indicators - **Strategy planner**: Where ordering may be applied for parallelization Examples of filters: - Do not migrate VMs with certain metadata in Nova (e.g., "gold" or "untouchable") - Do not migrate VMs on compute nodes marked in Placement *Agreements*: These mechanism were partially discussed in other topics, so we agreed that we should revisit this in future discussions. Audit Pipeline (previous Strategy Stacking) (dviroel) ======================================= We wrapped up discussion on the audit pipeline feature design, covering the PipelineHandler, Audit Pipeline resource, mutable CDM, metric caching process, and planner optimizations. We briefly covered some outputs from a proof of concept made with a audit pipeline of 2 strategies[4]. Key agreements: - Use of nullable strategy_id field to enable dropping it later. - Minimum 2 stages per pipeline, maximum configurable via config option. - For CONTINUOUS audit pipelines, changing some specific parameters will be supported initially; changing stages themselves may not be supported yet. The spec needs to document mutable fields clearly *AI*: Update audit pipeline spec[5] to reflect recent comments and agreements from the PTG session. Workload Profiles and Tiers (dviroel) ============================ We discussed methods to identify workloads with different priorities that can be used by different/existing strategies. Two main approaches were identified: 1. **Flavor extra-specs**: Using keys like `optimize:priority` or `optimize:tier` (gold/silver/bronze) - Can be used by strategies to select low-priority VMs first - Can be used in audit scope to filter instances by priority/tier - Should be aligned with Nova's usage of extra-specs to avoid overloading their meaning 2. **Instance metadata**: Using Watcher custom keys like `watcher-priority` metadata - Approach used in Noisy Neighbor strategy, but is more unlikely be adopt by new strategies. The information would be available in the Compute Cluster Data Model for possible identified use cases: - Audit scope attributes to exclude/include instances based on profiles or priorities - New SLA enforcement strategy with metrics enforced based on performance profile. Today this is the main use case for this feature. *Agreement*: We agreed that any full proposal should include a real use case and be applicable in existing strategies or propose a new strategy/goal. Watcher can support multiple ways of annotating instances with overriding rules for flexibility. Preemptable Instances (winiciusallan) ============================= We discussed a new feature for preemptable instances that would allow Watcher to preempt/shelve instances to achieve specific goals. Use cases include: - Public clouds offering instances at lower prices using spare capacity - Private clouds prioritizing workloads such as HPC or AI jobs The high-level implementation concept includes: - Flavor/server property (e.g., `lifecycle:preemptable=true|false`) to identify preemptable instances at creation time - New Goal and Strategy for preemptable instances - New action(s) for delete or shelve operations - Threshold definitions for metrics to be considered - Policies for deletion vs shelving behavior This builds on previous community discussions and implementations like CERN's Aardvark (triggered by NoValidHost) and NeCTAR-RC's similar service. *AI*: Work on a use case and propose a solution using Watcher, including a spec with the feature design. Gazpacho Retrospective =================== We conducted a retrospective for the Gazpacho release: **What Worked Well:** - Weekly IRC meetings held consistently throughout the release - 2 new core reviewers added - Review feedback within a week - No untriaged bugs in backlog with improved triaging process and tagging - Refactored common code in watcher-dashboard and watcher-tempest-plugin - Good testing and analysis on scaling issues - Better dashboard testing tooling (focus on Playwright) - Better deadline management with less hurried merges - Documentation improvements - Adopted ruff for code modernization **What Needs Improvement:** - Stable branch maintenance and release cadence - Missing tempest coverage: strategies, audit scope, integrations, event-type audits - Functional tests *AI*: Include checkpoints in the current release so the team can review the status of stable backports, and release-liaisons can report on latest stable releases. Thanks again to everyone who participated! Douglas Viroel (dviroel), assisted by Claude Sonnet 4.5 [1] https://etherpad.opendev.org/p/watcher-2026.2-ptg [2] https://review.opendev.org/c/openstack/watcher/+/980257 [3] https://bugs.launchpad.net/openstacksdk/+bug/2142571 [4] https://etherpad.opendev.org/p/watcher-audit-pipeline [5] https://review.opendev.org/c/openstack/watcher-specs/+/969840 -- Douglas Viroel - (irc: dviroel)