Hey,

Have a comment on one AI from the list.

> AI: (jgilaber) Mark Monasca and Grafana as deprecated, unless someone steps up to maintain them, which should include a minimal CI job running.

So eventually, on OpenStack-Ansible we were planning to revive the Watcher role support to the project.
How we usually test deployment, is by spawning an all-in-one environment with drivers and executing a couple of tempest scenarios to ensure basic functionality of the service.

With that, having a native OpenStack telemetry datastore is very beneficial for such goal, as we already do maintain means for spawning telemetry stack. While a requirement for Prometheus will be unfortunate for us at least.

While I was writing that, I partially realized that testing Watcher on all-in-one is pretty much impossible as well...

But at the very least, I can propose looking into adding an OSA job with Gnocchi as NV to the project, to show the state of the deployment with this driver.


On Wed, 16 Apr 2025, 21:53 Douglas Viroel, <viroel@gmail.com> wrote:
Hello everyone,

Last week's PTG had very interesting topics. Thank you all that joined.
The Watcher PTG etherpad with all notes is available here: https://etherpad.opendev.org/p/apr2025-ptg-watcher
Here is a summary of the discussions that we had, including the great cross-project sessions with Telemetry, Horizon and Nova team:

Tech Debt (chandankumar/sean-k-mooney)
=================================
a) Croniter
  • Project is being abandoned as per https://pypi.org/project/croniter/#disclaimer
  • Watcher uses croniter to calculate a new schedule time to run an audit (continuous). It is also used to validate cron like syntax
  • Agreed: replace croniter with appscheduler's cron methods.
  • AI: (chandankumar) Fix in master branch and backport to 2025.1
b) Support status of Watcher Datasources
  • Only Gnocchi and Prometheus have CI job running tempest tests (with scenario tests)
  • Monaska is inactive since 2024.1
  • AI: (jgilaber) Mark Monasca and Grafana as deprecated, unless someone steps up to maintain them, which should include a minimal CI job running.
  • AI: (dviroel) Document a support matrix between Strategies and Datasources, which ones are production ready or experimental, and testing coverage.
c) Eventlet Removal
  • Team is going to look at how the eventlet is used in Watcher and start a PoC of its removal.
  • Chandan Kumar and dviroel volunteer to help in this effort.
  • Planned for 2026.1 cycle.
Workflow/API Improvements (amoralej)
==============================
a) Actions states
  • Currently Actions updates from Pending to Succeeded or Failed, but these do not cover some important scenarios
  • If an Action's pre_conditions fails, the action is set to FAILED, but for some scenarios, it could be just SKIPPED and continue the workflow.
  • Proposal: New SKIPPED state for action. E.g: In a Nova Migration Action, if the instance doesn't exist in the source host, it can be skipped instead of fail.
  • Proposal: Users could also manually skip specific actions from an action plan.
  • A skip_reason field could also be added to document the reason behind the skip: user's request, pre-condition check, etc.
  • AI: (amoralej) Create a spec to describe the proposed changes.
b) Meaning of SUCCEEDED state in Action Plan
  • Currently means that all actions are triggered, even if all of them fail, which can be confusing for users.
  • Docs mention that SUCCEEDED state means that all actions have been successfully executed.
  • AI: (amoralej) Document the current behavior as a bug (Priority High)
Watcher-Dashboard: Priorities to next release (amoralej)
===========================================
a) Add integration/functional tests
  • Project is missing integration/functional tests and a CI job running against changes in the repo
  • No general conclusion and we will follow up with Horizon team
  • AI: (chandankumar/rlandy) sync with Horizon team about testing the plugin with horizon.
  • AI: (chandankumar/rlandy) devstack job running on new changes for watcher-dashboard repo.
b) Add parameters to Audits
  • It is missing on the watcher-dashboard side. Without it, it is not possible to define some important parameters.
  • Should be addressed by a blueprint
  • Contributors to this feature: chandankumar
Watcher cluster model collector improvement ideas (dviroel)
=============================================
  • Brainstorm ideas to improve watcher collector process, since we still see a lot of issues due to outdated models when running audits
  • Both scheduled model update and event-based updates are enabled in CI today
  • It is unknown the current state of event-based updates from Nova notification. Code needs to be reviewed and improvements/fixes can be proposed
  • Proposal: Refresh the model before running an audit. A rate limit should be considered to avoid too many refreshments.
  • AI: (dviroel) new spec for cluster model refresh, based on audit trigger
  • AI: (dviroel) investigate the processing of nova events in Watcher
Watcher and Nova's visible constraints (dviroel)
====================================
  • Currently, Watcher can propose solutions that include server migrations that violate some Nova constraints like: scheduler_hints, server_groups, pinned_az, etc.
  • In Epoxy release, Nova's API was improved to also show scheduler_hints and image_properties, allowing external services, like watcher, to query and use this information when calculating new solutions.
  • Proposal: Extend compute instance model to include new properties, which can be retrieved via novaclient. Update strategies to filter invalid migration destinations based on these new properties.
  • AI: (dviroel) Propose a spec to better document the proposal. No API changes are expected here.
Replacement for noisy neighbor policy (jgilaber)
====================================
  • The existing noisy neighbor strategy is based on L3 Cache metrics, which is not available anymore, since the support for it was dropped from the kernel and from Nova.
  • In order to keep this strategy, new metrics need to be considered: cpu_steal? io_wait? cache_misses?
  • AI: (jgilaber) Mark the strategy as deprecated during this cycle
  • AI: (TBD) Identify new metrics to be used
  • AI: (TBD) Work on a replacement for the current strategy

Host Maintenance strategy new use case (jeno8)
=====================================
  • New use case for Host Maintenance strategy: instance with ephemeral disks should not be migrated at all.
  • Spec proposed: https://review.opendev.org/c/openstack/watcher-specs/+/943873
    • New action to stop instances when both live/cold migration are disabled by the user
  • AI: (All) Review the spec and continue with discussion there.
Missing Contributor Docs (sean-k-mooney)
================================
Retrospective
==========
  • The DPL approach seems to be working for Watcher
  • New core members added: sean-k-mooney, dviroel, marios and chandankumar
    • We plan to add more cores in the next cycle, based on reviews and engagement.
    • We plan to remove not active members in the 2 last cycles (starting at 2026.1)
  • A new datasource was added: Prometheus
  • Prometheus job now also runs scenario tests, along with Gnocchi.
  • We triaged all old bugs from launchpad
  • Needs improvement:
    • current team is still learning about details in the code, much of the historical knowledge was lost with the previous maintainers
    • core team still needs to grow
    • we need to focus on creating stable releases

Cross-project session with Horizon team
===============================
  • Combined session with Telemetry and Horizon team, focused on how to provide a tenant and an admin dashboard to visualize metrics.
  • Watcher team presented some ideas of new panels for both admin and tenants, and sean-k-mooney raised a discussion about frameworks that can be used to implement them
  • Use-cases that were discussed:
    • a) Admin would benefit from a visualization of the infrastructure utilization (real usage metrics), so they can identify bottlenecks and plan optimization
    • b) A tenant would like to view their workload performance, checking real usage of cpu/ram/disk of instances, to proper adjust their resources allocation.
    • c) An admin user of watcher service would like to visualize metrics generated by watcher strategies like standard deviation of host metrics.
  • sean-k-mooney presented an initial PoC on how a Hypervisor Metrics dashboard would look like.
  • Proposal for next steps:
    • start a new horizon plugin as an official deliverable of telemetry project
    • still unclear which framework to use for building charts
    • dashboard will integrate with Prometheus, as metric store
    • it is expected that only short term metrics will be supported (7 days)
    • python-observability-client will be used to query Prometheus

Cross-project session with Nova team
=============================
  • sean-k-mooney led topics on how to evolve Nova to better assist other services, like Watcher, to take actions on instances. The team agreed on a proposal of using the existing metadata API to annotate instance's supported lifecycle operations. This information is very useful to improve Watcher's strategy's algorithms. Some example of instance's metadata could be:
    • lifecycle:cold-migratable=true|false
    • ha:maintenance-strategy:in_place|power_off|migrate
  • It was discussed that Nova could infer which operations are valid or not, based on information like: virt driver, flavor, image properties, etc. This feature was initially named 'instance capabilities' and will require a spec for further discussions.
  • Another topic of interest, also raised by Sean, was about adding new standard traits to resource providers, like PRESSURE_CPU and PRESSURE_DISK. These traits can be used to weight hosts when placing new VMs. Watcher and the libvirt driver could work on annotating them, but the team generally agreed that the libvirt driver is preferred here.
  • More info at Nova PTG etherpad [0] and sean's summary blog [1]
[0] https://etherpad.opendev.org/p/r.bf5f1185e201e31ed8c3adeb45e3cf6d
[1] https://www.seanmooney.info/blog/2025.2-ptg/#watcher-topics


Please let me know if I missed something.
Thanks!

--
Douglas Viroel - dviroel