On 16/04/2025 21:04, Dmitriy Rabotyagov wrote:
Hey,
Have a comment on one AI from the list.
AI: (jgilaber) Mark Monasca and Grafana as deprecated, unless someone steps up to maintain them, which should include a minimal CI job running.
So eventually, on OpenStack-Ansible we were planning to revive the Watcher role support to the project. How we usually test deployment, is by spawning an all-in-one environment with drivers and executing a couple of tempest scenarios to ensure basic functionality of the service.
With that, having a native OpenStack telemetry datastore is very beneficial for such goal, as we already do maintain means for spawning telemetry stack. While a requirement for Prometheus will be unfortunate for us at least.
While I was writing that, I partially realized that testing Watcher on all-in-one is pretty much impossible as well...
you can certenly test some fo watcher with an all in one deployment i.e. the apis and you can use the dummy test stragies. but ya in general like nova you need at least 2 nodes to be able to test it properly ideally 3 so that if your doing a live migration there is actully a choice of host. in general however watcher like heat just asks nova to actully move the vms. sure it will ask nova to move it to a specific host but fundementaly if you have tested live migration with nova via tempest seperatly there is no reason to expcect it would not work for live migratoin tirggred by watcher or heat or anything else that jsut calls novas api. so you could still get some valual testing in an all in one but ideally there woudl be at least 2 comptue hosts.
But at the very least, I can propose looking into adding an OSA job with Gnocchi as NV to the project, to show the state of the deployment with this driver.
well gnocchi is also not a native OpenStack telemetry datastore, it left our community to pursue its own goals and is now a third party datastore just likeĀ Grafana or Prometheus. monasca is currently marked as inactive https://review.opendev.org/c/openstack/governance/+/897520 and is in the process of being retired. but it also has no testing on the watcher side to the combination of the two is why we are deprecating it going forward. if both change im happy to see the support continue. Gnocchi has testing but we are not actively working on extending its functionality going forward. as long as it continues to work i see no reason to change its support status. watcher has quite a lot of untested integrations which is unfortunate we are planning to build out a feature/test/support matrix in the docs this cycle but for example watcher can integrate with both ironic an canonical maas component to do some level of host power management. none of that is tested and we are likely going to mark them as experimental and reflect on if we can continue to support them or not going forward. it also has the ability to do cinder storage pool balancing which is i think also untested write now. one of the things we hope to do is extend the exsitign testing in our current jobs to cover gaps like that where it is practical to do so. but creating a devstack plugin to deploy maas with fake infrastructure is likely alot more then we can do with our existing contributors so expect that to go to experimental then deprecated and finally it will be removed if no one turns up to support it. ironic is in the same boat however there are devstack jobs with fake ironic nodes so i could see a path to use having an ironic job down the line. its just not high on our current priority list to adress the support status or testing of this currently. eventlet removal and other techdebt/community goals are defintly higher but i hop the new supprot/testing matrix will at least help folks make informed descions or what feature to use and what backend are recommended going forward.
On Wed, 16 Apr 2025, 21:53 Douglas Viroel, <viroel@gmail.com> wrote:
Hello everyone,
Last week's PTG had very interesting topics. Thank you all that joined. The Watcher PTG etherpad with all notes is available here: https://etherpad.opendev.org/p/apr2025-ptg-watcher Here is a summary of the discussions that we had, including the great cross-project sessions with Telemetry, Horizon and Nova team:
Tech Debt (chandankumar/sean-k-mooney) ================================= a) Croniter
* Project is being abandoned as per https://pypi.org/project/croniter/#disclaimer * Watcher uses croniter to calculate a new schedule time to run an audit (continuous). It is also used to validate cron like syntax * Agreed: replace croniter with appscheduler's cron methods. * *AI*: (chandankumar) Fix in master branch and backport to 2025.1
b) Support status of Watcher Datasources
* Only Gnocchi and Prometheus have CI job running tempest tests (with scenario tests) * Monaska is inactive since 2024.1 * *AI*: (jgilaber) Mark Monasca and Grafana as deprecated, unless someone steps up to maintain them, which should include a minimal CI job running. * *AI*: (dviroel) Document a support matrix between Strategies and Datasources, which ones are production ready or experimental, and testing coverage.
c) Eventlet Removal
* Team is going to look at how the eventlet is used in Watcher and start a PoC of its removal. * Chandan Kumar and dviroel volunteer to help in this effort. * Planned for 2026.1 cycle.
Workflow/API Improvements (amoralej) ============================== a) Actions states
* Currently Actions updates from Pending to Succeeded or Failed, but these do not cover some important scenarios * If an Action's pre_conditions fails, the action is set to FAILED, but for some scenarios, it could be just SKIPPED and continue the workflow. * Proposal: New SKIPPED state for action. E.g: In a Nova Migration Action, if the instance doesn't exist in the source host, it can be skipped instead of fail. * Proposal: Users could also manually skip specific actions from an action plan. * A skip_reason field could also be added to document the reason behind the skip: user's request, pre-condition check, etc. * *AI*: (amoralej) Create a spec to describe the proposed changes.
b) Meaning of SUCCEEDED state in Action Plan
* Currently means that all actions are triggered, even if all of them fail, which can be confusing for users. * Docs mention that SUCCEEDED state means that all actions have been successfully executed. * *AI*: (amoralej) Document the current behavior as a bug (Priority High) o done: https://bugs.launchpad.net/watcher/+bug/2106407
Watcher-Dashboard: Priorities to next release (amoralej) =========================================== a) Add integration/functional tests
* Project is missing integration/functional tests and a CI job running against changes in the repo * No general conclusion and we will follow up with Horizon team * *AI*: (chandankumar/rlandy) sync with Horizon team about testing the plugin with horizon. * *AI*: (chandankumar/rlandy) devstack job running on new changes for watcher-dashboard repo.
b) Add parameters to Audits
* It is missing on the watcher-dashboard side. Without it, it is not possible to define some important parameters. * Should be addressed by a blueprint * Contributors to this feature: chandankumar
Watcher cluster model collector improvement ideas (dviroel) =============================================
* Brainstorm ideas to improve watcher collector process, since we still see a lot of issues due to outdated models when running audits * Both scheduled model update and event-based updates are enabled in CI today * It is unknown the current state of event-based updates from Nova notification. Code needs to be reviewed and improvements/fixes can be proposed o e.g: https://bugs.launchpad.net/watcher/+bug/2104220/comments/3 - We need to check if we are processing the right notifications of if is a bug on Nova * Proposal: Refresh the model before running an audit. A rate limit should be considered to avoid too many refreshments. * *AI*: (dviroel) new spec for cluster model refresh, based on audit trigger * *AI:* (dviroel) investigate the processing of nova events in Watcher
Watcher and Nova's visible constraints (dviroel) ====================================
* Currently, Watcher can propose solutions that include server migrations that violate some Nova constraints like: scheduler_hints, server_groups, pinned_az, etc. * In Epoxy release, Nova's API was improved to also show scheduler_hints and image_properties, allowing external services, like watcher, to query and use this information when calculating new solutions. o https://docs.openstack.org/releasenotes/nova/2025.1.html#new-features * Proposal: Extend compute instance model to include new properties, which can be retrieved via novaclient. Update strategies to filter invalid migration destinations based on these new properties. * *AI*: (dviroel) Propose a spec to better document the proposal. No API changes are expected here.
Replacement for noisy neighbor policy (jgilaber) ====================================
* The existing noisy neighbor strategy is based on L3 Cache metrics, which is not available anymore, since the support for it was dropped from the kernel and from Nova. * In order to keep this strategy, new metrics need to be considered: cpu_steal? io_wait? cache_misses? * *AI*: (jgilaber) Mark the strategy as deprecated during this cycle * *AI*: (TBD) Identify new metrics to be used * *AI*: (TBD) Work on a replacement for the current strategy
Host Maintenance strategy new use case (jeno8) =====================================
* New use case for Host Maintenance strategy: instance with ephemeral disks should not be migrated at all. * Spec proposed: https://review.opendev.org/c/openstack/watcher-specs/+/943873 o New action to stop instances when both live/cold migration are disabled by the user * *AI*: (All) Review the spec and continue with discussion there.
Missing Contributor Docs (sean-k-mooney) ================================
* Doc missing: Scope of the project, e.g: https://docs.openstack.org/nova/latest/contributor/project-scope.html * *AI*: (rlandy) Create a scope of the project doc for Watcher * Doc missing: PTL Guide, e.g: https://docs.openstack.org/nova/latest/contributor/ptl-guide.html * *AI*: (TBD) Create a PTL Guide for Watcher project * Document: When to create a spec vs blueprint vs bug * *AI*: (TBD) Create a doc section to describe the process based on what is being modified in the code.
Retrospective ==========
* The DPL approach seems to be working for Watcher * New core members added: sean-k-mooney, dviroel, marios and chandankumar o We plan to add more cores in the next cycle, based on reviews and engagement. o We plan to remove not active members in the 2 last cycles (starting at 2026.1) * A new datasource was added: Prometheus * Prometheus job now also runs scenario tests, along with Gnocchi. * We triaged all old bugs from launchpad * Needs improvement: o current team is still learning about details in the code, much of the historical knowledge was lost with the previous maintainers o core team still needs to grow o we need to focus on creating stable releases
Cross-project session with Horizon team ===============================
* Combined session with Telemetry and Horizon team, focused on how to provide a tenant and an admin dashboard to visualize metrics. * Watcher team presented some ideas of new panels for both admin and tenants, and sean-k-mooney raised a discussion about frameworks that can be used to implement them * Use-cases that were discussed: o a) Admin would benefit from a visualization of the infrastructure utilization (real usage metrics), so they can identify bottlenecks and plan optimization o b) A tenant would like to view their workload performance, checking real usage of cpu/ram/disk of instances, to proper adjust their resources allocation. o c) An admin user of watcher service would like to visualize metrics generated by watcher strategies like standard deviation of host metrics. * sean-k-mooney presented an initial PoC on how a Hypervisor Metrics dashboard would look like. * Proposal for next steps: o start a new horizon plugin as an official deliverable of telemetry project o still unclear which framework to use for building charts o dashboard will integrate with Prometheus, as metric store o it is expected that only short term metrics will be supported (7 days) o python-observability-client will be used to query Prometheus
Cross-project session with Nova team =============================
* sean-k-mooney led topics on how to evolve Nova to better assist other services, like Watcher, to take actions on instances. The team agreed on a proposal of using the existing metadata API to annotate instance's supported lifecycle operations. This information is very useful to improve Watcher's strategy's algorithms. Some example of instance's metadata could be: o lifecycle:cold-migratable=true|false o ha:maintenance-strategy:in_place|power_off|migrate * It was discussed that Nova could infer which operations are valid or not, based on information like: virt driver, flavor, image properties, etc. This feature was initially named 'instance capabilities' and will require a spec for further discussions. * Another topic of interest, also raised by Sean, was about adding new standard traits to resource providers, like PRESSURE_CPU and PRESSURE_DISK. These traits can be used to weight hosts when placing new VMs. Watcher and the libvirt driver could work on annotating them, but the team generally agreed that the libvirt driver is preferred here. * More info at Nova PTG etherpad [0] and sean's summary blog [1]
[0] https://etherpad.opendev.org/p/r.bf5f1185e201e31ed8c3adeb45e3cf6d [1] https://www.seanmooney.info/blog/2025.2-ptg/#watcher-topics
Please let me know if I missed something. Thanks!
-- Douglas Viroel - dviroel