Re: [watcher] 2025.2 Flamingo PTG summary

17 Apr 2025

      On 16/04/2025 21:04, Dmitriy Rabotyagov wrote:
...
Hey,
Have a comment on one AI from the list.
...
AI: (jgilaber) Mark Monasca and Grafana as deprecated, unless 
someone steps up to maintain them, which should include a minimal CI 
job running.
So eventually, on OpenStack-Ansible we were planning to revive the 
Watcher role support to the project.
How we usually test deployment, is by spawning an all-in-one 
environment with drivers and executing a couple of tempest scenarios 
to ensure basic functionality of the service.
With that, having a native OpenStack telemetry datastore is very 
beneficial for such goal, as we already do maintain means for spawning 
telemetry stack. While a requirement for Prometheus will be 
unfortunate for us at least.
While I was writing that, I partially realized that testing Watcher on 
all-in-one is pretty much impossible as well...
you can certenly test some fo watcher with an all in one deployment

i.e. the apis and you can use the dummy test stragies.

but ya in general like nova you need at least 2 nodes to be able to test 
it properly ideally 3

so that if your doing a live migration there is actully a choice of host.

in general however watcher like heat just asks nova to actully move the vms.

sure it will ask nova to move it to a specific host but fundementaly if 
you have

tested live migration with nova via tempest seperatly there is no reason 
to expcect

it would not work for live migratoin tirggred by watcher or heat or 
anything else that

jsut calls novas api.

so you could still get some valual testing in an all in one but ideally 
there woudl be at least 2 comptue hosts.
...
But at the very least, I can propose looking into adding an OSA job 
with Gnocchi as NV to the project, to show the state of the deployment 
with this driver.
well gnocchi is also not a native OpenStack telemetry datastore, it left 
our community to pursue its own goals and is now a third party datastore

just like Grafana or Prometheus.

monasca is currently marked as inactive 
https://review.opendev.org/c/openstack/governance/+/897520 and is in the 
process of being retired.

but it also has no testing on the watcher side to the combination of the 
two is why we are deprecating it going forward.

if both change im happy to see the support continue.

Gnocchi has testing but we are not actively working on extending its 
functionality going forward.

as long as it continues to work i see no reason to change its support 
status.

watcher has quite a lot of untested integrations which is unfortunate

we are planning to build out a feature/test/support matrix in the docs 
this cycle

but for example watcher can integrate with both ironic an canonical maas 
component

to do some level of host power management. none of that is tested and we 
are likely going

to mark them as experimental and reflect on if we can continue to 
support them or not going forward.

it also has the ability to do cinder storage pool balancing which is i 
think also untested write now.

one of the things we hope to do is extend the exsitign testing in our 
current jobs to cover gaps like

that where it is practical to do so. but creating a devstack plugin to 
deploy maas with fake infrastructure

is likely alot more then we can do with our existing contributors so 
expect that to go to experimental then

deprecated and finally it will be removed if no one turns up to support it.

ironic is in the same boat however there are devstack jobs with fake 
ironic nodes so i

could see a path to use having an ironic job down the line. its just not 
high on our current priority

list to adress the support status or testing of this currently.

eventlet removal and other techdebt/community goals are defintly higher 
but i hop the new supprot/testing

matrix will at least help folks make informed descions or what feature 
to use and what backend are

recommended going forward.
...
On Wed, 16 Apr 2025, 21:53 Douglas Viroel, <viroel@gmail.com> wrote:
Hello everyone,
Last week's PTG had very interesting topics. Thank you all that
    joined.
    The Watcher PTG etherpad with all notes is available here:
    https://etherpad.opendev.org/p/apr2025-ptg-watcher
    Here is a summary of the discussions that we had, including the
    great cross-project sessions with Telemetry, Horizon and Nova team:
Tech Debt (chandankumar/sean-k-mooney)
    =================================
    a) Croniter
* Project is being abandoned as per
        https://pypi.org/project/croniter/#disclaimer
      * Watcher uses croniter to calculate a new schedule time to run
        an audit (continuous). It is also used to validate cron like
        syntax
      * Agreed: replace croniter with appscheduler's cron methods.
      * *AI*: (chandankumar) Fix in master branch and backport to 2025.1
b) Support status of Watcher Datasources
* Only Gnocchi and Prometheus have CI job running tempest tests
        (with scenario tests)
      * Monaska is inactive since 2024.1
      * *AI*: (jgilaber) Mark Monasca and Grafana as deprecated,
        unless someone steps up to maintain them, which should include
        a minimal CI job running.
      * *AI*: (dviroel) Document a support matrix between Strategies
        and Datasources, which ones are production ready or
        experimental, and testing coverage.
c) Eventlet Removal
* Team is going to look at how the eventlet is used in Watcher
        and start a PoC of its removal.
      * Chandan Kumar and dviroel volunteer to help in this effort.
      * Planned for 2026.1 cycle.
Workflow/API Improvements (amoralej)
    ==============================
    a) Actions states
* Currently Actions updates from Pending to Succeeded or Failed,
        but these do not cover some important scenarios
      * If an Action's pre_conditions fails, the action is set to
        FAILED, but for some scenarios, it could be just SKIPPED and
        continue the workflow.
      * Proposal: New SKIPPED state for action. E.g: In a Nova
        Migration Action, if the instance doesn't exist in the source
        host, it can be skipped instead of fail.
      * Proposal: Users could also manually skip specific actions from
        an action plan.
      * A skip_reason field could also be added to document the reason
        behind the skip: user's request, pre-condition check, etc.
      * *AI*: (amoralej) Create a spec to describe the proposed changes.
b) Meaning of SUCCEEDED state in Action Plan
* Currently means that all actions are triggered, even if all of
        them fail, which can be confusing for users.
      * Docs mention that SUCCEEDED state means that all actions have
        been successfully executed.
      * *AI*: (amoralej) Document the current behavior as a bug
        (Priority High)
          o done: https://bugs.launchpad.net/watcher/+bug/2106407
Watcher-Dashboard: Priorities to next release (amoralej)
    ===========================================
    a) Add integration/functional tests
* Project is missing integration/functional tests and a CI job
        running against changes in the repo
      * No general conclusion and we will follow up with Horizon team
      * *AI*: (chandankumar/rlandy) sync with Horizon team about
        testing the plugin with horizon.
      * *AI*: (chandankumar/rlandy) devstack job running on new
        changes for watcher-dashboard repo.
b) Add parameters to Audits
* It is missing on the watcher-dashboard side. Without it, it is
        not possible to define some important parameters.
      * Should be addressed by a blueprint
      * Contributors to this feature: chandankumar
Watcher cluster model collector improvement ideas (dviroel)
    =============================================
* Brainstorm ideas to improve watcher collector process, since
        we still see a lot of issues due to outdated models when
        running audits
      * Both scheduled model update and event-based updates are
        enabled in CI today
      * It is unknown the current state of event-based updates from
        Nova notification. Code needs to be reviewed and
        improvements/fixes can be proposed
          o e.g:
            https://bugs.launchpad.net/watcher/+bug/2104220/comments/3
            - We need to check if we are processing the right
            notifications of if is a bug on Nova
      * Proposal: Refresh the model before running an audit. A rate
        limit should be considered to avoid too many refreshments.
      * *AI*: (dviroel) new spec for cluster model refresh, based on
        audit trigger
      * *AI:* (dviroel) investigate the processing of nova events in
        Watcher
Watcher and Nova's visible constraints (dviroel)
    ====================================
* Currently, Watcher can propose solutions that include server
        migrations that violate some Nova constraints like:
        scheduler_hints, server_groups, pinned_az, etc.
      * In Epoxy release, Nova's API was improved to also show
        scheduler_hints and image_properties, allowing external
        services, like watcher, to query and use this information when
        calculating new solutions.
          o https://docs.openstack.org/releasenotes/nova/2025.1.html#new-features
      * Proposal: Extend compute instance model to include new
        properties, which can be retrieved via novaclient. Update
        strategies to filter invalid migration destinations based on
        these new properties.
      * *AI*: (dviroel) Propose a spec to better document the
        proposal. No API changes are expected here.
Replacement for noisy neighbor policy (jgilaber)
    ====================================
* The existing noisy neighbor strategy is based on L3 Cache
        metrics, which is not available anymore, since the support for
        it was dropped from the kernel and from Nova.
      * In order to keep this strategy, new metrics need to be
        considered: cpu_steal? io_wait? cache_misses?
      * *AI*: (jgilaber) Mark the strategy as deprecated during this cycle
      * *AI*: (TBD) Identify new metrics to be used
      * *AI*: (TBD) Work on a replacement for the current strategy
Host Maintenance strategy new use case (jeno8)
    =====================================
* New use case for Host Maintenance strategy: instance with
        ephemeral disks should not be migrated at all.
      * Spec proposed:
        https://review.opendev.org/c/openstack/watcher-specs/+/943873
          o New action to stop instances when both live/cold migration
            are disabled by the user
      * *AI*: (All) Review the spec and continue with discussion there.
Missing Contributor Docs (sean-k-mooney)
    ================================
* Doc missing: Scope of the project, e.g:
        https://docs.openstack.org/nova/latest/contributor/project-scope.html
      * *AI*: (rlandy) Create a scope of the project doc for Watcher
      * Doc missing: PTL Guide, e.g:
        https://docs.openstack.org/nova/latest/contributor/ptl-guide.html
      * *AI*: (TBD) Create a PTL Guide for Watcher project
      * Document: When to create a spec vs blueprint vs bug
      * *AI*: (TBD) Create a doc section to describe the process based
        on what is being modified in the code.
Retrospective
    ==========
* The DPL approach seems to be working for Watcher
      * New core members added: sean-k-mooney, dviroel, marios and
        chandankumar
          o We plan to add more cores in the next cycle, based on
            reviews and engagement.
          o We plan to remove not active members in the 2 last cycles
            (starting at 2026.1)
      * A new datasource was added: Prometheus
      * Prometheus job now also runs scenario tests, along with Gnocchi.
      * We triaged all old bugs from launchpad
      * Needs improvement:
          o current team is still learning about details in the code,
            much of the historical knowledge was lost with the
            previous maintainers
          o core team still needs to grow
          o we need to focus on creating stable releases
Cross-project session with Horizon team
    ===============================
* Combined session with Telemetry and Horizon team, focused on
        how to provide a tenant and an admin dashboard to visualize
        metrics.
      * Watcher team presented some ideas of new panels for both admin
        and tenants, and sean-k-mooney raised a discussion about
        frameworks that can be used to implement them
      * Use-cases that were discussed:
          o a) Admin would benefit from a visualization of the
            infrastructure utilization (real usage metrics), so they
            can identify bottlenecks and plan optimization
          o b) A tenant would like to view their workload performance,
            checking real usage of cpu/ram/disk of instances, to
            proper adjust their resources allocation.
          o c) An admin user of watcher service would like to
            visualize metrics generated by watcher strategies like
            standard deviation of host metrics.
      * sean-k-mooney presented an initial PoC on how a Hypervisor
        Metrics dashboard would look like.
      * Proposal for next steps:
          o start a new horizon plugin as an official deliverable of
            telemetry project
          o still unclear which framework to use for building charts
          o dashboard will integrate with Prometheus, as metric store
          o it is expected that only short term metrics will be
            supported (7 days)
          o python-observability-client will be used to query Prometheus
Cross-project session with Nova team
    =============================
* sean-k-mooney led topics on how to evolve Nova to better
        assist other services, like Watcher, to take actions on
        instances. The team agreed on a proposal of using the existing
        metadata API to annotate instance's supported lifecycle
        operations. This information is very useful to improve
        Watcher's strategy's algorithms. Some example of instance's
        metadata could be:
          o lifecycle:cold-migratable=true|false
          o ha:maintenance-strategy:in_place|power_off|migrate
      * It was discussed that Nova could infer which operations are
        valid or not, based on information like: virt driver, flavor,
        image properties, etc. This feature was initially named
        'instance capabilities' and will require a spec for further
        discussions.
      * Another topic of interest, also raised by Sean, was about
        adding new standard traits to resource providers, like
        PRESSURE_CPU and PRESSURE_DISK. These traits can be used to
        weight hosts when placing new VMs. Watcher and the libvirt
        driver could work on annotating them, but the team generally
        agreed that the libvirt driver is preferred here.
      * More info at Nova PTG etherpad [0] and sean's summary blog [1]
[0] https://etherpad.opendev.org/p/r.bf5f1185e201e31ed8c3adeb45e3cf6d
    [1] https://www.seanmooney.info/blog/2025.2-ptg/#watcher-topics
Please let me know if I missed something.
    Thanks!
-- 
    Douglas Viroel - dviroel

Re: [watcher] 2025.2 Flamingo PTG summary

Sean Mooney