We are gleeful to announce the release of: watcher 14.1.0 This release is part of the epoxy release series. The source is available from: https://opendev.org/openstack/watcher Download the package from: https://tarballs.openstack.org/watcher/ Please report issues through: https://bugs.launchpad.net/watcher/+bugs For more details, please see below. 14.1.0 ^^^^^^ New Features ************ * A new module, "watcher.wsgi", has been added as a place to gather WSGI "application" objects. This is intended to ease deployment by providing a consistent location for these objects. For example, if using uWSGI then instead of: [uwsgi] wsgi-file = /bin/watcher-api-wsgi You can now use: [uwsgi] module = watcher.wsgi.api:application This also simplifies deployment with other WSGI servers that expect module paths such as gunicorn. Deprecation Notes ***************** * The watcher-api-wsgi console script is deprecated for removal in a future release. This artifact is generated using a setup-tools extension that is provide by PBR which is also deprecated. due to the changes in python packaging this custom extensions is planned to be removed form all OpenStack projects in a future PBR release in favor of module based wsgi applications entry points. Security Issues *************** * Watchers no longer forges requests on behalf of a tenant when swapping volumes. Prior to this release watcher had 2 implementations of moving a volume, it could use cinders volume migrate api or its own internal implementation that directly calls nova volume attachment update api. The former is safe and the recommend way to move volumes between cinder storage backend the internal implementation was insecure, fragile due to a lack of error handling and capable of deleting user data. Insecure: the internal volume migration operation created a new keystone user with a weak name and password and added it to the tenants project with the admin role. It then used that user to forge request on behalf of the tenant with admin right to swap the volume. if the applier was restarted during the execution of this operation it would never be cleaned up. Fragile: the error handling was minimal, the swap volume api is async so watcher has to poll for completion, there was no support to resume that if interrupted of the time out was exceeded. Data-loss: while the internal polling logic returned success or failure watcher did not check the result, once the function returned it unconditionally deleted the source volume. For larger volumes this could result in irretrievable data loss. Finally if a volume was swapped using the internal workflow it put the nova instance in an out of sync state. If the VM was live migrated after the swap volume completed successfully prior to a hard reboot then the migration would fail or succeed and break tenant isolation. see: https://bugs.launchpad.net/nova/+bug/2112187 for details. Bug Fixes ********* * When using prometheus datasource and more that one target has the same value for the "fqdn_label", the driver used the wrong instance label to query for host metrics. The "instance" label is no longer used in the queries but the "fqdn_label" which identifies all the metrics for a specific compute node. see Bug 2103451: https://bugs.launchpad.net/watcher/+bug/2103451 for more info. * Previously, when users attempted to create a new audit without providing a name and a goal or an audit template, the API returned error 500 and an incorrect error message was displayed. Now, Watcher displays a helpful message and returns HTTP error 400. For more info see: https://bugs.launchpad.net/watcher/+bug/2110947 * All code related to creating keystone user and granting roles has been removed. The internal swap volume implementation has been removed and replaced by cinders volume migrate api. Note as part of this change Watcher will no longer attempt volume migrations or retypes if the instance is in the *Verify Resize* task state. This resolves several issues related to volume migration in the zone migration and Storage capacity balance strategies. While efforts have been made to maintain backward compatibility these changes are required to address a security weakness in watcher's prior approach. see: https://bugs.launchpad.net/nova/+bug/2112187 for more context. * When running an audit with the *workload_stabilization* strategy with *instance_ram_usage* metric in a deployment with prometheus datasource, the host metric for the ram usage was wrongly reported with the incorrect unit which lead to incorrect standard deviation and action plans due to the application of the wrong scale factor in the algorithm. The host ram usage metric is now properly reported in KB when using a prometheus datasource and the strategy *workload_stabilization* calculates the standard deviation properly. For more details: https://launchpad.net/bugs/2113776 * Host maintenance strategy should migrate servers based on backup node if specified or rely on nova scheduler. It was enabling disabled hosts with watcher_disabled reason and migrating servers to those nodes. It can impact customer workload. Compute nodes were disabled for a reason. Host maintenance strategy is fixed now to support migrating servers only on backup node or rely on nova scheduler if no backup node is provided. * Previously, if an action failed in an action plan, the state of the action plan was reported as SUCCEEDED if the execution of the action has finished regardless of the outcome. Watcher will now reflect the actual state of all the actions in the plan after the execution has finished. If any action has status FAILED, it will set the state of the action plan as FAILED. This is the expected behavior according to Watcher documentation. For more info see: https://bugs.launchpad.net/watcher/+bug/2106407 * Bug #2110538 (https://bugs.launchpad.net/watcher/+bug/2110538): Corrected the HTTP error code returned when watcher users try to create audits with invalid parameters. The API now correctly returns a 400 Bad Request error. Changes in watcher 14.0.0..14.1.0 --------------------------------- ffec800f use cinder migrate for swap volume defd3953 Configure watcher tempest's microversion in devstack ba417b38 Fix audit creation with no name and no goal or audit_template 38622442 Set actionplan state to FAILED if any action has failed c7fde924 Add unit test to check action plan state when a nested action fails e5b5ff5d Return HTTP code 400 when creating an audit with wrong parameters fb85b27a Use KiB as unit for host_ram_usage when using prometheus datasource 53872f9a Aggregate by label when querying instance cpu usage in prometheus c0ebb8dd Drop code from Host maintenance strategy migrating instance to disabled hosts 1d7f1636 Added unit test to validate audit creation with no goal and no name c6ceaacf Add a unit test to check the error when creating an audit with wrong parameters f4bfb105 [host_maintenance] Pass des hostname in add_action solution 8a99d4c5 Add support for pyproject.toml and wsgi module paths ce9f0b4c Skip real-data tests in non-real-data jobs e385ece6 Aggregate by fqdn label instead instance in host cpu metrics c6505ad0 Query by fqdn_label instead of instance for host metrics 64f70b94 Drop sg_core prometheus related vars 68c9ce65 Update TOX_CONSTRAINTS_FILE for stable/2025.1 5fa09265 Update .gitreview for stable/2025.1 Diffstat (except docs and test files) ------------------------------------- .gitreview | 1 + .zuul.yaml | 17 +-- devstack/lib/watcher | 96 ++++---------- devstack/plugin.sh | 3 + pyproject.toml | 3 + .../add-wsgi-module-support-597f479e31979270.yaml | 30 +++++ ...ries-with-multiple-target-0e65d20711d1abe2.yaml | 8 ++ releasenotes/notes/bug-2110947.yaml | 10 ++ .../notes/bug-2112187-763bae283e0b736d.yaml | 47 +++++++ .../notes/bug-2113776-4bd314fb46623fbc.yaml | 14 +++ ...trategy-on-disabled-hosts-24084a22d4c8f914.yaml | 10 ++ ...ion-plan-state-on-failure-69e498d902ada5c5.yaml | 13 ++ ...ror-400-on-bad-parameters-bb964e4f5cadc15c.yaml | 7 ++ setup.cfg | 2 +- tox.ini | 13 +- watcher/api/controllers/v1/audit.py | 19 +-- watcher/applier/action_plan/default.py | 22 +++- watcher/applier/actions/volume_migration.py | 98 ++++----------- watcher/common/keystone_helper.py | 34 ----- watcher/common/utils.py | 8 +- watcher/decision_engine/datasources/base.py | 2 +- watcher/decision_engine/datasources/prometheus.py | 135 ++++++++++---------- .../strategy/strategies/host_maintenance.py | 26 +--- .../strategy/strategies/zone_migration.py | 55 ++++---- .../action_plan/test_default_action_handler.py | 27 ++++ .../datasources/test_prometheus_helper.py | 140 +++++++++++++++------ .../strategy/strategies/test_host_maintenance.py | 13 +- .../strategies/test_workload_stabilization.py | 62 ++++++++- .../strategy/strategies/test_zone_migration.py | 5 +- watcher/wsgi/__init__.py | 0 watcher/wsgi/api.py | 18 +++ 34 files changed, 652 insertions(+), 422 deletions(-)