openstack-discuss search results for query "#eventlet-removal"
openstack-discuss@lists.openstack.org- 149 messages
Re: [watcher] 2025.2 Flamingo PTG summary
by Sean Mooney
On 16/04/2025 21:04, Dmitriy Rabotyagov wrote:
>
> Hey,
>
> Have a comment on one AI from the list.
>
> > AI: (jgilaber) Mark Monasca and Grafana as deprecated, unless
> someone steps up to maintain them, which should include a minimal CI
> job running.
>
> So eventually, on OpenStack-Ansible we were planning to revive the
> Watcher role support to the project.
> How we usually test deployment, is by spawning an all-in-one
> environment with drivers and executing a couple of tempest scenarios
> to ensure basic functionality of the service.
>
> With that, having a native OpenStack telemetry datastore is very
> beneficial for such goal, as we already do maintain means for spawning
> telemetry stack. While a requirement for Prometheus will be
> unfortunate for us at least.
>
> While I was writing that, I partially realized that testing Watcher on
> all-in-one is pretty much impossible as well...
>
you can certenly test some fo watcher with an all in one deployment
i.e. the apis and you can use the dummy test stragies.
but ya in general like nova you need at least 2 nodes to be able to test
it properly ideally 3
so that if your doing a live migration there is actully a choice of host.
in general however watcher like heat just asks nova to actully move the vms.
sure it will ask nova to move it to a specific host but fundementaly if
you have
tested live migration with nova via tempest seperatly there is no reason
to expcect
it would not work for live migratoin tirggred by watcher or heat or
anything else that
jsut calls novas api.
so you could still get some valual testing in an all in one but ideally
there woudl be at least 2 comptue hosts.
> But at the very least, I can propose looking into adding an OSA job
> with Gnocchi as NV to the project, to show the state of the deployment
> with this driver.
>
well gnocchi is also not a native OpenStack telemetry datastore, it left
our community to pursue its own goals and is now a third party datastore
just like Grafana or Prometheus.
monasca is currently marked as inactive
https://review.opendev.org/c/openstack/governance/+/897520 and is in the
process of being retired.
but it also has no testing on the watcher side to the combination of the
two is why we are deprecating it going forward.
if both change im happy to see the support continue.
Gnocchi has testing but we are not actively working on extending its
functionality going forward.
as long as it continues to work i see no reason to change its support
status.
watcher has quite a lot of untested integrations which is unfortunate
we are planning to build out a feature/test/support matrix in the docs
this cycle
but for example watcher can integrate with both ironic an canonical maas
component
to do some level of host power management. none of that is tested and we
are likely going
to mark them as experimental and reflect on if we can continue to
support them or not going forward.
it also has the ability to do cinder storage pool balancing which is i
think also untested write now.
one of the things we hope to do is extend the exsitign testing in our
current jobs to cover gaps like
that where it is practical to do so. but creating a devstack plugin to
deploy maas with fake infrastructure
is likely alot more then we can do with our existing contributors so
expect that to go to experimental then
deprecated and finally it will be removed if no one turns up to support it.
ironic is in the same boat however there are devstack jobs with fake
ironic nodes so i
could see a path to use having an ironic job down the line. its just not
high on our current priority
list to adress the support status or testing of this currently.
eventlet removal and other techdebt/community goals are defintly higher
but i hop the new supprot/testing
matrix will at least help folks make informed descions or what feature
to use and what backend are
recommended going forward.
>
> On Wed, 16 Apr 2025, 21:53 Douglas Viroel, <viroel(a)gmail.com> wrote:
>
> Hello everyone,
>
> Last week's PTG had very interesting topics. Thank you all that
> joined.
> The Watcher PTG etherpad with all notes is available here:
> https://etherpad.opendev.org/p/apr2025-ptg-watcher
> Here is a summary of the discussions that we had, including the
> great cross-project sessions with Telemetry, Horizon and Nova team:
>
> Tech Debt (chandankumar/sean-k-mooney)
> =================================
> a) Croniter
>
> * Project is being abandoned as per
> https://pypi.org/project/croniter/#disclaimer
> * Watcher uses croniter to calculate a new schedule time to run
> an audit (continuous). It is also used to validate cron like
> syntax
> * Agreed: replace croniter with appscheduler's cron methods.
> * *AI*: (chandankumar) Fix in master branch and backport to 2025.1
>
> b) Support status of Watcher Datasources
>
> * Only Gnocchi and Prometheus have CI job running tempest tests
> (with scenario tests)
> * Monaska is inactive since 2024.1
> * *AI*: (jgilaber) Mark Monasca and Grafana as deprecated,
> unless someone steps up to maintain them, which should include
> a minimal CI job running.
> * *AI*: (dviroel) Document a support matrix between Strategies
> and Datasources, which ones are production ready or
> experimental, and testing coverage.
>
> c) Eventlet Removal
>
> * Team is going to look at how the eventlet is used in Watcher
> and start a PoC of its removal.
> * Chandan Kumar and dviroel volunteer to help in this effort.
> * Planned for 2026.1 cycle.
>
> Workflow/API Improvements (amoralej)
> ==============================
> a) Actions states
>
> * Currently Actions updates from Pending to Succeeded or Failed,
> but these do not cover some important scenarios
> * If an Action's pre_conditions fails, the action is set to
> FAILED, but for some scenarios, it could be just SKIPPED and
> continue the workflow.
> * Proposal: New SKIPPED state for action. E.g: In a Nova
> Migration Action, if the instance doesn't exist in the source
> host, it can be skipped instead of fail.
> * Proposal: Users could also manually skip specific actions from
> an action plan.
> * A skip_reason field could also be added to document the reason
> behind the skip: user's request, pre-condition check, etc.
> * *AI*: (amoralej) Create a spec to describe the proposed changes.
>
> b) Meaning of SUCCEEDED state in Action Plan
>
> * Currently means that all actions are triggered, even if all of
> them fail, which can be confusing for users.
> * Docs mention that SUCCEEDED state means that all actions have
> been successfully executed.
> * *AI*: (amoralej) Document the current behavior as a bug
> (Priority High)
> o done: https://bugs.launchpad.net/watcher/+bug/2106407
>
> Watcher-Dashboard: Priorities to next release (amoralej)
> ===========================================
> a) Add integration/functional tests
>
> * Project is missing integration/functional tests and a CI job
> running against changes in the repo
> * No general conclusion and we will follow up with Horizon team
> * *AI*: (chandankumar/rlandy) sync with Horizon team about
> testing the plugin with horizon.
> * *AI*: (chandankumar/rlandy) devstack job running on new
> changes for watcher-dashboard repo.
>
> b) Add parameters to Audits
>
> * It is missing on the watcher-dashboard side. Without it, it is
> not possible to define some important parameters.
> * Should be addressed by a blueprint
> * Contributors to this feature: chandankumar
>
> Watcher cluster model collector improvement ideas (dviroel)
> =============================================
>
> * Brainstorm ideas to improve watcher collector process, since
> we still see a lot of issues due to outdated models when
> running audits
> * Both scheduled model update and event-based updates are
> enabled in CI today
> * It is unknown the current state of event-based updates from
> Nova notification. Code needs to be reviewed and
> improvements/fixes can be proposed
> o e.g:
> https://bugs.launchpad.net/watcher/+bug/2104220/comments/3
> - We need to check if we are processing the right
> notifications of if is a bug on Nova
> * Proposal: Refresh the model before running an audit. A rate
> limit should be considered to avoid too many refreshments.
> * *AI*: (dviroel) new spec for cluster model refresh, based on
> audit trigger
> * *AI:* (dviroel) investigate the processing of nova events in
> Watcher
>
> Watcher and Nova's visible constraints (dviroel)
> ====================================
>
> * Currently, Watcher can propose solutions that include server
> migrations that violate some Nova constraints like:
> scheduler_hints, server_groups, pinned_az, etc.
> * In Epoxy release, Nova's API was improved to also show
> scheduler_hints and image_properties, allowing external
> services, like watcher, to query and use this information when
> calculating new solutions.
> o https://docs.openstack.org/releasenotes/nova/2025.1.html#new-features
> * Proposal: Extend compute instance model to include new
> properties, which can be retrieved via novaclient. Update
> strategies to filter invalid migration destinations based on
> these new properties.
> * *AI*: (dviroel) Propose a spec to better document the
> proposal. No API changes are expected here.
>
> Replacement for noisy neighbor policy (jgilaber)
> ====================================
>
> * The existing noisy neighbor strategy is based on L3 Cache
> metrics, which is not available anymore, since the support for
> it was dropped from the kernel and from Nova.
> * In order to keep this strategy, new metrics need to be
> considered: cpu_steal? io_wait? cache_misses?
> * *AI*: (jgilaber) Mark the strategy as deprecated during this cycle
> * *AI*: (TBD) Identify new metrics to be used
> * *AI*: (TBD) Work on a replacement for the current strategy
>
>
> Host Maintenance strategy new use case (jeno8)
> =====================================
>
> * New use case for Host Maintenance strategy: instance with
> ephemeral disks should not be migrated at all.
> * Spec proposed:
> https://review.opendev.org/c/openstack/watcher-specs/+/943873
> o New action to stop instances when both live/cold migration
> are disabled by the user
> * *AI*: (All) Review the spec and continue with discussion there.
>
> Missing Contributor Docs (sean-k-mooney)
> ================================
>
> * Doc missing: Scope of the project, e.g:
> https://docs.openstack.org/nova/latest/contributor/project-scope.html
> * *AI*: (rlandy) Create a scope of the project doc for Watcher
> * Doc missing: PTL Guide, e.g:
> https://docs.openstack.org/nova/latest/contributor/ptl-guide.html
> * *AI*: (TBD) Create a PTL Guide for Watcher project
> * Document: When to create a spec vs blueprint vs bug
> * *AI*: (TBD) Create a doc section to describe the process based
> on what is being modified in the code.
>
> Retrospective
> ==========
>
> * The DPL approach seems to be working for Watcher
> * New core members added: sean-k-mooney, dviroel, marios and
> chandankumar
> o We plan to add more cores in the next cycle, based on
> reviews and engagement.
> o We plan to remove not active members in the 2 last cycles
> (starting at 2026.1)
> * A new datasource was added: Prometheus
> * Prometheus job now also runs scenario tests, along with Gnocchi.
> * We triaged all old bugs from launchpad
> * Needs improvement:
> o current team is still learning about details in the code,
> much of the historical knowledge was lost with the
> previous maintainers
> o core team still needs to grow
> o we need to focus on creating stable releases
>
>
> Cross-project session with Horizon team
> ===============================
>
> * Combined session with Telemetry and Horizon team, focused on
> how to provide a tenant and an admin dashboard to visualize
> metrics.
> * Watcher team presented some ideas of new panels for both admin
> and tenants, and sean-k-mooney raised a discussion about
> frameworks that can be used to implement them
> * Use-cases that were discussed:
> o a) Admin would benefit from a visualization of the
> infrastructure utilization (real usage metrics), so they
> can identify bottlenecks and plan optimization
> o b) A tenant would like to view their workload performance,
> checking real usage of cpu/ram/disk of instances, to
> proper adjust their resources allocation.
> o c) An admin user of watcher service would like to
> visualize metrics generated by watcher strategies like
> standard deviation of host metrics.
> * sean-k-mooney presented an initial PoC on how a Hypervisor
> Metrics dashboard would look like.
> * Proposal for next steps:
> o start a new horizon plugin as an official deliverable of
> telemetry project
> o still unclear which framework to use for building charts
> o dashboard will integrate with Prometheus, as metric store
> o it is expected that only short term metrics will be
> supported (7 days)
> o python-observability-client will be used to query Prometheus
>
>
> Cross-project session with Nova team
> =============================
>
> * sean-k-mooney led topics on how to evolve Nova to better
> assist other services, like Watcher, to take actions on
> instances. The team agreed on a proposal of using the existing
> metadata API to annotate instance's supported lifecycle
> operations. This information is very useful to improve
> Watcher's strategy's algorithms. Some example of instance's
> metadata could be:
> o lifecycle:cold-migratable=true|false
> o ha:maintenance-strategy:in_place|power_off|migrate
> * It was discussed that Nova could infer which operations are
> valid or not, based on information like: virt driver, flavor,
> image properties, etc. This feature was initially named
> 'instance capabilities' and will require a spec for further
> discussions.
> * Another topic of interest, also raised by Sean, was about
> adding new standard traits to resource providers, like
> PRESSURE_CPU and PRESSURE_DISK. These traits can be used to
> weight hosts when placing new VMs. Watcher and the libvirt
> driver could work on annotating them, but the team generally
> agreed that the libvirt driver is preferred here.
> * More info at Nova PTG etherpad [0] and sean's summary blog [1]
>
> [0] https://etherpad.opendev.org/p/r.bf5f1185e201e31ed8c3adeb45e3cf6d
> [1] https://www.seanmooney.info/blog/2025.2-ptg/#watcher-topics
>
>
> Please let me know if I missed something.
> Thanks!
>
> --
> Douglas Viroel - dviroel
>
3 months, 3 weeks
[nova][ptg] 2025.2 Flamingo PTG summary
by Rene Ribaud
Hello everyone,
Last week was the PTG—thank you to those who joined! I hope you enjoyed it.
I haven’t gathered exact attendance stats, but it seemed that most sessions
had at least around 15 participants, with some peaks during the cross-team
discussions.
If you’d like to take a closer look, here’s the link to the PTG etherpad:
https://etherpad.opendev.org/p/r.bf5f1185e201e31ed8c3adeb45e3cf6d
We had a pretty full agenda for Nova, so here’s a summary I’ve tried to
keep as short as possible.
#### 2025.1 Epoxy Retrospective ####
17 specs were accepted, and 12 implemented — an excellent ratio. This
represents a clear improvement over previous cycles.
Virtiofs was successfully merged, unblocking other work and boosting
contributor motivation.
✅ We agreed to maintain regular status updates via the etherpad and follow
up during Nova meetings.
API Microversions & Tempest Coverage, several microversions were merged
with good structure.
However, some schema changes were not reflected in Tempest, causing
downstream blockers.
Also the updates covered by the microversions were not propagated into the
sdk and openstack client.
✅ Ensure client-side features (e.g., server show) are also published and
tracked.
✅ Keep microversions isolated and document Tempest implications clearly in
specs.
✅ Raise awareness of the tempest-with-latest-microversion job during Nova
meetings.
✅ Monitor OpenAPI efforts in Nova, which may allow offloading schema checks
from Tempest in the future.
Eventlet Removal, progress is behind schedule, especially compared to other
projects like Neutron.
✅ Flag this as a priority area for upcoming cycles.
Review Process & Tracking, spec review days were difficult to coordinate,
and the status etherpad often outdated.
✅ Encourage active contributors to support occasional contributors during
review days.
✅ Commit to keeping the etherpad current throughout the cycle.
#### 2025.2 Flamingo Planning ####
Timeline:
Soft spec freeze (no new specs): June 1st
Hard spec freeze (M2): July 3rd
Feature Freeze (FF): August 28th
Final release: late September / early October
✅ We agreed to officially adopt June 1st as the soft freeze date, based on
the successful approach in Epoxy.
✅ A spec review day will be scheduled around mid June, these will be
scheduled and announced early to ensure participation.
✅ Uggla will update the schedule document with the proposed milestones.
#### Upstream Bug Triage ####
We acknowledged that active bug triage has slowed down, resulting in a
backlog increase (~150 untriaged bugs).
There is a consensus that triage remains important to maintain a clear
picture of the actual bug landscape.
✅ Trial a new approach: review some untriaged bugs at the end of Nova team
meetings.
✅ Process the list by age (starting with the newest or most-voted first).
#### Closing Old Bugs ####
A proposal was made to bulk-close bugs older than 2 years, with a
respectful and explanatory message, aiming to reduce backlog and improve
visibility.
However, multiple voices expressed strong reservations.
✅Take no action for now. Focus efforts on triaging new bugs first.
✅ If we successfully reduce the number of untriaged new bugs, we can
consider scrubbing the bug backlog and possibly closing some of the older
ones.
#### Preparation for Python 3.13 ####
While Python 3.13 is not mandatory for 2025.2, early compatibility work was
discussed due to known issues (e.g., eventlet is broken on 3.13, as
observed on Ubuntu 25.04)
Ubuntu 24.04 and CentOS Stream 10 will stay on 3.12 for their supported
lifespans.
A non-voting unit test job for Python 3.13 (openstack-tox-py313) has
already been added and is currently passing.
Introducing a functional job for 3.13 could be a good next step, if
resources allow.
✅ Gibi will track this as part of the broader eventlet removal work.
#### Confidential Computing Feature Planning ####
AMD SEV is already supported in Nova.
SEV-ES is implemented in libvirt and work is ongoing in Nova.
SEV-SNP is now supported in libvirt (v10.5.0). Work in Nova has not started
yet.
✅ Pay closer attention to SEV-ES reviews to help move this forward.
✅ Tkajinam will write a new spec for SEV-SNP.
Intel TDX
Kernel support is nearly ready (expected in 6.15).
Libvirt patches exist, but feature is not yet upstreamed or widely released.
✅ No action agreed yet, as this remains exploratory.
Arm CCA
No hardware is available yet; earliest expected in April 2027 (Fujitsu
Monaka).
Support in libvirt, QEMU, and Linux kernel is still under development.
✅ The use case is reasonable, but too early to proceed — we should wait
until libvirt and QEMU support is mature.
✅ It would be beneficial to wait for at least one Linux distribution to
officially support Arm CCA, allowing real-world testing.
✅ Attestation support for Arm is seen as external to Nova, with only minor
flags possibly needed in the guest.
#### RDT / MPAM Feature Discussion ####
RDT (Intel PQoS) and MPAM (Arm equivalent) aim to mitigate “noisy neighbor”
issues by allocating cache/memory bandwidth to VMs.
Development has stalled since 2019, primarily due to:
- Lower priority for contributors
- Lack of customer demand
- Infrastructure complexity (NUMA modeling, placement limitations)
✅ r-taketn to reopen and revise the original spec, showing a clear diff to
the previous version.
✅ Ensure that abstractions are generic, not tied to proprietary technology,
using libvirt + resource classes/traits may provide enough flexibility.
#### vTPM Live Migration ####
A spec for vTPM live migration was approved in Epoxy:
https://specs.openstack.org/openstack/nova-specs/specs/2025.1/approved/vtpm…
<https://specs.openstack.org/openstack/nova-specs/specs/2025.1/approved/vtpm…>To
support live-migratable vTPM-enabled instances, Barbican secrets used for
vTPM need to be owned by Nova, rather than the end user.
This shift in ownership allows Nova to access the secret during live
migration operations.
Opt-in is handled via image property or flavor extra spec, meaning user
consent is explicitly required.
Current Proposal to enable this workflow:
- Castellan should allow per-call configuration for sending the service
token (rather than relying on a global all-or-nothing setting).
Proposal: https://review.opendev.org/c/openstack/castellan/+/942015
- If the Nova service token is present, Barbican should set the secret
owner to Nova.
Proposal: https://review.opendev.org/c/openstack/barbican/+/942016
This workflow ensures Nova can read/delete the secret during lifecycle
operations like migration, without involving the user.
A question was raised around possible co-ownership between Nova and the end
user (e.g., both having access to the secret). While this may be
interesting longer-term, current implementation assumes a single owner.
✅ User and host modes are as described in the spec.
For deployment mode, Nova will:
- Authenticate to Barbican as itself (using a service token).
- Own the vTPM secret it creates — it will be able to create, read, and
delete it.
- The user will not see or control the secret, including deletion.
- The secret will be visible to other members of the Nova service project
by default, but this could be restricted in future via Barbican ACLs to
limit visibility to Nova only.
#### Cloud Hypervisor Integration ####
There is an ongoing effort to integrate Cloud Hypervisor into Nova via the
Libvirt driver:
Spec: https://review.opendev.org/c/openstack/nova-specs/+/945549
The current PoC requires only minor changes to work with Libvirt, and the
team is ready to present the proposal at the PTG.
✅ We’re happy with the direction the spec is taking. Below are some key
highlights regarding the spec.
✅ Clarify platform support (e.g., is libvirt compiled with cloud hypervisor
support by default? Is it available in distros?).
✅ Provide a plan for runtime attach of multiple NICs and volumes.
✅ Mark as experimental if cloud hypervisor is not yet in upstream distro
packages.
✅ Ensure that the following features are expected to work and covered in
the spec: resize, migrate, rebuild, evacuate, snapshot.
✅ Justify raw-only image support, and outline the path to qcow2
compatibility.
#### vGPU (mdev) and PCI SR-IOV Topics ####
1. Live-migratable flag handling (physical_network tag)
Bug: https://bugs.launchpad.net/nova/+bug/2102161
✅ We agreed that the current behavior is correct and consistent with the
intention:
If live_migratable = false → fallback to hotplug during live migration.
If live_migratable = true on both source and destination → prefer
transparent live migration.
✅ Investigate how Neutron might participate by requesting live-migratable
ports.
2. Preemptive live migration failure for non-migratable PCI devices
Nova currently checks for migratability during scheduling and conductor
phases. There’s a proposal to move these checks earlier, possibly to the
API level.
Bug: https://bugs.launchpad.net/nova/+bug/2103631
✅ Confirm with gmann whether a microversion is needed — likely not, as
return codes are already supported (202 → 400/409).
✅ Uggla may submit a small spec to formalize this change.
✅ Split the work into two steps:
- Fix existing bug (can be backported).
- Incrementally move other validations earlier in the flow.
3. PCI SR-IOV: Unify the Live Migration Code Path
There’s agreement on the need to reduce technical debt by refactoring the
current dual-code-path approach into a unified model for PCI live migration.
✅ A dedicated spec is needed to clarify and unify PCI claiming and
allocation.
✅ This refactor should address PCI claiming and allocation, potentially
deprecating or replacing move_claim in favor of more robust DB-backed logic.
✅ This effort is directly related to point 1 (migratability awareness) and
will help ensure consistent resource management across the codebase.
#### SPICE VDI – Next Steps ####
There is an ongoing effort to enhance libvirt domain XML configuration for
desktop virtualization use cases (e.g. SPICE with USB and sound
controllers). Some patches were proposed but not merged in time for Epoxy.
Mikal raised the question of whether a new spec would be required in
Flamingo, which would be the third iteration of this work.
The team also raised concern about the complexity of adding traits (e.g.
os-traits) for relatively simple additions, due to the multi-step process
involved (traits patch, release, requirements update, etc.).
✅ Proceed with a specless blueprint.
✅ Plan to pull os-traits and os-resource-classes logic into Placement, to
simplify the integration process and reduce friction. Link the required
Placement version in Nova documentation accordingly. This is a strategic
direction, even if some traits might still be shared with Neutron/Cinder.
#### Virtiofs Client Support ####
The virtiofs server-side support was merged in Epoxy, but SDK and
client-side support did not make it in time. The proposal is to merge both
patches early in Flamingo and then backport to Epoxy.
✅ No concern with microversion usage here.
✅The ordering of microversion support patches across Nova, SDKs, and
clients will be handled by respective owners.
✅ Uggla to track that each new microversion in Nova has a corresponding
patch in SDK/client layers.
✅ Not directly related to virtiofs, but the new reset-state confirmation
prompt in the client was noted and welcomed.
#### One-Time-Use (OTU) Devices ####
OTU devices are designed to be consumed once and then unreserved.
There is a need to provide practical guidance on handling these cleanly,
especially in notification-driven environments.
Additionally, there's an important patch related to Placement behavior on
over-capacity nodes:
https://review.opendev.org/c/openstack/placement/+/945465
Placement currently blocks new allocations on over-capacity nodes — even if
the new allocation reduces usage. This breaks migration from overloaded
hosts. The proposed fix allows allocations that do not worsen (or improve)
usage.
Note: A similar OTU device handling strategy has been successfully used in
Ironic.
✅ Provide an example script or tool for external OTU device cleanup, based
on notifications.
✅ Agreement on the proposed Placement fix — it is operator-friendly and
resolves real issues in migration workflows.
✅ We likely need to dig deeper into implementation and tooling for broader
OTU support.
#### Glance cross-project session ####
Please look at glance summary.
#### Secure RBAC – Finalization Plan ####
Tobias raised concerns about incomplete secure RBAC support in Nova,
particularly around default roles and policy behavior. Much of the
groundwork has been done, but a number of patches still require review and
finalization.
✅ Gmann will continue working on the outstanding patches during the
Flamingo cycle. The objective is to complete secure RBAC support in Nova as
part of this cycle.
#### Image Properties Handling – DB Schema & API Response ####
The issue arises from discrepancies between image property metadata stored
by Nova and what is received from Glance. Nova’s DB schema enforces a
255-character limit on metadata keys and values, which can lead to silent
truncation or hard failures (e.g., when prefixing keys like image_ pushes
the total length over 255).
Nova stopped supporting custom image properties nearly a decade ago, when
the system moved to structured objects (ImageMetaProps via OVO).
Glance still allows some custom metadata, which may be passed through to
Nova.
This led to invalid or non-standard keys (e.g.,
owner_specified.openstack.sha256) being stored or exposed, even though they
are not part of Nova’s supported set.
Consensus emerged that we are facing two issues:
- Nova's API may expose more metadata than it should (from Glance).
- Nova stores non-standard or overly long keys/values, resulting in silent
truncation or hard DB errors.
✅ Nova should stop storing non-standard image properties altogether.
✅ A cleanup plan should be created to remove existing unused or invalid
metadata from Nova's database post-upgrade.
✅ During instance.save(), Nova should identify and delete unused image_*
keys from the system metadata table.
✅ We must be cautious to preserve snapshot-related keys that are valid but
not part of the base ImageMetaProps.
✅ These changes are considered bugfixes and can proceed without a new spec.
#### Eventlet removal ####
Please read the excellent blog post series from Gibi here:
https://gibizer.github.io/posts/Eventlet-Removal-Flamingo-PTG/
#### Enhanced Granularity and Live Application of QoS ####
This was cross team Neutron/Cinder/Nova first topic.
Bloomberg folks presented early ideas around making QoS settings more
granular and mutable, and potentially applicable to existing ports or VMs,
not just at creation time.
Nova does not operate on multiple instances at once, which conflicts with
some proposed behaviors (e.g., live update of QoS on a network/project
level).
QoS is currently exposed via flavors in Nova, and is only supported on the
frontend for the Libvirt driver.
QoS mutability is non-trivial, with implications for scheduling, resource
modeling, and placement interactions.
The scope is broad and would require cross-project collaboration (Neutron,
Cinder, Placement).
Use cases and notes from Bloomberg:
https://etherpad.opendev.org/p/OpenStack_QoS_Feature_Enhancement_Discussion
✅ Use flavor-based modeling for QoS remains the Nova approach.
✅ Nova should not apply policies across many instances simultaneously.
✅ A spec will be required, especially if new APIs or behavior modifications
for existing VMs are introduced. The spec should provide concrete use case
examples and API design proposals, including expected behavior during
lifecycle operations (resize, rebuild, shelve, etc.).
✅ Max bandwidth adjustments may be possible (as they don’t require
reservations), but broader mutability is more complex.
✅ Neutron and Cinder raised no objections regarding Bloomberg’s use cases
and proposals. However, please look at Neutron and Cinder's respective
summaries.
#### Moving TAP Device Creation from Libvirt to os-vif ####
This change proposes moving the creation of TAP devices from the Libvirt
driver into os-vif, making it more consistent and decoupled. However, it
introduces upgrade and timing considerations, especially regarding Neutron
and OVN behavior.
Bug: https://bugs.launchpad.net/nova/+bug/2073254
Patch: https://review.opendev.org/c/openstack/nova/+/942786
✅ Neutron team is open to adjusting the timing of the "port ready" event,
which could eliminate the need for Nova-side hacks.
✅ Sean will proceed with the patch and verify behavior through CI.
#### Instance Annotations, Labels & K8s-Like Semantics ####
Sean proposed introducing a mechanism similar to Kubernetes annotations and
labels in Nova, to:
- Express user intent regarding instance behavior (e.g., "should this
instance be migrated?")
- Convey lifecycle preferences to external tools like Watcher and Masakari
- Expose capabilities or constraints of an instance (e.g., "cannot be
shelved because it has a vTPM")
Proposed Examples of Instance Annotations:
lifecycle:live-migratable=true|false
ha:role=primary|secondary
These would be:
- Set by users (or operators)
- Optionally inherited from flavors (but conflicts would raise 400 Bad
Request)
- Expressed intent only — not enforcement of policy
In addition, labels generated by Nova could reflect actual capabilities,
like:
lifecycle:live-migratable=false if an instance has a PCI device
lifecycle:shelvable=false if it uses vTPM
✅ Define a new API to expose capabilities of instances (e.g., “can this
instance be live-migrated?”)
Values will be derived by Nova based on configuration/hardware and exposed
via nova server show.
✅ Sean will create a spec.
✅ Looking at user-defined labels, we eventually considered defining a
second API for them to express scheduling/HA preferences.
However we decided the current preferred approach is to start with metadata
API, and evolve to a first-class model.
We may need admin-only metadata (e.g., for HA tooling like Masakari) this
has been discussed in Admin-Only Instance Metadata / Annotations later
point.
✅ Sean will also create a spec for this. (Sean).
#### External Traits and Node Pressure Metrics ####
Sean also proposed allowing external systems (e.g., Watcher, telemetry
agents) to annotate compute nodes with traits such as memory/cpu/io
pressure, based on /proc/pressure.
Examples:
CUSTOM_MEM_PRESSURE=high
EXTERNAL_IO_PRESSURE=moderate
✅ Support a COMPUTE_MEM_PRESSURE-like trait, populated from sysfs as static
info (not dynamic).
✅ A weigher could use these traits to influence placement.Default traits
list could be configured (e.g., prefer/avoid hosts with certain pressures
or hardware features). This approach could evolve into a generic “preferred
traits” weigher, similar to Kubernetes taints/tolerations.
✅Sean will create a dedicated spec for this feature.
✅ Sbauza volunteered to help, especially as the work aligns with weigher
logic from the previous cycle.
#### OpenAPI Schema Integration ####
Stephen highlighted that most of the heavy lifting for OpenAPI support is
now complete, and the work is down to pure response schema definitions.
This effort spans over three cycles now, and it would be valuable to
finalize it early in Flamingo.
✅ We'll formalize this work with a blueprint.
✅ The goal is to make early progress in Flamingo, ideally with a dedicated
review day.
✅ Stephen is happy to join synchronous review sessions and will coordinate
pings for progress.
✅ Masahito volunteered to help with the remaining work.
#### OpenStack SDK & Client Workflows ####
Stephen raised a few concerns regarding timing mismatches between SDK/OSC
freezes and microversion patch merges in Nova.
Some microversion support landed too late to be integrated in the SDK
before the Epoxy freeze.
Patches were sometimes missed due to lack of "depends-on" links or broken
initial submissions.
✅ Uggla will follow up and finalize these patches early in the Flamingo
cycle.
#### Upstream Testing for PCI Passthrough and mdev Devices ####
With IGB support merged in Epoxy, and vIOMMU enabled in some Vexxhost
workers (thanks to dansmith), the opportunity exists to expand PCI testing
upstream in Tempest.
This would also benefit testing of one-time-use (OTU) devices.
Finalizing mtty testing is a priority, as it helps ensure device support is
consistent and regressions (like bug #2098892) are caught early.
✅ Bauzas will lead on wrapping up mtty testing.
✅ Gibi will coordinate with cloud providers to assess Epoxy support and
revisit this topic during the next PTG if needed.
#### CPU Power Management – Expected Behavior ####
Melanie raised questions about inconsistencies between design and
implementation in Nova’s CPU power management logic. In particular:
- CPUs were being offlined too aggressively, sometimes during reboot or
migration operations.
- This contradicts the intent that only unassigned or deallocated cores
should be powered off.
There was confusion between two approaches:
- Aggressive power-down of unused CPUs during all idle states (stop,
shelve, etc.)
- Conservative behavior, powering off cores only when the VM is deleted or
migrated away
Consensus favored the aggressive-but-safe model:
- Power down cores only when not used, e.g., VM is stopped or migrated.
- Be cautious not to power off cores prematurely (e.g., during reboot or
verify-resize).
✅ Do not rush to power off CPU cores at compute startup or mid-operation.
✅ Revisit the implementation so the resource tracker runs first, and
determines actual core assignments before making decisions.
#### Live Migration with Encrypted Volumes (Barbican Integration) ####
HJ-KIM raised the point that Nova does not currently support live migration
of instances using encrypted Cinder volumes managed by Barbican. This is a
critical blocker in environments with strict compliance requirements.
✅ This is a parallel issue to vTPM support. We will learn from the vTPM
implementation and consider applying similar concepts.
✅ A future solution may involve adjusting how ownership is managed, or
providing scoped access via ACLs.
✅ Further discussion/spec work will be needed once an implementation
direction is clearer.
#### Manila–Nova Cross-Team Integration ####
The initial Manila–Nova integration is now merged — thanks to everyone
involved!
The next step is to:
- Add automated testing (currently manual tests only).
- Start with a few basic positive and negative test scenarios (create,
attach, write, delete; snapshot and restore; rule visibility; restricted
deletion; etc.).
Additionally, longer-term features and improvements are being considered
please look at the etherpad.
✅ We will work on tempest tests.
✅ We will continue enhancing Nova–Manila integration during Flamingo (F)
and beyond.
✅ Uggla will submit a spec as needed for land memfd support.
#### Provider Traits Management via provider.yaml ####
📌 Spec: https://review.opendev.org/c/openstack/nova-specs/+/937587
Problem: Traits defined in provider.yaml are added to Placement but never
removed if deleted from the file.
✅ Implement a mechanism where Nova copies the applied file to
/var/lib/nova/applied_provider.yaml, and diffs it with the active one on
restart.
This would allow traits (and possibly other config) to be safely
removed.
#### Admin-Only Instance Metadata / Annotations ####
📌 Spec: https://review.opendev.org/c/openstack/nova-specs/+/939190
Issue: Current instance metadata is user-owned, and shouldn't be used by
admins.
Proposal: Introduce admin-only annotations (or metadata with ownership
tracking), allowing operators to set system-visible metadata without
violating user intent.
✅ Introduce a created_by field (similar to locked_by) to track who created
metadata: user vs admin.
Consider an admin: prefix namespace for admin-controlled keys (applied to
annotations or metadata).
Implementation requires a DB change and a nova-spec.
Note: This aligns well with broader annotation work already discussed in
this cycle.
#### delete_on_terminate for Ports (Server Create / Network Attach APIs)
####
📌 Related discussion:
https://review.opendev.org/c/openstack/nova-specs/+/936990
Background: This was discussed in past PTGs. Currently, delete_on_terminate
can't be updated dynamically across instance lifetime.
✅ A spec with a working PoC will help clarify the desired behavior and
unblock the discussion.
Long-term solution may require storing this flag in Neutron as a port
property (rather than Nova-specific DB).
#### Graceful Shutdown of Nova Compute Services ####
📌 Spec: https://review.opendev.org/c/openstack/nova-specs/+/937185
Challenge: Need a mechanism to drain compute nodes gracefully before
shutdown, without interrupting active workloads or migrations.
Graceful shutdown is tricky in the presence of live migrations.
Ideas include:
- Temporary “maintenance mode” (block write requests).
- Group-level compute draining.
✅ The topic is important but not urgent — bandwidth is limited.
Note: Eventlet removal may simplify implementing this.
✅ Please report concrete bugs so we understand the blockers.
✅ A nova-spec with PoC would help drive the conversation.
#### Libvirt/QEMU Attributes via Flavor Extra Specs ####
Target: Advanced tuning of I/O performance via iothreads and virtqueue
mapping, based on:
https://developers.redhat.com/articles/2024/09/05/scaling-virtio-blk-disk-i…
✅ Introduce new flavor extra specs such as:
- hw:io_threads=4
- hw:blk_multiqueue=2
These can be added to both flavor and image properties.
✅ A nova-spec should be written to document naming and semantics.
#### Dynamic Modification of libvirt Domain XML (Hook Proposal) ####
oVirt allows for plugins to alter the libvirt domain XML just before
instance launch (via VDSM hooks).
Nova does not offer a mechanism to intercept or modify the domain XML, and
the design explicitly avoids this.
The desired use case involves injecting configuration that libvirt cannot
currently represent, for example, enabling multiuser SPICE consoles.
✅ This proposal is explicitly rejected.
✅ Nova will not support hook points for permuting libvirt XML.
✅ Operators may use out-of-band libvirt/qemu hooks at their own risk, but
should not expect upstream support or stability guarantees.
#### Revisiting the "No More API Proxies" Rule ####
Masahito proposed allowing users to filter instances via API based on
related service data, such as network_id.
✅ The "no API proxy" rule remains, but with pragmatic exceptions:
- Filtering is acceptable if the data exists in Nova’s DB (e.g., network
ID, image ID).
- No cross-service REST calls allowed (e.g., Neutron QoS types still out of
scope).
- Filtering by network_id in nova list is reasonable and can proceed.
✅ Masahito will provide a spec.
#### OVN Migration & Port Setup Timing ####
📌 Context: https://bugs.launchpad.net/nova/+bug/2073254
In OVN-based deployments, Neutron signals the network-plugged event too
early, before the port is fully set up. This causes issues in live
migration, especially under load.
✅ Nova already supports waiting on the network-plugged event. OVN in Ubuntu
Noble should behave properly.
A proposal to improve timing in Neutron was discussed (Neutron to wait for
port claim in southbound DB).
Nova might support this via a Neutron port hint that triggers tap interface
creation earlier during migration (pre-live-migration).
✅ Next step: open an RFE bug in Neutron. If accepted, a Nova spec may be
needed.
#### Blocking API Threads During Volume Attachments ####
📌 Context: https://bugs.launchpad.net/nova/+bug/1930406
Volume attachment RPC calls block API workers in uWSGI, leading to
starvation when multiple attachments are made in parallel.
✅ Volume/interface attachments should become async, reducing API lock
contention.
Fix is non-trivial and will require a microversion.
In the meantime, operators may tune uWSGI workers/threads or serialize
attachment calls.
#### Inventory Update Failure – DISK_GB Bug ####
📌 Bug: https://bugs.launchpad.net/nova/+bug/2093869
When local storage becomes temporarily unavailable (e.g., Ceph down), Nova
sends total=0 for DISK_GB, which Placement rejects if allocations exist.
✅ The real fix is to restore the storage backend.
Nova should improve error handling/logging, but should not shut down the
compute service.
#### Security Group Name Conflict Bug ####
📌 Bug: https://bugs.launchpad.net/nova/+bug/2105896
When multiple security groups share the same name (via Neutron RBAC),
instance builds can fail due to incorrect duplicate detection logic.
✅ The issue was fixed in:
https://review.opendev.org/c/openstack/nova/+/946079
✅ Fix will be reviewed and backported to Epoxy.
If you've read this far — thank you! 🙏
If you spot any mistakes or missing points, please don't hesitate to let me
know.
Best regards.
René.
3 months, 4 weeks
[manila] 2025.2 Flamingo PTG summary
by Carlos Silva
Hello Zorillas and interested stackers,
Last week's PTG had plenty of topics and good takeaways.
In case you would like to watch any of the discussions, please take a look
at the videos in the OpenStack Manila Youtube channel [0].
The PTG etherpad has all of the notes we took [9]. Here is a summary of the
discussions grouped by each topic:
Retrospective
==========
Highlights
-------------
Mid cycle alongside feature proposal freeze provided a good opportunity for
us to have collaborative review sessions and move faster on reviews.
Two bugsquashes had a good impact on the bug backlog and the bug trend was
more positive on this cycle, despite the numbers growing due to
low-hanging-fruits we started reporting.
Internships with City University of Seattle, Valencia College and North
Dakota State University - they are definitely helping with progress on
manila-ui and OpenAPI. We will continue the effort.
We would like to speed up reviews and improve our metrics [1] on how long
changes are open before being merged. Review dashboards can help and we can
work with our reviewers to have a more disciplined approach on reviews.
Broken third party CI systems currently mean that we have little testing.
We need to rely on the authors' or their peers to test and ensure that a
feature is working. We will look into documenting CI setup procedures and
gather thoughts from maintainers.
New API Features should be tested as early as possible to ensure it won't
break any workflows. Our contributor documentation will be updated with
extra guidelines.
AIs:
(carloss) Encourage Bug Czar candidates and bring this up more often in the
manila weekly meetings
(carloss) Encourage spec authors to schedule a meeting to discuss the spec
to speed up the review process.
(carloss) include iCal with event announcements (bugsquash / mid cycle)
(gouthamr) Creating a review dashboard
(carloss) Record "expert seminars" on FAQs: it would be great to have some
videos documenting how-tos in OpenStack and help people to unblock
themselves when they are hitting common openstack-developer issues:
https://etherpad.opendev.org/p/manila-howcasts
(carloss) communicate a deadline for the manila CLI -> OSC documentation
changes. The work with our interns should go until FPF. It needs to be done
before the client release, when we are planning to drop the manilaclient
support. ashrodri offered help to get it completed after we come to the FPF
deadline.
(carloss) We should update these docs and mention that first party driver
implementations should be done for features and be more strict about the
testing requirements.
All things CephFS [2]
================
Deprecation of standalone NFS-Ganesha
-------------------------------------------------------
We added a warning in Dalmatian, deferred plans to deprecate based on
community feedback. Our plan is to remove it in the 2026.1 release. There
is a suggested update procedure, please reach out in case there are
questions.
AI: (carloss) send a reminder email in this cycle to incentivize people to
move to clustered NFS
Supporting NFSv3 for Windows workloads
--------------------------------------------------------
manila-tempest-plugin now supports multiple NFS protocol versions in one of
the scenario tests. As soon as we get the build, we will update the CephFS
NFS job to run tests for NFSv3 as well.
Testing and stabilization
--------------------------------
Bumped Ceph version in the CI jobs to Reef in Antelope, Bobcat, Caracal,
Dalmatian. We are starting to test with Ceph Squid; we intend to test with
Squid on "master" and "stable/2025.1" (epoxy) branches.
A couple of Ceph and NFS-Ganesha issues are impacting us at the moment [4]
[5] [6] and we managed to find the workaround for some.
We needed to stop testing with ingress daemon at the moment and we will get
back to testing as soon as the fix is out.
Manage unmanage of shares and snapshots
-----------------------------------------------------------
Feature is merged and working and we are going to backfill tempest test
patches
AI: (carloss) will propose a new job variant to allow testing this feature.
Plans for 2025.2 Flamingo
-----------------------------------
Investigate support for SMB/CIFS
Ceph-NFS QoS: we will follow the implementation of this feature in NFS
Ganesha and start discussing and drafting the Manila implementation when
the code is merged in Ganesha upstream.
Out of place restores and backup enhancements [7]
========================================
CERN is pursuing a backup backend with their C-Back tool. Currently Manila
backups can be restored back to the same share; there are some problems
with such approach when the source share backend is down and to prevent
browse by restore behavior.
Zachary Goggins (za) proposed a specification, and plans to work on it
during the Flamingo Cycle. The share backups feature also needs some
enhancements like a get progress and get restore progress actions. Zach
plans to make it part of the implementation.
We agreed that a backup resource should have a new "state" attribute,
instead of only relying on the status in order to have well defined backup
states.
AI: (za) update the out of place restore spec.
Tech debt
=======
Container driver failures
--------------------------------
The container driver tempest tests are perma-failing right now. We seem to
have a problem with RBAC and pre-provisioned tempest credentials.
AIs:
(carloss) Report a tempest bug to track the issues;
(gouthamr) will propose a change to switch back to using dynamic
credentials in our testing.
DockerHub rate limits
-----------------------------
We are only building an image in manila-image-elements. It's more pulls
than pushes. Pushes happen very rarely. The kolla team has moved away from
DockerHub as well.
Zach offered help in case we need another approach for registry. CERN has
its own tool.
AI: we will look into moving to quay.io
"manila" CLI removal
----------------------------
We added the deprecation warning 6 releases ago and we should proceed with
the removal. We will need an additional push to update all of our
documentation examples and move to keystoneauth.
We need more functional test coverage and we should have a hackathon just
as we did some years ago.
AI: carloss will schedule a hackathon for enabling more tests and send the
removal email to openstack-discuss. We are targeting the removal to 2025.2
Flamingo.
CI and testing
------------------
ZFSOnLinux job left on jammy: We created a bug for it and we can use it for
tracking.
IPv6 testing: The BGP software we were using (quagga) is now deprecated and
everything was migrated to FRR. We will need help to fix it as
unfortunately, things didn't have a 1x1 translation between the libraries.
If someone has experience on this, it would be nice to collaborate to get
this fixed.
API
----
We are going to stop testing the v1 API and stop deploying it on DevStack
test jobs. We'll update the install guide as well that we've stopped
supporting it. It was deprecated in 2015 ("Liberty" release). That's a good
code cleanup opportunity.
V2 is an extension of v1 with microversions.
If we stop supporting it, who is affected? Mostly people that have
automations using it.
What's the impact on manila-tempest-plugin? We have v1 and v2 tests. We
have a lot of coverage for v2. If you don't have the v1 API in the cloud,
the tests refuse to run. We will need to fix it.
AIs:
Work on the removal patches during the 2025.2 Flamingo release;
(carloss) will send an announcement email to the ML, including operators
tag.
Manila UI
-------------
We have been making progress in the Manila UI feature gap. Currently
working on manage/unmanage share servers, manage share with dhss=true,
filtering user messages on date, updating quotas table.
Share limits view broke some time ago, code lives in horizon.
We hit some issues using horizon's tox "runserver" environment, apparently
more people ran into the same issue. We will talk to other impacted parties
and check how to overcome this issue.
AI: (carloss) will reach out to the horizon team and ask how we can
re-introduce Manila limits to the overview tab.
Enable share encryption at-rest (back-end) with secret refs stored on
Barbican/Castellan. [8]
=====================================================================
We merged a specification some time ago with an implementation
architecture. That spec contemplated both Share encryption and Share server
encryption.
NetApp is now planning to work only on share server encryption. Encryption
can be disabled per share, but shares exported via a share server cannot
have a separate encryption key on ONTAP.
We reached an agreement that when a new share creation is triggered, if
there isn't a share server matching the provided key, a new share server
will need to be spawned. We also agreed that we should allow using names
for the secret reference for better user experience.
2025.2 Flamingo is the target release.
AIs: (kpdev/Sai) The spec will be updated and only the DHSS=True scenario
will be documented; The manila team will review the spec as soon as it is
proposed
Replication Improvements
====================
Back when we implemented replication, we didn't account for specific
configurations that the storage backends can have, for example whether the
backend could support zero RPO technologies or not.
Zero RPO is is an important feature that allows data to be written
simultaneously between the share and its replicas.
We agreed that the way we should send the information to the backend is
through a backend specific share type extra spec. Administrators will be
able to define it in the share type and the backend will pick it up.
Operator concerns / questions
=======================
Where to put parameters that change behaviour only of one protocol (NFS in
this case)? We agreed that we should have write once type of metadata and
not allow the metadata to be updated afterwards. A configuration option can
be introduced for this where the operator can determine what metadata will
not be updated.
AI: carthaca will propose a lite-spec for this
Lustre FS Support for HPC Use Cases in OpenStack
Is there any possibility for OpenStack to officially integrate or support
parallel file systems like Lustre, either through Manila or other
components? We've heard in the past as a request from the scientific-sig
group. Building a driver should be straightforward and it does not
necessarily need to be in-tree, and it would be easier to maintain. This is
a very good use case. This discussion will continue with the scientific-sig
group.
Replica / Snapshot Retention / Expiration Policy
While replicas in Manila are designed to be continuously in sync with the
active share, certain use cases — such as disaster recovery (DR) replicas
or manually created replicas that are no longer needed — could benefit from
lifecycle management.
Replicas are continuously synced with the source share, so if they're
"unused", they're there for some reason is the assumption. We had a spec a
while ago about automating snapshots (creation and deletion) on schedule.
It would be preferable that an external automation tool is used to achieve
such behavior. Maybe openstack/mistral can be a good approach (Support for
manila snapshots already exists on Mistral)
Affinity/Anti-affinity spec updates
=========================
This feature allows users to create share groups with affinity policies,
which determine the affinity relationship between shares within the group.
There was an open question about strategies of locking. We came to an
agreement that we can use tooz, database or oslo.
AI: (chuanm) will update the spec.
Force deleting subnets
=================
This is a feature that follows the ability to add multiple subnets to a
share server. We should also be able to remove them. This spec is under
review.
We agreed that we should also implement the "check" mechanism before
deleting the subnet.
AIs: (sylvanld) will update the spec
Eventlet removal
=============
Need to remove wsgi uses, use oslo service's new Threading based backend
instead for the ProcessLauncher, periodic tasks. Neutron is doing some work
around periodic tasks and we can benefit from their learning.
AI: Work on this in Flamingo, aiming for completion in 2026.1 cycle.
Manila/Nova Cross-project session: VirtioFS
=================================
VirtioFS implementation is now complete and we are looking at the next
steps. We currently don't have CI testing the feature and the Manila team
is planning to work on it during the 2025.2 Flamingo release.
The nova team intends to drive the remaining SDK and OSC patches to
completion during the 2025.2 Flamingo release.
We also discussed some possible enhancements: mem_fs support, online attach
and detach and live migration. These will take some time and the Nova team
will work on such features gradually.
AIs: (carloss) will share the test scenarios with the Nova team and ask for
reviews and the Manila team will work on the implementation of the tests.
(rribaud) will work on the remaining SDK patch and work on mem_fd support.
[0]
https://www.youtube.com/watch?v=MLXkBRhViS0&list=PLnpzT0InFrqADxXi_dtPqfWLt…
[1]
https://openstack.biterg.io/app/dashboards#/view/Gerrit-Backlog?_g=(filters…:'Gerrit%20Backlog%20panel%20by%20Bitergia.
',filters:!(('$state':(store:appState),meta:(alias:'Changesets%20Only',disabled:!f,index:gerrit,key:type,negate:!f,params:(query:changeset),type:phrase),query:(match:(type:(query:changeset,type:phrase)))),('$state':(store:appState),meta:(alias:Bots,disabled:!f,index:gerrit,key:author_bot,negate:!t,params:(query:!t),type:phrase),query:(match:(author_bot:(query:!t,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:gerrit,key:project,negate:!f,params:(query:manila),type:phrase),query:(match_phrase:(project:manila)))),fullScreenMode:!f,options:(darkTheme:!f,useMargins:!t),query:(language:lucene,query:(query_string:(analyze_wildcard:!t,default_field:'*',query:'*',time_zone:Europe%2FMadrid))),timeRestore:!f,title:'Gerrit%20Backlog',viewMode:view)
[2] https://etherpad.opendev.org/p/flamingo-ptg-manila-cephfs
[3] https://bugs.launchpad.net/manila/+bug/2049538
[4] https://github.com/nfs-ganesha/nfs-ganesha/issues/1227
[5] https://tracker.ceph.com/issues/69214
[6] https://tracker.ceph.com/issues/67323
[7] https://review.opendev.org/c/openstack/manila-specs/+/942694
[8] https://etherpad.opendev.org/p/share-encryption-with-barbican-secret-ref
[9] https://etherpad.opendev.org/p/flamingo-ptg-manila
Thank you everyone that participated on the PTG!
Best regards,
carloss
3 months, 3 weeks
[nova][ptg] 2025.1 Epoxy PTG summary
by Sylvain Bauza
(resending the email as the previous one was blocked to an attached
etherpad backup txtfile larger than the max size)
Hey all,
First, thanks for having joined us if you were in the vPTG. We had 15-20
people every day for our nova sessions, I was definitely happy to see new
folks :-)
If you want to see our PTG etherpad, please rather look at
https://etherpad.opendev.org/p/r.4f297ee4698e02c16c4007f7ee76b7c1 instead
of the main nova etherpad as I don't want that the etherpad would have a
wrong traduction or having some paragraphs to be removed.
As I say every cycle, just take a coffee (or a tea) now as the summary will
be large.
### Dalmatian retrospective and Epoxy planning ###
6 of 15 approved blueprints were eventually implemented. We also merged
more than 31 bugfixes during Dalmatian.
We agreed to be explaining on the IRC channel when we have meetings for
discussing some feature series (like the one we did every week for the
manila/virtiofs series) and providing some public invitations. We could do
this again this cycle for other features, we'll see.
We will also try to have a periodic integration-compute job that pulls OSC
and SDK from master.
Our Epoxy deadlines will be : two spec review days (R-16, R-2), a soft spec
approval freeze by R-16 and then hard spec approval freeze by R-12. That
means that contributors really need to provide their specs before
mid-December. Bauzas (me) will add these deadlines into the Epoxy schedule
: https://releases.openstack.org/epoxy/schedule.html
### vTPM live migration ###
We agreed on the fact that a vTPM live-migration feature is a priority for
Epoxy given Windows 11.
artom will create a spec proposing an image metadata property saying 'do I
want to share my secret with nova service user ?' and also providing a
new `nova-manage
image_property set migratable_something` command so operators could migrate
the existing instances for getting the Barbican secrets, if really the
operators wants.
### Unified limits wrap-up ###
We already have two changes needing to be merged before we can modify the
default quota driver (in order to default to use unified limits). We agreed
on reviewing both patches (one for treating unset limits as unlimited, the
other about adding a nova-manage command for automatically creating nova
limits) but we also discussed about a latter patch that would eventually
say which nova resources need to be eventually set (so we *have to* enforce
them anyway). melwitt agreed on working on that latter patch.
### per-process health checks ###
We already had one series and we discussed it again. Gibi agreed on taking
over it and he will re-propose the existing spec as it is. We also
discussed the first checks we would have, like RPC failures and DB
connection issues, we'll review those when they are in Gerrit.
### sustainable computing (a.k.a. power mgmt) ###
When someone (I won't say who [1]) implemented power management in
Antelope, this was nice but we eventually found a long list of bugs that we
fixed. Since we don't really want to reproduce that experience, we had this
kind of post-mortem where we eventually agreed on two things that could
avoid reproducing that problem : a weekly periodic job will run whitebox
tempest plugins [2]
with nova-compute restarts also covered by a whitebox tempest plugin.
Nobody is committed against those two actions but we hope to identify
someone soon.
As a side note, gibi mentioned RAPL MSR support [3], notifying us that we
would have to support that in a later release (as the libvirt
implementation is not merged yet)
### nvidia's vGPU vfio-pci variant driver support ###
Long story short, since the linux kernel removed some feature in release
5.18 (IOMMU backend support for vfio-mdev) this impacted the nvidia driver
which now detects that and then creates vfio-pci devices instead of
vfio-mdev devices (as vGPUs). This has a dramatic impact on Nova as we
relied on the vfio-mdev framework for abstracting virtual GPUs. By the next
release, Nova will need to inventorize the GPUs by rather looking at SRIOV
virtual functions which are specific to the nvidia driver (we call them
vfio-pci variant driver resources).
The nova PTG session focused on the required efforts to do so. We agreed on
the fact it will require operators to propose different flavors for vGPU
where they would require distinct resource classes (all but VGPU).
Fortunately, we'll reuse existing device_spec PCI config options [4] where
the operator would define custom resource classes which would match the PCI
addresses of the nvidia-generated virtual functions (don't freak out, we'll
also write documentation). We'll create another device type (something like
type-VF-migratable) for describing such specific nvidia VFs.
Accordingly the generated domain XML will correctly write the device
description (amending the "managed=no" flag for that device).
There will be an upgrade impact: existing instances will need to be resized
to that new flavor (or instances will need to be shelved, updated for
changing the embedded flavor and unshelved).
In order to be on par with existing vGPU features, we'll also need to
implement vfio-pci live-migration by detecting the VF type on the existing
SRIOV live-migration.
Since that effort is quite large, bauzas will incept a subteam of
interested parties that would help him implement all of those bits in the
short timeframe that is one upstream cycle.
### Graceful shutdowns ###
A common pitfall that was told by tobian-urdin is when you want to stop
nova-compute services. In general, before stopping the service, we should
be sure that all RPC calls are done, which means we would no longer accept
RPC calls after asking to stop the nova-compute and just awaiting the
current calls to be done before stopping the service. For that, we need to
create a backlog spec for discussing that design and we would also need to
modify oslo.service for unsubscribing the RPC topics. Unfortunately, this
cycle we won't have any contributor for working on it, but gibi could try
to at least document this.
### horizon-nova x-p session ###
We mostly discussed the Horizon feature gaps [5]. The first priority would
be Horizon to use OpenStackSDK instead of novaclient, but then supporting
all of the new Nova API microversions. Unfortunately, we are no sure that
we could have Horizon contributors that could fix those, but if you're a
contributor and you want to help Horizon to be better, maybe you could do
this ? If so, please ping me.
### Ironic-nova x-p session ###
We didn't really have topics for this x-p session. We just quickly
discussed some points, like Graphical Console support. Nothing really worth
noting, maybe just that it would be nice that we could have readonly
graphical console. We were just happy to say that the ironic driver now
works better thanks to some features that were merged last cycles. Kudos to
those who did them.
### HPC/AI optimized hypervisor "slices" ###
A large topic to explain, I'll try to keep it short. Basically, how Nova
slices the NUMA affinity between guests is nice but hard for HPC usecases
where sometimes you need to better explain how to slice the NUMA dependent
devices depending on the various PCI topologies. Eventually, we agreed on
some POC that johnthetubaguy could work on by trying to implement a
specific virt driver that would do something different from the existing
NUMA affinities.
### Cinder-nova x-p session ###
Multiple topics were discussed there. First, abishop wanted to enhance
cinder's retyping of in-use boot volumes which means that the Nova
os-attachments API to get a new parameter. We said that he needs to create
a new spec and we agreed on the fact that the cinder contributors need to
discuss with QEMU folks to know about the qemu writes.
We also discussed about a new nova spec which is about adding burst length
support to Cinder QoS [6]. We said that we need to both (nova and cinder)
review this spec.
About left residues when detaching a volume, we also agreed on the fact
this is not a security flaw and the fact that os-brick should delete them,
not nova (even if nova need to ask os-brick to look at that, either by a
periodic run or when attaching/detaching). whoami-rajat will provide a spec
for it.
### Python 3.13 support ###
We discussed a specific issue for py3.13, the fact that the crypt module is
no longer in stlib for py3.13, which impacts nova due to some usage in
nova.virt.disk.api module for passing an admin password for file injection.
Given file injection is deprecated, we have three possibilities: either
removing admin password file injection (or even file injection as a whole),
adding the new separate crypt package in upper-constraints or using
oslo_uitls.secretutils module. bauzas (me) will provide an email to
openstack-discuss for asking operators whether they are OK with deprecating
file injection or just admin password injection and then we'll see the
direction. bauzas or sean-k-mooney will also try to have py3.13 non-voting
jobs for unittests/functtests.
### Eventlet removal steps in Nova ###
I won't explain why we need to remove eventlet, you already know, right ?
We rather discussed about the details in our nova components, including
nova-api, nova-compute and other nova services. We agreed on removing
direct eventlet imports where possible, move nova entrypoints that don't
use eventlet to separate modules that don't monkeypatch the stdlib, look at
what we can do with all our scatter_gather methods which asynchronously
calling cells DB for using threads instead and check whether those calls
are blocking on DB (and not on the MQ side). Gibi will shepherd that effort
and provide some audit on the eventlet usage in order to avoid any
unexpected but unfortunate late discoveries.
### Libvirt image backend refactor ###
If you like spaghettis, you should pay attention to the libvirt image
backend code. Lots of assumptions and conditionals make any change to that
module hard to be written and hard to be reviewed, leading to errorprone
situations like the ones we had when fixing some recent CVEs.
We all agreed on the quite urgent necessity to refactor that code and
melwitt proposed a multi-stage effort about that. We agreed on the proposal
for the first two steps with some comments, leading to future revisions of
the proposal's patches. The crucial bits with the refactor are about test
coverage.
### IOThreads tuning for libvirt instances ###
An old spec was already proposed for defining iothreads to guests. We
agreed on reviving that spec, where a config option would define either no
iothread or one iothread per instance (with a potential for a latter option
value to be "one iothread per disk"). Depending on whether
emulator_thread_policy
is provided in the flavor/image, we would set the iothread on that policy
or we would put the iothread floating over the shared CPU set. If no shared
CPUs are configured but the operator wants iothreads, nova-compute would
refuse to start. lajoskatona will work on such an implementation that will
be designed in a blueprint that doesn't require a spec.
### OpenAPI schemas progress ###
Nothing specific to say here, bauzas and gmann will review the series this
cycle.
That's it. I'm gone, I'm dead [7] (a cyclist metaphor) but I eventually
skimmed the very large nova etherpad. Of course, 99% of chances that I'll
write some notes incorrectly, so please correct if I'm wrong, I won't feel
offended, just tired.
Thanks all (and I hope your coffee or tea was good)
-Sylvain
[1] https://geek-and-poke.com/geekandpoke/2013/11/24/simply-explained
[2] https://opendev.org/openstack/whitebox-tempest-plugin
[3] https://www.qemu.org/docs/master/specs/rapl-msr.html
[4]
https://docs.openstack.org/nova/latest/configuration/config.html#pci.device…
[5] https://etherpad.opendev.org/p/horizon-feature-gap#L69
[6] https://review.opendev.org/c/openstack/nova-specs/+/932653
[7] https://www.youtube.com/watch?v=HILcYXf8yqc
9 months, 2 weeks
Re: [watcher] 2025.2 Flamingo PTG summary
by Sean Mooney
On 17/04/2025 13:17, Dmitriy Rabotyagov wrote:
>> well gnocchi is also not a native OpenStack telemetry datastore, it left
>> our community to pursue its own goals and is now a third party datastore
>> just like Grafana or Prometheus.
> Yeah, well, true. Is still somehow treated as the "default" thing with
> Telemetry, likely due to existing integration with Keystone and
> multi-tenancy support. And beyond it - all other options become
> opinionated too fast - ie, some do OpenTelemetry, some do Zabbix,
> VictoriaMetrics, etc. As pretty much from what I got as well, is that
> still relies on Ceilometer metrics?
> And then Prometheus is obviously not the best storage for them, as it
> requires to have pushgatgeway, and afaik prometheus maintainers are
> strictly against "push" concept to it and treat it as conceptually
> wrong (on contrary to OpenTelemetry).
i dont know the detail but i know there is work planned for native
supprot of
Prometheus scrpe endpoint in ceilometer
so while you currently need to use SG-core to provide that integration there
is a plan to remove the need for sgcore going forward.
https://etherpad.opendev.org/p/r.72ac6a7268e4b9d854f75715adede80c#L28
i dont see a spec proposed yet but there is an olde one form 2 years ago
https://review.opendev.org/c/openstack/telemetry-specs/+/845485/4/specs/zed…
there is also a plan to provide keystone integration and mutli tenancy
https://etherpad.opendev.org/p/r.72ac6a7268e4b9d854f75715adede80c#L84
> So the metric timestamp issue is
> to remain unaddressed.
> So that's why I'd see leaving Gnocchi as "base" implementation might
> be valuable (and very handy for us, as we don't need to implement a
> prometheus job specifically for Watcher).
watcher, aodh, and cloud kitty i believe all have some level if support for
Prometheus but they can also use other backends. in not sure what level
of enablement they have in osa.
>
>> but for example watcher can integrate with both ironic an canonical maas
> component
>> to do some level of host power management.
> That sounds really interesting... We do maintain infrastructure using
> MAAS and playing with such integration will be extremely interesting.
> I hope I will be able to get some time for this though...
the current maas integration has 3 problems. 1 a lack of testing, 2 a
lack of documentation
and 3 it somehow managed to introduce asysnio in a project that uses
eventlet in
a release of eventlet that did not support asyncio
so im very nervious that that is broken or will break in the future.
this is the entrity of the support
https://review.opendev.org/c/openstack/watcher/+/898790
there are no docs and no spec...
so this should definitely be considered "experimental" at best today.
>
> чт, 17 апр. 2025 г. в 13:52, Sean Mooney <smooney(a)redhat.com>:
>>
>> On 16/04/2025 21:04, Dmitriy Rabotyagov wrote:
>>> Hey,
>>>
>>> Have a comment on one AI from the list.
>>>
>>>> AI: (jgilaber) Mark Monasca and Grafana as deprecated, unless
>>> someone steps up to maintain them, which should include a minimal CI
>>> job running.
>>>
>>> So eventually, on OpenStack-Ansible we were planning to revive the
>>> Watcher role support to the project.
>>> How we usually test deployment, is by spawning an all-in-one
>>> environment with drivers and executing a couple of tempest scenarios
>>> to ensure basic functionality of the service.
>>>
>>> With that, having a native OpenStack telemetry datastore is very
>>> beneficial for such goal, as we already do maintain means for spawning
>>> telemetry stack. While a requirement for Prometheus will be
>>> unfortunate for us at least.
>>>
>>> While I was writing that, I partially realized that testing Watcher on
>>> all-in-one is pretty much impossible as well...
>>>
>> you can certenly test some fo watcher with an all in one deployment
>>
>> i.e. the apis and you can use the dummy test stragies.
>>
>> but ya in general like nova you need at least 2 nodes to be able to test
>> it properly ideally 3
>>
>> so that if your doing a live migration there is actully a choice of host.
>>
>> in general however watcher like heat just asks nova to actully move the vms.
>>
>> sure it will ask nova to move it to a specific host but fundementaly if
>> you have
>>
>> tested live migration with nova via tempest seperatly there is no reason
>> to expcect
>>
>> it would not work for live migratoin tirggred by watcher or heat or
>> anything else that
>>
>> jsut calls novas api.
>>
>> so you could still get some valual testing in an all in one but ideally
>> there woudl be at least 2 comptue hosts.
>>
>>
>>> But at the very least, I can propose looking into adding an OSA job
>>> with Gnocchi as NV to the project, to show the state of the deployment
>>> with this driver.
>>>
>> well gnocchi is also not a native OpenStack telemetry datastore, it left
>> our community to pursue its own goals and is now a third party datastore
>>
>> just like Grafana or Prometheus.
>>
>> monasca is currently marked as inactive
>> https://review.opendev.org/c/openstack/governance/+/897520 and is in the
>> process of being retired.
>>
>> but it also has no testing on the watcher side to the combination of the
>> two is why we are deprecating it going forward.
>>
>> if both change im happy to see the support continue.
>>
>> Gnocchi has testing but we are not actively working on extending its
>> functionality going forward.
>>
>> as long as it continues to work i see no reason to change its support
>> status.
>>
>> watcher has quite a lot of untested integrations which is unfortunate
>>
>> we are planning to build out a feature/test/support matrix in the docs
>> this cycle
>>
>> but for example watcher can integrate with both ironic an canonical maas
>> component
>>
>> to do some level of host power management. none of that is tested and we
>> are likely going
>>
>> to mark them as experimental and reflect on if we can continue to
>> support them or not going forward.
>>
>> it also has the ability to do cinder storage pool balancing which is i
>> think also untested write now.
>>
>> one of the things we hope to do is extend the exsitign testing in our
>> current jobs to cover gaps like
>>
>> that where it is practical to do so. but creating a devstack plugin to
>> deploy maas with fake infrastructure
>>
>> is likely alot more then we can do with our existing contributors so
>> expect that to go to experimental then
>>
>> deprecated and finally it will be removed if no one turns up to support it.
>>
>> ironic is in the same boat however there are devstack jobs with fake
>> ironic nodes so i
>>
>> could see a path to use having an ironic job down the line. its just not
>> high on our current priority
>>
>> list to adress the support status or testing of this currently.
>>
>> eventlet removal and other techdebt/community goals are defintly higher
>> but i hop the new supprot/testing
>>
>> matrix will at least help folks make informed descions or what feature
>> to use and what backend are
>>
>> recommended going forward.
>>
>>> On Wed, 16 Apr 2025, 21:53 Douglas Viroel, <viroel(a)gmail.com> wrote:
>>>
>>> Hello everyone,
>>>
>>> Last week's PTG had very interesting topics. Thank you all that
>>> joined.
>>> The Watcher PTG etherpad with all notes is available here:
>>> https://etherpad.opendev.org/p/apr2025-ptg-watcher
>>> Here is a summary of the discussions that we had, including the
>>> great cross-project sessions with Telemetry, Horizon and Nova team:
>>>
>>> Tech Debt (chandankumar/sean-k-mooney)
>>> =================================
>>> a) Croniter
>>>
>>> * Project is being abandoned as per
>>> https://pypi.org/project/croniter/#disclaimer
>>> * Watcher uses croniter to calculate a new schedule time to run
>>> an audit (continuous). It is also used to validate cron like
>>> syntax
>>> * Agreed: replace croniter with appscheduler's cron methods.
>>> * *AI*: (chandankumar) Fix in master branch and backport to 2025.1
>>>
>>> b) Support status of Watcher Datasources
>>>
>>> * Only Gnocchi and Prometheus have CI job running tempest tests
>>> (with scenario tests)
>>> * Monaska is inactive since 2024.1
>>> * *AI*: (jgilaber) Mark Monasca and Grafana as deprecated,
>>> unless someone steps up to maintain them, which should include
>>> a minimal CI job running.
>>> * *AI*: (dviroel) Document a support matrix between Strategies
>>> and Datasources, which ones are production ready or
>>> experimental, and testing coverage.
>>>
>>> c) Eventlet Removal
>>>
>>> * Team is going to look at how the eventlet is used in Watcher
>>> and start a PoC of its removal.
>>> * Chandan Kumar and dviroel volunteer to help in this effort.
>>> * Planned for 2026.1 cycle.
>>>
>>> Workflow/API Improvements (amoralej)
>>> ==============================
>>> a) Actions states
>>>
>>> * Currently Actions updates from Pending to Succeeded or Failed,
>>> but these do not cover some important scenarios
>>> * If an Action's pre_conditions fails, the action is set to
>>> FAILED, but for some scenarios, it could be just SKIPPED and
>>> continue the workflow.
>>> * Proposal: New SKIPPED state for action. E.g: In a Nova
>>> Migration Action, if the instance doesn't exist in the source
>>> host, it can be skipped instead of fail.
>>> * Proposal: Users could also manually skip specific actions from
>>> an action plan.
>>> * A skip_reason field could also be added to document the reason
>>> behind the skip: user's request, pre-condition check, etc.
>>> * *AI*: (amoralej) Create a spec to describe the proposed changes.
>>>
>>> b) Meaning of SUCCEEDED state in Action Plan
>>>
>>> * Currently means that all actions are triggered, even if all of
>>> them fail, which can be confusing for users.
>>> * Docs mention that SUCCEEDED state means that all actions have
>>> been successfully executed.
>>> * *AI*: (amoralej) Document the current behavior as a bug
>>> (Priority High)
>>> o done: https://bugs.launchpad.net/watcher/+bug/2106407
>>>
>>> Watcher-Dashboard: Priorities to next release (amoralej)
>>> ===========================================
>>> a) Add integration/functional tests
>>>
>>> * Project is missing integration/functional tests and a CI job
>>> running against changes in the repo
>>> * No general conclusion and we will follow up with Horizon team
>>> * *AI*: (chandankumar/rlandy) sync with Horizon team about
>>> testing the plugin with horizon.
>>> * *AI*: (chandankumar/rlandy) devstack job running on new
>>> changes for watcher-dashboard repo.
>>>
>>> b) Add parameters to Audits
>>>
>>> * It is missing on the watcher-dashboard side. Without it, it is
>>> not possible to define some important parameters.
>>> * Should be addressed by a blueprint
>>> * Contributors to this feature: chandankumar
>>>
>>> Watcher cluster model collector improvement ideas (dviroel)
>>> =============================================
>>>
>>> * Brainstorm ideas to improve watcher collector process, since
>>> we still see a lot of issues due to outdated models when
>>> running audits
>>> * Both scheduled model update and event-based updates are
>>> enabled in CI today
>>> * It is unknown the current state of event-based updates from
>>> Nova notification. Code needs to be reviewed and
>>> improvements/fixes can be proposed
>>> o e.g:
>>> https://bugs.launchpad.net/watcher/+bug/2104220/comments/3
>>> - We need to check if we are processing the right
>>> notifications of if is a bug on Nova
>>> * Proposal: Refresh the model before running an audit. A rate
>>> limit should be considered to avoid too many refreshments.
>>> * *AI*: (dviroel) new spec for cluster model refresh, based on
>>> audit trigger
>>> * *AI:* (dviroel) investigate the processing of nova events in
>>> Watcher
>>>
>>> Watcher and Nova's visible constraints (dviroel)
>>> ====================================
>>>
>>> * Currently, Watcher can propose solutions that include server
>>> migrations that violate some Nova constraints like:
>>> scheduler_hints, server_groups, pinned_az, etc.
>>> * In Epoxy release, Nova's API was improved to also show
>>> scheduler_hints and image_properties, allowing external
>>> services, like watcher, to query and use this information when
>>> calculating new solutions.
>>> o https://docs.openstack.org/releasenotes/nova/2025.1.html#new-features
>>> * Proposal: Extend compute instance model to include new
>>> properties, which can be retrieved via novaclient. Update
>>> strategies to filter invalid migration destinations based on
>>> these new properties.
>>> * *AI*: (dviroel) Propose a spec to better document the
>>> proposal. No API changes are expected here.
>>>
>>> Replacement for noisy neighbor policy (jgilaber)
>>> ====================================
>>>
>>> * The existing noisy neighbor strategy is based on L3 Cache
>>> metrics, which is not available anymore, since the support for
>>> it was dropped from the kernel and from Nova.
>>> * In order to keep this strategy, new metrics need to be
>>> considered: cpu_steal? io_wait? cache_misses?
>>> * *AI*: (jgilaber) Mark the strategy as deprecated during this cycle
>>> * *AI*: (TBD) Identify new metrics to be used
>>> * *AI*: (TBD) Work on a replacement for the current strategy
>>>
>>>
>>> Host Maintenance strategy new use case (jeno8)
>>> =====================================
>>>
>>> * New use case for Host Maintenance strategy: instance with
>>> ephemeral disks should not be migrated at all.
>>> * Spec proposed:
>>> https://review.opendev.org/c/openstack/watcher-specs/+/943873
>>> o New action to stop instances when both live/cold migration
>>> are disabled by the user
>>> * *AI*: (All) Review the spec and continue with discussion there.
>>>
>>> Missing Contributor Docs (sean-k-mooney)
>>> ================================
>>>
>>> * Doc missing: Scope of the project, e.g:
>>> https://docs.openstack.org/nova/latest/contributor/project-scope.html
>>> * *AI*: (rlandy) Create a scope of the project doc for Watcher
>>> * Doc missing: PTL Guide, e.g:
>>> https://docs.openstack.org/nova/latest/contributor/ptl-guide.html
>>> * *AI*: (TBD) Create a PTL Guide for Watcher project
>>> * Document: When to create a spec vs blueprint vs bug
>>> * *AI*: (TBD) Create a doc section to describe the process based
>>> on what is being modified in the code.
>>>
>>> Retrospective
>>> ==========
>>>
>>> * The DPL approach seems to be working for Watcher
>>> * New core members added: sean-k-mooney, dviroel, marios and
>>> chandankumar
>>> o We plan to add more cores in the next cycle, based on
>>> reviews and engagement.
>>> o We plan to remove not active members in the 2 last cycles
>>> (starting at 2026.1)
>>> * A new datasource was added: Prometheus
>>> * Prometheus job now also runs scenario tests, along with Gnocchi.
>>> * We triaged all old bugs from launchpad
>>> * Needs improvement:
>>> o current team is still learning about details in the code,
>>> much of the historical knowledge was lost with the
>>> previous maintainers
>>> o core team still needs to grow
>>> o we need to focus on creating stable releases
>>>
>>>
>>> Cross-project session with Horizon team
>>> ===============================
>>>
>>> * Combined session with Telemetry and Horizon team, focused on
>>> how to provide a tenant and an admin dashboard to visualize
>>> metrics.
>>> * Watcher team presented some ideas of new panels for both admin
>>> and tenants, and sean-k-mooney raised a discussion about
>>> frameworks that can be used to implement them
>>> * Use-cases that were discussed:
>>> o a) Admin would benefit from a visualization of the
>>> infrastructure utilization (real usage metrics), so they
>>> can identify bottlenecks and plan optimization
>>> o b) A tenant would like to view their workload performance,
>>> checking real usage of cpu/ram/disk of instances, to
>>> proper adjust their resources allocation.
>>> o c) An admin user of watcher service would like to
>>> visualize metrics generated by watcher strategies like
>>> standard deviation of host metrics.
>>> * sean-k-mooney presented an initial PoC on how a Hypervisor
>>> Metrics dashboard would look like.
>>> * Proposal for next steps:
>>> o start a new horizon plugin as an official deliverable of
>>> telemetry project
>>> o still unclear which framework to use for building charts
>>> o dashboard will integrate with Prometheus, as metric store
>>> o it is expected that only short term metrics will be
>>> supported (7 days)
>>> o python-observability-client will be used to query Prometheus
>>>
>>>
>>> Cross-project session with Nova team
>>> =============================
>>>
>>> * sean-k-mooney led topics on how to evolve Nova to better
>>> assist other services, like Watcher, to take actions on
>>> instances. The team agreed on a proposal of using the existing
>>> metadata API to annotate instance's supported lifecycle
>>> operations. This information is very useful to improve
>>> Watcher's strategy's algorithms. Some example of instance's
>>> metadata could be:
>>> o lifecycle:cold-migratable=true|false
>>> o ha:maintenance-strategy:in_place|power_off|migrate
>>> * It was discussed that Nova could infer which operations are
>>> valid or not, based on information like: virt driver, flavor,
>>> image properties, etc. This feature was initially named
>>> 'instance capabilities' and will require a spec for further
>>> discussions.
>>> * Another topic of interest, also raised by Sean, was about
>>> adding new standard traits to resource providers, like
>>> PRESSURE_CPU and PRESSURE_DISK. These traits can be used to
>>> weight hosts when placing new VMs. Watcher and the libvirt
>>> driver could work on annotating them, but the team generally
>>> agreed that the libvirt driver is preferred here.
>>> * More info at Nova PTG etherpad [0] and sean's summary blog [1]
>>>
>>> [0] https://etherpad.opendev.org/p/r.bf5f1185e201e31ed8c3adeb45e3cf6d
>>> [1] https://www.seanmooney.info/blog/2025.2-ptg/#watcher-topics
>>>
>>>
>>> Please let me know if I missed something.
>>> Thanks!
>>>
>>> --
>>> Douglas Viroel - dviroel
>>>
3 months, 3 weeks
[tc][all] OpenStack Technical Committee Weekly Summary and Meeting Agenda (2025.2/R-18)
by Goutham Pacha Ravi
Hello Stackers,
We're 18 weeks away from the release date for OpenStack 2025.2
"Flamingo" [1]. Next week is the deadline for "cycle-trailing" [2]
projects to tag their 2025.1 "Epoxy" deliverables. Elsewhere, service
project teams are busy wrapping up design specifications for features
expected to be implemented in this release cycle. A call to action
regarding the cross-community goal on "eventlet removal" was made to
this mailing list [3]. Please join the #openstack-eventlet-removal
channel on OFTC and participate in the effort.
Several OpenStack governance changes are currently underway. A major
proposal among them is a transition [4] from the Contributor License
Agreement (CLA) [5] to the Developer Certificate of Origin [6]. This
change will affect every OpenStack contributor. The OpenStack
Technical Committee is working with the OpenInfra Foundation and the
OpenDev Infrastructure teams to enforce DCO compliance starting
2025-07-01. Please take some time to consider its implications and
provide your opinions on the TC resolution [4]. Project maintainers
are not expected to reject patches with DCO compliance today. If you
spot a "Signed-off-by" in the commit message, there's a good chance
reviewers have just looked past this, as it wasn't required so far.
It's a good time to review what may be necessary and be prepared for
the upcoming change [7].
=== Weekly Meeting ===
The weekly IRC meeting of the OpenStack Technical Committee occurred
on 2025-05-20 [8]. An action item regarding relinquishing the
"quantum" name on PyPI was discussed. The resolution in this regard
was acknowledged by the requester and merged shortly after. The
OpenDev infra administrators deleted OpenStack artifacts and handed
over the project namespace. The majority of the meeting later focused
on the transition from CLA (Contributor License Agreement) to DCO
(Developer Certificate of Origin). This move is part of a broader
transition into the Linux Foundation, with an effective date of June
1, 2025. The TC needed to reconfirm its desire to move to DCO,
preferably within the next two weeks, as the previous resolution on
this topic was from 2014. A new resolution confirming the board's
recommendation was deemed helpful for community feedback. We discussed
many aspects of this transition—a key concern being the smoothness of
the transition for contributors. While the technical implementation
(Gerrit enforcing Signed-Off-By in commit messages and turning off CLA
enforcement) is relatively simple, the human and organizational impact
is not trivial. The short timeline for the switchover was a major
point of contention, as downstream organizations may need to re-engage
legal teams and update internal contribution policies. The possibility
of having multiple CLAs active in Gerrit (allowing existing
contributors to continue under the old CLA while new contributors use
a new CLA for the new entity) was raised as a potential solution to
mitigate the immediate impact of the short deadline. However, mixing
CLA and DCO enforcement was generally seen as undesirable and hard to
implement. Post-meeting, the resolution was proposed [4], and the
timeline for implementation has been pushed out by a month to allow
the community time to prepare and react accordingly. Please expect
more communication regarding this in the next few days.
The next meeting of the OpenStack TC is on 2025-05-27 at 1700 UTC.
This meeting will be held over IRC on the #openstack-tc channel on
OFTC. Please find the agenda and other details on the meeting's wiki
page [9]. I hope you'll be able to join us there!
=== Governance Proposals ===
==== Merged ====
- [resolution] Relinquish "quantum" project on PyPI |
https://review.opendev.org/c/openstack/governance/+/949783
==== Open for Review ====
- Require declaration of affiliation from TC Candidates |
https://review.opendev.org/c/openstack/governance/+/949432
- [resolution] Replace CLA with DCO for all contributions |
https://review.opendev.org/c/openstack/governance/+/950463
- Clarify actions when no elections are required |
https://review.opendev.org/c/openstack/governance/+/949431
- Fix outdated info on the tc-guide |
https://review.opendev.org/c/openstack/governance/+/950446
=== Upcoming Events ===
- 2025-06-03: 15 ans d'OpenStack - OpenInfra UG, Paris:
https://www.meetup.com/openstack-france/events/307492285
- 2025-06-05: OpenStack 15 ans! - OpenInfra UG, Rennes:
https://www.meetup.com/openstack-rennes/events/306903998
- 2025-06-28: OpenInfra+Cloud Native Day, Vietnam:
https://www.vietopeninfra.org/void2025
Thank you very much for reading!
On behalf of the OpenStack TC,
Goutham Pacha Ravi (gouthamr)
OpenStack TC Chair
[1] 2025.2 "Flamingo" Release Schedule:
https://releases.openstack.org/flamingo/schedule.html
[2] "cycle-trailing":
https://releases.openstack.org/reference/release_models.html#cycle-trailing
[3] "eventlet-removal" status:
https://lists.openstack.org/archives/list/openstack-discuss@lists.openstack…
[4] TC resolution to replace CLA with DCO for all contributions:
https://review.opendev.org/c/openstack/governance/+/950463
[5] OpenStack CLA:
https://docs.openstack.org/contributors/common/setup-gerrit.html#individual…
[6] Developer Certificate of Origin: https://developercertificate.org/
[7] DCO documentation draft:
https://review.opendev.org/c/openstack/contributor-guide/+/950839
[8] TC Meeting IRC Log 2025-05-20:
https://meetings.opendev.org/meetings/tc/2025/tc.2025-05-20-17.00.log.html
[9] TC Meeting Agenda, 2025-05-27:
https://wiki.openstack.org/wiki/Meetings/TechnicalCommittee#Next_Meeting
2 months, 2 weeks
[manila] 2025.1 Epoxy PTG summary
by Carlos Silva
Hello everyone! Thank you for the great participation at the PTG last week.
We've had great discussions and a good turnout. The recordings for the
sessions are available on YouTube [0]. If you would like to check on the
notes, please take a look at the PTG etherpad [1].
*2024.2 Dalmatian Retrospective*
==========================
- New core reviewers in the manila group were impactful in reviews, we
should continue actively working on maintaining/growing the core reviewer
team.
- We had the mid-cycle and managed to combine it with our well-known
collaborative review sessions, around feature proposal freeze. This had a
good impact on raising awareness on the changes being proposed, as well as
prioritizing the reviews.
- Great contributions ranging from new third party drivers to successful
internships on manila-ui, bandit and the ongoing OpenAPI internships.
*Action items:*
- Carlos (carloss) will work with the manila team to help people gain
context on the bug czar role and work with the team to rotate it.
- Vida Haririan (vhari) will jot down the details of the Bug Czar role
- Follow the discussions on teams joining the VMT and get Manila included
too.
- Spread the word on the removal of the manila client and switch to
OpenStackClient
*Share backup enhancements*
=========================
- Out of place restore isn't supported currently. We have agreed that this
is a good use case and that a design specification should be proposed to
document this.
- DataManager / BackupDriver - forcing the backup process to go through the
DataManager service is supported through a config option, but Manila is
currently not honoring it. We agreed that this is an issue in the code, and
we will review the proposed change [2] to make the data manager honor this
config.
- DataManager to allow for a backup driver to provide reports on API call
progress: Currently, the data manager fetches the progress of a backup
using a generic get progress call, but it is failing with the generic
backup driver. We suggested that this should be fixed in the base driver.
- Context for Backup API calls: currently, only objects representing a
Share and Backup are passed to the backup driver. The request context
should also be forwarded in these calls. The backup driver interface can be
changed for this, but we should be mindful of out of tree drivers that
could break.
*Action items:*
- Zach Goggins (zachgoggins) will look into:
- Proposing a spec for the share backup out of place restore.
- Updating the backup driver interface and adding context to the methods
that need it.
- Updating the backup driver interface and adding the abstract
methods/capabilities that will help with the `get_restore_progress` and
`get_backup_progress` methods.
- The manila team will provide feedback on [2]
*All things CephFS*
===============
*Updates from previous cycles*
------------------------------------------
*State of the Standalone NFS Ganesha protocol helper:*
- We added a deprecation warning at the end of the previous SLURP
release, and we are planning to complete the removal during the
2025.1/Epoxy release. There were no objections to this so far at the PTG.
When this is removed, CephFS-via-NFS will only work with cephadm deployed
ceph-nfs clusters.
*Testing and stabilization:*
- devstack-plugin-ceph has been refactored to deploy a standalone
NFS-Ganesha service with a ceph orch deployed cluster. We also dropped
support for package-based and container-based installation of ceph. cephadm
is used to deploy/orchestrate ceph.
- Bumped Ceph version to Reef in Antelope, Bobcat, Caracal, Dalmatian,
as well as started testing with Squid.
- There are some failures on stable branches jobs which are being
triaged and fixed.
*Manage/unmanage:*
- Implementation completed in Dalmatian and the documentation has been
updated. We are currently working to enable the tests on CI permanently, as
well as doing some small refactors to the CI jobs.
*Ensure shares:*
- Merged in Dalmatian but testing is still challenging, as running the
tests mean that the service would temporarily have a different status and
shares within the backend would have their status changed, which is harmful
for test concurrency.
*Preferred export locations and export location metadata:*
- The core feature merged, but we are still working to get the newly
implemented tests passing and merged.
*Plans for 2025.1/Epoxy*
--------------------------------
- NFSv3 + testing: we are looking into enabling NFSv3 support as soon as
the patch is merged in Ceph. We agreed that we should enable the tests
within manila-tempest-plugin and make any necessary changes to the tests
structure, so we can ensure that we are testing some scenarios with both
NFSv3 and NFSv4.
- We will start to investigate support for SMB/CIFS shares and look at
the necessary changes for setting up devstack and testing.
*Action items:*
- Carlos (carloss) will write an email to the openstack-discuss mailing
list announcing the removal of the deprecated ganesha helper
- Carlos pursue the manage/unmanage testing patches to have tests enabled
in the CephFS jobs during Epoxy.
- Carlos will look into approaches to test ensure shares APIs.
- Ashley (ashrod98) will continue working on the export location metadata
tempest changes and drive them to completion.
- The manila team will look into updating manila-tempest-plugin tests and
enabling NFSv3 tests in the Ceph NFS jobs
- Goutham (gouthamr) will be submitting a prototype of the SMB/CIFS
integration
*Tech Debt*
========
*Eventlet removal*
----------------------
Our main concerns:
- Performance should not be degraded with the default configuration when we
switch.
- Synchronous calls do not take a big hit and become asynchronous.
- Impact to the SSH Pool (used by many drivers) should be minimal.
*Action items for 2025.1 Epoxy:*
- Tackle the low-hanging-fruit changes.
- Participating in the pop-up team discussions.
- Removing the affected console scripts in Manila.
- Working on performance tests to understand what will be the impact on
the SSH pool that is used by some drivers.
- Look into enhancing our rally/browbeat test coverage.
*CI and testing images*
-------------------------------
We started working on the migration of the CI to Ubuntu 24.04 in all of the
manila repositories (manila-image-elements, python-manilaclient, manila-ui,
manila, manila-specs).
Currently, the Ceph job is broken [3].
*Action items:*
- We should clean up our CI job variants, as they have a lot of
workarounds and we can start moving away from them.
*Stable branches*
----------------------
We currently have 5 "unmaintained" branches, so we should be looking at
sunsetting them.
*Action items:*
- Carlos (carloss) will start the conversation for the transition of some
of these branches in the openstack-discuss mailing list.
*Allowing force delete of a share network subnet*
=======================================
We currently can add subnets (which translates to adding new network
interfaces) to a share server but we can't remove them. This is a proposal
to add this removal feature and being able to detach a network interface of
a share server.
We agreed that:
- This is a good use case and something that can be enhanced.
- The enhancement should add a force-delete API.
- We should not allow the last subnet to be deleted, otherwise the shares
won't have an export path.
- A bug should be filed for a tangential issue that the NetApp driver is
using "neutron_net_id" (and possibly "neutron_subnet_id" to name resources
on the backend: ipspaces, broadcast domains, and possibly concurrency
control / locks.
*Action Items:*
- sylvanld will look into proposing a spec to document this behavior
*NetApp Driver: Enforcing lifs limit per HA pair*
=====================================
- The NetApp ONTAP storage has a limit of network interfaces per node in a
HA pair. In case the sum of allocated network interfaces in the two nodes
of the HA pair is bigger than the limit of the single node, then the
failover operation is compromised and will fail.
- NetApp maintainers would like to fix this issue, and we agreed that:
- The fix should be as dynamic as possible, not relying on users/admin
input or configuration.
- The ONTAP driver must look up all of the interfaces already created
and allow/deny the request in case it would compromise the failover.
- The NetApp ONTAP driver should keep an updated capability with the max
network interfaces support number, and possible the number of allocated
network interfaces at the moment.
*NetApp Driver: Implement Certificate based authentication*
================================================
- The NetApp ONTAP driver currently handles only user/password
authentication, but in an environment that password should change
quarterly, this means updating the local.conf at least every three months.
This enhancement proposes also adding the possibility of adding certificate
based authentication.
- We agreed that this is something that is going to be important for
operators and will allow them to add their certificates with a longer
expiration date, avoiding the disruptions caused by needing to update the
user/password.
*Manage Share affinity relationships by annotation/label*
=============================================
Currently the manila scheduler uses affinity/anti-affinity hints and we
base ourselves on share IDs. The idea now would be to have the affinite
hints to be based in an affinity policy, as possible with Nova.
We considered the proposed approaches, and agreed that:
- If we are adding new policies, they should end up becoming a new
resource/entity within the manila database
- If there is a way to reuse the share groups mechanism, we should
prioritize it
*Action items:*
- Chuan (chuanm) will propose a design spec to document this new behavior.
*Share encryption*
==============
This feature is currently waiting for more reviews and testing on gerrit.
In the Dalmatian release mid-cycle we talked about the importance of
testing this feature against a first party driver, to ensure that the APIs
and integration with Barbican and Castellan work.
We agreed that:
- We should do some research on how to do this testing with the generic
driver (which uses Cinder and Nova)
- The testing will focus on the APIs and behavior of this feature, not the
encryption of the shares.
*Action items:*
- gouthamr will help with some research on how to test this with the
generic driver
- The manila team will discuss this again in the upcoming manila weekly
meetings.
[0]
https://www.youtube.com/watch?v=8UxrjEr6yik&list=PLnpzT0InFrqDHGfSDPhiGtSeX…
[1] https://etherpad.opendev.org/p/epoxy-ptg-manila
[2] https://review.opendev.org/c/openstack/manila/+/907983
[3] https://www.spinics.net/lists/ceph-users/msg83201.html
9 months, 1 week
Re: [watcher] 2025.2 Flamingo PTG summary
by Dmitriy Rabotyagov
Hey,
Have a comment on one AI from the list.
> AI: (jgilaber) Mark Monasca and Grafana as deprecated, unless someone
steps up to maintain them, which should include a minimal CI job running.
So eventually, on OpenStack-Ansible we were planning to revive the Watcher
role support to the project.
How we usually test deployment, is by spawning an all-in-one environment
with drivers and executing a couple of tempest scenarios to ensure basic
functionality of the service.
With that, having a native OpenStack telemetry datastore is very beneficial
for such goal, as we already do maintain means for spawning telemetry
stack. While a requirement for Prometheus will be unfortunate for us at
least.
While I was writing that, I partially realized that testing Watcher on
all-in-one is pretty much impossible as well...
But at the very least, I can propose looking into adding an OSA job with
Gnocchi as NV to the project, to show the state of the deployment with this
driver.
On Wed, 16 Apr 2025, 21:53 Douglas Viroel, <viroel(a)gmail.com> wrote:
> Hello everyone,
>
> Last week's PTG had very interesting topics. Thank you all that joined.
> The Watcher PTG etherpad with all notes is available here:
> https://etherpad.opendev.org/p/apr2025-ptg-watcher
> Here is a summary of the discussions that we had, including the great
> cross-project sessions with Telemetry, Horizon and Nova team:
>
> Tech Debt (chandankumar/sean-k-mooney)
> =================================
> a) Croniter
>
> - Project is being abandoned as per
> https://pypi.org/project/croniter/#disclaimer
> - Watcher uses croniter to calculate a new schedule time to run an
> audit (continuous). It is also used to validate cron like syntax
> - Agreed: replace croniter with appscheduler's cron methods.
> - *AI*: (chandankumar) Fix in master branch and backport to 2025.1
>
> b) Support status of Watcher Datasources
>
> - Only Gnocchi and Prometheus have CI job running tempest tests (with
> scenario tests)
> - Monaska is inactive since 2024.1
> - *AI*: (jgilaber) Mark Monasca and Grafana as deprecated, unless
> someone steps up to maintain them, which should include a minimal CI job
> running.
> - *AI*: (dviroel) Document a support matrix between Strategies and
> Datasources, which ones are production ready or experimental, and testing
> coverage.
>
> c) Eventlet Removal
>
> - Team is going to look at how the eventlet is used in Watcher and
> start a PoC of its removal.
> - Chandan Kumar and dviroel volunteer to help in this effort.
> - Planned for 2026.1 cycle.
>
> Workflow/API Improvements (amoralej)
> ==============================
> a) Actions states
>
> - Currently Actions updates from Pending to Succeeded or Failed, but
> these do not cover some important scenarios
> - If an Action's pre_conditions fails, the action is set to FAILED,
> but for some scenarios, it could be just SKIPPED and continue the workflow.
> - Proposal: New SKIPPED state for action. E.g: In a Nova Migration
> Action, if the instance doesn't exist in the source host, it can be skipped
> instead of fail.
> - Proposal: Users could also manually skip specific actions from an
> action plan.
> - A skip_reason field could also be added to document the reason
> behind the skip: user's request, pre-condition check, etc.
> - *AI*: (amoralej) Create a spec to describe the proposed changes.
>
> b) Meaning of SUCCEEDED state in Action Plan
>
> - Currently means that all actions are triggered, even if all of them
> fail, which can be confusing for users.
> - Docs mention that SUCCEEDED state means that all actions have been
> successfully executed.
> - *AI*: (amoralej) Document the current behavior as a bug (Priority
> High)
> - done: https://bugs.launchpad.net/watcher/+bug/2106407
>
> Watcher-Dashboard: Priorities to next release (amoralej)
> ===========================================
> a) Add integration/functional tests
>
> - Project is missing integration/functional tests and a CI job running
> against changes in the repo
> - No general conclusion and we will follow up with Horizon team
> - *AI*: (chandankumar/rlandy) sync with Horizon team about testing the
> plugin with horizon.
> - *AI*: (chandankumar/rlandy) devstack job running on new changes for
> watcher-dashboard repo.
>
> b) Add parameters to Audits
>
> - It is missing on the watcher-dashboard side. Without it, it is not
> possible to define some important parameters.
> - Should be addressed by a blueprint
> - Contributors to this feature: chandankumar
>
> Watcher cluster model collector improvement ideas (dviroel)
> =============================================
>
> - Brainstorm ideas to improve watcher collector process, since we
> still see a lot of issues due to outdated models when running audits
> - Both scheduled model update and event-based updates are enabled in
> CI today
> - It is unknown the current state of event-based updates from Nova
> notification. Code needs to be reviewed and improvements/fixes can be
> proposed
> - e.g: https://bugs.launchpad.net/watcher/+bug/2104220/comments/3 -
> We need to check if we are processing the right notifications of if is a
> bug on Nova
> - Proposal: Refresh the model before running an audit. A rate limit
> should be considered to avoid too many refreshments.
> - *AI*: (dviroel) new spec for cluster model refresh, based on audit
> trigger
> - *AI:* (dviroel) investigate the processing of nova events in Watcher
>
> Watcher and Nova's visible constraints (dviroel)
> ====================================
>
> - Currently, Watcher can propose solutions that include server
> migrations that violate some Nova constraints like: scheduler_hints,
> server_groups, pinned_az, etc.
> - In Epoxy release, Nova's API was improved to also show
> scheduler_hints and image_properties, allowing external services, like
> watcher, to query and use this information when calculating new solutions.
> -
> https://docs.openstack.org/releasenotes/nova/2025.1.html#new-features
> - Proposal: Extend compute instance model to include new properties,
> which can be retrieved via novaclient. Update strategies to filter invalid
> migration destinations based on these new properties.
> - *AI*: (dviroel) Propose a spec to better document the proposal. No
> API changes are expected here.
>
> Replacement for noisy neighbor policy (jgilaber)
> ====================================
>
> - The existing noisy neighbor strategy is based on L3 Cache metrics,
> which is not available anymore, since the support for it was dropped from
> the kernel and from Nova.
> - In order to keep this strategy, new metrics need to be considered:
> cpu_steal? io_wait? cache_misses?
> - *AI*: (jgilaber) Mark the strategy as deprecated during this cycle
> - *AI*: (TBD) Identify new metrics to be used
> - *AI*: (TBD) Work on a replacement for the current strategy
>
>
> Host Maintenance strategy new use case (jeno8)
> =====================================
>
> - New use case for Host Maintenance strategy: instance with ephemeral
> disks should not be migrated at all.
> - Spec proposed:
> https://review.opendev.org/c/openstack/watcher-specs/+/943873
> - New action to stop instances when both live/cold migration are
> disabled by the user
> - *AI*: (All) Review the spec and continue with discussion there.
>
> Missing Contributor Docs (sean-k-mooney)
> ================================
>
> - Doc missing: Scope of the project, e.g:
> https://docs.openstack.org/nova/latest/contributor/project-scope.html
> - *AI*: (rlandy) Create a scope of the project doc for Watcher
> - Doc missing: PTL Guide, e.g:
> https://docs.openstack.org/nova/latest/contributor/ptl-guide.html
> - *AI*: (TBD) Create a PTL Guide for Watcher project
> - Document: When to create a spec vs blueprint vs bug
> - *AI*: (TBD) Create a doc section to describe the process based on
> what is being modified in the code.
>
> Retrospective
> ==========
>
> - The DPL approach seems to be working for Watcher
> - New core members added: sean-k-mooney, dviroel, marios and
> chandankumar
> - We plan to add more cores in the next cycle, based on reviews and
> engagement.
> - We plan to remove not active members in the 2 last cycles
> (starting at 2026.1)
> - A new datasource was added: Prometheus
> - Prometheus job now also runs scenario tests, along with Gnocchi.
> - We triaged all old bugs from launchpad
> - Needs improvement:
> - current team is still learning about details in the code, much of
> the historical knowledge was lost with the previous maintainers
> - core team still needs to grow
> - we need to focus on creating stable releases
>
>
> Cross-project session with Horizon team
> ===============================
>
> - Combined session with Telemetry and Horizon team, focused on how to
> provide a tenant and an admin dashboard to visualize metrics.
> - Watcher team presented some ideas of new panels for both admin and
> tenants, and sean-k-mooney raised a discussion about frameworks that can be
> used to implement them
> - Use-cases that were discussed:
> - a) Admin would benefit from a visualization of the infrastructure
> utilization (real usage metrics), so they can identify bottlenecks and plan
> optimization
> - b) A tenant would like to view their workload performance,
> checking real usage of cpu/ram/disk of instances, to proper adjust their
> resources allocation.
> - c) An admin user of watcher service would like to visualize
> metrics generated by watcher strategies like standard deviation of host
> metrics.
> - sean-k-mooney presented an initial PoC on how a Hypervisor Metrics
> dashboard would look like.
> - Proposal for next steps:
> - start a new horizon plugin as an official deliverable of
> telemetry project
> - still unclear which framework to use for building charts
> - dashboard will integrate with Prometheus, as metric store
> - it is expected that only short term metrics will be supported (7
> days)
> - python-observability-client will be used to query Prometheus
>
>
> Cross-project session with Nova team
> =============================
>
> - sean-k-mooney led topics on how to evolve Nova to better assist
> other services, like Watcher, to take actions on instances. The team agreed
> on a proposal of using the existing metadata API to annotate instance's
> supported lifecycle operations. This information is very useful to improve
> Watcher's strategy's algorithms. Some example of instance's metadata could
> be:
> - lifecycle:cold-migratable=true|false
> - ha:maintenance-strategy:in_place|power_off|migrate
> - It was discussed that Nova could infer which operations are valid or
> not, based on information like: virt driver, flavor, image properties, etc.
> This feature was initially named 'instance capabilities' and will require a
> spec for further discussions.
> - Another topic of interest, also raised by Sean, was about adding new
> standard traits to resource providers, like PRESSURE_CPU and PRESSURE_DISK.
> These traits can be used to weight hosts when placing new VMs. Watcher and
> the libvirt driver could work on annotating them, but the team generally
> agreed that the libvirt driver is preferred here.
> - More info at Nova PTG etherpad [0] and sean's summary blog [1]
>
> [0] https://etherpad.opendev.org/p/r.bf5f1185e201e31ed8c3adeb45e3cf6d
> [1] https://www.seanmooney.info/blog/2025.2-ptg/#watcher-topics
>
>
> Please let me know if I missed something.
> Thanks!
>
> --
> Douglas Viroel - dviroel
>
3 months, 3 weeks
Re: [watcher] 2025.2 Flamingo PTG summary
by Dmitriy Rabotyagov
> well gnocchi is also not a native OpenStack telemetry datastore, it left
> our community to pursue its own goals and is now a third party datastore
> just like Grafana or Prometheus.
Yeah, well, true. Is still somehow treated as the "default" thing with
Telemetry, likely due to existing integration with Keystone and
multi-tenancy support. And beyond it - all other options become
opinionated too fast - ie, some do OpenTelemetry, some do Zabbix,
VictoriaMetrics, etc. As pretty much from what I got as well, is that
still relies on Ceilometer metrics?
And then Prometheus is obviously not the best storage for them, as it
requires to have pushgatgeway, and afaik prometheus maintainers are
strictly against "push" concept to it and treat it as conceptually
wrong (on contrary to OpenTelemetry). So the metric timestamp issue is
to remain unaddressed.
So that's why I'd see leaving Gnocchi as "base" implementation might
be valuable (and very handy for us, as we don't need to implement a
prometheus job specifically for Watcher).
> but for example watcher can integrate with both ironic an canonical maas
component
> to do some level of host power management.
That sounds really interesting... We do maintain infrastructure using
MAAS and playing with such integration will be extremely interesting.
I hope I will be able to get some time for this though...
чт, 17 апр. 2025 г. в 13:52, Sean Mooney <smooney(a)redhat.com>:
>
>
> On 16/04/2025 21:04, Dmitriy Rabotyagov wrote:
> >
> > Hey,
> >
> > Have a comment on one AI from the list.
> >
> > > AI: (jgilaber) Mark Monasca and Grafana as deprecated, unless
> > someone steps up to maintain them, which should include a minimal CI
> > job running.
> >
> > So eventually, on OpenStack-Ansible we were planning to revive the
> > Watcher role support to the project.
> > How we usually test deployment, is by spawning an all-in-one
> > environment with drivers and executing a couple of tempest scenarios
> > to ensure basic functionality of the service.
> >
> > With that, having a native OpenStack telemetry datastore is very
> > beneficial for such goal, as we already do maintain means for spawning
> > telemetry stack. While a requirement for Prometheus will be
> > unfortunate for us at least.
> >
> > While I was writing that, I partially realized that testing Watcher on
> > all-in-one is pretty much impossible as well...
> >
> you can certenly test some fo watcher with an all in one deployment
>
> i.e. the apis and you can use the dummy test stragies.
>
> but ya in general like nova you need at least 2 nodes to be able to test
> it properly ideally 3
>
> so that if your doing a live migration there is actully a choice of host.
>
> in general however watcher like heat just asks nova to actully move the vms.
>
> sure it will ask nova to move it to a specific host but fundementaly if
> you have
>
> tested live migration with nova via tempest seperatly there is no reason
> to expcect
>
> it would not work for live migratoin tirggred by watcher or heat or
> anything else that
>
> jsut calls novas api.
>
> so you could still get some valual testing in an all in one but ideally
> there woudl be at least 2 comptue hosts.
>
>
> > But at the very least, I can propose looking into adding an OSA job
> > with Gnocchi as NV to the project, to show the state of the deployment
> > with this driver.
> >
> well gnocchi is also not a native OpenStack telemetry datastore, it left
> our community to pursue its own goals and is now a third party datastore
>
> just like Grafana or Prometheus.
>
> monasca is currently marked as inactive
> https://review.opendev.org/c/openstack/governance/+/897520 and is in the
> process of being retired.
>
> but it also has no testing on the watcher side to the combination of the
> two is why we are deprecating it going forward.
>
> if both change im happy to see the support continue.
>
> Gnocchi has testing but we are not actively working on extending its
> functionality going forward.
>
> as long as it continues to work i see no reason to change its support
> status.
>
> watcher has quite a lot of untested integrations which is unfortunate
>
> we are planning to build out a feature/test/support matrix in the docs
> this cycle
>
> but for example watcher can integrate with both ironic an canonical maas
> component
>
> to do some level of host power management. none of that is tested and we
> are likely going
>
> to mark them as experimental and reflect on if we can continue to
> support them or not going forward.
>
> it also has the ability to do cinder storage pool balancing which is i
> think also untested write now.
>
> one of the things we hope to do is extend the exsitign testing in our
> current jobs to cover gaps like
>
> that where it is practical to do so. but creating a devstack plugin to
> deploy maas with fake infrastructure
>
> is likely alot more then we can do with our existing contributors so
> expect that to go to experimental then
>
> deprecated and finally it will be removed if no one turns up to support it.
>
> ironic is in the same boat however there are devstack jobs with fake
> ironic nodes so i
>
> could see a path to use having an ironic job down the line. its just not
> high on our current priority
>
> list to adress the support status or testing of this currently.
>
> eventlet removal and other techdebt/community goals are defintly higher
> but i hop the new supprot/testing
>
> matrix will at least help folks make informed descions or what feature
> to use and what backend are
>
> recommended going forward.
>
> >
> > On Wed, 16 Apr 2025, 21:53 Douglas Viroel, <viroel(a)gmail.com> wrote:
> >
> > Hello everyone,
> >
> > Last week's PTG had very interesting topics. Thank you all that
> > joined.
> > The Watcher PTG etherpad with all notes is available here:
> > https://etherpad.opendev.org/p/apr2025-ptg-watcher
> > Here is a summary of the discussions that we had, including the
> > great cross-project sessions with Telemetry, Horizon and Nova team:
> >
> > Tech Debt (chandankumar/sean-k-mooney)
> > =================================
> > a) Croniter
> >
> > * Project is being abandoned as per
> > https://pypi.org/project/croniter/#disclaimer
> > * Watcher uses croniter to calculate a new schedule time to run
> > an audit (continuous). It is also used to validate cron like
> > syntax
> > * Agreed: replace croniter with appscheduler's cron methods.
> > * *AI*: (chandankumar) Fix in master branch and backport to 2025.1
> >
> > b) Support status of Watcher Datasources
> >
> > * Only Gnocchi and Prometheus have CI job running tempest tests
> > (with scenario tests)
> > * Monaska is inactive since 2024.1
> > * *AI*: (jgilaber) Mark Monasca and Grafana as deprecated,
> > unless someone steps up to maintain them, which should include
> > a minimal CI job running.
> > * *AI*: (dviroel) Document a support matrix between Strategies
> > and Datasources, which ones are production ready or
> > experimental, and testing coverage.
> >
> > c) Eventlet Removal
> >
> > * Team is going to look at how the eventlet is used in Watcher
> > and start a PoC of its removal.
> > * Chandan Kumar and dviroel volunteer to help in this effort.
> > * Planned for 2026.1 cycle.
> >
> > Workflow/API Improvements (amoralej)
> > ==============================
> > a) Actions states
> >
> > * Currently Actions updates from Pending to Succeeded or Failed,
> > but these do not cover some important scenarios
> > * If an Action's pre_conditions fails, the action is set to
> > FAILED, but for some scenarios, it could be just SKIPPED and
> > continue the workflow.
> > * Proposal: New SKIPPED state for action. E.g: In a Nova
> > Migration Action, if the instance doesn't exist in the source
> > host, it can be skipped instead of fail.
> > * Proposal: Users could also manually skip specific actions from
> > an action plan.
> > * A skip_reason field could also be added to document the reason
> > behind the skip: user's request, pre-condition check, etc.
> > * *AI*: (amoralej) Create a spec to describe the proposed changes.
> >
> > b) Meaning of SUCCEEDED state in Action Plan
> >
> > * Currently means that all actions are triggered, even if all of
> > them fail, which can be confusing for users.
> > * Docs mention that SUCCEEDED state means that all actions have
> > been successfully executed.
> > * *AI*: (amoralej) Document the current behavior as a bug
> > (Priority High)
> > o done: https://bugs.launchpad.net/watcher/+bug/2106407
> >
> > Watcher-Dashboard: Priorities to next release (amoralej)
> > ===========================================
> > a) Add integration/functional tests
> >
> > * Project is missing integration/functional tests and a CI job
> > running against changes in the repo
> > * No general conclusion and we will follow up with Horizon team
> > * *AI*: (chandankumar/rlandy) sync with Horizon team about
> > testing the plugin with horizon.
> > * *AI*: (chandankumar/rlandy) devstack job running on new
> > changes for watcher-dashboard repo.
> >
> > b) Add parameters to Audits
> >
> > * It is missing on the watcher-dashboard side. Without it, it is
> > not possible to define some important parameters.
> > * Should be addressed by a blueprint
> > * Contributors to this feature: chandankumar
> >
> > Watcher cluster model collector improvement ideas (dviroel)
> > =============================================
> >
> > * Brainstorm ideas to improve watcher collector process, since
> > we still see a lot of issues due to outdated models when
> > running audits
> > * Both scheduled model update and event-based updates are
> > enabled in CI today
> > * It is unknown the current state of event-based updates from
> > Nova notification. Code needs to be reviewed and
> > improvements/fixes can be proposed
> > o e.g:
> > https://bugs.launchpad.net/watcher/+bug/2104220/comments/3
> > - We need to check if we are processing the right
> > notifications of if is a bug on Nova
> > * Proposal: Refresh the model before running an audit. A rate
> > limit should be considered to avoid too many refreshments.
> > * *AI*: (dviroel) new spec for cluster model refresh, based on
> > audit trigger
> > * *AI:* (dviroel) investigate the processing of nova events in
> > Watcher
> >
> > Watcher and Nova's visible constraints (dviroel)
> > ====================================
> >
> > * Currently, Watcher can propose solutions that include server
> > migrations that violate some Nova constraints like:
> > scheduler_hints, server_groups, pinned_az, etc.
> > * In Epoxy release, Nova's API was improved to also show
> > scheduler_hints and image_properties, allowing external
> > services, like watcher, to query and use this information when
> > calculating new solutions.
> > o https://docs.openstack.org/releasenotes/nova/2025.1.html#new-features
> > * Proposal: Extend compute instance model to include new
> > properties, which can be retrieved via novaclient. Update
> > strategies to filter invalid migration destinations based on
> > these new properties.
> > * *AI*: (dviroel) Propose a spec to better document the
> > proposal. No API changes are expected here.
> >
> > Replacement for noisy neighbor policy (jgilaber)
> > ====================================
> >
> > * The existing noisy neighbor strategy is based on L3 Cache
> > metrics, which is not available anymore, since the support for
> > it was dropped from the kernel and from Nova.
> > * In order to keep this strategy, new metrics need to be
> > considered: cpu_steal? io_wait? cache_misses?
> > * *AI*: (jgilaber) Mark the strategy as deprecated during this cycle
> > * *AI*: (TBD) Identify new metrics to be used
> > * *AI*: (TBD) Work on a replacement for the current strategy
> >
> >
> > Host Maintenance strategy new use case (jeno8)
> > =====================================
> >
> > * New use case for Host Maintenance strategy: instance with
> > ephemeral disks should not be migrated at all.
> > * Spec proposed:
> > https://review.opendev.org/c/openstack/watcher-specs/+/943873
> > o New action to stop instances when both live/cold migration
> > are disabled by the user
> > * *AI*: (All) Review the spec and continue with discussion there.
> >
> > Missing Contributor Docs (sean-k-mooney)
> > ================================
> >
> > * Doc missing: Scope of the project, e.g:
> > https://docs.openstack.org/nova/latest/contributor/project-scope.html
> > * *AI*: (rlandy) Create a scope of the project doc for Watcher
> > * Doc missing: PTL Guide, e.g:
> > https://docs.openstack.org/nova/latest/contributor/ptl-guide.html
> > * *AI*: (TBD) Create a PTL Guide for Watcher project
> > * Document: When to create a spec vs blueprint vs bug
> > * *AI*: (TBD) Create a doc section to describe the process based
> > on what is being modified in the code.
> >
> > Retrospective
> > ==========
> >
> > * The DPL approach seems to be working for Watcher
> > * New core members added: sean-k-mooney, dviroel, marios and
> > chandankumar
> > o We plan to add more cores in the next cycle, based on
> > reviews and engagement.
> > o We plan to remove not active members in the 2 last cycles
> > (starting at 2026.1)
> > * A new datasource was added: Prometheus
> > * Prometheus job now also runs scenario tests, along with Gnocchi.
> > * We triaged all old bugs from launchpad
> > * Needs improvement:
> > o current team is still learning about details in the code,
> > much of the historical knowledge was lost with the
> > previous maintainers
> > o core team still needs to grow
> > o we need to focus on creating stable releases
> >
> >
> > Cross-project session with Horizon team
> > ===============================
> >
> > * Combined session with Telemetry and Horizon team, focused on
> > how to provide a tenant and an admin dashboard to visualize
> > metrics.
> > * Watcher team presented some ideas of new panels for both admin
> > and tenants, and sean-k-mooney raised a discussion about
> > frameworks that can be used to implement them
> > * Use-cases that were discussed:
> > o a) Admin would benefit from a visualization of the
> > infrastructure utilization (real usage metrics), so they
> > can identify bottlenecks and plan optimization
> > o b) A tenant would like to view their workload performance,
> > checking real usage of cpu/ram/disk of instances, to
> > proper adjust their resources allocation.
> > o c) An admin user of watcher service would like to
> > visualize metrics generated by watcher strategies like
> > standard deviation of host metrics.
> > * sean-k-mooney presented an initial PoC on how a Hypervisor
> > Metrics dashboard would look like.
> > * Proposal for next steps:
> > o start a new horizon plugin as an official deliverable of
> > telemetry project
> > o still unclear which framework to use for building charts
> > o dashboard will integrate with Prometheus, as metric store
> > o it is expected that only short term metrics will be
> > supported (7 days)
> > o python-observability-client will be used to query Prometheus
> >
> >
> > Cross-project session with Nova team
> > =============================
> >
> > * sean-k-mooney led topics on how to evolve Nova to better
> > assist other services, like Watcher, to take actions on
> > instances. The team agreed on a proposal of using the existing
> > metadata API to annotate instance's supported lifecycle
> > operations. This information is very useful to improve
> > Watcher's strategy's algorithms. Some example of instance's
> > metadata could be:
> > o lifecycle:cold-migratable=true|false
> > o ha:maintenance-strategy:in_place|power_off|migrate
> > * It was discussed that Nova could infer which operations are
> > valid or not, based on information like: virt driver, flavor,
> > image properties, etc. This feature was initially named
> > 'instance capabilities' and will require a spec for further
> > discussions.
> > * Another topic of interest, also raised by Sean, was about
> > adding new standard traits to resource providers, like
> > PRESSURE_CPU and PRESSURE_DISK. These traits can be used to
> > weight hosts when placing new VMs. Watcher and the libvirt
> > driver could work on annotating them, but the team generally
> > agreed that the libvirt driver is preferred here.
> > * More info at Nova PTG etherpad [0] and sean's summary blog [1]
> >
> > [0] https://etherpad.opendev.org/p/r.bf5f1185e201e31ed8c3adeb45e3cf6d
> > [1] https://www.seanmooney.info/blog/2025.2-ptg/#watcher-topics
> >
> >
> > Please let me know if I missed something.
> > Thanks!
> >
> > --
> > Douglas Viroel - dviroel
> >
>
3 months, 3 weeks
[tc][all] OpenStack Technical Committee Weekly Summary and Meeting Agenda (2025.1/R-25)
by Goutham Pacha Ravi
Hello Stackers,
This week, we begin our 26-week endeavor towards the next SLURP
release, 2025.1 ("Epoxy") [1]. OpenStack Project Teams will meet
virtually at the Project Teams Gathering (PTG) in two weeks, starting
on 2024-10-21 [2]. The OpenStack TC plans to host cross project
meetings during the following time slots:
- 2024-10-21 (Monday): 1400 UTC - 1700 UTC
- 2024-10-25 (Friday): 1500 UTC - 1700 UTC
You'll find the proposed topics on the PTG Etherpad [3]; please add
your IRC nickname if you'd like to attend or be notified when
discussions begin.
Last week, a few community leads presented at OpenInfra Live,
recapping the 2024.2 release [4]. I encourage you to watch the
presentation and follow the themes each team is pursuing in the
"Epoxy" release cycle. I'm excited to share that the organizers of the
upcoming OpenInfra Days North America (Oct 15-16) have made it a
hybrid event. Please register if you plan to attend virtually [5].
=== Weekly Meeting ===
The last weekly meeting of the OpenStack Technical Committee was held
simultaneously on IRC [6] and video [7]. We discussed meeting times,
and the current time (Tuesdays at 1800 UTC) was retained due to a lack
of consensus on better alternatives. Sylvain Bauza (bauzas)
volunteered to be an Election Official for the 2025.2 elections, which
will be announced around February 2025. We also discussed "leaderless"
projects for the 2025.1 release and appointed leaders for the
OpenStack Mistral, OpenStack Watcher, and OpenStack Swift projects.
Additionally, we created a TC tracker for the 2025.1 release cycle [8]
to monitor the progress of community goals and other governance
initiatives.
The next OpenStack Technical Committee meeting is today (2024-10-08)
at 1800 UTC on the #openstack-tc IRC channel on OFTC. You can find the
agenda on the weekly meeting wiki page [9]. I hope you can join us!
Below is a list of governance changes that have merged in the past
week and those still pending community review.
=== Governance Proposals ===
==== Merged ====
- Appoint Tim Burke as PTL for Swift |
https://review.opendev.org/c/openstack/governance/+/928881
==== Open for Review ====
- Mark kuryr-kubernetes and kuryr-tempest-plugin inactive |
https://review.opendev.org/c/openstack/governance/+/929698
- Add Axel Vanzaghi as PTL for Mistral |
https://review.opendev.org/c/openstack/governance/+/927962
- Propose the eventlet-removal community goal |
https://review.opendev.org/c/openstack/governance/+/931254
=== Upcoming Events ===
- 2024-10-08: OpenInfra Monthly Board Meeting: https://board.openinfra.dev/
- 2024-10-15: OpenInfra Days NA, Indianapolis:
https://ittraining.iu.edu/explore-topics/titles/oid-iu/
- 2024-10-21: OpenInfra Project Teams Gathering: https://openinfra.dev/ptg/
Thank you for reading!
On behalf of the OpenStack TC,
Goutham Pacha Ravi (gouthamr)
OpenStack TC Chair
[1] 2025.1 "Epoxy" Release Schedule:
https://releases.openstack.org/epoxy/schedule.html
[2] "Epoxy" PTG Schedule: https://ptg.opendev.org/ptg.html
[3] Technical Committee PTG Etherpad:
https://etherpad.opendev.org/p/oct2024-ptg-os-tc
[4] "Introducing OpenStack Dalmatian 2024.2": https://youtu.be/6igJNIJ9yFE
[5] OpenInfra Days NA:
https://ittraining.iu.edu/explore-topics/titles/oid-iu/index.html#register
[6] TC Meeting IRC Log, 2024-10-01:
https://meetings.opendev.org/meetings/tc/2024/tc.2024-10-01-18.00.log.html
[7] TC Meeting Video Recording, 2024-10-01: https://youtu.be/6RXE1LfEv7w
[8] 2025.1 TC Tracker: https://etherpad.opendev.org/p/tc-2025.1-tracker
[9] TC Meeting Agenda, 2024-10-08:
https://wiki.openstack.org/wiki/Meetings/TechnicalCommittee#Next_Meeting
10 months