openstack-discuss search results for query "#eventlet-removal"
openstack-discuss@lists.openstack.org- 186 messages
Re: [nova] multiple pci types with same address
by Sean Mooney
On 17/06/2025 09:30, Arnaud Morin wrote:
> Hello nova team,
>
> Quick question regarding support for multiple types (see [1]) with the same address:
> {vendor_id: "10de", product_id: "233b", "address": "00000000:03:00.0", "resource_class": "CUSTOM_H200_A"}
> {vendor_id: "10de", product_id: "233b", "address": "00000000:04:00.0", "resource_class": "CUSTOM_H200_A"}
> {vendor_id: "10de", product_id: "233b", "address": "00000000:44:00.0", "resource_class": "CUSTOM_H200_A"}
> {vendor_id: "10de", product_id: "233b", "address": "00000000:45:00.0", "resource_class": "CUSTOM_H200_A"}
> {vendor_id: "10de", product_id: "233b", "address": "00000000:83:00.0", "resource_class": "CUSTOM_H200_B"}
> {vendor_id: "10de", product_id: "233b", "address": "00000000:84:00.0", "resource_class": "CUSTOM_H200_B"}
> {vendor_id: "10de", product_id: "233b", "address": "00000000:C3:00.0", "resource_class": "CUSTOM_H200_B"}
> {vendor_id: "10de", product_id: "233b", "address": "00000000:C4:00.0", "resource_class": "CUSTOM_H200_B"}
>
> This works fine, I was able to define multiple aliases:
> {name: "h200a", device_type:"type-PF", resource_class: "CUSTOM_H200_A"}
> {name: "h200b", device_type:"type-PF", resource_class: "CUSTOM_H200_B"}
>
> I did that to create two blocks of 4 resources (A or B).
>
> But now, I need to create a flavor to boot instances with these devices.
> I want to have only one flavor that can use either A or B:
> For now I created a flavor with:
> pci_passthrough:alias='h200a:4'
>
> And was forced to create a second flavor with h200b:4.
>
> Is there any way to achieve a single flavor with both:
> Something like this?
> pci_passthrough:alias='h200a:4|h200b:4'
>
> I can't figure that out for now, is it possible?
no its not and for reasons is also not easy to implement that in the
future because of how placement works.
we would need a new OneOF type query in the allocation candidates api.
the feature you really want is
https://specs.openstack.org/openstack/nova-specs/specs/2024.1/approved/pci-…
it was proposed but never implemented.
with that spec you can carve ups pci device into groups in the pci devspec
and then have a a singel resouce class to select any one of the groups.
so in your case you would express
```
{vendor_id: "10de", product_id: "233b", "address": "00000000:03:00.0", "resource_class": "CUSTOM_H200_A"}
{vendor_id: "10de", product_id: "233b", "address": "00000000:04:00.0", "resource_class": "CUSTOM_H200_A"}
{vendor_id: "10de", product_id: "233b", "address": "00000000:44:00.0", "resource_class": "CUSTOM_H200_A"}
{vendor_id: "10de", product_id: "233b", "address": "00000000:45:00.0", "resource_class": "CUSTOM_H200_A"}
{vendor_id: "10de", product_id: "233b", "address": "00000000:83:00.0", "resource_class": "CUSTOM_H200_B"}
{vendor_id: "10de", product_id: "233b", "address": "00000000:84:00.0", "resource_class": "CUSTOM_H200_B"}
{vendor_id: "10de", product_id: "233b", "address": "00000000:C3:00.0", "resource_class": "CUSTOM_H200_B"}
{vendor_id: "10de", product_id: "233b", "address": "00000000:C4:00.0", "resource_class": "CUSTOM_H200_B"}
```
as
```
{vendor_id: "10de", product_id: "233b", "address": "00000000:03:00.0", "group_name": "A", "group_type": "H200"}
{vendor_id: "10de", product_id: "233b", "address": "00000000:04:00.0", "group_name": "A", "group_type": "H200"}
{vendor_id: "10de", product_id: "233b", "address": "00000000:44:00.0", "group_name": "A", "group_type": "H200"}
{vendor_id: "10de", product_id: "233b", "address": "00000000:45:00.0", "group_name": "A", "group_type": "H200"}
{vendor_id: "10de", product_id: "233b", "address": "00000000:83:00.0", "group_name": "B", "group_type": "H200"}
{vendor_id: "10de", product_id: "233b", "address": "00000000:84:00.0", "group_name": "B", "group_type": "H200"}
{vendor_id: "10de", product_id: "233b", "address": "00000000:C3:00.0", "group_name": "B", "group_type": "H200"}
{vendor_id: "10de", product_id: "233b", "address": "00000000:C4:00.0", "group_name": "B", "group_type": "H200"}
alias = {"name":"h200", resource_class:"CUSTOM_PCI_GROUP_H200"}
```
and in the flavor you would request pci_passthrough:alias='h200:1' pci groups allocate a full group to the request.
its a very useful propal that we have discuss on an over for thet better part of a decade and finally got as
far as writign it down for 2024.1 but then the implementation never got started.
i know some redhat customer are also asked about this type of grouping functionality so at least internally this
comes up form time to time. i strongly suspect this will eventually get implemtned but there have been higher priorites
for the nova team like eventlet removal or gpu live migrations. if some one propsoes it again i know at least john garbutt
was keen to see this added for some hpc usecases and i suspect with the current AI boom passing blocks of H200 gpus
to a workload is becomming more common not less.
>
> [1] https://docs.openstack.org/nova/latest/admin/pci-passthrough.html#support-f…
>
6 months, 2 weeks
Re: [tc][all][security] Supporting Post-Quantum Cryptography in OpenStack code (all projects)
by Julia Kreger
Overall, I agree with a lot of the sentiment and the statements thus
far. One key aspect I think we need to ensure is that we don't silo
discussions to a specific topic or for example ssh keys. In other
words, ultimately the *complete* scope of work remains undefined, and
projects *should* identify potential areas of work that they can see,
while the overall larger ecosystem moves forward and the overall exact
needs are clarified further. The wider process should be to revise a
vision and guidelines as time moves forward.
For example, it could boil down to a very simple question for projects:
Does your project do *anything* in relation to keys, encryption, or
interaction of any encrypted data either at rest or while in transit.?
If yes, are there actions to take?
And then a dialog needs to occur, or at least a framework for
understanding if there *really* needs to be an investment in time or
resources in that specific instance of usage, or is it just permitting
a longer value, or what. Ultimately that is a case by case assessment
which needs to be performed.
A good first step for any working group would be to create a chart
which could be used as a reference for those individual discussions.
-Julia
On Mon, Oct 27, 2025 at 5:35 AM Sean Mooney <smooney(a)redhat.com> wrote:
>
> my personal take on this is it may be a vaild future comunicaty goal but
> its probably premature.
>
> in general most openstack project try to not be in the busyness of
> Cryptography if we can at all avoid it.
>
> our dependicies may have cypto feature like ssh, ssl for our rest apis
> or similar both the openstack code base
> in general tries to not implement any cypto logic itself. i.e. we try to
> delegate to python-cryptography or similar well maintained
> modules.
>
> for example just looking at the nova section you mentioned that nova can
> genreate ssh keys
>
> https://wiki.openstack.org/wiki/Post_quantum_openstack#Nova_.28Compute.29
>
> however we deprecatea and removed that capablity in zed relese in
> micoversion 2.92
>
> https://docs.openstack.org/nova/latest/reference/api-microversion-history.h…
>
> we did that specificly because we did nto want to supprot ssh key
> generation in nova going forward and defiend that to be out of scope fo
> our project
>
> so that is a non issue in a PQC world because we have decieed as a
> project not to extend or suprot that api going forward.
>
> it has not been removed as we dont do that in nova if its reasonable to
> keep the code but it shoudl never be used anymore.
>
> we only supprot uploading a pre generated public key now.
>
>
> the there two case might be valide
>
> ```
>
> Supports validation of Glance image signatures and certificate trust
> when booting signed images. (link)
>
> Metadata path protection with Neutron uses HMAC over Instance-ID to
> prevent spoofing (shared secret). (link)
>
> ```
>
> although for the metadata case that shared secretae is intened to be
> passed over a https connection so if the ssl encyption for
> the connection supprot post quantum encyption then the hamc does not
> really need too but we can likely change that algorhtypme when python
> cypgoraphy supprot somethign
> in the future that is a suitable replacement.
>
> the glance image verficaiton woudl need glance to support somethign else
> instead but if they come ups with an updated approch nova coudl adapt.
>
> none of the above seem particularly urgent and likely don't need to be
> addressed in 2026.1 or even 2027.1 but if someone wanted to write a spec for
> the nova/glance changes and wanted to work on it it could be reviewed
> via the normal upstream process without needing to make it a comunity
> goal or have
> it driven by the TC. you coudl for example create a popup team or sig
> kind of like the eventlet remvoal work to drive this instead.
>
> speaking of which i think the eventlet removal work is gong to take
> precedence for most team as that is more urgent in terms
> of real world impact.
>
> regards
>
> sean.
>
> On 27/10/2025 05:14, Goutham Pacha Ravi wrote:
> > On Fri, Oct 24, 2025 at 1:19 PM Jean-Philippe Jung <jjung(a)redhat.com> wrote:
> >> Hi,
> >>
> >> I am seeking help from the TC to raise the urgency of this work across all OpenStack projects and to help me lead an effort to reduce the number of cryptographic modules used in OpenStack (my personal opinion is that there should be no more than five).
> >>
> >> Doing this may involve work in each OpenStack Project team; and I can help organize this effort. I'm seeking the following from the TC and/or project teams:
> >> Portions of this work will be isolated to specific repositories managed by a project team, while others will involve "cross-project" synchronization.
> >> What vehicles can we use to have a "call-to-action" for project teams to get someone to look into their specific projects? How can we go about community wide collaboration?
> >>
> >> I've created a document [1] that I assembled from AI analysis of part of the OpenStack code. It gives an overall view of the problem we face.\
> >>
> > Thank you for starting this discussion, JP. I've added a topic to the
> > TC's PTG for 1600 UTC on Friday, 31st Oct 2025. I hope you'll be able
> > to share your findings there briefly and invite opinions in-sync. Our
> > vehicle for driving cross project work has been via the Community
> > Goals framework: https://governance.openstack.org/tc/goals/
> > If we have one or more objectives, this can be proposed as one, and
> > will require a "goal champion" - someone that'll help us gather
> > requirements, and coordinate efforts to complete the goal. Some goals
> > in the past have spawned new groups - either as Pop Up Teams or SIGs
> > (https://governance.openstack.org/tc/reference/comparison-of-official-group-…)
> >
>
2 months
[watcher] 2025.2 Flamingo PTG summary
by Douglas Viroel
Hello everyone,
Last week's PTG had very interesting topics. Thank you all that joined.
The Watcher PTG etherpad with all notes is available here:
https://etherpad.opendev.org/p/apr2025-ptg-watcher
Here is a summary of the discussions that we had, including the great
cross-project sessions with Telemetry, Horizon and Nova team:
Tech Debt (chandankumar/sean-k-mooney)
=================================
a) Croniter
- Project is being abandoned as per
https://pypi.org/project/croniter/#disclaimer
- Watcher uses croniter to calculate a new schedule time to run an audit
(continuous). It is also used to validate cron like syntax
- Agreed: replace croniter with appscheduler's cron methods.
- *AI*: (chandankumar) Fix in master branch and backport to 2025.1
b) Support status of Watcher Datasources
- Only Gnocchi and Prometheus have CI job running tempest tests (with
scenario tests)
- Monaska is inactive since 2024.1
- *AI*: (jgilaber) Mark Monasca and Grafana as deprecated, unless
someone steps up to maintain them, which should include a minimal CI job
running.
- *AI*: (dviroel) Document a support matrix between Strategies and
Datasources, which ones are production ready or experimental, and testing
coverage.
c) Eventlet Removal
- Team is going to look at how the eventlet is used in Watcher and start
a PoC of its removal.
- Chandan Kumar and dviroel volunteer to help in this effort.
- Planned for 2026.1 cycle.
Workflow/API Improvements (amoralej)
==============================
a) Actions states
- Currently Actions updates from Pending to Succeeded or Failed, but
these do not cover some important scenarios
- If an Action's pre_conditions fails, the action is set to FAILED, but
for some scenarios, it could be just SKIPPED and continue the workflow.
- Proposal: New SKIPPED state for action. E.g: In a Nova Migration
Action, if the instance doesn't exist in the source host, it can be skipped
instead of fail.
- Proposal: Users could also manually skip specific actions from an
action plan.
- A skip_reason field could also be added to document the reason behind
the skip: user's request, pre-condition check, etc.
- *AI*: (amoralej) Create a spec to describe the proposed changes.
b) Meaning of SUCCEEDED state in Action Plan
- Currently means that all actions are triggered, even if all of them
fail, which can be confusing for users.
- Docs mention that SUCCEEDED state means that all actions have been
successfully executed.
- *AI*: (amoralej) Document the current behavior as a bug (Priority High)
- done: https://bugs.launchpad.net/watcher/+bug/2106407
Watcher-Dashboard: Priorities to next release (amoralej)
===========================================
a) Add integration/functional tests
- Project is missing integration/functional tests and a CI job running
against changes in the repo
- No general conclusion and we will follow up with Horizon team
- *AI*: (chandankumar/rlandy) sync with Horizon team about testing the
plugin with horizon.
- *AI*: (chandankumar/rlandy) devstack job running on new changes for
watcher-dashboard repo.
b) Add parameters to Audits
- It is missing on the watcher-dashboard side. Without it, it is not
possible to define some important parameters.
- Should be addressed by a blueprint
- Contributors to this feature: chandankumar
Watcher cluster model collector improvement ideas (dviroel)
=============================================
- Brainstorm ideas to improve watcher collector process, since we still
see a lot of issues due to outdated models when running audits
- Both scheduled model update and event-based updates are enabled in CI
today
- It is unknown the current state of event-based updates from Nova
notification. Code needs to be reviewed and improvements/fixes can be
proposed
- e.g: https://bugs.launchpad.net/watcher/+bug/2104220/comments/3 -
We need to check if we are processing the right notifications of if is a
bug on Nova
- Proposal: Refresh the model before running an audit. A rate limit
should be considered to avoid too many refreshments.
- *AI*: (dviroel) new spec for cluster model refresh, based on audit
trigger
- *AI:* (dviroel) investigate the processing of nova events in Watcher
Watcher and Nova's visible constraints (dviroel)
====================================
- Currently, Watcher can propose solutions that include server
migrations that violate some Nova constraints like: scheduler_hints,
server_groups, pinned_az, etc.
- In Epoxy release, Nova's API was improved to also show scheduler_hints
and image_properties, allowing external services, like watcher, to query
and use this information when calculating new solutions.
-
https://docs.openstack.org/releasenotes/nova/2025.1.html#new-features
- Proposal: Extend compute instance model to include new properties,
which can be retrieved via novaclient. Update strategies to filter invalid
migration destinations based on these new properties.
- *AI*: (dviroel) Propose a spec to better document the proposal. No API
changes are expected here.
Replacement for noisy neighbor policy (jgilaber)
====================================
- The existing noisy neighbor strategy is based on L3 Cache metrics,
which is not available anymore, since the support for it was dropped from
the kernel and from Nova.
- In order to keep this strategy, new metrics need to be considered:
cpu_steal? io_wait? cache_misses?
- *AI*: (jgilaber) Mark the strategy as deprecated during this cycle
- *AI*: (TBD) Identify new metrics to be used
- *AI*: (TBD) Work on a replacement for the current strategy
Host Maintenance strategy new use case (jeno8)
=====================================
- New use case for Host Maintenance strategy: instance with ephemeral
disks should not be migrated at all.
- Spec proposed:
https://review.opendev.org/c/openstack/watcher-specs/+/943873
- New action to stop instances when both live/cold migration are
disabled by the user
- *AI*: (All) Review the spec and continue with discussion there.
Missing Contributor Docs (sean-k-mooney)
================================
- Doc missing: Scope of the project, e.g:
https://docs.openstack.org/nova/latest/contributor/project-scope.html
- *AI*: (rlandy) Create a scope of the project doc for Watcher
- Doc missing: PTL Guide, e.g:
https://docs.openstack.org/nova/latest/contributor/ptl-guide.html
- *AI*: (TBD) Create a PTL Guide for Watcher project
- Document: When to create a spec vs blueprint vs bug
- *AI*: (TBD) Create a doc section to describe the process based on what
is being modified in the code.
Retrospective
==========
- The DPL approach seems to be working for Watcher
- New core members added: sean-k-mooney, dviroel, marios and chandankumar
- We plan to add more cores in the next cycle, based on reviews and
engagement.
- We plan to remove not active members in the 2 last cycles (starting
at 2026.1)
- A new datasource was added: Prometheus
- Prometheus job now also runs scenario tests, along with Gnocchi.
- We triaged all old bugs from launchpad
- Needs improvement:
- current team is still learning about details in the code, much of
the historical knowledge was lost with the previous maintainers
- core team still needs to grow
- we need to focus on creating stable releases
Cross-project session with Horizon team
===============================
- Combined session with Telemetry and Horizon team, focused on how to
provide a tenant and an admin dashboard to visualize metrics.
- Watcher team presented some ideas of new panels for both admin and
tenants, and sean-k-mooney raised a discussion about frameworks that can be
used to implement them
- Use-cases that were discussed:
- a) Admin would benefit from a visualization of the infrastructure
utilization (real usage metrics), so they can identify
bottlenecks and plan
optimization
- b) A tenant would like to view their workload performance, checking
real usage of cpu/ram/disk of instances, to proper adjust their resources
allocation.
- c) An admin user of watcher service would like to visualize metrics
generated by watcher strategies like standard deviation of host metrics.
- sean-k-mooney presented an initial PoC on how a Hypervisor Metrics
dashboard would look like.
- Proposal for next steps:
- start a new horizon plugin as an official deliverable of telemetry
project
- still unclear which framework to use for building charts
- dashboard will integrate with Prometheus, as metric store
- it is expected that only short term metrics will be supported (7
days)
- python-observability-client will be used to query Prometheus
Cross-project session with Nova team
=============================
- sean-k-mooney led topics on how to evolve Nova to better assist other
services, like Watcher, to take actions on instances. The team agreed on a
proposal of using the existing metadata API to annotate instance's
supported lifecycle operations. This information is very useful to improve
Watcher's strategy's algorithms. Some example of instance's metadata could
be:
- lifecycle:cold-migratable=true|false
- ha:maintenance-strategy:in_place|power_off|migrate
- It was discussed that Nova could infer which operations are valid or
not, based on information like: virt driver, flavor, image properties, etc.
This feature was initially named 'instance capabilities' and will require a
spec for further discussions.
- Another topic of interest, also raised by Sean, was about adding new
standard traits to resource providers, like PRESSURE_CPU and PRESSURE_DISK.
These traits can be used to weight hosts when placing new VMs. Watcher and
the libvirt driver could work on annotating them, but the team generally
agreed that the libvirt driver is preferred here.
- More info at Nova PTG etherpad [0] and sean's summary blog [1]
[0] https://etherpad.opendev.org/p/r.bf5f1185e201e31ed8c3adeb45e3cf6d
[1] https://www.seanmooney.info/blog/2025.2-ptg/#watcher-topics
Please let me know if I missed something.
Thanks!
--
Douglas Viroel - dviroel
8 months, 2 weeks
[tc][all] OpenStack Technical Committee Weekly Summary and Meeting Agenda (2025.1/R-11)
by Goutham Pacha Ravi
Hello Stackers,
Time flies! We're 11 weeks away from the 2025.1 "Epoxy" release day
[1]. OpenStack project teams must be working on their deliverables
according to the schedule shared by Előd Illés (elodilles) from the
release team [2] this week.
In the past week, the OpenStack Technical Committee (TC) worked with
election officials to adjust the dates of the elections preceding the
2025.2 ("Flamingo" [3]) release cycle. This change was made in
response to a revision in the TC's charter that now allows more time
for polling. Election officials will share the updated schedule on
this list soon. Additionally, the TC approved the retirement of
openstack-ansible roles for former projects: Murano (Application
Catalog), Senlin (Clustering Service), and Sahara (Data Processing).
The OpenInfra Board's Individual Member Director elections are
currently underway [4]. If you are an OpenInfra Foundation member,
please check your email for a ballot and participate. The election
will conclude on Friday, 2025-01-17.
=== Weekly Meeting ===
The OpenStack TC resumed its regular weekly meeting schedule last week
following a brief hiatus due to the year-end holidays. Last week's
meeting was held on 2025-01-07 at 1800 UTC, simultaneously on Zoom and
IRC. Please find the meeting minutes on eavesdrop [5] and a recording
on YouTube [6].
The meeting began with a discussion about the pending proposal [7] to
delete the "unmaintained/victoria" branch on OpenStack git
repositories. There is interest from at least one contributing
organization in keeping this branch open for specific repositories.
The deadline to merge this change is 2025-01-31. We aim to identify
the specific repositories and the responsible maintainers by then.
I'll keep you updated in future emails.
The TC then discussed initiating the election cycle for the 2025.2
release. We recognize that long holidays (such as the upcoming Chinese
New Year) could impact nominations from interested candidates. We hope
to encourage nominations as early as possible since even a couple of
extra weeks could make a difference in coordinating for such a
geographically diverse community.
The TC also reviewed the status of the community goal to migrate test
jobs to Python 3.12 (and "Ubuntu 24.04 / Noble Numbat" where
applicable). Ghanshyam Mann (gmann), the goal champion, shared that
test jobs in three projects (Heat, Skyline, and
devstack-plugin-container) need attention from their respective
project teams [8].
We also expressed our gratitude to the OpenDev Infrastructure team for
keeping the systems running smoothly during the holidays.
The next OpenStack Technical Committee meeting is today, 2025-01-14,
at 1800 UTC. This meeting will be held over IRC in OFTC's
#openstack-tc channel. Please find the agenda on the meeting wiki [9].
I hope you can join us. Remember, any community member can propose
meeting topics—just mention your IRC nick so the meeting chair can
call upon you.
=== Governance Proposals ===
==== Merged ====
- Allow more than 2 weeks for elections |
https://review.opendev.org/c/openstack/governance/+/937741
- Put whitebox-tempest-plugin under release management |
https://review.opendev.org/c/openstack/governance/+/938401
- Retire Murano/Senlin/Sahara OpenStack-Ansible roles |
https://review.opendev.org/c/openstack/governance/+/935677
==== Open for Review ====
- Rework the eventlet-removal goal proposal |
https://review.opendev.org/c/openstack/governance/+/931254
- Add ansible-role-httpd repo to OSA-owned projects |
https://review.opendev.org/c/openstack/governance/+/935694
- Retire Freezer DR | https://review.opendev.org/c/openstack/governance/+/938183
- Retire qdrouterd role |
https://review.opendev.org/c/openstack/governance/+/938193
- Remove Freezer from inactive state |
https://review.opendev.org/c/openstack/governance/+/938938
- Propose to select the eventlet-removal community goal |
https://review.opendev.org/c/openstack/governance/+/934936
- Resolve to adhere to non-biased language |
https://review.opendev.org/c/openstack/governance/+/934907
=== How to Contact the TC ===
You can reach the TC in several ways:
- Email: Send an email with the tag [tc] on this mailing list.
- Ping us using the 'tc-members' keyword on the #openstack-tc IRC
channel on OFTC.
- Join us at our weekly meeting: The Technical Committee meets every
week on Tuesdays at 1800 UTC [9].
=== Upcoming Events ===
- 2025-01-17: 2025 OpenInfra Board Individual Member Director Elections conclude
- 2025-02-01: FOSDEM 2025 (https://fosdem.org/2025/) OpenStack's 15th
Birthday Celebration
- 2025-02-28: 2025.1 ("Epoxy") Feature Freeze and release milestone 3 [1]
- 2025-03-06: SCALE 2025 + OpenInfra Days NA
(https://www.socallinuxexpo.org/scale/22x)
Thank you very much for reading!
On behalf of the OpenStack TC,
Goutham Pacha Ravi (gouthamr)
OpenStack TC Chair
[1] 2025.1 "Epoxy" Release Schedule:
https://releases.openstack.org/epoxy/schedule.html
[2] Release countdown for week R-11:
https://lists.openstack.org/archives/list/openstack-discuss@lists.openstack…
[3] OpenStack 2025.2 'F' Release Naming Poll:
https://lists.openstack.org/archives/list/openstack-discuss@lists.openstack…
[4] OpenInfra Foundation Board Elections:
https://openinfra.dev/election/2025-individual-director-election
[5] TC Meeting IRC Log 2025-01-07:
https://meetings.opendev.org/meetings/tc/2025/tc.2025-01-07-18.00.log.html
[6] TC Meeting Video Recording, 2025-01-07: https://youtu.be/-Nxul8_ykto
[7] Transition unmaintained/victoria to EOL:
https://review.opendev.org/c/openstack/releases/+/937515
[8] Projects failing the "migrate-to-noble" goal:
https://etherpad.opendev.org/p/migrate-to-noble#L172
[9] TC Meeting Agenda, 2025-01-14:
https://wiki.openstack.org/wiki/Meetings/TechnicalCommittee#Next_Meeting
11 months, 2 weeks
Re: [watcher] 2025.2 Flamingo PTG summary
by Sean Mooney
On 16/04/2025 21:04, Dmitriy Rabotyagov wrote:
>
> Hey,
>
> Have a comment on one AI from the list.
>
> > AI: (jgilaber) Mark Monasca and Grafana as deprecated, unless
> someone steps up to maintain them, which should include a minimal CI
> job running.
>
> So eventually, on OpenStack-Ansible we were planning to revive the
> Watcher role support to the project.
> How we usually test deployment, is by spawning an all-in-one
> environment with drivers and executing a couple of tempest scenarios
> to ensure basic functionality of the service.
>
> With that, having a native OpenStack telemetry datastore is very
> beneficial for such goal, as we already do maintain means for spawning
> telemetry stack. While a requirement for Prometheus will be
> unfortunate for us at least.
>
> While I was writing that, I partially realized that testing Watcher on
> all-in-one is pretty much impossible as well...
>
you can certenly test some fo watcher with an all in one deployment
i.e. the apis and you can use the dummy test stragies.
but ya in general like nova you need at least 2 nodes to be able to test
it properly ideally 3
so that if your doing a live migration there is actully a choice of host.
in general however watcher like heat just asks nova to actully move the vms.
sure it will ask nova to move it to a specific host but fundementaly if
you have
tested live migration with nova via tempest seperatly there is no reason
to expcect
it would not work for live migratoin tirggred by watcher or heat or
anything else that
jsut calls novas api.
so you could still get some valual testing in an all in one but ideally
there woudl be at least 2 comptue hosts.
> But at the very least, I can propose looking into adding an OSA job
> with Gnocchi as NV to the project, to show the state of the deployment
> with this driver.
>
well gnocchi is also not a native OpenStack telemetry datastore, it left
our community to pursue its own goals and is now a third party datastore
just like Grafana or Prometheus.
monasca is currently marked as inactive
https://review.opendev.org/c/openstack/governance/+/897520 and is in the
process of being retired.
but it also has no testing on the watcher side to the combination of the
two is why we are deprecating it going forward.
if both change im happy to see the support continue.
Gnocchi has testing but we are not actively working on extending its
functionality going forward.
as long as it continues to work i see no reason to change its support
status.
watcher has quite a lot of untested integrations which is unfortunate
we are planning to build out a feature/test/support matrix in the docs
this cycle
but for example watcher can integrate with both ironic an canonical maas
component
to do some level of host power management. none of that is tested and we
are likely going
to mark them as experimental and reflect on if we can continue to
support them or not going forward.
it also has the ability to do cinder storage pool balancing which is i
think also untested write now.
one of the things we hope to do is extend the exsitign testing in our
current jobs to cover gaps like
that where it is practical to do so. but creating a devstack plugin to
deploy maas with fake infrastructure
is likely alot more then we can do with our existing contributors so
expect that to go to experimental then
deprecated and finally it will be removed if no one turns up to support it.
ironic is in the same boat however there are devstack jobs with fake
ironic nodes so i
could see a path to use having an ironic job down the line. its just not
high on our current priority
list to adress the support status or testing of this currently.
eventlet removal and other techdebt/community goals are defintly higher
but i hop the new supprot/testing
matrix will at least help folks make informed descions or what feature
to use and what backend are
recommended going forward.
>
> On Wed, 16 Apr 2025, 21:53 Douglas Viroel, <viroel(a)gmail.com> wrote:
>
> Hello everyone,
>
> Last week's PTG had very interesting topics. Thank you all that
> joined.
> The Watcher PTG etherpad with all notes is available here:
> https://etherpad.opendev.org/p/apr2025-ptg-watcher
> Here is a summary of the discussions that we had, including the
> great cross-project sessions with Telemetry, Horizon and Nova team:
>
> Tech Debt (chandankumar/sean-k-mooney)
> =================================
> a) Croniter
>
> * Project is being abandoned as per
> https://pypi.org/project/croniter/#disclaimer
> * Watcher uses croniter to calculate a new schedule time to run
> an audit (continuous). It is also used to validate cron like
> syntax
> * Agreed: replace croniter with appscheduler's cron methods.
> * *AI*: (chandankumar) Fix in master branch and backport to 2025.1
>
> b) Support status of Watcher Datasources
>
> * Only Gnocchi and Prometheus have CI job running tempest tests
> (with scenario tests)
> * Monaska is inactive since 2024.1
> * *AI*: (jgilaber) Mark Monasca and Grafana as deprecated,
> unless someone steps up to maintain them, which should include
> a minimal CI job running.
> * *AI*: (dviroel) Document a support matrix between Strategies
> and Datasources, which ones are production ready or
> experimental, and testing coverage.
>
> c) Eventlet Removal
>
> * Team is going to look at how the eventlet is used in Watcher
> and start a PoC of its removal.
> * Chandan Kumar and dviroel volunteer to help in this effort.
> * Planned for 2026.1 cycle.
>
> Workflow/API Improvements (amoralej)
> ==============================
> a) Actions states
>
> * Currently Actions updates from Pending to Succeeded or Failed,
> but these do not cover some important scenarios
> * If an Action's pre_conditions fails, the action is set to
> FAILED, but for some scenarios, it could be just SKIPPED and
> continue the workflow.
> * Proposal: New SKIPPED state for action. E.g: In a Nova
> Migration Action, if the instance doesn't exist in the source
> host, it can be skipped instead of fail.
> * Proposal: Users could also manually skip specific actions from
> an action plan.
> * A skip_reason field could also be added to document the reason
> behind the skip: user's request, pre-condition check, etc.
> * *AI*: (amoralej) Create a spec to describe the proposed changes.
>
> b) Meaning of SUCCEEDED state in Action Plan
>
> * Currently means that all actions are triggered, even if all of
> them fail, which can be confusing for users.
> * Docs mention that SUCCEEDED state means that all actions have
> been successfully executed.
> * *AI*: (amoralej) Document the current behavior as a bug
> (Priority High)
> o done: https://bugs.launchpad.net/watcher/+bug/2106407
>
> Watcher-Dashboard: Priorities to next release (amoralej)
> ===========================================
> a) Add integration/functional tests
>
> * Project is missing integration/functional tests and a CI job
> running against changes in the repo
> * No general conclusion and we will follow up with Horizon team
> * *AI*: (chandankumar/rlandy) sync with Horizon team about
> testing the plugin with horizon.
> * *AI*: (chandankumar/rlandy) devstack job running on new
> changes for watcher-dashboard repo.
>
> b) Add parameters to Audits
>
> * It is missing on the watcher-dashboard side. Without it, it is
> not possible to define some important parameters.
> * Should be addressed by a blueprint
> * Contributors to this feature: chandankumar
>
> Watcher cluster model collector improvement ideas (dviroel)
> =============================================
>
> * Brainstorm ideas to improve watcher collector process, since
> we still see a lot of issues due to outdated models when
> running audits
> * Both scheduled model update and event-based updates are
> enabled in CI today
> * It is unknown the current state of event-based updates from
> Nova notification. Code needs to be reviewed and
> improvements/fixes can be proposed
> o e.g:
> https://bugs.launchpad.net/watcher/+bug/2104220/comments/3
> - We need to check if we are processing the right
> notifications of if is a bug on Nova
> * Proposal: Refresh the model before running an audit. A rate
> limit should be considered to avoid too many refreshments.
> * *AI*: (dviroel) new spec for cluster model refresh, based on
> audit trigger
> * *AI:* (dviroel) investigate the processing of nova events in
> Watcher
>
> Watcher and Nova's visible constraints (dviroel)
> ====================================
>
> * Currently, Watcher can propose solutions that include server
> migrations that violate some Nova constraints like:
> scheduler_hints, server_groups, pinned_az, etc.
> * In Epoxy release, Nova's API was improved to also show
> scheduler_hints and image_properties, allowing external
> services, like watcher, to query and use this information when
> calculating new solutions.
> o https://docs.openstack.org/releasenotes/nova/2025.1.html#new-features
> * Proposal: Extend compute instance model to include new
> properties, which can be retrieved via novaclient. Update
> strategies to filter invalid migration destinations based on
> these new properties.
> * *AI*: (dviroel) Propose a spec to better document the
> proposal. No API changes are expected here.
>
> Replacement for noisy neighbor policy (jgilaber)
> ====================================
>
> * The existing noisy neighbor strategy is based on L3 Cache
> metrics, which is not available anymore, since the support for
> it was dropped from the kernel and from Nova.
> * In order to keep this strategy, new metrics need to be
> considered: cpu_steal? io_wait? cache_misses?
> * *AI*: (jgilaber) Mark the strategy as deprecated during this cycle
> * *AI*: (TBD) Identify new metrics to be used
> * *AI*: (TBD) Work on a replacement for the current strategy
>
>
> Host Maintenance strategy new use case (jeno8)
> =====================================
>
> * New use case for Host Maintenance strategy: instance with
> ephemeral disks should not be migrated at all.
> * Spec proposed:
> https://review.opendev.org/c/openstack/watcher-specs/+/943873
> o New action to stop instances when both live/cold migration
> are disabled by the user
> * *AI*: (All) Review the spec and continue with discussion there.
>
> Missing Contributor Docs (sean-k-mooney)
> ================================
>
> * Doc missing: Scope of the project, e.g:
> https://docs.openstack.org/nova/latest/contributor/project-scope.html
> * *AI*: (rlandy) Create a scope of the project doc for Watcher
> * Doc missing: PTL Guide, e.g:
> https://docs.openstack.org/nova/latest/contributor/ptl-guide.html
> * *AI*: (TBD) Create a PTL Guide for Watcher project
> * Document: When to create a spec vs blueprint vs bug
> * *AI*: (TBD) Create a doc section to describe the process based
> on what is being modified in the code.
>
> Retrospective
> ==========
>
> * The DPL approach seems to be working for Watcher
> * New core members added: sean-k-mooney, dviroel, marios and
> chandankumar
> o We plan to add more cores in the next cycle, based on
> reviews and engagement.
> o We plan to remove not active members in the 2 last cycles
> (starting at 2026.1)
> * A new datasource was added: Prometheus
> * Prometheus job now also runs scenario tests, along with Gnocchi.
> * We triaged all old bugs from launchpad
> * Needs improvement:
> o current team is still learning about details in the code,
> much of the historical knowledge was lost with the
> previous maintainers
> o core team still needs to grow
> o we need to focus on creating stable releases
>
>
> Cross-project session with Horizon team
> ===============================
>
> * Combined session with Telemetry and Horizon team, focused on
> how to provide a tenant and an admin dashboard to visualize
> metrics.
> * Watcher team presented some ideas of new panels for both admin
> and tenants, and sean-k-mooney raised a discussion about
> frameworks that can be used to implement them
> * Use-cases that were discussed:
> o a) Admin would benefit from a visualization of the
> infrastructure utilization (real usage metrics), so they
> can identify bottlenecks and plan optimization
> o b) A tenant would like to view their workload performance,
> checking real usage of cpu/ram/disk of instances, to
> proper adjust their resources allocation.
> o c) An admin user of watcher service would like to
> visualize metrics generated by watcher strategies like
> standard deviation of host metrics.
> * sean-k-mooney presented an initial PoC on how a Hypervisor
> Metrics dashboard would look like.
> * Proposal for next steps:
> o start a new horizon plugin as an official deliverable of
> telemetry project
> o still unclear which framework to use for building charts
> o dashboard will integrate with Prometheus, as metric store
> o it is expected that only short term metrics will be
> supported (7 days)
> o python-observability-client will be used to query Prometheus
>
>
> Cross-project session with Nova team
> =============================
>
> * sean-k-mooney led topics on how to evolve Nova to better
> assist other services, like Watcher, to take actions on
> instances. The team agreed on a proposal of using the existing
> metadata API to annotate instance's supported lifecycle
> operations. This information is very useful to improve
> Watcher's strategy's algorithms. Some example of instance's
> metadata could be:
> o lifecycle:cold-migratable=true|false
> o ha:maintenance-strategy:in_place|power_off|migrate
> * It was discussed that Nova could infer which operations are
> valid or not, based on information like: virt driver, flavor,
> image properties, etc. This feature was initially named
> 'instance capabilities' and will require a spec for further
> discussions.
> * Another topic of interest, also raised by Sean, was about
> adding new standard traits to resource providers, like
> PRESSURE_CPU and PRESSURE_DISK. These traits can be used to
> weight hosts when placing new VMs. Watcher and the libvirt
> driver could work on annotating them, but the team generally
> agreed that the libvirt driver is preferred here.
> * More info at Nova PTG etherpad [0] and sean's summary blog [1]
>
> [0] https://etherpad.opendev.org/p/r.bf5f1185e201e31ed8c3adeb45e3cf6d
> [1] https://www.seanmooney.info/blog/2025.2-ptg/#watcher-topics
>
>
> Please let me know if I missed something.
> Thanks!
>
> --
> Douglas Viroel - dviroel
>
8 months, 2 weeks
[nova][ptg] 2025.2 Flamingo PTG summary
by Rene Ribaud
Hello everyone,
Last week was the PTG—thank you to those who joined! I hope you enjoyed it.
I haven’t gathered exact attendance stats, but it seemed that most sessions
had at least around 15 participants, with some peaks during the cross-team
discussions.
If you’d like to take a closer look, here’s the link to the PTG etherpad:
https://etherpad.opendev.org/p/r.bf5f1185e201e31ed8c3adeb45e3cf6d
We had a pretty full agenda for Nova, so here’s a summary I’ve tried to
keep as short as possible.
#### 2025.1 Epoxy Retrospective ####
17 specs were accepted, and 12 implemented — an excellent ratio. This
represents a clear improvement over previous cycles.
Virtiofs was successfully merged, unblocking other work and boosting
contributor motivation.
✅ We agreed to maintain regular status updates via the etherpad and follow
up during Nova meetings.
API Microversions & Tempest Coverage, several microversions were merged
with good structure.
However, some schema changes were not reflected in Tempest, causing
downstream blockers.
Also the updates covered by the microversions were not propagated into the
sdk and openstack client.
✅ Ensure client-side features (e.g., server show) are also published and
tracked.
✅ Keep microversions isolated and document Tempest implications clearly in
specs.
✅ Raise awareness of the tempest-with-latest-microversion job during Nova
meetings.
✅ Monitor OpenAPI efforts in Nova, which may allow offloading schema checks
from Tempest in the future.
Eventlet Removal, progress is behind schedule, especially compared to other
projects like Neutron.
✅ Flag this as a priority area for upcoming cycles.
Review Process & Tracking, spec review days were difficult to coordinate,
and the status etherpad often outdated.
✅ Encourage active contributors to support occasional contributors during
review days.
✅ Commit to keeping the etherpad current throughout the cycle.
#### 2025.2 Flamingo Planning ####
Timeline:
Soft spec freeze (no new specs): June 1st
Hard spec freeze (M2): July 3rd
Feature Freeze (FF): August 28th
Final release: late September / early October
✅ We agreed to officially adopt June 1st as the soft freeze date, based on
the successful approach in Epoxy.
✅ A spec review day will be scheduled around mid June, these will be
scheduled and announced early to ensure participation.
✅ Uggla will update the schedule document with the proposed milestones.
#### Upstream Bug Triage ####
We acknowledged that active bug triage has slowed down, resulting in a
backlog increase (~150 untriaged bugs).
There is a consensus that triage remains important to maintain a clear
picture of the actual bug landscape.
✅ Trial a new approach: review some untriaged bugs at the end of Nova team
meetings.
✅ Process the list by age (starting with the newest or most-voted first).
#### Closing Old Bugs ####
A proposal was made to bulk-close bugs older than 2 years, with a
respectful and explanatory message, aiming to reduce backlog and improve
visibility.
However, multiple voices expressed strong reservations.
✅Take no action for now. Focus efforts on triaging new bugs first.
✅ If we successfully reduce the number of untriaged new bugs, we can
consider scrubbing the bug backlog and possibly closing some of the older
ones.
#### Preparation for Python 3.13 ####
While Python 3.13 is not mandatory for 2025.2, early compatibility work was
discussed due to known issues (e.g., eventlet is broken on 3.13, as
observed on Ubuntu 25.04)
Ubuntu 24.04 and CentOS Stream 10 will stay on 3.12 for their supported
lifespans.
A non-voting unit test job for Python 3.13 (openstack-tox-py313) has
already been added and is currently passing.
Introducing a functional job for 3.13 could be a good next step, if
resources allow.
✅ Gibi will track this as part of the broader eventlet removal work.
#### Confidential Computing Feature Planning ####
AMD SEV is already supported in Nova.
SEV-ES is implemented in libvirt and work is ongoing in Nova.
SEV-SNP is now supported in libvirt (v10.5.0). Work in Nova has not started
yet.
✅ Pay closer attention to SEV-ES reviews to help move this forward.
✅ Tkajinam will write a new spec for SEV-SNP.
Intel TDX
Kernel support is nearly ready (expected in 6.15).
Libvirt patches exist, but feature is not yet upstreamed or widely released.
✅ No action agreed yet, as this remains exploratory.
Arm CCA
No hardware is available yet; earliest expected in April 2027 (Fujitsu
Monaka).
Support in libvirt, QEMU, and Linux kernel is still under development.
✅ The use case is reasonable, but too early to proceed — we should wait
until libvirt and QEMU support is mature.
✅ It would be beneficial to wait for at least one Linux distribution to
officially support Arm CCA, allowing real-world testing.
✅ Attestation support for Arm is seen as external to Nova, with only minor
flags possibly needed in the guest.
#### RDT / MPAM Feature Discussion ####
RDT (Intel PQoS) and MPAM (Arm equivalent) aim to mitigate “noisy neighbor”
issues by allocating cache/memory bandwidth to VMs.
Development has stalled since 2019, primarily due to:
- Lower priority for contributors
- Lack of customer demand
- Infrastructure complexity (NUMA modeling, placement limitations)
✅ r-taketn to reopen and revise the original spec, showing a clear diff to
the previous version.
✅ Ensure that abstractions are generic, not tied to proprietary technology,
using libvirt + resource classes/traits may provide enough flexibility.
#### vTPM Live Migration ####
A spec for vTPM live migration was approved in Epoxy:
https://specs.openstack.org/openstack/nova-specs/specs/2025.1/approved/vtpm…
<https://specs.openstack.org/openstack/nova-specs/specs/2025.1/approved/vtpm…>To
support live-migratable vTPM-enabled instances, Barbican secrets used for
vTPM need to be owned by Nova, rather than the end user.
This shift in ownership allows Nova to access the secret during live
migration operations.
Opt-in is handled via image property or flavor extra spec, meaning user
consent is explicitly required.
Current Proposal to enable this workflow:
- Castellan should allow per-call configuration for sending the service
token (rather than relying on a global all-or-nothing setting).
Proposal: https://review.opendev.org/c/openstack/castellan/+/942015
- If the Nova service token is present, Barbican should set the secret
owner to Nova.
Proposal: https://review.opendev.org/c/openstack/barbican/+/942016
This workflow ensures Nova can read/delete the secret during lifecycle
operations like migration, without involving the user.
A question was raised around possible co-ownership between Nova and the end
user (e.g., both having access to the secret). While this may be
interesting longer-term, current implementation assumes a single owner.
✅ User and host modes are as described in the spec.
For deployment mode, Nova will:
- Authenticate to Barbican as itself (using a service token).
- Own the vTPM secret it creates — it will be able to create, read, and
delete it.
- The user will not see or control the secret, including deletion.
- The secret will be visible to other members of the Nova service project
by default, but this could be restricted in future via Barbican ACLs to
limit visibility to Nova only.
#### Cloud Hypervisor Integration ####
There is an ongoing effort to integrate Cloud Hypervisor into Nova via the
Libvirt driver:
Spec: https://review.opendev.org/c/openstack/nova-specs/+/945549
The current PoC requires only minor changes to work with Libvirt, and the
team is ready to present the proposal at the PTG.
✅ We’re happy with the direction the spec is taking. Below are some key
highlights regarding the spec.
✅ Clarify platform support (e.g., is libvirt compiled with cloud hypervisor
support by default? Is it available in distros?).
✅ Provide a plan for runtime attach of multiple NICs and volumes.
✅ Mark as experimental if cloud hypervisor is not yet in upstream distro
packages.
✅ Ensure that the following features are expected to work and covered in
the spec: resize, migrate, rebuild, evacuate, snapshot.
✅ Justify raw-only image support, and outline the path to qcow2
compatibility.
#### vGPU (mdev) and PCI SR-IOV Topics ####
1. Live-migratable flag handling (physical_network tag)
Bug: https://bugs.launchpad.net/nova/+bug/2102161
✅ We agreed that the current behavior is correct and consistent with the
intention:
If live_migratable = false → fallback to hotplug during live migration.
If live_migratable = true on both source and destination → prefer
transparent live migration.
✅ Investigate how Neutron might participate by requesting live-migratable
ports.
2. Preemptive live migration failure for non-migratable PCI devices
Nova currently checks for migratability during scheduling and conductor
phases. There’s a proposal to move these checks earlier, possibly to the
API level.
Bug: https://bugs.launchpad.net/nova/+bug/2103631
✅ Confirm with gmann whether a microversion is needed — likely not, as
return codes are already supported (202 → 400/409).
✅ Uggla may submit a small spec to formalize this change.
✅ Split the work into two steps:
- Fix existing bug (can be backported).
- Incrementally move other validations earlier in the flow.
3. PCI SR-IOV: Unify the Live Migration Code Path
There’s agreement on the need to reduce technical debt by refactoring the
current dual-code-path approach into a unified model for PCI live migration.
✅ A dedicated spec is needed to clarify and unify PCI claiming and
allocation.
✅ This refactor should address PCI claiming and allocation, potentially
deprecating or replacing move_claim in favor of more robust DB-backed logic.
✅ This effort is directly related to point 1 (migratability awareness) and
will help ensure consistent resource management across the codebase.
#### SPICE VDI – Next Steps ####
There is an ongoing effort to enhance libvirt domain XML configuration for
desktop virtualization use cases (e.g. SPICE with USB and sound
controllers). Some patches were proposed but not merged in time for Epoxy.
Mikal raised the question of whether a new spec would be required in
Flamingo, which would be the third iteration of this work.
The team also raised concern about the complexity of adding traits (e.g.
os-traits) for relatively simple additions, due to the multi-step process
involved (traits patch, release, requirements update, etc.).
✅ Proceed with a specless blueprint.
✅ Plan to pull os-traits and os-resource-classes logic into Placement, to
simplify the integration process and reduce friction. Link the required
Placement version in Nova documentation accordingly. This is a strategic
direction, even if some traits might still be shared with Neutron/Cinder.
#### Virtiofs Client Support ####
The virtiofs server-side support was merged in Epoxy, but SDK and
client-side support did not make it in time. The proposal is to merge both
patches early in Flamingo and then backport to Epoxy.
✅ No concern with microversion usage here.
✅The ordering of microversion support patches across Nova, SDKs, and
clients will be handled by respective owners.
✅ Uggla to track that each new microversion in Nova has a corresponding
patch in SDK/client layers.
✅ Not directly related to virtiofs, but the new reset-state confirmation
prompt in the client was noted and welcomed.
#### One-Time-Use (OTU) Devices ####
OTU devices are designed to be consumed once and then unreserved.
There is a need to provide practical guidance on handling these cleanly,
especially in notification-driven environments.
Additionally, there's an important patch related to Placement behavior on
over-capacity nodes:
https://review.opendev.org/c/openstack/placement/+/945465
Placement currently blocks new allocations on over-capacity nodes — even if
the new allocation reduces usage. This breaks migration from overloaded
hosts. The proposed fix allows allocations that do not worsen (or improve)
usage.
Note: A similar OTU device handling strategy has been successfully used in
Ironic.
✅ Provide an example script or tool for external OTU device cleanup, based
on notifications.
✅ Agreement on the proposed Placement fix — it is operator-friendly and
resolves real issues in migration workflows.
✅ We likely need to dig deeper into implementation and tooling for broader
OTU support.
#### Glance cross-project session ####
Please look at glance summary.
#### Secure RBAC – Finalization Plan ####
Tobias raised concerns about incomplete secure RBAC support in Nova,
particularly around default roles and policy behavior. Much of the
groundwork has been done, but a number of patches still require review and
finalization.
✅ Gmann will continue working on the outstanding patches during the
Flamingo cycle. The objective is to complete secure RBAC support in Nova as
part of this cycle.
#### Image Properties Handling – DB Schema & API Response ####
The issue arises from discrepancies between image property metadata stored
by Nova and what is received from Glance. Nova’s DB schema enforces a
255-character limit on metadata keys and values, which can lead to silent
truncation or hard failures (e.g., when prefixing keys like image_ pushes
the total length over 255).
Nova stopped supporting custom image properties nearly a decade ago, when
the system moved to structured objects (ImageMetaProps via OVO).
Glance still allows some custom metadata, which may be passed through to
Nova.
This led to invalid or non-standard keys (e.g.,
owner_specified.openstack.sha256) being stored or exposed, even though they
are not part of Nova’s supported set.
Consensus emerged that we are facing two issues:
- Nova's API may expose more metadata than it should (from Glance).
- Nova stores non-standard or overly long keys/values, resulting in silent
truncation or hard DB errors.
✅ Nova should stop storing non-standard image properties altogether.
✅ A cleanup plan should be created to remove existing unused or invalid
metadata from Nova's database post-upgrade.
✅ During instance.save(), Nova should identify and delete unused image_*
keys from the system metadata table.
✅ We must be cautious to preserve snapshot-related keys that are valid but
not part of the base ImageMetaProps.
✅ These changes are considered bugfixes and can proceed without a new spec.
#### Eventlet removal ####
Please read the excellent blog post series from Gibi here:
https://gibizer.github.io/posts/Eventlet-Removal-Flamingo-PTG/
#### Enhanced Granularity and Live Application of QoS ####
This was cross team Neutron/Cinder/Nova first topic.
Bloomberg folks presented early ideas around making QoS settings more
granular and mutable, and potentially applicable to existing ports or VMs,
not just at creation time.
Nova does not operate on multiple instances at once, which conflicts with
some proposed behaviors (e.g., live update of QoS on a network/project
level).
QoS is currently exposed via flavors in Nova, and is only supported on the
frontend for the Libvirt driver.
QoS mutability is non-trivial, with implications for scheduling, resource
modeling, and placement interactions.
The scope is broad and would require cross-project collaboration (Neutron,
Cinder, Placement).
Use cases and notes from Bloomberg:
https://etherpad.opendev.org/p/OpenStack_QoS_Feature_Enhancement_Discussion
✅ Use flavor-based modeling for QoS remains the Nova approach.
✅ Nova should not apply policies across many instances simultaneously.
✅ A spec will be required, especially if new APIs or behavior modifications
for existing VMs are introduced. The spec should provide concrete use case
examples and API design proposals, including expected behavior during
lifecycle operations (resize, rebuild, shelve, etc.).
✅ Max bandwidth adjustments may be possible (as they don’t require
reservations), but broader mutability is more complex.
✅ Neutron and Cinder raised no objections regarding Bloomberg’s use cases
and proposals. However, please look at Neutron and Cinder's respective
summaries.
#### Moving TAP Device Creation from Libvirt to os-vif ####
This change proposes moving the creation of TAP devices from the Libvirt
driver into os-vif, making it more consistent and decoupled. However, it
introduces upgrade and timing considerations, especially regarding Neutron
and OVN behavior.
Bug: https://bugs.launchpad.net/nova/+bug/2073254
Patch: https://review.opendev.org/c/openstack/nova/+/942786
✅ Neutron team is open to adjusting the timing of the "port ready" event,
which could eliminate the need for Nova-side hacks.
✅ Sean will proceed with the patch and verify behavior through CI.
#### Instance Annotations, Labels & K8s-Like Semantics ####
Sean proposed introducing a mechanism similar to Kubernetes annotations and
labels in Nova, to:
- Express user intent regarding instance behavior (e.g., "should this
instance be migrated?")
- Convey lifecycle preferences to external tools like Watcher and Masakari
- Expose capabilities or constraints of an instance (e.g., "cannot be
shelved because it has a vTPM")
Proposed Examples of Instance Annotations:
lifecycle:live-migratable=true|false
ha:role=primary|secondary
These would be:
- Set by users (or operators)
- Optionally inherited from flavors (but conflicts would raise 400 Bad
Request)
- Expressed intent only — not enforcement of policy
In addition, labels generated by Nova could reflect actual capabilities,
like:
lifecycle:live-migratable=false if an instance has a PCI device
lifecycle:shelvable=false if it uses vTPM
✅ Define a new API to expose capabilities of instances (e.g., “can this
instance be live-migrated?”)
Values will be derived by Nova based on configuration/hardware and exposed
via nova server show.
✅ Sean will create a spec.
✅ Looking at user-defined labels, we eventually considered defining a
second API for them to express scheduling/HA preferences.
However we decided the current preferred approach is to start with metadata
API, and evolve to a first-class model.
We may need admin-only metadata (e.g., for HA tooling like Masakari) this
has been discussed in Admin-Only Instance Metadata / Annotations later
point.
✅ Sean will also create a spec for this. (Sean).
#### External Traits and Node Pressure Metrics ####
Sean also proposed allowing external systems (e.g., Watcher, telemetry
agents) to annotate compute nodes with traits such as memory/cpu/io
pressure, based on /proc/pressure.
Examples:
CUSTOM_MEM_PRESSURE=high
EXTERNAL_IO_PRESSURE=moderate
✅ Support a COMPUTE_MEM_PRESSURE-like trait, populated from sysfs as static
info (not dynamic).
✅ A weigher could use these traits to influence placement.Default traits
list could be configured (e.g., prefer/avoid hosts with certain pressures
or hardware features). This approach could evolve into a generic “preferred
traits” weigher, similar to Kubernetes taints/tolerations.
✅Sean will create a dedicated spec for this feature.
✅ Sbauza volunteered to help, especially as the work aligns with weigher
logic from the previous cycle.
#### OpenAPI Schema Integration ####
Stephen highlighted that most of the heavy lifting for OpenAPI support is
now complete, and the work is down to pure response schema definitions.
This effort spans over three cycles now, and it would be valuable to
finalize it early in Flamingo.
✅ We'll formalize this work with a blueprint.
✅ The goal is to make early progress in Flamingo, ideally with a dedicated
review day.
✅ Stephen is happy to join synchronous review sessions and will coordinate
pings for progress.
✅ Masahito volunteered to help with the remaining work.
#### OpenStack SDK & Client Workflows ####
Stephen raised a few concerns regarding timing mismatches between SDK/OSC
freezes and microversion patch merges in Nova.
Some microversion support landed too late to be integrated in the SDK
before the Epoxy freeze.
Patches were sometimes missed due to lack of "depends-on" links or broken
initial submissions.
✅ Uggla will follow up and finalize these patches early in the Flamingo
cycle.
#### Upstream Testing for PCI Passthrough and mdev Devices ####
With IGB support merged in Epoxy, and vIOMMU enabled in some Vexxhost
workers (thanks to dansmith), the opportunity exists to expand PCI testing
upstream in Tempest.
This would also benefit testing of one-time-use (OTU) devices.
Finalizing mtty testing is a priority, as it helps ensure device support is
consistent and regressions (like bug #2098892) are caught early.
✅ Bauzas will lead on wrapping up mtty testing.
✅ Gibi will coordinate with cloud providers to assess Epoxy support and
revisit this topic during the next PTG if needed.
#### CPU Power Management – Expected Behavior ####
Melanie raised questions about inconsistencies between design and
implementation in Nova’s CPU power management logic. In particular:
- CPUs were being offlined too aggressively, sometimes during reboot or
migration operations.
- This contradicts the intent that only unassigned or deallocated cores
should be powered off.
There was confusion between two approaches:
- Aggressive power-down of unused CPUs during all idle states (stop,
shelve, etc.)
- Conservative behavior, powering off cores only when the VM is deleted or
migrated away
Consensus favored the aggressive-but-safe model:
- Power down cores only when not used, e.g., VM is stopped or migrated.
- Be cautious not to power off cores prematurely (e.g., during reboot or
verify-resize).
✅ Do not rush to power off CPU cores at compute startup or mid-operation.
✅ Revisit the implementation so the resource tracker runs first, and
determines actual core assignments before making decisions.
#### Live Migration with Encrypted Volumes (Barbican Integration) ####
HJ-KIM raised the point that Nova does not currently support live migration
of instances using encrypted Cinder volumes managed by Barbican. This is a
critical blocker in environments with strict compliance requirements.
✅ This is a parallel issue to vTPM support. We will learn from the vTPM
implementation and consider applying similar concepts.
✅ A future solution may involve adjusting how ownership is managed, or
providing scoped access via ACLs.
✅ Further discussion/spec work will be needed once an implementation
direction is clearer.
#### Manila–Nova Cross-Team Integration ####
The initial Manila–Nova integration is now merged — thanks to everyone
involved!
The next step is to:
- Add automated testing (currently manual tests only).
- Start with a few basic positive and negative test scenarios (create,
attach, write, delete; snapshot and restore; rule visibility; restricted
deletion; etc.).
Additionally, longer-term features and improvements are being considered
please look at the etherpad.
✅ We will work on tempest tests.
✅ We will continue enhancing Nova–Manila integration during Flamingo (F)
and beyond.
✅ Uggla will submit a spec as needed for land memfd support.
#### Provider Traits Management via provider.yaml ####
📌 Spec: https://review.opendev.org/c/openstack/nova-specs/+/937587
Problem: Traits defined in provider.yaml are added to Placement but never
removed if deleted from the file.
✅ Implement a mechanism where Nova copies the applied file to
/var/lib/nova/applied_provider.yaml, and diffs it with the active one on
restart.
This would allow traits (and possibly other config) to be safely
removed.
#### Admin-Only Instance Metadata / Annotations ####
📌 Spec: https://review.opendev.org/c/openstack/nova-specs/+/939190
Issue: Current instance metadata is user-owned, and shouldn't be used by
admins.
Proposal: Introduce admin-only annotations (or metadata with ownership
tracking), allowing operators to set system-visible metadata without
violating user intent.
✅ Introduce a created_by field (similar to locked_by) to track who created
metadata: user vs admin.
Consider an admin: prefix namespace for admin-controlled keys (applied to
annotations or metadata).
Implementation requires a DB change and a nova-spec.
Note: This aligns well with broader annotation work already discussed in
this cycle.
#### delete_on_terminate for Ports (Server Create / Network Attach APIs)
####
📌 Related discussion:
https://review.opendev.org/c/openstack/nova-specs/+/936990
Background: This was discussed in past PTGs. Currently, delete_on_terminate
can't be updated dynamically across instance lifetime.
✅ A spec with a working PoC will help clarify the desired behavior and
unblock the discussion.
Long-term solution may require storing this flag in Neutron as a port
property (rather than Nova-specific DB).
#### Graceful Shutdown of Nova Compute Services ####
📌 Spec: https://review.opendev.org/c/openstack/nova-specs/+/937185
Challenge: Need a mechanism to drain compute nodes gracefully before
shutdown, without interrupting active workloads or migrations.
Graceful shutdown is tricky in the presence of live migrations.
Ideas include:
- Temporary “maintenance mode” (block write requests).
- Group-level compute draining.
✅ The topic is important but not urgent — bandwidth is limited.
Note: Eventlet removal may simplify implementing this.
✅ Please report concrete bugs so we understand the blockers.
✅ A nova-spec with PoC would help drive the conversation.
#### Libvirt/QEMU Attributes via Flavor Extra Specs ####
Target: Advanced tuning of I/O performance via iothreads and virtqueue
mapping, based on:
https://developers.redhat.com/articles/2024/09/05/scaling-virtio-blk-disk-i…
✅ Introduce new flavor extra specs such as:
- hw:io_threads=4
- hw:blk_multiqueue=2
These can be added to both flavor and image properties.
✅ A nova-spec should be written to document naming and semantics.
#### Dynamic Modification of libvirt Domain XML (Hook Proposal) ####
oVirt allows for plugins to alter the libvirt domain XML just before
instance launch (via VDSM hooks).
Nova does not offer a mechanism to intercept or modify the domain XML, and
the design explicitly avoids this.
The desired use case involves injecting configuration that libvirt cannot
currently represent, for example, enabling multiuser SPICE consoles.
✅ This proposal is explicitly rejected.
✅ Nova will not support hook points for permuting libvirt XML.
✅ Operators may use out-of-band libvirt/qemu hooks at their own risk, but
should not expect upstream support or stability guarantees.
#### Revisiting the "No More API Proxies" Rule ####
Masahito proposed allowing users to filter instances via API based on
related service data, such as network_id.
✅ The "no API proxy" rule remains, but with pragmatic exceptions:
- Filtering is acceptable if the data exists in Nova’s DB (e.g., network
ID, image ID).
- No cross-service REST calls allowed (e.g., Neutron QoS types still out of
scope).
- Filtering by network_id in nova list is reasonable and can proceed.
✅ Masahito will provide a spec.
#### OVN Migration & Port Setup Timing ####
📌 Context: https://bugs.launchpad.net/nova/+bug/2073254
In OVN-based deployments, Neutron signals the network-plugged event too
early, before the port is fully set up. This causes issues in live
migration, especially under load.
✅ Nova already supports waiting on the network-plugged event. OVN in Ubuntu
Noble should behave properly.
A proposal to improve timing in Neutron was discussed (Neutron to wait for
port claim in southbound DB).
Nova might support this via a Neutron port hint that triggers tap interface
creation earlier during migration (pre-live-migration).
✅ Next step: open an RFE bug in Neutron. If accepted, a Nova spec may be
needed.
#### Blocking API Threads During Volume Attachments ####
📌 Context: https://bugs.launchpad.net/nova/+bug/1930406
Volume attachment RPC calls block API workers in uWSGI, leading to
starvation when multiple attachments are made in parallel.
✅ Volume/interface attachments should become async, reducing API lock
contention.
Fix is non-trivial and will require a microversion.
In the meantime, operators may tune uWSGI workers/threads or serialize
attachment calls.
#### Inventory Update Failure – DISK_GB Bug ####
📌 Bug: https://bugs.launchpad.net/nova/+bug/2093869
When local storage becomes temporarily unavailable (e.g., Ceph down), Nova
sends total=0 for DISK_GB, which Placement rejects if allocations exist.
✅ The real fix is to restore the storage backend.
Nova should improve error handling/logging, but should not shut down the
compute service.
#### Security Group Name Conflict Bug ####
📌 Bug: https://bugs.launchpad.net/nova/+bug/2105896
When multiple security groups share the same name (via Neutron RBAC),
instance builds can fail due to incorrect duplicate detection logic.
✅ The issue was fixed in:
https://review.opendev.org/c/openstack/nova/+/946079
✅ Fix will be reviewed and backported to Epoxy.
If you've read this far — thank you! 🙏
If you spot any mistakes or missing points, please don't hesitate to let me
know.
Best regards.
René.
8 months, 2 weeks
[manila] 2025.2 Flamingo PTG summary
by Carlos Silva
Hello Zorillas and interested stackers,
Last week's PTG had plenty of topics and good takeaways.
In case you would like to watch any of the discussions, please take a look
at the videos in the OpenStack Manila Youtube channel [0].
The PTG etherpad has all of the notes we took [9]. Here is a summary of the
discussions grouped by each topic:
Retrospective
==========
Highlights
-------------
Mid cycle alongside feature proposal freeze provided a good opportunity for
us to have collaborative review sessions and move faster on reviews.
Two bugsquashes had a good impact on the bug backlog and the bug trend was
more positive on this cycle, despite the numbers growing due to
low-hanging-fruits we started reporting.
Internships with City University of Seattle, Valencia College and North
Dakota State University - they are definitely helping with progress on
manila-ui and OpenAPI. We will continue the effort.
We would like to speed up reviews and improve our metrics [1] on how long
changes are open before being merged. Review dashboards can help and we can
work with our reviewers to have a more disciplined approach on reviews.
Broken third party CI systems currently mean that we have little testing.
We need to rely on the authors' or their peers to test and ensure that a
feature is working. We will look into documenting CI setup procedures and
gather thoughts from maintainers.
New API Features should be tested as early as possible to ensure it won't
break any workflows. Our contributor documentation will be updated with
extra guidelines.
AIs:
(carloss) Encourage Bug Czar candidates and bring this up more often in the
manila weekly meetings
(carloss) Encourage spec authors to schedule a meeting to discuss the spec
to speed up the review process.
(carloss) include iCal with event announcements (bugsquash / mid cycle)
(gouthamr) Creating a review dashboard
(carloss) Record "expert seminars" on FAQs: it would be great to have some
videos documenting how-tos in OpenStack and help people to unblock
themselves when they are hitting common openstack-developer issues:
https://etherpad.opendev.org/p/manila-howcasts
(carloss) communicate a deadline for the manila CLI -> OSC documentation
changes. The work with our interns should go until FPF. It needs to be done
before the client release, when we are planning to drop the manilaclient
support. ashrodri offered help to get it completed after we come to the FPF
deadline.
(carloss) We should update these docs and mention that first party driver
implementations should be done for features and be more strict about the
testing requirements.
All things CephFS [2]
================
Deprecation of standalone NFS-Ganesha
-------------------------------------------------------
We added a warning in Dalmatian, deferred plans to deprecate based on
community feedback. Our plan is to remove it in the 2026.1 release. There
is a suggested update procedure, please reach out in case there are
questions.
AI: (carloss) send a reminder email in this cycle to incentivize people to
move to clustered NFS
Supporting NFSv3 for Windows workloads
--------------------------------------------------------
manila-tempest-plugin now supports multiple NFS protocol versions in one of
the scenario tests. As soon as we get the build, we will update the CephFS
NFS job to run tests for NFSv3 as well.
Testing and stabilization
--------------------------------
Bumped Ceph version in the CI jobs to Reef in Antelope, Bobcat, Caracal,
Dalmatian. We are starting to test with Ceph Squid; we intend to test with
Squid on "master" and "stable/2025.1" (epoxy) branches.
A couple of Ceph and NFS-Ganesha issues are impacting us at the moment [4]
[5] [6] and we managed to find the workaround for some.
We needed to stop testing with ingress daemon at the moment and we will get
back to testing as soon as the fix is out.
Manage unmanage of shares and snapshots
-----------------------------------------------------------
Feature is merged and working and we are going to backfill tempest test
patches
AI: (carloss) will propose a new job variant to allow testing this feature.
Plans for 2025.2 Flamingo
-----------------------------------
Investigate support for SMB/CIFS
Ceph-NFS QoS: we will follow the implementation of this feature in NFS
Ganesha and start discussing and drafting the Manila implementation when
the code is merged in Ganesha upstream.
Out of place restores and backup enhancements [7]
========================================
CERN is pursuing a backup backend with their C-Back tool. Currently Manila
backups can be restored back to the same share; there are some problems
with such approach when the source share backend is down and to prevent
browse by restore behavior.
Zachary Goggins (za) proposed a specification, and plans to work on it
during the Flamingo Cycle. The share backups feature also needs some
enhancements like a get progress and get restore progress actions. Zach
plans to make it part of the implementation.
We agreed that a backup resource should have a new "state" attribute,
instead of only relying on the status in order to have well defined backup
states.
AI: (za) update the out of place restore spec.
Tech debt
=======
Container driver failures
--------------------------------
The container driver tempest tests are perma-failing right now. We seem to
have a problem with RBAC and pre-provisioned tempest credentials.
AIs:
(carloss) Report a tempest bug to track the issues;
(gouthamr) will propose a change to switch back to using dynamic
credentials in our testing.
DockerHub rate limits
-----------------------------
We are only building an image in manila-image-elements. It's more pulls
than pushes. Pushes happen very rarely. The kolla team has moved away from
DockerHub as well.
Zach offered help in case we need another approach for registry. CERN has
its own tool.
AI: we will look into moving to quay.io
"manila" CLI removal
----------------------------
We added the deprecation warning 6 releases ago and we should proceed with
the removal. We will need an additional push to update all of our
documentation examples and move to keystoneauth.
We need more functional test coverage and we should have a hackathon just
as we did some years ago.
AI: carloss will schedule a hackathon for enabling more tests and send the
removal email to openstack-discuss. We are targeting the removal to 2025.2
Flamingo.
CI and testing
------------------
ZFSOnLinux job left on jammy: We created a bug for it and we can use it for
tracking.
IPv6 testing: The BGP software we were using (quagga) is now deprecated and
everything was migrated to FRR. We will need help to fix it as
unfortunately, things didn't have a 1x1 translation between the libraries.
If someone has experience on this, it would be nice to collaborate to get
this fixed.
API
----
We are going to stop testing the v1 API and stop deploying it on DevStack
test jobs. We'll update the install guide as well that we've stopped
supporting it. It was deprecated in 2015 ("Liberty" release). That's a good
code cleanup opportunity.
V2 is an extension of v1 with microversions.
If we stop supporting it, who is affected? Mostly people that have
automations using it.
What's the impact on manila-tempest-plugin? We have v1 and v2 tests. We
have a lot of coverage for v2. If you don't have the v1 API in the cloud,
the tests refuse to run. We will need to fix it.
AIs:
Work on the removal patches during the 2025.2 Flamingo release;
(carloss) will send an announcement email to the ML, including operators
tag.
Manila UI
-------------
We have been making progress in the Manila UI feature gap. Currently
working on manage/unmanage share servers, manage share with dhss=true,
filtering user messages on date, updating quotas table.
Share limits view broke some time ago, code lives in horizon.
We hit some issues using horizon's tox "runserver" environment, apparently
more people ran into the same issue. We will talk to other impacted parties
and check how to overcome this issue.
AI: (carloss) will reach out to the horizon team and ask how we can
re-introduce Manila limits to the overview tab.
Enable share encryption at-rest (back-end) with secret refs stored on
Barbican/Castellan. [8]
=====================================================================
We merged a specification some time ago with an implementation
architecture. That spec contemplated both Share encryption and Share server
encryption.
NetApp is now planning to work only on share server encryption. Encryption
can be disabled per share, but shares exported via a share server cannot
have a separate encryption key on ONTAP.
We reached an agreement that when a new share creation is triggered, if
there isn't a share server matching the provided key, a new share server
will need to be spawned. We also agreed that we should allow using names
for the secret reference for better user experience.
2025.2 Flamingo is the target release.
AIs: (kpdev/Sai) The spec will be updated and only the DHSS=True scenario
will be documented; The manila team will review the spec as soon as it is
proposed
Replication Improvements
====================
Back when we implemented replication, we didn't account for specific
configurations that the storage backends can have, for example whether the
backend could support zero RPO technologies or not.
Zero RPO is is an important feature that allows data to be written
simultaneously between the share and its replicas.
We agreed that the way we should send the information to the backend is
through a backend specific share type extra spec. Administrators will be
able to define it in the share type and the backend will pick it up.
Operator concerns / questions
=======================
Where to put parameters that change behaviour only of one protocol (NFS in
this case)? We agreed that we should have write once type of metadata and
not allow the metadata to be updated afterwards. A configuration option can
be introduced for this where the operator can determine what metadata will
not be updated.
AI: carthaca will propose a lite-spec for this
Lustre FS Support for HPC Use Cases in OpenStack
Is there any possibility for OpenStack to officially integrate or support
parallel file systems like Lustre, either through Manila or other
components? We've heard in the past as a request from the scientific-sig
group. Building a driver should be straightforward and it does not
necessarily need to be in-tree, and it would be easier to maintain. This is
a very good use case. This discussion will continue with the scientific-sig
group.
Replica / Snapshot Retention / Expiration Policy
While replicas in Manila are designed to be continuously in sync with the
active share, certain use cases — such as disaster recovery (DR) replicas
or manually created replicas that are no longer needed — could benefit from
lifecycle management.
Replicas are continuously synced with the source share, so if they're
"unused", they're there for some reason is the assumption. We had a spec a
while ago about automating snapshots (creation and deletion) on schedule.
It would be preferable that an external automation tool is used to achieve
such behavior. Maybe openstack/mistral can be a good approach (Support for
manila snapshots already exists on Mistral)
Affinity/Anti-affinity spec updates
=========================
This feature allows users to create share groups with affinity policies,
which determine the affinity relationship between shares within the group.
There was an open question about strategies of locking. We came to an
agreement that we can use tooz, database or oslo.
AI: (chuanm) will update the spec.
Force deleting subnets
=================
This is a feature that follows the ability to add multiple subnets to a
share server. We should also be able to remove them. This spec is under
review.
We agreed that we should also implement the "check" mechanism before
deleting the subnet.
AIs: (sylvanld) will update the spec
Eventlet removal
=============
Need to remove wsgi uses, use oslo service's new Threading based backend
instead for the ProcessLauncher, periodic tasks. Neutron is doing some work
around periodic tasks and we can benefit from their learning.
AI: Work on this in Flamingo, aiming for completion in 2026.1 cycle.
Manila/Nova Cross-project session: VirtioFS
=================================
VirtioFS implementation is now complete and we are looking at the next
steps. We currently don't have CI testing the feature and the Manila team
is planning to work on it during the 2025.2 Flamingo release.
The nova team intends to drive the remaining SDK and OSC patches to
completion during the 2025.2 Flamingo release.
We also discussed some possible enhancements: mem_fs support, online attach
and detach and live migration. These will take some time and the Nova team
will work on such features gradually.
AIs: (carloss) will share the test scenarios with the Nova team and ask for
reviews and the Manila team will work on the implementation of the tests.
(rribaud) will work on the remaining SDK patch and work on mem_fd support.
[0]
https://www.youtube.com/watch?v=MLXkBRhViS0&list=PLnpzT0InFrqADxXi_dtPqfWLt…
[1]
https://openstack.biterg.io/app/dashboards#/view/Gerrit-Backlog?_g=(filters…:'Gerrit%20Backlog%20panel%20by%20Bitergia.
',filters:!(('$state':(store:appState),meta:(alias:'Changesets%20Only',disabled:!f,index:gerrit,key:type,negate:!f,params:(query:changeset),type:phrase),query:(match:(type:(query:changeset,type:phrase)))),('$state':(store:appState),meta:(alias:Bots,disabled:!f,index:gerrit,key:author_bot,negate:!t,params:(query:!t),type:phrase),query:(match:(author_bot:(query:!t,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:gerrit,key:project,negate:!f,params:(query:manila),type:phrase),query:(match_phrase:(project:manila)))),fullScreenMode:!f,options:(darkTheme:!f,useMargins:!t),query:(language:lucene,query:(query_string:(analyze_wildcard:!t,default_field:'*',query:'*',time_zone:Europe%2FMadrid))),timeRestore:!f,title:'Gerrit%20Backlog',viewMode:view)
[2] https://etherpad.opendev.org/p/flamingo-ptg-manila-cephfs
[3] https://bugs.launchpad.net/manila/+bug/2049538
[4] https://github.com/nfs-ganesha/nfs-ganesha/issues/1227
[5] https://tracker.ceph.com/issues/69214
[6] https://tracker.ceph.com/issues/67323
[7] https://review.opendev.org/c/openstack/manila-specs/+/942694
[8] https://etherpad.opendev.org/p/share-encryption-with-barbican-secret-ref
[9] https://etherpad.opendev.org/p/flamingo-ptg-manila
Thank you everyone that participated on the PTG!
Best regards,
carloss
8 months, 2 weeks
Re: [nova] Image Encryption patch
by Sean Mooney
On 14/08/2025 15:09, Dan Smith wrote:
>>> One of the things that is not supported in your series is direct booting
>> of an encrypted image.
> I could be wrong, but I think this is just a simplistic read of the first addition in the patch. AFAIK, the direct-boot abort is already in the tree, and they are just adding an additional check for the new key id parameter to mirror the same (existing) behavior. That is, of course, fine.
yes it is just an exteion of that but you shoudl be able to sue if for
the "boot form voluem form inmage workflow" no?
we had a very long converation about local iamge encryption and why you
were not ok with breakign the workflow of creating a vm,
modifying it to customisze it and sthen creating a snapshot and booting
addtioanl vms.
if the snapshot you are takign is of the boot volume and you don't
support that work flow for that as well then we have a conflcit between
the requrieemtn for both features.
if something is taking a snapshot of a data volume and uploading it as
an image that is diffent as that data volume is presumabel not marked as
bootable anyway in cinder.
booking using the encrypted image for local storage is totally valid as
we have not implemented that in nova yet but i woudl expect the BFV case
to work.
>
>> In April 2024 we had a cross project session with Nova and Glance at the PTG [4]!
>> There was a big discussion around the encryption format initiated by Dan Smith (Nova). He proposed to move away from GPG and use LUKS instead because this would streamline it with existing functionality and formats that are already available in Nova and Cinder.
>> Due to this proposal from Nova, we agreed to discard our existing patchsets [5] and rewrite our image encryption feature with new patchsets [6] with LUKS as the encryption format, as suggested by Dan Smith (Nova).
>> We also talked specifically about the cryptographic key differentiation (hexlify vs. non-hexlify) which materialized in the os-brick change that you mentioned.
> Yep, this and the rest of your history summary matches my recollection as well.
that all fine and it more or less aligned with my recollection of this
to however it misses the point that feature proposals be it tracked by a
blueprint or spec are only accpted for a given release
and need to be epxlictly propsoed again for the next cycle if tye are
not complete. so even if it was acepted as a specless blueprint or an
actul spec in 2024 on the nova side it would still
need one for this cycle. appoval for dalmaion 2024.2 expired at the
start of the 2025.1 cycle.
>
> I know I've been on the hook to review this stuff and just keep getting pulled in different directions on more important stuff. My apologies, but there are some pretty important things up for review right now (like eventlet removal). Your patch to use brick for the passphrase extraction seems like a fine thing to merge at this point, especially because the earlier we merge it the better from the compatibility point of view. I'll try to make time today to look at it in detail.
>
> --Dan
>
4 months, 2 weeks
[nova][ptg] 2025.1 Epoxy PTG summary
by Sylvain Bauza
(resending the email as the previous one was blocked to an attached
etherpad backup txtfile larger than the max size)
Hey all,
First, thanks for having joined us if you were in the vPTG. We had 15-20
people every day for our nova sessions, I was definitely happy to see new
folks :-)
If you want to see our PTG etherpad, please rather look at
https://etherpad.opendev.org/p/r.4f297ee4698e02c16c4007f7ee76b7c1 instead
of the main nova etherpad as I don't want that the etherpad would have a
wrong traduction or having some paragraphs to be removed.
As I say every cycle, just take a coffee (or a tea) now as the summary will
be large.
### Dalmatian retrospective and Epoxy planning ###
6 of 15 approved blueprints were eventually implemented. We also merged
more than 31 bugfixes during Dalmatian.
We agreed to be explaining on the IRC channel when we have meetings for
discussing some feature series (like the one we did every week for the
manila/virtiofs series) and providing some public invitations. We could do
this again this cycle for other features, we'll see.
We will also try to have a periodic integration-compute job that pulls OSC
and SDK from master.
Our Epoxy deadlines will be : two spec review days (R-16, R-2), a soft spec
approval freeze by R-16 and then hard spec approval freeze by R-12. That
means that contributors really need to provide their specs before
mid-December. Bauzas (me) will add these deadlines into the Epoxy schedule
: https://releases.openstack.org/epoxy/schedule.html
### vTPM live migration ###
We agreed on the fact that a vTPM live-migration feature is a priority for
Epoxy given Windows 11.
artom will create a spec proposing an image metadata property saying 'do I
want to share my secret with nova service user ?' and also providing a
new `nova-manage
image_property set migratable_something` command so operators could migrate
the existing instances for getting the Barbican secrets, if really the
operators wants.
### Unified limits wrap-up ###
We already have two changes needing to be merged before we can modify the
default quota driver (in order to default to use unified limits). We agreed
on reviewing both patches (one for treating unset limits as unlimited, the
other about adding a nova-manage command for automatically creating nova
limits) but we also discussed about a latter patch that would eventually
say which nova resources need to be eventually set (so we *have to* enforce
them anyway). melwitt agreed on working on that latter patch.
### per-process health checks ###
We already had one series and we discussed it again. Gibi agreed on taking
over it and he will re-propose the existing spec as it is. We also
discussed the first checks we would have, like RPC failures and DB
connection issues, we'll review those when they are in Gerrit.
### sustainable computing (a.k.a. power mgmt) ###
When someone (I won't say who [1]) implemented power management in
Antelope, this was nice but we eventually found a long list of bugs that we
fixed. Since we don't really want to reproduce that experience, we had this
kind of post-mortem where we eventually agreed on two things that could
avoid reproducing that problem : a weekly periodic job will run whitebox
tempest plugins [2]
with nova-compute restarts also covered by a whitebox tempest plugin.
Nobody is committed against those two actions but we hope to identify
someone soon.
As a side note, gibi mentioned RAPL MSR support [3], notifying us that we
would have to support that in a later release (as the libvirt
implementation is not merged yet)
### nvidia's vGPU vfio-pci variant driver support ###
Long story short, since the linux kernel removed some feature in release
5.18 (IOMMU backend support for vfio-mdev) this impacted the nvidia driver
which now detects that and then creates vfio-pci devices instead of
vfio-mdev devices (as vGPUs). This has a dramatic impact on Nova as we
relied on the vfio-mdev framework for abstracting virtual GPUs. By the next
release, Nova will need to inventorize the GPUs by rather looking at SRIOV
virtual functions which are specific to the nvidia driver (we call them
vfio-pci variant driver resources).
The nova PTG session focused on the required efforts to do so. We agreed on
the fact it will require operators to propose different flavors for vGPU
where they would require distinct resource classes (all but VGPU).
Fortunately, we'll reuse existing device_spec PCI config options [4] where
the operator would define custom resource classes which would match the PCI
addresses of the nvidia-generated virtual functions (don't freak out, we'll
also write documentation). We'll create another device type (something like
type-VF-migratable) for describing such specific nvidia VFs.
Accordingly the generated domain XML will correctly write the device
description (amending the "managed=no" flag for that device).
There will be an upgrade impact: existing instances will need to be resized
to that new flavor (or instances will need to be shelved, updated for
changing the embedded flavor and unshelved).
In order to be on par with existing vGPU features, we'll also need to
implement vfio-pci live-migration by detecting the VF type on the existing
SRIOV live-migration.
Since that effort is quite large, bauzas will incept a subteam of
interested parties that would help him implement all of those bits in the
short timeframe that is one upstream cycle.
### Graceful shutdowns ###
A common pitfall that was told by tobian-urdin is when you want to stop
nova-compute services. In general, before stopping the service, we should
be sure that all RPC calls are done, which means we would no longer accept
RPC calls after asking to stop the nova-compute and just awaiting the
current calls to be done before stopping the service. For that, we need to
create a backlog spec for discussing that design and we would also need to
modify oslo.service for unsubscribing the RPC topics. Unfortunately, this
cycle we won't have any contributor for working on it, but gibi could try
to at least document this.
### horizon-nova x-p session ###
We mostly discussed the Horizon feature gaps [5]. The first priority would
be Horizon to use OpenStackSDK instead of novaclient, but then supporting
all of the new Nova API microversions. Unfortunately, we are no sure that
we could have Horizon contributors that could fix those, but if you're a
contributor and you want to help Horizon to be better, maybe you could do
this ? If so, please ping me.
### Ironic-nova x-p session ###
We didn't really have topics for this x-p session. We just quickly
discussed some points, like Graphical Console support. Nothing really worth
noting, maybe just that it would be nice that we could have readonly
graphical console. We were just happy to say that the ironic driver now
works better thanks to some features that were merged last cycles. Kudos to
those who did them.
### HPC/AI optimized hypervisor "slices" ###
A large topic to explain, I'll try to keep it short. Basically, how Nova
slices the NUMA affinity between guests is nice but hard for HPC usecases
where sometimes you need to better explain how to slice the NUMA dependent
devices depending on the various PCI topologies. Eventually, we agreed on
some POC that johnthetubaguy could work on by trying to implement a
specific virt driver that would do something different from the existing
NUMA affinities.
### Cinder-nova x-p session ###
Multiple topics were discussed there. First, abishop wanted to enhance
cinder's retyping of in-use boot volumes which means that the Nova
os-attachments API to get a new parameter. We said that he needs to create
a new spec and we agreed on the fact that the cinder contributors need to
discuss with QEMU folks to know about the qemu writes.
We also discussed about a new nova spec which is about adding burst length
support to Cinder QoS [6]. We said that we need to both (nova and cinder)
review this spec.
About left residues when detaching a volume, we also agreed on the fact
this is not a security flaw and the fact that os-brick should delete them,
not nova (even if nova need to ask os-brick to look at that, either by a
periodic run or when attaching/detaching). whoami-rajat will provide a spec
for it.
### Python 3.13 support ###
We discussed a specific issue for py3.13, the fact that the crypt module is
no longer in stlib for py3.13, which impacts nova due to some usage in
nova.virt.disk.api module for passing an admin password for file injection.
Given file injection is deprecated, we have three possibilities: either
removing admin password file injection (or even file injection as a whole),
adding the new separate crypt package in upper-constraints or using
oslo_uitls.secretutils module. bauzas (me) will provide an email to
openstack-discuss for asking operators whether they are OK with deprecating
file injection or just admin password injection and then we'll see the
direction. bauzas or sean-k-mooney will also try to have py3.13 non-voting
jobs for unittests/functtests.
### Eventlet removal steps in Nova ###
I won't explain why we need to remove eventlet, you already know, right ?
We rather discussed about the details in our nova components, including
nova-api, nova-compute and other nova services. We agreed on removing
direct eventlet imports where possible, move nova entrypoints that don't
use eventlet to separate modules that don't monkeypatch the stdlib, look at
what we can do with all our scatter_gather methods which asynchronously
calling cells DB for using threads instead and check whether those calls
are blocking on DB (and not on the MQ side). Gibi will shepherd that effort
and provide some audit on the eventlet usage in order to avoid any
unexpected but unfortunate late discoveries.
### Libvirt image backend refactor ###
If you like spaghettis, you should pay attention to the libvirt image
backend code. Lots of assumptions and conditionals make any change to that
module hard to be written and hard to be reviewed, leading to errorprone
situations like the ones we had when fixing some recent CVEs.
We all agreed on the quite urgent necessity to refactor that code and
melwitt proposed a multi-stage effort about that. We agreed on the proposal
for the first two steps with some comments, leading to future revisions of
the proposal's patches. The crucial bits with the refactor are about test
coverage.
### IOThreads tuning for libvirt instances ###
An old spec was already proposed for defining iothreads to guests. We
agreed on reviving that spec, where a config option would define either no
iothread or one iothread per instance (with a potential for a latter option
value to be "one iothread per disk"). Depending on whether
emulator_thread_policy
is provided in the flavor/image, we would set the iothread on that policy
or we would put the iothread floating over the shared CPU set. If no shared
CPUs are configured but the operator wants iothreads, nova-compute would
refuse to start. lajoskatona will work on such an implementation that will
be designed in a blueprint that doesn't require a spec.
### OpenAPI schemas progress ###
Nothing specific to say here, bauzas and gmann will review the series this
cycle.
That's it. I'm gone, I'm dead [7] (a cyclist metaphor) but I eventually
skimmed the very large nova etherpad. Of course, 99% of chances that I'll
write some notes incorrectly, so please correct if I'm wrong, I won't feel
offended, just tired.
Thanks all (and I hope your coffee or tea was good)
-Sylvain
[1] https://geek-and-poke.com/geekandpoke/2013/11/24/simply-explained
[2] https://opendev.org/openstack/whitebox-tempest-plugin
[3] https://www.qemu.org/docs/master/specs/rapl-msr.html
[4]
https://docs.openstack.org/nova/latest/configuration/config.html#pci.device…
[5] https://etherpad.opendev.org/p/horizon-feature-gap#L69
[6] https://review.opendev.org/c/openstack/nova-specs/+/932653
[7] https://www.youtube.com/watch?v=HILcYXf8yqc
1 year, 2 months
Re: [watcher] 2025.2 Flamingo PTG summary
by Sean Mooney
On 17/04/2025 13:17, Dmitriy Rabotyagov wrote:
>> well gnocchi is also not a native OpenStack telemetry datastore, it left
>> our community to pursue its own goals and is now a third party datastore
>> just like Grafana or Prometheus.
> Yeah, well, true. Is still somehow treated as the "default" thing with
> Telemetry, likely due to existing integration with Keystone and
> multi-tenancy support. And beyond it - all other options become
> opinionated too fast - ie, some do OpenTelemetry, some do Zabbix,
> VictoriaMetrics, etc. As pretty much from what I got as well, is that
> still relies on Ceilometer metrics?
> And then Prometheus is obviously not the best storage for them, as it
> requires to have pushgatgeway, and afaik prometheus maintainers are
> strictly against "push" concept to it and treat it as conceptually
> wrong (on contrary to OpenTelemetry).
i dont know the detail but i know there is work planned for native
supprot of
Prometheus scrpe endpoint in ceilometer
so while you currently need to use SG-core to provide that integration there
is a plan to remove the need for sgcore going forward.
https://etherpad.opendev.org/p/r.72ac6a7268e4b9d854f75715adede80c#L28
i dont see a spec proposed yet but there is an olde one form 2 years ago
https://review.opendev.org/c/openstack/telemetry-specs/+/845485/4/specs/zed…
there is also a plan to provide keystone integration and mutli tenancy
https://etherpad.opendev.org/p/r.72ac6a7268e4b9d854f75715adede80c#L84
> So the metric timestamp issue is
> to remain unaddressed.
> So that's why I'd see leaving Gnocchi as "base" implementation might
> be valuable (and very handy for us, as we don't need to implement a
> prometheus job specifically for Watcher).
watcher, aodh, and cloud kitty i believe all have some level if support for
Prometheus but they can also use other backends. in not sure what level
of enablement they have in osa.
>
>> but for example watcher can integrate with both ironic an canonical maas
> component
>> to do some level of host power management.
> That sounds really interesting... We do maintain infrastructure using
> MAAS and playing with such integration will be extremely interesting.
> I hope I will be able to get some time for this though...
the current maas integration has 3 problems. 1 a lack of testing, 2 a
lack of documentation
and 3 it somehow managed to introduce asysnio in a project that uses
eventlet in
a release of eventlet that did not support asyncio
so im very nervious that that is broken or will break in the future.
this is the entrity of the support
https://review.opendev.org/c/openstack/watcher/+/898790
there are no docs and no spec...
so this should definitely be considered "experimental" at best today.
>
> чт, 17 апр. 2025 г. в 13:52, Sean Mooney <smooney(a)redhat.com>:
>>
>> On 16/04/2025 21:04, Dmitriy Rabotyagov wrote:
>>> Hey,
>>>
>>> Have a comment on one AI from the list.
>>>
>>>> AI: (jgilaber) Mark Monasca and Grafana as deprecated, unless
>>> someone steps up to maintain them, which should include a minimal CI
>>> job running.
>>>
>>> So eventually, on OpenStack-Ansible we were planning to revive the
>>> Watcher role support to the project.
>>> How we usually test deployment, is by spawning an all-in-one
>>> environment with drivers and executing a couple of tempest scenarios
>>> to ensure basic functionality of the service.
>>>
>>> With that, having a native OpenStack telemetry datastore is very
>>> beneficial for such goal, as we already do maintain means for spawning
>>> telemetry stack. While a requirement for Prometheus will be
>>> unfortunate for us at least.
>>>
>>> While I was writing that, I partially realized that testing Watcher on
>>> all-in-one is pretty much impossible as well...
>>>
>> you can certenly test some fo watcher with an all in one deployment
>>
>> i.e. the apis and you can use the dummy test stragies.
>>
>> but ya in general like nova you need at least 2 nodes to be able to test
>> it properly ideally 3
>>
>> so that if your doing a live migration there is actully a choice of host.
>>
>> in general however watcher like heat just asks nova to actully move the vms.
>>
>> sure it will ask nova to move it to a specific host but fundementaly if
>> you have
>>
>> tested live migration with nova via tempest seperatly there is no reason
>> to expcect
>>
>> it would not work for live migratoin tirggred by watcher or heat or
>> anything else that
>>
>> jsut calls novas api.
>>
>> so you could still get some valual testing in an all in one but ideally
>> there woudl be at least 2 comptue hosts.
>>
>>
>>> But at the very least, I can propose looking into adding an OSA job
>>> with Gnocchi as NV to the project, to show the state of the deployment
>>> with this driver.
>>>
>> well gnocchi is also not a native OpenStack telemetry datastore, it left
>> our community to pursue its own goals and is now a third party datastore
>>
>> just like Grafana or Prometheus.
>>
>> monasca is currently marked as inactive
>> https://review.opendev.org/c/openstack/governance/+/897520 and is in the
>> process of being retired.
>>
>> but it also has no testing on the watcher side to the combination of the
>> two is why we are deprecating it going forward.
>>
>> if both change im happy to see the support continue.
>>
>> Gnocchi has testing but we are not actively working on extending its
>> functionality going forward.
>>
>> as long as it continues to work i see no reason to change its support
>> status.
>>
>> watcher has quite a lot of untested integrations which is unfortunate
>>
>> we are planning to build out a feature/test/support matrix in the docs
>> this cycle
>>
>> but for example watcher can integrate with both ironic an canonical maas
>> component
>>
>> to do some level of host power management. none of that is tested and we
>> are likely going
>>
>> to mark them as experimental and reflect on if we can continue to
>> support them or not going forward.
>>
>> it also has the ability to do cinder storage pool balancing which is i
>> think also untested write now.
>>
>> one of the things we hope to do is extend the exsitign testing in our
>> current jobs to cover gaps like
>>
>> that where it is practical to do so. but creating a devstack plugin to
>> deploy maas with fake infrastructure
>>
>> is likely alot more then we can do with our existing contributors so
>> expect that to go to experimental then
>>
>> deprecated and finally it will be removed if no one turns up to support it.
>>
>> ironic is in the same boat however there are devstack jobs with fake
>> ironic nodes so i
>>
>> could see a path to use having an ironic job down the line. its just not
>> high on our current priority
>>
>> list to adress the support status or testing of this currently.
>>
>> eventlet removal and other techdebt/community goals are defintly higher
>> but i hop the new supprot/testing
>>
>> matrix will at least help folks make informed descions or what feature
>> to use and what backend are
>>
>> recommended going forward.
>>
>>> On Wed, 16 Apr 2025, 21:53 Douglas Viroel, <viroel(a)gmail.com> wrote:
>>>
>>> Hello everyone,
>>>
>>> Last week's PTG had very interesting topics. Thank you all that
>>> joined.
>>> The Watcher PTG etherpad with all notes is available here:
>>> https://etherpad.opendev.org/p/apr2025-ptg-watcher
>>> Here is a summary of the discussions that we had, including the
>>> great cross-project sessions with Telemetry, Horizon and Nova team:
>>>
>>> Tech Debt (chandankumar/sean-k-mooney)
>>> =================================
>>> a) Croniter
>>>
>>> * Project is being abandoned as per
>>> https://pypi.org/project/croniter/#disclaimer
>>> * Watcher uses croniter to calculate a new schedule time to run
>>> an audit (continuous). It is also used to validate cron like
>>> syntax
>>> * Agreed: replace croniter with appscheduler's cron methods.
>>> * *AI*: (chandankumar) Fix in master branch and backport to 2025.1
>>>
>>> b) Support status of Watcher Datasources
>>>
>>> * Only Gnocchi and Prometheus have CI job running tempest tests
>>> (with scenario tests)
>>> * Monaska is inactive since 2024.1
>>> * *AI*: (jgilaber) Mark Monasca and Grafana as deprecated,
>>> unless someone steps up to maintain them, which should include
>>> a minimal CI job running.
>>> * *AI*: (dviroel) Document a support matrix between Strategies
>>> and Datasources, which ones are production ready or
>>> experimental, and testing coverage.
>>>
>>> c) Eventlet Removal
>>>
>>> * Team is going to look at how the eventlet is used in Watcher
>>> and start a PoC of its removal.
>>> * Chandan Kumar and dviroel volunteer to help in this effort.
>>> * Planned for 2026.1 cycle.
>>>
>>> Workflow/API Improvements (amoralej)
>>> ==============================
>>> a) Actions states
>>>
>>> * Currently Actions updates from Pending to Succeeded or Failed,
>>> but these do not cover some important scenarios
>>> * If an Action's pre_conditions fails, the action is set to
>>> FAILED, but for some scenarios, it could be just SKIPPED and
>>> continue the workflow.
>>> * Proposal: New SKIPPED state for action. E.g: In a Nova
>>> Migration Action, if the instance doesn't exist in the source
>>> host, it can be skipped instead of fail.
>>> * Proposal: Users could also manually skip specific actions from
>>> an action plan.
>>> * A skip_reason field could also be added to document the reason
>>> behind the skip: user's request, pre-condition check, etc.
>>> * *AI*: (amoralej) Create a spec to describe the proposed changes.
>>>
>>> b) Meaning of SUCCEEDED state in Action Plan
>>>
>>> * Currently means that all actions are triggered, even if all of
>>> them fail, which can be confusing for users.
>>> * Docs mention that SUCCEEDED state means that all actions have
>>> been successfully executed.
>>> * *AI*: (amoralej) Document the current behavior as a bug
>>> (Priority High)
>>> o done: https://bugs.launchpad.net/watcher/+bug/2106407
>>>
>>> Watcher-Dashboard: Priorities to next release (amoralej)
>>> ===========================================
>>> a) Add integration/functional tests
>>>
>>> * Project is missing integration/functional tests and a CI job
>>> running against changes in the repo
>>> * No general conclusion and we will follow up with Horizon team
>>> * *AI*: (chandankumar/rlandy) sync with Horizon team about
>>> testing the plugin with horizon.
>>> * *AI*: (chandankumar/rlandy) devstack job running on new
>>> changes for watcher-dashboard repo.
>>>
>>> b) Add parameters to Audits
>>>
>>> * It is missing on the watcher-dashboard side. Without it, it is
>>> not possible to define some important parameters.
>>> * Should be addressed by a blueprint
>>> * Contributors to this feature: chandankumar
>>>
>>> Watcher cluster model collector improvement ideas (dviroel)
>>> =============================================
>>>
>>> * Brainstorm ideas to improve watcher collector process, since
>>> we still see a lot of issues due to outdated models when
>>> running audits
>>> * Both scheduled model update and event-based updates are
>>> enabled in CI today
>>> * It is unknown the current state of event-based updates from
>>> Nova notification. Code needs to be reviewed and
>>> improvements/fixes can be proposed
>>> o e.g:
>>> https://bugs.launchpad.net/watcher/+bug/2104220/comments/3
>>> - We need to check if we are processing the right
>>> notifications of if is a bug on Nova
>>> * Proposal: Refresh the model before running an audit. A rate
>>> limit should be considered to avoid too many refreshments.
>>> * *AI*: (dviroel) new spec for cluster model refresh, based on
>>> audit trigger
>>> * *AI:* (dviroel) investigate the processing of nova events in
>>> Watcher
>>>
>>> Watcher and Nova's visible constraints (dviroel)
>>> ====================================
>>>
>>> * Currently, Watcher can propose solutions that include server
>>> migrations that violate some Nova constraints like:
>>> scheduler_hints, server_groups, pinned_az, etc.
>>> * In Epoxy release, Nova's API was improved to also show
>>> scheduler_hints and image_properties, allowing external
>>> services, like watcher, to query and use this information when
>>> calculating new solutions.
>>> o https://docs.openstack.org/releasenotes/nova/2025.1.html#new-features
>>> * Proposal: Extend compute instance model to include new
>>> properties, which can be retrieved via novaclient. Update
>>> strategies to filter invalid migration destinations based on
>>> these new properties.
>>> * *AI*: (dviroel) Propose a spec to better document the
>>> proposal. No API changes are expected here.
>>>
>>> Replacement for noisy neighbor policy (jgilaber)
>>> ====================================
>>>
>>> * The existing noisy neighbor strategy is based on L3 Cache
>>> metrics, which is not available anymore, since the support for
>>> it was dropped from the kernel and from Nova.
>>> * In order to keep this strategy, new metrics need to be
>>> considered: cpu_steal? io_wait? cache_misses?
>>> * *AI*: (jgilaber) Mark the strategy as deprecated during this cycle
>>> * *AI*: (TBD) Identify new metrics to be used
>>> * *AI*: (TBD) Work on a replacement for the current strategy
>>>
>>>
>>> Host Maintenance strategy new use case (jeno8)
>>> =====================================
>>>
>>> * New use case for Host Maintenance strategy: instance with
>>> ephemeral disks should not be migrated at all.
>>> * Spec proposed:
>>> https://review.opendev.org/c/openstack/watcher-specs/+/943873
>>> o New action to stop instances when both live/cold migration
>>> are disabled by the user
>>> * *AI*: (All) Review the spec and continue with discussion there.
>>>
>>> Missing Contributor Docs (sean-k-mooney)
>>> ================================
>>>
>>> * Doc missing: Scope of the project, e.g:
>>> https://docs.openstack.org/nova/latest/contributor/project-scope.html
>>> * *AI*: (rlandy) Create a scope of the project doc for Watcher
>>> * Doc missing: PTL Guide, e.g:
>>> https://docs.openstack.org/nova/latest/contributor/ptl-guide.html
>>> * *AI*: (TBD) Create a PTL Guide for Watcher project
>>> * Document: When to create a spec vs blueprint vs bug
>>> * *AI*: (TBD) Create a doc section to describe the process based
>>> on what is being modified in the code.
>>>
>>> Retrospective
>>> ==========
>>>
>>> * The DPL approach seems to be working for Watcher
>>> * New core members added: sean-k-mooney, dviroel, marios and
>>> chandankumar
>>> o We plan to add more cores in the next cycle, based on
>>> reviews and engagement.
>>> o We plan to remove not active members in the 2 last cycles
>>> (starting at 2026.1)
>>> * A new datasource was added: Prometheus
>>> * Prometheus job now also runs scenario tests, along with Gnocchi.
>>> * We triaged all old bugs from launchpad
>>> * Needs improvement:
>>> o current team is still learning about details in the code,
>>> much of the historical knowledge was lost with the
>>> previous maintainers
>>> o core team still needs to grow
>>> o we need to focus on creating stable releases
>>>
>>>
>>> Cross-project session with Horizon team
>>> ===============================
>>>
>>> * Combined session with Telemetry and Horizon team, focused on
>>> how to provide a tenant and an admin dashboard to visualize
>>> metrics.
>>> * Watcher team presented some ideas of new panels for both admin
>>> and tenants, and sean-k-mooney raised a discussion about
>>> frameworks that can be used to implement them
>>> * Use-cases that were discussed:
>>> o a) Admin would benefit from a visualization of the
>>> infrastructure utilization (real usage metrics), so they
>>> can identify bottlenecks and plan optimization
>>> o b) A tenant would like to view their workload performance,
>>> checking real usage of cpu/ram/disk of instances, to
>>> proper adjust their resources allocation.
>>> o c) An admin user of watcher service would like to
>>> visualize metrics generated by watcher strategies like
>>> standard deviation of host metrics.
>>> * sean-k-mooney presented an initial PoC on how a Hypervisor
>>> Metrics dashboard would look like.
>>> * Proposal for next steps:
>>> o start a new horizon plugin as an official deliverable of
>>> telemetry project
>>> o still unclear which framework to use for building charts
>>> o dashboard will integrate with Prometheus, as metric store
>>> o it is expected that only short term metrics will be
>>> supported (7 days)
>>> o python-observability-client will be used to query Prometheus
>>>
>>>
>>> Cross-project session with Nova team
>>> =============================
>>>
>>> * sean-k-mooney led topics on how to evolve Nova to better
>>> assist other services, like Watcher, to take actions on
>>> instances. The team agreed on a proposal of using the existing
>>> metadata API to annotate instance's supported lifecycle
>>> operations. This information is very useful to improve
>>> Watcher's strategy's algorithms. Some example of instance's
>>> metadata could be:
>>> o lifecycle:cold-migratable=true|false
>>> o ha:maintenance-strategy:in_place|power_off|migrate
>>> * It was discussed that Nova could infer which operations are
>>> valid or not, based on information like: virt driver, flavor,
>>> image properties, etc. This feature was initially named
>>> 'instance capabilities' and will require a spec for further
>>> discussions.
>>> * Another topic of interest, also raised by Sean, was about
>>> adding new standard traits to resource providers, like
>>> PRESSURE_CPU and PRESSURE_DISK. These traits can be used to
>>> weight hosts when placing new VMs. Watcher and the libvirt
>>> driver could work on annotating them, but the team generally
>>> agreed that the libvirt driver is preferred here.
>>> * More info at Nova PTG etherpad [0] and sean's summary blog [1]
>>>
>>> [0] https://etherpad.opendev.org/p/r.bf5f1185e201e31ed8c3adeb45e3cf6d
>>> [1] https://www.seanmooney.info/blog/2025.2-ptg/#watcher-topics
>>>
>>>
>>> Please let me know if I missed something.
>>> Thanks!
>>>
>>> --
>>> Douglas Viroel - dviroel
>>>
8 months, 2 weeks