openstack-discuss search results for query "#eventlet-removal"

openstack-discuss@lists.openstack.org

186 messages

Re: [nova] multiple pci types with same address

by Sean Mooney

On 17/06/2025 09:30, Arnaud Morin wrote: > Hello nova team, > > Quick question regarding support for multiple types (see [1]) with the same address: > {vendor_id: "10de", product_id: "233b", "address": "00000000:03:00.0", "resource_class": "CUSTOM_H200_A"} > {vendor_id: "10de", product_id: "233b", "address": "00000000:04:00.0", "resource_class": "CUSTOM_H200_A"} > {vendor_id: "10de", product_id: "233b", "address": "00000000:44:00.0", "resource_class": "CUSTOM_H200_A"} > {vendor_id: "10de", product_id: "233b", "address": "00000000:45:00.0", "resource_class": "CUSTOM_H200_A"} > {vendor_id: "10de", product_id: "233b", "address": "00000000:83:00.0", "resource_class": "CUSTOM_H200_B"} > {vendor_id: "10de", product_id: "233b", "address": "00000000:84:00.0", "resource_class": "CUSTOM_H200_B"} > {vendor_id: "10de", product_id: "233b", "address": "00000000:C3:00.0", "resource_class": "CUSTOM_H200_B"} > {vendor_id: "10de", product_id: "233b", "address": "00000000:C4:00.0", "resource_class": "CUSTOM_H200_B"} > > This works fine, I was able to define multiple aliases: > {name: "h200a", device_type:"type-PF", resource_class: "CUSTOM_H200_A"} > {name: "h200b", device_type:"type-PF", resource_class: "CUSTOM_H200_B"} > > I did that to create two blocks of 4 resources (A or B). > > But now, I need to create a flavor to boot instances with these devices. > I want to have only one flavor that can use either A or B: > For now I created a flavor with: > pci_passthrough:alias='h200a:4' > > And was forced to create a second flavor with h200b:4. > > Is there any way to achieve a single flavor with both: > Something like this? > pci_passthrough:alias='h200a:4|h200b:4' > > I can't figure that out for now, is it possible? no its not and for reasons is also not easy to implement that in the future because of how placement works. we would need a new OneOF type query in the allocation candidates api. the feature you really want is https://specs.openstack.org/openstack/nova-specs/specs/2024.1/approved/pci-… it was proposed but never implemented. with that spec you can carve ups pci device into groups in the pci devspec and then have a a singel resouce class to select any one of the groups. so in your case you would express ``` {vendor_id: "10de", product_id: "233b", "address": "00000000:03:00.0", "resource_class": "CUSTOM_H200_A"} {vendor_id: "10de", product_id: "233b", "address": "00000000:04:00.0", "resource_class": "CUSTOM_H200_A"} {vendor_id: "10de", product_id: "233b", "address": "00000000:44:00.0", "resource_class": "CUSTOM_H200_A"} {vendor_id: "10de", product_id: "233b", "address": "00000000:45:00.0", "resource_class": "CUSTOM_H200_A"} {vendor_id: "10de", product_id: "233b", "address": "00000000:83:00.0", "resource_class": "CUSTOM_H200_B"} {vendor_id: "10de", product_id: "233b", "address": "00000000:84:00.0", "resource_class": "CUSTOM_H200_B"} {vendor_id: "10de", product_id: "233b", "address": "00000000:C3:00.0", "resource_class": "CUSTOM_H200_B"} {vendor_id: "10de", product_id: "233b", "address": "00000000:C4:00.0", "resource_class": "CUSTOM_H200_B"} ``` as ``` {vendor_id: "10de", product_id: "233b", "address": "00000000:03:00.0", "group_name": "A", "group_type": "H200"} {vendor_id: "10de", product_id: "233b", "address": "00000000:04:00.0", "group_name": "A", "group_type": "H200"} {vendor_id: "10de", product_id: "233b", "address": "00000000:44:00.0", "group_name": "A", "group_type": "H200"} {vendor_id: "10de", product_id: "233b", "address": "00000000:45:00.0", "group_name": "A", "group_type": "H200"} {vendor_id: "10de", product_id: "233b", "address": "00000000:83:00.0", "group_name": "B", "group_type": "H200"} {vendor_id: "10de", product_id: "233b", "address": "00000000:84:00.0", "group_name": "B", "group_type": "H200"} {vendor_id: "10de", product_id: "233b", "address": "00000000:C3:00.0", "group_name": "B", "group_type": "H200"} {vendor_id: "10de", product_id: "233b", "address": "00000000:C4:00.0", "group_name": "B", "group_type": "H200"} alias = {"name":"h200", resource_class:"CUSTOM_PCI_GROUP_H200"} ``` and in the flavor you would request pci_passthrough:alias='h200:1' pci groups allocate a full group to the request. its a very useful propal that we have discuss on an over for thet better part of a decade and finally got as far as writign it down for 2024.1 but then the implementation never got started. i know some redhat customer are also asked about this type of grouping functionality so at least internally this comes up form time to time. i strongly suspect this will eventually get implemtned but there have been higher priorites for the nova team like eventlet removal or gpu live migrations. if some one propsoes it again i know at least john garbutt was keen to see this added for some hpc usecases and i suspect with the current AI boom passing blocks of H200 gpus to a workload is becomming more common not less. > > [1] https://docs.openstack.org/nova/latest/admin/pci-passthrough.html#support-f… >

6 months, 2 weeks

Re: [tc][all][security] Supporting Post-Quantum Cryptography in OpenStack code (all projects)

by Julia Kreger

Overall, I agree with a lot of the sentiment and the statements thus far. One key aspect I think we need to ensure is that we don't silo discussions to a specific topic or for example ssh keys. In other words, ultimately the *complete* scope of work remains undefined, and projects *should* identify potential areas of work that they can see, while the overall larger ecosystem moves forward and the overall exact needs are clarified further. The wider process should be to revise a vision and guidelines as time moves forward. For example, it could boil down to a very simple question for projects: Does your project do *anything* in relation to keys, encryption, or interaction of any encrypted data either at rest or while in transit.? If yes, are there actions to take? And then a dialog needs to occur, or at least a framework for understanding if there *really* needs to be an investment in time or resources in that specific instance of usage, or is it just permitting a longer value, or what. Ultimately that is a case by case assessment which needs to be performed. A good first step for any working group would be to create a chart which could be used as a reference for those individual discussions. -Julia On Mon, Oct 27, 2025 at 5:35 AM Sean Mooney <smooney(a)redhat.com> wrote: > > my personal take on this is it may be a vaild future comunicaty goal but > its probably premature. > > in general most openstack project try to not be in the busyness of > Cryptography if we can at all avoid it. > > our dependicies may have cypto feature like ssh, ssl for our rest apis > or similar both the openstack code base > in general tries to not implement any cypto logic itself. i.e. we try to > delegate to python-cryptography or similar well maintained > modules. > > for example just looking at the nova section you mentioned that nova can > genreate ssh keys > > https://wiki.openstack.org/wiki/Post_quantum_openstack#Nova_.28Compute.29 > > however we deprecatea and removed that capablity in zed relese in > micoversion 2.92 > > https://docs.openstack.org/nova/latest/reference/api-microversion-history.h… > > we did that specificly because we did nto want to supprot ssh key > generation in nova going forward and defiend that to be out of scope fo > our project > > so that is a non issue in a PQC world because we have decieed as a > project not to extend or suprot that api going forward. > > it has not been removed as we dont do that in nova if its reasonable to > keep the code but it shoudl never be used anymore. > > we only supprot uploading a pre generated public key now. > > > the there two case might be valide > > ``` > > Supports validation of Glance image signatures and certificate trust > when booting signed images. (link) > > Metadata path protection with Neutron uses HMAC over Instance-ID to > prevent spoofing (shared secret). (link) > > ``` > > although for the metadata case that shared secretae is intened to be > passed over a https connection so if the ssl encyption for > the connection supprot post quantum encyption then the hamc does not > really need too but we can likely change that algorhtypme when python > cypgoraphy supprot somethign > in the future that is a suitable replacement. > > the glance image verficaiton woudl need glance to support somethign else > instead but if they come ups with an updated approch nova coudl adapt. > > none of the above seem particularly urgent and likely don't need to be > addressed in 2026.1 or even 2027.1 but if someone wanted to write a spec for > the nova/glance changes and wanted to work on it it could be reviewed > via the normal upstream process without needing to make it a comunity > goal or have > it driven by the TC. you coudl for example create a popup team or sig > kind of like the eventlet remvoal work to drive this instead. > > speaking of which i think the eventlet removal work is gong to take > precedence for most team as that is more urgent in terms > of real world impact. > > regards > > sean. > > On 27/10/2025 05:14, Goutham Pacha Ravi wrote: > > On Fri, Oct 24, 2025 at 1:19 PM Jean-Philippe Jung <jjung(a)redhat.com> wrote: > >> Hi, > >> > >> I am seeking help from the TC to raise the urgency of this work across all OpenStack projects and to help me lead an effort to reduce the number of cryptographic modules used in OpenStack (my personal opinion is that there should be no more than five). > >> > >> Doing this may involve work in each OpenStack Project team; and I can help organize this effort. I'm seeking the following from the TC and/or project teams: > >> Portions of this work will be isolated to specific repositories managed by a project team, while others will involve "cross-project" synchronization. > >> What vehicles can we use to have a "call-to-action" for project teams to get someone to look into their specific projects? How can we go about community wide collaboration? > >> > >> I've created a document [1] that I assembled from AI analysis of part of the OpenStack code. It gives an overall view of the problem we face.\ > >> > > Thank you for starting this discussion, JP. I've added a topic to the > > TC's PTG for 1600 UTC on Friday, 31st Oct 2025. I hope you'll be able > > to share your findings there briefly and invite opinions in-sync. Our > > vehicle for driving cross project work has been via the Community > > Goals framework: https://governance.openstack.org/tc/goals/ > > If we have one or more objectives, this can be proposed as one, and > > will require a "goal champion" - someone that'll help us gather > > requirements, and coordinate efforts to complete the goal. Some goals > > in the past have spawned new groups - either as Pop Up Teams or SIGs > > (https://governance.openstack.org/tc/reference/comparison-of-official-group-…) > > >

2 months

[watcher] 2025.2 Flamingo PTG summary

by Douglas Viroel

Hello everyone, Last week's PTG had very interesting topics. Thank you all that joined. The Watcher PTG etherpad with all notes is available here: https://etherpad.opendev.org/p/apr2025-ptg-watcher Here is a summary of the discussions that we had, including the great cross-project sessions with Telemetry, Horizon and Nova team: Tech Debt (chandankumar/sean-k-mooney) ================================= a) Croniter - Project is being abandoned as per https://pypi.org/project/croniter/#disclaimer - Watcher uses croniter to calculate a new schedule time to run an audit (continuous). It is also used to validate cron like syntax - Agreed: replace croniter with appscheduler's cron methods. - *AI*: (chandankumar) Fix in master branch and backport to 2025.1 b) Support status of Watcher Datasources - Only Gnocchi and Prometheus have CI job running tempest tests (with scenario tests) - Monaska is inactive since 2024.1 - *AI*: (jgilaber) Mark Monasca and Grafana as deprecated, unless someone steps up to maintain them, which should include a minimal CI job running. - *AI*: (dviroel) Document a support matrix between Strategies and Datasources, which ones are production ready or experimental, and testing coverage. c) Eventlet Removal - Team is going to look at how the eventlet is used in Watcher and start a PoC of its removal. - Chandan Kumar and dviroel volunteer to help in this effort. - Planned for 2026.1 cycle. Workflow/API Improvements (amoralej) ============================== a) Actions states - Currently Actions updates from Pending to Succeeded or Failed, but these do not cover some important scenarios - If an Action's pre_conditions fails, the action is set to FAILED, but for some scenarios, it could be just SKIPPED and continue the workflow. - Proposal: New SKIPPED state for action. E.g: In a Nova Migration Action, if the instance doesn't exist in the source host, it can be skipped instead of fail. - Proposal: Users could also manually skip specific actions from an action plan. - A skip_reason field could also be added to document the reason behind the skip: user's request, pre-condition check, etc. - *AI*: (amoralej) Create a spec to describe the proposed changes. b) Meaning of SUCCEEDED state in Action Plan - Currently means that all actions are triggered, even if all of them fail, which can be confusing for users. - Docs mention that SUCCEEDED state means that all actions have been successfully executed. - *AI*: (amoralej) Document the current behavior as a bug (Priority High) - done: https://bugs.launchpad.net/watcher/+bug/2106407 Watcher-Dashboard: Priorities to next release (amoralej) =========================================== a) Add integration/functional tests - Project is missing integration/functional tests and a CI job running against changes in the repo - No general conclusion and we will follow up with Horizon team - *AI*: (chandankumar/rlandy) sync with Horizon team about testing the plugin with horizon. - *AI*: (chandankumar/rlandy) devstack job running on new changes for watcher-dashboard repo. b) Add parameters to Audits - It is missing on the watcher-dashboard side. Without it, it is not possible to define some important parameters. - Should be addressed by a blueprint - Contributors to this feature: chandankumar Watcher cluster model collector improvement ideas (dviroel) ============================================= - Brainstorm ideas to improve watcher collector process, since we still see a lot of issues due to outdated models when running audits - Both scheduled model update and event-based updates are enabled in CI today - It is unknown the current state of event-based updates from Nova notification. Code needs to be reviewed and improvements/fixes can be proposed - e.g: https://bugs.launchpad.net/watcher/+bug/2104220/comments/3 - We need to check if we are processing the right notifications of if is a bug on Nova - Proposal: Refresh the model before running an audit. A rate limit should be considered to avoid too many refreshments. - *AI*: (dviroel) new spec for cluster model refresh, based on audit trigger - *AI:* (dviroel) investigate the processing of nova events in Watcher Watcher and Nova's visible constraints (dviroel) ==================================== - Currently, Watcher can propose solutions that include server migrations that violate some Nova constraints like: scheduler_hints, server_groups, pinned_az, etc. - In Epoxy release, Nova's API was improved to also show scheduler_hints and image_properties, allowing external services, like watcher, to query and use this information when calculating new solutions. - https://docs.openstack.org/releasenotes/nova/2025.1.html#new-features - Proposal: Extend compute instance model to include new properties, which can be retrieved via novaclient. Update strategies to filter invalid migration destinations based on these new properties. - *AI*: (dviroel) Propose a spec to better document the proposal. No API changes are expected here. Replacement for noisy neighbor policy (jgilaber) ==================================== - The existing noisy neighbor strategy is based on L3 Cache metrics, which is not available anymore, since the support for it was dropped from the kernel and from Nova. - In order to keep this strategy, new metrics need to be considered: cpu_steal? io_wait? cache_misses? - *AI*: (jgilaber) Mark the strategy as deprecated during this cycle - *AI*: (TBD) Identify new metrics to be used - *AI*: (TBD) Work on a replacement for the current strategy Host Maintenance strategy new use case (jeno8) ===================================== - New use case for Host Maintenance strategy: instance with ephemeral disks should not be migrated at all. - Spec proposed: https://review.opendev.org/c/openstack/watcher-specs/+/943873 - New action to stop instances when both live/cold migration are disabled by the user - *AI*: (All) Review the spec and continue with discussion there. Missing Contributor Docs (sean-k-mooney) ================================ - Doc missing: Scope of the project, e.g: https://docs.openstack.org/nova/latest/contributor/project-scope.html - *AI*: (rlandy) Create a scope of the project doc for Watcher - Doc missing: PTL Guide, e.g: https://docs.openstack.org/nova/latest/contributor/ptl-guide.html - *AI*: (TBD) Create a PTL Guide for Watcher project - Document: When to create a spec vs blueprint vs bug - *AI*: (TBD) Create a doc section to describe the process based on what is being modified in the code. Retrospective ========== - The DPL approach seems to be working for Watcher - New core members added: sean-k-mooney, dviroel, marios and chandankumar - We plan to add more cores in the next cycle, based on reviews and engagement. - We plan to remove not active members in the 2 last cycles (starting at 2026.1) - A new datasource was added: Prometheus - Prometheus job now also runs scenario tests, along with Gnocchi. - We triaged all old bugs from launchpad - Needs improvement: - current team is still learning about details in the code, much of the historical knowledge was lost with the previous maintainers - core team still needs to grow - we need to focus on creating stable releases Cross-project session with Horizon team =============================== - Combined session with Telemetry and Horizon team, focused on how to provide a tenant and an admin dashboard to visualize metrics. - Watcher team presented some ideas of new panels for both admin and tenants, and sean-k-mooney raised a discussion about frameworks that can be used to implement them - Use-cases that were discussed: - a) Admin would benefit from a visualization of the infrastructure utilization (real usage metrics), so they can identify bottlenecks and plan optimization - b) A tenant would like to view their workload performance, checking real usage of cpu/ram/disk of instances, to proper adjust their resources allocation. - c) An admin user of watcher service would like to visualize metrics generated by watcher strategies like standard deviation of host metrics. - sean-k-mooney presented an initial PoC on how a Hypervisor Metrics dashboard would look like. - Proposal for next steps: - start a new horizon plugin as an official deliverable of telemetry project - still unclear which framework to use for building charts - dashboard will integrate with Prometheus, as metric store - it is expected that only short term metrics will be supported (7 days) - python-observability-client will be used to query Prometheus Cross-project session with Nova team ============================= - sean-k-mooney led topics on how to evolve Nova to better assist other services, like Watcher, to take actions on instances. The team agreed on a proposal of using the existing metadata API to annotate instance's supported lifecycle operations. This information is very useful to improve Watcher's strategy's algorithms. Some example of instance's metadata could be: - lifecycle:cold-migratable=true|false - ha:maintenance-strategy:in_place|power_off|migrate - It was discussed that Nova could infer which operations are valid or not, based on information like: virt driver, flavor, image properties, etc. This feature was initially named 'instance capabilities' and will require a spec for further discussions. - Another topic of interest, also raised by Sean, was about adding new standard traits to resource providers, like PRESSURE_CPU and PRESSURE_DISK. These traits can be used to weight hosts when placing new VMs. Watcher and the libvirt driver could work on annotating them, but the team generally agreed that the libvirt driver is preferred here. - More info at Nova PTG etherpad [0] and sean's summary blog [1] [0] https://etherpad.opendev.org/p/r.bf5f1185e201e31ed8c3adeb45e3cf6d [1] https://www.seanmooney.info/blog/2025.2-ptg/#watcher-topics Please let me know if I missed something. Thanks! -- Douglas Viroel - dviroel

8 months, 2 weeks

[tc][all] OpenStack Technical Committee Weekly Summary and Meeting Agenda (2025.1/R-11)

by Goutham Pacha Ravi

Hello Stackers, Time flies! We're 11 weeks away from the 2025.1 "Epoxy" release day [1]. OpenStack project teams must be working on their deliverables according to the schedule shared by Előd Illés (elodilles) from the release team [2] this week. In the past week, the OpenStack Technical Committee (TC) worked with election officials to adjust the dates of the elections preceding the 2025.2 ("Flamingo" [3]) release cycle. This change was made in response to a revision in the TC's charter that now allows more time for polling. Election officials will share the updated schedule on this list soon. Additionally, the TC approved the retirement of openstack-ansible roles for former projects: Murano (Application Catalog), Senlin (Clustering Service), and Sahara (Data Processing). The OpenInfra Board's Individual Member Director elections are currently underway [4]. If you are an OpenInfra Foundation member, please check your email for a ballot and participate. The election will conclude on Friday, 2025-01-17. === Weekly Meeting === The OpenStack TC resumed its regular weekly meeting schedule last week following a brief hiatus due to the year-end holidays. Last week's meeting was held on 2025-01-07 at 1800 UTC, simultaneously on Zoom and IRC. Please find the meeting minutes on eavesdrop [5] and a recording on YouTube [6]. The meeting began with a discussion about the pending proposal [7] to delete the "unmaintained/victoria" branch on OpenStack git repositories. There is interest from at least one contributing organization in keeping this branch open for specific repositories. The deadline to merge this change is 2025-01-31. We aim to identify the specific repositories and the responsible maintainers by then. I'll keep you updated in future emails. The TC then discussed initiating the election cycle for the 2025.2 release. We recognize that long holidays (such as the upcoming Chinese New Year) could impact nominations from interested candidates. We hope to encourage nominations as early as possible since even a couple of extra weeks could make a difference in coordinating for such a geographically diverse community. The TC also reviewed the status of the community goal to migrate test jobs to Python 3.12 (and "Ubuntu 24.04 / Noble Numbat" where applicable). Ghanshyam Mann (gmann), the goal champion, shared that test jobs in three projects (Heat, Skyline, and devstack-plugin-container) need attention from their respective project teams [8]. We also expressed our gratitude to the OpenDev Infrastructure team for keeping the systems running smoothly during the holidays. The next OpenStack Technical Committee meeting is today, 2025-01-14, at 1800 UTC. This meeting will be held over IRC in OFTC's #openstack-tc channel. Please find the agenda on the meeting wiki [9]. I hope you can join us. Remember, any community member can propose meeting topics—just mention your IRC nick so the meeting chair can call upon you. === Governance Proposals === ==== Merged ==== - Allow more than 2 weeks for elections | https://review.opendev.org/c/openstack/governance/+/937741 - Put whitebox-tempest-plugin under release management | https://review.opendev.org/c/openstack/governance/+/938401 - Retire Murano/Senlin/Sahara OpenStack-Ansible roles | https://review.opendev.org/c/openstack/governance/+/935677 ==== Open for Review ==== - Rework the eventlet-removal goal proposal | https://review.opendev.org/c/openstack/governance/+/931254 - Add ansible-role-httpd repo to OSA-owned projects | https://review.opendev.org/c/openstack/governance/+/935694 - Retire Freezer DR | https://review.opendev.org/c/openstack/governance/+/938183 - Retire qdrouterd role | https://review.opendev.org/c/openstack/governance/+/938193 - Remove Freezer from inactive state | https://review.opendev.org/c/openstack/governance/+/938938 - Propose to select the eventlet-removal community goal | https://review.opendev.org/c/openstack/governance/+/934936 - Resolve to adhere to non-biased language | https://review.opendev.org/c/openstack/governance/+/934907 === How to Contact the TC === You can reach the TC in several ways: - Email: Send an email with the tag [tc] on this mailing list. - Ping us using the 'tc-members' keyword on the #openstack-tc IRC channel on OFTC. - Join us at our weekly meeting: The Technical Committee meets every week on Tuesdays at 1800 UTC [9]. === Upcoming Events === - 2025-01-17: 2025 OpenInfra Board Individual Member Director Elections conclude - 2025-02-01: FOSDEM 2025 (https://fosdem.org/2025/) OpenStack's 15th Birthday Celebration - 2025-02-28: 2025.1 ("Epoxy") Feature Freeze and release milestone 3 [1] - 2025-03-06: SCALE 2025 + OpenInfra Days NA (https://www.socallinuxexpo.org/scale/22x) Thank you very much for reading! On behalf of the OpenStack TC, Goutham Pacha Ravi (gouthamr) OpenStack TC Chair [1] 2025.1 "Epoxy" Release Schedule: https://releases.openstack.org/epoxy/schedule.html [2] Release countdown for week R-11: https://lists.openstack.org/archives/list/openstack-discuss@lists.openstack… [3] OpenStack 2025.2 'F' Release Naming Poll: https://lists.openstack.org/archives/list/openstack-discuss@lists.openstack… [4] OpenInfra Foundation Board Elections: https://openinfra.dev/election/2025-individual-director-election [5] TC Meeting IRC Log 2025-01-07: https://meetings.opendev.org/meetings/tc/2025/tc.2025-01-07-18.00.log.html [6] TC Meeting Video Recording, 2025-01-07: https://youtu.be/-Nxul8_ykto [7] Transition unmaintained/victoria to EOL: https://review.opendev.org/c/openstack/releases/+/937515 [8] Projects failing the "migrate-to-noble" goal: https://etherpad.opendev.org/p/migrate-to-noble#L172 [9] TC Meeting Agenda, 2025-01-14: https://wiki.openstack.org/wiki/Meetings/TechnicalCommittee#Next_Meeting

11 months, 2 weeks

Re: [watcher] 2025.2 Flamingo PTG summary

by Sean Mooney

On 16/04/2025 21:04, Dmitriy Rabotyagov wrote: > > Hey, > > Have a comment on one AI from the list. > > > AI: (jgilaber) Mark Monasca and Grafana as deprecated, unless > someone steps up to maintain them, which should include a minimal CI > job running. > > So eventually, on OpenStack-Ansible we were planning to revive the > Watcher role support to the project. > How we usually test deployment, is by spawning an all-in-one > environment with drivers and executing a couple of tempest scenarios > to ensure basic functionality of the service. > > With that, having a native OpenStack telemetry datastore is very > beneficial for such goal, as we already do maintain means for spawning > telemetry stack. While a requirement for Prometheus will be > unfortunate for us at least. > > While I was writing that, I partially realized that testing Watcher on > all-in-one is pretty much impossible as well... > you can certenly test some fo watcher with an all in one deployment i.e. the apis and you can use the dummy test stragies. but ya in general like nova you need at least 2 nodes to be able to test it properly ideally 3 so that if your doing a live migration there is actully a choice of host. in general however watcher like heat just asks nova to actully move the vms. sure it will ask nova to move it to a specific host but fundementaly if you have tested live migration with nova via tempest seperatly there is no reason to expcect it would not work for live migratoin tirggred by watcher or heat or anything else that jsut calls novas api. so you could still get some valual testing in an all in one but ideally there woudl be at least 2 comptue hosts. > But at the very least, I can propose looking into adding an OSA job > with Gnocchi as NV to the project, to show the state of the deployment > with this driver. > well gnocchi is also not a native OpenStack telemetry datastore, it left our community to pursue its own goals and is now a third party datastore just like Grafana or Prometheus. monasca is currently marked as inactive https://review.opendev.org/c/openstack/governance/+/897520 and is in the process of being retired. but it also has no testing on the watcher side to the combination of the two is why we are deprecating it going forward. if both change im happy to see the support continue. Gnocchi has testing but we are not actively working on extending its functionality going forward. as long as it continues to work i see no reason to change its support status. watcher has quite a lot of untested integrations which is unfortunate we are planning to build out a feature/test/support matrix in the docs this cycle but for example watcher can integrate with both ironic an canonical maas component to do some level of host power management. none of that is tested and we are likely going to mark them as experimental and reflect on if we can continue to support them or not going forward. it also has the ability to do cinder storage pool balancing which is i think also untested write now. one of the things we hope to do is extend the exsitign testing in our current jobs to cover gaps like that where it is practical to do so. but creating a devstack plugin to deploy maas with fake infrastructure is likely alot more then we can do with our existing contributors so expect that to go to experimental then deprecated and finally it will be removed if no one turns up to support it. ironic is in the same boat however there are devstack jobs with fake ironic nodes so i could see a path to use having an ironic job down the line. its just not high on our current priority list to adress the support status or testing of this currently. eventlet removal and other techdebt/community goals are defintly higher but i hop the new supprot/testing matrix will at least help folks make informed descions or what feature to use and what backend are recommended going forward. > > On Wed, 16 Apr 2025, 21:53 Douglas Viroel, <viroel(a)gmail.com> wrote: > > Hello everyone, > > Last week's PTG had very interesting topics. Thank you all that > joined. > The Watcher PTG etherpad with all notes is available here: > https://etherpad.opendev.org/p/apr2025-ptg-watcher > Here is a summary of the discussions that we had, including the > great cross-project sessions with Telemetry, Horizon and Nova team: > > Tech Debt (chandankumar/sean-k-mooney) > ================================= > a) Croniter > > * Project is being abandoned as per > https://pypi.org/project/croniter/#disclaimer > * Watcher uses croniter to calculate a new schedule time to run > an audit (continuous). It is also used to validate cron like > syntax > * Agreed: replace croniter with appscheduler's cron methods. > * *AI*: (chandankumar) Fix in master branch and backport to 2025.1 > > b) Support status of Watcher Datasources > > * Only Gnocchi and Prometheus have CI job running tempest tests > (with scenario tests) > * Monaska is inactive since 2024.1 > * *AI*: (jgilaber) Mark Monasca and Grafana as deprecated, > unless someone steps up to maintain them, which should include > a minimal CI job running. > * *AI*: (dviroel) Document a support matrix between Strategies > and Datasources, which ones are production ready or > experimental, and testing coverage. > > c) Eventlet Removal > > * Team is going to look at how the eventlet is used in Watcher > and start a PoC of its removal. > * Chandan Kumar and dviroel volunteer to help in this effort. > * Planned for 2026.1 cycle. > > Workflow/API Improvements (amoralej) > ============================== > a) Actions states > > * Currently Actions updates from Pending to Succeeded or Failed, > but these do not cover some important scenarios > * If an Action's pre_conditions fails, the action is set to > FAILED, but for some scenarios, it could be just SKIPPED and > continue the workflow. > * Proposal: New SKIPPED state for action. E.g: In a Nova > Migration Action, if the instance doesn't exist in the source > host, it can be skipped instead of fail. > * Proposal: Users could also manually skip specific actions from > an action plan. > * A skip_reason field could also be added to document the reason > behind the skip: user's request, pre-condition check, etc. > * *AI*: (amoralej) Create a spec to describe the proposed changes. > > b) Meaning of SUCCEEDED state in Action Plan > > * Currently means that all actions are triggered, even if all of > them fail, which can be confusing for users. > * Docs mention that SUCCEEDED state means that all actions have > been successfully executed. > * *AI*: (amoralej) Document the current behavior as a bug > (Priority High) > o done: https://bugs.launchpad.net/watcher/+bug/2106407 > > Watcher-Dashboard: Priorities to next release (amoralej) > =========================================== > a) Add integration/functional tests > > * Project is missing integration/functional tests and a CI job > running against changes in the repo > * No general conclusion and we will follow up with Horizon team > * *AI*: (chandankumar/rlandy) sync with Horizon team about > testing the plugin with horizon. > * *AI*: (chandankumar/rlandy) devstack job running on new > changes for watcher-dashboard repo. > > b) Add parameters to Audits > > * It is missing on the watcher-dashboard side. Without it, it is > not possible to define some important parameters. > * Should be addressed by a blueprint > * Contributors to this feature: chandankumar > > Watcher cluster model collector improvement ideas (dviroel) > ============================================= > > * Brainstorm ideas to improve watcher collector process, since > we still see a lot of issues due to outdated models when > running audits > * Both scheduled model update and event-based updates are > enabled in CI today > * It is unknown the current state of event-based updates from > Nova notification. Code needs to be reviewed and > improvements/fixes can be proposed > o e.g: > https://bugs.launchpad.net/watcher/+bug/2104220/comments/3 > - We need to check if we are processing the right > notifications of if is a bug on Nova > * Proposal: Refresh the model before running an audit. A rate > limit should be considered to avoid too many refreshments. > * *AI*: (dviroel) new spec for cluster model refresh, based on > audit trigger > * *AI:* (dviroel) investigate the processing of nova events in > Watcher > > Watcher and Nova's visible constraints (dviroel) > ==================================== > > * Currently, Watcher can propose solutions that include server > migrations that violate some Nova constraints like: > scheduler_hints, server_groups, pinned_az, etc. > * In Epoxy release, Nova's API was improved to also show > scheduler_hints and image_properties, allowing external > services, like watcher, to query and use this information when > calculating new solutions. > o https://docs.openstack.org/releasenotes/nova/2025.1.html#new-features > * Proposal: Extend compute instance model to include new > properties, which can be retrieved via novaclient. Update > strategies to filter invalid migration destinations based on > these new properties. > * *AI*: (dviroel) Propose a spec to better document the > proposal. No API changes are expected here. > > Replacement for noisy neighbor policy (jgilaber) > ==================================== > > * The existing noisy neighbor strategy is based on L3 Cache > metrics, which is not available anymore, since the support for > it was dropped from the kernel and from Nova. > * In order to keep this strategy, new metrics need to be > considered: cpu_steal? io_wait? cache_misses? > * *AI*: (jgilaber) Mark the strategy as deprecated during this cycle > * *AI*: (TBD) Identify new metrics to be used > * *AI*: (TBD) Work on a replacement for the current strategy > > > Host Maintenance strategy new use case (jeno8) > ===================================== > > * New use case for Host Maintenance strategy: instance with > ephemeral disks should not be migrated at all. > * Spec proposed: > https://review.opendev.org/c/openstack/watcher-specs/+/943873 > o New action to stop instances when both live/cold migration > are disabled by the user > * *AI*: (All) Review the spec and continue with discussion there. > > Missing Contributor Docs (sean-k-mooney) > ================================ > > * Doc missing: Scope of the project, e.g: > https://docs.openstack.org/nova/latest/contributor/project-scope.html > * *AI*: (rlandy) Create a scope of the project doc for Watcher > * Doc missing: PTL Guide, e.g: > https://docs.openstack.org/nova/latest/contributor/ptl-guide.html > * *AI*: (TBD) Create a PTL Guide for Watcher project > * Document: When to create a spec vs blueprint vs bug > * *AI*: (TBD) Create a doc section to describe the process based > on what is being modified in the code. > > Retrospective > ========== > > * The DPL approach seems to be working for Watcher > * New core members added: sean-k-mooney, dviroel, marios and > chandankumar > o We plan to add more cores in the next cycle, based on > reviews and engagement. > o We plan to remove not active members in the 2 last cycles > (starting at 2026.1) > * A new datasource was added: Prometheus > * Prometheus job now also runs scenario tests, along with Gnocchi. > * We triaged all old bugs from launchpad > * Needs improvement: > o current team is still learning about details in the code, > much of the historical knowledge was lost with the > previous maintainers > o core team still needs to grow > o we need to focus on creating stable releases > > > Cross-project session with Horizon team > =============================== > > * Combined session with Telemetry and Horizon team, focused on > how to provide a tenant and an admin dashboard to visualize > metrics. > * Watcher team presented some ideas of new panels for both admin > and tenants, and sean-k-mooney raised a discussion about > frameworks that can be used to implement them > * Use-cases that were discussed: > o a) Admin would benefit from a visualization of the > infrastructure utilization (real usage metrics), so they > can identify bottlenecks and plan optimization > o b) A tenant would like to view their workload performance, > checking real usage of cpu/ram/disk of instances, to > proper adjust their resources allocation. > o c) An admin user of watcher service would like to > visualize metrics generated by watcher strategies like > standard deviation of host metrics. > * sean-k-mooney presented an initial PoC on how a Hypervisor > Metrics dashboard would look like. > * Proposal for next steps: > o start a new horizon plugin as an official deliverable of > telemetry project > o still unclear which framework to use for building charts > o dashboard will integrate with Prometheus, as metric store > o it is expected that only short term metrics will be > supported (7 days) > o python-observability-client will be used to query Prometheus > > > Cross-project session with Nova team > ============================= > > * sean-k-mooney led topics on how to evolve Nova to better > assist other services, like Watcher, to take actions on > instances. The team agreed on a proposal of using the existing > metadata API to annotate instance's supported lifecycle > operations. This information is very useful to improve > Watcher's strategy's algorithms. Some example of instance's > metadata could be: > o lifecycle:cold-migratable=true|false > o ha:maintenance-strategy:in_place|power_off|migrate > * It was discussed that Nova could infer which operations are > valid or not, based on information like: virt driver, flavor, > image properties, etc. This feature was initially named > 'instance capabilities' and will require a spec for further > discussions. > * Another topic of interest, also raised by Sean, was about > adding new standard traits to resource providers, like > PRESSURE_CPU and PRESSURE_DISK. These traits can be used to > weight hosts when placing new VMs. Watcher and the libvirt > driver could work on annotating them, but the team generally > agreed that the libvirt driver is preferred here. > * More info at Nova PTG etherpad [0] and sean's summary blog [1] > > [0] https://etherpad.opendev.org/p/r.bf5f1185e201e31ed8c3adeb45e3cf6d > [1] https://www.seanmooney.info/blog/2025.2-ptg/#watcher-topics > > > Please let me know if I missed something. > Thanks! > > -- > Douglas Viroel - dviroel >

8 months, 2 weeks

[nova][ptg] 2025.2 Flamingo PTG summary

by Rene Ribaud

Hello everyone, Last week was the PTG—thank you to those who joined! I hope you enjoyed it. I haven’t gathered exact attendance stats, but it seemed that most sessions had at least around 15 participants, with some peaks during the cross-team discussions. If you’d like to take a closer look, here’s the link to the PTG etherpad: https://etherpad.opendev.org/p/r.bf5f1185e201e31ed8c3adeb45e3cf6d We had a pretty full agenda for Nova, so here’s a summary I’ve tried to keep as short as possible. #### 2025.1 Epoxy Retrospective #### 17 specs were accepted, and 12 implemented — an excellent ratio. This represents a clear improvement over previous cycles. Virtiofs was successfully merged, unblocking other work and boosting contributor motivation. ✅ We agreed to maintain regular status updates via the etherpad and follow up during Nova meetings. API Microversions & Tempest Coverage, several microversions were merged with good structure. However, some schema changes were not reflected in Tempest, causing downstream blockers. Also the updates covered by the microversions were not propagated into the sdk and openstack client. ✅ Ensure client-side features (e.g., server show) are also published and tracked. ✅ Keep microversions isolated and document Tempest implications clearly in specs. ✅ Raise awareness of the tempest-with-latest-microversion job during Nova meetings. ✅ Monitor OpenAPI efforts in Nova, which may allow offloading schema checks from Tempest in the future. Eventlet Removal, progress is behind schedule, especially compared to other projects like Neutron. ✅ Flag this as a priority area for upcoming cycles. Review Process & Tracking, spec review days were difficult to coordinate, and the status etherpad often outdated. ✅ Encourage active contributors to support occasional contributors during review days. ✅ Commit to keeping the etherpad current throughout the cycle. #### 2025.2 Flamingo Planning #### Timeline: Soft spec freeze (no new specs): June 1st Hard spec freeze (M2): July 3rd Feature Freeze (FF): August 28th Final release: late September / early October ✅ We agreed to officially adopt June 1st as the soft freeze date, based on the successful approach in Epoxy. ✅ A spec review day will be scheduled around mid June, these will be scheduled and announced early to ensure participation. ✅ Uggla will update the schedule document with the proposed milestones. #### Upstream Bug Triage #### We acknowledged that active bug triage has slowed down, resulting in a backlog increase (~150 untriaged bugs). There is a consensus that triage remains important to maintain a clear picture of the actual bug landscape. ✅ Trial a new approach: review some untriaged bugs at the end of Nova team meetings. ✅ Process the list by age (starting with the newest or most-voted first). #### Closing Old Bugs #### A proposal was made to bulk-close bugs older than 2 years, with a respectful and explanatory message, aiming to reduce backlog and improve visibility. However, multiple voices expressed strong reservations. ✅Take no action for now. Focus efforts on triaging new bugs first. ✅ If we successfully reduce the number of untriaged new bugs, we can consider scrubbing the bug backlog and possibly closing some of the older ones. #### Preparation for Python 3.13 #### While Python 3.13 is not mandatory for 2025.2, early compatibility work was discussed due to known issues (e.g., eventlet is broken on 3.13, as observed on Ubuntu 25.04) Ubuntu 24.04 and CentOS Stream 10 will stay on 3.12 for their supported lifespans. A non-voting unit test job for Python 3.13 (openstack-tox-py313) has already been added and is currently passing. Introducing a functional job for 3.13 could be a good next step, if resources allow. ✅ Gibi will track this as part of the broader eventlet removal work. #### Confidential Computing Feature Planning #### AMD SEV is already supported in Nova. SEV-ES is implemented in libvirt and work is ongoing in Nova. SEV-SNP is now supported in libvirt (v10.5.0). Work in Nova has not started yet. ✅ Pay closer attention to SEV-ES reviews to help move this forward. ✅ Tkajinam will write a new spec for SEV-SNP. Intel TDX Kernel support is nearly ready (expected in 6.15). Libvirt patches exist, but feature is not yet upstreamed or widely released. ✅ No action agreed yet, as this remains exploratory. Arm CCA No hardware is available yet; earliest expected in April 2027 (Fujitsu Monaka). Support in libvirt, QEMU, and Linux kernel is still under development. ✅ The use case is reasonable, but too early to proceed — we should wait until libvirt and QEMU support is mature. ✅ It would be beneficial to wait for at least one Linux distribution to officially support Arm CCA, allowing real-world testing. ✅ Attestation support for Arm is seen as external to Nova, with only minor flags possibly needed in the guest. #### RDT / MPAM Feature Discussion #### RDT (Intel PQoS) and MPAM (Arm equivalent) aim to mitigate “noisy neighbor” issues by allocating cache/memory bandwidth to VMs. Development has stalled since 2019, primarily due to: - Lower priority for contributors - Lack of customer demand - Infrastructure complexity (NUMA modeling, placement limitations) ✅ r-taketn to reopen and revise the original spec, showing a clear diff to the previous version. ✅ Ensure that abstractions are generic, not tied to proprietary technology, using libvirt + resource classes/traits may provide enough flexibility. #### vTPM Live Migration #### A spec for vTPM live migration was approved in Epoxy: https://specs.openstack.org/openstack/nova-specs/specs/2025.1/approved/vtpm… <https://specs.openstack.org/openstack/nova-specs/specs/2025.1/approved/vtpm…>To support live-migratable vTPM-enabled instances, Barbican secrets used for vTPM need to be owned by Nova, rather than the end user. This shift in ownership allows Nova to access the secret during live migration operations. Opt-in is handled via image property or flavor extra spec, meaning user consent is explicitly required. Current Proposal to enable this workflow: - Castellan should allow per-call configuration for sending the service token (rather than relying on a global all-or-nothing setting). Proposal: https://review.opendev.org/c/openstack/castellan/+/942015 - If the Nova service token is present, Barbican should set the secret owner to Nova. Proposal: https://review.opendev.org/c/openstack/barbican/+/942016 This workflow ensures Nova can read/delete the secret during lifecycle operations like migration, without involving the user. A question was raised around possible co-ownership between Nova and the end user (e.g., both having access to the secret). While this may be interesting longer-term, current implementation assumes a single owner. ✅ User and host modes are as described in the spec. For deployment mode, Nova will: - Authenticate to Barbican as itself (using a service token). - Own the vTPM secret it creates — it will be able to create, read, and delete it. - The user will not see or control the secret, including deletion. - The secret will be visible to other members of the Nova service project by default, but this could be restricted in future via Barbican ACLs to limit visibility to Nova only. #### Cloud Hypervisor Integration #### There is an ongoing effort to integrate Cloud Hypervisor into Nova via the Libvirt driver: Spec: https://review.opendev.org/c/openstack/nova-specs/+/945549 The current PoC requires only minor changes to work with Libvirt, and the team is ready to present the proposal at the PTG. ✅ We’re happy with the direction the spec is taking. Below are some key highlights regarding the spec. ✅ Clarify platform support (e.g., is libvirt compiled with cloud hypervisor support by default? Is it available in distros?). ✅ Provide a plan for runtime attach of multiple NICs and volumes. ✅ Mark as experimental if cloud hypervisor is not yet in upstream distro packages. ✅ Ensure that the following features are expected to work and covered in the spec: resize, migrate, rebuild, evacuate, snapshot. ✅ Justify raw-only image support, and outline the path to qcow2 compatibility. #### vGPU (mdev) and PCI SR-IOV Topics #### 1. Live-migratable flag handling (physical_network tag) Bug: https://bugs.launchpad.net/nova/+bug/2102161 ✅ We agreed that the current behavior is correct and consistent with the intention: If live_migratable = false → fallback to hotplug during live migration. If live_migratable = true on both source and destination → prefer transparent live migration. ✅ Investigate how Neutron might participate by requesting live-migratable ports. 2. Preemptive live migration failure for non-migratable PCI devices Nova currently checks for migratability during scheduling and conductor phases. There’s a proposal to move these checks earlier, possibly to the API level. Bug: https://bugs.launchpad.net/nova/+bug/2103631 ✅ Confirm with gmann whether a microversion is needed — likely not, as return codes are already supported (202 → 400/409). ✅ Uggla may submit a small spec to formalize this change. ✅ Split the work into two steps: - Fix existing bug (can be backported). - Incrementally move other validations earlier in the flow. 3. PCI SR-IOV: Unify the Live Migration Code Path There’s agreement on the need to reduce technical debt by refactoring the current dual-code-path approach into a unified model for PCI live migration. ✅ A dedicated spec is needed to clarify and unify PCI claiming and allocation. ✅ This refactor should address PCI claiming and allocation, potentially deprecating or replacing move_claim in favor of more robust DB-backed logic. ✅ This effort is directly related to point 1 (migratability awareness) and will help ensure consistent resource management across the codebase. #### SPICE VDI – Next Steps #### There is an ongoing effort to enhance libvirt domain XML configuration for desktop virtualization use cases (e.g. SPICE with USB and sound controllers). Some patches were proposed but not merged in time for Epoxy. Mikal raised the question of whether a new spec would be required in Flamingo, which would be the third iteration of this work. The team also raised concern about the complexity of adding traits (e.g. os-traits) for relatively simple additions, due to the multi-step process involved (traits patch, release, requirements update, etc.). ✅ Proceed with a specless blueprint. ✅ Plan to pull os-traits and os-resource-classes logic into Placement, to simplify the integration process and reduce friction. Link the required Placement version in Nova documentation accordingly. This is a strategic direction, even if some traits might still be shared with Neutron/Cinder. #### Virtiofs Client Support #### The virtiofs server-side support was merged in Epoxy, but SDK and client-side support did not make it in time. The proposal is to merge both patches early in Flamingo and then backport to Epoxy. ✅ No concern with microversion usage here. ✅The ordering of microversion support patches across Nova, SDKs, and clients will be handled by respective owners. ✅ Uggla to track that each new microversion in Nova has a corresponding patch in SDK/client layers. ✅ Not directly related to virtiofs, but the new reset-state confirmation prompt in the client was noted and welcomed. #### One-Time-Use (OTU) Devices #### OTU devices are designed to be consumed once and then unreserved. There is a need to provide practical guidance on handling these cleanly, especially in notification-driven environments. Additionally, there's an important patch related to Placement behavior on over-capacity nodes: https://review.opendev.org/c/openstack/placement/+/945465 Placement currently blocks new allocations on over-capacity nodes — even if the new allocation reduces usage. This breaks migration from overloaded hosts. The proposed fix allows allocations that do not worsen (or improve) usage. Note: A similar OTU device handling strategy has been successfully used in Ironic. ✅ Provide an example script or tool for external OTU device cleanup, based on notifications. ✅ Agreement on the proposed Placement fix — it is operator-friendly and resolves real issues in migration workflows. ✅ We likely need to dig deeper into implementation and tooling for broader OTU support. #### Glance cross-project session #### Please look at glance summary. #### Secure RBAC – Finalization Plan #### Tobias raised concerns about incomplete secure RBAC support in Nova, particularly around default roles and policy behavior. Much of the groundwork has been done, but a number of patches still require review and finalization. ✅ Gmann will continue working on the outstanding patches during the Flamingo cycle. The objective is to complete secure RBAC support in Nova as part of this cycle. #### Image Properties Handling – DB Schema & API Response #### The issue arises from discrepancies between image property metadata stored by Nova and what is received from Glance. Nova’s DB schema enforces a 255-character limit on metadata keys and values, which can lead to silent truncation or hard failures (e.g., when prefixing keys like image_ pushes the total length over 255). Nova stopped supporting custom image properties nearly a decade ago, when the system moved to structured objects (ImageMetaProps via OVO). Glance still allows some custom metadata, which may be passed through to Nova. This led to invalid or non-standard keys (e.g., owner_specified.openstack.sha256) being stored or exposed, even though they are not part of Nova’s supported set. Consensus emerged that we are facing two issues: - Nova's API may expose more metadata than it should (from Glance). - Nova stores non-standard or overly long keys/values, resulting in silent truncation or hard DB errors. ✅ Nova should stop storing non-standard image properties altogether. ✅ A cleanup plan should be created to remove existing unused or invalid metadata from Nova's database post-upgrade. ✅ During instance.save(), Nova should identify and delete unused image_* keys from the system metadata table. ✅ We must be cautious to preserve snapshot-related keys that are valid but not part of the base ImageMetaProps. ✅ These changes are considered bugfixes and can proceed without a new spec. #### Eventlet removal #### Please read the excellent blog post series from Gibi here: https://gibizer.github.io/posts/Eventlet-Removal-Flamingo-PTG/ #### Enhanced Granularity and Live Application of QoS #### This was cross team Neutron/Cinder/Nova first topic. Bloomberg folks presented early ideas around making QoS settings more granular and mutable, and potentially applicable to existing ports or VMs, not just at creation time. Nova does not operate on multiple instances at once, which conflicts with some proposed behaviors (e.g., live update of QoS on a network/project level). QoS is currently exposed via flavors in Nova, and is only supported on the frontend for the Libvirt driver. QoS mutability is non-trivial, with implications for scheduling, resource modeling, and placement interactions. The scope is broad and would require cross-project collaboration (Neutron, Cinder, Placement). Use cases and notes from Bloomberg: https://etherpad.opendev.org/p/OpenStack_QoS_Feature_Enhancement_Discussion ✅ Use flavor-based modeling for QoS remains the Nova approach. ✅ Nova should not apply policies across many instances simultaneously. ✅ A spec will be required, especially if new APIs or behavior modifications for existing VMs are introduced. The spec should provide concrete use case examples and API design proposals, including expected behavior during lifecycle operations (resize, rebuild, shelve, etc.). ✅ Max bandwidth adjustments may be possible (as they don’t require reservations), but broader mutability is more complex. ✅ Neutron and Cinder raised no objections regarding Bloomberg’s use cases and proposals. However, please look at Neutron and Cinder's respective summaries. #### Moving TAP Device Creation from Libvirt to os-vif #### This change proposes moving the creation of TAP devices from the Libvirt driver into os-vif, making it more consistent and decoupled. However, it introduces upgrade and timing considerations, especially regarding Neutron and OVN behavior. Bug: https://bugs.launchpad.net/nova/+bug/2073254 Patch: https://review.opendev.org/c/openstack/nova/+/942786 ✅ Neutron team is open to adjusting the timing of the "port ready" event, which could eliminate the need for Nova-side hacks. ✅ Sean will proceed with the patch and verify behavior through CI. #### Instance Annotations, Labels & K8s-Like Semantics #### Sean proposed introducing a mechanism similar to Kubernetes annotations and labels in Nova, to: - Express user intent regarding instance behavior (e.g., "should this instance be migrated?") - Convey lifecycle preferences to external tools like Watcher and Masakari - Expose capabilities or constraints of an instance (e.g., "cannot be shelved because it has a vTPM") Proposed Examples of Instance Annotations: lifecycle:live-migratable=true|false ha:role=primary|secondary These would be: - Set by users (or operators) - Optionally inherited from flavors (but conflicts would raise 400 Bad Request) - Expressed intent only — not enforcement of policy In addition, labels generated by Nova could reflect actual capabilities, like: lifecycle:live-migratable=false if an instance has a PCI device lifecycle:shelvable=false if it uses vTPM ✅ Define a new API to expose capabilities of instances (e.g., “can this instance be live-migrated?”) Values will be derived by Nova based on configuration/hardware and exposed via nova server show. ✅ Sean will create a spec. ✅ Looking at user-defined labels, we eventually considered defining a second API for them to express scheduling/HA preferences. However we decided the current preferred approach is to start with metadata API, and evolve to a first-class model. We may need admin-only metadata (e.g., for HA tooling like Masakari) this has been discussed in Admin-Only Instance Metadata / Annotations later point. ✅ Sean will also create a spec for this. (Sean). #### External Traits and Node Pressure Metrics #### Sean also proposed allowing external systems (e.g., Watcher, telemetry agents) to annotate compute nodes with traits such as memory/cpu/io pressure, based on /proc/pressure. Examples: CUSTOM_MEM_PRESSURE=high EXTERNAL_IO_PRESSURE=moderate ✅ Support a COMPUTE_MEM_PRESSURE-like trait, populated from sysfs as static info (not dynamic). ✅ A weigher could use these traits to influence placement.Default traits list could be configured (e.g., prefer/avoid hosts with certain pressures or hardware features). This approach could evolve into a generic “preferred traits” weigher, similar to Kubernetes taints/tolerations. ✅Sean will create a dedicated spec for this feature. ✅ Sbauza volunteered to help, especially as the work aligns with weigher logic from the previous cycle. #### OpenAPI Schema Integration #### Stephen highlighted that most of the heavy lifting for OpenAPI support is now complete, and the work is down to pure response schema definitions. This effort spans over three cycles now, and it would be valuable to finalize it early in Flamingo. ✅ We'll formalize this work with a blueprint. ✅ The goal is to make early progress in Flamingo, ideally with a dedicated review day. ✅ Stephen is happy to join synchronous review sessions and will coordinate pings for progress. ✅ Masahito volunteered to help with the remaining work. #### OpenStack SDK & Client Workflows #### Stephen raised a few concerns regarding timing mismatches between SDK/OSC freezes and microversion patch merges in Nova. Some microversion support landed too late to be integrated in the SDK before the Epoxy freeze. Patches were sometimes missed due to lack of "depends-on" links or broken initial submissions. ✅ Uggla will follow up and finalize these patches early in the Flamingo cycle. #### Upstream Testing for PCI Passthrough and mdev Devices #### With IGB support merged in Epoxy, and vIOMMU enabled in some Vexxhost workers (thanks to dansmith), the opportunity exists to expand PCI testing upstream in Tempest. This would also benefit testing of one-time-use (OTU) devices. Finalizing mtty testing is a priority, as it helps ensure device support is consistent and regressions (like bug #2098892) are caught early. ✅ Bauzas will lead on wrapping up mtty testing. ✅ Gibi will coordinate with cloud providers to assess Epoxy support and revisit this topic during the next PTG if needed. #### CPU Power Management – Expected Behavior #### Melanie raised questions about inconsistencies between design and implementation in Nova’s CPU power management logic. In particular: - CPUs were being offlined too aggressively, sometimes during reboot or migration operations. - This contradicts the intent that only unassigned or deallocated cores should be powered off. There was confusion between two approaches: - Aggressive power-down of unused CPUs during all idle states (stop, shelve, etc.) - Conservative behavior, powering off cores only when the VM is deleted or migrated away Consensus favored the aggressive-but-safe model: - Power down cores only when not used, e.g., VM is stopped or migrated. - Be cautious not to power off cores prematurely (e.g., during reboot or verify-resize). ✅ Do not rush to power off CPU cores at compute startup or mid-operation. ✅ Revisit the implementation so the resource tracker runs first, and determines actual core assignments before making decisions. #### Live Migration with Encrypted Volumes (Barbican Integration) #### HJ-KIM raised the point that Nova does not currently support live migration of instances using encrypted Cinder volumes managed by Barbican. This is a critical blocker in environments with strict compliance requirements. ✅ This is a parallel issue to vTPM support. We will learn from the vTPM implementation and consider applying similar concepts. ✅ A future solution may involve adjusting how ownership is managed, or providing scoped access via ACLs. ✅ Further discussion/spec work will be needed once an implementation direction is clearer. #### Manila–Nova Cross-Team Integration #### The initial Manila–Nova integration is now merged — thanks to everyone involved! The next step is to: - Add automated testing (currently manual tests only). - Start with a few basic positive and negative test scenarios (create, attach, write, delete; snapshot and restore; rule visibility; restricted deletion; etc.). Additionally, longer-term features and improvements are being considered please look at the etherpad. ✅ We will work on tempest tests. ✅ We will continue enhancing Nova–Manila integration during Flamingo (F) and beyond. ✅ Uggla will submit a spec as needed for land memfd support. #### Provider Traits Management via provider.yaml #### 📌 Spec: https://review.opendev.org/c/openstack/nova-specs/+/937587 Problem: Traits defined in provider.yaml are added to Placement but never removed if deleted from the file. ✅ Implement a mechanism where Nova copies the applied file to /var/lib/nova/applied_provider.yaml, and diffs it with the active one on restart. This would allow traits (and possibly other config) to be safely removed. #### Admin-Only Instance Metadata / Annotations #### 📌 Spec: https://review.opendev.org/c/openstack/nova-specs/+/939190 Issue: Current instance metadata is user-owned, and shouldn't be used by admins. Proposal: Introduce admin-only annotations (or metadata with ownership tracking), allowing operators to set system-visible metadata without violating user intent. ✅ Introduce a created_by field (similar to locked_by) to track who created metadata: user vs admin. Consider an admin: prefix namespace for admin-controlled keys (applied to annotations or metadata). Implementation requires a DB change and a nova-spec. Note: This aligns well with broader annotation work already discussed in this cycle. #### delete_on_terminate for Ports (Server Create / Network Attach APIs) #### 📌 Related discussion: https://review.opendev.org/c/openstack/nova-specs/+/936990 Background: This was discussed in past PTGs. Currently, delete_on_terminate can't be updated dynamically across instance lifetime. ✅ A spec with a working PoC will help clarify the desired behavior and unblock the discussion. Long-term solution may require storing this flag in Neutron as a port property (rather than Nova-specific DB). #### Graceful Shutdown of Nova Compute Services #### 📌 Spec: https://review.opendev.org/c/openstack/nova-specs/+/937185 Challenge: Need a mechanism to drain compute nodes gracefully before shutdown, without interrupting active workloads or migrations. Graceful shutdown is tricky in the presence of live migrations. Ideas include: - Temporary “maintenance mode” (block write requests). - Group-level compute draining. ✅ The topic is important but not urgent — bandwidth is limited. Note: Eventlet removal may simplify implementing this. ✅ Please report concrete bugs so we understand the blockers. ✅ A nova-spec with PoC would help drive the conversation. #### Libvirt/QEMU Attributes via Flavor Extra Specs #### Target: Advanced tuning of I/O performance via iothreads and virtqueue mapping, based on: https://developers.redhat.com/articles/2024/09/05/scaling-virtio-blk-disk-i… ✅ Introduce new flavor extra specs such as: - hw:io_threads=4 - hw:blk_multiqueue=2 These can be added to both flavor and image properties. ✅ A nova-spec should be written to document naming and semantics. #### Dynamic Modification of libvirt Domain XML (Hook Proposal) #### oVirt allows for plugins to alter the libvirt domain XML just before instance launch (via VDSM hooks). Nova does not offer a mechanism to intercept or modify the domain XML, and the design explicitly avoids this. The desired use case involves injecting configuration that libvirt cannot currently represent, for example, enabling multiuser SPICE consoles. ✅ This proposal is explicitly rejected. ✅ Nova will not support hook points for permuting libvirt XML. ✅ Operators may use out-of-band libvirt/qemu hooks at their own risk, but should not expect upstream support or stability guarantees. #### Revisiting the "No More API Proxies" Rule #### Masahito proposed allowing users to filter instances via API based on related service data, such as network_id. ✅ The "no API proxy" rule remains, but with pragmatic exceptions: - Filtering is acceptable if the data exists in Nova’s DB (e.g., network ID, image ID). - No cross-service REST calls allowed (e.g., Neutron QoS types still out of scope). - Filtering by network_id in nova list is reasonable and can proceed. ✅ Masahito will provide a spec. #### OVN Migration & Port Setup Timing #### 📌 Context: https://bugs.launchpad.net/nova/+bug/2073254 In OVN-based deployments, Neutron signals the network-plugged event too early, before the port is fully set up. This causes issues in live migration, especially under load. ✅ Nova already supports waiting on the network-plugged event. OVN in Ubuntu Noble should behave properly. A proposal to improve timing in Neutron was discussed (Neutron to wait for port claim in southbound DB). Nova might support this via a Neutron port hint that triggers tap interface creation earlier during migration (pre-live-migration). ✅ Next step: open an RFE bug in Neutron. If accepted, a Nova spec may be needed. #### Blocking API Threads During Volume Attachments #### 📌 Context: https://bugs.launchpad.net/nova/+bug/1930406 Volume attachment RPC calls block API workers in uWSGI, leading to starvation when multiple attachments are made in parallel. ✅ Volume/interface attachments should become async, reducing API lock contention. Fix is non-trivial and will require a microversion. In the meantime, operators may tune uWSGI workers/threads or serialize attachment calls. #### Inventory Update Failure – DISK_GB Bug #### 📌 Bug: https://bugs.launchpad.net/nova/+bug/2093869 When local storage becomes temporarily unavailable (e.g., Ceph down), Nova sends total=0 for DISK_GB, which Placement rejects if allocations exist. ✅ The real fix is to restore the storage backend. Nova should improve error handling/logging, but should not shut down the compute service. #### Security Group Name Conflict Bug #### 📌 Bug: https://bugs.launchpad.net/nova/+bug/2105896 When multiple security groups share the same name (via Neutron RBAC), instance builds can fail due to incorrect duplicate detection logic. ✅ The issue was fixed in: https://review.opendev.org/c/openstack/nova/+/946079 ✅ Fix will be reviewed and backported to Epoxy. If you've read this far — thank you! 🙏 If you spot any mistakes or missing points, please don't hesitate to let me know. Best regards. René.

8 months, 2 weeks

[manila] 2025.2 Flamingo PTG summary

by Carlos Silva

Hello Zorillas and interested stackers, Last week's PTG had plenty of topics and good takeaways. In case you would like to watch any of the discussions, please take a look at the videos in the OpenStack Manila Youtube channel [0]. The PTG etherpad has all of the notes we took [9]. Here is a summary of the discussions grouped by each topic: Retrospective ========== Highlights ------------- Mid cycle alongside feature proposal freeze provided a good opportunity for us to have collaborative review sessions and move faster on reviews. Two bugsquashes had a good impact on the bug backlog and the bug trend was more positive on this cycle, despite the numbers growing due to low-hanging-fruits we started reporting. Internships with City University of Seattle, Valencia College and North Dakota State University - they are definitely helping with progress on manila-ui and OpenAPI. We will continue the effort. We would like to speed up reviews and improve our metrics [1] on how long changes are open before being merged. Review dashboards can help and we can work with our reviewers to have a more disciplined approach on reviews. Broken third party CI systems currently mean that we have little testing. We need to rely on the authors' or their peers to test and ensure that a feature is working. We will look into documenting CI setup procedures and gather thoughts from maintainers. New API Features should be tested as early as possible to ensure it won't break any workflows. Our contributor documentation will be updated with extra guidelines. AIs: (carloss) Encourage Bug Czar candidates and bring this up more often in the manila weekly meetings (carloss) Encourage spec authors to schedule a meeting to discuss the spec to speed up the review process. (carloss) include iCal with event announcements (bugsquash / mid cycle) (gouthamr) Creating a review dashboard (carloss) Record "expert seminars" on FAQs: it would be great to have some videos documenting how-tos in OpenStack and help people to unblock themselves when they are hitting common openstack-developer issues: https://etherpad.opendev.org/p/manila-howcasts (carloss) communicate a deadline for the manila CLI -> OSC documentation changes. The work with our interns should go until FPF. It needs to be done before the client release, when we are planning to drop the manilaclient support. ashrodri offered help to get it completed after we come to the FPF deadline. (carloss) We should update these docs and mention that first party driver implementations should be done for features and be more strict about the testing requirements. All things CephFS [2] ================ Deprecation of standalone NFS-Ganesha ------------------------------------------------------- We added a warning in Dalmatian, deferred plans to deprecate based on community feedback. Our plan is to remove it in the 2026.1 release. There is a suggested update procedure, please reach out in case there are questions. AI: (carloss) send a reminder email in this cycle to incentivize people to move to clustered NFS Supporting NFSv3 for Windows workloads -------------------------------------------------------- manila-tempest-plugin now supports multiple NFS protocol versions in one of the scenario tests. As soon as we get the build, we will update the CephFS NFS job to run tests for NFSv3 as well. Testing and stabilization -------------------------------- Bumped Ceph version in the CI jobs to Reef in Antelope, Bobcat, Caracal, Dalmatian. We are starting to test with Ceph Squid; we intend to test with Squid on "master" and "stable/2025.1" (epoxy) branches. A couple of Ceph and NFS-Ganesha issues are impacting us at the moment [4] [5] [6] and we managed to find the workaround for some. We needed to stop testing with ingress daemon at the moment and we will get back to testing as soon as the fix is out. Manage unmanage of shares and snapshots ----------------------------------------------------------- Feature is merged and working and we are going to backfill tempest test patches AI: (carloss) will propose a new job variant to allow testing this feature. Plans for 2025.2 Flamingo ----------------------------------- Investigate support for SMB/CIFS Ceph-NFS QoS: we will follow the implementation of this feature in NFS Ganesha and start discussing and drafting the Manila implementation when the code is merged in Ganesha upstream. Out of place restores and backup enhancements [7] ======================================== CERN is pursuing a backup backend with their C-Back tool. Currently Manila backups can be restored back to the same share; there are some problems with such approach when the source share backend is down and to prevent browse by restore behavior. Zachary Goggins (za) proposed a specification, and plans to work on it during the Flamingo Cycle. The share backups feature also needs some enhancements like a get progress and get restore progress actions. Zach plans to make it part of the implementation. We agreed that a backup resource should have a new "state" attribute, instead of only relying on the status in order to have well defined backup states. AI: (za) update the out of place restore spec. Tech debt ======= Container driver failures -------------------------------- The container driver tempest tests are perma-failing right now. We seem to have a problem with RBAC and pre-provisioned tempest credentials. AIs: (carloss) Report a tempest bug to track the issues; (gouthamr) will propose a change to switch back to using dynamic credentials in our testing. DockerHub rate limits ----------------------------- We are only building an image in manila-image-elements. It's more pulls than pushes. Pushes happen very rarely. The kolla team has moved away from DockerHub as well. Zach offered help in case we need another approach for registry. CERN has its own tool. AI: we will look into moving to quay.io "manila" CLI removal ---------------------------- We added the deprecation warning 6 releases ago and we should proceed with the removal. We will need an additional push to update all of our documentation examples and move to keystoneauth. We need more functional test coverage and we should have a hackathon just as we did some years ago. AI: carloss will schedule a hackathon for enabling more tests and send the removal email to openstack-discuss. We are targeting the removal to 2025.2 Flamingo. CI and testing ------------------ ZFSOnLinux job left on jammy: We created a bug for it and we can use it for tracking. IPv6 testing: The BGP software we were using (quagga) is now deprecated and everything was migrated to FRR. We will need help to fix it as unfortunately, things didn't have a 1x1 translation between the libraries. If someone has experience on this, it would be nice to collaborate to get this fixed. API ---- We are going to stop testing the v1 API and stop deploying it on DevStack test jobs. We'll update the install guide as well that we've stopped supporting it. It was deprecated in 2015 ("Liberty" release). That's a good code cleanup opportunity. V2 is an extension of v1 with microversions. If we stop supporting it, who is affected? Mostly people that have automations using it. What's the impact on manila-tempest-plugin? We have v1 and v2 tests. We have a lot of coverage for v2. If you don't have the v1 API in the cloud, the tests refuse to run. We will need to fix it. AIs: Work on the removal patches during the 2025.2 Flamingo release; (carloss) will send an announcement email to the ML, including operators tag. Manila UI ------------- We have been making progress in the Manila UI feature gap. Currently working on manage/unmanage share servers, manage share with dhss=true, filtering user messages on date, updating quotas table. Share limits view broke some time ago, code lives in horizon. We hit some issues using horizon's tox "runserver" environment, apparently more people ran into the same issue. We will talk to other impacted parties and check how to overcome this issue. AI: (carloss) will reach out to the horizon team and ask how we can re-introduce Manila limits to the overview tab. Enable share encryption at-rest (back-end) with secret refs stored on Barbican/Castellan. [8] ===================================================================== We merged a specification some time ago with an implementation architecture. That spec contemplated both Share encryption and Share server encryption. NetApp is now planning to work only on share server encryption. Encryption can be disabled per share, but shares exported via a share server cannot have a separate encryption key on ONTAP. We reached an agreement that when a new share creation is triggered, if there isn't a share server matching the provided key, a new share server will need to be spawned. We also agreed that we should allow using names for the secret reference for better user experience. 2025.2 Flamingo is the target release. AIs: (kpdev/Sai) The spec will be updated and only the DHSS=True scenario will be documented; The manila team will review the spec as soon as it is proposed Replication Improvements ==================== Back when we implemented replication, we didn't account for specific configurations that the storage backends can have, for example whether the backend could support zero RPO technologies or not. Zero RPO is is an important feature that allows data to be written simultaneously between the share and its replicas. We agreed that the way we should send the information to the backend is through a backend specific share type extra spec. Administrators will be able to define it in the share type and the backend will pick it up. Operator concerns / questions ======================= Where to put parameters that change behaviour only of one protocol (NFS in this case)? We agreed that we should have write once type of metadata and not allow the metadata to be updated afterwards. A configuration option can be introduced for this where the operator can determine what metadata will not be updated. AI: carthaca will propose a lite-spec for this Lustre FS Support for HPC Use Cases in OpenStack Is there any possibility for OpenStack to officially integrate or support parallel file systems like Lustre, either through Manila or other components? We've heard in the past as a request from the scientific-sig group. Building a driver should be straightforward and it does not necessarily need to be in-tree, and it would be easier to maintain. This is a very good use case. This discussion will continue with the scientific-sig group. Replica / Snapshot Retention / Expiration Policy While replicas in Manila are designed to be continuously in sync with the active share, certain use cases — such as disaster recovery (DR) replicas or manually created replicas that are no longer needed — could benefit from lifecycle management. Replicas are continuously synced with the source share, so if they're "unused", they're there for some reason is the assumption. We had a spec a while ago about automating snapshots (creation and deletion) on schedule. It would be preferable that an external automation tool is used to achieve such behavior. Maybe openstack/mistral can be a good approach (Support for manila snapshots already exists on Mistral) Affinity/Anti-affinity spec updates ========================= This feature allows users to create share groups with affinity policies, which determine the affinity relationship between shares within the group. There was an open question about strategies of locking. We came to an agreement that we can use tooz, database or oslo. AI: (chuanm) will update the spec. Force deleting subnets ================= This is a feature that follows the ability to add multiple subnets to a share server. We should also be able to remove them. This spec is under review. We agreed that we should also implement the "check" mechanism before deleting the subnet. AIs: (sylvanld) will update the spec Eventlet removal ============= Need to remove wsgi uses, use oslo service's new Threading based backend instead for the ProcessLauncher, periodic tasks. Neutron is doing some work around periodic tasks and we can benefit from their learning. AI: Work on this in Flamingo, aiming for completion in 2026.1 cycle. Manila/Nova Cross-project session: VirtioFS ================================= VirtioFS implementation is now complete and we are looking at the next steps. We currently don't have CI testing the feature and the Manila team is planning to work on it during the 2025.2 Flamingo release. The nova team intends to drive the remaining SDK and OSC patches to completion during the 2025.2 Flamingo release. We also discussed some possible enhancements: mem_fs support, online attach and detach and live migration. These will take some time and the Nova team will work on such features gradually. AIs: (carloss) will share the test scenarios with the Nova team and ask for reviews and the Manila team will work on the implementation of the tests. (rribaud) will work on the remaining SDK patch and work on mem_fd support. [0] https://www.youtube.com/watch?v=MLXkBRhViS0&list=PLnpzT0InFrqADxXi_dtPqfWLt… [1] https://openstack.biterg.io/app/dashboards#/view/Gerrit-Backlog?_g=(filters…:'Gerrit%20Backlog%20panel%20by%20Bitergia. ',filters:!(('$state':(store:appState),meta:(alias:'Changesets%20Only',disabled:!f,index:gerrit,key:type,negate:!f,params:(query:changeset),type:phrase),query:(match:(type:(query:changeset,type:phrase)))),('$state':(store:appState),meta:(alias:Bots,disabled:!f,index:gerrit,key:author_bot,negate:!t,params:(query:!t),type:phrase),query:(match:(author_bot:(query:!t,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:gerrit,key:project,negate:!f,params:(query:manila),type:phrase),query:(match_phrase:(project:manila)))),fullScreenMode:!f,options:(darkTheme:!f,useMargins:!t),query:(language:lucene,query:(query_string:(analyze_wildcard:!t,default_field:'*',query:'*',time_zone:Europe%2FMadrid))),timeRestore:!f,title:'Gerrit%20Backlog',viewMode:view) [2] https://etherpad.opendev.org/p/flamingo-ptg-manila-cephfs [3] https://bugs.launchpad.net/manila/+bug/2049538 [4] https://github.com/nfs-ganesha/nfs-ganesha/issues/1227 [5] https://tracker.ceph.com/issues/69214 [6] https://tracker.ceph.com/issues/67323 [7] https://review.opendev.org/c/openstack/manila-specs/+/942694 [8] https://etherpad.opendev.org/p/share-encryption-with-barbican-secret-ref [9] https://etherpad.opendev.org/p/flamingo-ptg-manila Thank you everyone that participated on the PTG! Best regards, carloss

8 months, 2 weeks

Re: [nova] Image Encryption patch

by Sean Mooney

On 14/08/2025 15:09, Dan Smith wrote: >>> One of the things that is not supported in your series is direct booting >> of an encrypted image. > I could be wrong, but I think this is just a simplistic read of the first addition in the patch. AFAIK, the direct-boot abort is already in the tree, and they are just adding an additional check for the new key id parameter to mirror the same (existing) behavior. That is, of course, fine. yes it is just an exteion of that but you shoudl be able to sue if for the "boot form voluem form inmage workflow" no? we had a very long converation about local iamge encryption and why you were not ok with breakign the workflow of creating a vm, modifying it to customisze it and sthen creating a snapshot and booting addtioanl vms. if the snapshot you are takign is of the boot volume and you don't support that work flow for that as well then we have a conflcit between the requrieemtn for both features. if something is taking a snapshot of a data volume and uploading it as an image that is diffent as that data volume is presumabel not marked as bootable anyway in cinder. booking using the encrypted image for local storage is totally valid as we have not implemented that in nova yet but i woudl expect the BFV case to work. > >> In April 2024 we had a cross project session with Nova and Glance at the PTG [4]! >> There was a big discussion around the encryption format initiated by Dan Smith (Nova). He proposed to move away from GPG and use LUKS instead because this would streamline it with existing functionality and formats that are already available in Nova and Cinder. >> Due to this proposal from Nova, we agreed to discard our existing patchsets [5] and rewrite our image encryption feature with new patchsets [6] with LUKS as the encryption format, as suggested by Dan Smith (Nova). >> We also talked specifically about the cryptographic key differentiation (hexlify vs. non-hexlify) which materialized in the os-brick change that you mentioned. > Yep, this and the rest of your history summary matches my recollection as well. that all fine and it more or less aligned with my recollection of this to however it misses the point that feature proposals be it tracked by a blueprint or spec are only accpted for a given release and need to be epxlictly propsoed again for the next cycle if tye are not complete. so even if it was acepted as a specless blueprint or an actul spec in 2024 on the nova side it would still need one for this cycle. appoval for dalmaion 2024.2 expired at the start of the 2025.1 cycle. > > I know I've been on the hook to review this stuff and just keep getting pulled in different directions on more important stuff. My apologies, but there are some pretty important things up for review right now (like eventlet removal). Your patch to use brick for the passphrase extraction seems like a fine thing to merge at this point, especially because the earlier we merge it the better from the compatibility point of view. I'll try to make time today to look at it in detail. > > --Dan >

4 months, 2 weeks

[nova][ptg] 2025.1 Epoxy PTG summary

by Sylvain Bauza

(resending the email as the previous one was blocked to an attached etherpad backup txtfile larger than the max size) Hey all, First, thanks for having joined us if you were in the vPTG. We had 15-20 people every day for our nova sessions, I was definitely happy to see new folks :-) If you want to see our PTG etherpad, please rather look at https://etherpad.opendev.org/p/r.4f297ee4698e02c16c4007f7ee76b7c1 instead of the main nova etherpad as I don't want that the etherpad would have a wrong traduction or having some paragraphs to be removed. As I say every cycle, just take a coffee (or a tea) now as the summary will be large. ### Dalmatian retrospective and Epoxy planning ### 6 of 15 approved blueprints were eventually implemented. We also merged more than 31 bugfixes during Dalmatian. We agreed to be explaining on the IRC channel when we have meetings for discussing some feature series (like the one we did every week for the manila/virtiofs series) and providing some public invitations. We could do this again this cycle for other features, we'll see. We will also try to have a periodic integration-compute job that pulls OSC and SDK from master. Our Epoxy deadlines will be : two spec review days (R-16, R-2), a soft spec approval freeze by R-16 and then hard spec approval freeze by R-12. That means that contributors really need to provide their specs before mid-December. Bauzas (me) will add these deadlines into the Epoxy schedule : https://releases.openstack.org/epoxy/schedule.html ### vTPM live migration ### We agreed on the fact that a vTPM live-migration feature is a priority for Epoxy given Windows 11. artom will create a spec proposing an image metadata property saying 'do I want to share my secret with nova service user ?' and also providing a new `nova-manage image_property set migratable_something` command so operators could migrate the existing instances for getting the Barbican secrets, if really the operators wants. ### Unified limits wrap-up ### We already have two changes needing to be merged before we can modify the default quota driver (in order to default to use unified limits). We agreed on reviewing both patches (one for treating unset limits as unlimited, the other about adding a nova-manage command for automatically creating nova limits) but we also discussed about a latter patch that would eventually say which nova resources need to be eventually set (so we *have to* enforce them anyway). melwitt agreed on working on that latter patch. ### per-process health checks ### We already had one series and we discussed it again. Gibi agreed on taking over it and he will re-propose the existing spec as it is. We also discussed the first checks we would have, like RPC failures and DB connection issues, we'll review those when they are in Gerrit. ### sustainable computing (a.k.a. power mgmt) ### When someone (I won't say who [1]) implemented power management in Antelope, this was nice but we eventually found a long list of bugs that we fixed. Since we don't really want to reproduce that experience, we had this kind of post-mortem where we eventually agreed on two things that could avoid reproducing that problem : a weekly periodic job will run whitebox tempest plugins [2] with nova-compute restarts also covered by a whitebox tempest plugin. Nobody is committed against those two actions but we hope to identify someone soon. As a side note, gibi mentioned RAPL MSR support [3], notifying us that we would have to support that in a later release (as the libvirt implementation is not merged yet) ### nvidia's vGPU vfio-pci variant driver support ### Long story short, since the linux kernel removed some feature in release 5.18 (IOMMU backend support for vfio-mdev) this impacted the nvidia driver which now detects that and then creates vfio-pci devices instead of vfio-mdev devices (as vGPUs). This has a dramatic impact on Nova as we relied on the vfio-mdev framework for abstracting virtual GPUs. By the next release, Nova will need to inventorize the GPUs by rather looking at SRIOV virtual functions which are specific to the nvidia driver (we call them vfio-pci variant driver resources). The nova PTG session focused on the required efforts to do so. We agreed on the fact it will require operators to propose different flavors for vGPU where they would require distinct resource classes (all but VGPU). Fortunately, we'll reuse existing device_spec PCI config options [4] where the operator would define custom resource classes which would match the PCI addresses of the nvidia-generated virtual functions (don't freak out, we'll also write documentation). We'll create another device type (something like type-VF-migratable) for describing such specific nvidia VFs. Accordingly the generated domain XML will correctly write the device description (amending the "managed=no" flag for that device). There will be an upgrade impact: existing instances will need to be resized to that new flavor (or instances will need to be shelved, updated for changing the embedded flavor and unshelved). In order to be on par with existing vGPU features, we'll also need to implement vfio-pci live-migration by detecting the VF type on the existing SRIOV live-migration. Since that effort is quite large, bauzas will incept a subteam of interested parties that would help him implement all of those bits in the short timeframe that is one upstream cycle. ### Graceful shutdowns ### A common pitfall that was told by tobian-urdin is when you want to stop nova-compute services. In general, before stopping the service, we should be sure that all RPC calls are done, which means we would no longer accept RPC calls after asking to stop the nova-compute and just awaiting the current calls to be done before stopping the service. For that, we need to create a backlog spec for discussing that design and we would also need to modify oslo.service for unsubscribing the RPC topics. Unfortunately, this cycle we won't have any contributor for working on it, but gibi could try to at least document this. ### horizon-nova x-p session ### We mostly discussed the Horizon feature gaps [5]. The first priority would be Horizon to use OpenStackSDK instead of novaclient, but then supporting all of the new Nova API microversions. Unfortunately, we are no sure that we could have Horizon contributors that could fix those, but if you're a contributor and you want to help Horizon to be better, maybe you could do this ? If so, please ping me. ### Ironic-nova x-p session ### We didn't really have topics for this x-p session. We just quickly discussed some points, like Graphical Console support. Nothing really worth noting, maybe just that it would be nice that we could have readonly graphical console. We were just happy to say that the ironic driver now works better thanks to some features that were merged last cycles. Kudos to those who did them. ### HPC/AI optimized hypervisor "slices" ### A large topic to explain, I'll try to keep it short. Basically, how Nova slices the NUMA affinity between guests is nice but hard for HPC usecases where sometimes you need to better explain how to slice the NUMA dependent devices depending on the various PCI topologies. Eventually, we agreed on some POC that johnthetubaguy could work on by trying to implement a specific virt driver that would do something different from the existing NUMA affinities. ### Cinder-nova x-p session ### Multiple topics were discussed there. First, abishop wanted to enhance cinder's retyping of in-use boot volumes which means that the Nova os-attachments API to get a new parameter. We said that he needs to create a new spec and we agreed on the fact that the cinder contributors need to discuss with QEMU folks to know about the qemu writes. We also discussed about a new nova spec which is about adding burst length support to Cinder QoS [6]. We said that we need to both (nova and cinder) review this spec. About left residues when detaching a volume, we also agreed on the fact this is not a security flaw and the fact that os-brick should delete them, not nova (even if nova need to ask os-brick to look at that, either by a periodic run or when attaching/detaching). whoami-rajat will provide a spec for it. ### Python 3.13 support ### We discussed a specific issue for py3.13, the fact that the crypt module is no longer in stlib for py3.13, which impacts nova due to some usage in nova.virt.disk.api module for passing an admin password for file injection. Given file injection is deprecated, we have three possibilities: either removing admin password file injection (or even file injection as a whole), adding the new separate crypt package in upper-constraints or using oslo_uitls.secretutils module. bauzas (me) will provide an email to openstack-discuss for asking operators whether they are OK with deprecating file injection or just admin password injection and then we'll see the direction. bauzas or sean-k-mooney will also try to have py3.13 non-voting jobs for unittests/functtests. ### Eventlet removal steps in Nova ### I won't explain why we need to remove eventlet, you already know, right ? We rather discussed about the details in our nova components, including nova-api, nova-compute and other nova services. We agreed on removing direct eventlet imports where possible, move nova entrypoints that don't use eventlet to separate modules that don't monkeypatch the stdlib, look at what we can do with all our scatter_gather methods which asynchronously calling cells DB for using threads instead and check whether those calls are blocking on DB (and not on the MQ side). Gibi will shepherd that effort and provide some audit on the eventlet usage in order to avoid any unexpected but unfortunate late discoveries. ### Libvirt image backend refactor ### If you like spaghettis, you should pay attention to the libvirt image backend code. Lots of assumptions and conditionals make any change to that module hard to be written and hard to be reviewed, leading to errorprone situations like the ones we had when fixing some recent CVEs. We all agreed on the quite urgent necessity to refactor that code and melwitt proposed a multi-stage effort about that. We agreed on the proposal for the first two steps with some comments, leading to future revisions of the proposal's patches. The crucial bits with the refactor are about test coverage. ### IOThreads tuning for libvirt instances ### An old spec was already proposed for defining iothreads to guests. We agreed on reviving that spec, where a config option would define either no iothread or one iothread per instance (with a potential for a latter option value to be "one iothread per disk"). Depending on whether emulator_thread_policy is provided in the flavor/image, we would set the iothread on that policy or we would put the iothread floating over the shared CPU set. If no shared CPUs are configured but the operator wants iothreads, nova-compute would refuse to start. lajoskatona will work on such an implementation that will be designed in a blueprint that doesn't require a spec. ### OpenAPI schemas progress ### Nothing specific to say here, bauzas and gmann will review the series this cycle. That's it. I'm gone, I'm dead [7] (a cyclist metaphor) but I eventually skimmed the very large nova etherpad. Of course, 99% of chances that I'll write some notes incorrectly, so please correct if I'm wrong, I won't feel offended, just tired. Thanks all (and I hope your coffee or tea was good) -Sylvain [1] https://geek-and-poke.com/geekandpoke/2013/11/24/simply-explained [2] https://opendev.org/openstack/whitebox-tempest-plugin [3] https://www.qemu.org/docs/master/specs/rapl-msr.html [4] https://docs.openstack.org/nova/latest/configuration/config.html#pci.device… [5] https://etherpad.opendev.org/p/horizon-feature-gap#L69 [6] https://review.opendev.org/c/openstack/nova-specs/+/932653 [7] https://www.youtube.com/watch?v=HILcYXf8yqc

1 year, 2 months

Re: [watcher] 2025.2 Flamingo PTG summary

by Sean Mooney

On 17/04/2025 13:17, Dmitriy Rabotyagov wrote: >> well gnocchi is also not a native OpenStack telemetry datastore, it left >> our community to pursue its own goals and is now a third party datastore >> just like Grafana or Prometheus. > Yeah, well, true. Is still somehow treated as the "default" thing with > Telemetry, likely due to existing integration with Keystone and > multi-tenancy support. And beyond it - all other options become > opinionated too fast - ie, some do OpenTelemetry, some do Zabbix, > VictoriaMetrics, etc. As pretty much from what I got as well, is that > still relies on Ceilometer metrics? > And then Prometheus is obviously not the best storage for them, as it > requires to have pushgatgeway, and afaik prometheus maintainers are > strictly against "push" concept to it and treat it as conceptually > wrong (on contrary to OpenTelemetry). i dont know the detail but i know there is work planned for native supprot of Prometheus scrpe endpoint in ceilometer so while you currently need to use SG-core to provide that integration there is a plan to remove the need for sgcore going forward. https://etherpad.opendev.org/p/r.72ac6a7268e4b9d854f75715adede80c#L28 i dont see a spec proposed yet but there is an olde one form 2 years ago https://review.opendev.org/c/openstack/telemetry-specs/+/845485/4/specs/zed… there is also a plan to provide keystone integration and mutli tenancy https://etherpad.opendev.org/p/r.72ac6a7268e4b9d854f75715adede80c#L84 > So the metric timestamp issue is > to remain unaddressed. > So that's why I'd see leaving Gnocchi as "base" implementation might > be valuable (and very handy for us, as we don't need to implement a > prometheus job specifically for Watcher). watcher, aodh, and cloud kitty i believe all have some level if support for Prometheus but they can also use other backends. in not sure what level of enablement they have in osa. > >> but for example watcher can integrate with both ironic an canonical maas > component >> to do some level of host power management. > That sounds really interesting... We do maintain infrastructure using > MAAS and playing with such integration will be extremely interesting. > I hope I will be able to get some time for this though... the current maas integration has 3 problems. 1 a lack of testing, 2 a lack of documentation and 3 it somehow managed to introduce asysnio in a project that uses eventlet in a release of eventlet that did not support asyncio so im very nervious that that is broken or will break in the future. this is the entrity of the support https://review.opendev.org/c/openstack/watcher/+/898790 there are no docs and no spec... so this should definitely be considered "experimental" at best today. > > чт, 17 апр. 2025 г. в 13:52, Sean Mooney <smooney(a)redhat.com>: >> >> On 16/04/2025 21:04, Dmitriy Rabotyagov wrote: >>> Hey, >>> >>> Have a comment on one AI from the list. >>> >>>> AI: (jgilaber) Mark Monasca and Grafana as deprecated, unless >>> someone steps up to maintain them, which should include a minimal CI >>> job running. >>> >>> So eventually, on OpenStack-Ansible we were planning to revive the >>> Watcher role support to the project. >>> How we usually test deployment, is by spawning an all-in-one >>> environment with drivers and executing a couple of tempest scenarios >>> to ensure basic functionality of the service. >>> >>> With that, having a native OpenStack telemetry datastore is very >>> beneficial for such goal, as we already do maintain means for spawning >>> telemetry stack. While a requirement for Prometheus will be >>> unfortunate for us at least. >>> >>> While I was writing that, I partially realized that testing Watcher on >>> all-in-one is pretty much impossible as well... >>> >> you can certenly test some fo watcher with an all in one deployment >> >> i.e. the apis and you can use the dummy test stragies. >> >> but ya in general like nova you need at least 2 nodes to be able to test >> it properly ideally 3 >> >> so that if your doing a live migration there is actully a choice of host. >> >> in general however watcher like heat just asks nova to actully move the vms. >> >> sure it will ask nova to move it to a specific host but fundementaly if >> you have >> >> tested live migration with nova via tempest seperatly there is no reason >> to expcect >> >> it would not work for live migratoin tirggred by watcher or heat or >> anything else that >> >> jsut calls novas api. >> >> so you could still get some valual testing in an all in one but ideally >> there woudl be at least 2 comptue hosts. >> >> >>> But at the very least, I can propose looking into adding an OSA job >>> with Gnocchi as NV to the project, to show the state of the deployment >>> with this driver. >>> >> well gnocchi is also not a native OpenStack telemetry datastore, it left >> our community to pursue its own goals and is now a third party datastore >> >> just like Grafana or Prometheus. >> >> monasca is currently marked as inactive >> https://review.opendev.org/c/openstack/governance/+/897520 and is in the >> process of being retired. >> >> but it also has no testing on the watcher side to the combination of the >> two is why we are deprecating it going forward. >> >> if both change im happy to see the support continue. >> >> Gnocchi has testing but we are not actively working on extending its >> functionality going forward. >> >> as long as it continues to work i see no reason to change its support >> status. >> >> watcher has quite a lot of untested integrations which is unfortunate >> >> we are planning to build out a feature/test/support matrix in the docs >> this cycle >> >> but for example watcher can integrate with both ironic an canonical maas >> component >> >> to do some level of host power management. none of that is tested and we >> are likely going >> >> to mark them as experimental and reflect on if we can continue to >> support them or not going forward. >> >> it also has the ability to do cinder storage pool balancing which is i >> think also untested write now. >> >> one of the things we hope to do is extend the exsitign testing in our >> current jobs to cover gaps like >> >> that where it is practical to do so. but creating a devstack plugin to >> deploy maas with fake infrastructure >> >> is likely alot more then we can do with our existing contributors so >> expect that to go to experimental then >> >> deprecated and finally it will be removed if no one turns up to support it. >> >> ironic is in the same boat however there are devstack jobs with fake >> ironic nodes so i >> >> could see a path to use having an ironic job down the line. its just not >> high on our current priority >> >> list to adress the support status or testing of this currently. >> >> eventlet removal and other techdebt/community goals are defintly higher >> but i hop the new supprot/testing >> >> matrix will at least help folks make informed descions or what feature >> to use and what backend are >> >> recommended going forward. >> >>> On Wed, 16 Apr 2025, 21:53 Douglas Viroel, <viroel(a)gmail.com> wrote: >>> >>> Hello everyone, >>> >>> Last week's PTG had very interesting topics. Thank you all that >>> joined. >>> The Watcher PTG etherpad with all notes is available here: >>> https://etherpad.opendev.org/p/apr2025-ptg-watcher >>> Here is a summary of the discussions that we had, including the >>> great cross-project sessions with Telemetry, Horizon and Nova team: >>> >>> Tech Debt (chandankumar/sean-k-mooney) >>> ================================= >>> a) Croniter >>> >>> * Project is being abandoned as per >>> https://pypi.org/project/croniter/#disclaimer >>> * Watcher uses croniter to calculate a new schedule time to run >>> an audit (continuous). It is also used to validate cron like >>> syntax >>> * Agreed: replace croniter with appscheduler's cron methods. >>> * *AI*: (chandankumar) Fix in master branch and backport to 2025.1 >>> >>> b) Support status of Watcher Datasources >>> >>> * Only Gnocchi and Prometheus have CI job running tempest tests >>> (with scenario tests) >>> * Monaska is inactive since 2024.1 >>> * *AI*: (jgilaber) Mark Monasca and Grafana as deprecated, >>> unless someone steps up to maintain them, which should include >>> a minimal CI job running. >>> * *AI*: (dviroel) Document a support matrix between Strategies >>> and Datasources, which ones are production ready or >>> experimental, and testing coverage. >>> >>> c) Eventlet Removal >>> >>> * Team is going to look at how the eventlet is used in Watcher >>> and start a PoC of its removal. >>> * Chandan Kumar and dviroel volunteer to help in this effort. >>> * Planned for 2026.1 cycle. >>> >>> Workflow/API Improvements (amoralej) >>> ============================== >>> a) Actions states >>> >>> * Currently Actions updates from Pending to Succeeded or Failed, >>> but these do not cover some important scenarios >>> * If an Action's pre_conditions fails, the action is set to >>> FAILED, but for some scenarios, it could be just SKIPPED and >>> continue the workflow. >>> * Proposal: New SKIPPED state for action. E.g: In a Nova >>> Migration Action, if the instance doesn't exist in the source >>> host, it can be skipped instead of fail. >>> * Proposal: Users could also manually skip specific actions from >>> an action plan. >>> * A skip_reason field could also be added to document the reason >>> behind the skip: user's request, pre-condition check, etc. >>> * *AI*: (amoralej) Create a spec to describe the proposed changes. >>> >>> b) Meaning of SUCCEEDED state in Action Plan >>> >>> * Currently means that all actions are triggered, even if all of >>> them fail, which can be confusing for users. >>> * Docs mention that SUCCEEDED state means that all actions have >>> been successfully executed. >>> * *AI*: (amoralej) Document the current behavior as a bug >>> (Priority High) >>> o done: https://bugs.launchpad.net/watcher/+bug/2106407 >>> >>> Watcher-Dashboard: Priorities to next release (amoralej) >>> =========================================== >>> a) Add integration/functional tests >>> >>> * Project is missing integration/functional tests and a CI job >>> running against changes in the repo >>> * No general conclusion and we will follow up with Horizon team >>> * *AI*: (chandankumar/rlandy) sync with Horizon team about >>> testing the plugin with horizon. >>> * *AI*: (chandankumar/rlandy) devstack job running on new >>> changes for watcher-dashboard repo. >>> >>> b) Add parameters to Audits >>> >>> * It is missing on the watcher-dashboard side. Without it, it is >>> not possible to define some important parameters. >>> * Should be addressed by a blueprint >>> * Contributors to this feature: chandankumar >>> >>> Watcher cluster model collector improvement ideas (dviroel) >>> ============================================= >>> >>> * Brainstorm ideas to improve watcher collector process, since >>> we still see a lot of issues due to outdated models when >>> running audits >>> * Both scheduled model update and event-based updates are >>> enabled in CI today >>> * It is unknown the current state of event-based updates from >>> Nova notification. Code needs to be reviewed and >>> improvements/fixes can be proposed >>> o e.g: >>> https://bugs.launchpad.net/watcher/+bug/2104220/comments/3 >>> - We need to check if we are processing the right >>> notifications of if is a bug on Nova >>> * Proposal: Refresh the model before running an audit. A rate >>> limit should be considered to avoid too many refreshments. >>> * *AI*: (dviroel) new spec for cluster model refresh, based on >>> audit trigger >>> * *AI:* (dviroel) investigate the processing of nova events in >>> Watcher >>> >>> Watcher and Nova's visible constraints (dviroel) >>> ==================================== >>> >>> * Currently, Watcher can propose solutions that include server >>> migrations that violate some Nova constraints like: >>> scheduler_hints, server_groups, pinned_az, etc. >>> * In Epoxy release, Nova's API was improved to also show >>> scheduler_hints and image_properties, allowing external >>> services, like watcher, to query and use this information when >>> calculating new solutions. >>> o https://docs.openstack.org/releasenotes/nova/2025.1.html#new-features >>> * Proposal: Extend compute instance model to include new >>> properties, which can be retrieved via novaclient. Update >>> strategies to filter invalid migration destinations based on >>> these new properties. >>> * *AI*: (dviroel) Propose a spec to better document the >>> proposal. No API changes are expected here. >>> >>> Replacement for noisy neighbor policy (jgilaber) >>> ==================================== >>> >>> * The existing noisy neighbor strategy is based on L3 Cache >>> metrics, which is not available anymore, since the support for >>> it was dropped from the kernel and from Nova. >>> * In order to keep this strategy, new metrics need to be >>> considered: cpu_steal? io_wait? cache_misses? >>> * *AI*: (jgilaber) Mark the strategy as deprecated during this cycle >>> * *AI*: (TBD) Identify new metrics to be used >>> * *AI*: (TBD) Work on a replacement for the current strategy >>> >>> >>> Host Maintenance strategy new use case (jeno8) >>> ===================================== >>> >>> * New use case for Host Maintenance strategy: instance with >>> ephemeral disks should not be migrated at all. >>> * Spec proposed: >>> https://review.opendev.org/c/openstack/watcher-specs/+/943873 >>> o New action to stop instances when both live/cold migration >>> are disabled by the user >>> * *AI*: (All) Review the spec and continue with discussion there. >>> >>> Missing Contributor Docs (sean-k-mooney) >>> ================================ >>> >>> * Doc missing: Scope of the project, e.g: >>> https://docs.openstack.org/nova/latest/contributor/project-scope.html >>> * *AI*: (rlandy) Create a scope of the project doc for Watcher >>> * Doc missing: PTL Guide, e.g: >>> https://docs.openstack.org/nova/latest/contributor/ptl-guide.html >>> * *AI*: (TBD) Create a PTL Guide for Watcher project >>> * Document: When to create a spec vs blueprint vs bug >>> * *AI*: (TBD) Create a doc section to describe the process based >>> on what is being modified in the code. >>> >>> Retrospective >>> ========== >>> >>> * The DPL approach seems to be working for Watcher >>> * New core members added: sean-k-mooney, dviroel, marios and >>> chandankumar >>> o We plan to add more cores in the next cycle, based on >>> reviews and engagement. >>> o We plan to remove not active members in the 2 last cycles >>> (starting at 2026.1) >>> * A new datasource was added: Prometheus >>> * Prometheus job now also runs scenario tests, along with Gnocchi. >>> * We triaged all old bugs from launchpad >>> * Needs improvement: >>> o current team is still learning about details in the code, >>> much of the historical knowledge was lost with the >>> previous maintainers >>> o core team still needs to grow >>> o we need to focus on creating stable releases >>> >>> >>> Cross-project session with Horizon team >>> =============================== >>> >>> * Combined session with Telemetry and Horizon team, focused on >>> how to provide a tenant and an admin dashboard to visualize >>> metrics. >>> * Watcher team presented some ideas of new panels for both admin >>> and tenants, and sean-k-mooney raised a discussion about >>> frameworks that can be used to implement them >>> * Use-cases that were discussed: >>> o a) Admin would benefit from a visualization of the >>> infrastructure utilization (real usage metrics), so they >>> can identify bottlenecks and plan optimization >>> o b) A tenant would like to view their workload performance, >>> checking real usage of cpu/ram/disk of instances, to >>> proper adjust their resources allocation. >>> o c) An admin user of watcher service would like to >>> visualize metrics generated by watcher strategies like >>> standard deviation of host metrics. >>> * sean-k-mooney presented an initial PoC on how a Hypervisor >>> Metrics dashboard would look like. >>> * Proposal for next steps: >>> o start a new horizon plugin as an official deliverable of >>> telemetry project >>> o still unclear which framework to use for building charts >>> o dashboard will integrate with Prometheus, as metric store >>> o it is expected that only short term metrics will be >>> supported (7 days) >>> o python-observability-client will be used to query Prometheus >>> >>> >>> Cross-project session with Nova team >>> ============================= >>> >>> * sean-k-mooney led topics on how to evolve Nova to better >>> assist other services, like Watcher, to take actions on >>> instances. The team agreed on a proposal of using the existing >>> metadata API to annotate instance's supported lifecycle >>> operations. This information is very useful to improve >>> Watcher's strategy's algorithms. Some example of instance's >>> metadata could be: >>> o lifecycle:cold-migratable=true|false >>> o ha:maintenance-strategy:in_place|power_off|migrate >>> * It was discussed that Nova could infer which operations are >>> valid or not, based on information like: virt driver, flavor, >>> image properties, etc. This feature was initially named >>> 'instance capabilities' and will require a spec for further >>> discussions. >>> * Another topic of interest, also raised by Sean, was about >>> adding new standard traits to resource providers, like >>> PRESSURE_CPU and PRESSURE_DISK. These traits can be used to >>> weight hosts when placing new VMs. Watcher and the libvirt >>> driver could work on annotating them, but the team generally >>> agreed that the libvirt driver is preferred here. >>> * More info at Nova PTG etherpad [0] and sean's summary blog [1] >>> >>> [0] https://etherpad.opendev.org/p/r.bf5f1185e201e31ed8c3adeb45e3cf6d >>> [1] https://www.seanmooney.info/blog/2025.2-ptg/#watcher-topics >>> >>> >>> Please let me know if I missed something. >>> Thanks! >>> >>> -- >>> Douglas Viroel - dviroel >>>

8 months, 2 weeks

Jump to page: