openstack-discuss search results for query "#eventlet-removal"

openstack-discuss@lists.openstack.org

149 messages

Re: [watcher] 2025.2 Flamingo PTG summary

by Sean Mooney

On 16/04/2025 21:04, Dmitriy Rabotyagov wrote: > > Hey, > > Have a comment on one AI from the list. > > > AI: (jgilaber) Mark Monasca and Grafana as deprecated, unless > someone steps up to maintain them, which should include a minimal CI > job running. > > So eventually, on OpenStack-Ansible we were planning to revive the > Watcher role support to the project. > How we usually test deployment, is by spawning an all-in-one > environment with drivers and executing a couple of tempest scenarios > to ensure basic functionality of the service. > > With that, having a native OpenStack telemetry datastore is very > beneficial for such goal, as we already do maintain means for spawning > telemetry stack. While a requirement for Prometheus will be > unfortunate for us at least. > > While I was writing that, I partially realized that testing Watcher on > all-in-one is pretty much impossible as well... > you can certenly test some fo watcher with an all in one deployment i.e. the apis and you can use the dummy test stragies. but ya in general like nova you need at least 2 nodes to be able to test it properly ideally 3 so that if your doing a live migration there is actully a choice of host. in general however watcher like heat just asks nova to actully move the vms. sure it will ask nova to move it to a specific host but fundementaly if you have tested live migration with nova via tempest seperatly there is no reason to expcect it would not work for live migratoin tirggred by watcher or heat or anything else that jsut calls novas api. so you could still get some valual testing in an all in one but ideally there woudl be at least 2 comptue hosts. > But at the very least, I can propose looking into adding an OSA job > with Gnocchi as NV to the project, to show the state of the deployment > with this driver. > well gnocchi is also not a native OpenStack telemetry datastore, it left our community to pursue its own goals and is now a third party datastore just like Grafana or Prometheus. monasca is currently marked as inactive https://review.opendev.org/c/openstack/governance/+/897520 and is in the process of being retired. but it also has no testing on the watcher side to the combination of the two is why we are deprecating it going forward. if both change im happy to see the support continue. Gnocchi has testing but we are not actively working on extending its functionality going forward. as long as it continues to work i see no reason to change its support status. watcher has quite a lot of untested integrations which is unfortunate we are planning to build out a feature/test/support matrix in the docs this cycle but for example watcher can integrate with both ironic an canonical maas component to do some level of host power management. none of that is tested and we are likely going to mark them as experimental and reflect on if we can continue to support them or not going forward. it also has the ability to do cinder storage pool balancing which is i think also untested write now. one of the things we hope to do is extend the exsitign testing in our current jobs to cover gaps like that where it is practical to do so. but creating a devstack plugin to deploy maas with fake infrastructure is likely alot more then we can do with our existing contributors so expect that to go to experimental then deprecated and finally it will be removed if no one turns up to support it. ironic is in the same boat however there are devstack jobs with fake ironic nodes so i could see a path to use having an ironic job down the line. its just not high on our current priority list to adress the support status or testing of this currently. eventlet removal and other techdebt/community goals are defintly higher but i hop the new supprot/testing matrix will at least help folks make informed descions or what feature to use and what backend are recommended going forward. > > On Wed, 16 Apr 2025, 21:53 Douglas Viroel, <viroel(a)gmail.com> wrote: > > Hello everyone, > > Last week's PTG had very interesting topics. Thank you all that > joined. > The Watcher PTG etherpad with all notes is available here: > https://etherpad.opendev.org/p/apr2025-ptg-watcher > Here is a summary of the discussions that we had, including the > great cross-project sessions with Telemetry, Horizon and Nova team: > > Tech Debt (chandankumar/sean-k-mooney) > ================================= > a) Croniter > > * Project is being abandoned as per > https://pypi.org/project/croniter/#disclaimer > * Watcher uses croniter to calculate a new schedule time to run > an audit (continuous). It is also used to validate cron like > syntax > * Agreed: replace croniter with appscheduler's cron methods. > * *AI*: (chandankumar) Fix in master branch and backport to 2025.1 > > b) Support status of Watcher Datasources > > * Only Gnocchi and Prometheus have CI job running tempest tests > (with scenario tests) > * Monaska is inactive since 2024.1 > * *AI*: (jgilaber) Mark Monasca and Grafana as deprecated, > unless someone steps up to maintain them, which should include > a minimal CI job running. > * *AI*: (dviroel) Document a support matrix between Strategies > and Datasources, which ones are production ready or > experimental, and testing coverage. > > c) Eventlet Removal > > * Team is going to look at how the eventlet is used in Watcher > and start a PoC of its removal. > * Chandan Kumar and dviroel volunteer to help in this effort. > * Planned for 2026.1 cycle. > > Workflow/API Improvements (amoralej) > ============================== > a) Actions states > > * Currently Actions updates from Pending to Succeeded or Failed, > but these do not cover some important scenarios > * If an Action's pre_conditions fails, the action is set to > FAILED, but for some scenarios, it could be just SKIPPED and > continue the workflow. > * Proposal: New SKIPPED state for action. E.g: In a Nova > Migration Action, if the instance doesn't exist in the source > host, it can be skipped instead of fail. > * Proposal: Users could also manually skip specific actions from > an action plan. > * A skip_reason field could also be added to document the reason > behind the skip: user's request, pre-condition check, etc. > * *AI*: (amoralej) Create a spec to describe the proposed changes. > > b) Meaning of SUCCEEDED state in Action Plan > > * Currently means that all actions are triggered, even if all of > them fail, which can be confusing for users. > * Docs mention that SUCCEEDED state means that all actions have > been successfully executed. > * *AI*: (amoralej) Document the current behavior as a bug > (Priority High) > o done: https://bugs.launchpad.net/watcher/+bug/2106407 > > Watcher-Dashboard: Priorities to next release (amoralej) > =========================================== > a) Add integration/functional tests > > * Project is missing integration/functional tests and a CI job > running against changes in the repo > * No general conclusion and we will follow up with Horizon team > * *AI*: (chandankumar/rlandy) sync with Horizon team about > testing the plugin with horizon. > * *AI*: (chandankumar/rlandy) devstack job running on new > changes for watcher-dashboard repo. > > b) Add parameters to Audits > > * It is missing on the watcher-dashboard side. Without it, it is > not possible to define some important parameters. > * Should be addressed by a blueprint > * Contributors to this feature: chandankumar > > Watcher cluster model collector improvement ideas (dviroel) > ============================================= > > * Brainstorm ideas to improve watcher collector process, since > we still see a lot of issues due to outdated models when > running audits > * Both scheduled model update and event-based updates are > enabled in CI today > * It is unknown the current state of event-based updates from > Nova notification. Code needs to be reviewed and > improvements/fixes can be proposed > o e.g: > https://bugs.launchpad.net/watcher/+bug/2104220/comments/3 > - We need to check if we are processing the right > notifications of if is a bug on Nova > * Proposal: Refresh the model before running an audit. A rate > limit should be considered to avoid too many refreshments. > * *AI*: (dviroel) new spec for cluster model refresh, based on > audit trigger > * *AI:* (dviroel) investigate the processing of nova events in > Watcher > > Watcher and Nova's visible constraints (dviroel) > ==================================== > > * Currently, Watcher can propose solutions that include server > migrations that violate some Nova constraints like: > scheduler_hints, server_groups, pinned_az, etc. > * In Epoxy release, Nova's API was improved to also show > scheduler_hints and image_properties, allowing external > services, like watcher, to query and use this information when > calculating new solutions. > o https://docs.openstack.org/releasenotes/nova/2025.1.html#new-features > * Proposal: Extend compute instance model to include new > properties, which can be retrieved via novaclient. Update > strategies to filter invalid migration destinations based on > these new properties. > * *AI*: (dviroel) Propose a spec to better document the > proposal. No API changes are expected here. > > Replacement for noisy neighbor policy (jgilaber) > ==================================== > > * The existing noisy neighbor strategy is based on L3 Cache > metrics, which is not available anymore, since the support for > it was dropped from the kernel and from Nova. > * In order to keep this strategy, new metrics need to be > considered: cpu_steal? io_wait? cache_misses? > * *AI*: (jgilaber) Mark the strategy as deprecated during this cycle > * *AI*: (TBD) Identify new metrics to be used > * *AI*: (TBD) Work on a replacement for the current strategy > > > Host Maintenance strategy new use case (jeno8) > ===================================== > > * New use case for Host Maintenance strategy: instance with > ephemeral disks should not be migrated at all. > * Spec proposed: > https://review.opendev.org/c/openstack/watcher-specs/+/943873 > o New action to stop instances when both live/cold migration > are disabled by the user > * *AI*: (All) Review the spec and continue with discussion there. > > Missing Contributor Docs (sean-k-mooney) > ================================ > > * Doc missing: Scope of the project, e.g: > https://docs.openstack.org/nova/latest/contributor/project-scope.html > * *AI*: (rlandy) Create a scope of the project doc for Watcher > * Doc missing: PTL Guide, e.g: > https://docs.openstack.org/nova/latest/contributor/ptl-guide.html > * *AI*: (TBD) Create a PTL Guide for Watcher project > * Document: When to create a spec vs blueprint vs bug > * *AI*: (TBD) Create a doc section to describe the process based > on what is being modified in the code. > > Retrospective > ========== > > * The DPL approach seems to be working for Watcher > * New core members added: sean-k-mooney, dviroel, marios and > chandankumar > o We plan to add more cores in the next cycle, based on > reviews and engagement. > o We plan to remove not active members in the 2 last cycles > (starting at 2026.1) > * A new datasource was added: Prometheus > * Prometheus job now also runs scenario tests, along with Gnocchi. > * We triaged all old bugs from launchpad > * Needs improvement: > o current team is still learning about details in the code, > much of the historical knowledge was lost with the > previous maintainers > o core team still needs to grow > o we need to focus on creating stable releases > > > Cross-project session with Horizon team > =============================== > > * Combined session with Telemetry and Horizon team, focused on > how to provide a tenant and an admin dashboard to visualize > metrics. > * Watcher team presented some ideas of new panels for both admin > and tenants, and sean-k-mooney raised a discussion about > frameworks that can be used to implement them > * Use-cases that were discussed: > o a) Admin would benefit from a visualization of the > infrastructure utilization (real usage metrics), so they > can identify bottlenecks and plan optimization > o b) A tenant would like to view their workload performance, > checking real usage of cpu/ram/disk of instances, to > proper adjust their resources allocation. > o c) An admin user of watcher service would like to > visualize metrics generated by watcher strategies like > standard deviation of host metrics. > * sean-k-mooney presented an initial PoC on how a Hypervisor > Metrics dashboard would look like. > * Proposal for next steps: > o start a new horizon plugin as an official deliverable of > telemetry project > o still unclear which framework to use for building charts > o dashboard will integrate with Prometheus, as metric store > o it is expected that only short term metrics will be > supported (7 days) > o python-observability-client will be used to query Prometheus > > > Cross-project session with Nova team > ============================= > > * sean-k-mooney led topics on how to evolve Nova to better > assist other services, like Watcher, to take actions on > instances. The team agreed on a proposal of using the existing > metadata API to annotate instance's supported lifecycle > operations. This information is very useful to improve > Watcher's strategy's algorithms. Some example of instance's > metadata could be: > o lifecycle:cold-migratable=true|false > o ha:maintenance-strategy:in_place|power_off|migrate > * It was discussed that Nova could infer which operations are > valid or not, based on information like: virt driver, flavor, > image properties, etc. This feature was initially named > 'instance capabilities' and will require a spec for further > discussions. > * Another topic of interest, also raised by Sean, was about > adding new standard traits to resource providers, like > PRESSURE_CPU and PRESSURE_DISK. These traits can be used to > weight hosts when placing new VMs. Watcher and the libvirt > driver could work on annotating them, but the team generally > agreed that the libvirt driver is preferred here. > * More info at Nova PTG etherpad [0] and sean's summary blog [1] > > [0] https://etherpad.opendev.org/p/r.bf5f1185e201e31ed8c3adeb45e3cf6d > [1] https://www.seanmooney.info/blog/2025.2-ptg/#watcher-topics > > > Please let me know if I missed something. > Thanks! > > -- > Douglas Viroel - dviroel >

3 months, 3 weeks

[nova][ptg] 2025.2 Flamingo PTG summary

by Rene Ribaud

Hello everyone, Last week was the PTG—thank you to those who joined! I hope you enjoyed it. I haven’t gathered exact attendance stats, but it seemed that most sessions had at least around 15 participants, with some peaks during the cross-team discussions. If you’d like to take a closer look, here’s the link to the PTG etherpad: https://etherpad.opendev.org/p/r.bf5f1185e201e31ed8c3adeb45e3cf6d We had a pretty full agenda for Nova, so here’s a summary I’ve tried to keep as short as possible. #### 2025.1 Epoxy Retrospective #### 17 specs were accepted, and 12 implemented — an excellent ratio. This represents a clear improvement over previous cycles. Virtiofs was successfully merged, unblocking other work and boosting contributor motivation. ✅ We agreed to maintain regular status updates via the etherpad and follow up during Nova meetings. API Microversions & Tempest Coverage, several microversions were merged with good structure. However, some schema changes were not reflected in Tempest, causing downstream blockers. Also the updates covered by the microversions were not propagated into the sdk and openstack client. ✅ Ensure client-side features (e.g., server show) are also published and tracked. ✅ Keep microversions isolated and document Tempest implications clearly in specs. ✅ Raise awareness of the tempest-with-latest-microversion job during Nova meetings. ✅ Monitor OpenAPI efforts in Nova, which may allow offloading schema checks from Tempest in the future. Eventlet Removal, progress is behind schedule, especially compared to other projects like Neutron. ✅ Flag this as a priority area for upcoming cycles. Review Process & Tracking, spec review days were difficult to coordinate, and the status etherpad often outdated. ✅ Encourage active contributors to support occasional contributors during review days. ✅ Commit to keeping the etherpad current throughout the cycle. #### 2025.2 Flamingo Planning #### Timeline: Soft spec freeze (no new specs): June 1st Hard spec freeze (M2): July 3rd Feature Freeze (FF): August 28th Final release: late September / early October ✅ We agreed to officially adopt June 1st as the soft freeze date, based on the successful approach in Epoxy. ✅ A spec review day will be scheduled around mid June, these will be scheduled and announced early to ensure participation. ✅ Uggla will update the schedule document with the proposed milestones. #### Upstream Bug Triage #### We acknowledged that active bug triage has slowed down, resulting in a backlog increase (~150 untriaged bugs). There is a consensus that triage remains important to maintain a clear picture of the actual bug landscape. ✅ Trial a new approach: review some untriaged bugs at the end of Nova team meetings. ✅ Process the list by age (starting with the newest or most-voted first). #### Closing Old Bugs #### A proposal was made to bulk-close bugs older than 2 years, with a respectful and explanatory message, aiming to reduce backlog and improve visibility. However, multiple voices expressed strong reservations. ✅Take no action for now. Focus efforts on triaging new bugs first. ✅ If we successfully reduce the number of untriaged new bugs, we can consider scrubbing the bug backlog and possibly closing some of the older ones. #### Preparation for Python 3.13 #### While Python 3.13 is not mandatory for 2025.2, early compatibility work was discussed due to known issues (e.g., eventlet is broken on 3.13, as observed on Ubuntu 25.04) Ubuntu 24.04 and CentOS Stream 10 will stay on 3.12 for their supported lifespans. A non-voting unit test job for Python 3.13 (openstack-tox-py313) has already been added and is currently passing. Introducing a functional job for 3.13 could be a good next step, if resources allow. ✅ Gibi will track this as part of the broader eventlet removal work. #### Confidential Computing Feature Planning #### AMD SEV is already supported in Nova. SEV-ES is implemented in libvirt and work is ongoing in Nova. SEV-SNP is now supported in libvirt (v10.5.0). Work in Nova has not started yet. ✅ Pay closer attention to SEV-ES reviews to help move this forward. ✅ Tkajinam will write a new spec for SEV-SNP. Intel TDX Kernel support is nearly ready (expected in 6.15). Libvirt patches exist, but feature is not yet upstreamed or widely released. ✅ No action agreed yet, as this remains exploratory. Arm CCA No hardware is available yet; earliest expected in April 2027 (Fujitsu Monaka). Support in libvirt, QEMU, and Linux kernel is still under development. ✅ The use case is reasonable, but too early to proceed — we should wait until libvirt and QEMU support is mature. ✅ It would be beneficial to wait for at least one Linux distribution to officially support Arm CCA, allowing real-world testing. ✅ Attestation support for Arm is seen as external to Nova, with only minor flags possibly needed in the guest. #### RDT / MPAM Feature Discussion #### RDT (Intel PQoS) and MPAM (Arm equivalent) aim to mitigate “noisy neighbor” issues by allocating cache/memory bandwidth to VMs. Development has stalled since 2019, primarily due to: - Lower priority for contributors - Lack of customer demand - Infrastructure complexity (NUMA modeling, placement limitations) ✅ r-taketn to reopen and revise the original spec, showing a clear diff to the previous version. ✅ Ensure that abstractions are generic, not tied to proprietary technology, using libvirt + resource classes/traits may provide enough flexibility. #### vTPM Live Migration #### A spec for vTPM live migration was approved in Epoxy: https://specs.openstack.org/openstack/nova-specs/specs/2025.1/approved/vtpm… <https://specs.openstack.org/openstack/nova-specs/specs/2025.1/approved/vtpm…>To support live-migratable vTPM-enabled instances, Barbican secrets used for vTPM need to be owned by Nova, rather than the end user. This shift in ownership allows Nova to access the secret during live migration operations. Opt-in is handled via image property or flavor extra spec, meaning user consent is explicitly required. Current Proposal to enable this workflow: - Castellan should allow per-call configuration for sending the service token (rather than relying on a global all-or-nothing setting). Proposal: https://review.opendev.org/c/openstack/castellan/+/942015 - If the Nova service token is present, Barbican should set the secret owner to Nova. Proposal: https://review.opendev.org/c/openstack/barbican/+/942016 This workflow ensures Nova can read/delete the secret during lifecycle operations like migration, without involving the user. A question was raised around possible co-ownership between Nova and the end user (e.g., both having access to the secret). While this may be interesting longer-term, current implementation assumes a single owner. ✅ User and host modes are as described in the spec. For deployment mode, Nova will: - Authenticate to Barbican as itself (using a service token). - Own the vTPM secret it creates — it will be able to create, read, and delete it. - The user will not see or control the secret, including deletion. - The secret will be visible to other members of the Nova service project by default, but this could be restricted in future via Barbican ACLs to limit visibility to Nova only. #### Cloud Hypervisor Integration #### There is an ongoing effort to integrate Cloud Hypervisor into Nova via the Libvirt driver: Spec: https://review.opendev.org/c/openstack/nova-specs/+/945549 The current PoC requires only minor changes to work with Libvirt, and the team is ready to present the proposal at the PTG. ✅ We’re happy with the direction the spec is taking. Below are some key highlights regarding the spec. ✅ Clarify platform support (e.g., is libvirt compiled with cloud hypervisor support by default? Is it available in distros?). ✅ Provide a plan for runtime attach of multiple NICs and volumes. ✅ Mark as experimental if cloud hypervisor is not yet in upstream distro packages. ✅ Ensure that the following features are expected to work and covered in the spec: resize, migrate, rebuild, evacuate, snapshot. ✅ Justify raw-only image support, and outline the path to qcow2 compatibility. #### vGPU (mdev) and PCI SR-IOV Topics #### 1. Live-migratable flag handling (physical_network tag) Bug: https://bugs.launchpad.net/nova/+bug/2102161 ✅ We agreed that the current behavior is correct and consistent with the intention: If live_migratable = false → fallback to hotplug during live migration. If live_migratable = true on both source and destination → prefer transparent live migration. ✅ Investigate how Neutron might participate by requesting live-migratable ports. 2. Preemptive live migration failure for non-migratable PCI devices Nova currently checks for migratability during scheduling and conductor phases. There’s a proposal to move these checks earlier, possibly to the API level. Bug: https://bugs.launchpad.net/nova/+bug/2103631 ✅ Confirm with gmann whether a microversion is needed — likely not, as return codes are already supported (202 → 400/409). ✅ Uggla may submit a small spec to formalize this change. ✅ Split the work into two steps: - Fix existing bug (can be backported). - Incrementally move other validations earlier in the flow. 3. PCI SR-IOV: Unify the Live Migration Code Path There’s agreement on the need to reduce technical debt by refactoring the current dual-code-path approach into a unified model for PCI live migration. ✅ A dedicated spec is needed to clarify and unify PCI claiming and allocation. ✅ This refactor should address PCI claiming and allocation, potentially deprecating or replacing move_claim in favor of more robust DB-backed logic. ✅ This effort is directly related to point 1 (migratability awareness) and will help ensure consistent resource management across the codebase. #### SPICE VDI – Next Steps #### There is an ongoing effort to enhance libvirt domain XML configuration for desktop virtualization use cases (e.g. SPICE with USB and sound controllers). Some patches were proposed but not merged in time for Epoxy. Mikal raised the question of whether a new spec would be required in Flamingo, which would be the third iteration of this work. The team also raised concern about the complexity of adding traits (e.g. os-traits) for relatively simple additions, due to the multi-step process involved (traits patch, release, requirements update, etc.). ✅ Proceed with a specless blueprint. ✅ Plan to pull os-traits and os-resource-classes logic into Placement, to simplify the integration process and reduce friction. Link the required Placement version in Nova documentation accordingly. This is a strategic direction, even if some traits might still be shared with Neutron/Cinder. #### Virtiofs Client Support #### The virtiofs server-side support was merged in Epoxy, but SDK and client-side support did not make it in time. The proposal is to merge both patches early in Flamingo and then backport to Epoxy. ✅ No concern with microversion usage here. ✅The ordering of microversion support patches across Nova, SDKs, and clients will be handled by respective owners. ✅ Uggla to track that each new microversion in Nova has a corresponding patch in SDK/client layers. ✅ Not directly related to virtiofs, but the new reset-state confirmation prompt in the client was noted and welcomed. #### One-Time-Use (OTU) Devices #### OTU devices are designed to be consumed once and then unreserved. There is a need to provide practical guidance on handling these cleanly, especially in notification-driven environments. Additionally, there's an important patch related to Placement behavior on over-capacity nodes: https://review.opendev.org/c/openstack/placement/+/945465 Placement currently blocks new allocations on over-capacity nodes — even if the new allocation reduces usage. This breaks migration from overloaded hosts. The proposed fix allows allocations that do not worsen (or improve) usage. Note: A similar OTU device handling strategy has been successfully used in Ironic. ✅ Provide an example script or tool for external OTU device cleanup, based on notifications. ✅ Agreement on the proposed Placement fix — it is operator-friendly and resolves real issues in migration workflows. ✅ We likely need to dig deeper into implementation and tooling for broader OTU support. #### Glance cross-project session #### Please look at glance summary. #### Secure RBAC – Finalization Plan #### Tobias raised concerns about incomplete secure RBAC support in Nova, particularly around default roles and policy behavior. Much of the groundwork has been done, but a number of patches still require review and finalization. ✅ Gmann will continue working on the outstanding patches during the Flamingo cycle. The objective is to complete secure RBAC support in Nova as part of this cycle. #### Image Properties Handling – DB Schema & API Response #### The issue arises from discrepancies between image property metadata stored by Nova and what is received from Glance. Nova’s DB schema enforces a 255-character limit on metadata keys and values, which can lead to silent truncation or hard failures (e.g., when prefixing keys like image_ pushes the total length over 255). Nova stopped supporting custom image properties nearly a decade ago, when the system moved to structured objects (ImageMetaProps via OVO). Glance still allows some custom metadata, which may be passed through to Nova. This led to invalid or non-standard keys (e.g., owner_specified.openstack.sha256) being stored or exposed, even though they are not part of Nova’s supported set. Consensus emerged that we are facing two issues: - Nova's API may expose more metadata than it should (from Glance). - Nova stores non-standard or overly long keys/values, resulting in silent truncation or hard DB errors. ✅ Nova should stop storing non-standard image properties altogether. ✅ A cleanup plan should be created to remove existing unused or invalid metadata from Nova's database post-upgrade. ✅ During instance.save(), Nova should identify and delete unused image_* keys from the system metadata table. ✅ We must be cautious to preserve snapshot-related keys that are valid but not part of the base ImageMetaProps. ✅ These changes are considered bugfixes and can proceed without a new spec. #### Eventlet removal #### Please read the excellent blog post series from Gibi here: https://gibizer.github.io/posts/Eventlet-Removal-Flamingo-PTG/ #### Enhanced Granularity and Live Application of QoS #### This was cross team Neutron/Cinder/Nova first topic. Bloomberg folks presented early ideas around making QoS settings more granular and mutable, and potentially applicable to existing ports or VMs, not just at creation time. Nova does not operate on multiple instances at once, which conflicts with some proposed behaviors (e.g., live update of QoS on a network/project level). QoS is currently exposed via flavors in Nova, and is only supported on the frontend for the Libvirt driver. QoS mutability is non-trivial, with implications for scheduling, resource modeling, and placement interactions. The scope is broad and would require cross-project collaboration (Neutron, Cinder, Placement). Use cases and notes from Bloomberg: https://etherpad.opendev.org/p/OpenStack_QoS_Feature_Enhancement_Discussion ✅ Use flavor-based modeling for QoS remains the Nova approach. ✅ Nova should not apply policies across many instances simultaneously. ✅ A spec will be required, especially if new APIs or behavior modifications for existing VMs are introduced. The spec should provide concrete use case examples and API design proposals, including expected behavior during lifecycle operations (resize, rebuild, shelve, etc.). ✅ Max bandwidth adjustments may be possible (as they don’t require reservations), but broader mutability is more complex. ✅ Neutron and Cinder raised no objections regarding Bloomberg’s use cases and proposals. However, please look at Neutron and Cinder's respective summaries. #### Moving TAP Device Creation from Libvirt to os-vif #### This change proposes moving the creation of TAP devices from the Libvirt driver into os-vif, making it more consistent and decoupled. However, it introduces upgrade and timing considerations, especially regarding Neutron and OVN behavior. Bug: https://bugs.launchpad.net/nova/+bug/2073254 Patch: https://review.opendev.org/c/openstack/nova/+/942786 ✅ Neutron team is open to adjusting the timing of the "port ready" event, which could eliminate the need for Nova-side hacks. ✅ Sean will proceed with the patch and verify behavior through CI. #### Instance Annotations, Labels & K8s-Like Semantics #### Sean proposed introducing a mechanism similar to Kubernetes annotations and labels in Nova, to: - Express user intent regarding instance behavior (e.g., "should this instance be migrated?") - Convey lifecycle preferences to external tools like Watcher and Masakari - Expose capabilities or constraints of an instance (e.g., "cannot be shelved because it has a vTPM") Proposed Examples of Instance Annotations: lifecycle:live-migratable=true|false ha:role=primary|secondary These would be: - Set by users (or operators) - Optionally inherited from flavors (but conflicts would raise 400 Bad Request) - Expressed intent only — not enforcement of policy In addition, labels generated by Nova could reflect actual capabilities, like: lifecycle:live-migratable=false if an instance has a PCI device lifecycle:shelvable=false if it uses vTPM ✅ Define a new API to expose capabilities of instances (e.g., “can this instance be live-migrated?”) Values will be derived by Nova based on configuration/hardware and exposed via nova server show. ✅ Sean will create a spec. ✅ Looking at user-defined labels, we eventually considered defining a second API for them to express scheduling/HA preferences. However we decided the current preferred approach is to start with metadata API, and evolve to a first-class model. We may need admin-only metadata (e.g., for HA tooling like Masakari) this has been discussed in Admin-Only Instance Metadata / Annotations later point. ✅ Sean will also create a spec for this. (Sean). #### External Traits and Node Pressure Metrics #### Sean also proposed allowing external systems (e.g., Watcher, telemetry agents) to annotate compute nodes with traits such as memory/cpu/io pressure, based on /proc/pressure. Examples: CUSTOM_MEM_PRESSURE=high EXTERNAL_IO_PRESSURE=moderate ✅ Support a COMPUTE_MEM_PRESSURE-like trait, populated from sysfs as static info (not dynamic). ✅ A weigher could use these traits to influence placement.Default traits list could be configured (e.g., prefer/avoid hosts with certain pressures or hardware features). This approach could evolve into a generic “preferred traits” weigher, similar to Kubernetes taints/tolerations. ✅Sean will create a dedicated spec for this feature. ✅ Sbauza volunteered to help, especially as the work aligns with weigher logic from the previous cycle. #### OpenAPI Schema Integration #### Stephen highlighted that most of the heavy lifting for OpenAPI support is now complete, and the work is down to pure response schema definitions. This effort spans over three cycles now, and it would be valuable to finalize it early in Flamingo. ✅ We'll formalize this work with a blueprint. ✅ The goal is to make early progress in Flamingo, ideally with a dedicated review day. ✅ Stephen is happy to join synchronous review sessions and will coordinate pings for progress. ✅ Masahito volunteered to help with the remaining work. #### OpenStack SDK & Client Workflows #### Stephen raised a few concerns regarding timing mismatches between SDK/OSC freezes and microversion patch merges in Nova. Some microversion support landed too late to be integrated in the SDK before the Epoxy freeze. Patches were sometimes missed due to lack of "depends-on" links or broken initial submissions. ✅ Uggla will follow up and finalize these patches early in the Flamingo cycle. #### Upstream Testing for PCI Passthrough and mdev Devices #### With IGB support merged in Epoxy, and vIOMMU enabled in some Vexxhost workers (thanks to dansmith), the opportunity exists to expand PCI testing upstream in Tempest. This would also benefit testing of one-time-use (OTU) devices. Finalizing mtty testing is a priority, as it helps ensure device support is consistent and regressions (like bug #2098892) are caught early. ✅ Bauzas will lead on wrapping up mtty testing. ✅ Gibi will coordinate with cloud providers to assess Epoxy support and revisit this topic during the next PTG if needed. #### CPU Power Management – Expected Behavior #### Melanie raised questions about inconsistencies between design and implementation in Nova’s CPU power management logic. In particular: - CPUs were being offlined too aggressively, sometimes during reboot or migration operations. - This contradicts the intent that only unassigned or deallocated cores should be powered off. There was confusion between two approaches: - Aggressive power-down of unused CPUs during all idle states (stop, shelve, etc.) - Conservative behavior, powering off cores only when the VM is deleted or migrated away Consensus favored the aggressive-but-safe model: - Power down cores only when not used, e.g., VM is stopped or migrated. - Be cautious not to power off cores prematurely (e.g., during reboot or verify-resize). ✅ Do not rush to power off CPU cores at compute startup or mid-operation. ✅ Revisit the implementation so the resource tracker runs first, and determines actual core assignments before making decisions. #### Live Migration with Encrypted Volumes (Barbican Integration) #### HJ-KIM raised the point that Nova does not currently support live migration of instances using encrypted Cinder volumes managed by Barbican. This is a critical blocker in environments with strict compliance requirements. ✅ This is a parallel issue to vTPM support. We will learn from the vTPM implementation and consider applying similar concepts. ✅ A future solution may involve adjusting how ownership is managed, or providing scoped access via ACLs. ✅ Further discussion/spec work will be needed once an implementation direction is clearer. #### Manila–Nova Cross-Team Integration #### The initial Manila–Nova integration is now merged — thanks to everyone involved! The next step is to: - Add automated testing (currently manual tests only). - Start with a few basic positive and negative test scenarios (create, attach, write, delete; snapshot and restore; rule visibility; restricted deletion; etc.). Additionally, longer-term features and improvements are being considered please look at the etherpad. ✅ We will work on tempest tests. ✅ We will continue enhancing Nova–Manila integration during Flamingo (F) and beyond. ✅ Uggla will submit a spec as needed for land memfd support. #### Provider Traits Management via provider.yaml #### 📌 Spec: https://review.opendev.org/c/openstack/nova-specs/+/937587 Problem: Traits defined in provider.yaml are added to Placement but never removed if deleted from the file. ✅ Implement a mechanism where Nova copies the applied file to /var/lib/nova/applied_provider.yaml, and diffs it with the active one on restart. This would allow traits (and possibly other config) to be safely removed. #### Admin-Only Instance Metadata / Annotations #### 📌 Spec: https://review.opendev.org/c/openstack/nova-specs/+/939190 Issue: Current instance metadata is user-owned, and shouldn't be used by admins. Proposal: Introduce admin-only annotations (or metadata with ownership tracking), allowing operators to set system-visible metadata without violating user intent. ✅ Introduce a created_by field (similar to locked_by) to track who created metadata: user vs admin. Consider an admin: prefix namespace for admin-controlled keys (applied to annotations or metadata). Implementation requires a DB change and a nova-spec. Note: This aligns well with broader annotation work already discussed in this cycle. #### delete_on_terminate for Ports (Server Create / Network Attach APIs) #### 📌 Related discussion: https://review.opendev.org/c/openstack/nova-specs/+/936990 Background: This was discussed in past PTGs. Currently, delete_on_terminate can't be updated dynamically across instance lifetime. ✅ A spec with a working PoC will help clarify the desired behavior and unblock the discussion. Long-term solution may require storing this flag in Neutron as a port property (rather than Nova-specific DB). #### Graceful Shutdown of Nova Compute Services #### 📌 Spec: https://review.opendev.org/c/openstack/nova-specs/+/937185 Challenge: Need a mechanism to drain compute nodes gracefully before shutdown, without interrupting active workloads or migrations. Graceful shutdown is tricky in the presence of live migrations. Ideas include: - Temporary “maintenance mode” (block write requests). - Group-level compute draining. ✅ The topic is important but not urgent — bandwidth is limited. Note: Eventlet removal may simplify implementing this. ✅ Please report concrete bugs so we understand the blockers. ✅ A nova-spec with PoC would help drive the conversation. #### Libvirt/QEMU Attributes via Flavor Extra Specs #### Target: Advanced tuning of I/O performance via iothreads and virtqueue mapping, based on: https://developers.redhat.com/articles/2024/09/05/scaling-virtio-blk-disk-i… ✅ Introduce new flavor extra specs such as: - hw:io_threads=4 - hw:blk_multiqueue=2 These can be added to both flavor and image properties. ✅ A nova-spec should be written to document naming and semantics. #### Dynamic Modification of libvirt Domain XML (Hook Proposal) #### oVirt allows for plugins to alter the libvirt domain XML just before instance launch (via VDSM hooks). Nova does not offer a mechanism to intercept or modify the domain XML, and the design explicitly avoids this. The desired use case involves injecting configuration that libvirt cannot currently represent, for example, enabling multiuser SPICE consoles. ✅ This proposal is explicitly rejected. ✅ Nova will not support hook points for permuting libvirt XML. ✅ Operators may use out-of-band libvirt/qemu hooks at their own risk, but should not expect upstream support or stability guarantees. #### Revisiting the "No More API Proxies" Rule #### Masahito proposed allowing users to filter instances via API based on related service data, such as network_id. ✅ The "no API proxy" rule remains, but with pragmatic exceptions: - Filtering is acceptable if the data exists in Nova’s DB (e.g., network ID, image ID). - No cross-service REST calls allowed (e.g., Neutron QoS types still out of scope). - Filtering by network_id in nova list is reasonable and can proceed. ✅ Masahito will provide a spec. #### OVN Migration & Port Setup Timing #### 📌 Context: https://bugs.launchpad.net/nova/+bug/2073254 In OVN-based deployments, Neutron signals the network-plugged event too early, before the port is fully set up. This causes issues in live migration, especially under load. ✅ Nova already supports waiting on the network-plugged event. OVN in Ubuntu Noble should behave properly. A proposal to improve timing in Neutron was discussed (Neutron to wait for port claim in southbound DB). Nova might support this via a Neutron port hint that triggers tap interface creation earlier during migration (pre-live-migration). ✅ Next step: open an RFE bug in Neutron. If accepted, a Nova spec may be needed. #### Blocking API Threads During Volume Attachments #### 📌 Context: https://bugs.launchpad.net/nova/+bug/1930406 Volume attachment RPC calls block API workers in uWSGI, leading to starvation when multiple attachments are made in parallel. ✅ Volume/interface attachments should become async, reducing API lock contention. Fix is non-trivial and will require a microversion. In the meantime, operators may tune uWSGI workers/threads or serialize attachment calls. #### Inventory Update Failure – DISK_GB Bug #### 📌 Bug: https://bugs.launchpad.net/nova/+bug/2093869 When local storage becomes temporarily unavailable (e.g., Ceph down), Nova sends total=0 for DISK_GB, which Placement rejects if allocations exist. ✅ The real fix is to restore the storage backend. Nova should improve error handling/logging, but should not shut down the compute service. #### Security Group Name Conflict Bug #### 📌 Bug: https://bugs.launchpad.net/nova/+bug/2105896 When multiple security groups share the same name (via Neutron RBAC), instance builds can fail due to incorrect duplicate detection logic. ✅ The issue was fixed in: https://review.opendev.org/c/openstack/nova/+/946079 ✅ Fix will be reviewed and backported to Epoxy. If you've read this far — thank you! 🙏 If you spot any mistakes or missing points, please don't hesitate to let me know. Best regards. René.

3 months, 4 weeks

[manila] 2025.2 Flamingo PTG summary

by Carlos Silva

Hello Zorillas and interested stackers, Last week's PTG had plenty of topics and good takeaways. In case you would like to watch any of the discussions, please take a look at the videos in the OpenStack Manila Youtube channel [0]. The PTG etherpad has all of the notes we took [9]. Here is a summary of the discussions grouped by each topic: Retrospective ========== Highlights ------------- Mid cycle alongside feature proposal freeze provided a good opportunity for us to have collaborative review sessions and move faster on reviews. Two bugsquashes had a good impact on the bug backlog and the bug trend was more positive on this cycle, despite the numbers growing due to low-hanging-fruits we started reporting. Internships with City University of Seattle, Valencia College and North Dakota State University - they are definitely helping with progress on manila-ui and OpenAPI. We will continue the effort. We would like to speed up reviews and improve our metrics [1] on how long changes are open before being merged. Review dashboards can help and we can work with our reviewers to have a more disciplined approach on reviews. Broken third party CI systems currently mean that we have little testing. We need to rely on the authors' or their peers to test and ensure that a feature is working. We will look into documenting CI setup procedures and gather thoughts from maintainers. New API Features should be tested as early as possible to ensure it won't break any workflows. Our contributor documentation will be updated with extra guidelines. AIs: (carloss) Encourage Bug Czar candidates and bring this up more often in the manila weekly meetings (carloss) Encourage spec authors to schedule a meeting to discuss the spec to speed up the review process. (carloss) include iCal with event announcements (bugsquash / mid cycle) (gouthamr) Creating a review dashboard (carloss) Record "expert seminars" on FAQs: it would be great to have some videos documenting how-tos in OpenStack and help people to unblock themselves when they are hitting common openstack-developer issues: https://etherpad.opendev.org/p/manila-howcasts (carloss) communicate a deadline for the manila CLI -> OSC documentation changes. The work with our interns should go until FPF. It needs to be done before the client release, when we are planning to drop the manilaclient support. ashrodri offered help to get it completed after we come to the FPF deadline. (carloss) We should update these docs and mention that first party driver implementations should be done for features and be more strict about the testing requirements. All things CephFS [2] ================ Deprecation of standalone NFS-Ganesha ------------------------------------------------------- We added a warning in Dalmatian, deferred plans to deprecate based on community feedback. Our plan is to remove it in the 2026.1 release. There is a suggested update procedure, please reach out in case there are questions. AI: (carloss) send a reminder email in this cycle to incentivize people to move to clustered NFS Supporting NFSv3 for Windows workloads -------------------------------------------------------- manila-tempest-plugin now supports multiple NFS protocol versions in one of the scenario tests. As soon as we get the build, we will update the CephFS NFS job to run tests for NFSv3 as well. Testing and stabilization -------------------------------- Bumped Ceph version in the CI jobs to Reef in Antelope, Bobcat, Caracal, Dalmatian. We are starting to test with Ceph Squid; we intend to test with Squid on "master" and "stable/2025.1" (epoxy) branches. A couple of Ceph and NFS-Ganesha issues are impacting us at the moment [4] [5] [6] and we managed to find the workaround for some. We needed to stop testing with ingress daemon at the moment and we will get back to testing as soon as the fix is out. Manage unmanage of shares and snapshots ----------------------------------------------------------- Feature is merged and working and we are going to backfill tempest test patches AI: (carloss) will propose a new job variant to allow testing this feature. Plans for 2025.2 Flamingo ----------------------------------- Investigate support for SMB/CIFS Ceph-NFS QoS: we will follow the implementation of this feature in NFS Ganesha and start discussing and drafting the Manila implementation when the code is merged in Ganesha upstream. Out of place restores and backup enhancements [7] ======================================== CERN is pursuing a backup backend with their C-Back tool. Currently Manila backups can be restored back to the same share; there are some problems with such approach when the source share backend is down and to prevent browse by restore behavior. Zachary Goggins (za) proposed a specification, and plans to work on it during the Flamingo Cycle. The share backups feature also needs some enhancements like a get progress and get restore progress actions. Zach plans to make it part of the implementation. We agreed that a backup resource should have a new "state" attribute, instead of only relying on the status in order to have well defined backup states. AI: (za) update the out of place restore spec. Tech debt ======= Container driver failures -------------------------------- The container driver tempest tests are perma-failing right now. We seem to have a problem with RBAC and pre-provisioned tempest credentials. AIs: (carloss) Report a tempest bug to track the issues; (gouthamr) will propose a change to switch back to using dynamic credentials in our testing. DockerHub rate limits ----------------------------- We are only building an image in manila-image-elements. It's more pulls than pushes. Pushes happen very rarely. The kolla team has moved away from DockerHub as well. Zach offered help in case we need another approach for registry. CERN has its own tool. AI: we will look into moving to quay.io "manila" CLI removal ---------------------------- We added the deprecation warning 6 releases ago and we should proceed with the removal. We will need an additional push to update all of our documentation examples and move to keystoneauth. We need more functional test coverage and we should have a hackathon just as we did some years ago. AI: carloss will schedule a hackathon for enabling more tests and send the removal email to openstack-discuss. We are targeting the removal to 2025.2 Flamingo. CI and testing ------------------ ZFSOnLinux job left on jammy: We created a bug for it and we can use it for tracking. IPv6 testing: The BGP software we were using (quagga) is now deprecated and everything was migrated to FRR. We will need help to fix it as unfortunately, things didn't have a 1x1 translation between the libraries. If someone has experience on this, it would be nice to collaborate to get this fixed. API ---- We are going to stop testing the v1 API and stop deploying it on DevStack test jobs. We'll update the install guide as well that we've stopped supporting it. It was deprecated in 2015 ("Liberty" release). That's a good code cleanup opportunity. V2 is an extension of v1 with microversions. If we stop supporting it, who is affected? Mostly people that have automations using it. What's the impact on manila-tempest-plugin? We have v1 and v2 tests. We have a lot of coverage for v2. If you don't have the v1 API in the cloud, the tests refuse to run. We will need to fix it. AIs: Work on the removal patches during the 2025.2 Flamingo release; (carloss) will send an announcement email to the ML, including operators tag. Manila UI ------------- We have been making progress in the Manila UI feature gap. Currently working on manage/unmanage share servers, manage share with dhss=true, filtering user messages on date, updating quotas table. Share limits view broke some time ago, code lives in horizon. We hit some issues using horizon's tox "runserver" environment, apparently more people ran into the same issue. We will talk to other impacted parties and check how to overcome this issue. AI: (carloss) will reach out to the horizon team and ask how we can re-introduce Manila limits to the overview tab. Enable share encryption at-rest (back-end) with secret refs stored on Barbican/Castellan. [8] ===================================================================== We merged a specification some time ago with an implementation architecture. That spec contemplated both Share encryption and Share server encryption. NetApp is now planning to work only on share server encryption. Encryption can be disabled per share, but shares exported via a share server cannot have a separate encryption key on ONTAP. We reached an agreement that when a new share creation is triggered, if there isn't a share server matching the provided key, a new share server will need to be spawned. We also agreed that we should allow using names for the secret reference for better user experience. 2025.2 Flamingo is the target release. AIs: (kpdev/Sai) The spec will be updated and only the DHSS=True scenario will be documented; The manila team will review the spec as soon as it is proposed Replication Improvements ==================== Back when we implemented replication, we didn't account for specific configurations that the storage backends can have, for example whether the backend could support zero RPO technologies or not. Zero RPO is is an important feature that allows data to be written simultaneously between the share and its replicas. We agreed that the way we should send the information to the backend is through a backend specific share type extra spec. Administrators will be able to define it in the share type and the backend will pick it up. Operator concerns / questions ======================= Where to put parameters that change behaviour only of one protocol (NFS in this case)? We agreed that we should have write once type of metadata and not allow the metadata to be updated afterwards. A configuration option can be introduced for this where the operator can determine what metadata will not be updated. AI: carthaca will propose a lite-spec for this Lustre FS Support for HPC Use Cases in OpenStack Is there any possibility for OpenStack to officially integrate or support parallel file systems like Lustre, either through Manila or other components? We've heard in the past as a request from the scientific-sig group. Building a driver should be straightforward and it does not necessarily need to be in-tree, and it would be easier to maintain. This is a very good use case. This discussion will continue with the scientific-sig group. Replica / Snapshot Retention / Expiration Policy While replicas in Manila are designed to be continuously in sync with the active share, certain use cases — such as disaster recovery (DR) replicas or manually created replicas that are no longer needed — could benefit from lifecycle management. Replicas are continuously synced with the source share, so if they're "unused", they're there for some reason is the assumption. We had a spec a while ago about automating snapshots (creation and deletion) on schedule. It would be preferable that an external automation tool is used to achieve such behavior. Maybe openstack/mistral can be a good approach (Support for manila snapshots already exists on Mistral) Affinity/Anti-affinity spec updates ========================= This feature allows users to create share groups with affinity policies, which determine the affinity relationship between shares within the group. There was an open question about strategies of locking. We came to an agreement that we can use tooz, database or oslo. AI: (chuanm) will update the spec. Force deleting subnets ================= This is a feature that follows the ability to add multiple subnets to a share server. We should also be able to remove them. This spec is under review. We agreed that we should also implement the "check" mechanism before deleting the subnet. AIs: (sylvanld) will update the spec Eventlet removal ============= Need to remove wsgi uses, use oslo service's new Threading based backend instead for the ProcessLauncher, periodic tasks. Neutron is doing some work around periodic tasks and we can benefit from their learning. AI: Work on this in Flamingo, aiming for completion in 2026.1 cycle. Manila/Nova Cross-project session: VirtioFS ================================= VirtioFS implementation is now complete and we are looking at the next steps. We currently don't have CI testing the feature and the Manila team is planning to work on it during the 2025.2 Flamingo release. The nova team intends to drive the remaining SDK and OSC patches to completion during the 2025.2 Flamingo release. We also discussed some possible enhancements: mem_fs support, online attach and detach and live migration. These will take some time and the Nova team will work on such features gradually. AIs: (carloss) will share the test scenarios with the Nova team and ask for reviews and the Manila team will work on the implementation of the tests. (rribaud) will work on the remaining SDK patch and work on mem_fd support. [0] https://www.youtube.com/watch?v=MLXkBRhViS0&list=PLnpzT0InFrqADxXi_dtPqfWLt… [1] https://openstack.biterg.io/app/dashboards#/view/Gerrit-Backlog?_g=(filters…:'Gerrit%20Backlog%20panel%20by%20Bitergia. ',filters:!(('$state':(store:appState),meta:(alias:'Changesets%20Only',disabled:!f,index:gerrit,key:type,negate:!f,params:(query:changeset),type:phrase),query:(match:(type:(query:changeset,type:phrase)))),('$state':(store:appState),meta:(alias:Bots,disabled:!f,index:gerrit,key:author_bot,negate:!t,params:(query:!t),type:phrase),query:(match:(author_bot:(query:!t,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:gerrit,key:project,negate:!f,params:(query:manila),type:phrase),query:(match_phrase:(project:manila)))),fullScreenMode:!f,options:(darkTheme:!f,useMargins:!t),query:(language:lucene,query:(query_string:(analyze_wildcard:!t,default_field:'*',query:'*',time_zone:Europe%2FMadrid))),timeRestore:!f,title:'Gerrit%20Backlog',viewMode:view) [2] https://etherpad.opendev.org/p/flamingo-ptg-manila-cephfs [3] https://bugs.launchpad.net/manila/+bug/2049538 [4] https://github.com/nfs-ganesha/nfs-ganesha/issues/1227 [5] https://tracker.ceph.com/issues/69214 [6] https://tracker.ceph.com/issues/67323 [7] https://review.opendev.org/c/openstack/manila-specs/+/942694 [8] https://etherpad.opendev.org/p/share-encryption-with-barbican-secret-ref [9] https://etherpad.opendev.org/p/flamingo-ptg-manila Thank you everyone that participated on the PTG! Best regards, carloss

3 months, 3 weeks

[nova][ptg] 2025.1 Epoxy PTG summary

by Sylvain Bauza

(resending the email as the previous one was blocked to an attached etherpad backup txtfile larger than the max size) Hey all, First, thanks for having joined us if you were in the vPTG. We had 15-20 people every day for our nova sessions, I was definitely happy to see new folks :-) If you want to see our PTG etherpad, please rather look at https://etherpad.opendev.org/p/r.4f297ee4698e02c16c4007f7ee76b7c1 instead of the main nova etherpad as I don't want that the etherpad would have a wrong traduction or having some paragraphs to be removed. As I say every cycle, just take a coffee (or a tea) now as the summary will be large. ### Dalmatian retrospective and Epoxy planning ### 6 of 15 approved blueprints were eventually implemented. We also merged more than 31 bugfixes during Dalmatian. We agreed to be explaining on the IRC channel when we have meetings for discussing some feature series (like the one we did every week for the manila/virtiofs series) and providing some public invitations. We could do this again this cycle for other features, we'll see. We will also try to have a periodic integration-compute job that pulls OSC and SDK from master. Our Epoxy deadlines will be : two spec review days (R-16, R-2), a soft spec approval freeze by R-16 and then hard spec approval freeze by R-12. That means that contributors really need to provide their specs before mid-December. Bauzas (me) will add these deadlines into the Epoxy schedule : https://releases.openstack.org/epoxy/schedule.html ### vTPM live migration ### We agreed on the fact that a vTPM live-migration feature is a priority for Epoxy given Windows 11. artom will create a spec proposing an image metadata property saying 'do I want to share my secret with nova service user ?' and also providing a new `nova-manage image_property set migratable_something` command so operators could migrate the existing instances for getting the Barbican secrets, if really the operators wants. ### Unified limits wrap-up ### We already have two changes needing to be merged before we can modify the default quota driver (in order to default to use unified limits). We agreed on reviewing both patches (one for treating unset limits as unlimited, the other about adding a nova-manage command for automatically creating nova limits) but we also discussed about a latter patch that would eventually say which nova resources need to be eventually set (so we *have to* enforce them anyway). melwitt agreed on working on that latter patch. ### per-process health checks ### We already had one series and we discussed it again. Gibi agreed on taking over it and he will re-propose the existing spec as it is. We also discussed the first checks we would have, like RPC failures and DB connection issues, we'll review those when they are in Gerrit. ### sustainable computing (a.k.a. power mgmt) ### When someone (I won't say who [1]) implemented power management in Antelope, this was nice but we eventually found a long list of bugs that we fixed. Since we don't really want to reproduce that experience, we had this kind of post-mortem where we eventually agreed on two things that could avoid reproducing that problem : a weekly periodic job will run whitebox tempest plugins [2] with nova-compute restarts also covered by a whitebox tempest plugin. Nobody is committed against those two actions but we hope to identify someone soon. As a side note, gibi mentioned RAPL MSR support [3], notifying us that we would have to support that in a later release (as the libvirt implementation is not merged yet) ### nvidia's vGPU vfio-pci variant driver support ### Long story short, since the linux kernel removed some feature in release 5.18 (IOMMU backend support for vfio-mdev) this impacted the nvidia driver which now detects that and then creates vfio-pci devices instead of vfio-mdev devices (as vGPUs). This has a dramatic impact on Nova as we relied on the vfio-mdev framework for abstracting virtual GPUs. By the next release, Nova will need to inventorize the GPUs by rather looking at SRIOV virtual functions which are specific to the nvidia driver (we call them vfio-pci variant driver resources). The nova PTG session focused on the required efforts to do so. We agreed on the fact it will require operators to propose different flavors for vGPU where they would require distinct resource classes (all but VGPU). Fortunately, we'll reuse existing device_spec PCI config options [4] where the operator would define custom resource classes which would match the PCI addresses of the nvidia-generated virtual functions (don't freak out, we'll also write documentation). We'll create another device type (something like type-VF-migratable) for describing such specific nvidia VFs. Accordingly the generated domain XML will correctly write the device description (amending the "managed=no" flag for that device). There will be an upgrade impact: existing instances will need to be resized to that new flavor (or instances will need to be shelved, updated for changing the embedded flavor and unshelved). In order to be on par with existing vGPU features, we'll also need to implement vfio-pci live-migration by detecting the VF type on the existing SRIOV live-migration. Since that effort is quite large, bauzas will incept a subteam of interested parties that would help him implement all of those bits in the short timeframe that is one upstream cycle. ### Graceful shutdowns ### A common pitfall that was told by tobian-urdin is when you want to stop nova-compute services. In general, before stopping the service, we should be sure that all RPC calls are done, which means we would no longer accept RPC calls after asking to stop the nova-compute and just awaiting the current calls to be done before stopping the service. For that, we need to create a backlog spec for discussing that design and we would also need to modify oslo.service for unsubscribing the RPC topics. Unfortunately, this cycle we won't have any contributor for working on it, but gibi could try to at least document this. ### horizon-nova x-p session ### We mostly discussed the Horizon feature gaps [5]. The first priority would be Horizon to use OpenStackSDK instead of novaclient, but then supporting all of the new Nova API microversions. Unfortunately, we are no sure that we could have Horizon contributors that could fix those, but if you're a contributor and you want to help Horizon to be better, maybe you could do this ? If so, please ping me. ### Ironic-nova x-p session ### We didn't really have topics for this x-p session. We just quickly discussed some points, like Graphical Console support. Nothing really worth noting, maybe just that it would be nice that we could have readonly graphical console. We were just happy to say that the ironic driver now works better thanks to some features that were merged last cycles. Kudos to those who did them. ### HPC/AI optimized hypervisor "slices" ### A large topic to explain, I'll try to keep it short. Basically, how Nova slices the NUMA affinity between guests is nice but hard for HPC usecases where sometimes you need to better explain how to slice the NUMA dependent devices depending on the various PCI topologies. Eventually, we agreed on some POC that johnthetubaguy could work on by trying to implement a specific virt driver that would do something different from the existing NUMA affinities. ### Cinder-nova x-p session ### Multiple topics were discussed there. First, abishop wanted to enhance cinder's retyping of in-use boot volumes which means that the Nova os-attachments API to get a new parameter. We said that he needs to create a new spec and we agreed on the fact that the cinder contributors need to discuss with QEMU folks to know about the qemu writes. We also discussed about a new nova spec which is about adding burst length support to Cinder QoS [6]. We said that we need to both (nova and cinder) review this spec. About left residues when detaching a volume, we also agreed on the fact this is not a security flaw and the fact that os-brick should delete them, not nova (even if nova need to ask os-brick to look at that, either by a periodic run or when attaching/detaching). whoami-rajat will provide a spec for it. ### Python 3.13 support ### We discussed a specific issue for py3.13, the fact that the crypt module is no longer in stlib for py3.13, which impacts nova due to some usage in nova.virt.disk.api module for passing an admin password for file injection. Given file injection is deprecated, we have three possibilities: either removing admin password file injection (or even file injection as a whole), adding the new separate crypt package in upper-constraints or using oslo_uitls.secretutils module. bauzas (me) will provide an email to openstack-discuss for asking operators whether they are OK with deprecating file injection or just admin password injection and then we'll see the direction. bauzas or sean-k-mooney will also try to have py3.13 non-voting jobs for unittests/functtests. ### Eventlet removal steps in Nova ### I won't explain why we need to remove eventlet, you already know, right ? We rather discussed about the details in our nova components, including nova-api, nova-compute and other nova services. We agreed on removing direct eventlet imports where possible, move nova entrypoints that don't use eventlet to separate modules that don't monkeypatch the stdlib, look at what we can do with all our scatter_gather methods which asynchronously calling cells DB for using threads instead and check whether those calls are blocking on DB (and not on the MQ side). Gibi will shepherd that effort and provide some audit on the eventlet usage in order to avoid any unexpected but unfortunate late discoveries. ### Libvirt image backend refactor ### If you like spaghettis, you should pay attention to the libvirt image backend code. Lots of assumptions and conditionals make any change to that module hard to be written and hard to be reviewed, leading to errorprone situations like the ones we had when fixing some recent CVEs. We all agreed on the quite urgent necessity to refactor that code and melwitt proposed a multi-stage effort about that. We agreed on the proposal for the first two steps with some comments, leading to future revisions of the proposal's patches. The crucial bits with the refactor are about test coverage. ### IOThreads tuning for libvirt instances ### An old spec was already proposed for defining iothreads to guests. We agreed on reviving that spec, where a config option would define either no iothread or one iothread per instance (with a potential for a latter option value to be "one iothread per disk"). Depending on whether emulator_thread_policy is provided in the flavor/image, we would set the iothread on that policy or we would put the iothread floating over the shared CPU set. If no shared CPUs are configured but the operator wants iothreads, nova-compute would refuse to start. lajoskatona will work on such an implementation that will be designed in a blueprint that doesn't require a spec. ### OpenAPI schemas progress ### Nothing specific to say here, bauzas and gmann will review the series this cycle. That's it. I'm gone, I'm dead [7] (a cyclist metaphor) but I eventually skimmed the very large nova etherpad. Of course, 99% of chances that I'll write some notes incorrectly, so please correct if I'm wrong, I won't feel offended, just tired. Thanks all (and I hope your coffee or tea was good) -Sylvain [1] https://geek-and-poke.com/geekandpoke/2013/11/24/simply-explained [2] https://opendev.org/openstack/whitebox-tempest-plugin [3] https://www.qemu.org/docs/master/specs/rapl-msr.html [4] https://docs.openstack.org/nova/latest/configuration/config.html#pci.device… [5] https://etherpad.opendev.org/p/horizon-feature-gap#L69 [6] https://review.opendev.org/c/openstack/nova-specs/+/932653 [7] https://www.youtube.com/watch?v=HILcYXf8yqc

9 months, 2 weeks

Re: [watcher] 2025.2 Flamingo PTG summary

by Sean Mooney

On 17/04/2025 13:17, Dmitriy Rabotyagov wrote: >> well gnocchi is also not a native OpenStack telemetry datastore, it left >> our community to pursue its own goals and is now a third party datastore >> just like Grafana or Prometheus. > Yeah, well, true. Is still somehow treated as the "default" thing with > Telemetry, likely due to existing integration with Keystone and > multi-tenancy support. And beyond it - all other options become > opinionated too fast - ie, some do OpenTelemetry, some do Zabbix, > VictoriaMetrics, etc. As pretty much from what I got as well, is that > still relies on Ceilometer metrics? > And then Prometheus is obviously not the best storage for them, as it > requires to have pushgatgeway, and afaik prometheus maintainers are > strictly against "push" concept to it and treat it as conceptually > wrong (on contrary to OpenTelemetry). i dont know the detail but i know there is work planned for native supprot of Prometheus scrpe endpoint in ceilometer so while you currently need to use SG-core to provide that integration there is a plan to remove the need for sgcore going forward. https://etherpad.opendev.org/p/r.72ac6a7268e4b9d854f75715adede80c#L28 i dont see a spec proposed yet but there is an olde one form 2 years ago https://review.opendev.org/c/openstack/telemetry-specs/+/845485/4/specs/zed… there is also a plan to provide keystone integration and mutli tenancy https://etherpad.opendev.org/p/r.72ac6a7268e4b9d854f75715adede80c#L84 > So the metric timestamp issue is > to remain unaddressed. > So that's why I'd see leaving Gnocchi as "base" implementation might > be valuable (and very handy for us, as we don't need to implement a > prometheus job specifically for Watcher). watcher, aodh, and cloud kitty i believe all have some level if support for Prometheus but they can also use other backends. in not sure what level of enablement they have in osa. > >> but for example watcher can integrate with both ironic an canonical maas > component >> to do some level of host power management. > That sounds really interesting... We do maintain infrastructure using > MAAS and playing with such integration will be extremely interesting. > I hope I will be able to get some time for this though... the current maas integration has 3 problems. 1 a lack of testing, 2 a lack of documentation and 3 it somehow managed to introduce asysnio in a project that uses eventlet in a release of eventlet that did not support asyncio so im very nervious that that is broken or will break in the future. this is the entrity of the support https://review.opendev.org/c/openstack/watcher/+/898790 there are no docs and no spec... so this should definitely be considered "experimental" at best today. > > чт, 17 апр. 2025 г. в 13:52, Sean Mooney <smooney(a)redhat.com>: >> >> On 16/04/2025 21:04, Dmitriy Rabotyagov wrote: >>> Hey, >>> >>> Have a comment on one AI from the list. >>> >>>> AI: (jgilaber) Mark Monasca and Grafana as deprecated, unless >>> someone steps up to maintain them, which should include a minimal CI >>> job running. >>> >>> So eventually, on OpenStack-Ansible we were planning to revive the >>> Watcher role support to the project. >>> How we usually test deployment, is by spawning an all-in-one >>> environment with drivers and executing a couple of tempest scenarios >>> to ensure basic functionality of the service. >>> >>> With that, having a native OpenStack telemetry datastore is very >>> beneficial for such goal, as we already do maintain means for spawning >>> telemetry stack. While a requirement for Prometheus will be >>> unfortunate for us at least. >>> >>> While I was writing that, I partially realized that testing Watcher on >>> all-in-one is pretty much impossible as well... >>> >> you can certenly test some fo watcher with an all in one deployment >> >> i.e. the apis and you can use the dummy test stragies. >> >> but ya in general like nova you need at least 2 nodes to be able to test >> it properly ideally 3 >> >> so that if your doing a live migration there is actully a choice of host. >> >> in general however watcher like heat just asks nova to actully move the vms. >> >> sure it will ask nova to move it to a specific host but fundementaly if >> you have >> >> tested live migration with nova via tempest seperatly there is no reason >> to expcect >> >> it would not work for live migratoin tirggred by watcher or heat or >> anything else that >> >> jsut calls novas api. >> >> so you could still get some valual testing in an all in one but ideally >> there woudl be at least 2 comptue hosts. >> >> >>> But at the very least, I can propose looking into adding an OSA job >>> with Gnocchi as NV to the project, to show the state of the deployment >>> with this driver. >>> >> well gnocchi is also not a native OpenStack telemetry datastore, it left >> our community to pursue its own goals and is now a third party datastore >> >> just like Grafana or Prometheus. >> >> monasca is currently marked as inactive >> https://review.opendev.org/c/openstack/governance/+/897520 and is in the >> process of being retired. >> >> but it also has no testing on the watcher side to the combination of the >> two is why we are deprecating it going forward. >> >> if both change im happy to see the support continue. >> >> Gnocchi has testing but we are not actively working on extending its >> functionality going forward. >> >> as long as it continues to work i see no reason to change its support >> status. >> >> watcher has quite a lot of untested integrations which is unfortunate >> >> we are planning to build out a feature/test/support matrix in the docs >> this cycle >> >> but for example watcher can integrate with both ironic an canonical maas >> component >> >> to do some level of host power management. none of that is tested and we >> are likely going >> >> to mark them as experimental and reflect on if we can continue to >> support them or not going forward. >> >> it also has the ability to do cinder storage pool balancing which is i >> think also untested write now. >> >> one of the things we hope to do is extend the exsitign testing in our >> current jobs to cover gaps like >> >> that where it is practical to do so. but creating a devstack plugin to >> deploy maas with fake infrastructure >> >> is likely alot more then we can do with our existing contributors so >> expect that to go to experimental then >> >> deprecated and finally it will be removed if no one turns up to support it. >> >> ironic is in the same boat however there are devstack jobs with fake >> ironic nodes so i >> >> could see a path to use having an ironic job down the line. its just not >> high on our current priority >> >> list to adress the support status or testing of this currently. >> >> eventlet removal and other techdebt/community goals are defintly higher >> but i hop the new supprot/testing >> >> matrix will at least help folks make informed descions or what feature >> to use and what backend are >> >> recommended going forward. >> >>> On Wed, 16 Apr 2025, 21:53 Douglas Viroel, <viroel(a)gmail.com> wrote: >>> >>> Hello everyone, >>> >>> Last week's PTG had very interesting topics. Thank you all that >>> joined. >>> The Watcher PTG etherpad with all notes is available here: >>> https://etherpad.opendev.org/p/apr2025-ptg-watcher >>> Here is a summary of the discussions that we had, including the >>> great cross-project sessions with Telemetry, Horizon and Nova team: >>> >>> Tech Debt (chandankumar/sean-k-mooney) >>> ================================= >>> a) Croniter >>> >>> * Project is being abandoned as per >>> https://pypi.org/project/croniter/#disclaimer >>> * Watcher uses croniter to calculate a new schedule time to run >>> an audit (continuous). It is also used to validate cron like >>> syntax >>> * Agreed: replace croniter with appscheduler's cron methods. >>> * *AI*: (chandankumar) Fix in master branch and backport to 2025.1 >>> >>> b) Support status of Watcher Datasources >>> >>> * Only Gnocchi and Prometheus have CI job running tempest tests >>> (with scenario tests) >>> * Monaska is inactive since 2024.1 >>> * *AI*: (jgilaber) Mark Monasca and Grafana as deprecated, >>> unless someone steps up to maintain them, which should include >>> a minimal CI job running. >>> * *AI*: (dviroel) Document a support matrix between Strategies >>> and Datasources, which ones are production ready or >>> experimental, and testing coverage. >>> >>> c) Eventlet Removal >>> >>> * Team is going to look at how the eventlet is used in Watcher >>> and start a PoC of its removal. >>> * Chandan Kumar and dviroel volunteer to help in this effort. >>> * Planned for 2026.1 cycle. >>> >>> Workflow/API Improvements (amoralej) >>> ============================== >>> a) Actions states >>> >>> * Currently Actions updates from Pending to Succeeded or Failed, >>> but these do not cover some important scenarios >>> * If an Action's pre_conditions fails, the action is set to >>> FAILED, but for some scenarios, it could be just SKIPPED and >>> continue the workflow. >>> * Proposal: New SKIPPED state for action. E.g: In a Nova >>> Migration Action, if the instance doesn't exist in the source >>> host, it can be skipped instead of fail. >>> * Proposal: Users could also manually skip specific actions from >>> an action plan. >>> * A skip_reason field could also be added to document the reason >>> behind the skip: user's request, pre-condition check, etc. >>> * *AI*: (amoralej) Create a spec to describe the proposed changes. >>> >>> b) Meaning of SUCCEEDED state in Action Plan >>> >>> * Currently means that all actions are triggered, even if all of >>> them fail, which can be confusing for users. >>> * Docs mention that SUCCEEDED state means that all actions have >>> been successfully executed. >>> * *AI*: (amoralej) Document the current behavior as a bug >>> (Priority High) >>> o done: https://bugs.launchpad.net/watcher/+bug/2106407 >>> >>> Watcher-Dashboard: Priorities to next release (amoralej) >>> =========================================== >>> a) Add integration/functional tests >>> >>> * Project is missing integration/functional tests and a CI job >>> running against changes in the repo >>> * No general conclusion and we will follow up with Horizon team >>> * *AI*: (chandankumar/rlandy) sync with Horizon team about >>> testing the plugin with horizon. >>> * *AI*: (chandankumar/rlandy) devstack job running on new >>> changes for watcher-dashboard repo. >>> >>> b) Add parameters to Audits >>> >>> * It is missing on the watcher-dashboard side. Without it, it is >>> not possible to define some important parameters. >>> * Should be addressed by a blueprint >>> * Contributors to this feature: chandankumar >>> >>> Watcher cluster model collector improvement ideas (dviroel) >>> ============================================= >>> >>> * Brainstorm ideas to improve watcher collector process, since >>> we still see a lot of issues due to outdated models when >>> running audits >>> * Both scheduled model update and event-based updates are >>> enabled in CI today >>> * It is unknown the current state of event-based updates from >>> Nova notification. Code needs to be reviewed and >>> improvements/fixes can be proposed >>> o e.g: >>> https://bugs.launchpad.net/watcher/+bug/2104220/comments/3 >>> - We need to check if we are processing the right >>> notifications of if is a bug on Nova >>> * Proposal: Refresh the model before running an audit. A rate >>> limit should be considered to avoid too many refreshments. >>> * *AI*: (dviroel) new spec for cluster model refresh, based on >>> audit trigger >>> * *AI:* (dviroel) investigate the processing of nova events in >>> Watcher >>> >>> Watcher and Nova's visible constraints (dviroel) >>> ==================================== >>> >>> * Currently, Watcher can propose solutions that include server >>> migrations that violate some Nova constraints like: >>> scheduler_hints, server_groups, pinned_az, etc. >>> * In Epoxy release, Nova's API was improved to also show >>> scheduler_hints and image_properties, allowing external >>> services, like watcher, to query and use this information when >>> calculating new solutions. >>> o https://docs.openstack.org/releasenotes/nova/2025.1.html#new-features >>> * Proposal: Extend compute instance model to include new >>> properties, which can be retrieved via novaclient. Update >>> strategies to filter invalid migration destinations based on >>> these new properties. >>> * *AI*: (dviroel) Propose a spec to better document the >>> proposal. No API changes are expected here. >>> >>> Replacement for noisy neighbor policy (jgilaber) >>> ==================================== >>> >>> * The existing noisy neighbor strategy is based on L3 Cache >>> metrics, which is not available anymore, since the support for >>> it was dropped from the kernel and from Nova. >>> * In order to keep this strategy, new metrics need to be >>> considered: cpu_steal? io_wait? cache_misses? >>> * *AI*: (jgilaber) Mark the strategy as deprecated during this cycle >>> * *AI*: (TBD) Identify new metrics to be used >>> * *AI*: (TBD) Work on a replacement for the current strategy >>> >>> >>> Host Maintenance strategy new use case (jeno8) >>> ===================================== >>> >>> * New use case for Host Maintenance strategy: instance with >>> ephemeral disks should not be migrated at all. >>> * Spec proposed: >>> https://review.opendev.org/c/openstack/watcher-specs/+/943873 >>> o New action to stop instances when both live/cold migration >>> are disabled by the user >>> * *AI*: (All) Review the spec and continue with discussion there. >>> >>> Missing Contributor Docs (sean-k-mooney) >>> ================================ >>> >>> * Doc missing: Scope of the project, e.g: >>> https://docs.openstack.org/nova/latest/contributor/project-scope.html >>> * *AI*: (rlandy) Create a scope of the project doc for Watcher >>> * Doc missing: PTL Guide, e.g: >>> https://docs.openstack.org/nova/latest/contributor/ptl-guide.html >>> * *AI*: (TBD) Create a PTL Guide for Watcher project >>> * Document: When to create a spec vs blueprint vs bug >>> * *AI*: (TBD) Create a doc section to describe the process based >>> on what is being modified in the code. >>> >>> Retrospective >>> ========== >>> >>> * The DPL approach seems to be working for Watcher >>> * New core members added: sean-k-mooney, dviroel, marios and >>> chandankumar >>> o We plan to add more cores in the next cycle, based on >>> reviews and engagement. >>> o We plan to remove not active members in the 2 last cycles >>> (starting at 2026.1) >>> * A new datasource was added: Prometheus >>> * Prometheus job now also runs scenario tests, along with Gnocchi. >>> * We triaged all old bugs from launchpad >>> * Needs improvement: >>> o current team is still learning about details in the code, >>> much of the historical knowledge was lost with the >>> previous maintainers >>> o core team still needs to grow >>> o we need to focus on creating stable releases >>> >>> >>> Cross-project session with Horizon team >>> =============================== >>> >>> * Combined session with Telemetry and Horizon team, focused on >>> how to provide a tenant and an admin dashboard to visualize >>> metrics. >>> * Watcher team presented some ideas of new panels for both admin >>> and tenants, and sean-k-mooney raised a discussion about >>> frameworks that can be used to implement them >>> * Use-cases that were discussed: >>> o a) Admin would benefit from a visualization of the >>> infrastructure utilization (real usage metrics), so they >>> can identify bottlenecks and plan optimization >>> o b) A tenant would like to view their workload performance, >>> checking real usage of cpu/ram/disk of instances, to >>> proper adjust their resources allocation. >>> o c) An admin user of watcher service would like to >>> visualize metrics generated by watcher strategies like >>> standard deviation of host metrics. >>> * sean-k-mooney presented an initial PoC on how a Hypervisor >>> Metrics dashboard would look like. >>> * Proposal for next steps: >>> o start a new horizon plugin as an official deliverable of >>> telemetry project >>> o still unclear which framework to use for building charts >>> o dashboard will integrate with Prometheus, as metric store >>> o it is expected that only short term metrics will be >>> supported (7 days) >>> o python-observability-client will be used to query Prometheus >>> >>> >>> Cross-project session with Nova team >>> ============================= >>> >>> * sean-k-mooney led topics on how to evolve Nova to better >>> assist other services, like Watcher, to take actions on >>> instances. The team agreed on a proposal of using the existing >>> metadata API to annotate instance's supported lifecycle >>> operations. This information is very useful to improve >>> Watcher's strategy's algorithms. Some example of instance's >>> metadata could be: >>> o lifecycle:cold-migratable=true|false >>> o ha:maintenance-strategy:in_place|power_off|migrate >>> * It was discussed that Nova could infer which operations are >>> valid or not, based on information like: virt driver, flavor, >>> image properties, etc. This feature was initially named >>> 'instance capabilities' and will require a spec for further >>> discussions. >>> * Another topic of interest, also raised by Sean, was about >>> adding new standard traits to resource providers, like >>> PRESSURE_CPU and PRESSURE_DISK. These traits can be used to >>> weight hosts when placing new VMs. Watcher and the libvirt >>> driver could work on annotating them, but the team generally >>> agreed that the libvirt driver is preferred here. >>> * More info at Nova PTG etherpad [0] and sean's summary blog [1] >>> >>> [0] https://etherpad.opendev.org/p/r.bf5f1185e201e31ed8c3adeb45e3cf6d >>> [1] https://www.seanmooney.info/blog/2025.2-ptg/#watcher-topics >>> >>> >>> Please let me know if I missed something. >>> Thanks! >>> >>> -- >>> Douglas Viroel - dviroel >>>

3 months, 3 weeks

[tc][all] OpenStack Technical Committee Weekly Summary and Meeting Agenda (2025.2/R-18)

by Goutham Pacha Ravi

Hello Stackers, We're 18 weeks away from the release date for OpenStack 2025.2 "Flamingo" [1]. Next week is the deadline for "cycle-trailing" [2] projects to tag their 2025.1 "Epoxy" deliverables. Elsewhere, service project teams are busy wrapping up design specifications for features expected to be implemented in this release cycle. A call to action regarding the cross-community goal on "eventlet removal" was made to this mailing list [3]. Please join the #openstack-eventlet-removal channel on OFTC and participate in the effort. Several OpenStack governance changes are currently underway. A major proposal among them is a transition [4] from the Contributor License Agreement (CLA) [5] to the Developer Certificate of Origin [6]. This change will affect every OpenStack contributor. The OpenStack Technical Committee is working with the OpenInfra Foundation and the OpenDev Infrastructure teams to enforce DCO compliance starting 2025-07-01. Please take some time to consider its implications and provide your opinions on the TC resolution [4]. Project maintainers are not expected to reject patches with DCO compliance today. If you spot a "Signed-off-by" in the commit message, there's a good chance reviewers have just looked past this, as it wasn't required so far. It's a good time to review what may be necessary and be prepared for the upcoming change [7]. === Weekly Meeting === The weekly IRC meeting of the OpenStack Technical Committee occurred on 2025-05-20 [8]. An action item regarding relinquishing the "quantum" name on PyPI was discussed. The resolution in this regard was acknowledged by the requester and merged shortly after. The OpenDev infra administrators deleted OpenStack artifacts and handed over the project namespace. The majority of the meeting later focused on the transition from CLA (Contributor License Agreement) to DCO (Developer Certificate of Origin). This move is part of a broader transition into the Linux Foundation, with an effective date of June 1, 2025. The TC needed to reconfirm its desire to move to DCO, preferably within the next two weeks, as the previous resolution on this topic was from 2014. A new resolution confirming the board's recommendation was deemed helpful for community feedback. We discussed many aspects of this transition—a key concern being the smoothness of the transition for contributors. While the technical implementation (Gerrit enforcing Signed-Off-By in commit messages and turning off CLA enforcement) is relatively simple, the human and organizational impact is not trivial. The short timeline for the switchover was a major point of contention, as downstream organizations may need to re-engage legal teams and update internal contribution policies. The possibility of having multiple CLAs active in Gerrit (allowing existing contributors to continue under the old CLA while new contributors use a new CLA for the new entity) was raised as a potential solution to mitigate the immediate impact of the short deadline. However, mixing CLA and DCO enforcement was generally seen as undesirable and hard to implement. Post-meeting, the resolution was proposed [4], and the timeline for implementation has been pushed out by a month to allow the community time to prepare and react accordingly. Please expect more communication regarding this in the next few days. The next meeting of the OpenStack TC is on 2025-05-27 at 1700 UTC. This meeting will be held over IRC on the #openstack-tc channel on OFTC. Please find the agenda and other details on the meeting's wiki page [9]. I hope you'll be able to join us there! === Governance Proposals === ==== Merged ==== - [resolution] Relinquish "quantum" project on PyPI | https://review.opendev.org/c/openstack/governance/+/949783 ==== Open for Review ==== - Require declaration of affiliation from TC Candidates | https://review.opendev.org/c/openstack/governance/+/949432 - [resolution] Replace CLA with DCO for all contributions | https://review.opendev.org/c/openstack/governance/+/950463 - Clarify actions when no elections are required | https://review.opendev.org/c/openstack/governance/+/949431 - Fix outdated info on the tc-guide | https://review.opendev.org/c/openstack/governance/+/950446 === Upcoming Events === - 2025-06-03: 15 ans d'OpenStack - OpenInfra UG, Paris: https://www.meetup.com/openstack-france/events/307492285 - 2025-06-05: OpenStack 15 ans! - OpenInfra UG, Rennes: https://www.meetup.com/openstack-rennes/events/306903998 - 2025-06-28: OpenInfra+Cloud Native Day, Vietnam: https://www.vietopeninfra.org/void2025 Thank you very much for reading! On behalf of the OpenStack TC, Goutham Pacha Ravi (gouthamr) OpenStack TC Chair [1] 2025.2 "Flamingo" Release Schedule: https://releases.openstack.org/flamingo/schedule.html [2] "cycle-trailing": https://releases.openstack.org/reference/release_models.html#cycle-trailing [3] "eventlet-removal" status: https://lists.openstack.org/archives/list/openstack-discuss@lists.openstack… [4] TC resolution to replace CLA with DCO for all contributions: https://review.opendev.org/c/openstack/governance/+/950463 [5] OpenStack CLA: https://docs.openstack.org/contributors/common/setup-gerrit.html#individual… [6] Developer Certificate of Origin: https://developercertificate.org/ [7] DCO documentation draft: https://review.opendev.org/c/openstack/contributor-guide/+/950839 [8] TC Meeting IRC Log 2025-05-20: https://meetings.opendev.org/meetings/tc/2025/tc.2025-05-20-17.00.log.html [9] TC Meeting Agenda, 2025-05-27: https://wiki.openstack.org/wiki/Meetings/TechnicalCommittee#Next_Meeting

2 months, 2 weeks

[manila] 2025.1 Epoxy PTG summary

by Carlos Silva

Hello everyone! Thank you for the great participation at the PTG last week. We've had great discussions and a good turnout. The recordings for the sessions are available on YouTube [0]. If you would like to check on the notes, please take a look at the PTG etherpad [1]. *2024.2 Dalmatian Retrospective* ========================== - New core reviewers in the manila group were impactful in reviews, we should continue actively working on maintaining/growing the core reviewer team. - We had the mid-cycle and managed to combine it with our well-known collaborative review sessions, around feature proposal freeze. This had a good impact on raising awareness on the changes being proposed, as well as prioritizing the reviews. - Great contributions ranging from new third party drivers to successful internships on manila-ui, bandit and the ongoing OpenAPI internships. *Action items:* - Carlos (carloss) will work with the manila team to help people gain context on the bug czar role and work with the team to rotate it. - Vida Haririan (vhari) will jot down the details of the Bug Czar role - Follow the discussions on teams joining the VMT and get Manila included too. - Spread the word on the removal of the manila client and switch to OpenStackClient *Share backup enhancements* ========================= - Out of place restore isn't supported currently. We have agreed that this is a good use case and that a design specification should be proposed to document this. - DataManager / BackupDriver - forcing the backup process to go through the DataManager service is supported through a config option, but Manila is currently not honoring it. We agreed that this is an issue in the code, and we will review the proposed change [2] to make the data manager honor this config. - DataManager to allow for a backup driver to provide reports on API call progress: Currently, the data manager fetches the progress of a backup using a generic get progress call, but it is failing with the generic backup driver. We suggested that this should be fixed in the base driver. - Context for Backup API calls: currently, only objects representing a Share and Backup are passed to the backup driver. The request context should also be forwarded in these calls. The backup driver interface can be changed for this, but we should be mindful of out of tree drivers that could break. *Action items:* - Zach Goggins (zachgoggins) will look into: - Proposing a spec for the share backup out of place restore. - Updating the backup driver interface and adding context to the methods that need it. - Updating the backup driver interface and adding the abstract methods/capabilities that will help with the `get_restore_progress` and `get_backup_progress` methods. - The manila team will provide feedback on [2] *All things CephFS* =============== *Updates from previous cycles* ------------------------------------------ *State of the Standalone NFS Ganesha protocol helper:* - We added a deprecation warning at the end of the previous SLURP release, and we are planning to complete the removal during the 2025.1/Epoxy release. There were no objections to this so far at the PTG. When this is removed, CephFS-via-NFS will only work with cephadm deployed ceph-nfs clusters. *Testing and stabilization:* - devstack-plugin-ceph has been refactored to deploy a standalone NFS-Ganesha service with a ceph orch deployed cluster. We also dropped support for package-based and container-based installation of ceph. cephadm is used to deploy/orchestrate ceph. - Bumped Ceph version to Reef in Antelope, Bobcat, Caracal, Dalmatian, as well as started testing with Squid. - There are some failures on stable branches jobs which are being triaged and fixed. *Manage/unmanage:* - Implementation completed in Dalmatian and the documentation has been updated. We are currently working to enable the tests on CI permanently, as well as doing some small refactors to the CI jobs. *Ensure shares:* - Merged in Dalmatian but testing is still challenging, as running the tests mean that the service would temporarily have a different status and shares within the backend would have their status changed, which is harmful for test concurrency. *Preferred export locations and export location metadata:* - The core feature merged, but we are still working to get the newly implemented tests passing and merged. *Plans for 2025.1/Epoxy* -------------------------------- - NFSv3 + testing: we are looking into enabling NFSv3 support as soon as the patch is merged in Ceph. We agreed that we should enable the tests within manila-tempest-plugin and make any necessary changes to the tests structure, so we can ensure that we are testing some scenarios with both NFSv3 and NFSv4. - We will start to investigate support for SMB/CIFS shares and look at the necessary changes for setting up devstack and testing. *Action items:* - Carlos (carloss) will write an email to the openstack-discuss mailing list announcing the removal of the deprecated ganesha helper - Carlos pursue the manage/unmanage testing patches to have tests enabled in the CephFS jobs during Epoxy. - Carlos will look into approaches to test ensure shares APIs. - Ashley (ashrod98) will continue working on the export location metadata tempest changes and drive them to completion. - The manila team will look into updating manila-tempest-plugin tests and enabling NFSv3 tests in the Ceph NFS jobs - Goutham (gouthamr) will be submitting a prototype of the SMB/CIFS integration *Tech Debt* ======== *Eventlet removal* ---------------------- Our main concerns: - Performance should not be degraded with the default configuration when we switch. - Synchronous calls do not take a big hit and become asynchronous. - Impact to the SSH Pool (used by many drivers) should be minimal. *Action items for 2025.1 Epoxy:* - Tackle the low-hanging-fruit changes. - Participating in the pop-up team discussions. - Removing the affected console scripts in Manila. - Working on performance tests to understand what will be the impact on the SSH pool that is used by some drivers. - Look into enhancing our rally/browbeat test coverage. *CI and testing images* ------------------------------- We started working on the migration of the CI to Ubuntu 24.04 in all of the manila repositories (manila-image-elements, python-manilaclient, manila-ui, manila, manila-specs). Currently, the Ceph job is broken [3]. *Action items:* - We should clean up our CI job variants, as they have a lot of workarounds and we can start moving away from them. *Stable branches* ---------------------- We currently have 5 "unmaintained" branches, so we should be looking at sunsetting them. *Action items:* - Carlos (carloss) will start the conversation for the transition of some of these branches in the openstack-discuss mailing list. *Allowing force delete of a share network subnet* ======================================= We currently can add subnets (which translates to adding new network interfaces) to a share server but we can't remove them. This is a proposal to add this removal feature and being able to detach a network interface of a share server. We agreed that: - This is a good use case and something that can be enhanced. - The enhancement should add a force-delete API. - We should not allow the last subnet to be deleted, otherwise the shares won't have an export path. - A bug should be filed for a tangential issue that the NetApp driver is using "neutron_net_id" (and possibly "neutron_subnet_id" to name resources on the backend: ipspaces, broadcast domains, and possibly concurrency control / locks. *Action Items:* - sylvanld will look into proposing a spec to document this behavior *NetApp Driver: Enforcing lifs limit per HA pair* ===================================== - The NetApp ONTAP storage has a limit of network interfaces per node in a HA pair. In case the sum of allocated network interfaces in the two nodes of the HA pair is bigger than the limit of the single node, then the failover operation is compromised and will fail. - NetApp maintainers would like to fix this issue, and we agreed that: - The fix should be as dynamic as possible, not relying on users/admin input or configuration. - The ONTAP driver must look up all of the interfaces already created and allow/deny the request in case it would compromise the failover. - The NetApp ONTAP driver should keep an updated capability with the max network interfaces support number, and possible the number of allocated network interfaces at the moment. *NetApp Driver: Implement Certificate based authentication* ================================================ - The NetApp ONTAP driver currently handles only user/password authentication, but in an environment that password should change quarterly, this means updating the local.conf at least every three months. This enhancement proposes also adding the possibility of adding certificate based authentication. - We agreed that this is something that is going to be important for operators and will allow them to add their certificates with a longer expiration date, avoiding the disruptions caused by needing to update the user/password. *Manage Share affinity relationships by annotation/label* ============================================= Currently the manila scheduler uses affinity/anti-affinity hints and we base ourselves on share IDs. The idea now would be to have the affinite hints to be based in an affinity policy, as possible with Nova. We considered the proposed approaches, and agreed that: - If we are adding new policies, they should end up becoming a new resource/entity within the manila database - If there is a way to reuse the share groups mechanism, we should prioritize it *Action items:* - Chuan (chuanm) will propose a design spec to document this new behavior. *Share encryption* ============== This feature is currently waiting for more reviews and testing on gerrit. In the Dalmatian release mid-cycle we talked about the importance of testing this feature against a first party driver, to ensure that the APIs and integration with Barbican and Castellan work. We agreed that: - We should do some research on how to do this testing with the generic driver (which uses Cinder and Nova) - The testing will focus on the APIs and behavior of this feature, not the encryption of the shares. *Action items:* - gouthamr will help with some research on how to test this with the generic driver - The manila team will discuss this again in the upcoming manila weekly meetings. [0] https://www.youtube.com/watch?v=8UxrjEr6yik&list=PLnpzT0InFrqDHGfSDPhiGtSeX… [1] https://etherpad.opendev.org/p/epoxy-ptg-manila [2] https://review.opendev.org/c/openstack/manila/+/907983 [3] https://www.spinics.net/lists/ceph-users/msg83201.html

9 months, 1 week

Re: [watcher] 2025.2 Flamingo PTG summary

by Dmitriy Rabotyagov

Hey, Have a comment on one AI from the list. > AI: (jgilaber) Mark Monasca and Grafana as deprecated, unless someone steps up to maintain them, which should include a minimal CI job running. So eventually, on OpenStack-Ansible we were planning to revive the Watcher role support to the project. How we usually test deployment, is by spawning an all-in-one environment with drivers and executing a couple of tempest scenarios to ensure basic functionality of the service. With that, having a native OpenStack telemetry datastore is very beneficial for such goal, as we already do maintain means for spawning telemetry stack. While a requirement for Prometheus will be unfortunate for us at least. While I was writing that, I partially realized that testing Watcher on all-in-one is pretty much impossible as well... But at the very least, I can propose looking into adding an OSA job with Gnocchi as NV to the project, to show the state of the deployment with this driver. On Wed, 16 Apr 2025, 21:53 Douglas Viroel, <viroel(a)gmail.com> wrote: > Hello everyone, > > Last week's PTG had very interesting topics. Thank you all that joined. > The Watcher PTG etherpad with all notes is available here: > https://etherpad.opendev.org/p/apr2025-ptg-watcher > Here is a summary of the discussions that we had, including the great > cross-project sessions with Telemetry, Horizon and Nova team: > > Tech Debt (chandankumar/sean-k-mooney) > ================================= > a) Croniter > > - Project is being abandoned as per > https://pypi.org/project/croniter/#disclaimer > - Watcher uses croniter to calculate a new schedule time to run an > audit (continuous). It is also used to validate cron like syntax > - Agreed: replace croniter with appscheduler's cron methods. > - *AI*: (chandankumar) Fix in master branch and backport to 2025.1 > > b) Support status of Watcher Datasources > > - Only Gnocchi and Prometheus have CI job running tempest tests (with > scenario tests) > - Monaska is inactive since 2024.1 > - *AI*: (jgilaber) Mark Monasca and Grafana as deprecated, unless > someone steps up to maintain them, which should include a minimal CI job > running. > - *AI*: (dviroel) Document a support matrix between Strategies and > Datasources, which ones are production ready or experimental, and testing > coverage. > > c) Eventlet Removal > > - Team is going to look at how the eventlet is used in Watcher and > start a PoC of its removal. > - Chandan Kumar and dviroel volunteer to help in this effort. > - Planned for 2026.1 cycle. > > Workflow/API Improvements (amoralej) > ============================== > a) Actions states > > - Currently Actions updates from Pending to Succeeded or Failed, but > these do not cover some important scenarios > - If an Action's pre_conditions fails, the action is set to FAILED, > but for some scenarios, it could be just SKIPPED and continue the workflow. > - Proposal: New SKIPPED state for action. E.g: In a Nova Migration > Action, if the instance doesn't exist in the source host, it can be skipped > instead of fail. > - Proposal: Users could also manually skip specific actions from an > action plan. > - A skip_reason field could also be added to document the reason > behind the skip: user's request, pre-condition check, etc. > - *AI*: (amoralej) Create a spec to describe the proposed changes. > > b) Meaning of SUCCEEDED state in Action Plan > > - Currently means that all actions are triggered, even if all of them > fail, which can be confusing for users. > - Docs mention that SUCCEEDED state means that all actions have been > successfully executed. > - *AI*: (amoralej) Document the current behavior as a bug (Priority > High) > - done: https://bugs.launchpad.net/watcher/+bug/2106407 > > Watcher-Dashboard: Priorities to next release (amoralej) > =========================================== > a) Add integration/functional tests > > - Project is missing integration/functional tests and a CI job running > against changes in the repo > - No general conclusion and we will follow up with Horizon team > - *AI*: (chandankumar/rlandy) sync with Horizon team about testing the > plugin with horizon. > - *AI*: (chandankumar/rlandy) devstack job running on new changes for > watcher-dashboard repo. > > b) Add parameters to Audits > > - It is missing on the watcher-dashboard side. Without it, it is not > possible to define some important parameters. > - Should be addressed by a blueprint > - Contributors to this feature: chandankumar > > Watcher cluster model collector improvement ideas (dviroel) > ============================================= > > - Brainstorm ideas to improve watcher collector process, since we > still see a lot of issues due to outdated models when running audits > - Both scheduled model update and event-based updates are enabled in > CI today > - It is unknown the current state of event-based updates from Nova > notification. Code needs to be reviewed and improvements/fixes can be > proposed > - e.g: https://bugs.launchpad.net/watcher/+bug/2104220/comments/3 - > We need to check if we are processing the right notifications of if is a > bug on Nova > - Proposal: Refresh the model before running an audit. A rate limit > should be considered to avoid too many refreshments. > - *AI*: (dviroel) new spec for cluster model refresh, based on audit > trigger > - *AI:* (dviroel) investigate the processing of nova events in Watcher > > Watcher and Nova's visible constraints (dviroel) > ==================================== > > - Currently, Watcher can propose solutions that include server > migrations that violate some Nova constraints like: scheduler_hints, > server_groups, pinned_az, etc. > - In Epoxy release, Nova's API was improved to also show > scheduler_hints and image_properties, allowing external services, like > watcher, to query and use this information when calculating new solutions. > - > https://docs.openstack.org/releasenotes/nova/2025.1.html#new-features > - Proposal: Extend compute instance model to include new properties, > which can be retrieved via novaclient. Update strategies to filter invalid > migration destinations based on these new properties. > - *AI*: (dviroel) Propose a spec to better document the proposal. No > API changes are expected here. > > Replacement for noisy neighbor policy (jgilaber) > ==================================== > > - The existing noisy neighbor strategy is based on L3 Cache metrics, > which is not available anymore, since the support for it was dropped from > the kernel and from Nova. > - In order to keep this strategy, new metrics need to be considered: > cpu_steal? io_wait? cache_misses? > - *AI*: (jgilaber) Mark the strategy as deprecated during this cycle > - *AI*: (TBD) Identify new metrics to be used > - *AI*: (TBD) Work on a replacement for the current strategy > > > Host Maintenance strategy new use case (jeno8) > ===================================== > > - New use case for Host Maintenance strategy: instance with ephemeral > disks should not be migrated at all. > - Spec proposed: > https://review.opendev.org/c/openstack/watcher-specs/+/943873 > - New action to stop instances when both live/cold migration are > disabled by the user > - *AI*: (All) Review the spec and continue with discussion there. > > Missing Contributor Docs (sean-k-mooney) > ================================ > > - Doc missing: Scope of the project, e.g: > https://docs.openstack.org/nova/latest/contributor/project-scope.html > - *AI*: (rlandy) Create a scope of the project doc for Watcher > - Doc missing: PTL Guide, e.g: > https://docs.openstack.org/nova/latest/contributor/ptl-guide.html > - *AI*: (TBD) Create a PTL Guide for Watcher project > - Document: When to create a spec vs blueprint vs bug > - *AI*: (TBD) Create a doc section to describe the process based on > what is being modified in the code. > > Retrospective > ========== > > - The DPL approach seems to be working for Watcher > - New core members added: sean-k-mooney, dviroel, marios and > chandankumar > - We plan to add more cores in the next cycle, based on reviews and > engagement. > - We plan to remove not active members in the 2 last cycles > (starting at 2026.1) > - A new datasource was added: Prometheus > - Prometheus job now also runs scenario tests, along with Gnocchi. > - We triaged all old bugs from launchpad > - Needs improvement: > - current team is still learning about details in the code, much of > the historical knowledge was lost with the previous maintainers > - core team still needs to grow > - we need to focus on creating stable releases > > > Cross-project session with Horizon team > =============================== > > - Combined session with Telemetry and Horizon team, focused on how to > provide a tenant and an admin dashboard to visualize metrics. > - Watcher team presented some ideas of new panels for both admin and > tenants, and sean-k-mooney raised a discussion about frameworks that can be > used to implement them > - Use-cases that were discussed: > - a) Admin would benefit from a visualization of the infrastructure > utilization (real usage metrics), so they can identify bottlenecks and plan > optimization > - b) A tenant would like to view their workload performance, > checking real usage of cpu/ram/disk of instances, to proper adjust their > resources allocation. > - c) An admin user of watcher service would like to visualize > metrics generated by watcher strategies like standard deviation of host > metrics. > - sean-k-mooney presented an initial PoC on how a Hypervisor Metrics > dashboard would look like. > - Proposal for next steps: > - start a new horizon plugin as an official deliverable of > telemetry project > - still unclear which framework to use for building charts > - dashboard will integrate with Prometheus, as metric store > - it is expected that only short term metrics will be supported (7 > days) > - python-observability-client will be used to query Prometheus > > > Cross-project session with Nova team > ============================= > > - sean-k-mooney led topics on how to evolve Nova to better assist > other services, like Watcher, to take actions on instances. The team agreed > on a proposal of using the existing metadata API to annotate instance's > supported lifecycle operations. This information is very useful to improve > Watcher's strategy's algorithms. Some example of instance's metadata could > be: > - lifecycle:cold-migratable=true|false > - ha:maintenance-strategy:in_place|power_off|migrate > - It was discussed that Nova could infer which operations are valid or > not, based on information like: virt driver, flavor, image properties, etc. > This feature was initially named 'instance capabilities' and will require a > spec for further discussions. > - Another topic of interest, also raised by Sean, was about adding new > standard traits to resource providers, like PRESSURE_CPU and PRESSURE_DISK. > These traits can be used to weight hosts when placing new VMs. Watcher and > the libvirt driver could work on annotating them, but the team generally > agreed that the libvirt driver is preferred here. > - More info at Nova PTG etherpad [0] and sean's summary blog [1] > > [0] https://etherpad.opendev.org/p/r.bf5f1185e201e31ed8c3adeb45e3cf6d > [1] https://www.seanmooney.info/blog/2025.2-ptg/#watcher-topics > > > Please let me know if I missed something. > Thanks! > > -- > Douglas Viroel - dviroel >

3 months, 3 weeks

Re: [watcher] 2025.2 Flamingo PTG summary

by Dmitriy Rabotyagov

> well gnocchi is also not a native OpenStack telemetry datastore, it left > our community to pursue its own goals and is now a third party datastore > just like Grafana or Prometheus. Yeah, well, true. Is still somehow treated as the "default" thing with Telemetry, likely due to existing integration with Keystone and multi-tenancy support. And beyond it - all other options become opinionated too fast - ie, some do OpenTelemetry, some do Zabbix, VictoriaMetrics, etc. As pretty much from what I got as well, is that still relies on Ceilometer metrics? And then Prometheus is obviously not the best storage for them, as it requires to have pushgatgeway, and afaik prometheus maintainers are strictly against "push" concept to it and treat it as conceptually wrong (on contrary to OpenTelemetry). So the metric timestamp issue is to remain unaddressed. So that's why I'd see leaving Gnocchi as "base" implementation might be valuable (and very handy for us, as we don't need to implement a prometheus job specifically for Watcher). > but for example watcher can integrate with both ironic an canonical maas component > to do some level of host power management. That sounds really interesting... We do maintain infrastructure using MAAS and playing with such integration will be extremely interesting. I hope I will be able to get some time for this though... чт, 17 апр. 2025 г. в 13:52, Sean Mooney <smooney(a)redhat.com>: > > > On 16/04/2025 21:04, Dmitriy Rabotyagov wrote: > > > > Hey, > > > > Have a comment on one AI from the list. > > > > > AI: (jgilaber) Mark Monasca and Grafana as deprecated, unless > > someone steps up to maintain them, which should include a minimal CI > > job running. > > > > So eventually, on OpenStack-Ansible we were planning to revive the > > Watcher role support to the project. > > How we usually test deployment, is by spawning an all-in-one > > environment with drivers and executing a couple of tempest scenarios > > to ensure basic functionality of the service. > > > > With that, having a native OpenStack telemetry datastore is very > > beneficial for such goal, as we already do maintain means for spawning > > telemetry stack. While a requirement for Prometheus will be > > unfortunate for us at least. > > > > While I was writing that, I partially realized that testing Watcher on > > all-in-one is pretty much impossible as well... > > > you can certenly test some fo watcher with an all in one deployment > > i.e. the apis and you can use the dummy test stragies. > > but ya in general like nova you need at least 2 nodes to be able to test > it properly ideally 3 > > so that if your doing a live migration there is actully a choice of host. > > in general however watcher like heat just asks nova to actully move the vms. > > sure it will ask nova to move it to a specific host but fundementaly if > you have > > tested live migration with nova via tempest seperatly there is no reason > to expcect > > it would not work for live migratoin tirggred by watcher or heat or > anything else that > > jsut calls novas api. > > so you could still get some valual testing in an all in one but ideally > there woudl be at least 2 comptue hosts. > > > > But at the very least, I can propose looking into adding an OSA job > > with Gnocchi as NV to the project, to show the state of the deployment > > with this driver. > > > well gnocchi is also not a native OpenStack telemetry datastore, it left > our community to pursue its own goals and is now a third party datastore > > just like Grafana or Prometheus. > > monasca is currently marked as inactive > https://review.opendev.org/c/openstack/governance/+/897520 and is in the > process of being retired. > > but it also has no testing on the watcher side to the combination of the > two is why we are deprecating it going forward. > > if both change im happy to see the support continue. > > Gnocchi has testing but we are not actively working on extending its > functionality going forward. > > as long as it continues to work i see no reason to change its support > status. > > watcher has quite a lot of untested integrations which is unfortunate > > we are planning to build out a feature/test/support matrix in the docs > this cycle > > but for example watcher can integrate with both ironic an canonical maas > component > > to do some level of host power management. none of that is tested and we > are likely going > > to mark them as experimental and reflect on if we can continue to > support them or not going forward. > > it also has the ability to do cinder storage pool balancing which is i > think also untested write now. > > one of the things we hope to do is extend the exsitign testing in our > current jobs to cover gaps like > > that where it is practical to do so. but creating a devstack plugin to > deploy maas with fake infrastructure > > is likely alot more then we can do with our existing contributors so > expect that to go to experimental then > > deprecated and finally it will be removed if no one turns up to support it. > > ironic is in the same boat however there are devstack jobs with fake > ironic nodes so i > > could see a path to use having an ironic job down the line. its just not > high on our current priority > > list to adress the support status or testing of this currently. > > eventlet removal and other techdebt/community goals are defintly higher > but i hop the new supprot/testing > > matrix will at least help folks make informed descions or what feature > to use and what backend are > > recommended going forward. > > > > > On Wed, 16 Apr 2025, 21:53 Douglas Viroel, <viroel(a)gmail.com> wrote: > > > > Hello everyone, > > > > Last week's PTG had very interesting topics. Thank you all that > > joined. > > The Watcher PTG etherpad with all notes is available here: > > https://etherpad.opendev.org/p/apr2025-ptg-watcher > > Here is a summary of the discussions that we had, including the > > great cross-project sessions with Telemetry, Horizon and Nova team: > > > > Tech Debt (chandankumar/sean-k-mooney) > > ================================= > > a) Croniter > > > > * Project is being abandoned as per > > https://pypi.org/project/croniter/#disclaimer > > * Watcher uses croniter to calculate a new schedule time to run > > an audit (continuous). It is also used to validate cron like > > syntax > > * Agreed: replace croniter with appscheduler's cron methods. > > * *AI*: (chandankumar) Fix in master branch and backport to 2025.1 > > > > b) Support status of Watcher Datasources > > > > * Only Gnocchi and Prometheus have CI job running tempest tests > > (with scenario tests) > > * Monaska is inactive since 2024.1 > > * *AI*: (jgilaber) Mark Monasca and Grafana as deprecated, > > unless someone steps up to maintain them, which should include > > a minimal CI job running. > > * *AI*: (dviroel) Document a support matrix between Strategies > > and Datasources, which ones are production ready or > > experimental, and testing coverage. > > > > c) Eventlet Removal > > > > * Team is going to look at how the eventlet is used in Watcher > > and start a PoC of its removal. > > * Chandan Kumar and dviroel volunteer to help in this effort. > > * Planned for 2026.1 cycle. > > > > Workflow/API Improvements (amoralej) > > ============================== > > a) Actions states > > > > * Currently Actions updates from Pending to Succeeded or Failed, > > but these do not cover some important scenarios > > * If an Action's pre_conditions fails, the action is set to > > FAILED, but for some scenarios, it could be just SKIPPED and > > continue the workflow. > > * Proposal: New SKIPPED state for action. E.g: In a Nova > > Migration Action, if the instance doesn't exist in the source > > host, it can be skipped instead of fail. > > * Proposal: Users could also manually skip specific actions from > > an action plan. > > * A skip_reason field could also be added to document the reason > > behind the skip: user's request, pre-condition check, etc. > > * *AI*: (amoralej) Create a spec to describe the proposed changes. > > > > b) Meaning of SUCCEEDED state in Action Plan > > > > * Currently means that all actions are triggered, even if all of > > them fail, which can be confusing for users. > > * Docs mention that SUCCEEDED state means that all actions have > > been successfully executed. > > * *AI*: (amoralej) Document the current behavior as a bug > > (Priority High) > > o done: https://bugs.launchpad.net/watcher/+bug/2106407 > > > > Watcher-Dashboard: Priorities to next release (amoralej) > > =========================================== > > a) Add integration/functional tests > > > > * Project is missing integration/functional tests and a CI job > > running against changes in the repo > > * No general conclusion and we will follow up with Horizon team > > * *AI*: (chandankumar/rlandy) sync with Horizon team about > > testing the plugin with horizon. > > * *AI*: (chandankumar/rlandy) devstack job running on new > > changes for watcher-dashboard repo. > > > > b) Add parameters to Audits > > > > * It is missing on the watcher-dashboard side. Without it, it is > > not possible to define some important parameters. > > * Should be addressed by a blueprint > > * Contributors to this feature: chandankumar > > > > Watcher cluster model collector improvement ideas (dviroel) > > ============================================= > > > > * Brainstorm ideas to improve watcher collector process, since > > we still see a lot of issues due to outdated models when > > running audits > > * Both scheduled model update and event-based updates are > > enabled in CI today > > * It is unknown the current state of event-based updates from > > Nova notification. Code needs to be reviewed and > > improvements/fixes can be proposed > > o e.g: > > https://bugs.launchpad.net/watcher/+bug/2104220/comments/3 > > - We need to check if we are processing the right > > notifications of if is a bug on Nova > > * Proposal: Refresh the model before running an audit. A rate > > limit should be considered to avoid too many refreshments. > > * *AI*: (dviroel) new spec for cluster model refresh, based on > > audit trigger > > * *AI:* (dviroel) investigate the processing of nova events in > > Watcher > > > > Watcher and Nova's visible constraints (dviroel) > > ==================================== > > > > * Currently, Watcher can propose solutions that include server > > migrations that violate some Nova constraints like: > > scheduler_hints, server_groups, pinned_az, etc. > > * In Epoxy release, Nova's API was improved to also show > > scheduler_hints and image_properties, allowing external > > services, like watcher, to query and use this information when > > calculating new solutions. > > o https://docs.openstack.org/releasenotes/nova/2025.1.html#new-features > > * Proposal: Extend compute instance model to include new > > properties, which can be retrieved via novaclient. Update > > strategies to filter invalid migration destinations based on > > these new properties. > > * *AI*: (dviroel) Propose a spec to better document the > > proposal. No API changes are expected here. > > > > Replacement for noisy neighbor policy (jgilaber) > > ==================================== > > > > * The existing noisy neighbor strategy is based on L3 Cache > > metrics, which is not available anymore, since the support for > > it was dropped from the kernel and from Nova. > > * In order to keep this strategy, new metrics need to be > > considered: cpu_steal? io_wait? cache_misses? > > * *AI*: (jgilaber) Mark the strategy as deprecated during this cycle > > * *AI*: (TBD) Identify new metrics to be used > > * *AI*: (TBD) Work on a replacement for the current strategy > > > > > > Host Maintenance strategy new use case (jeno8) > > ===================================== > > > > * New use case for Host Maintenance strategy: instance with > > ephemeral disks should not be migrated at all. > > * Spec proposed: > > https://review.opendev.org/c/openstack/watcher-specs/+/943873 > > o New action to stop instances when both live/cold migration > > are disabled by the user > > * *AI*: (All) Review the spec and continue with discussion there. > > > > Missing Contributor Docs (sean-k-mooney) > > ================================ > > > > * Doc missing: Scope of the project, e.g: > > https://docs.openstack.org/nova/latest/contributor/project-scope.html > > * *AI*: (rlandy) Create a scope of the project doc for Watcher > > * Doc missing: PTL Guide, e.g: > > https://docs.openstack.org/nova/latest/contributor/ptl-guide.html > > * *AI*: (TBD) Create a PTL Guide for Watcher project > > * Document: When to create a spec vs blueprint vs bug > > * *AI*: (TBD) Create a doc section to describe the process based > > on what is being modified in the code. > > > > Retrospective > > ========== > > > > * The DPL approach seems to be working for Watcher > > * New core members added: sean-k-mooney, dviroel, marios and > > chandankumar > > o We plan to add more cores in the next cycle, based on > > reviews and engagement. > > o We plan to remove not active members in the 2 last cycles > > (starting at 2026.1) > > * A new datasource was added: Prometheus > > * Prometheus job now also runs scenario tests, along with Gnocchi. > > * We triaged all old bugs from launchpad > > * Needs improvement: > > o current team is still learning about details in the code, > > much of the historical knowledge was lost with the > > previous maintainers > > o core team still needs to grow > > o we need to focus on creating stable releases > > > > > > Cross-project session with Horizon team > > =============================== > > > > * Combined session with Telemetry and Horizon team, focused on > > how to provide a tenant and an admin dashboard to visualize > > metrics. > > * Watcher team presented some ideas of new panels for both admin > > and tenants, and sean-k-mooney raised a discussion about > > frameworks that can be used to implement them > > * Use-cases that were discussed: > > o a) Admin would benefit from a visualization of the > > infrastructure utilization (real usage metrics), so they > > can identify bottlenecks and plan optimization > > o b) A tenant would like to view their workload performance, > > checking real usage of cpu/ram/disk of instances, to > > proper adjust their resources allocation. > > o c) An admin user of watcher service would like to > > visualize metrics generated by watcher strategies like > > standard deviation of host metrics. > > * sean-k-mooney presented an initial PoC on how a Hypervisor > > Metrics dashboard would look like. > > * Proposal for next steps: > > o start a new horizon plugin as an official deliverable of > > telemetry project > > o still unclear which framework to use for building charts > > o dashboard will integrate with Prometheus, as metric store > > o it is expected that only short term metrics will be > > supported (7 days) > > o python-observability-client will be used to query Prometheus > > > > > > Cross-project session with Nova team > > ============================= > > > > * sean-k-mooney led topics on how to evolve Nova to better > > assist other services, like Watcher, to take actions on > > instances. The team agreed on a proposal of using the existing > > metadata API to annotate instance's supported lifecycle > > operations. This information is very useful to improve > > Watcher's strategy's algorithms. Some example of instance's > > metadata could be: > > o lifecycle:cold-migratable=true|false > > o ha:maintenance-strategy:in_place|power_off|migrate > > * It was discussed that Nova could infer which operations are > > valid or not, based on information like: virt driver, flavor, > > image properties, etc. This feature was initially named > > 'instance capabilities' and will require a spec for further > > discussions. > > * Another topic of interest, also raised by Sean, was about > > adding new standard traits to resource providers, like > > PRESSURE_CPU and PRESSURE_DISK. These traits can be used to > > weight hosts when placing new VMs. Watcher and the libvirt > > driver could work on annotating them, but the team generally > > agreed that the libvirt driver is preferred here. > > * More info at Nova PTG etherpad [0] and sean's summary blog [1] > > > > [0] https://etherpad.opendev.org/p/r.bf5f1185e201e31ed8c3adeb45e3cf6d > > [1] https://www.seanmooney.info/blog/2025.2-ptg/#watcher-topics > > > > > > Please let me know if I missed something. > > Thanks! > > > > -- > > Douglas Viroel - dviroel > > >

3 months, 3 weeks

[tc][all] OpenStack Technical Committee Weekly Summary and Meeting Agenda (2025.1/R-25)

by Goutham Pacha Ravi

Hello Stackers, This week, we begin our 26-week endeavor towards the next SLURP release, 2025.1 ("Epoxy") [1]. OpenStack Project Teams will meet virtually at the Project Teams Gathering (PTG) in two weeks, starting on 2024-10-21 [2]. The OpenStack TC plans to host cross project meetings during the following time slots: - 2024-10-21 (Monday): 1400 UTC - 1700 UTC - 2024-10-25 (Friday): 1500 UTC - 1700 UTC You'll find the proposed topics on the PTG Etherpad [3]; please add your IRC nickname if you'd like to attend or be notified when discussions begin. Last week, a few community leads presented at OpenInfra Live, recapping the 2024.2 release [4]. I encourage you to watch the presentation and follow the themes each team is pursuing in the "Epoxy" release cycle. I'm excited to share that the organizers of the upcoming OpenInfra Days North America (Oct 15-16) have made it a hybrid event. Please register if you plan to attend virtually [5]. === Weekly Meeting === The last weekly meeting of the OpenStack Technical Committee was held simultaneously on IRC [6] and video [7]. We discussed meeting times, and the current time (Tuesdays at 1800 UTC) was retained due to a lack of consensus on better alternatives. Sylvain Bauza (bauzas) volunteered to be an Election Official for the 2025.2 elections, which will be announced around February 2025. We also discussed "leaderless" projects for the 2025.1 release and appointed leaders for the OpenStack Mistral, OpenStack Watcher, and OpenStack Swift projects. Additionally, we created a TC tracker for the 2025.1 release cycle [8] to monitor the progress of community goals and other governance initiatives. The next OpenStack Technical Committee meeting is today (2024-10-08) at 1800 UTC on the #openstack-tc IRC channel on OFTC. You can find the agenda on the weekly meeting wiki page [9]. I hope you can join us! Below is a list of governance changes that have merged in the past week and those still pending community review. === Governance Proposals === ==== Merged ==== - Appoint Tim Burke as PTL for Swift | https://review.opendev.org/c/openstack/governance/+/928881 ==== Open for Review ==== - Mark kuryr-kubernetes and kuryr-tempest-plugin inactive | https://review.opendev.org/c/openstack/governance/+/929698 - Add Axel Vanzaghi as PTL for Mistral | https://review.opendev.org/c/openstack/governance/+/927962 - Propose the eventlet-removal community goal | https://review.opendev.org/c/openstack/governance/+/931254 === Upcoming Events === - 2024-10-08: OpenInfra Monthly Board Meeting: https://board.openinfra.dev/ - 2024-10-15: OpenInfra Days NA, Indianapolis: https://ittraining.iu.edu/explore-topics/titles/oid-iu/ - 2024-10-21: OpenInfra Project Teams Gathering: https://openinfra.dev/ptg/ Thank you for reading! On behalf of the OpenStack TC, Goutham Pacha Ravi (gouthamr) OpenStack TC Chair [1] 2025.1 "Epoxy" Release Schedule: https://releases.openstack.org/epoxy/schedule.html [2] "Epoxy" PTG Schedule: https://ptg.opendev.org/ptg.html [3] Technical Committee PTG Etherpad: https://etherpad.opendev.org/p/oct2024-ptg-os-tc [4] "Introducing OpenStack Dalmatian 2024.2": https://youtu.be/6igJNIJ9yFE [5] OpenInfra Days NA: https://ittraining.iu.edu/explore-topics/titles/oid-iu/index.html#register [6] TC Meeting IRC Log, 2024-10-01: https://meetings.opendev.org/meetings/tc/2024/tc.2024-10-01-18.00.log.html [7] TC Meeting Video Recording, 2024-10-01: https://youtu.be/6RXE1LfEv7w [8] 2025.1 TC Tracker: https://etherpad.opendev.org/p/tc-2025.1-tracker [9] TC Meeting Agenda, 2024-10-08: https://wiki.openstack.org/wiki/Meetings/TechnicalCommittee#Next_Meeting

10 months

Jump to page: