Hello everyone, Last week was the PTG, thank you to all who joined and contributed to the discussions! I hope you enjoyed it and found it productive. Again, this cycle we had a packed agenda with active participation throughout the sessions. For the Nova sessions (Tuesday to Friday), we had an average of 15 to 20 participants per session. Attendance was slightly higher during the cross-project discussions, with an average of 25 to 30 people joining. You can revisit all notes and discussions on the Etherpad here: https://etherpad.opendev.org/p/r.9bac75a686ab0941572757b205c9415c Below is a summary of the topics and outcomes from our PTG discussions. I’ve tried to keep it as concise as possible, though we covered quite a lot during the week. **** Tuesday topics **** #### 2025.2 Flamingo Retrospective #### During the Flamingo cycle, 14 blueprints were accepted and 9 were fully implemented, resulting in a 64% completion rate. This is slightly lower than the 71% achieved during the previous Epoxy cycle. In addition, at least 26 bug fixes were merged during the cycle. Review & Process Feedback Eventlet Transition Plan, the migration continues to work as expected and remains well-coordinated. Release process was smooth with no major issues reported. Recognition to Kamil for his work on the conductor, and to Sean, Dan, and others for their consistent review efforts. ✅ Continue to prioritize the Eventlet removal as a key goal for the G cycle. Branch Opening Policy, some uncertainty remains about what can be merged between RC1 and GA, when master is technically open but still sensitive for late merges. Concerns were raised about backports and release stability. ✅ Allow merges on a case-by-case basis once the new master branch opens. ✅ Bauzas to propose a patch updating Nova contributor process <https://docs.openstack.org/nova/latest/contributor/process.html> <https://docs.openstack.org/nova/latest/contributor/process.html> Bug Triage, the number of untriaged bugs remains high (around 180). Efforts were noted, but time allocation and rotation remain challenging. The end-of-meeting triage format was found ineffective for real-time work, though still useful for discussing specific cases. ✅ Consider reviving a rotating triage roster if enough volunteers join. ✅ Keep triage discussions focused during the weekly meeting rather than doing live triage. Review Load, contributors noted increasing difficulty in finding time for reviews. Maintaining review velocity remains a cross-cycle challenge. ✅ Encourage feature-specific syncs or review sprints for high-priority areas. Feature Momentum, weekly syncs and structured updates help large features maintain progress. The VMware CI topic was cited as useful, though too many topics per meeting could cause overload. ✅ Keep dedicated weekly or bi-weekly syncs for major features (e.g. Eventlet removal). ✅ Limit meeting updates to 2-3 features per session. Review Tracking, the previous Etherpad tracker was not effective. However, having a central list remains important to avoid missing patches and to monitor review health. ✅ Discuss an external or simplified review tracker during the next meeting with more core reviewers. ✅ Keep the tracker short and up to date to maintain visibility. #### 2025.2 Flamingo Planning #### Hard spec freeze: December 4th Milestone 2 (M2): January 8th Feature Freeze (FF): February 26th Release: End of March / early April. The team discussed adjusting the timing of the spec freeze. A soft spec freeze was proposed for November 20th, giving three weeks after the PTG for contributors to submit specs. Two spec review days will be scheduled between the soft and hard freezes. Additionally, two implementation review days are planned: one around M2 for early landings, and another roughly two weeks before M3. ✅ Adopt the proposed plan, while staying flexible on a case-by-case basis. ✅ Soft spec freeze on Nov 20th, followed by two spec review days before Dec 4th. ✅ Two implementation review days: one near M2 and one two weeks before M3. ✅ A +2 on a spec implies reviewers commit to reviewing the corresponding implementation. #### Upstream meeting schedule #### ✅ Move the weekly upstream meeting to Monday at 16:00 UTC. ✅ Keep the slot fixed (no alternating or skipping weeks). ✅ Uggla to update the meeting schedule and cycle accordingly. #### Upstream bug triage follow up #### The team discussed how to better organize bug triage follow-up for the Gazpacho cycle. A proposal was made to create a small group of volunteers who would receive a few bugs to triage each week. Each volunteer would handle up to three bugs, distributed by Uggla. During the weekly upstream meeting, the team will check whether a dedicated Q&A session is needed to discuss specific bugs. If required, such a session will be scheduled every two weeks, ideally right after the upstream meeting. Otherwise, discussions will take place within the regular meeting. ✅ The team agreed to give this approach a try in the next cycle. The overall sentiment seemed favorable, though details will likely need adjustment as we go. #### Eventlet removal #### Significant progress was made since Flamingo. In Flamingo, n-api, n-metadata, and n-sch could already be switched to native threading via an environment variable, and the nova-next job runs those services without Eventlet. n-cond was nearly ready but missed feature freeze due to a late futurist bug, which has since been fixed and released. A py312-threading tox target and a Zuul job already validate a large part of the unit tests without Eventlet. In early Gazpacho, native threading for the conductor has successfully landed. Planned tasks for Gazpacho - Add native threading support for n-novncproxy and other proxies. - Add native threading support for n-compute. - Run all unit and functional tests in both threading and Eventlet modes, keeping small exclude lists where needed. - Conduct scale testing using the fake driver, possibly extended to the libvirt driver. - Switch n-api, n-metadata, and n-sch to native threading by default, keeping Eventlet as fallback. - Introduce a nova-eventlet (legacy) CI job to ensure backward compatibility. - Keep nova-next as the only job testing conductor, novncproxy, and compute with threading until the H cycle. - Fix bugs reported after Eventlet removal. - After Eventlet removal, migrate from PyMySQL to libmysqlclient for better DB performance. ✅ Kamil volunteered to handle the novncproxy migration. ✅ Start with nova-compute migration first to build confidence. ✅ Switch the default to threading as early as possible in Gazpacho. ✅ Add a dedicated Eventlet CI job to continue testing legacy mode. ✅ Use the hybrid plug job to test Eventlet mode even for services switched to threading by default. ✅ Keep the work independent, but document the known issue with long-running RPC calls. #### Long-Running Sync REST API Calls #### The team discussed how long-running synchronous REST API actions (like volume or interface attach) can block nova-api threads once Eventlet is removed. With native threading, the limited number of threads makes blocking calls more costly, unlike Eventlet which could spawn many lightweight greenlets. This work is closely related but independent from the Eventlet removal effort. Dan started a spec on the topic, and it may be split into two (volume and interface attach). Detach operations are already async, and Manila shares follow a similar model. ✅ Introduce a new microversion to keep backward compatibility. ✅ Convert volume and interface attach to async operations under that microversion. ✅ Coordinate with Rajesh’s "show finish_time" ( https://review.opendev.org/q/topic:%22bp/show-instance-action-finish-time%22 ) work to avoid race conditions. ✅ Dan and Johannes to update the spec. ✅ Further investigation is needed to determine if WSGI or API tuning could help mitigate blocking. #### vTPM Live Migration #### The team reviewed how to handle TPM secret security policies during instance operations. Changing the assigned policy during resize is not supported, as it adds complexity and can lead to image/flavor conflicts. Rebuilds are already blocked for vTPM instances, so once a policy is set via resize, it remains locked in. Existing instances from previous releases are unaffected. ✅ Do not allow changing the TPM secret security policy after assignment. ✅ Remove the option to select the policy from the image for simplicity. ✅ Default policy is “user”, but compute nodes support all policies by default. ✅ Document in the spec and release notes that deployers must define flavors with hw:tpm_secret_security if they want to enable this. ✅ Mention that [libvirt]supported_tpm_secret_security = ['user', 'host', 'deployment'] can be adjusted by operators. **** Wednesday topics **** #### CPU Comparison During Live Migration #### The team revisited the CPU compatibility check between source and destination hosts during live migration. If skip_cpu_compare_on_dest is enabled, some migrations may fail due to missing pre-checks; if disabled, the check can sometimes be too strict. The original goal is to avoid migrations to nodes running with QEMU emulation, which would cause serious performance issues. ✅ Include getDomainCapabilities in the Nova pre-check to improve accuracy. ✅ Move the configuration flag from workaround to the libvirt section. ✅ Document potential unsafe edge cases so deployers can choose between the safer (Nova) or more permissive (libvirt-only) approach. ### Start cross session with Cinder ### #### NFS / NetApp NFS Online Volume Extend #### The team discussed adding online volume extend support for NFS and NetApp backends. Instance operations (like shutdown or migration) should be blocked during the extend process, and any new task_state must stay compatible with existing microversions. ✅ Avoid using the event API, as it’s meant for fast operations. ✅ Introduce a dedicated API for this feature to manage versioning and states. ✅ Consider making the Cinder call synchronous for better control. ✅ Konrad will update the spec in this way. #### Hard-Coded io=native Setting for Cinder Backends #### The team revisited the hard-coded io=native driver setting used for Cinder volume backends (NFS, FC, and iSCSI). This code dates back nearly ten years and has recently caused performance issues and guest hangs in some environments. The original reason for enforcing io=native is unclear. ✅ No one seems to know the reason for this original choice ✅ Move the configuration to a standard libvirt configuration option. ### End cross session with Cinder ### #### Preserve NVRAM After Reboots and Migrations #### The team discussed the proposal to preserve NVRAM data across instance reboots and migrations. The feature is ready for review, pending available reviewer bandwidth. ✅ Request a definitive answer from libvirt regarding file permission handling through a bug report. ### Start cross session with Ironic/Cinder ### A more complete summary of this joint session will be provided by the Ironic team, as they can better capture the discussions and decisions made. However, the Nova PTG notes still include a few relevant takeaways and references for those interested. ### End cross session with Ironic/Cinder ### ### Start cross session with Manila ### The teams discussed testing and integration plans between Nova and Manila. Testing through Tempest was considered challenging due to configuration complexity and potential circular dependencies. Support for CephFS and NFS, hot attach, and live migration with newer OS versions was identified as a goal. ✅ Manila will draft initial UX and workflow details to share with Nova for review. ✅ Add a backlog spec in the Nova repository to track cross-team work. ✅ Manila contributors may help with open SDK and client patches as needed. ✅ Hot attach and live migration support are valuable goals but will need explicit demand and prioritization to move forward. ### End cross session with Manila ### ### Start cross session with Neutron ### A more complete summary of this joint session will be provided by the Neutron team, as they can better capture the discussions and decisions made. However, the Nova PTG notes still include a few relevant takeaways and references for those interested. ### End cross session with Neutron ### **** Thursday topics **** #### Correct Firmware Detection for Stateless Firmware and AMD SEV/SEV-ES #### The team discussed improving firmware detection for stateless firmware and AMD SEV/SEV-ES. Recent QEMU and libvirt versions now include firmware descriptor files that expose these capabilities, allowing libvirt to automatically select the right firmware without Nova duplicating that logic. The group agreed that leveraging libvirt’s detection is the better long-term approach. To avoid regressions, Nova should continue to preserve existing firmware for running instances during hard reboots, while new instances will rely on libvirt’s detection from the start. ✅ Use libvirt’s firmware detection instead of maintaining custom logic in Nova. ✅ Preserve the existing firmware for current instances on hard reboot. Let libvirt select firmware for new ones. ✅ Takashi will write a spec describing the approach and detailing the upgrade path. ✅ He will also check UEFI/stateless firmware tests in CI and run local SEV/SEV-ES tests. #### ROM-Type Firmware Support #### The team continued the discussion on firmware selection, focusing on ROM-type firmware used by AMD SEV-SNP, Intel TDX, and future Arm CCA. Since recent libvirt versions can automatically select the appropriate firmware and handle SMM when needed, Nova no longer needs to manage this logic directly. ✅ Skip further work in Nova and rely on libvirt’s firmware auto-selection mechanism. #### AMD SEV-SNP Support #### The team discussed initial plans for SEV-SNP support, which depends on firmware and requires recent QEMU (≥9.2) and libvirt versions. Work may be postponed to the 2026.2 cycle due to these dependencies. For SEV-SNP, there is a need to provide hostData, an immutable and attested field (up to 32 bytes) that can be set by Nova or external tools. Using Nova metadata for this purpose was ruled out. ✅ Do not use Nova metadata to provide SEV-SNP host data. ✅ The instance POST API could be extended to allow providing this information. ✅ Takashi will write an initial spec, keeping the design generic for both SEV-SNP and TDX. #### Generalize SEV / SEV-ES Code #### The team reviewed the proposal to generalize the existing SEV/SEV-ES code in preparation for future Confidential Computing architectures such as Intel TDX and Arm CCA. The work focuses on refactoring internal Nova code to make memory encryption handling more generic, without changing any user-visible behavior or API. ✅ Treat this as a code refactoring only (no external impact). ✅ Ensure it does not conflict with Takashi’s ongoing work. ✅ Implementation can proceed as a specless blueprint. ✅ No Tempest job available. Manual validation is acceptable, ideally with screenshots or logs. ✅ Note: It’s possible to run Tempest manually using a flavor requesting SEV/SEV-ES to verify correctness. #### Arm CCA Support #### The team discussed the roadmap for Arm CCA enablement in Nova. Upstream dependencies include Linux Kernel (Dec 2025), QEMU (Feb 2026), and libvirt (Mar 2026), with full availability expected around Ubuntu 26.04 (Apr 2026). As a result, development and testing in OpenStack are expected to start during the 2026.2 cycle. ✅ Features cannot merge until support is available in official distributions (not custom kernels). ✅ CentOS Stream 10 may serve as an early test platform if it gains CCA support before Ubuntu 26.04. ✅ Specs or PoCs can still be prepared in advance to ease future inclusion. ✅ Once libvirt support lands, the team will review the spec, targeting early in the H (2026.2) cycle. #### Show finish_time in Instance Action Details #### The proposal adds a finish_time field to the instance action show API, along with matching updates in the SDK and client. Some review comments still need to be addressed before re-proposing the patch for the G release. ✅ No objection to re-proposing the patch for Gazpacho (G). ✅ The existing spec remains valid. ✅ Gmaan will review previous comments to help move the implementation forward. ### Start cross session with Glance ### A more complete summary of this joint session will be provided by the Glance team, as they can better capture the discussions and decisions made. However, the Nova PTG notes still include a few relevant takeaways and references for those interested. ### End cross session with Glance ### #### Periodic Scheduler Update as Notification #### The proposal suggests exposing Nova’s periodic resource tracker updates as optional versioned notifications. This would help services like Watcher obtain a more complete and up-to-date view of compute resources, including NUMA topology and devices not yet modeled in Placement. While there are concerns about notification stability and data volume, the team agreed it could be useful for cross-service integrations. ✅ Support the idea in principle. ✅ Be mindful of the volume of data included in notifications. ✅ Sean will prepare a spec proposal detailing the approach. #### Resource Provider Weigher and Preferred/Avoided Traits #### The proposal introduces a resource provider weigher to influence scheduling decisions, with a longer-term goal of supporting preferred and avoided traits. This would allow external services, such as Watcher, to tag compute nodes (e.g., with CPU or memory pressure) and guide the scheduler to favor or avoid specific hosts. ✅ There are concerns about validating the weigher’s behavior at scale, ensuring it works correctly with a large number of instances. ✅ Continue developing this approach. ✅ Improve the test framework, despite existing challenges. ✅ Add Monte Carlo–style functional tests to validate behavior in complex, non-symmetric scenarios. #### Proxying OVH’s Monitoring-Aware Scheduling #### The team discussed OVH’s proposal to enhance scheduling decisions by integrating external monitoring data. The idea is to add a filter or weigher that delegates its logic to an external plugin, allowing operators to plug in custom metrics (e.g., Prometheus, InfluxDB) without modifying Nova directly. While the use case is relevant, existing out-of-tree filters and weighers already allow similar integrations. One of the main challenges identified was how to share and maintain such code consistently across the community. ✅ Current Nova interfaces likely already support this use case. ✅ Proposal to create a dedicated repository on OpenDev for sharing and maintaining custom filters and weighers. **** Friday topics **** #### LY Corp – Graceful Shutdown Proposal #### The team discussed LY Corp’s proposal to implement graceful shutdown handling in Nova. The goal is to stop accepting new RPC requests while allowing in-progress operations, such as instance boots or live migrations to complete safely. Different designs were compared, including stopping RPC listeners or using the upcoming oslo.messaging HTTP driver, which can handle clean shutdowns more reliably. The consensus was that the complete solution must involve oslo.messaging, allowing Nova services to dynamically subscribe or unsubscribe from topics. This approach also implies an upgrade impact, requiring versioned changes to compute and scheduler topics. ✅ The current proposal needs further work, as live migration cases were not yet considered. ✅ Gmaan will continue investigation and refine the approach based on oslo.messaging improvements. ✅ A revised spec will be submitted once ready. ✅ Gmaan confirmed commitment to follow up and lead this feature. #### Offline VM Migration Using HTTP-Based File Server #### The team discussed a proposal to enable offline VM migration using an HTTP-based file server. The preferred approach is to use WebDAV, which provides a standard protocol and existing server implementations. Nova itself will not host the web server and will be restricted to paths under /var/lib/nova. ✅ Use WebDAV as the preferred protocol. ✅ Avoid introducing non-standard protocols in Nova. ✅ Nova will not spawn or host the web server directly. ✅ Define a URL template in configuration for the remote filesystem driver (e.g., remotefs_url = https://%(host)s:9999/path/to/dav). ✅ Add DevStack support to test the feature in CI. #### PCI and VGPU Scheduling Performance in Placement The team revisited the performance issues with PCI and VGPU scheduling in Placement. Two workaround sets have already been implemented to reduce the number of valid and invalid candidates, but the core scalability problem of the GET /allocation_candidates query remains, especially in large, symmetric provider trees. A long-term constraint solver approach is under consideration, and the discussion will continue once more data is available from current deployments. ✅ Keep the current workarounds active and gather feedback over the next six months. ✅ Nova-next may run with these settings enabled to collect early results. ✅ Revisit the topic and defaulting decisions at the next PTG. ✅ Keep the issue open for urgent improvements if needed before the next cycle. #### Multithread and Multiqueue Support for Libvirt #### The team discussed adding multithread and multiqueue support in Nova for libvirt-based instances. While QEMU already enables multiqueue for virtio-blk and virtio-scsi by default, the main focus is now on IOThread support and improving parallelism during I/O and live migrations. ✅ Enable IOThreads with one thread per device by default (handled as a specless blueprint). ✅ virtio-blk and virtio-scsi specs are parked. ✅ Cancel the multiqueue work, since QEMU now manages it natively. ✅ The parallel migration blueprint is approved as specless. ✅ The earlier proposed spec will be abandoned in favor of this simplified path forward. #### AZ-Aware Scheduling (Soft Anti-Affinity) #### The team discussed improving availability zone–aware scheduling by introducing a soft anti-affinity policy for server groups. ✅ The overall direction is good, but the spec still needs more work. ✅ The team is broadly supportive of continuing in this direction. ✅ A spec will be proposed by Dmitriy. #### NUMA-Aware Placement (CloudFerro Proposal) #### The team discussed CloudFerro’s proposal to enable NUMA-aware scheduling in Placement, based on the old Ussuri spec. The goal is to represent NUMA nodes as resource providers to allow better locality handling for CPU and memory resources. However, this raises potential performance concerns, especially when combined with PCI in Placement, as both increase candidate permutations. ✅ Ensure NUMA and PCI in Placement work together correctly before expanding scope. ✅ Do not move PCI devices under NUMA subtrees yet. ✅ Add functional tests to validate the combined behavior. ✅ Consider extending CI to run on hosts with real NUMA topology. ✅ Enable NUMA awareness in nova-next (single NUMA node for now). ✅ Investigate with the infra team the possibility of adding dual-NUMA test nodes. ✅ Improvements to Placement’s group_policy remain out of scope for this work. ✅ The team is happy with the current direction and looks forward to future patches landing in master. #### WSGI Script Removal #### The team discussed removing the legacy wsgi_script support from Nova. No objections were raised, and no spec is required. ✅ Proceed with the removal, the team is aligned and ready to move forward. #### Improving Cyborg Support #### The team discussed reviving Cyborg integration efforts in Nova. Several aspects remain unfinished, including proper mdev GPU XML generation, support for additional device buses (block, NVMe, CXL, USB), and handling ownership overlap between Nova and Cyborg in Placement. The plan is to resume this work gradually, starting with the most mature components. ✅ Restart the Cyborg–Nova integration effort. ✅ Split the work into multiple specs, as each part is complex enough to warrant separate discussion. ✅ Focus first on mdev support, then extend to other buses (block, NVMe, etc.). ✅ Investigate Placement ownership to avoid inventory conflicts between Nova and Cyborg. If you've read this far, thank you! 🙏 If you spot any mistakes or missing points, please don't hesitate to let me know. Best regards. René.