We're on a wetty Monday here and I guess this is time for me to take my pen again and write a PTG recap email for the current release, which is 2023.1 Antelope. Yet again, I *beg* you to *not* use any translation tool when reading any etherpad, including Google Chrome embedded translation or it would fully translate the whole etherpad content for *all* readers. In order to prevent any accidental misusage, here is a readonly copy of the Nova etherpad we had for the week https://etherpad.opendev.org/p/r.8ae4e0ef997aebfe626b2b272ff23f1b Again, I'm human and while I can write bugs (actually I write a lot of them), I can also write wrong things in this email and I can misinterpretate something precise. Apologies if so, and please don't refrain them yourself to correct me by replying to this email. Last but not the least, I'm more than open to any questions or remarks that would come from the reading of this long email. Good luck with this very long thread, you should maybe grab a coffee before starting to read it (and I promise to keep it as short as I can). ### Operator hours ### After the success of the productive Nova meet-and-greet session we held in Berlin (packed room w/ lots of feedback), we were eager to again have a discussion with operators, this time virtually which would have hopefully lowered the entry barrier by not requiring an in-person event in order to attend the session. Consequently, we allocated three timeslots of one hour, two back-to-back on Tuesday and one on Wednesday (on different times to arrange time differences between operators). Unfortunately, I have to write here that we only had a very small attendance on Tuesday with only three operators joining (but providing great feedback, thanks btw. !) and *none* on the 1-hour Wednesday slot. As a result, I won't provide statistics about release and project usage here as numbers are too low to be representative, but here is what we discussed : (again, a readonly etherpad is available here https://etherpad.opendev.org/p/r.aa8b12b385297b455138d35172698d48 for further context) # Upcoming features and general discussions - the possibility to mount Manila shares in Nova seems promising. - As nova doesn't support multiple physnets per neutron network today, this prevents the use of routed networks in some cases. - modifying our mod_wsgi usage to be allowed to pass some arguments is requested - operators seem interested in having PCI devices tracked in Placement # Pain points - getting inventories or used/availables resources in Placement becomes costly as it requires N calls, with N be the number of Resource Providers (instead of a single HTTP call). Problem has been acknowledged by the team and some design discussion occurred. - We should paginate on the flavors API and we need to fix the private > public flavor bug. - mediated devices disappearing at reboot is a problem. This has been discussed during the contributor session later, see below. - routing metrics are hard to manage when you have a long list of multiple ports attached. We eventually agreed on the fact the proposal fix could be an interesting feature for Neutron team to develop. That's it for the operators discussions. Let's now discuss about what we discussed in the contributors PTG agenda : ### Cross-project discussions Two projects this time were discussing with the Nova community : # Ironic-Nova The whole session was about the inconsistencies that were happening if a nova-compute was failing down with Ironic rebalancing the nodes. - we agreed on the fact this wasn't an easy problem to solve - a consensus was around the fact Ironic should support a sharding key so the Nova compute could use it instead of using hash rings. - JayF and johnthetubayguy agreed on codifying this idea into a spec - we'd hear feedback from operators about what they feel of the above migration (sharding their node cloud into sharding pieces and providing sharding details to nova) - in parallel of the above, documentation has to be amended in order to recommend for *existing deployments* to setup active/passive failover mode for Nova computes (instead of relying on hashring rebalances) # Neutron-Nova (I'll briefly cover the details here, as it will also be covered by the Neutron PTG recap) - while Nova is finishing up tracking of PCI devices in Placement this cycle, we agreed on defering the modeling of Neutron-based PCI requests until next cycle. This won't have any impact in terms of existing features that will continue to operate seamlessly. - the way we define switchdev capabilities as of now is incorrect. We agreed on modfying Nova to allow it to report such capabilities to Neutron. This may be a specless blueprint or just a bugfix (and then potential backportability), to be determined later. - MTUs are unfortunately immutable. If you change the MTU value by Neutron, you need to restart the instance( (or reattach the port). We consequently agreed on documenting this caveat, which is due to libvirt/qemu not able to change the MTU while the guest is running. - We eventually agreed on letting --hostname (parameter of an instance creation request) to represent a FQDN thru a nova microversion. This value won't be sanitized by nova and directly passed to Neutron as the hostname value. We also agreed on the fact Nova WON'T ever touch the port domain value in metadata, as this is not in the project scope to manage name domains. ### Procedural discussions (skip it if you're not interacting with the Nova contributors) ### # Zed retrospective - We had a productive Zed release and we made a good job on reducing bugs and bug reports. Kudos to the team again. - The microversions etherpad we had was nice but planning microversions usage is hard. WE agreed on rather providing a document to contributors explaining them how to write an API change that adds a new microversion and how to easily rebase this change if a merge conflict occurs (due to the micoversion being taken by another patch that merged) - We agreed on keeping an etherpad for tracking all API changes during milestone-3 - We agreed on filing blueprints for TC goals that impact Nova - bauzas will promote again the use of review-priority label in Gerrit during weekly meetings in order for cores to show their interest in reviewing a particular patch. # Promoting backlog features to new contributors and enhance mentoring - Starting this cycle, we'll put our list of small and easily actionable blueprints into Launchpad bug reports that have a "Wishlist" status and both a 'low-hanging-fruit' and a 'rfe' tag. New contributors or people wanting to mentor a new upstream member are more than welcome to consume that list of 'rfe' bugs and identify the ones they're willing to work on. A detailed process will be documented for helping newcomers to join. - We'll also draft our pile of work we defer into 'backlog specs' if they require further design (and are less actionable from a newcomer perspective) # Other procedural and release discussions - we'll stop using Storyboard for tracking Placement features and bugs and we'll pivot back to Launchpad for the 'placement' sub-project. - we agreed on a *feature* review day possibly 4 weeks before the feature freeze in order to catch up any late design issue we could have missed when reviewing the spec. - we will EOL stein and older nova branches with elodilles proposing the deletion patch. We agred on discussing EOL'ing train at next PTG. - gibi will work on providing a static requirements file for nova that would use minimum versions of our dependencies (non transitively) and modify unit and functional test jobs to rather use this capped requirements file for testing. - we discussed the User Survey and we agreed on discussing any question we may want to add in the next survey during the next weekly meetings. - "2023.1 Antelope" naming is a bit confusing, but we agreed on we should continue to either use "2023.1" or "2023.1 Antelope" for naming our docs. We also wait for guidelines in order to consistently name our next stable branch (2023.1 possibly). (That's the end of procedural and release discussions, please resume here if you skipped the above) ### Technical bits ### # VMware driver status - As currently no job runs happened since April, we agreed on communicating in the nova-compute logs at periodic times that the code isn't tested so operators running it would know its upstream status. - we'll update the supported matrix documentation to refiect this 'UNTESTED' state # TC goals this cycle - for the FIPS goal, we agreed on the fact the current FIPS job (running on centos 9 stream) shouldn't be running on gate and check pipelines. As the current job is busted (all periodic runs go into TIMEOUT state), we also want to modify the job timeout to allow 30 mins more time for running (due to a reboot in the job definition) - for the oslo.privsep goal, no effort is required from a nova perspective (all deliverables are already using privsep). sean-k-mooney can propose a 'rfe' bug (see the note above on new contributors) for modifying our privsep usage in nova (using different privsep context by callers) - for the Ubunutu 2022.4 goal, gmann already provided changes. We need to review them. - for the RBAC goal, let's discuss it in a proper section just below # Next steps on RBAC - we need to test new policy defaults in a full integrated tempest testsuite (meaning with other projects), ideally before the first milestone. - once we check everything works fine as expected, we can flip the default (enabling new policies) before milestone-2 - we'd like to start drafting the new service role usage by two new specs (one for nova, the other one for placement) # Power management in Nova A pile of work I'm proud we gonna start this cycle. This is about disabling/enabling cores on purpose, so power savings occur. - we agreed on supporting it directly in Nova (and to not design and support an external tool which would suppose to draft some heave interface for an easy quickwin). This would just be a config flag to turn on that would enable CPU cores on demand. - a potential follow-up *may* be to use a different CPU governor depending on flavors or images but this won't be scoped for this release. # Power monitoring in Nova I don't really like this name, as monitoring isn't part of the Nova mission statement so I'll clarify. Here, this is about providing an internal readonly interface for power monitoring tools running on guests that would be able to capture host consumption metrics. One example of such monitoring tools is Scaphandre, demonstrated during the last OpenInfraSummit at a keynote. - we agreed on reusing virtiofs support we're gonna introduce in this release for the Manila share attachment usecase - this share would be readonly and would be unique per instance (ie. we wouldn't be supporting multiple guest agents reading different directories) - this share would be enabled thru a configuration flag per compute, and would only be mounted per instance by specific flavor or image extraspec/metadata. # Database soft-deleted records and the archive/purge commands - We don't want to deprecate DB soft-deleted records as some APIs continue to rely on those records. - We rather prefer to add a new parameter to the purge command that will directly delete the 'soft-deleted' records from the main DBs and not only the shadow tables (skipping the need to archive, if the operator wants to use this parameter) # Nova-compute failing at reboot due to vGPU missing devices Talking of bug #1900800, the problem is that mediated devices aren't persistent so nova-compute isn't able to respawn the instances after a reboot if those are vGPU-flavored. - we agreed on the fact that, like any other device, Nova shouldn't try to create them and we should rather ask the operator to pre-create the mediated devices, exaclty like we do for SR-IOV VFs. - we'll accordingly deprecate the mdev creation in the libvirt driver (but we'll continue to support it) and we'll log a warning if Nova has to create one mdev. - we'll change the nova-compute init script to raise a better exception explaining which mdev is missing - we'll document a procedure for explaining how to get existing mdev information and persist them by udev rules (for upgrade cases) # Robustify our compute hostname changes To be clear, at first step, we will continue to *NOT* support nova-compute hostname changes but we'll better detect the hostname change to prevent later issues. - first step to persist the compute UUID on disk seems a good candidate so a spec is targeted this cycle. - next steps could be to robustify our relationships between instance, compute node and service object records but this design will be deferred for later in a backlog spec. # Move to OpenStackClient and SDK OSC is already fully supported in Zed but it continues to rely on novaclient python bindings for calling the Nova API. - we agreed on modifying OSC to rather use openstacksdk (instead of novaclient) for communicating to the Nova APIs. Contributors welcome on this move. - we agreed on stopping to use project client libraries in nova services (eg. cinderclient used by nova-compute) and rather use openstacksdk directly. A 'rfe' bug per project client will be issued, anyone willing to work on it is welcome. - we also agreed on continuing to support novaclient for a couple of releases, as operators or other consumers of this package could require substantial efforts to move to the sdk. - we agreed on changing the release model for novaclient to be independent so we can release anytime we need. # Evacuate to target state We understand the usecase (evacuate an instance shouldn't always turn the evacuated instance on) - that said, we don't want to amend the API for passing a parameter as this would carry a tech debt indefinitely) - we prefer to propose a new microversion that would stop the instance eventually instead of starting it - operators wanting to keep the original behaviour would need to negociate an older microversion to the Nova API as we don't intend to make spawning optionally chosen by API on evacuate. (aaaaaaaaaaaaaaaand that's it for the PTG recap) Kudos, you were brave, you reached that point. Hope your coffee was good and now you feel rejuvenated. Anyway, time for me now to rest my fingers and to enjoy a deserved time off. As said, I'm all up for any questions or remarks that would come from the reading of this enormous thread. -Sylvain