[nova][ptg] 2024.1 Caracal PTG summary
(sorry folks, was working on other urgencies hence the late summary) Again, as every cycle I say, that's a wrap ! Thanks to all the contributors that were around, we were more than 10 everyday, woohoo ! As a reminder (like every cycle ;-) ), please don't open the main etherpad link [1] if you use a Web browser with an automatic translation modification feature. If you do it, then it would internally translate all the phrases directly in the etherpad as if you were directly modifying the etherpad by yourself. Please also make sure you don't accidentally remove or modify lines. In order to prevent any misusage, this etherpad is a read-only copy of the main one : https://etherpad.opendev.org/p/r.02f889c0423aa279a3fa8e136becad10 Anyway, let's stop discussing about etherpad, and lemme provide you the summary : (oh and I could be wrong for explaining some stuff, no worries, just reply in here if so) (please also grab a coffee or whatever else, you may need it) ### Operator hour ### (read-only etherpad for this here https://etherpad.opendev.org/p/r.06c38397f2cdd81380649312f032b22f) So, we had around 4-5 operators for this operator hour, lovely. This is what we discussed : # flavors explosion for public clouds Basically, there are multiple usecases here : this is hard to discover flavor properties except by looking up the description, but this is also difficult to filter filters based on some properties since they can be different across clouds. What we agreed (as a start) : - Glance metadefs concept [2] could help to provide a catalog of flavor properties definition if operators were using it. - Nova could provide some new API microversion for filtering flavors based on a request of standard traits and/or resources (those queried within minimum and/or maximum bars). We don't know yet who could be the owner of such a specification, contributors welcome. - Another feature could be "Nova, tell me whether my flavor could have some candidates ?" but again, we need hands on deck. # Nova could get a better housekeeping system for orphaned resources # (I'm literally paraphrasing the topic name) OK, so, after discussing with the operator, we eventually found two use-cases : - as an operator, I'd like nova to magically adopt a compute node coming from elsewhere (say a region or another nova deployment). Long story short (lookup the etherpad for more details), the winning strategy that we could implement would be : #1 adopt a whole separate nova cell by using a specific nova-manage command that would create records on the target nova-api DB based from the records from the source api DB. The operator was happy with working on a POC and discussing with the nova community to make it a quickwin. #2 live-migrate my instances between cells. Well, that one is more difficult to implement, so we could punt this until #1 is implemented. - as an operator, I sometimes need to rebuild a compute OS. Well, on that one, we discussed it further more in terms of "project cleanup". Again, long story short, we went into explaining to use 'OSC project cleanup' [3] feature that'd do the job. # I want to be able to detach my root volume Well, that one wasn't really discussed, but the topic was filled in the etherpad. No real owner of such feature request, which absolutely requires a large community design discussion since resolving this isn't trivial (and we probably need to do some trade-offs in terms of guest support). Anyway, if you read those lines and you do care of such usecase, please understand that you need to associate resources and come by the community to bring them. I, as PTL, can help providing some mentoring and onboarding if needed. ### Cross-project meetings ### We had two cross-project sessions, one with Neutron and one with Cinder. # Neutron cross-project session We only discussed one topic, but a very important one : how Neutron would feel if Nova starts to require some optional Neutron API extensions ? :-) Eventually, what we agreed with Neutron is : - Nova will deprecate by Caracal the legacy codepath that doesn't use multiple port binding and provide a LOG.warning - Nova will delete this ^ in the E* release - Sean will provide a nova blueprint explaining which API extensions will be mandatory by Nova but also he will provide a neutron rfe bug report for that. - Eventually, we'll provide release notes and a ML thread for explaining all of this (we being Brian and me) # Cinder cross-project sessions We had two sessions, one in Wednesday, the other in Friday.. Please look at both [4] and [5] for knowing the summary for each of them. ### Bobcat retrospective and Caracal planning ### 17 blueprints were approved, 9 of them landed, one of them got reverted due to late issues discovered with OSC. 34 contributors this cycle, less than previously (mostly less on-off contributors). The list of untriaged bugs exploded, but some of our awesome contributors did a very late scrubbing, so now we're down to 30, huzzah. What we agreed at the PTG : - we gonna abandon the review-priority Gerrit label in favor of using an etherpad for both feature and bugs cycle tracking (we were already doing etherpad tracking for approved blueprints implementation patches) - that cycle tracking etherpad will provide a section for proposing patches, and another section only written by Nova PTL containing a scrubbed list of changes ready to review. Process to be further documented later (when our PTL eventually has time to do it) - PTL will review every approved spec to ensure that the testing requirements are reasonably defined Please also take care of our approved Caracal planning (I'll propose a release patch to add those deadlines into the main Caracal schedule page) : - Spec freeze at the second milestone - Spec review days on Tuesdays of the weeks R-21 and R-17 - Implementation review days on Tuesdays of the weeks R-20 and R-12 ### EOLing our unmaintained branches ### - bauzas will propose a patch for EOLing Ussuri - We'll defer until next cycle to see whether we drop Victoria and Wallaby ### SLURP cadence impact and curating Caracal ### - We are bumping our libvirt minimums early this cycle to Ubuntu Jammy versions - We will also update our next minimum versions later this cycle based on the numbers of the next Ubuntu 24.04 LTS release (once a beta build is available). - we'll forward port our Bobcat deprecation notices in the Caracal release notes - we end up merging all the required bits (mostly ORM changes) for preparing the SQLA 2.0 upgrade in a later release - we'll remove vmwareapi and hyperv support in Caracal - we agreed for the virt drivers removals to not provide an API microversion on the specific API endpoints and rather return a HTTP400 ### python tooling we could use ### - We agreed on using sphinx-lint, making the code changes having doc job to use it. - We agreed on testing codespell tool with the pep8 job and eventually either keep it or remove it depending on whether it finds many false positives - We agreed on reviewing the pyupgrade patches as a try with no clear promises. ### Planning new features ### - the Healthchecks spec is going to be reproposed this cycle, some implementation details about how to do RPC checks have to be identified - we agreed on exposing the value of the pinned AZ in GET /servers/<uuid>/details - we agreed on doing kind of soft-[anti-]affinity for AZs using instance groups (either thru new policies or some new concept of 'destination') - PCI device affinity for libvirt driver seems an effort to be resumed. Testing this is a concern but we'll discuss this in the spec - PCI devices notion of grouping (imagine two PCI devices for the same card, you'll get the idea) seems an interesting usecase, worth spec'ing. We could imagine the notion of a single placement unit being 'pci group' with one PCI device being the main object to track. The neighbouring aspect seems hard to implement so we reserve it for later. - Having console timeouts on token expiry seems a nice security feat, reviews welcome on the spec. - VGPU efforts will be huge this cycle. More to be discussed below in a specific section ### Caracal cleanups, bugs and follow-ups ### - we really want to change our default quota system to use unified limits soon but there are some upgrade concerns with flavors that contain placement resource classes that require ourselves to do further work with oslo.limits or Keystone in order to not break operators that chose to use the unified limits driver. We could also add some checking in the nova-manage migration tool that would emit a warning if resource classes were used. - our config with images_types_backend, use_cow_images and force_raw_images is confusing and errorprone. We agreed on fixing some migration bug as a backportable way, but also deprecating use_cow_images and force_raw_images while providing meaningful config options instead. - we're mixing IP addresses and URLs usage in our configs, we could be more consistent. What we agreed on that is that a new migration_inbound_addr config sounds cool, defaulting to the my_ip config value but being able to provide a FQDN or hostname too ### The VGPU efforts ### - we're actively working on providing upstream testing for mdevs, thanks to the mtty kernel samples framework. mtty live-migration is currently an ongoing patch, but someone proposed to backport this into a custom kernel built on purpose for our usage. Either way, mtty testing will be done as much as possible in nova-next but we could create a new periodic/experimental target for targeting new features like live-migration. - we plan to do mdev live-migration this cycle. There are lots of limitations that we need to document but this sounds doable. We could add some pre-live-migration check about mdev types and this would require some object change. We may also need to persist the mdev/allocation relationship but that will be discussed in the spec - nvidia is going to drop vfio-mdev for their SR-IOV GPUs and use a specifc vfio-pci variant driver. This is a very premature task and our caracal efforts will consist in identifying the necessary nova adaptations. One is already identified (adding a managed flag) and we agreed on a solution. - we're gonna change how nova creates the mediated devices by rather defining a libvirt nodedevice which will also facilitate operator maintenance (they could just mdevctl or define udev rules) - we agreed on deprecating the config default that allows you to only specify mdev types without specifying the addresses. - SRIOV GPUs currently leak some unused VFs into Placement, leading to capacity issues [6]. We could address that problem by either making the device's explicit definition mandatory or adding a new config option that would specify the number of VFs to use per type. - some other bugs were discussed with an operator who was present. We agreed on doing further testing to clearly identify problems, and some patches are already up for reviews (like the evacuate case or the multi-create limitation) but also require some rebase. Anyway, I guess I'm done now. Thanks for having read until that point and I hope your coffee (or tea) was good. HTH and thanks, -Sylvain (on behalf of the whole Nova community) [1] https://etherpad.opendev.org/p/nova-caracal-pt g (please remove the empty char between 'pt' and 'g') [2] https://docs.openstack.org/glance/latest/user/metadefs-concepts.html [3] https://docs.openstack.org/python-openstackclient/latest/cli/command-objects... [4] https://wiki.openstack.org/wiki/CinderCaracalPTGSummary#Cross_project_with_n... [5] https://wiki.openstack.org/wiki/CinderCaracalPTGSummary#Nova_Cinder_cross_pr... [6] https://bugs.launchpad.net/nova/+bug/2041519
participants (1)
-
Sylvain Bauza