On Mon, 2024-04-22 at 19:43 +0200, Sylvain Bauza wrote:
Hey folks, Sorry for being a bit late to provide this summary, but let me do it :-) So, we had in between 10 to 20 contributors during our 4 days. Wow, it was nice ! Thanks all of you that were there, and I was happy to see you even if it was the first time you were in.
As a reminder, please don't open the main etherpad link [1] if you use a Web browser with an automatic translation modification feature. If you do it, then it would internally translate all the phrases directly in the etherpad as if you were directly modifying the etherpad by yourself. Please also make sure you don't accidentally remove or modify lines. In order to prevent any misusage, please rather look at this read-only etherpad link https://etherpad.opendev.org/p/r.3d37f484b24bb0415983f345582508f7
Now, let me explain the summary now (first, take a coffee if you want).
First, we only had one cross-project meeting, with the Neutron team. I won't provide the summary for this as it was only a neutron discussion (this was about multiple port bindings support with only some neutron backends). Please rather look at the neutron etherpad or wait for its summary.
### Caracal retrospective and Dalmatian planning ###
19 blueprints were approved, 10 of them landed. We also had more contributors (38) than last cycle. We now unfortunately have >60 untriaged bugs so we discussed that. We agreed then that rather than triaging the bugs every week, we should propose a bug triage day on a periodic time and welcome any person who would want to help about it. We will also ask reporters for each bug report that is older than 5 years if they can try to reproduce the issue with a newer/supported release. We also discussed how to make sure we could review the feature implementations. Eventually, we agreed on the fact we should ask spec owners to discuss with us for having feature liaisons for their specs so we could have something like 'subteams' per feature implementation. For that, I'll explain it separately by a new email in the next few weeks.
About the proposed Dalmatian planning, we'll have two spec review days (R-20 and R-13) but also a spec approval freeze day (a deadline, if your prefer) by R-11 (mid-July). We'll also have an implementation review day by R-10.
### Enhancing support for running nova services in k8s ###
Gibi agreed on providing a backlog spec for explaining what we should modify in Nova. For graceful shutdown support, gibi will also create another spec to make oslo.messaging only supporting some RPC topics usage.
### Unified limits default behaviour with no registered limits ###
We discussed some specific case you have today with unified limits : you need to register all the limits for all the resource classes you use for flavors, because if not, you wouldn't have quotas for them. We then agreed on the fact that all the 'core' resource classes (meaning the non-custom ones) should be registered. melwitt will provide a document for explaining it. For custom resource classes, we agreed on providing a relax mode (in the nova configuration) for accepting unregistered limits only for custom resource classes. That option would also have a strict mode for asking to register even the custom RCs.
### Missing instance action events ###
We have a bug report about missing instance action events in our API [2]. We agreed on fixing this without requiring a new API microversion. For each of the missing events, ratailor will provide individual bugfixes and will also try to assert the events in Tempest scenarios.
### Ephemeral encryption feature ###
As we started to merge some changes for this feature on Caracal, we also found some concerns that needed to be discussed during this PTG. Melwitt will provide a new spec for adding a new API microversion that will add new request parameters for snapshotting an instance. If an instance that has an encrypted disk needs to be rescued, either the rescue disk will also be encrypted or it will use the same key as the rescue image. Resize won't yet be accepted but melwitt will try to look how it could be unencrypted. Eventually, we also discussed the fact that compute services should check when moving an instance whether the target supports the encryption options.
### Nova Metadata optimisation ###
In case the metadata API is called quite frequently, there is some read amplification. In order to fix this, fwiesel will create a blueprint for lazy loading some returned parameters and telling which metadata paths are the most called. For the moment, we won't do any cache invalidation feature as we first need to understand whether the lazy-loading helps. by the way i tried to comment on this on slack after but i dont believe we should add lazy loading. tl;dr cloud-init only retires the first query and assumes all other data is ready once the first query is complete. lazy loading would breaks that assumption and as a result break the cloud-init integration.
i would prefer if we provided a way for the guest to request the cache to be invalidated i.e. by doing a http delete to the root of the metadata api or whatever the correct http semantic request would be.
### Extending memory encryption support ###
For the moment, Nova only supports AMD-SEV for encrypting the instance memory. We discussed about two other hardware supports : AMD-SEV-ES and SGX. For AMD SEV-ES (new AMD SEV generation), we agreed on documenting some not configuration support by BIOS, modifying the inventories by using nested resource providers with the same resource class and maybe needing some reshape. Takashi will work on it for this cycle (live-migration wouldn't be done by this cycle) For SGX, we said we need a new blueprint and a spec. We are OK to create a new nested resource provider for the resource class, but we would like to know the current move operation limitations.
### Stateless firmware support ###
Even if the libvirt driver recreates a new firmware when we're moving/resizing an instance, we discussed whether we should support a specific stateless firmware. We agreed on accepting it, so takashi will work on it, but will also fix the related bug reports [3].
### Automatically detecting vTPM support ###
Long story short, we accepted [4] as a specless blueprint. takashi will work on the implementation this cycle.
### vTPM migrations ###
Even if Nova supports vTPM related instances, we don't support yet rebuild, rescue, shelve, live migration and evacuation. We agreed on fixing this cycle live-migration/evacuate as a priority by persisting the vTPM memory somewhere. We need to find a contributor that said, hopefully those weeks. A bit related, we discussed some live-migration other topic that was related to our SSH usage later, and we said in that topic that some kind of small object-store in nova (or elsewhere) could be useful for transferring instance-related local datablocks (like console logs, NVRAM or config drives), including vTPM memory state.
### Live migration options difference ###
So, you probably know that we have a lot of live migration configuration options now, as least the fact tht the live_migration_uri option was deprecated in favor of live_migration_scheme and live_migration_inbound_addr. Unfortunately, those new options miss some way to provide some elements like user, port, path and query string. Eventually, we agreed on deprecating this cycle live_migration_scheme while *un*deprecating live_migration_uri. That won't mean that we'll remove live_migration_scheme quickly, at least until 2025.2 F (based on the SLURP policy) so distros shouldn't be afraid.
by the way we can signal deprecate seperatlly form signalling intent to remove so if we really want too we can defer adding the scond flag to signal removal. as sylvain says we are not in a rush to remove live_migration_scheme but we do want to advertise taht we nolonger plan to remove live_migration_uri and that it is no longer depreacted.
### Allow Userdata to be updated ###
Well, that one is a long overdue feature request. crohmann kindly updated the spec [5] for this cycle based on our feedback from the last reviews, we agreed on reviewing that spec but a quick glance showed up it was going into the right direction.
### Clean up old compute service versions ###
That one has been a long discussed topic, even if just tech debt related. The case is that we have old cruft of upgrade checks in our code that we could remove, but we were afraid of removing some unsupported yet useful 'upgrade enveloppe' between very old computes and new controllers. After lots of back and forths, we eventually agreed on having some kind of upcall check from compute to conductor that would signal this compute that it's too old and needs to self-disable, yet having some 'escape valve' (straighly quoting) in case it incorrectly disables things. Dansmith agreed to start scheming it.
### OpenAPI-compatible API reference documentation ###
Stephenfin instructed us on the cross-project effort aiming to have project's api-refs automatically generated thru some tooling. In order to achieve that, there are a few gaps in our current code that need to be fixed, mostly about adding some response schema validation. stephenfin already proposed a spec [6]. We agreed on reviewing it. We also wanted the implementation to be iterative per API resource, to leave to the deprecated APIs untouched and to keep the existing api-ref until the new documentation is on par. Stephefin agreed to update his spec based on that feedback.
### SPICE-direct console support ###
A new nova spec [7] was created this cycle by mikal in order to provide some hypervisor details for some external tool that could then pass a SPICE-direct console from a specific instance to some user. We discussed this feature and we said that the usecase was reasonable. That said, we had some concerns about the technical modifications but we'll discuss them in the spec review.
### Remote consoles API ###
When an user asks Nova to create a new remote console for an instance, the user doesn't tell which kind of console they want to get, only the operator can set which console kind to use per compute. We agreed on providing a new API microversion for helping users which kind of console they want (html5 or tcp-direct for example) by a new request parameter. Fwiesel accepted to create a new spec for it this cycle.
### Resurrected computes should doublecheck the RPC messages they get ###
Sorry for the terrible section name above, I tried to make it short. Basically, when a compute service disappears for some time and then returns, it can process any RPC request which was in queue. If some instance was evacuated in between, then this compute will dumbly try to process what was called earlier (like a move operation). We agreed on the fact this is a bug that needs to be fixed by adding some conditional against any move operation RPC targeted method, we don't know yet who could work on it, maybe tobias-urbin since he reported it.
this is actully missing context. its not just about move operatiosn is more liek teh following: if you have a server on host A and that host has a failure, when the operator evacuates it to host B there is a a period of time where a user could issue api actiosn like stop that will be queue for host A. when the operator restores Host A (i.e. replaces the power supply) when that service start up it may dequeue a message that was sent before teh evauate was done. nova needs to check that when we dequeue an message that the instance is activly managed on this host and avoid processing request for instance it nolonger manages. in this case it should avoid doing na instance.save to record the instance should be powered off.
### How can we get rid of eventlet in Nova ? ####
I won't go into the details on what eventlet is but please consider that all nova services are hardly dependent on that python library for concurrency reasons, mostly the compute service (besides the API and other services for the workers usage). Sean Mooney proposed us a plan for using direct python threading (instead of asyncio) in order to replace our workers model and to create specific distinct threadpools for compute-related operations. We agreed on that approach so we'll let sean to create a blueprint (which will be approved as specless) and rebase his current series so we could review it.
specificiflly in the context of dalmaition there are 3 parts that we will consider. first nova deliverables like nova-status that dont actully need eventlet will be moved to a separate folder nova/bin that will not be monkey patched. nova binaries that require eventlet will remain in nova/cmd second the nova api only uses eventlet for 1 usecase multi cell scatter gather. both i and melanie have patches to replace that with a futurist thread pool executor. That will allow nova api to run without eventlest again provided it run as a wsgi service. i.e. nova-api binary that is hosted by the eventlet webserver will continue to use eventlet but the nova api wsgi application will no longer use eventlet. Third where we have the option to use eventlet.<module> or a standard libary version i.e. replace eventlet.queue.LightQueue with queue.SimpleQueue as part of this we will also avoid having eventlet imported in random modules and consolidate our eventlet usage into a small number of files like nova.utils. this will allow us to localise our explict usage of eventlet. what we wont do is actively port large amounts of nova to non eventlet implementations i will try to move some explicit usage of eventlets to use futurist with either the eventlet or threaded executor but for dalmatian we will mainly look at the low hanging fruit. one example i plan to do a poc of moving is replacing our current usage of eventlet.tpool.proxy in the rbd utils with explicitly using a futurist executor. i will also capture this isn the bluepirnt/spec
### EOF ###
That's a wrap now, my fingers are now quite bloody (they're not used to write that much). Sorry for the long summary, I just hope your coffee (or tea) was good.
-Sylvain (who could have written his last PTG summary email if planets align)
[1] https://etherpad.opendev.org/p/nova-dalmatian-pt g (please remove the empty char between 'pt' and 'g') [2] https://bugs.launchpad.net/nova/+bug/2058928 [3] https://bugs.launchpad.net/nova/+bug/1785123 and https://bugs.launchpad.net/nova/+bug/1633447 [4] https://blueprints.launchpad.net/nova/+spec/libvirt-detect-vtpm-support [5] https://review.opendev.org/c/openstack/nova-specs/+/863884 [6] https://review.opendev.org/c/openstack/nova-specs/+/909448 [7] https://review.opendev.org/c/openstack/nova-specs/+/915190