Re: [nova][ptg] 2024.2 Dalmatian PTG summary

22 Apr 2024

      On Mon, 2024-04-22 at 19:43 +0200, Sylvain Bauza wrote:
...
Hey folks,
Sorry for being a bit late to provide this summary, but let me do it :-)
So, we had in between 10 to 20 contributors during our 4 days. Wow, it was
nice ! Thanks all of you that were there, and I was happy to see you even
if it was the first time you were in.
As a reminder, please don't open the main etherpad link [1] if you use a
Web browser with an automatic translation modification feature. If you do
it, then it would internally translate all the phrases directly in the
etherpad as if you were directly modifying the etherpad by yourself. Please
also make sure you don't accidentally remove or modify lines.
In order to prevent any misusage, please rather look at this read-only
etherpad link
https://etherpad.opendev.org/p/r.3d37f484b24bb0415983f345582508f7
Now, let me explain the summary now (first, take a coffee if you want).
First, we only had one cross-project meeting, with the Neutron team. I
won't provide the summary for this as it was only a neutron discussion
(this was about multiple port bindings support with only some neutron
backends). Please rather look at the neutron etherpad or wait for its
summary.
### Caracal retrospective and Dalmatian planning ###
19 blueprints were approved, 10 of them landed. We also had more
contributors (38) than last cycle. We now unfortunately have >60 untriaged
bugs so we discussed that.
We agreed then that rather than triaging the bugs every week, we should
propose a bug triage day on a periodic time and welcome any person who
would want to help about it. We will also ask reporters for each bug report
that is older than 5 years if they can try to reproduce the issue with a
newer/supported release.
We also discussed how to make sure we could review the feature
implementations. Eventually, we agreed on the fact we should ask spec
owners to discuss with us for having feature liaisons for their specs so we
could have something like 'subteams' per feature implementation. For that,
I'll explain it separately by a new email in the next few weeks.
About the proposed Dalmatian planning, we'll have two spec review days
(R-20 and R-13) but also a spec approval freeze day (a deadline, if your
prefer) by R-11 (mid-July).
We'll also have an implementation review day by R-10.
### Enhancing support for running nova services in k8s ###
Gibi agreed on providing a backlog spec for explaining what we should
modify in Nova. For graceful shutdown support, gibi will also create
another spec to make oslo.messaging only supporting some RPC topics usage.
### Unified limits default behaviour with no registered limits ###
We discussed some specific case you have today with unified limits : you
need to register all the limits for all the resource classes you use for
flavors, because if not, you wouldn't have quotas for them.
We then agreed on the fact that all the 'core' resource classes (meaning
the non-custom ones) should be registered. melwitt will provide a document
for explaining it. For custom resource classes, we agreed on providing a
relax mode (in the nova configuration) for accepting unregistered limits
only for custom resource classes. That option would also have a strict mode
for asking to register even the custom RCs.
### Missing instance action events ###
We have a bug report about missing instance action events in our API [2].
We agreed on fixing this without requiring a new API microversion. For each
of the missing events, ratailor will provide individual bugfixes and will
also try to assert the events in Tempest scenarios.
### Ephemeral encryption feature ###
As we started to merge some changes for this feature on Caracal, we also
found some concerns that needed to be discussed during this PTG. Melwitt
will provide a new spec for adding a new API microversion that will add new
request parameters for snapshotting an instance. If an instance that has an
encrypted disk needs to be rescued, either the rescue disk will also be
encrypted or it will use the same key as the rescue image. Resize won't yet
be accepted but melwitt will try to look how it could be unencrypted.
Eventually, we also discussed the fact that compute services should check
when moving an instance whether the target supports the encryption options.
### Nova Metadata optimisation ###
In case the metadata API is called quite frequently, there is some read
amplification. In order to fix this, fwiesel will create a blueprint for
lazy loading some returned parameters and telling which metadata paths are
the most called. For the moment, we won't do any cache invalidation feature
as we first need to understand whether the lazy-loading helps.
by the way i tried to comment on this on slack after but i dont believe we should
add lazy loading. tl;dr cloud-init only retires the first query and assumes
all other data is ready once the first query is complete.
lazy  loading would breaks that assumption and as a result break the cloud-init
integration.
i would prefer if we provided a way for the guest to request the cache to be invalidated
i.e. by doing a http delete to the root of the metadata api or whatever the correct http
semantic request would be.
...
### Extending memory encryption support ###
For the moment, Nova only supports AMD-SEV for encrypting the instance
memory. We discussed about two other hardware supports : AMD-SEV-ES and
SGX.
For AMD SEV-ES (new AMD SEV generation), we agreed on documenting some not
configuration support by BIOS, modifying the inventories by using nested
resource providers with the same resource class and maybe needing some
reshape. Takashi will work on it for this cycle (live-migration wouldn't be
done by this cycle)
For SGX, we said we need a new blueprint and a spec. We are OK to create a
new nested resource provider for the resource class, but we would like to
know the current move operation limitations.
### Stateless firmware support ###
Even if the libvirt driver recreates a new firmware when we're
moving/resizing an instance, we discussed whether we should support a
specific stateless firmware. We agreed on accepting it, so takashi will
work on it, but will also fix the related bug reports [3].
### Automatically detecting vTPM support ###
Long story short, we accepted [4] as a specless blueprint. takashi will
work on the implementation this cycle.
### vTPM migrations ###
Even if Nova supports vTPM related instances, we don't support yet rebuild,
rescue, shelve, live migration and evacuation. We agreed on fixing this
cycle live-migration/evacuate as a priority by persisting the vTPM memory
somewhere. We need to find a contributor that said, hopefully those weeks.
A bit related, we discussed some live-migration other topic that was
related to our SSH usage later, and we said in that topic that some kind of
small object-store in nova (or elsewhere) could be useful for transferring
instance-related local datablocks (like console logs, NVRAM or config
drives), including vTPM memory state.
### Live migration options difference ###
So, you probably know that we have a lot of live migration configuration
options now, as least the fact tht the live_migration_uri option was
deprecated in favor of live_migration_scheme and
live_migration_inbound_addr. Unfortunately, those new options miss some way
to provide some elements like user, port, path and query string.
Eventually, we agreed on deprecating this cycle live_migration_scheme while
*un*deprecating live_migration_uri. That won't mean that we'll remove
live_migration_scheme
quickly, at least until 2025.2 F (based on the SLURP policy) so distros
shouldn't be afraid.
by the way we can signal deprecate seperatlly form signalling intent to remove
so if we really want too we can defer adding the scond flag to signal removal.
as sylvain says we are not in a rush to remove live_migration_scheme but we
do want to advertise taht we nolonger plan to remove live_migration_uri
and that it is no longer depreacted.
...
### Allow Userdata to be updated ###
Well, that one is a long overdue feature request. crohmann kindly updated
the spec [5] for this cycle based on our feedback from the last reviews, we
agreed on reviewing that spec but a quick glance showed up it was going
into the right direction.
### Clean up old compute service versions ###
That one has been a long discussed topic, even if just tech debt related.
The case is that we have old cruft of upgrade checks in our code that we
could remove, but we were afraid of removing some unsupported yet useful
'upgrade enveloppe' between very old computes and new controllers. After
lots of back and forths, we eventually agreed on having some kind of upcall
check from compute to conductor that would signal this compute that it's
too old and needs to self-disable, yet having some 'escape valve'
(straighly quoting) in case it incorrectly disables things.
Dansmith agreed to start scheming it.
### OpenAPI-compatible API reference documentation ###
Stephenfin instructed us on the cross-project effort aiming to have
project's api-refs automatically generated thru some tooling. In order to
achieve that, there are a few gaps in our current code that need to be
fixed, mostly about adding some response schema validation. stephenfin
already proposed a spec [6]. We agreed on reviewing it. We also wanted the
implementation to be iterative per API resource, to leave to the deprecated
APIs untouched and to keep the existing api-ref until the new documentation
is on par. Stephefin agreed to update his spec based on that feedback.
### SPICE-direct console support ###
A new nova spec [7] was created this cycle by mikal in order to provide
some hypervisor details for some external tool that could then pass a
SPICE-direct console from a specific instance to some user.
We discussed this feature and we said that the usecase was reasonable. That
said, we had some concerns about the technical modifications but we'll
discuss them in the spec review.
### Remote consoles API ###
When an user asks Nova to create a new remote console for an instance, the
user doesn't tell which kind of console they want to get, only the operator
can set which console kind to use per compute.
We agreed on providing a new API microversion for helping users which kind
of console they want (html5 or tcp-direct for example) by a new request
parameter. Fwiesel accepted to create a new spec for it this cycle.
### Resurrected computes should doublecheck the RPC messages they get ###
Sorry for the terrible section name above, I tried to make it short.
Basically, when a compute service disappears for some time and then
returns, it can process any RPC request which was in queue. If some
instance was evacuated in between, then this compute will dumbly try to
process what was called earlier (like a move operation). We agreed on the
fact this is a bug that needs to be fixed by adding some conditional
against any move operation RPC targeted method, we don't know yet who could
work on it, maybe tobias-urbin since he reported it.
this is actully missing context.

its not just about move operatiosn is more liek teh following:

if you have a server on host A and that host has a failure,
when the operator evacuates it to host B there is a a period of time
where a user could issue api actiosn like stop that will be queue for
host A. 

when the operator restores Host A (i.e. replaces the power supply)
when that service start up it may dequeue a message that was sent before
teh evauate was done.

nova needs to check that when we dequeue an message that the instance is activly
managed on this host and avoid processing request for instance it nolonger manages.
in this case it should avoid doing na instance.save to record the instance should be powered off.
...
### How can we get rid of eventlet in Nova ? ####
I won't go into the details on what eventlet is but please consider that
all nova services are hardly dependent on that python library for
concurrency reasons, mostly the compute service (besides the API and other
services for the workers usage). Sean Mooney proposed us a plan for using
direct python threading (instead of asyncio) in order to replace our
workers model and to create specific distinct threadpools for
compute-related operations. We agreed on that approach so we'll let sean to
create a blueprint (which will be approved as specless) and rebase his
current series so we could review it.
specificiflly in the context of dalmaition there are 3 parts that we will consider.

first nova deliverables like nova-status that dont actully need eventlet will be
moved to a separate folder nova/bin that will not be monkey patched.
nova binaries that require eventlet will remain in nova/cmd

second the nova api only uses eventlet for 1 usecase multi cell scatter gather.
both i and melanie have patches to replace that with a futurist thread pool executor.
That will allow nova api to run without eventlest again provided it run as a wsgi service.
i.e. nova-api binary that is hosted by the eventlet webserver will continue to use eventlet
but the nova api wsgi application will no longer use eventlet.

Third where we have the option to use eventlet.<module> or a standard libary version 
i.e.  replace eventlet.queue.LightQueue with queue.SimpleQueue
as part of this we will also avoid having eventlet imported in random modules and consolidate
our eventlet usage into a small number of files like nova.utils.
this will allow us to localise our explict usage of eventlet.

what we wont do is actively port large amounts of nova to non eventlet implementations
i will try to move some explicit usage of eventlets to use futurist with either the eventlet or threaded 
executor but for dalmatian we will mainly look at the low hanging fruit.

one example i plan to do a poc of moving is replacing our current usage of
eventlet.tpool.proxy in the rbd utils with explicitly using a futurist executor.

i will also capture this isn the bluepirnt/spec
...
### EOF ###
That's a wrap now, my fingers are now quite bloody (they're not used to
write that much). Sorry for the long summary, I just hope your coffee (or
tea) was good.
-Sylvain (who could have written his last PTG summary email if planets
align)
[1] https://etherpad.opendev.org/p/nova-dalmatian-pt g (please remove the
empty char between 'pt' and 'g')
[2] https://bugs.launchpad.net/nova/+bug/2058928
[3] https://bugs.launchpad.net/nova/+bug/1785123 and
https://bugs.launchpad.net/nova/+bug/1633447
[4] https://blueprints.launchpad.net/nova/+spec/libvirt-detect-vtpm-support
[5] https://review.opendev.org/c/openstack/nova-specs/+/863884
[6] https://review.opendev.org/c/openstack/nova-specs/+/909448
[7] https://review.opendev.org/c/openstack/nova-specs/+/915190

Re: [nova][ptg] 2024.2 Dalmatian PTG summary

smooney＠redhat.com