[cinder][PTG] 2025.1 Epoxy PTG summary
Hello everyone, thanks for a very productive PTG for the Epoxy cycle. Below is a brief summary of the highlights. The raw etherpad notes are here [1] and both the summary and transcribed etherpad notes are captured in our wiki here [2]. # Retrospective We spent the first few minutes discussing the state the Dalmatian release, what went well, not so well, and how we can improve in the Epoxy release cycle. The most notable point raised is that we have two less cores this cycle. This adds additional strain to an already sizable review queue. We will be looking to cultivate and elevate any qualified contributors if the opportunity arises. # NetApp We covered 6 different topics related to NetApp's feature development plans for the Epoxy cycle in an effort to make sure we're all aligned. Below is a summary of the full set: * Certificate-based authentication Right now only basic auth is implemented, which requires a username and password. There is a strong desire to add support for certificate-based auth. Discussions covered possible approaches and a few helpful links to existing implementations that can be used as a reference. * HA Active-Active support for NVMe driver This will allow users to configure NVMe driver backends in cinder clustered environments. We covered the failover process and which methods needed to be implemented. The related use cases were discussed as well as testing scenarios. We do not have replication and failover coverage in our upstream gate, so special care is needed in testing and verification on the developer side at the moment. Mention of Rally for testing as a possibility. * SVM scoping support for NVMe Desire to add support to the NVMe driver to SVM scoping. At the moment the driver fails to initialize. Work is ongoing to add full support for this feature. Nothing concerning was raised, looking forward to this for Epoxy. * ASA r2 support ASA r2 systems use storage units and consistency groups to simplify storage management and data protection. The REST API is different, and some features are restricted. We'll see how this unfolds in this cycle, good to have this on the radar as we go forward. * Replication - Snapmirror Sync Support Synchronous mirroring achieves a recovery point objective (RPO) of zero lost data by having a copy of important data available if a disaster happens on one of the two storage arrays. The copy is identical to production data at every moment because each time a write is done to the primary volume, a write is done to the secondary volume. The host does not receive an acknowledgment that the write was successful until the secondary volume is successfully updated with the changes that were made on the primary volume. * Metrocluster Support Metro cluster is a clustering solution with synchronous replication to deliver continuous availability, immediately duplicating all of the mission-critical data on a transaction-by-transaction basis. MetroCluster configurations enhance the built-in high availability and non disruptive operations of NetApp hardware and ONTAP storage software, providing an additional layer of protection for the entire storage and host environment. A few changes beyond driver-specific updates were discussed, no blockers. # Handling MYPY Updates We support mypy in our gate jobs, but having the version unpinned causes occasional breakages. We pin hacking and similar tools, and we agreed that mypy should be included. The caveat being we need to remember to periodically unpin and update these version numbers to stay current. # New volume driver LVM+Clone Jan Horstmann proposed a derivative of our LVM driver to add support for using device mapper clone target for transparently migrating volumes. This would all an admin to leverage local storage for performance and still be able to live migrate instances across compute hosts. A POC/Demo is hopefully coming soon, no fundamental blockers were identified, very interested to see how this comes together. # Reviews for backup service It is very difficult to get reviews for cinder-backup related changes. Tobias asked if any cores can focus some time on cinder-backup specifically (an potential backports). We agreed that mention of the backup review queue and review dashboard at the weekly meeting may help raise awareness. # New Location API adoption Rajat proposed a patch that adds the call for the `add_image_location` method which triggers the new location API workflow that is more secure and robust than the old location workflow. This call is made when glance is using cinder as a backend and we want to perform an optimized "upload volume to image" operation. Additional patches are up that add support for testing this feature. We may still need some input from the Glance team, to be followed up on in our weekly meeting. # Adding `service` query parameter to AZ API Tobias proposed an API change, adding a query parameter to an existing API. Specifically, adding `service` query parameter to os-availability-zone API. The patch is thought to be complete and ready for review. # Adding a backup summary API Tobias proposed adding a backup summary API (same way that is available for volumes). He also found a volume summary API bug when implementing this feature. The patch is ready for review. # Horizon Feature Gap We had the privilege to meet with the Horizon team and discuss their efforts on closing existing feature gaps. Several specifics were discussed and clarity was achieved. Concerns were raised over possible incompatibility issues between the openstack client and the SDK. Further, Horizon updates microversion only when necessary and value was identified in reviewing microversions to see if any additional functionality can be leveraged. # Optimize RBD update from Cinder to Glance Currently the path to upload a cinder volume to glance image is not optimized when cinder and glance are both using RBD. We can leverage RBD layering (or copy-on-write cloning) to complete the operation in seconds irrespective of the volume size (since no actual copy is happening). There are 3 features implemented that were needed to support this feature: 1. RBD clone dependency deletion: We have landed changes in glance and cinder that handle the RBD dependency chains so dependency from volumes to images pools shouldn't be an issue 2. Service role: Currently the attachment delete API already uses service role + service token and we can use similar changes for this feature 3. New location APIs: This new glance feature eliminates the security risk as we will be calculating the image hash even in this optimized path (in background so our operation performance won't be affected) * Previously several blockers, appear to be all resolved now * Significant performance increase * Glance -> Cinder already supported * Fallback to unoptimized codepath * Related, recent patches to improve RBD cinder/glance, need to be considered/reviewed Eric has committed to help with the reviews. # State of volume replication Some time was spent discussing the current state of replication. We do not have testing coverage in tempest, so the current state of any particular driver implementation is difficult to know. An idea was raised to poll vendors for use cases and testing results. There is a general lack of clarity for this feature across projects and what our overall disaster recovery solution is. # Image Encryption (Glance Cross-Project) The goal with this effort is to standardize encrypted images between Glance, Cinder and Nova, in an effort to make them consistent and accessible to users. Here is a recap: 1. We merged a patchset in os-brick so far that extends key handling to support more types of Barbican keys for image encryption in addition to Cinder's existing implementation (which strictly uses binascii.hexlify() conversion). 2. Patchset for Glance has been written to add Secret Consumer API from Barbican for encrypted images and to handle the new standardized format. 3. Patchset for Cinder has been written, making it able to produce and use the new standardized format + additionally support qcow2+LUKS in addition to raw LUKS format as well as non-hexlified keys from Barbican. After the initial discussion with Glance, we agreed on *not* introducing any new container_format or disk_format and instead handle this with `os_encrypt_*` properties instead (e.g. os_encrypt_format=luks); this is how the patchsets are currently implemented. However, due to the recent CVEs around image format parsing and conversion, valid concerns were raised that this approach might lead to new vulnerabilities due to potential conflicts and ambiguities between container_format, disk_format and the encryption attributes. We need to agree on a new way of handling the classification of encrypted images in the metadata. We have at least two cases to consider: 1. Raw LUKS (as currently produced by Cinder from encrypted volumes) 2. qcow2+LUKS (as produced by Nova for Ephemeral Storage); both could also be produced by the user (see the gist link in the reference section). # Volume type metadata progress We received a progress update on the volume type metadata effort. Consensus that the direction is sound. A new API endpoint in addition to changes to the response of volume type show / create will require a new microversion. The tempest test still need to be adjusted. Patches need review. # Status of OpenAPI effort Very little traction on initial patches, needs help getting reviews for the first set of patches. Patches also cleanup technical debt and fix several bugs, would be very good to have. The next set of patches is more meaningful, but being held until initial reviews are available. # Removal of uWSGI scripts (stephenfin) The patch to remove uWSGI scripts has been up and needs review. Relatively small patch. Update: this has been reviewed and merged. # Eventlet A dedicated etherpad was created to help document current status. Cinder is still in the investigatory phase. Not only do we use eventlet directly, some drivers such as RBD use C libraries that spawn native threads. We must take care in how we handle these cases. Our goal is only preparatory changes in Epoxy (because it is a SLURP release), and hold big changes for F (which most people won't upgrade to, so will give us more time to detect & fix issues) [1] https://etherpad.opendev.org/p/epoxy-ptg-cinder [2] https://wiki.openstack.org/wiki/CinderEpoxyPTGSummary -- Jon
participants (1)
-
Jon Bernard