On 30/01/2026 12:26, Dmitriy Rabotyagov wrote:
Hey!
It's very nice to hear that somebody steps in to keep the project around.
Though I'm probably saying smth entirely stupid now, but I wonder what actual value Cyborg does have for managing acceleration devices? Not at all, this has been a topic of debate for a long time. The non technical motivation for Cyborg to exist is the need to delegate a massive problem space away from the Nova team to a dedicated group of maintainers. It follows the same historical pattern as the creation of Neutron to replace nova-network, Cinder for nova-volume, and Ironic for the old nova-baremetal driver. Nova is already a massive project, and offloading specialized domain knowledge is a way to reduce its scope.
Like, should not ideally Nova/Placement/Blazar be able to cover most of Cyborg's goals? I’ve believed for a long time that device management on the compute node is a core capability Nova should provide. Having been involved in everything from PCI pass-through and NUMA awareness to vGPUs, Intel PMEM, and SmartNIC enabling, I’ve seen how much Nova can handle. In many ways, Nova’s current capabilities actually exceed what Cyborg can do today. However, as a top-level service, there are specific use cases Cyborg can address that Nova simply cannot—partly due to architectural limits, and partly due to the
For full transparency when the project was being incubated as "Nomad" (before it became Cyborg), I actually advocated for a different path. I believed accelerator management should be handled as a library with pluggable drivers, similar to how we use os-vif for networking or os-brick for storage. In that model, Nova would call the library to discover, allocate, and provision devices without needing an entirely separate service to manage. The goal was to avoid the operational overhead of deploying another top-level service. We even started work on os-acc (https://opendev.org/openstack/os-acc) to be that library. However, given that Cyborg is now an established service with its own ecosystem, I don't think pivoting back to a library-only approach is productive at this point. We have to work with the architecture we have to solve the use cases Nova can't touch. philosophical boundaries we've set for Nova. The first major hurdle is Nova’s "no vendor specific logic" rule. We focus on generic abstractions so that the API remains consistent, but this often forces us into awkward workarounds. Take NVIDIA vGPUs: a card might report 32 VFs even if the vGPU profile only supports 8 instances. In Nova, we have to per-allocate those vGPUs at boot or manually declare a max_instances limit in the config just to keep the inventory from breaking. Because Cyborg is allowed to have vendor-specific drivers, it can actually "know" about these hardware quirks and handle them automatically without the operator needing to hard-code limitations. This extends to high-speed interconnects like NVLink. If you have a DGX server with two groups of four GPUs interconnected, the only safe way to pass those to a VM is as a complete group. Today, Nova requires you to manually and statically group those in your configuration. We’re looking at improving this with PCI pass-through groups in the future, but it’s still a manual overhead. Cyborg, by contrast, can discover that topology dynamically and ensure the scheduler makes the right choice without human intervention. We also have the issue of stateful device management. Nova treats PCI devices as "plumbing" we attach them and walk away. This is why we don't officially support NVMe pass-through upstream; we don't want to get into the business of customizing the provisioning or de-provisioning of the device. Cyborg is allowed to know that a generic PCI device is actually an SSD, which means it can handle the specific lifecycle requirements like secure erasure. It also provides a home for device buses Nova has explicitly pushed out of scope, such as USB, CXL, or direct storage management (SATA/SCSI/NVMe block devices) Finally, there is the frontier of remote and disaggregated device management where Nova simply has no answer. While rack-scale design has been a "future" concept for a while, it’s becoming a reality. We see it with NVMe-oF, and with the evolution of PCI fabrics and CXL, commercially viable remote GPUs, like nuclear fusion are perpetually on the horizon. will it happen i dont know but given NICs are starting to push 1.6tbps we are a lot closer to that then we were a few year ago.Technologies like Liqid and their PCIe over fabric Matrix platform (https://www.liqid.com/) demonstrate that accelerators are no longer physically tethered to a specific motherboard even if it requires a vendor specific solution today. by the way im not saying ^ is something we have any plan to support with cybrog. im just pointing to a technology that exist today that demonstrate the use case. Nova and Placement are excellent at tracking and scheduling local resources, but they aren't built to manage the lifecycle of hardware that exists across a fabric. Cyborg is the only place in OpenStack where we can realistically support these disaggregated accelerators in the medium to long term.
And thus, was bringing in missing features from Cyborg to these services ever considered as an alternative?
almost every other ptg :) While I personally wouldn't be against having a robust device management API directly in Nova, I'm definitely in the minority on that front. We’ve had this conversation almost every time a new hardware capability was proposed. Over the years, we’ve extended the PCI tracker in Nova to do far more than intended. It has succeeded in enabling many use cases, but it also led to fragmentation. When we added vGPU support, we didn't use the PCI tracker; we built a standalone solution. We did something similar for the now defunct Intel PMEM support. The result is that Nova now effectively has three different systems for tracking devices, most of which are only integrated with the libvirt driver. While the upgrade impact of removing the PCI tracker from Nova is too large to ever truly get rid of it, there is a lot of merit in the current trajectory: Nova focuses on the "presentation" layer generating the domain XML and attaching the device while the "gory details" of lifecycle and hardware-specific management live in Cyborg. Looking ahead, we can create a clear breakdown of responsibilities. Nova should eventually freeze new device enumeration and discovery features, with perhaps the exception of finishing the PCI groups support for 2026.2. All future work for new device buses or local storage management would be decomposed, Nova handles the hypervisor interaction, while Cyborg handles the hardware management. To make this work, Nova will eventually need a dedicated "devices" API, similar to the recently added Manila share attachment API. Nova will also need the ability to accept Cyborg device profiles as boot time parameters and support move operations, like live migration, for instances with Cyborg-managed devices. We need Nova to be capable of configuring hypervisors for USB or CXL buses and respecting flags like "managed", "hotplugable", "live-migratable" , but Nova should not have to gain any awareness of how to actually enumerate a USB bus or ensure an NVMe drive is securely wiped upon release.
Or does Cyborg do smth unique/unconventional which is not a fit for these service goals to incorporate?
hopefully i have answered this above it would be technically less work to just do this natively in nova because its more mature but that maturity comes with cost in development velocity, upgrade impact and legacy design constraints. P.S. i don't know enough about how blazar works to know if it could play a role in device management but i suspect not or at least i don't think the exsitance or usage of cyborg would affect its role.
чт, 29 янв. 2026 г. в 22:47, Goutham Pacha Ravi <gouthampravi@gmail.com>:
Hi Sean, On Thu, Jan 29, 2026 at 10:30 AM Sean Mooney <smooney@redhat.com> wrote: > > Hi everyone, > > I'm not really sure how to start this conversation, so I'm just going to > jump > right to the point. > > I am writing to the Cyborg community and the Technical Committee to > discuss the > current state of the project and share my intent to help ramp up maintenance > efforts for this cycle and beyond. > > I recently discussed this briefly on the #openstack-tc channel, but I wanted > to bring it to the public list for broader visibility. Now that the new year > has started, I am reaching out to express my intent to spend time > contributing > to the health and maintainability of Cyborg over the next few cycles. > > If I'm being entirely honest, it does not feel like it has been 18 > months since > I started down the same path with Watcher, but I have been asked to try and > restart Cyborg development in a similar way as we did with Watcher.
Thank you for taking on this effort. With the recent changes in the way OpenStack is being put to use, Cyborg's relevance is growing, even if the need for constant project maintenance isn't apparent. Li Liu has kept the project going for the Gazpacho release, but the maintenance team unfortunately decayed due to an organic shift in focus.
The work you've done in reviving Watcher has been very well received. Speaking with my TC hat on, my concern is when the core team is composed primarily of individuals from a single organization. While this is sometimes unavoidable, we have seen how projects like Cyborg can be left in jeopardy when an organization changes tack and moves on from contributing, despite having many users. I am very grateful to the companies and individuals that take on the arduous task of maintaining this software, I want to actively encourage you to seek out diverse maintainers to ensure the project's long-term sustenance.
> > Recently, I have been submitting patches and performing initial code > reviews, > but it is clear that review latency and accumulated technical debt have > become significant bottlenecks. To help unblock the project and prepare for > future development, I am volunteering to lead a focused cleanup effort. > > My primary goals for this cycle are: > > 1. Addressing neglected technical debt and stale patches: > I have already identified several critical areas including oslo.db and > oslo.service compatibility, microversion-parse naming, and the > long-overdue > eventlet removal. I have also identified a backlog of bot-proposed > patches > for release notes and .gitreview that need manual intervention.
+1 thank you The unmerged bot patches are evidence that we lack maintenance.
> > > 2. Improving CI/CD stability and alignment: > While we have made progress moving failing jobs from Jammy to Noble and > adding Python 3.13 support, significant debt remains. For example, the > cyborg-tempest-plugin lacks stable branch jobs post-2024.2 while still > carrying EOL branch definitions. We also lack grenade/SLURP upgrade > testing > which is vital for project health.
++ > > > 3. Managing release-related work and project metadata: > Cyborg needs active management for release note preludes, marketing > highlights, and RC/GA tagging. Furthermore, our Launchpad project > requires > cleanup to ensure bugs and features are tracked against the correct > series, > and team ownership needs to be aligned with current OpenStack standards. > Note: I have checked PyPI > (https://pypi.org/project/openstack-cyborg/) and > it is correctly owned by openstackci from what I can tell.
Thank you for caring about these important details. This is a common labor for all project maintainers and another good indication of a project's health.
> > > We are close to the end of the 2026.1 cycle, so my immediate priority is > fixing > the critical gaps to ensure Cyborg's inclusion in the 2026.1 release, > followed > by a longer-term plan for maintenance in 2026.2 and new feature development. > > To execute this, I am requesting that the core team consider adding me > to the > cyborg-core group. I am also volunteering as Release and TACT-SIG > liaison for > the remainder of this cycle. For 2026.2, I propose adopting the Distributed > Project Leadership (DPL) model to better distribute these responsibilities. > > One or two others who have been helping me revive Watcher over the last year > will be joining me in this effort over the coming weeks. We hope to > split the > release, TACT, and security roles between us to ensure consistent coverage > unless we get other volunteers :)
If Li Liu can make the core team adjustments, that'd be great. If not, the TC can help with seeding the core team as we've done with other projects in the recent past.
For 2026.2, The PTL elections are imminent. If you're still identifying liaisons, I would recommend nominating yourself as PTL for the upcoming release cycle. PTLs can and should have liaisons that can handle different aspects of project maintenance too :) Although I understand that you want to prevent a single-point-of-failure. You could alternatively propose a DPL transition right away with liaisons you do have; but please note that any PTL nominees during the election window will override this change. Please continue to coordinate with other new/existing maintainers and the TC as you're doing. We'll conclude elections by March 19, 2026 at the latest, and we'll resolve the project governance by then.
> > > Our overall goal is to restart the Nova-Cyborg integration work to > improve the > accelerator management UX (attach/detach, move operations, etc.). To reach > that point, we must first pay down technical debt and rebuild an active core > review team. > > I have prepared a high-level maintenance roadmap and task list here: > https://etherpad.opendev.org/p/cyborg-maintance-2026.2 > which I intend to use to track this work. > > I look forward to hearing your thoughts. > The first step should probably to discuss this at the next tc > meeting and to follow up with the existing core team for comment.
+1, added it to our agenda.: https://wiki.openstack.org/wiki/Meetings/TechnicalCommittee#Next_Meeting
> > If there are no objections, > I will coordinate with the TC and Infra team regarding the necessary > permission updates. > > Regards, > > Sean >