Re: [tc][cyborg] Proposal for Cyborg Project Revitalization and Maintenance

30 Jan 2026

      On 30/01/2026 12:26, Dmitriy Rabotyagov wrote:
...
Hey!
It's very nice to hear that somebody steps in to keep the project around.
Though I'm probably saying smth entirely stupid now, but I wonder what 
actual value Cyborg does have for managing acceleration devices?
Not at all, this has been a topic of debate for a long time. The non 
technical
motivation for Cyborg to exist is the need to delegate a massive problem 
space
away from the Nova team to a dedicated group of maintainers. It follows the
same historical pattern as the creation of Neutron to replace nova-network,
Cinder for nova-volume, and Ironic for the old nova-baremetal driver. Nova
is already a massive project, and offloading specialized domain knowledge
is a way to reduce its scope.
...
Like, should not ideally Nova/Placement/Blazar be able to cover most 
of Cyborg's goals?
I’ve believed for a long time that device management on the compute node 
is a
core capability Nova should provide. Having been involved in everything 
from
PCI pass-through and NUMA awareness to vGPUs, Intel PMEM, and SmartNIC
enabling, I’ve seen how much Nova can handle. In many ways, Nova’s current
capabilities actually exceed what Cyborg can do today. However, as a 
top-level
service, there are specific use cases Cyborg can address that Nova simply
cannot—partly due to architectural limits, and partly due to the
For full transparency when the project was being incubated as "Nomad" 
(before
it became Cyborg), I actually advocated for a different path. I believed
accelerator management should be handled as a library with pluggable 
drivers,
similar to how we use os-vif for networking or os-brick for storage. In 
that
model, Nova would call the library to discover, allocate, and provision
devices without needing an entirely separate service to manage.

The goal was to avoid the operational overhead of deploying another 
top-level
service. We even started work on os-acc 
(https://opendev.org/openstack/os-acc)
to be that library. However, given that Cyborg is now an established 
service
with its own ecosystem, I don't think pivoting back to a library-only 
approach
is productive at this point. We have to work with the architecture we have
to solve the use cases Nova can't touch.

philosophical
boundaries we've set for Nova.

The first major hurdle is Nova’s "no vendor specific logic" rule. We 
focus on
generic abstractions so that the API remains consistent, but this often
forces us into awkward workarounds. Take NVIDIA vGPUs: a card might 
report 32
VFs even if the vGPU profile only supports 8 instances. In Nova, we have to
per-allocate those vGPUs at boot or manually declare a max_instances 
limit in
the config just to keep the inventory from breaking. Because Cyborg is 
allowed
to have vendor-specific drivers, it can actually "know" about these 
hardware
quirks and handle them automatically without the operator needing to 
hard-code
limitations.

This extends to high-speed interconnects like NVLink. If you have a DGX 
server
with two groups of four GPUs interconnected, the only safe way to pass 
those
to a VM is as a complete group. Today, Nova requires you to manually and
statically group those in your configuration. We’re looking at improving 
this
with PCI pass-through groups in the future, but it’s still a manual 
overhead.
Cyborg, by contrast, can discover that topology dynamically and ensure the
scheduler makes the right choice without human intervention.

We also have the issue of stateful device management. Nova treats PCI 
devices
as "plumbing" we attach them and walk away. This is why we don't officially
support NVMe pass-through upstream; we don't want to get into the 
business of
customizing the provisioning or de-provisioning of the device. Cyborg is
allowed to know that a generic PCI device is actually an SSD, which 
means it
can handle the specific lifecycle requirements like secure erasure. It also
provides a home for device buses Nova has explicitly pushed out of scope,
such as USB, CXL, or direct storage management (SATA/SCSI/NVMe block 
devices)

Finally, there is the frontier of remote and disaggregated device 
management
where Nova simply has no answer. While rack-scale design has been a 
"future"
concept for a while, it’s becoming a reality. We see it with NVMe-oF, and
with the evolution of PCI fabrics and CXL, commercially viable remote GPUs,
like nuclear fusion are perpetually on the horizon. will it happen i 
dont know
but given NICs are starting to push 1.6tbps we are a lot closer to that 
then we
were a few year ago.Technologies like Liqid and their PCIe over fabric 
Matrix
platform (https://www.liqid.com/) demonstrate that accelerators are no 
longer
physically tethered to a specific motherboard even if it requires a 
vendor specific
solution today.

by the way im not saying ^ is something we have any plan to support with 
cybrog.
im just pointing to a technology that exist today that demonstrate the 
use case.

Nova and Placement are excellent at tracking and scheduling local 
resources,
but they aren't built to manage the lifecycle of hardware that exists 
across
a fabric. Cyborg is the only place in OpenStack where we can realistically
support these disaggregated accelerators in the medium to long term.
...
And thus, was bringing in missing features from Cyborg to these 
services ever considered as an alternative?
almost every other ptg :) While I personally wouldn't be against having 
a robust
device management API directly in Nova, I'm definitely in the minority 
on that
front. We’ve had this conversation almost every time a new hardware 
capability
was proposed.

Over the years, we’ve extended the PCI tracker in Nova to do far more 
than intended.
It has succeeded in enabling many use cases, but it also led to 
fragmentation.
When we added vGPU support, we didn't use the PCI tracker; we built a 
standalone
solution. We did something similar for the now defunct Intel PMEM support.
The result is that Nova now effectively has three different systems for
tracking devices, most of which are only integrated with the libvirt 
driver.

While the upgrade impact of removing the PCI tracker from Nova is too 
large to
ever truly get rid of it, there is a lot of merit in the current 
trajectory:
Nova focuses on the "presentation" layer generating the domain XML and
attaching the device while the "gory details" of lifecycle and 
hardware-specific
management live in Cyborg.

Looking ahead, we can create a clear breakdown of responsibilities. Nova 
should
eventually freeze new device enumeration and discovery features, with 
perhaps
the exception of finishing the PCI groups support for 2026.2. All future 
work
for new device buses or local storage management would be decomposed, Nova
handles the hypervisor interaction, while Cyborg handles the hardware 
management.

To make this work, Nova will eventually need a dedicated "devices" API,
similar to the recently added Manila share attachment API. Nova will also
need the ability to accept Cyborg device profiles as boot time parameters
and support move operations, like live migration, for instances with
Cyborg-managed devices. We need Nova to be capable of configuring 
hypervisors
for USB or CXL buses and respecting flags like "managed", "hotplugable", 
"live-migratable"
, but Nova should not have to gain any awareness of how to actually
enumerate a USB bus or ensure an NVMe drive is securely wiped upon release.
...
Or does Cyborg do smth unique/unconventional which is not a fit for 
these service goals to incorporate?
hopefully i have answered this above
it would be technically less work to just do this natively in nova 
because its more mature
but that maturity comes with cost in development velocity, upgrade 
impact and legacy design constraints.

P.S.

i don't know enough about how blazar works to know if it could play a 
role in device management but i suspect not
or at least i don't think the exsitance or usage of cyborg would affect 
its role.
...
чт, 29 янв. 2026 г. в 22:47, Goutham Pacha Ravi <gouthampravi@gmail.com>:
Hi Sean,
    On Thu, Jan 29, 2026 at 10:30 AM Sean Mooney <smooney@redhat.com>
    wrote:
    >
    > Hi everyone,
    >
    > I'm not really sure how to start this conversation, so I'm just
    going to
    > jump
    > right to the point.
    >
    > I am writing to the Cyborg community and the Technical Committee to
    > discuss the
    > current state of the project and share my intent to help ramp up
    maintenance
    > efforts for this cycle and beyond.
    >
    > I recently discussed this briefly on the #openstack-tc channel,
    but I wanted
    > to bring it to the public list for broader visibility. Now that
    the new year
    > has started, I am reaching out to express my intent to spend time
    > contributing
    > to the health and maintainability of Cyborg over the next few
    cycles.
    >
    > If I'm being entirely honest, it does not feel like it has been 18
    > months since
    > I started down the same path with Watcher, but I have been asked
    to try and
    > restart Cyborg development in a similar way as we did with Watcher.
Thank you for taking on this effort. With the recent changes in
    the way OpenStack is being put to use, Cyborg's relevance is
    growing, even if the need for constant project maintenance isn't
    apparent. Li Liu has kept the project going for the Gazpacho
    release, but the maintenance team unfortunately decayed due to an
    organic shift in focus.
The work you've done in reviving Watcher has been very well
    received. Speaking with my TC hat on, my concern is when the core
    team is composed primarily of individuals from a single
    organization. While this is sometimes unavoidable, we have seen
    how projects like Cyborg can be left in jeopardy when an
    organization changes tack and moves on from contributing, despite
    having many users. I am very grateful to the companies and
    individuals that take on the arduous task of maintaining this
    software, I want to actively encourage you to seek out diverse
    maintainers to ensure the project's long-term sustenance.
>
    > Recently, I have been submitting patches and performing initial code
    > reviews,
    > but it is clear that review latency and accumulated technical
    debt have
    > become significant bottlenecks. To help unblock the project and
    prepare for
    > future development, I am volunteering to lead a focused cleanup
    effort.
    >
    > My primary goals for this cycle are:
    >
    > 1. Addressing neglected technical debt and stale patches:
    >     I have already identified several critical areas including
    oslo.db and
    >     oslo.service compatibility, microversion-parse naming, and the
    > long-overdue
    >     eventlet removal. I have also identified a backlog of
    bot-proposed
    > patches
    >     for release notes and .gitreview that need manual intervention.
+1 thank you
    The unmerged bot patches are evidence that we lack maintenance.
>
    >
    > 2. Improving CI/CD stability and alignment:
    >     While we have made progress moving failing jobs from Jammy
    to Noble and
    >     adding Python 3.13 support, significant debt remains. For
    example, the
    >     cyborg-tempest-plugin lacks stable branch jobs post-2024.2
    while still
    >     carrying EOL branch definitions. We also lack grenade/SLURP
    upgrade
    > testing
    >     which is vital for project health.
++
    >
    >
    > 3. Managing release-related work and project metadata:
    >     Cyborg needs active management for release note preludes,
    marketing
    >     highlights, and RC/GA tagging. Furthermore, our Launchpad
    project
    > requires
    >     cleanup to ensure bugs and features are tracked against the
    correct
    > series,
    >     and team ownership needs to be aligned with current
    OpenStack standards.
    >     Note: I have checked PyPI
    > (https://pypi.org/project/openstack-cyborg/) and
    >     it is correctly owned by openstackci from what I can tell.
Thank you for caring about these important details. This is a
    common labor for all project maintainers and another good
    indication of a project's health.
>
    >
    > We are close to the end of the 2026.1 cycle, so my immediate
    priority is
    > fixing
    > the critical gaps to ensure Cyborg's inclusion in the 2026.1
    release,
    > followed
    > by a longer-term plan for maintenance in 2026.2 and new feature
    development.
    >
    > To execute this, I am requesting that the core team consider
    adding me
    > to the
    > cyborg-core group. I am also volunteering as Release and TACT-SIG
    > liaison for
    > the remainder of this cycle. For 2026.2, I propose adopting the
    Distributed
    > Project Leadership (DPL) model to better distribute these
    responsibilities.
    >
    > One or two others who have been helping me revive Watcher over
    the last year
    > will be joining me in this effort over the coming weeks. We hope to
    > split the
    > release, TACT, and security roles between us to ensure
    consistent coverage
    > unless we get other volunteers :)
If Li Liu can make the core team adjustments, that'd be great. If
    not, the TC can help with seeding the core team as we've done with
    other projects in the recent past.
For 2026.2, The PTL elections are imminent. If you're still
    identifying liaisons, I would recommend nominating yourself as PTL
    for the upcoming release cycle. PTLs can and should have liaisons
    that can handle different aspects of project maintenance too :)
    Although I understand that you want to prevent a
    single-point-of-failure. You could alternatively propose a DPL
    transition right away with liaisons you do have; but please note
    that any PTL nominees during the election window will override
    this change. Please continue to coordinate with other new/existing
    maintainers and the TC as you're doing. We'll conclude elections
    by March 19, 2026 at the latest, and we'll resolve the project
    governance by then.
>
    >
    > Our overall goal is to restart the Nova-Cyborg integration work to
    > improve the
    > accelerator management UX (attach/detach, move operations,
    etc.). To reach
    > that point, we must first pay down technical debt and rebuild an
    active core
    > review team.
    >
    > I have prepared a high-level maintenance roadmap and task list here:
    > https://etherpad.opendev.org/p/cyborg-maintance-2026.2
    > which I intend to use to track this work.
    >
    > I look forward to hearing your thoughts.
    > The first step should probably to discuss this at the next tc
    > meeting and to follow up with the existing core team for comment.
+1, added it to our agenda.:
    https://wiki.openstack.org/wiki/Meetings/TechnicalCommittee#Next_Meeting
>
    > If there are no objections,
    > I will coordinate with the TC and Infra team regarding the necessary
    > permission updates.
    >
    > Regards,
    >
    > Sean
    >