[Cyborg][Ironic][Nova][Neutron][TripleO][Cinder] accelerators management

Dan Smith dms at danplanet.com
Mon Jan 13 15:16:30 UTC 2020


> TL;DR
>
> * Agree with Arkady that firmware updates should follow the server
> vendors' guidelines, and can/should be done as part of the server
> configuration.

I'm worried there's a little bit of confusion about "which nova" and
"which ironic" in this case, especially since Arkady mentioned
tripleo. More on that below. However, I agree that if you're using
ironic to manage the nodes that form your actual (over)cloud, then having
ironic update firmware on your accelerator device in the same way that it
might update firmware on a regular NIC, GPU card, or anything else makes
sense.

However, if you're talking about services all at the same level
(i.e. nova working with ironic to provide metal as a tenant as well as
VMs) then *that* ironic is not going to be managing firmware on
accelerators that you're handing to your VM instances on the compute nodes.

>> Managing life cycle of devices is Ironic responsibility, 
>
> Disagree here.

Me too, but in a general sense. I would not agree with the assessment
that "Managing life cycle of devices is Ironic responsibility."
Specifically the wide scope of "devices" being more than just physical
machines. It's true that Ironic manages the lifecycle of physical
machines, which may be used in a tripleo type of environment to manage
the lifecycle of things like compute nodes.

I *think* you both agree with that clarification, because of the next
point, but I think it's important to avoid such statements that imply
"all devices."

> To the best of my knowledge, Ironic handles devices based on PCI
> IDs. Cyborg is designed to go deeper for discovering device
> features/properties and utilize Placement for scheduling based on
> these.

What does this matter though? If you're talking about firmware for an
FPGA card, that's what you need to know in order to apply the correct
firmware to it, independent of whatever application-level bitstream is
going to go in there right?

>> One cannot and should not manage lifecycle of server components independently. 
>
> If what you meant to say is: ' do not update device firmware
> independently of other server components', agreed.

I'm not really sure what this original point from Arkady really
means. Are (either of) you saying that if there's a CVE for the firmware
in some card that the firmware patch shouldn't be applied without taking
the box through a full lifecycle event or something? AFAIK, Ironic can't
just do this in isolation, which means that if you've got a compute node
managed by ironic in a tripleo type of environment, you're looking to
move workloads away from that node, destroy it, apply updates, and
re-create it before you can use it again. I guess I'd be surprised if
people are doing this every time intel releases another microcode
update. Am I wrong about that?

Either way, I'm not sure how the firmware for accelerator cards is any
different from the firmware for other devices on the system. Maybe the
confusion is just that Cyborg does "programming" which seems similar to
"updating firmware"?

>> Nova and Neutron are getting info about all devices and their
>> capabilities from Ironic; that they use for scheduling
>
> Hmm, this seems overly broad to me: not every deployment includes
> Ironic, and getting PCI IDs is not enough for scheduling and
> management.

I also don't think it's correct. Nova does not get info about devices
from Ironic, and I kinda doubt Neutron does either. If Nova is using
ironic to provide metal as tenants, then...sure, but in the case where
nova is providing VMs with accelerator cards, Ironic is not involved.

>> Thus, I propose that Cyborg cover use cases 1 & 2, and Ironic would cover use case 3
>
> Use case 3 says "setup device for its use, like burning specific FW."
> With the definition of firmware above, I agree. Other aspects of
> lifecycle management, not covered by use cases 1 - 3, would come under
> Cyborg.
>
>> Thus, move all device Life-cycle code from Cyborg to Ironic
>
> To recap, there is more to device lifecycle than firmware update. I'd
> suggest the other aspects can remain in Cyborg.

Didn't you say that firmware programming (as defined here) is not
something that Cyborg currently does? Thus, nothing Cyborg currently
does should be moved to Ironic, AFAICT. If that is true, then I agree.

I guess my summary is: firmware updates for accelerators can and should
be handled the same as for other devices on the system, in whatever way
the operator currently does that. Programming an application-level
bitstream should not be confused with the former activity, and is fully
within the domain of Cyborg's responsibilities.

--Dan



More information about the openstack-discuss mailing list