TL;DR
* Agree with Arkady that firmware updates should follow the server vendors' guidelines, and can/should be done as part of the server configuration.
I'm worried there's a little bit of confusion about "which nova" and "which ironic" in this case, especially since Arkady mentioned tripleo. More on that below. However, I agree that if you're using ironic to manage the nodes that form your actual (over)cloud, then having ironic update firmware on your accelerator device in the same way that it might update firmware on a regular NIC, GPU card, or anything else makes sense. However, if you're talking about services all at the same level (i.e. nova working with ironic to provide metal as a tenant as well as VMs) then *that* ironic is not going to be managing firmware on accelerators that you're handing to your VM instances on the compute nodes.
Managing life cycle of devices is Ironic responsibility,
Disagree here.
Me too, but in a general sense. I would not agree with the assessment that "Managing life cycle of devices is Ironic responsibility." Specifically the wide scope of "devices" being more than just physical machines. It's true that Ironic manages the lifecycle of physical machines, which may be used in a tripleo type of environment to manage the lifecycle of things like compute nodes. I *think* you both agree with that clarification, because of the next point, but I think it's important to avoid such statements that imply "all devices."
To the best of my knowledge, Ironic handles devices based on PCI IDs. Cyborg is designed to go deeper for discovering device features/properties and utilize Placement for scheduling based on these.
What does this matter though? If you're talking about firmware for an FPGA card, that's what you need to know in order to apply the correct firmware to it, independent of whatever application-level bitstream is going to go in there right?
One cannot and should not manage lifecycle of server components independently.
If what you meant to say is: ' do not update device firmware independently of other server components', agreed.
I'm not really sure what this original point from Arkady really means. Are (either of) you saying that if there's a CVE for the firmware in some card that the firmware patch shouldn't be applied without taking the box through a full lifecycle event or something? AFAIK, Ironic can't just do this in isolation, which means that if you've got a compute node managed by ironic in a tripleo type of environment, you're looking to move workloads away from that node, destroy it, apply updates, and re-create it before you can use it again. I guess I'd be surprised if people are doing this every time intel releases another microcode update. Am I wrong about that? Either way, I'm not sure how the firmware for accelerator cards is any different from the firmware for other devices on the system. Maybe the confusion is just that Cyborg does "programming" which seems similar to "updating firmware"?
Nova and Neutron are getting info about all devices and their capabilities from Ironic; that they use for scheduling
Hmm, this seems overly broad to me: not every deployment includes Ironic, and getting PCI IDs is not enough for scheduling and management.
I also don't think it's correct. Nova does not get info about devices from Ironic, and I kinda doubt Neutron does either. If Nova is using ironic to provide metal as tenants, then...sure, but in the case where nova is providing VMs with accelerator cards, Ironic is not involved.
Thus, I propose that Cyborg cover use cases 1 & 2, and Ironic would cover use case 3
Use case 3 says "setup device for its use, like burning specific FW." With the definition of firmware above, I agree. Other aspects of lifecycle management, not covered by use cases 1 - 3, would come under Cyborg.
Thus, move all device Life-cycle code from Cyborg to Ironic
To recap, there is more to device lifecycle than firmware update. I'd suggest the other aspects can remain in Cyborg.
Didn't you say that firmware programming (as defined here) is not something that Cyborg currently does? Thus, nothing Cyborg currently does should be moved to Ironic, AFAICT. If that is true, then I agree. I guess my summary is: firmware updates for accelerators can and should be handled the same as for other devices on the system, in whatever way the operator currently does that. Programming an application-level bitstream should not be confused with the former activity, and is fully within the domain of Cyborg's responsibilities. --Dan