[Cyborg][Ironic][Nova][Neutron][TripleO][Cinder] accelerators management

Nadathur, Sundar sundar.nadathur at intel.com
Mon Jan 13 18:16:20 UTC 2020


> From: Dan Smith <dms at danplanet.com>
> Sent: Monday, January 13, 2020 7:17 AM
> To: Nadathur, Sundar <sundar.nadathur at intel.com>
> Cc: Arkady.Kanevsky at dell.com; openstack-discuss at lists.openstack.org
> Subject: Re: [Cyborg][Ironic][Nova][Neutron][TripleO][Cinder] accelerators
> management
> 
> > TL;DR
> >
> > * Agree with Arkady that firmware updates should follow the server
> > vendors' guidelines, and can/should be done as part of the server
> > configuration.
> 
> I'm worried there's a little bit of confusion about "which nova" and "which
> ironic" in this case, especially since Arkady mentioned tripleo. More on that
> below. However, I agree that if you're using ironic to manage the nodes that
> form your actual (over)cloud, then having ironic update firmware on your
> accelerator device in the same way that it might update firmware on a regular
> NIC, GPU card, or anything else makes sense.
> 
> However, if you're talking about services all at the same level (i.e. nova
> working with ironic to provide metal as a tenant as well as
> VMs) then *that* ironic is not going to be managing firmware on accelerators
> that you're handing to your VM instances on the compute nodes.

This goes back to the definition of firmware update vs. programming in my earlier post. In a Nova + Ironic + Cyborg env, I'd expect Cyborg to do programming. Firmware updates can be done by Ironic, Ansible/Redfish/... , some combination like Ironic with Redfish driver, or whatever the operator chooses.

> > To the best of my knowledge, Ironic handles devices based on PCI IDs.
> > Cyborg is designed to go deeper for discovering device
> > features/properties and utilize Placement for scheduling based on
> > these.
> 
> What does this matter though? If you're talking about firmware for an FPGA
> card, that's what you need to know in order to apply the correct firmware to
> it, independent of whatever application-level bitstream is going to go in there
> right?

The device properties are needed for scheduling: users are often interested in getting a VM with an accelerator that has specific properties: e.g. implements a specific version of gzip, has 4 GB or more of device-local memory etc. 

Device properties are also needed for management of accelerator inventory: admins want to know how many FPGAs have a particular bitstream burnt into them, etc. 

Re. programming, sometimes we may need to determine what's in a device (beyond PCI ID) before programming it to ensure the image being programmed and the existing device contents are compatible.

> >> One cannot and should not manage lifecycle of server components
> independently.
> >
> > If what you meant to say is: ' do not update device firmware
> > independently of other server components', agreed.
> 
> I'm not really sure what this original point from Arkady really means. Are
> (either of) you saying that if there's a CVE for the firmware in some card that
> the firmware patch shouldn't be applied without taking the box through a full
> lifecycle event or something? 
My paraphrase of Arkady's points:
a. Updating CPU firmware/microcode should be done as per the server/CPU vendor's rules (use their specific tools, or some specific mechanisms like Redfish, with auditing, ....)
b. Updating firmware for devices/accelerators should be done the same way.

By a "full lifecycle event", you presumably mean vacating the entire node. For device updates, that is not always needed: one could disconnect just the instances using that device. The server/device vendor rules must specify the 'lifecycle event' involved for a specific update.

 > AFAIK, Ironic can't just do this in isolation, which
> means that if you've got a compute node managed by ironic in a tripleo type
> of environment, you're looking to move workloads away from that node,
> destroy it, apply updates, and re-create it before you can use it again. I guess
> I'd be surprised if people are doing this every time intel releases another
> microcode update. Am I wrong about that?

Not making any official statements but, generally, if a microcode/firmware update requires a reboot, one would have to do that. The admin would declare a maintenance window and combine software/firmware/configuration updates in that window.

> Either way, I'm not sure how the firmware for accelerator cards is any
> different from the firmware for other devices on the system. 

Updates of other devices, like CPU or motherboard components, often require server reboots. Accelerator updates may or may not require them, depending on ... all kinds of things.
 
> Maybe the confusion is just that Cyborg does "programming" which seems similar to
> "updating firmware"?

Yes, indeed. That is why I went at length on the distinction between the two.

> >> Nova and Neutron are getting info about all devices and their
> >> capabilities from Ironic; that they use for scheduling
> >
> > Hmm, this seems overly broad to me: not every deployment includes
> > Ironic, and getting PCI IDs is not enough for scheduling and
> > management.
> 
> I also don't think it's correct. Nova does not get info about devices from
> Ironic, and I kinda doubt Neutron does either. If Nova is using ironic to
> provide metal as tenants, then...sure, but in the case where nova is providing
> VMs with accelerator cards, Ironic is not involved.

+1 

> >> Thus, move all device Life-cycle code from Cyborg to Ironic
> >
> > To recap, there is more to device lifecycle than firmware update. I'd
> > suggest the other aspects can remain in Cyborg.
> 
> Didn't you say that firmware programming (as defined here) is not something
> that Cyborg currently does? Thus, nothing Cyborg currently does should be
> moved to Ironic, AFAICT. If that is true, then I agree.

Yes ^. 

> I guess my summary is: firmware updates for accelerators can and should be
> handled the same as for other devices on the system, in whatever way the
> operator currently does that. Programming an application-level bitstream
> should not be confused with the former activity, and is fully within the
> domain of Cyborg's responsibilities.

Agreed.

> --Dan

Regards,
Sundar



More information about the openstack-discuss mailing list