From: Julia Kreger <juliaashleykreger@gmail.com> Sent: Monday, January 6, 2020 1:33 PM To: Arkady.Kanevsky@dell.com Cc: Zhipeng Huang <zhipengh512@gmail.com>; openstack-discuss <openstack- discuss@lists.openstack.org> Subject: Re: [Cyborg][Ironic][Nova][Neutron][TripleO][Cinder] accelerators management
Hi Julia, Lots of good points here.
Greetings Arkady,
I think your message makes a very good case and raises a point that I've been trying to type out for the past hour, but with only different words.
We have multiple USER driven interactions with a similarly desired, if not the exact same desired end result where different paths can be taken, as we perceive use cases from "As a user, I would like a VM with a configured accelerator", "I would like any compute resource (VM or Baremetal), with a configured accelerator", to "As an administrator, I need to reallocate a baremetal node for this different use, so my user can leverage its accelerator once they know how and are ready to use it.", and as suggested "I as a user want baremetal with k8s and configured accelerators."
And I suspect this diversity of use patterns is where things begin to become difficult. As such I believe, we in essence, have a question of a support or compatibility matrix that definitely has gaps depending on "how" the "user" wants or needs to achieve their goals.
Yes, there are a wide variety of deployments and use cases. There may not be a single silver bullet solution for all of them. There may be different solutions, such as Ironic standalone, Ironic with Nova, and potentially some combination with Cyborg.
And, I think where this entire discussion _can_ go sideways is... (from what I understand) some of these devices need to be flashed by the application user with firmware on demand to meet the user's needs, which is where lifecycle and support interactions begin to become... conflicted.
We are probably using different definitions of the term 'firmware.' As I said in another response in this thread, if a device configuration exposes application-specific features or schedulable features, then the term 'firmware update' may not be applicable IMHO, since it is going to be done dynamically as workloads spin up and retire. This is especially so given Arkady's stipulation that firmware updates are done as part of server configuration and as per server vendor's guidelines.
Further complicating matters is the "Metal to Tenant" use cases where the user requesting the machine is not an administrator, but has some level of inherent administrative access to all Operating System accessible devices once their OS has booted. Which makes me wonder "What if the cloud administrators WANT to block the tenant's direct ability to write/flash firmware into accelerator/smartnic/etc?"
Yes, admins may want to do that. This can be done (partly) via RBAC, by having different roles for tenants who can use devices but not reprogram them, and for tenants who can program the device with application/scheduling-relevant features (but not firmware), etc.
I suspect if cloud administrators want to block such hardware access, vendors will want to support such a capability.
Devices can and usually do offer separate mechanisms for reading from registers, writing to them, updating flash etc. each with associated access permissions. A device vendor can go a bit extra by requiring specific Linux capabilities, such as say CAP_IPC_LOCK for mmap access, in their device driver.
Blocking such access inherently forces some actions into hardware management/maintenance workflows, and may ultimately may cause some of a support matrix's use cases to be unsupportable, again ultimately depending on what exactly the user is attempting to achieve.
Not sure if you are expressing a concern here. If the admin is using device features or RBAC to restrict access, then she is intentionally blocking some combinations in your support matrix, right? Users in such a deployment need to live with that.
Is there any documentation at present that details the desired support and use cases? I think this would at least help my understanding, since everything that requires the power to be on would still need to be integrated with-in workflows for eventual tighter integration.
The Cyborg spec [1] addresses the Nova/VM-based use cases. [1] https://opendev.org/openstack/cyborg-specs/src/branch/master/specs/train/app...
Also, has Cyborg drafted any plans or proposals for integration?
For Nova integration, we have a spec [2]. [2] https://review.opendev.org/#/c/684151/
-Julia
Regards, Sundar