[Cyborg][Ironic][Nova][Neutron][TripleO][Cinder] accelerators management

Julia Kreger juliaashleykreger at gmail.com
Mon Jan 6 21:32:57 UTC 2020


Greetings Arkady,

I think your message makes a very good case and raises a point that
I've been trying to type out for the past hour, but with only
different words.

We have multiple USER driven interactions with a similarly desired, if
not the exact same desired end result where different paths can be
taken, as we perceive use cases from "As a user, I would like a VM
with a configured accelerator", "I would like any compute resource (VM
or Baremetal), with a configured accelerator", to "As an
administrator, I need to reallocate a baremetal node for this
different use, so my user can leverage its accelerator once they know
how and are ready to use it.", and as suggested "I as a user want
baremetal with k8s and configured accelerators."

And I suspect this diversity of use patterns is where things begin to
become difficult. As such I believe, we in essence, have a question of
a support or compatibility matrix that definitely has gaps depending
on "how" the "user" wants or needs to achieve their goals.

And, I think where this entire discussion _can_ go sideways is...
(from what I understand) some of these devices need to be flashed by
the application user with firmware on demand to meet the user's needs,
which is where lifecycle and support interactions begin to become...
conflicted.

Further complicating matters is the "Metal to Tenant" use cases where
the user requesting the machine is not an administrator, but has some
level of inherent administrative access to all Operating System
accessible devices once their OS has booted. Which makes me wonder
"What if the cloud administrators WANT to block the tenant's direct
ability to write/flash firmware into accelerator/smartnic/etc?" I
suspect if cloud administrators want to block such hardware access,
vendors will want to support such a capability. Blocking such access
inherently forces some actions into hardware management/maintenance
workflows, and may ultimately may cause some of a support matrix's use
cases to be unsupportable, again ultimately depending on what exactly
the user is attempting to achieve.

Going back to the suggestions in the original email, They seem logical
to me in terms of the delineation and separation of responsibilities
as we present a cohesive solution the users of our software.

Greetings Zhipeng,

Is there any documentation at present that details the desired support
and use cases?  I think this would at least help my understanding,
since everything that requires the power to be on would still need to
be integrated with-in workflows for eventual tighter integration.
Also, has Cyborg drafted any plans or proposals for integration?

-Julia
On Mon, Jan 6, 2020 at 9:14 AM <Arkady.Kanevsky at dell.com> wrote:
>
> Zhipeng,
>
> Thanks for quick feedback.
>
> Where is accelerating device is running? I am aware of 3 possibilities: servers, storage, switches.
>
> In each one of them the device is managed as part of server, storage box or switch.
>
>
>
> The core of my message is separation of device life cycle management in the “box” where it is placed, from the programming the device as needed per application (VM, container).
>
>
>
> Thanks,
> Arkady
>
>
>
> From: Zhipeng Huang <zhipengh512 at gmail.com>
> Sent: Friday, January 3, 2020 7:53 PM
> To: Kanevsky, Arkady
> Cc: OpenStack Discuss
> Subject: Re: [Cyborg][Ironic][Nova][Neutron][TripleO][Cinder] accelerators management
>
>
>
> [EXTERNAL EMAIL]
>
> Hi Arkady,
>
>
>
> Thanks for your interest in Cyborg project :) I would like to point out that when we initiated the project there are two specific use cases we want to cover: the accelerators attached locally (via PCIe or other bus type) or remotely (via Ethernet or other fabric type).
>
>
>
> For the latter one, it is clear that its life cycle is independent from the server (like block device managed by Cinder). For the former one however, its life cycle is not dependent on server for all kinds of accelerators either. For example we already have PCIe based AI accelerator cards or Smart NICs that could be power on/off when the server is on all the time.
>
>
>
> Therefore it is not a good idea to move all the life cycle management part into Ironic for the above mentioned reasons. Ironic integration is very important for the standalone usage of Cyborg for Kubernetes, Envoy (TLS acceleration) and others alike.
>
>
>
> Hope this answers your question :)
>
>
>
> On Sat, Jan 4, 2020 at 5:23 AM <Arkady.Kanevsky at dell.com> wrote:
>
> Fellow Open Stackers,
>
> I have been thinking  on how to handle SmartNICs, GPUs, FPGA handling across different projects within OpenStack with Cyborg taking a leading role in it.
>
>
>
> Cyborg is important project and address accelerator devices that are part of the server and potentially switches and storage.
>
> It is address 3 different use cases and users there are all grouped into single project.
>
>
>
> Application user need to program a portion of the device under management, like GPU, or SmartNIC for that app usage. Having a common way to do it across different device families and across different vendor is very important. And that has to be done every time a VM is deploy that need usage of a device. That is tied with VM scheduling.
> Administrator need to program the whole device for specific usage. That covers the scenario when device can only support single tenant or single use case. That is done once during OpenStack deployment but may need reprogramming to configure device for different usage. May or may not require reboot of the server.
> Administrator need to setup device for its use, like burning specific FW on it. This is typically done as part of server life-cycle event.
>
>
>
> The first 2 cases cover application life cycle of device usage.
>
> The last one covers device life cycle independently how it is used.
>
>
>
> Managing life cycle of devices is Ironic responsibility, One cannot and should not manage lifecycle of server components independently. Managing server devices outside server management violates customer service agreements with server vendors and breaks server support agreements.
>
> Nova and Neutron are getting info about all devices and their capabilities from Ironic; that they use for scheduling. We should avoid creating new project for every new component of the server and modify nova and neuron for each new device. (the same will also apply to cinder and manila if smart devices used in its data/control path on a server).
>
> Finally we want Cyborg to be able to be used in standalone capacity, say for Kubernetes.
>
>
>
> Thus, I propose that Cyborg cover use cases 1 & 2, and Ironic would cover use case 3.
>
> Thus, move all device Life-cycle code from Cyborg to Ironic.
>
> Concentrate Cyborg of fulfilling the first 2 use cases.
>
> Simplify integration with Nova and Neutron for using these accelerators to use existing Ironic mechanism for it.
>
> Create idempotent calls for use case 1 so Nova and Neutron can use it as part of VM deployment to ensure that devices are programmed for VM under scheduling need.
>
> Create idempotent call(s) for use case 2 for TripleO to setup device for single accelerator usage of a node.
>
> [Propose similar model for CNI integration.]
>
>
>
> Let the discussion start!
>
>
>
> Thanks.,
> Arkady
>
>
>
>
> --
>
> Zhipeng (Howard) Huang
>
>
>
> Principle Engineer
>
> OpenStack, Kubernetes, CNCF, LF Edge, ONNX, Kubeflow, OpenSDS, Open Service Broker API, OCP, Hyperledger, ETSI, SNIA, DMTF, W3C
>
>



More information about the openstack-discuss mailing list