[openstack-dev] [nova] [cyborg] Race condition in the Cyborg/Nova flow
dms at danplanet.com
Thu Mar 29 18:04:26 UTC 2018
> ==> Fully dynamic: You can program one region with one function, and
> then still program a different region with a different function, etc.
Note that this is also the case if you don't have virtualized multi-slot
devices. Like, if you had one that only has one region. Consuming it
consumes the one and only inventory.
> ==> Single program: Once you program the card with a function, *all* its
> virtual slots are *only* capable of that function until the card is
> reprogrammed. And while any slot is in use, you can't reprogram. This
> is Sundar's FPGA use case. It is also Sylvain's VGPU use case.
> The "fully dynamic" case is straightforward (in the sense of being what
> placement was architected to handle).
> * Model the PF/region as a resource provider.
> * The RP has inventory of some generic resource class (e.g. "VGPU",
> "SRIOV_NET_VF", "FPGA_FUNCTION"). Allocations consume that inventory,
> plain and simple.
> * As a region gets programmed dynamically, it's acceptable for the thing
> doing the programming to set a trait indicating that that function is in
> play. (Sundar, this is the thing I originally said would get
> resistance; but we've agreed it's okay. No blood was shed :)
> * Requests *may* use preferred traits to help them land on a card that
> already has their function flashed on it. (Prerequisite: preferred
> traits, which can be implemented in placement. Candidates with the most
> preferred traits get sorted highest.)
> The "single program" case needs to be handled more like what Alex
> describes below. TL;DR: We do *not* support dynamic programming,
> traiting, or inventorying at instance boot time - it all has to be done
> "up front".
> * The PFs can be initially modeled as "empty" resource providers. Or
> maybe not at all. Either way, *they can not be deployed* in this state.
> * An operator or admin (via a CLI, config file, agent like blazar or
> cyborg, etc.) preprograms the PF to have the specific desired
> * This may be cyborg/blazar pre-programming devices to maintain an
> available set of each function
> * This may be in response to a user requesting some function, which
> causes a new image to be laid down on a device so it will be available
> for scheduling
> * This may be a human doing it at cloud-build time
> * This results in the resource provider being (created and) set up with
> the inventory and traits appropriate to that function.
> * Now deploys can happen, using required traits representing the desired
...and it could be in response to something noticing that a recent nova
boot failed to find any candidates with a particular function, which
provisions that thing so it can be retried. This is kindof the "spot
instances" approach -- that same workflow would work here as well,
although I expect most people would fit into the above cases.
More information about the OpenStack-dev