[openstack-dev] [nova] [cyborg] Race condition in the Cyborg/Nova flow

Dan Smith dms at danplanet.com
Thu Mar 29 18:04:26 UTC 2018


> ==> Fully dynamic: You can program one region with one function, and
> then still program a different region with a different function, etc.

Note that this is also the case if you don't have virtualized multi-slot
devices. Like, if you had one that only has one region. Consuming it
consumes the one and only inventory.

> ==> Single program: Once you program the card with a function, *all* its
> virtual slots are *only* capable of that function until the card is
> reprogrammed.  And while any slot is in use, you can't reprogram.  This
> is Sundar's FPGA use case.  It is also Sylvain's VGPU use case.
>
> The "fully dynamic" case is straightforward (in the sense of being what
> placement was architected to handle).
> * Model the PF/region as a resource provider.
> * The RP has inventory of some generic resource class (e.g. "VGPU",
> "SRIOV_NET_VF", "FPGA_FUNCTION").  Allocations consume that inventory,
> plain and simple.
> * As a region gets programmed dynamically, it's acceptable for the thing
> doing the programming to set a trait indicating that that function is in
> play.  (Sundar, this is the thing I originally said would get
> resistance; but we've agreed it's okay.  No blood was shed :)
> * Requests *may* use preferred traits to help them land on a card that
> already has their function flashed on it. (Prerequisite: preferred
> traits, which can be implemented in placement.  Candidates with the most
> preferred traits get sorted highest.)

Yup.

> The "single program" case needs to be handled more like what Alex
> describes below.  TL;DR: We do *not* support dynamic programming,
> traiting, or inventorying at instance boot time - it all has to be done
> "up front".
> * The PFs can be initially modeled as "empty" resource providers.  Or
> maybe not at all.  Either way, *they can not be deployed* in this state.
> * An operator or admin (via a CLI, config file, agent like blazar or
> cyborg, etc.) preprograms the PF to have the specific desired
> function/configuration.
>   * This may be cyborg/blazar pre-programming devices to maintain an
> available set of each function
>   * This may be in response to a user requesting some function, which
> causes a new image to be laid down on a device so it will be available
> for scheduling
>   * This may be a human doing it at cloud-build time
> * This results in the resource provider being (created and) set up with
> the inventory and traits appropriate to that function.
> * Now deploys can happen, using required traits representing the desired
> function.

...and it could be in response to something noticing that a recent nova
boot failed to find any candidates with a particular function, which
provisions that thing so it can be retried. This is kindof the "spot
instances" approach -- that same workflow would work here as well,
although I expect most people would fit into the above cases.

--Dan



More information about the OpenStack-dev mailing list