<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    Hi all,<br>

       The Cyborg/Nova scheduling spec [1] details what traits will be

    applied to the resource providers that represent devices like GPUs.

    Some of the traits referred to vendor names. I got feedback that

    traits must not refer to products or specific models of devices. I

    agree. However, we need some reference to device types to enable

    matching the VM driver with the device.<br>

    <br>

    TL;DR We need some reference to device types, but we don't need

    product names. I will update the spec [1] to clarify that. Rest of

    this email clarifies why we need device types in traits, and what

    traits we propose to include.<br>

    <br>

    In general, an accelerator device is operated by two pieces of

    software: a driver in the kernel (which may discover and handle the

    PF for SR-IOV  devices), and a driver/library in the guest (which

    may handle the assigned VF). <br>

    <br>

    The device assigned to the VM must match the driver/library packaged

    in the VM. For this, the request must explicitly state what category

    of devices it needs. For example, if the VM needs a GPU, it needs to

    say whether it needs an AMD GPU or an Nvidia GPU, since it may have

    the driver/libraries for that vendor alone. It may also need to

    state what version of Cuda is needed, if it is a Nvidia GPU. These

    aspects are necessarily vendor-specific.<br>

    <br>

    Further, one driver/library version may handle multiple devices.

    Since a new driver version may be backwards compatible, multiple

    driver versions may manage the same device. The development/release

    of the driver/library inside the VM should be independent of the

    kernel driver for that device.<br>

    <br>

    For FPGAs, there is an additional twist as the VM may need specific

    bitstream(s), and they match only specific device/region types. The

    bitstream for a device from a vendor will not fit any other device

    from the same vendor, let alone other vendors. IOW, the region type

    is specific not just to a vendor but to a device type within the

    vendor. So, it is essential to identify the device type.<br>

    <br>

    So, the proposed set of RCs and traits are as below. As we learn

    more about actual usages by operators, we may need to evolve this

    set.<br>

    <ul>

      <li>There is a resource class per device category e.g.

        CUSTOM_ACCELERATOR_GPU, CUSTOM_ACCELERATOR_FPGA.</li>

      <li>The resource provider that represents a device has the

        following traits:</li>

      <ul>

        <li>Vendor/Category trait: e.g. CUSTOM_GPU_AMD,

          CUSTOM_FPGA_XILINX.</li>

        <li>Device type trait which is a refinement of vendor/category

          trait e.g. CUSTOM_FPGA_XILINX_VU9P.</li>

      </ul>

    </ul>

    <blockquote>

      <blockquote>NOTE: This is not a product or model, at least for

        FPGAs. Multiple products may use the same FPGA chip.<br>

        NOTE: The reason for having both the vendor/category and this

        one is that a flavor may ask for either, depending on the

        granularity desired. IOW, if one driver can handle all devices

        from a vendor (*eye roll*), the flavor can ask for the

        vendor/category trait alone. If there are separate drivers for

        different device families from the same vendor, the flavor must

        specify the trait for the device family.<br>

        NOTE: The equivalent trait for GPUs may be like

        CUSTOM_GPU_NVIDIA_P90, but I'll let others decide if that is a

        product or not.<br>

      </blockquote>

    </blockquote>

    <ul>

      <ul>

        <li>For FPGAs, we have additional traits:</li>

        <ul>

          <li>Functionality trait: e.g. CUSTOM_FPGA_COMPUTE,

            CUSTOM_FPGA_NETWORK, CUSTOM_FPGA_STORAGE</li>

          <li>Region type ID.  e.g.

            CUSTOM_FPGA_INTEL_REGION_<uuid>.</li>

          <li>Optionally, a function ID, indicating what function is

            currently programmed in the region RP. e.g.

            CUSTOM_FPGA_INTEL_FUNCTION_<uuid>. Not all

            implementations may provide it. The function trait may

            change on reprogramming, but it is not expected to be

            frequent.</li>

          <li>Possibly, CUSTOM_PROGRAMMABLE as a separate trait.<br>

          </li>

        </ul>

      </ul>

    </ul>

    [1] <a class="moz-txt-link-freetext" href="https://review.openstack.org/#/c/554717/">https://review.openstack.org/#/c/554717/</a><br>

    <br>

    Thanks.<br>

    <br>

    Regards,<br>

    Sundar<br>

  </body>

</html>