Re: [nova] Persistent memory resource tracking model

3 Jan 2019

      On 01/02/2019 11:08 PM, Alex Xu wrote:
...
Jay Pipes <jaypipes@gmail.com <mailto:jaypipes@gmail.com>> 于2019年1月2 
日周三 下午10:48写道：
On 12/21/2018 03:45 AM, Rui Zang wrote:
     > It was advised in today's nova team meeting to bring this up by
    email.
     >
     > There has been some discussion on the how to track persistent memory
     > resource in placement on the spec review [1].
     >
     > Background: persistent memory (PMEM) needs to be partitioned to
     > namespaces to be consumed by VMs. Due to fragmentation issues,
    the spec
     > proposed to use fixed sized PMEM namespaces.
The spec proposed to use fixed sized namespaces that is controllable by
    the deployer, not fixed-size-for-everyone :) Just want to make sure
    we're being clear here.
> The spec proposed way to represent PMEM namespaces is to use one
     > Resource Provider (RP) for one PMEM namespace. An new standard
    Resource
     > Class (RC) -- 'VPMEM_GB` is introduced to classify PMEM namspace
    RPs.
     > For each PMEM namespace RP, the values for 'max_unit', 'min_unit',
     > 'total' and 'step_size` are all set to the size of the PMEM
    namespace.
     > In this way, it is guaranteed each RP will be consumed as a whole
    at one
     > time.
      >
     > An alternative was brought out in the review. Different Custom
    Resource
     > Classes ( CUSTOM_PMEM_XXXGB) can be used to designate PMEM
    namespaces of
     > different sizes. The size of the PMEM namespace is encoded in the
    name
     > of the custom Resource Class. And multiple PMEM namespaces of the
    same
     > size  (say 128G) can be represented by one RP of the same
Not represented by "one RP of the same CUSTOM_PMEM_128G". There
    would be
    only one resource provider: the compute node itself. It would have an
    inventory of, say, 8 CUSTOM_PMEM_128G resources.
> CUSTOM_PMEM_128G. In this way, the RP could have 'max_unit'  and
    'total'
     > as the total number of the PMEM namespaces of the certain size.
    And the
     > values of 'min_unit' and 'step_size' could set to 1.
No, the max_unit, min_unit, step_size and total would refer to the
    number of *PMEM namespaces*, not the amount of GB of memory represented
    by those namespaces.
Therefore, min_unit and step_size would be 1, max_unit would be the
    total number of *namespaces* that could simultaneously be attached to a
    single consumer (VM), and total would be 8 in our example where the
    compute node had 8 of these pre-defined 128G PMEM namespaces.
> We believe both way could work. We would like to have a community
     > consensus on which way to use.
     > Email replies and review comments to the spec [1] are both welcomed.
Custom resource classes were invented for precisely this kind of use
    case. The resource being represented is a namespace. The resource is
    not
    "a Gibibyte of persistent memory".
The point of the initial design is avoid to encode the `size` in the 
resource class name. If that is ok for you(I remember people hate to 
encode size and number into the trait name), then we will update the 
design. Probably based on the namespace configuration, nova will be 
responsible for create those custom RC first. Sounds works.
A couple points...

1) I was/am opposed to putting the least-fine-grained size in a resource 
class name. For example, I would have preferred DISK_BYTE instead of 
DISK_GB. And MEMORY_BYTE instead of MEMORY_MB.

2) After reading the original Intel PMEM specification 
(http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf), it seems to me 
that what you are describing with a generic PMEM_GB (or PMEM_BYTE) 
resource class is more appropriate for the block mode translation system 
described in the PDF versus the PMEM namespace system described therein.

 From a lay person's perspective, I see the difference between the two 
as similar to the difference between describing the bytes that are in 
block storage versus a filesystem that has been formatted, wiped, 
cleaned, etc on that block storage.

In Nova, the DISK_GB resource class describes the former: it's a bunch 
of blocks that are reserved in the underlying block storage for use by 
the virtual machine. The virtual machine manager then formats that bunch 
of blocks as needed and lays down a formatted image.

We don't have a resource class that represents "a filesystem" or "a 
partition" (yet). But the proposed PMEM namespaces in your spec 
definitely seem to be more like a "filesystem resource" than a "GB of 
block storage" resource.

Best,
-jay