<div xmlns="http://www.w3.org/1999/xhtml">Hey Jay,</div><div xmlns="http://www.w3.org/1999/xhtml"> </div><div xmlns="http://www.w3.org/1999/xhtml">I replied to your comments to the spec however missed this email.</div><div xmlns="http://www.w3.org/1999/xhtml">Please see my replies in line.</div><div xmlns="http://www.w3.org/1999/xhtml"> </div><div xmlns="http://www.w3.org/1999/xhtml">Thanks,<br />Zang, Rui</div><div xmlns="http://www.w3.org/1999/xhtml"> </div><div xmlns="http://www.w3.org/1999/xhtml"> </div><div xmlns="http://www.w3.org/1999/xhtml"> </div><div xmlns="http://www.w3.org/1999/xhtml">03.01.2019, 21:31, "Jay Pipes" <jaypipes@gmail.com>:</div><blockquote xmlns="http://www.w3.org/1999/xhtml" type="cite"><p>On 01/02/2019 11:08 PM, Alex Xu wrote:</p><blockquote> Jay Pipes <<a rel="noopener noreferrer" href="mailto:jaypipes@gmail.com">jaypipes@gmail.com</a> <mailto:<a rel="noopener noreferrer" href="mailto:jaypipes@gmail.com">jaypipes@gmail.com</a>>> 于2019年1月2<br /> 日周三 下午10:48写道：<br /><br />     On 12/21/2018 03:45 AM, Rui Zang wrote:<br />      > It was advised in today's nova team meeting to bring this up by<br />     email.<br />      ><br />      > There has been some discussion on the how to track persistent memory<br />      > resource in placement on the spec review [1].<br />      ><br />      > Background: persistent memory (PMEM) needs to be partitioned to<br />      > namespaces to be consumed by VMs. Due to fragmentation issues,<br />     the spec<br />      > proposed to use fixed sized PMEM namespaces.<br /><br />     The spec proposed to use fixed sized namespaces that is controllable by<br />     the deployer, not fixed-size-for-everyone :) Just want to make sure<br />     we're being clear here.<br /><br />      > The spec proposed way to represent PMEM namespaces is to use one<br />      > Resource Provider (RP) for one PMEM namespace. An new standard<br />     Resource<br />      > Class (RC) -- 'VPMEM_GB` is introduced to classify PMEM namspace<br />     RPs.<br />      > For each PMEM namespace RP, the values for 'max_unit', 'min_unit',<br />      > 'total' and 'step_size` are all set to the size of the PMEM<br />     namespace.<br />      > In this way, it is guaranteed each RP will be consumed as a whole<br />     at one<br />      > time.<br />       ><br />      > An alternative was brought out in the review. Different Custom<br />     Resource<br />      > Classes ( CUSTOM_PMEM_XXXGB) can be used to designate PMEM<br />     namespaces of<br />      > different sizes. The size of the PMEM namespace is encoded in the<br />     name<br />      > of the custom Resource Class. And multiple PMEM namespaces of the<br />     same<br />      > size  (say 128G) can be represented by one RP of the same<br /><br />     Not represented by "one RP of the same CUSTOM_PMEM_128G". There<br />     would be<br />     only one resource provider: the compute node itself. It would have an<br />     inventory of, say, 8 CUSTOM_PMEM_128G resources.<br /><br />      > CUSTOM_PMEM_128G. In this way, the RP could have 'max_unit'  and<br />     'total'<br />      > as the total number of the PMEM namespaces of the certain size.<br />     And the<br />      > values of 'min_unit' and 'step_size' could set to 1.<br /><br />     No, the max_unit, min_unit, step_size and total would refer to the<br />     number of *PMEM namespaces*, not the amount of GB of memory represented<br />     by those namespaces.<br /><br />     Therefore, min_unit and step_size would be 1, max_unit would be the<br />     total number of *namespaces* that could simultaneously be attached to a<br />     single consumer (VM), and total would be 8 in our example where the<br />     compute node had 8 of these pre-defined 128G PMEM namespaces.<br /><br />      > We believe both way could work. We would like to have a community<br />      > consensus on which way to use.<br />      > Email replies and review comments to the spec [1] are both welcomed.<br /><br />     Custom resource classes were invented for precisely this kind of use<br />     case. The resource being represented is a namespace. The resource is<br />     not<br />     "a Gibibyte of persistent memory".<br /><br /><br /> The point of the initial design is avoid to encode the `size` in the<br /> resource class name. If that is ok for you(I remember people hate to<br /> encode size and number into the trait name), then we will update the<br /> design. Probably based on the namespace configuration, nova will be<br /> responsible for create those custom RC first. Sounds works.</blockquote><p><br />A couple points...<br /><br />1) I was/am opposed to putting the least-fine-grained size in a resource<br />class name. For example, I would have preferred DISK_BYTE instead of<br />DISK_GB. And MEMORY_BYTE instead of MEMORY_MB.</p></blockquote><div xmlns="http://www.w3.org/1999/xhtml">I agree the more precise the better as far as resource tracking is concerned.</div><div xmlns="http://www.w3.org/1999/xhtml">However, as for persistent memory, it usually comes out in large capacity --</div><div xmlns="http://www.w3.org/1999/xhtml">terabytes are normal. And the targeting applications are also expected to use</div><div xmlns="http://www.w3.org/1999/xhtml">persistent memory in that quantity. GB is a reasonable unit not to make</div><div xmlns="http://www.w3.org/1999/xhtml">the number too nasty.</div><div xmlns="http://www.w3.org/1999/xhtml"> </div><blockquote xmlns="http://www.w3.org/1999/xhtml" type="cite"><p><br />2) After reading the original Intel PMEM specification<br />(<a rel="noopener noreferrer" href="http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf">http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf</a>), it seems to me<br />that what you are describing with a generic PMEM_GB (or PMEM_BYTE)<br />resource class is more appropriate for the block mode translation system<br />described in the PDF versus the PMEM namespace system described therein.<br /><br /> From a lay person's perspective, I see the difference between the two<br />as similar to the difference between describing the bytes that are in<br />block storage versus a filesystem that has been formatted, wiped,<br />cleaned, etc on that block storage.</p></blockquote><div xmlns="http://www.w3.org/1999/xhtml">First let's talk about "block mode" v.s. "persistent memory mode".</div><div xmlns="http://www.w3.org/1999/xhtml">They are not tiered up, they are counterparts. Each of them describes an access</div><div xmlns="http://www.w3.org/1999/xhtml">method to the unlerlying hardware. Quote some sectors from</div><div xmlns="http://www.w3.org/1999/xhtml"><a href="https://www.kernel.org/doc/Documentation/nvdimm/nvdimm.txt">https://www.kernel.org/doc/Documentation/nvdimm/nvdimm.txt</a></div><div xmlns="http://www.w3.org/1999/xhtml">inside the dash line block.</div><div xmlns="http://www.w3.org/1999/xhtml"> </div><div xmlns="http://www.w3.org/1999/xhtml">------------------------------8<-------------------------------------------------------------------</div><div xmlns="http://www.w3.org/1999/xhtml"><pre style="color:rgb(0,0,0);font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;text-align:start;text-transform:none;overflow-wrap:break-word;white-space:pre-wrap;">Why BLK?

--------

While PMEM provides direct byte-addressable CPU-load/store access to

NVDIMM storage, it does not provide the best system RAS (recovery,

availability, and serviceability) model.  An access to a corrupted

system-physical-address address causes a CPU exception while an access

to a corrupted address through an BLK-aperture causes that block window

to raise an error status in a register.  The latter is more aligned with

the standard error model that host-bus-adapter attached disks present.

Also, if an administrator ever wants to replace a memory it is easier to

service a system at DIMM module boundaries.  Compare this to PMEM where

data could be interleaved in an opaque hardware specific manner across

several DIMMs.

PMEM vs BLK

BLK-apertures solve these RAS problems, but their presence is also the

major contributing factor to the complexity of the ND subsystem.  They

complicate the implementation because PMEM and BLK alias in DPA space.

Any given DIMM's DPA-range may contribute to one or more

system-physical-address sets of interleaved DIMMs, *and* may also be

accessed in its entirety through its BLK-aperture.  Accessing a DPA

through a system-physical-address while simultaneously accessing the

same DPA through a BLK-aperture has undefined results.  For this reason,

DIMMs with this dual interface configuration include a DSM function to

store/retrieve a LABEL.  The LABEL effectively partitions the DPA-space

into exclusive system-physical-address and BLK-aperture accessible

regions.  For simplicity a DIMM is allowed a PMEM "region" per each

interleave set in which it is a member.  The remaining DPA space can be

carved into an arbitrary number of BLK devices with discontiguous

extents.

</pre></div><div xmlns="http://www.w3.org/1999/xhtml"><div>------------------------------8<-------------------------------------------------------------------</div></div><div xmlns="http://www.w3.org/1999/xhtml"> </div><div xmlns="http://www.w3.org/1999/xhtml">You can see that "block mode" does not provide "direct access", thus not the best  </div><div xmlns="http://www.w3.org/1999/xhtml">performance. That is the reason "persistent memory mode" is proposed in the spec.</div><div xmlns="http://www.w3.org/1999/xhtml"> </div><div xmlns="http://www.w3.org/1999/xhtml">However, people can still create a block device out of a "persistent memory mode"</div><div xmlns="http://www.w3.org/1999/xhtml">namespace. And further more, create a file system on top of that block device.</div><div xmlns="http://www.w3.org/1999/xhtml">Applications can map files from that file system into their memory namespaces,</div><div xmlns="http://www.w3.org/1999/xhtml">and if the file system is DAX (direct-access) capable. The application's access to</div><div xmlns="http://www.w3.org/1999/xhtml">the hardware is still direct-access which means direct byte-addressable</div><div xmlns="http://www.w3.org/1999/xhtml">CPU-load/store access to NVDIMM storage.</div><div xmlns="http://www.w3.org/1999/xhtml">This is perfect so far, as one can think of why not just track the DAX file system</div><div xmlns="http://www.w3.org/1999/xhtml">and let the VM instances map the files of the file system?</div><div xmlns="http://www.w3.org/1999/xhtml">However, this usage model is reported to have severe issues with hardware</div><div xmlns="http://www.w3.org/1999/xhtml">pass-ed through. So the recommended model is still mapping namespaces</div><div xmlns="http://www.w3.org/1999/xhtml">of "persistent memory mode" into applications' address space.</div><div xmlns="http://www.w3.org/1999/xhtml"> </div><div xmlns="http://www.w3.org/1999/xhtml"> </div><div xmlns="http://www.w3.org/1999/xhtml"> </div><div xmlns="http://www.w3.org/1999/xhtml"> </div><blockquote xmlns="http://www.w3.org/1999/xhtml" type="cite"><p>In Nova, the DISK_GB resource class describes the former: it's a bunch<br />of blocks that are reserved in the underlying block storage for use by<br />the virtual machine. The virtual machine manager then formats that bunch<br />of blocks as needed and lays down a formatted image.<br /><br />We don't have a resource class that represents "a filesystem" or "a<br />partition" (yet). But the proposed PMEM namespaces in your spec<br />definitely seem to be more like a "filesystem resource" than a "GB of<br />block storage" resource.<br /><br />Best,<br />-jay</p></blockquote>