[nova] Persistent memory resource tracking model

Sean Mooney smooney at redhat.com
Mon Jan 7 10:05:54 UTC 2019

On Mon, 2019-01-07 at 16:23 +0800, rui zang wrote:
> Hey Jay,
> I replied to your comments to the spec however missed this email.
> Please see my replies in line.
> Thanks,
> Zang, Rui
> 03.01.2019, 21:31, "Jay Pipes" <jaypipes at gmail.com>:
> > On 01/02/2019 11:08 PM, Alex Xu wrote:
> > >  Jay Pipes <jaypipes at gmail.com <mailto:jaypipes at gmail.com>> 于2019年1月2
> > >  日周三 下午10:48写道:
> > > 
> > >      On 12/21/2018 03:45 AM, Rui Zang wrote:
> > >       > It was advised in today's nova team meeting to bring this up by
> > >      email.
> > >       >
> > >       > There has been some discussion on the how to track persistent memory
> > >       > resource in placement on the spec review [1].
> > >       >
> > >       > Background: persistent memory (PMEM) needs to be partitioned to
> > >       > namespaces to be consumed by VMs. Due to fragmentation issues,
> > >      the spec
> > >       > proposed to use fixed sized PMEM namespaces.
> > > 
> > >      The spec proposed to use fixed sized namespaces that is controllable by
> > >      the deployer, not fixed-size-for-everyone :) Just want to make sure
> > >      we're being clear here.
> > > 
> > >       > The spec proposed way to represent PMEM namespaces is to use one
> > >       > Resource Provider (RP) for one PMEM namespace. An new standard
> > >      Resource
> > >       > Class (RC) -- 'VPMEM_GB` is introduced to classify PMEM namspace
> > >      RPs.
> > >       > For each PMEM namespace RP, the values for 'max_unit', 'min_unit',
> > >       > 'total' and 'step_size` are all set to the size of the PMEM
> > >      namespace.
> > >       > In this way, it is guaranteed each RP will be consumed as a whole
> > >      at one
> > >       > time.
> > >        >
> > >       > An alternative was brought out in the review. Different Custom
> > >      Resource
> > >       > Classes ( CUSTOM_PMEM_XXXGB) can be used to designate PMEM
> > >      namespaces of
> > >       > different sizes. The size of the PMEM namespace is encoded in the
> > >      name
> > >       > of the custom Resource Class. And multiple PMEM namespaces of the
> > >      same
> > >       > size  (say 128G) can be represented by one RP of the same
> > > 
> > >      Not represented by "one RP of the same CUSTOM_PMEM_128G". There
> > >      would be
> > >      only one resource provider: the compute node itself. It would have an
> > >      inventory of, say, 8 CUSTOM_PMEM_128G resources.
> > > 
> > >       > CUSTOM_PMEM_128G. In this way, the RP could have 'max_unit'  and
> > >      'total'
> > >       > as the total number of the PMEM namespaces of the certain size.
> > >      And the
> > >       > values of 'min_unit' and 'step_size' could set to 1.
> > > 
> > >      No, the max_unit, min_unit, step_size and total would refer to the
> > >      number of *PMEM namespaces*, not the amount of GB of memory represented
> > >      by those namespaces.
> > > 
> > >      Therefore, min_unit and step_size would be 1, max_unit would be the
> > >      total number of *namespaces* that could simultaneously be attached to a
> > >      single consumer (VM), and total would be 8 in our example where the
> > >      compute node had 8 of these pre-defined 128G PMEM namespaces.
> > > 
> > >       > We believe both way could work. We would like to have a community
> > >       > consensus on which way to use.
> > >       > Email replies and review comments to the spec [1] are both welcomed.
> > > 
> > >      Custom resource classes were invented for precisely this kind of use
> > >      case. The resource being represented is a namespace. The resource is
> > >      not
> > >      "a Gibibyte of persistent memory".
> > > 
> > > 
> > >  The point of the initial design is avoid to encode the `size` in the
> > >  resource class name. If that is ok for you(I remember people hate to
> > >  encode size and number into the trait name), then we will update the
> > >  design. Probably based on the namespace configuration, nova will be
> > >  responsible for create those custom RC first. Sounds works.
> > 
> > A couple points...
> > 
> > 1) I was/am opposed to putting the least-fine-grained size in a resource
> > class name. For example, I would have preferred DISK_BYTE instead of
> > DISK_GB. And MEMORY_BYTE instead of MEMORY_MB.
> I agree the more precise the better as far as resource tracking is concerned.
> However, as for persistent memory, it usually comes out in large capacity --
> terabytes are normal. And the targeting applications are also expected to use
> persistent memory in that quantity. GB is a reasonable unit not to make
> the number too nasty.

so im honestly not that concernetd with large numbers.
if we want to imporve the user experience we can do what we do with hugepage memory.
we suppport passing a sufix. so we can say 2M or 1G.

if you are concerned with capasity its a relitivly simple exerises to show that if
we use a 64 int or even 48bit we have plenty of headroom over where teh technology is.

NVDIMs are speced for a max capasity of 512GB per module.
if i recall correctly you can also only have 12 nvdim with 4 ram dimms per socket
acting as a cache so that effectivly limits you to 6TB per socket or 12 TB per 1/2U
with standard density servers. moderen x86 processors i belive still use a 48 bit
phyical adress spaces with the last 16 bits reserved for future use meaning a host
can adress a maxium of 2^48 or 256 TiB of memory such a system.

note persistent memory is stream memory so it base 2 not base 10 so when 
we state it 1GB we technically mean 1 GiB or 2^10 bytes not 10^9 bytes

whiile it unlikely we will ever need byte level granularity in allocations
to guest im not sure i buy the argument that this will only be used by applications
in large allocations in the 100GB or TBs range.

i think i share jays preference here in increasing the granularity and eiter tracking
the allocation in MiBs or Bytes. i do somewhat agree that bytes is likely to fine grained
hence my perference for mebibytes.

> > 2) After reading the original Intel PMEM specification
> > (http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf), it seems to me
> > that what you are describing with a generic PMEM_GB (or PMEM_BYTE)
> > resource class is more appropriate for the block mode translation system
> > described in the PDF versus the PMEM namespace system described therein.
> > 
> >  From a lay person's perspective, I see the difference between the two
> > as similar to the difference between describing the bytes that are in
> > block storage versus a filesystem that has been formatted, wiped,
> > cleaned, etc on that block storage.
> First let's talk about "block mode" v.s. "persistent memory mode".
> They are not tiered up, they are counterparts. Each of them describes an access
> method to the unlerlying hardware. Quote some sectors from
> https://www.kernel.org/doc/Documentation/nvdimm/nvdimm.txt
> inside the dash line block.
> ------------------------------8<-------------------------------------------------------------------
> Why BLK?
> --------
> While PMEM provides direct byte-addressable CPU-load/store access to
> NVDIMM storage, it does not provide the best system RAS (recovery,
> availability, and serviceability) model.  An access to a corrupted
> system-physical-address address causes a CPU exception while an access
> to a corrupted address through an BLK-aperture causes that block window
> to raise an error status in a register.  The latter is more aligned with
> the standard error model that host-bus-adapter attached disks present.
> Also, if an administrator ever wants to replace a memory it is easier to
> service a system at DIMM module boundaries.  Compare this to PMEM where
> data could be interleaved in an opaque hardware specific manner across
> several DIMMs.
> BLK-apertures solve these RAS problems, but their presence is also the
> major contributing factor to the complexity of the ND subsystem.  They
> complicate the implementation because PMEM and BLK alias in DPA space.
> Any given DIMM's DPA-range may contribute to one or more
> system-physical-address sets of interleaved DIMMs, *and* may also be
> accessed in its entirety through its BLK-aperture.  Accessing a DPA
> through a system-physical-address while simultaneously accessing the
> same DPA through a BLK-aperture has undefined results.  For this reason,
> DIMMs with this dual interface configuration include a DSM function to
> store/retrieve a LABEL.  The LABEL effectively partitions the DPA-space
> into exclusive system-physical-address and BLK-aperture accessible
> regions.  For simplicity a DIMM is allowed a PMEM "region" per each
> interleave set in which it is a member.  The remaining DPA space can be
> carved into an arbitrary number of BLK devices with discontiguous
> extents.
> ------------------------------8<-------------------------------------------------------------------
> You can see that "block mode" does not provide "direct access", thus not the best  
> performance. That is the reason "persistent memory mode" is proposed in the spec.
the block mode will allow any exsing applciation that is coded to work with a block device
to just use the NVDIM storage as a faster from of solid state storage. direct mode reqiures
applications to be specifcialy coded to support it. form an openstack perspective we will
eventually want to support exposing the deivce both as a block deivce (e.g. via virtio-blk or virtio-scsi devices
if/when qemu supports that) and direct mode pmem device to the guest. i understand why persistent memory mode is more
appealing from a vendor perspecitve to lead with but pratically speaking there are very few application that actully
supprot pmem to date and supporting app direct mode only seams like it would hurt adoption of this feautre
more generally then encourage it.
> However, people can still create a block device out of a "persistent memory mode"
> namespace. And further more, create a file system on top of that block device.
> Applications can map files from that file system into their memory namespaces,
> and if the file system is DAX (direct-access) capable. The application's access to
> the hardware is still direct-access which means direct byte-addressable
> CPU-load/store access to NVDIMM storage.
> This is perfect so far, as one can think of why not just track the DAX file system
> and let the VM instances map the files of the file system?
> However, this usage model is reported to have severe issues with hardware
> pass-ed through. So the recommended model is still mapping namespaces
> of "persistent memory mode" into applications' address space.

intels nvdimm technology works in 3 modes, app direct, block and system memory.

the direct and block modes were discussed at some lenght in the spec and this thread.
does libvirt support using a nvdims pmem namespaces in devdax mode to back a guest memory
instead of system ram.

based on https://docs.pmem.io/getting-started-guide/creating-development-environments/virtualization/qemu
qemu does support such a configuration and honestly haveing the capablity to alter the guest meory 
backing to run my vms with 100s or GB of ram would as compeeling as app direct mode as
it would allow all my legacy application to work without modification and would deliver effectivly the same
perfromance. perhaps we should also consider a hw:mem_page_backing extra spec to complement the hw:mem_page_size
we have already hugepages today. this would proably be a seperate spec but i would hope we dont make desisions
today that would block other useage models in the future.

> > In Nova, the DISK_GB resource class describes the former: it's a bunch
> > of blocks that are reserved in the underlying block storage for use by
> > the virtual machine. The virtual machine manager then formats that bunch
> > of blocks as needed and lays down a formatted image.
> > 
> > We don't have a resource class that represents "a filesystem" or "a
> > partition" (yet). But the proposed PMEM namespaces in your spec
> > definitely seem to be more like a "filesystem resource" than a "GB of
> > block storage" resource.
> > 
> > Best,
> > -jay

More information about the openstack-discuss mailing list