[nova] Persistent memory resource tracking model

newer
[goal][python3] week R-13 update

older
[neutron] Bug deputy report - week...

Rui Zang

21 Dec 2018 21 Dec '18

12:45 a.m.

Attachments:

attachment.html (text/html — 2.4 KB)

Show replies by date

Jay Pipes

2 Jan 2 Jan

6:47 a.m.

On 12/21/2018 03:45 AM, Rui Zang wrote:

...

It was advised in today's nova team meeting to bring this up by email.

There has been some discussion on the how to track persistent memory resource in placement on the spec review [1].

Background: persistent memory (PMEM) needs to be partitioned to namespaces to be consumed by VMs. Due to fragmentation issues, the spec proposed to use fixed sized PMEM namespaces.

The spec proposed to use fixed sized namespaces that is controllable by the deployer, not fixed-size-for-everyone :) Just want to make sure we're being clear here.

...

The spec proposed way to represent PMEM namespaces is to use one Resource Provider (RP) for one PMEM namespace. An new standard Resource Class (RC) -- 'VPMEM_GB` is introduced to classify PMEM namspace RPs. For each PMEM namespace RP, the values for 'max_unit', 'min_unit', 'total' and 'step_size` are all set to the size of the PMEM namespace. In this way, it is guaranteed each RP will be consumed as a whole at one time.

An alternative was brought out in the review. Different Custom Resource Classes ( CUSTOM_PMEM_XXXGB) can be used to designate PMEM namespaces of different sizes. The size of the PMEM namespace is encoded in the name of the custom Resource Class. And multiple PMEM namespaces of the same size (say 128G) can be represented by one RP of the same

Not represented by "one RP of the same CUSTOM_PMEM_128G". There would be only one resource provider: the compute node itself. It would have an inventory of, say, 8 CUSTOM_PMEM_128G resources.

...

CUSTOM_PMEM_128G. In this way, the RP could have 'max_unit' and 'total' as the total number of the PMEM namespaces of the certain size. And the values of 'min_unit' and 'step_size' could set to 1.

No, the max_unit, min_unit, step_size and total would refer to the number of *PMEM namespaces*, not the amount of GB of memory represented by those namespaces. Therefore, min_unit and step_size would be 1, max_unit would be the total number of *namespaces* that could simultaneously be attached to a single consumer (VM), and total would be 8 in our example where the compute node had 8 of these pre-defined 128G PMEM namespaces.

...

We believe both way could work. We would like to have a community consensus on which way to use. Email replies and review comments to the spec [1] are both welcomed.

Custom resource classes were invented for precisely this kind of use case. The resource being represented is a namespace. The resource is not "a Gibibyte of persistent memory". Best, -jay

...

Regards, Zang, Rui

[1] https://review.openstack.org/#/c/601596/

Alex Xu

8:08 p.m.

Jay Pipes <jaypipes@gmail.com> 于2019年1月2日周三下午10:48写道：

...

On 12/21/2018 03:45 AM, Rui Zang wrote:

...
It was advised in today's nova team meeting to bring this up by email.

There has been some discussion on the how to track persistent memory resource in placement on the spec review [1].

Background: persistent memory (PMEM) needs to be partitioned to namespaces to be consumed by VMs. Due to fragmentation issues, the spec proposed to use fixed sized PMEM namespaces.

The spec proposed to use fixed sized namespaces that is controllable by the deployer, not fixed-size-for-everyone :) Just want to make sure we're being clear here.

...
The spec proposed way to represent PMEM namespaces is to use one Resource Provider (RP) for one PMEM namespace. An new standard Resource Class (RC) -- 'VPMEM_GB` is introduced to classify PMEM namspace RPs. For each PMEM namespace RP, the values for 'max_unit', 'min_unit', 'total' and 'step_size` are all set to the size of the PMEM namespace. In this way, it is guaranteed each RP will be consumed as a whole at one time.

An alternative was brought out in the review. Different Custom Resource Classes ( CUSTOM_PMEM_XXXGB) can be used to designate PMEM namespaces of different sizes. The size of the PMEM namespace is encoded in the name of the custom Resource Class. And multiple PMEM namespaces of the same size (say 128G) can be represented by one RP of the same

Not represented by "one RP of the same CUSTOM_PMEM_128G". There would be only one resource provider: the compute node itself. It would have an inventory of, say, 8 CUSTOM_PMEM_128G resources.

...
CUSTOM_PMEM_128G. In this way, the RP could have 'max_unit' and 'total' as the total number of the PMEM namespaces of the certain size. And the values of 'min_unit' and 'step_size' could set to 1.

No, the max_unit, min_unit, step_size and total would refer to the number of *PMEM namespaces*, not the amount of GB of memory represented by those namespaces.

Therefore, min_unit and step_size would be 1, max_unit would be the total number of *namespaces* that could simultaneously be attached to a single consumer (VM), and total would be 8 in our example where the compute node had 8 of these pre-defined 128G PMEM namespaces.

...
We believe both way could work. We would like to have a community consensus on which way to use. Email replies and review comments to the spec [1] are both welcomed.

Custom resource classes were invented for precisely this kind of use case. The resource being represented is a namespace. The resource is not "a Gibibyte of persistent memory".

The point of the initial design is avoid to encode the `size` in the resource class name. If that is ok for you(I remember people hate to encode size and number into the trait name), then we will update the design. Probably based on the namespace configuration, nova will be responsible for create those custom RC first. Sounds works.

...

Best, -jay

...
Regards, Zang, Rui

[1] https://review.openstack.org/#/c/601596/

Jay Pipes

3 Jan 3 Jan

5:31 a.m.

On 01/02/2019 11:08 PM, Alex Xu wrote:

...

Jay Pipes <jaypipes@gmail.com <mailto:jaypipes@gmail.com>> 于2019年1月2 日周三下午10:48写道：

On 12/21/2018 03:45 AM, Rui Zang wrote: > It was advised in today's nova team meeting to bring this up by email. > > There has been some discussion on the how to track persistent memory > resource in placement on the spec review [1]. > > Background: persistent memory (PMEM) needs to be partitioned to > namespaces to be consumed by VMs. Due to fragmentation issues, the spec > proposed to use fixed sized PMEM namespaces.

The spec proposed to use fixed sized namespaces that is controllable by the deployer, not fixed-size-for-everyone :) Just want to make sure we're being clear here.

> The spec proposed way to represent PMEM namespaces is to use one > Resource Provider (RP) for one PMEM namespace. An new standard Resource > Class (RC) -- 'VPMEM_GB` is introduced to classify PMEM namspace RPs. > For each PMEM namespace RP, the values for 'max_unit', 'min_unit', > 'total' and 'step_size` are all set to the size of the PMEM namespace. > In this way, it is guaranteed each RP will be consumed as a whole at one > time. > > An alternative was brought out in the review. Different Custom Resource > Classes ( CUSTOM_PMEM_XXXGB) can be used to designate PMEM namespaces of > different sizes. The size of the PMEM namespace is encoded in the name > of the custom Resource Class. And multiple PMEM namespaces of the same > size (say 128G) can be represented by one RP of the same

Not represented by "one RP of the same CUSTOM_PMEM_128G". There would be only one resource provider: the compute node itself. It would have an inventory of, say, 8 CUSTOM_PMEM_128G resources.

> CUSTOM_PMEM_128G. In this way, the RP could have 'max_unit' and 'total' > as the total number of the PMEM namespaces of the certain size. And the > values of 'min_unit' and 'step_size' could set to 1.

No, the max_unit, min_unit, step_size and total would refer to the number of *PMEM namespaces*, not the amount of GB of memory represented by those namespaces.

Therefore, min_unit and step_size would be 1, max_unit would be the total number of *namespaces* that could simultaneously be attached to a single consumer (VM), and total would be 8 in our example where the compute node had 8 of these pre-defined 128G PMEM namespaces.

> We believe both way could work. We would like to have a community > consensus on which way to use. > Email replies and review comments to the spec [1] are both welcomed.

Custom resource classes were invented for precisely this kind of use case. The resource being represented is a namespace. The resource is not "a Gibibyte of persistent memory".

The point of the initial design is avoid to encode the `size` in the resource class name. If that is ok for you(I remember people hate to encode size and number into the trait name), then we will update the design. Probably based on the namespace configuration, nova will be responsible for create those custom RC first. Sounds works.

A couple points... 1) I was/am opposed to putting the least-fine-grained size in a resource class name. For example, I would have preferred DISK_BYTE instead of DISK_GB. And MEMORY_BYTE instead of MEMORY_MB. 2) After reading the original Intel PMEM specification (http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf), it seems to me that what you are describing with a generic PMEM_GB (or PMEM_BYTE) resource class is more appropriate for the block mode translation system described in the PDF versus the PMEM namespace system described therein. From a lay person's perspective, I see the difference between the two as similar to the difference between describing the bytes that are in block storage versus a filesystem that has been formatted, wiped, cleaned, etc on that block storage. In Nova, the DISK_GB resource class describes the former: it's a bunch of blocks that are reserved in the underlying block storage for use by the virtual machine. The virtual machine manager then formats that bunch of blocks as needed and lays down a formatted image. We don't have a resource class that represents "a filesystem" or "a partition" (yet). But the proposed PMEM namespaces in your spec definitely seem to be more like a "filesystem resource" than a "GB of block storage" resource. Best, -jay

rui zang

7 Jan 7 Jan

12:23 a.m.

Sean Mooney

2:05 a.m.

On Mon, 2019-01-07 at 16:23 +0800, rui zang wrote:

...

Hey Jay,

I replied to your comments to the spec however missed this email. Please see my replies in line.

Thanks, Zang, Rui

03.01.2019, 21:31, "Jay Pipes" <jaypipes@gmail.com>:

...
On 01/02/2019 11:08 PM, Alex Xu wrote:

...
Jay Pipes <jaypipes@gmail.com <mailto:jaypipes@gmail.com>> 于2019年1月2 日周三下午10:48写道：

On 12/21/2018 03:45 AM, Rui Zang wrote: > It was advised in today's nova team meeting to bring this up by email. > > There has been some discussion on the how to track persistent memory > resource in placement on the spec review [1]. > > Background: persistent memory (PMEM) needs to be partitioned to > namespaces to be consumed by VMs. Due to fragmentation issues, the spec > proposed to use fixed sized PMEM namespaces.

The spec proposed to use fixed sized namespaces that is controllable by the deployer, not fixed-size-for-everyone :) Just want to make sure we're being clear here.

> The spec proposed way to represent PMEM namespaces is to use one > Resource Provider (RP) for one PMEM namespace. An new standard Resource > Class (RC) -- 'VPMEM_GB` is introduced to classify PMEM namspace RPs. > For each PMEM namespace RP, the values for 'max_unit', 'min_unit', > 'total' and 'step_size` are all set to the size of the PMEM namespace. > In this way, it is guaranteed each RP will be consumed as a whole at one > time. > > An alternative was brought out in the review. Different Custom Resource > Classes ( CUSTOM_PMEM_XXXGB) can be used to designate PMEM namespaces of > different sizes. The size of the PMEM namespace is encoded in the name > of the custom Resource Class. And multiple PMEM namespaces of the same > size (say 128G) can be represented by one RP of the same

Not represented by "one RP of the same CUSTOM_PMEM_128G". There would be only one resource provider: the compute node itself. It would have an inventory of, say, 8 CUSTOM_PMEM_128G resources.

> CUSTOM_PMEM_128G. In this way, the RP could have 'max_unit' and 'total' > as the total number of the PMEM namespaces of the certain size. And the > values of 'min_unit' and 'step_size' could set to 1.

No, the max_unit, min_unit, step_size and total would refer to the number of *PMEM namespaces*, not the amount of GB of memory represented by those namespaces.

Therefore, min_unit and step_size would be 1, max_unit would be the total number of *namespaces* that could simultaneously be attached to a single consumer (VM), and total would be 8 in our example where the compute node had 8 of these pre-defined 128G PMEM namespaces.

> We believe both way could work. We would like to have a community > consensus on which way to use. > Email replies and review comments to the spec [1] are both welcomed.

Custom resource classes were invented for precisely this kind of use case. The resource being represented is a namespace. The resource is not "a Gibibyte of persistent memory".

The point of the initial design is avoid to encode the `size` in the resource class name. If that is ok for you(I remember people hate to encode size and number into the trait name), then we will update the design. Probably based on the namespace configuration, nova will be responsible for create those custom RC first. Sounds works.

A couple points...

1) I was/am opposed to putting the least-fine-grained size in a resource class name. For example, I would have preferred DISK_BYTE instead of DISK_GB. And MEMORY_BYTE instead of MEMORY_MB.

I agree the more precise the better as far as resource tracking is concerned. However, as for persistent memory, it usually comes out in large capacity -- terabytes are normal. And the targeting applications are also expected to use persistent memory in that quantity. GB is a reasonable unit not to make the number too nasty.

so im honestly not that concernetd with large numbers. if we want to imporve the user experience we can do what we do with hugepage memory. we suppport passing a sufix. so we can say 2M or 1G. if you are concerned with capasity its a relitivly simple exerises to show that if we use a 64 int or even 48bit we have plenty of headroom over where teh technology is. NVDIMs are speced for a max capasity of 512GB per module. if i recall correctly you can also only have 12 nvdim with 4 ram dimms per socket acting as a cache so that effectivly limits you to 6TB per socket or 12 TB per 1/2U with standard density servers. moderen x86 processors i belive still use a 48 bit phyical adress spaces with the last 16 bits reserved for future use meaning a host can adress a maxium of 2^48 or 256 TiB of memory such a system. note persistent memory is stream memory so it base 2 not base 10 so when we state it 1GB we technically mean 1 GiB or 2^10 bytes not 10^9 bytes whiile it unlikely we will ever need byte level granularity in allocations to guest im not sure i buy the argument that this will only be used by applications in large allocations in the 100GB or TBs range. i think i share jays preference here in increasing the granularity and eiter tracking the allocation in MiBs or Bytes. i do somewhat agree that bytes is likely to fine grained hence my perference for mebibytes.

...

...
2) After reading the original Intel PMEM specification (http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf), it seems to me that what you are describing with a generic PMEM_GB (or PMEM_BYTE) resource class is more appropriate for the block mode translation system described in the PDF versus the PMEM namespace system described therein.

From a lay person's perspective, I see the difference between the two as similar to the difference between describing the bytes that are in block storage versus a filesystem that has been formatted, wiped, cleaned, etc on that block storage.

First let's talk about "block mode" v.s. "persistent memory mode". They are not tiered up, they are counterparts. Each of them describes an access method to the unlerlying hardware. Quote some sectors from https://www.kernel.org/doc/Documentation/nvdimm/nvdimm.txt inside the dash line block.

------------------------------8<------------------------------------------------------------------- Why BLK? --------

While PMEM provides direct byte-addressable CPU-load/store access to NVDIMM storage, it does not provide the best system RAS (recovery, availability, and serviceability) model. An access to a corrupted system-physical-address address causes a CPU exception while an access to a corrupted address through an BLK-aperture causes that block window to raise an error status in a register. The latter is more aligned with the standard error model that host-bus-adapter attached disks present. Also, if an administrator ever wants to replace a memory it is easier to service a system at DIMM module boundaries. Compare this to PMEM where data could be interleaved in an opaque hardware specific manner across several DIMMs.

PMEM vs BLK BLK-apertures solve these RAS problems, but their presence is also the major contributing factor to the complexity of the ND subsystem. They complicate the implementation because PMEM and BLK alias in DPA space. Any given DIMM's DPA-range may contribute to one or more system-physical-address sets of interleaved DIMMs, *and* may also be accessed in its entirety through its BLK-aperture. Accessing a DPA through a system-physical-address while simultaneously accessing the same DPA through a BLK-aperture has undefined results. For this reason, DIMMs with this dual interface configuration include a DSM function to store/retrieve a LABEL. The LABEL effectively partitions the DPA-space into exclusive system-physical-address and BLK-aperture accessible regions. For simplicity a DIMM is allowed a PMEM "region" per each interleave set in which it is a member. The remaining DPA space can be carved into an arbitrary number of BLK devices with discontiguous extents. ------------------------------8<-------------------------------------------------------------------

You can see that "block mode" does not provide "direct access", thus not the best performance. That is the reason "persistent memory mode" is proposed in the spec.

the block mode will allow any exsing applciation that is coded to work with a block device to just use the NVDIM storage as a faster from of solid state storage. direct mode reqiures applications to be specifcialy coded to support it. form an openstack perspective we will eventually want to support exposing the deivce both as a block deivce (e.g. via virtio-blk or virtio-scsi devices if/when qemu supports that) and direct mode pmem device to the guest. i understand why persistent memory mode is more appealing from a vendor perspecitve to lead with but pratically speaking there are very few application that actully supprot pmem to date and supporting app direct mode only seams like it would hurt adoption of this feautre more generally then encourage it.

...

However, people can still create a block device out of a "persistent memory mode" namespace. And further more, create a file system on top of that block device. Applications can map files from that file system into their memory namespaces, and if the file system is DAX (direct-access) capable. The application's access to the hardware is still direct-access which means direct byte-addressable CPU-load/store access to NVDIMM storage. This is perfect so far, as one can think of why not just track the DAX file system and let the VM instances map the files of the file system? However, this usage model is reported to have severe issues with hardware pass-ed through. So the recommended model is still mapping namespaces of "persistent memory mode" into applications' address space.

intels nvdimm technology works in 3 modes, app direct, block and system memory. the direct and block modes were discussed at some lenght in the spec and this thread. does libvirt support using a nvdims pmem namespaces in devdax mode to back a guest memory instead of system ram. based on https://docs.pmem.io/getting-started-guide/creating-development-environments... qemu does support such a configuration and honestly haveing the capablity to alter the guest meory backing to run my vms with 100s or GB of ram would as compeeling as app direct mode as it would allow all my legacy application to work without modification and would deliver effectivly the same perfromance. perhaps we should also consider a hw:mem_page_backing extra spec to complement the hw:mem_page_size we have already hugepages today. this would proably be a seperate spec but i would hope we dont make desisions today that would block other useage models in the future.

...

...
In Nova, the DISK_GB resource class describes the former: it's a bunch of blocks that are reserved in the underlying block storage for use by the virtual machine. The virtual machine manager then formats that bunch of blocks as needed and lays down a formatted image.

We don't have a resource class that represents "a filesystem" or "a partition" (yet). But the proposed PMEM namespaces in your spec definitely seem to be more like a "filesystem resource" than a "GB of block storage" resource.

Best, -jay

Jeremy Stanley

5:11 a.m.

On 2019-01-07 10:05:54 +0000 (+0000), Sean Mooney wrote: [...]

...

note persistent memory is stream memory so it base 2 not base 10 so when we state it 1GB we technically mean 1 GiB or 2^10 bytes not 10^9 bytes [...]

Not to get pedantic, but a gibibyte is 2^30 bytes (2^10 is a kibibyte). I'm quite sure you (and most of the rest of us) know this, just pointing it out for the sake of clarity. -- Jeremy Stanley

Sean Mooney

5:47 a.m.

On Mon, 2019-01-07 at 13:11 +0000, Jeremy Stanley wrote:

...

On 2019-01-07 10:05:54 +0000 (+0000), Sean Mooney wrote: [...]

...
note persistent memory is stream memory so it base 2 not base 10 so when we state it 1GB we technically mean 1 GiB or 2^10 bytes not 10^9 bytes

[...]

Not to get pedantic, but a gibibyte is 2^30 bytes (2^10 is a kibibyte). I'm quite sure you (and most of the rest of us) know this, just pointing it out for the sake of clarity. yep i spotted that when i was reading the mail back after i send it :) i kind of wanted to fix it but i assumed most would see its a typo and didnt wnat to spam.

the main point i wanted to convay was that nvdimm-p is being standarised by JEDEC and will be using there unit definitions rather then the IEC definitions typically used by block storage. thanks for giving me the opertunity to calrify :)

Jay Pipes

6:02 a.m.

On 01/07/2019 05:05 AM, Sean Mooney wrote:

...

i think i share jays preference here in increasing the granularity and eiter tracking the allocation in MiBs or Bytes. i do somewhat agree that bytes is likely to fine grained hence my perference for mebibytes.

Actually, that's not at all my preference for PMEM :) My preference is to use custom resource classes like "CUSTOM_PMEM_NAMESPACE_1TB" because the resource is the namespace, not the bunch of blocks/bytes of storage. With regards to the whole "finest-grained unit" thing, I was just responding to Alex Xu's comment: "The point of the initial design is avoid to encode the `size` in the resource class name. If that is ok for you(I remember people hate to encode size and number into the trait name), then we will update the design. Probably based on the namespace configuration, nova will be responsible for create those custom RC first. Sounds works." Best, -jay

rui zang

9:20 a.m.

rui zang

9:17 a.m.

2479

Age (days ago)

2496

Last active (days ago)

List overview

Download

10 comments

5 participants

participants (5)

Alex Xu
Jay Pipes
Jeremy Stanley
Rui Zang
Sean Mooney

[nova] Persistent memory resource tracking model

tags

participants (5)