volume local cache

21 Jan 2020

      Hi Gibi, Dan, Alex and dear Nova cores

Nova Spec: https://review.opendev.org/#/c/689070/
Cinder Spec: https://review.opendev.org/#/c/684556/

Regarding the volume local cache spec, after several rounds of discussion in cinder weekly meeting, we continued discussed in cinder virtual mid-cycle ptg yesterday, and now cinder team is close to get agreement, and major ideas are:

  *   Cinder sets “cacheable” property in volume type
     *   Cinder should guarantee “cacheable” is set correctly, that means: should not set with “multiattach”, should not set for a backend that cannot be cached (NFS, RBD)
  *   Os-brick prevents to attach a volume with cache mode that not safe in cloud environment
     *   e.g. prevent write-back cache mode. So that all the operations, like live migration/snapshot/consistent group/volume backup can work as usual.
  *   Nova schedules the VM with “cacheable” volume to servers with cache capability.
     *   If no such available server, just go ahead without cache, but don’t fail it.
     *   Schedule based on flavor (need more design here with Nova expert)

Could you please comment on cinder spec if any before cinder team getting this hammered out? Thank you so much.

Below is the performance test I did personally, using Optane SSD p4800x 750G as the cache. Setting fio with block size = 4k, iodepth=1 which is typical for latency measurement. The data may be slightly different on different environment, just FYI:
In rand read test:
[cid:image004.png@01D5D107.23D05610]

In rand write test:
[cid:image006.png@01D5D107.23D05610]

In 70% rand read + 30% rand write mix test:
[cid:image009.png@01D5D107.23D05610]

Ceph in above chart means the baseline test, as shown in path ① below.
Write-Through and Write-Back means open-cas cached, as shown in path②.
[cid:image010.png@01D5D0FD.83839680]

So even write-through mode, we still have lots of performance gains.

Some concerns maybe:

  *   In real world, the cache hit rate may cannot be so high (95% and more).
     *   It depends on how big the cache device is. But normally a fast SSD with x TB size or a persistent memory with 512G size is big enough for hot data cache.
  *   What will happen when backend storage is RDMA
     *   Pure RDMA network link latency would be as low as ~10us. If plus the disk io to the storage system, the final latency would be dozens of microseconds. So it’s not suitable for an fast ssd(itself ~10us) to cache for RDMA volume, but we can use persistent memory (with latency about hundreds of nanosecond) to do the cache. I will do the measurement after Chinese new year.
  *   It’s a pain that ceph cannot be supported
     *   The major concern to mount ceph volume to host OS is the security considerations, right? It is a performance / security tradeoff for operators. In private cloud which OpenStack maybe mostly used, trust host OS would may be not a big problem, anyway, other backends rather than ceph are still doing in this way. I’m not going to change ceph to mount to host OS in this spec, but it’s not difficult for customer to switch to this way, like they customized other things.

Regards
LiangFang

Fang, Liang A

tags

participants (1)