[ceph-users] Suggestion to build ceph storage

Anthony D'Atri anthony.datri at gmail.com
Sun Jun 19 16:24:35 UTC 2022



>> 
>> 
>> Please do not CC the list.
> 
> It was my mistake, sorry about that. 

You hadn’t, but you just did (doh!)

Back in the old days one would solicit replies and post a summary afterward, that cut down on list volume.  ceph-users gets a LOT of traffic and I don’t like to flood people.

>>> 15 Total servers and each server has a 12x18TB HDD (spinning disk) . We
>>> understand SSD/NvME would be best fit but it's way out of budget.
>> 
>> NVMe SSDs can be surprisingly competitive when you consider IOPS/$, density, and the cost of the HBA you don’t need.
> 
> Yes totally

This TCO calculator is gold:

https://www.snia.org/forums/cmsi/programs/TCOcalc

> and I have only single slot to mount one NvME on motherboard. Let’s say I want to put single M.2 NvME then what size I would go for ? 

What kind of systems are these?  Are you sure that the M.2 port is NVMe, not just SATA?  Any option to add a BOSS card or rear cage for additonal 2.5” SFF or M.2 SSDs?

Let’s work toward the sizing.

First, you mention CephFS today, and RGW (object) later.  This is important.

* For CephFS it is HIGHLY recommended to place the metadata pool on non-rotational media, ideally keeping 4 copies
* For RGW, unless your usage is really low and your objects mostly large (like 1GB+), you’ll likely want non-rotational storage for the index pool as well.

Assuming that you’ll be running a recent Ceph release, Pacific or Quincy, you have more flexibility in DB+WAL sizing than with previous releases, due to fixed RocksDB level sizes.  A hair too  little, and your DB data spills over onto the slow device, and everybody has a bad day.  This can also happen during DB compaction.

RGW and CephFS tend to be heavier on metadata than RBD (Block storage), in part due to the file/volume sizes tending to be dramatically smaller, so one has a LOT more of them.  This influences how many DB levels you want to try to keep on the faster storage.

18TB is a big spinner, and you may find that the SATA interface is a real bottleneck.  With one object storage deployment I’ve worked with, HDDs were capped at 8TB because of the bottleneck.  Then after failing to squeeze iops from a stone, the deployment finally dumped the spinners and went with QLC SSDs, with much success.

Without any more detail about your workload, I’ll guesstimate that you want to plan for 100GB per OSD on a system with 12x big spinners (yikes), so leaving additional space for potential future index and metadata pools, I’d say go with at least a 3.84TB SKU.

>>> 
>> 
>> Since you specify 12xHDD, you’re thinking an add-in PCI card?  Or do you have rear bays?
> 
> I have raid controller connected to all drivers in side box. 

Don’t use a RAID controller aka RoC HBA.  They add a lot of expense, they’re buggy as all get-out, finicky, need very specific monitoring, and add latency to your I/O.  If you go the route of wrapping every HDD in a single-drive R0 volume, your lifecycle will be that much more complex, and you’ll need to add cache RAM/flash on the HBA and a supercap for backup. Cha-ching.  Traditional battery BBUs are usually warranted for only a year.  If it fails, everything will be silently slower, and unless you monitor the BBU/supercap state, you won’t even know.

Add that into the TCO calculation.  The $$$ saved on a RAID HBA goes a certain way toward SSDs.

With QLC you can fit 1.4PB of raw space in 1RU, HDDs can’t come close.  RUs cost money, and when you run out, getting more if possible is expensive.

For CephFS you’ll do best with MDS nodes with high-frequency, low-core-count CPU SKUs.  OSD nodes make much better use out of more cores and are less sensitive to high-end clock.


>> 
>> 
>>> 2. Do I need to partition wal/db for each OSD or just a single
>>> partition can share for all OSDs?
>> 
>> Each OSD.
> 
> What size of partition I should create for each 18TB OSD? 

Without more detail on your workload, I’ll guesstimate 100GB for WAL+DB. Partition the rest of the space for CephFS and RGW metadata.

Oh, and be prepared that a 12:1 ratio + other pools isn’t ideal, the NVMe device may be bottlenecked.

> 
>> 
>>> 3. Can I put the OS on the same disk where the wal/db is going to sit ?
>> 
>> Bad idea.
> 
> I can understand but I have redundancy of 3 copy just in case of failure. And OS disk doesn’t hammer correct?

It doesn’t, no. You could even PXE-boot your nodes, cf. croit.io

The is the question of blast radius, though.

>> 
>>> (This way i don't need to spend extra money for extra disk)
>> 
>> Boot drives are cheap.
> 
> I have no extra slot left that is why planning to share but I can explore more. 

Please let me know privately which chassis model this is.

>> 
>>> 
>>> Any suggestions you have for this kind of storage would be much
>>> appreciated.
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users at ceph.io
>>> To unsubscribe send an email to ceph-users-leave at ceph.io
>> 




More information about the openstack-discuss mailing list