[Cinder][Ceph] Volumetria errada no cluster ceph
Hello, I'm experiencing an issue with the volume usage of the Ceph cluster, which is currently used in OpenStack for volumes. I am working with Ceph version Octopus (15.2.17). When I run the ceph df command, I get this output: --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 120 TiB 33 TiB 87 TiB 87 TiB 72.72 TOTAL 120 TiB 33 TiB 87 TiB 87 TiB 72.72 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 32 69 MiB 101 137 MiB 0 11 TiB images 2 256 1.3 TiB 166.90k 2.5 TiB 10.72 11 TiB vms 3 64 574 KiB 21 2.5 MiB 0 11 TiB volumes 4 2048 41 TiB 3.94M 82 TiB 79.42 11 TiB backups 5 1024 407 GiB 111.85k 818 GiB 3.63 11 TiB And when I run the rbd du -p volumes command, I get this total: NAME PROVISIONED USED <TOTAL> 14 TiB 8.3 TiB This "volumes" pool is currently set to replica 2, and mirroring is not enabled. I have checked for any locked or deleted snapshots but found none. I also ran a ceph osd pool deep-scrub volumes, but it didn't resolve the issue. Has anyone encountered this problem before? Could someone provide assistance?
Hi, did you look at the trash? rbd -p volumes trash ls Zitat von lfsilva@binario.cloud:
Hello,
I'm experiencing an issue with the volume usage of the Ceph cluster, which is currently used in OpenStack for volumes. I am working with Ceph version Octopus (15.2.17).
When I run the ceph df command, I get this output: --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 120 TiB 33 TiB 87 TiB 87 TiB 72.72 TOTAL 120 TiB 33 TiB 87 TiB 87 TiB 72.72
--- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 32 69 MiB 101 137 MiB 0 11 TiB images 2 256 1.3 TiB 166.90k 2.5 TiB 10.72 11 TiB vms 3 64 574 KiB 21 2.5 MiB 0 11 TiB volumes 4 2048 41 TiB 3.94M 82 TiB 79.42 11 TiB backups 5 1024 407 GiB 111.85k 818 GiB 3.63 11 TiB
And when I run the rbd du -p volumes command, I get this total: NAME PROVISIONED USED <TOTAL> 14 TiB 8.3 TiB
This "volumes" pool is currently set to replica 2, and mirroring is not enabled. I have checked for any locked or deleted snapshots but found none. I also ran a ceph osd pool deep-scrub volumes, but it didn't resolve the issue.
Has anyone encountered this problem before? Could someone provide assistance?
Hello Eugen, Yes, I ran the command but there is no output: "root@hc-node01:~# rbd -p volumes trash ls root@hc-node01:~# "
Did you bulk delete a lot of volumes? Are you sure there aren't any snapshots involved? The math for the 'rbd du' output is plausible: NAME PROVISIONED USED <TOTAL> 14 TiB 8.3 TiB 3.94 Million objects of size 4M results in roughly 14 or 15 TB, so it doesn't seem like there are orphaned objects. Can you share 'ceph osd df', 'ceph -s' and 'ceph pg ls-by-pool volumes' in a text file? BTW, replica size 2 is a really bad choice, it has been discussed many times why it should only be considered in test clusters or if your data is not important. Zitat von lfsilva@binario.cloud:
Hello Eugen,
Yes, I ran the command but there is no output:
"root@hc-node01:~# rbd -p volumes trash ls root@hc-node01:~# "
Here are links to the files with the output of the requested commands: https://drive.google.com/file/d/1GTXKlBDp6QTJWmNZxoycflqHeAPvadjy/view?usp=s..., https://drive.google.com/file/d/1TgtNW3btQSPrOI3vdNwLFpWGmyPYs0Ou/view?usp=s..., https://drive.google.com/file/d/1qG0Gro00rFEMvteUgG78sWMNZrakRL3_/view?usp=s... Thank you very much, we will adjust to replica 3.
I don't have access to your google drive, please use some accessible platform or make those files publicly available, or attach them as a plain text file, although I'm not sure if attachments are allowed here... Zitat von lfsilva@binario.cloud:
Here are links to the files with the output of the requested commands: https://drive.google.com/file/d/1GTXKlBDp6QTJWmNZxoycflqHeAPvadjy/view?usp=s..., https://drive.google.com/file/d/1TgtNW3btQSPrOI3vdNwLFpWGmyPYs0Ou/view?usp=s..., https://drive.google.com/file/d/1qG0Gro00rFEMvteUgG78sWMNZrakRL3_/view?usp=s...
Thank you very much, we will adjust to replica 3.
On 2024-10-29 17:23:19 +0000 (+0000), Eugen Block wrote:
I don't have access to your google drive, please use some accessible platform or make those files publicly available, or attach them as a plain text file, although I'm not sure if attachments are allowed here... [...]
Most reasonable kinds of attachments (including like plain text files) are allowed, but if they push the message size over 40KB then the post will be held until a list moderator has a chance to look it over. I personally try to process the moderator holds for this list once a day unless I'm really busy with other things. -- Jeremy Stanley
Hello Eugen, Please try to access those links again, I've changed the permissions.
Okay, it's accessible now. From a first glance (I won't have much time over the next days) the numbers seem to match: One PG has a size of around 22 GB, having 2048 PGs results in roughly 42 TB, which matches your 'ceph df' output. I don't exactly recall if Ceph Octopus displayed usage information differently than newer releases, but I think it should match in general, especially with rbd. But there might be tombstones in the rocksDB of the OSDs, have you tried compaction? Offline compaction is usually better than online compaction, so you might want to stop one OSD by one and use the ceph-kvstore-tool for that. But I don't expect it to free up that much space. You could also try to 'rbd sparsify' a couple of images and see if you get some space back. Zitat von lfsilva@binario.cloud:
Hello Eugen,
Please try to access those links again, I've changed the permissions.
Thank you very much for your answer, I ran the rbd sparsify command on some pool volumes and there really wasn't much freeing, and since we didn't identify the problem, let's take an alternative route, creating a new environment and migrating the openstack vms and volumes to this new one ceph cluster in the latest version, to check if the problem persists.
I just noticed something else from your pg-ls output. Your average object size is bigger than expected. You have 22 GB per PG and around 1960 objects per PG, which are around 11 MB per object. In a rbd pool used by cinder you usually have objects of 4MB, that's the default chunk size rbd objects are split into (rbd_store_chunk_size). How did you setup that Ceph cluster? Do you have some non-default configs? I would recommend to review your configuration, otherwise you could end up in the same situation when migrating your VMs to a new cluster. Is it a cephadm managed cluster or package based? For starters, you could share your ceph.conf from control/compute nodes (one is sufficient if they're identical). What is rbd_store_chunk_size from cinder.conf? Zitat von lfsilva@binario.cloud:
Thank you very much for your answer,
I ran the rbd sparsify command on some pool volumes and there really wasn't much freeing, and since we didn't identify the problem, let's take an alternative route, creating a new environment and migrating the openstack vms and volumes to this new one ceph cluster in the latest version, to check if the problem persists.
Hello Eugen, This cluster is package based, follows ceph.conf configuration and output from ceph config dump and pool: CEPH.CONF: [global] fsid = ccdcc86e-c0e7-49ba-b2cc-f8c162a42f91 mon_initial_members = hc-node02, hc-node01, hc-node04 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx rbd default features = 125 rbd mirroring replay delay = 300 mon pg warn max object skew = 20 #Choose reasonable numbers for number of replicas and placement groups. osd pool default size = 2 # Write an object 2 times osd pool default min size = 1 # Allow writing 1 copy in a degraded state osd pool default pg num = 64 osd pool default pgp n0um = 64 #Choose a reasonable crush leaf type #0 for a 1-node cluster. #1 for a multi node cluster in a single rack #2 for a multi node, multi chassis cluster with multiple hosts in a chassis #3 for a multi node cluster with hosts across racks, etc. osd crush chooseleaf type = 1 debug ms = 0 debug mds = 0 debug osd = 0 debug optracker = 0 debug auth = 0 debug asok = 0 debug bluestore = 0 debug bluefs = 0 debug bdev = 0 debug kstore = 0 debug rocksdb = 0 debug eventtrace = 0 debug default = 0 debug rados = 0 debug client = 0 debug perfcounter = 0 debug finisher = 0 [osd] debug osd = 0/0 debug bluestore = 0/0 debug ms = 0/0 osd scrub begin_hour = 19 osd scrub end_hour = 6 osd scrub sleep = 0.1 bluestore cache size ssd = 8589934592 CEPH CONFIG DUMP: WHO MASK LEVEL OPTION VALUE RO mon advanced auth_allow_insecure_global_id_reclaim false mgr advanced mgr/balancer/active true mgr advanced mgr/balancer/mode upmap mgr advanced mgr/dashboard/server_addr 0.0.0.0 * mgr advanced mgr/dashboard/server_port 7000 * mgr advanced mgr/dashboard/ssl true * mgr advanced mgr/restful/server_addr 0.0.0.0 * mgr advanced mgr/restful/server_port 8003 * mgr advanced mgr/telemetry/enabled true * mgr advanced mgr/telemetry/last_opt_revision 3 * CEPH OSD POOL GET VOLUMES ALL: size: 2 min_size: 1 pg_num: 2048 pgp_num: 2048 crush_rule: replicated_rule hashpspool: true nodelete: false nopgchange: false nosizechange: false write_fadvise_dontneed: false noscrub: false nodeep-scrub: false use_gmt_hitset: 1 fast_read: 0 pg_autoscale_mode: off In fact rbd_store_chunk_size is not configured in ceph but it is in cinder.conf, I will add the cinder output too. [rbd-backend-hdd] volume_driver = cinder.volume.drivers.rbd.RBDDriver volume_backend_name = backend-hdd rbd_pool = volumes rbd_user = cinder rbd_ceph_conf = /etc/ceph/ceph.conf rbd_flatten_volume_from_snapshot = true rbd_max_clone_depth = 5 rbd_store_chunk_size = 4 rados_connect_timeout = -1 I also noticed that there are rbd mirror settings in the configuration files that are currently disabled.
Can you show a couple of outputs for some of the volumes, maybe 3 or 4? rbd info volumes/volume-{SOME_ID} | grep -E "size|order|snap" I don't find anything suspicious yet... Zitat von lfsilva@binario.cloud:
Hello Eugen,
This cluster is package based, follows ceph.conf configuration and output from ceph config dump and pool:
CEPH.CONF:
[global] fsid = ccdcc86e-c0e7-49ba-b2cc-f8c162a42f91 mon_initial_members = hc-node02, hc-node01, hc-node04 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx
rbd default features = 125 rbd mirroring replay delay = 300
mon pg warn max object skew = 20
#Choose reasonable numbers for number of replicas and placement groups. osd pool default size = 2 # Write an object 2 times osd pool default min size = 1 # Allow writing 1 copy in a degraded state osd pool default pg num = 64 osd pool default pgp n0um = 64
#Choose a reasonable crush leaf type #0 for a 1-node cluster. #1 for a multi node cluster in a single rack #2 for a multi node, multi chassis cluster with multiple hosts in a chassis #3 for a multi node cluster with hosts across racks, etc. osd crush chooseleaf type = 1
debug ms = 0 debug mds = 0 debug osd = 0 debug optracker = 0 debug auth = 0 debug asok = 0 debug bluestore = 0 debug bluefs = 0 debug bdev = 0 debug kstore = 0 debug rocksdb = 0 debug eventtrace = 0 debug default = 0 debug rados = 0 debug client = 0 debug perfcounter = 0 debug finisher = 0
[osd] debug osd = 0/0 debug bluestore = 0/0 debug ms = 0/0 osd scrub begin_hour = 19 osd scrub end_hour = 6 osd scrub sleep = 0.1 bluestore cache size ssd = 8589934592
CEPH CONFIG DUMP:
WHO MASK LEVEL OPTION VALUE RO mon advanced auth_allow_insecure_global_id_reclaim false mgr advanced mgr/balancer/active true mgr advanced mgr/balancer/mode upmap mgr advanced mgr/dashboard/server_addr 0.0.0.0 * mgr advanced mgr/dashboard/server_port 7000 * mgr advanced mgr/dashboard/ssl true * mgr advanced mgr/restful/server_addr 0.0.0.0 * mgr advanced mgr/restful/server_port 8003 * mgr advanced mgr/telemetry/enabled true * mgr advanced mgr/telemetry/last_opt_revision 3 *
CEPH OSD POOL GET VOLUMES ALL:
size: 2 min_size: 1 pg_num: 2048 pgp_num: 2048 crush_rule: replicated_rule hashpspool: true nodelete: false nopgchange: false nosizechange: false write_fadvise_dontneed: false noscrub: false nodeep-scrub: false use_gmt_hitset: 1 fast_read: 0 pg_autoscale_mode: off
In fact rbd_store_chunk_size is not configured in ceph but it is in cinder.conf, I will add the cinder output too.
[rbd-backend-hdd] volume_driver = cinder.volume.drivers.rbd.RBDDriver volume_backend_name = backend-hdd rbd_pool = volumes rbd_user = cinder rbd_ceph_conf = /etc/ceph/ceph.conf rbd_flatten_volume_from_snapshot = true rbd_max_clone_depth = 5 rbd_store_chunk_size = 4 rados_connect_timeout = -1
I also noticed that there are rbd mirror settings in the configuration files that are currently disabled.
Sure, root@hc-node01:~# rbd info volumes/volume-f8949d7b-4de5-47c2-8ef8-1c83fe897449 | grep -E "size|order|snap" size 120 GiB in 15360 objects order 23 (8 MiB objects) snapshot_count: 0 parent: images/8c5805a4-d117-45d1-92d5-bff5e7b176d7@snap root@hc-node01:~# rbd info volumes/volume-9e61bb4f-a5d9-4407-9863-e786a05359ce | grep -E "size|order|snap" size 400 GiB in 102400 objects order 22 (4 MiB objects) snapshot_count: 0
Interesting, one volume has 8MB chunks while the other one has 8MB. How many control nodes do you have? Are their configs identical? I know that glance does 8MB chunks by default, but those are clearly volumes… I don’t have an immediate explanation, maybe volumes that were uploaded as images, then when new volumes from those images are created, they “inherited” the chunk size? I’ll have to think about it… Zitat von lfsilva@binario.cloud:
Sure,
root@hc-node01:~# rbd info volumes/volume-f8949d7b-4de5-47c2-8ef8-1c83fe897449 | grep -E "size|order|snap" size 120 GiB in 15360 objects order 23 (8 MiB objects) snapshot_count: 0 parent: images/8c5805a4-d117-45d1-92d5-bff5e7b176d7@snap
root@hc-node01:~# rbd info volumes/volume-9e61bb4f-a5d9-4407-9863-e786a05359ce | grep -E "size|order|snap" size 400 GiB in 102400 objects order 22 (4 MiB objects) snapshot_count: 0
There are 6 storage hosts and identical configurations. OSD's are formed with HDD disks with rockdb and WAL cache.
I see. I just checked some rbd images in our own cloud and those volumes with 8MB chunk sizes are volumes created from images, so that's unusual. I must admit, I haven't looked in years into this difference. ;-) One other thing coming to mind is the bluestore_min_alloc_size_hdd, its default before some late Pacific version was 64k which could cause a quite high overhead if only small chunks were written. But I doubt that this is the case here, it just seems too high to be explained by the alloc size. I'll need to think a bit more about it. Zitat von lfsilva@binario.cloud:
Are the same 6 hosts, we currently have a hyperconverged solution.
Hello Eugen, The strangest thing is to see that the number of volumes does not increase and neither does provisioned/used, they continue with 14TB provisioned and 9TB used. I searched for any problem in this version but didn't find it, anyway I keep looking and searching to see if I can find something.
I added the objects from the volumes in the pool volumes(3253434) and then ran rados -p volumes ls |wc -l(3946219), apparently there are around 692785 orphaned objects. I'm going to run a pg repair on all the pgs in this pool and check if I have any positive feedback, while running I also decided to list the objects to see the average size of each one and see how much this would free me up, I ran the command: "rados -p volumes ls | while read obj; do rados -p volumes stat $obj; done > size-obj.txt", and I started getting the following message: "error stat-ing volumes/rbd_data.c7fb271f7be3b0.0000000000001c00: (2) No such file or directory error stat-ing volumes/rbd_data.c7fb271f7be3b0.0000000000005362: (2) No such file or directory error stat-ing volumes/rbd_data.c7fb271f7be3b0.0000000000002914: (2) No such file or directory error stat-ing volumes/rbd_data.c7fb271f7be3b0.0000000000008869: (2) No such file or directory error stat-ing volumes/rbd_data.c7fb271f7be3b0.0000000000008873: (2) No such file or directory error stat-ing volumes/rbd_data.c7fb271f7be3b0.0000000000004b11: (2) No such file or directory " Are these the objects that are impacting growth? If I delete them manually, could this cause any inconsistency in the volumes?
That would have been my next suggestion, go though the pool and match all existing volumes to their rados objects and find orphans. You can identify the matching cinder volume to their rbd_data prefix: for i in $(rbd -p volumes ls); do if [ $(rbd info --pretty-format --format json volumes/$i | jq -r '.block_name_prefix') = "rbd_data.{YOUR_PREFIX}" ]; then echo "Volume: $i"; fi ;done And then check cinder if the volume actually exists. With {YOUR_PREFIX} I mean the string between the two dots from your example: c7fb271f7be3b0 Zitat von lfsilva@binario.cloud:
I added the objects from the volumes in the pool volumes(3253434) and then ran rados -p volumes ls |wc -l(3946219), apparently there are around 692785 orphaned objects. I'm going to run a pg repair on all the pgs in this pool and check if I have any positive feedback, while running I also decided to list the objects to see the average size of each one and see how much this would free me up, I ran the command: "rados -p volumes ls | while read obj; do rados -p volumes stat $obj; done > size-obj.txt", and I started getting the following message: "error stat-ing volumes/rbd_data.c7fb271f7be3b0.0000000000001c00: (2) No such file or directory error stat-ing volumes/rbd_data.c7fb271f7be3b0.0000000000005362: (2) No such file or directory error stat-ing volumes/rbd_data.c7fb271f7be3b0.0000000000002914: (2) No such file or directory error stat-ing volumes/rbd_data.c7fb271f7be3b0.0000000000008869: (2) No such file or directory error stat-ing volumes/rbd_data.c7fb271f7be3b0.0000000000008873: (2) No such file or directory error stat-ing volumes/rbd_data.c7fb271f7be3b0.0000000000004b11: (2) No such file or directory " Are these the objects that are impacting growth? If I delete them manually, could this cause any inconsistency in the volumes?
participants (3)
-
Eugen Block
-
Jeremy Stanley
-
lfsilva@binario.cloud