My only operational experience with part power increases is object rings. You’re in uncharted territory.
With your experience, having >100 parts per device can be impacting on overall health/performance cluster?
It’s an order of magnitude thing. 10s is bad, 100s is good, 1000s is ok, 10ks is bad again. The issue is mostly the time scanning the file system on concurrency processes. In my biggest clusters, even with sharded databases, we probably only have maybe 50-300K dbs in the whole cluster spread across hundreds of NVME drives. Millions of parts wouldn’t make sense. Clay Gerrard On Thu, Dec 21, 2023 at 2:01 PM Thiago De Moraes Teixeira < teixeira.thiago@luizalabs.com> wrote:
I think that makes sense for your use-case. Last time we did a part power increase we did *3* in a row. Each increase would disable replication while relinker/cleanup was running - then we'd let replication run for ~24 hours to address any inconsistencies, swap out failed drives etc. Then repeat the whole process again - it was pretty intensive for ops, but they got good at it and our users didn't notice.
This part power increase was on the object ring right? What concerns me about all this increase is in your case these ~24H to make sure everything is ok. My hope is the accounts/containers databases size files are small sizes on kB/MB scale.
totals are good, also number of databases per disk and per partition (max, min, avg, std-dev)
Unfortunately I can't get this information yet, but it is a critical piece of information for this process.
you can probably stop once you have >100 parts per device. But I'm only "recommending" it because that's what's supported and what we've seen work in the past.
Is good to hear your experience, I read recently in Suse documentation <https://documentation.suse.com/soc/9/html/suse-openstack-cloud-clm-all/modify-input-model.html#selecting-partition-power> that a 100 parts per device is an ideal and you too now. With your experience, having >100 parts per device can be impacting on overall health/performance cluster?
Em qui., 21 de dez. de 2023 às 15:48, Clay Gerrard <clay.gerrard@gmail.com> escreveu:
On Thu, Dec 21, 2023 at 11:28 AM Thiago De Moraes Teixeira < teixeira.thiago@luizalabs.com> wrote:
do that inside a maintenance window, to not be concerned about availability issues during the process.
More than that I would encourage you to disable writes; ideally just turn off all your proxies. You should probably also stop the account/container consistency daeamons - or teach them how to abort/skip rings marked for part power like the object services do:
https://github.com/NVIDIA/swift/blob/master/swift/obj/replicator.py#L911-L91...
your recommendation is do this increase 1 per 1 part power, like 8->9;9->10;...19->20?
I mean going from 8->9 already *doubles* your number of partitions. exponential goes pretty fast. So even if you only have 1 part per device, 2(9), 4(10), 8(11), 16(12), 32(13), 64(14), 128(15) - you can probably stop once you have >100 parts per device. But I'm only "recommending" it because that's what's supported and what we've seen work in the past.
One idea I've just thought of, besides this adaptation on swift ring builder, is to add an input argument to map new part power value (thinking if this makes sense...)
I think that makes sense for your use-case. Last time we did a part power increase we did *3* in a row. Each increase would disable replication while relinker/cleanup was running - then we'd let replication run for ~24 hours to address any inconsistencies, swap out failed drives etc. Then repeat the whole process again - it was pretty intensive for ops, but they got good at it and our users didn't notice.
Since account/container doesn't support writes to relinking databases - you're going to be taking the cluster offline. Depending on how your ring management is automated it might be better to do the full relink from part power 8->X in one ring push. Christian is the expert but I *think* however many times you want to split a part you *should* be able to assign those new parts to the device they're actually on.
If you're doing say 3 increases at once instead of part 1 splitting to 2, 3 - it'd split to [8, 9, 10, 11, 12, 13, 14, 15] - and you'd want to make sure all those parts are assigned to each of the devices currently holding a replica of 1 (you should double check my math: https://paste.openstack.org/show/b0TKpjHzXSs7z8xrSVtK/)
The second step after you get your ring figured out is to traverse the account/container device's filesystem trees and for each database calculate what the new partition assignment (each individual database in part 1 will only go into one of the new parts). Then you just move it into the right dir on the same device - since if you did your ring manipulation correctly it should already belong on that device. You can probably use the existing object-relinker as inspiration but it would require heavy modification to work with account and container databases:
https://github.com/NVIDIA/swift/blob/master/swift/cli/relinker.py
I'd encourage you to share any modifications on gerrit even if you're not interested in polishing them for contributions upstream; others could potentially use your work as a starting point. The benefit to you would be mostly in the attention of people that have gone through this process for object part power increase might be able to spot issues with your attempted approach (i.e. bugs) or contribute some tests that may help ensure your migration goes smoothly/quickly.
About our cluster, we are gathering information about the production
environment like how many accounts/containers do we have exactly
totals are good, also number of databases per disk and per partition (max, min, avg, std-dev)
I can come back
later with useful information when I've it!
Appreciate!
That'll be great fun! I appreciate you taking on such an exciting endeavor and sharing your experience with the community.
*‘Esta mensagem é direcionada apenas para os endereços constantes no cabeçalho inicial. Se você não está listado nos endereços constantes no cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão imediatamente anuladas e proibidas’.*
*‘Apesar do Magazine Luiza tomar todas as precauções razoáveis para assegurar que nenhum vírus esteja presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.*