Re: [swift] Increase part power on account/container ring

21 Dec 2023

      My only operational experience with part power increases is object rings.
You’re in uncharted territory.
...
With your experience, having >100 parts per device can be impacting on
overall health/performance cluster?
It’s an order of magnitude thing. 10s is bad, 100s is good, 1000s is ok,
10ks is bad again. The issue is mostly the time scanning the file system on
concurrency processes. In my biggest clusters, even with sharded databases,
we probably only have maybe 50-300K dbs in the whole cluster spread across
hundreds of NVME drives. Millions of parts wouldn’t make sense.

Clay Gerrard

On Thu, Dec 21, 2023 at 2:01 PM Thiago De Moraes Teixeira <
teixeira.thiago@luizalabs.com> wrote:
...
...
I think that makes sense for your use-case.  Last time we did a part
power increase we did *3* in a row.  Each increase would disable
replication while relinker/cleanup was running - then we'd let replication
run for ~24 hours to address any inconsistencies, swap out failed drives
etc.  Then repeat the whole process again - it was pretty intensive for
ops, but they got good at it and our users didn't notice.
This part power increase was on the object ring right? What concerns me
about all this increase is in your case these ~24H to make sure everything
is ok. My hope is the accounts/containers databases size files are small
sizes on kB/MB scale.
...
totals are good, also number of databases per disk and per partition
(max, min, avg, std-dev)
Unfortunately I can't get this information yet, but it is a critical piece
of information for this process.
...
you can probably stop once you have >100 parts per device.  But I'm
only "recommending" it because that's what's supported and what we've seen
work in the past.
Is good to hear your experience, I read recently in Suse documentation
<https://documentation.suse.com/soc/9/html/suse-openstack-cloud-clm-all/modify-input-model.html#selecting-partition-power>
that a 100 parts per device is an ideal and you too now. With your
experience, having >100 parts per device can be impacting on overall
health/performance cluster?
Em qui., 21 de dez. de 2023 às 15:48, Clay Gerrard <clay.gerrard@gmail.com>
escreveu:
...
On Thu, Dec 21, 2023 at 11:28 AM Thiago De Moraes Teixeira <
teixeira.thiago@luizalabs.com> wrote:
...
do that inside a maintenance window,
to not be concerned about availability issues during the process.
More than that I would encourage you to disable writes; ideally just turn
off all your proxies.  You should probably also stop the account/container
consistency daeamons - or teach them how to abort/skip rings marked for
part power like the object services do:
https://github.com/NVIDIA/swift/blob/master/swift/obj/replicator.py#L911-L91...
...
your recommendation is do this increase 1 per 1 part
power, like 8->9;9->10;...19->20?
I mean going from 8->9 already *doubles* your number of partitions.
exponential goes pretty fast. So even if you only have 1 part per device,
2(9), 4(10), 8(11), 16(12), 32(13), 64(14), 128(15) - you can probably stop
once you have >100 parts per device.  But I'm only "recommending" it
because that's what's supported and what we've seen work in the past.
...
One idea I've just thought of, besides this adaptation on swift ring
builder, is to add an input argument to map new part power value
(thinking if this makes sense...)
I think that makes sense for your use-case.  Last time we did a part
power increase we did *3* in a row.  Each increase would disable
replication while relinker/cleanup was running - then we'd let replication
run for ~24 hours to address any inconsistencies, swap out failed drives
etc.  Then repeat the whole process again - it was pretty intensive for
ops, but they got good at it and our users didn't notice.
Since account/container doesn't support writes to relinking databases -
you're going to be taking the cluster offline.  Depending on how your ring
management is automated it might be better to do the full relink from part
power 8->X in one ring push.  Christian is the expert but I *think* however
many times you want to split a part you *should* be able to assign those
new parts to the device they're actually on.
If you're doing say 3 increases at once instead of part 1 splitting to 2,
3 - it'd split to [8, 9, 10, 11, 12, 13, 14, 15] - and you'd want to make
sure all those parts are assigned to each of the devices currently holding
a replica of 1
(you should double check my math:
https://paste.openstack.org/show/b0TKpjHzXSs7z8xrSVtK/)
The second step after you get your ring figured out is to traverse the
account/container device's filesystem trees and for each database calculate
what the new partition assignment (each individual database in part 1 will
only go into one of the new parts).  Then you just move it into the right
dir on the same device - since if you did your ring manipulation correctly
it should already belong on that device.  You can probably use the existing
object-relinker as inspiration but it would require heavy modification to
work with account and container databases:
https://github.com/NVIDIA/swift/blob/master/swift/cli/relinker.py
I'd encourage you to share any modifications on gerrit even if you're not
interested in polishing them for contributions upstream; others could
potentially use your work as a starting point.  The benefit to you would be
mostly in the attention of people that have gone through this process for
object part power increase might be able to spot issues with your attempted
approach (i.e. bugs) or contribute some tests that may help ensure your
migration goes smoothly/quickly.
About our cluster, we are gathering information about the production
...
environment like how many accounts/containers do we have exactly
totals are good, also number of databases per disk and per partition
(max, min, avg, std-dev)
I can come back
...
later with useful information when I've it!
Appreciate!
That'll be great fun!  I appreciate you taking on such an exciting
endeavor and sharing your experience with the community.
*‘Esta mensagem é direcionada apenas para os endereços constantes no
cabeçalho inicial. Se você não está listado nos endereços constantes no
cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa
mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão
imediatamente anuladas e proibidas’.*
*‘Apesar do Magazine Luiza tomar todas as precauções razoáveis para
assegurar que nenhum vírus esteja presente nesse e-mail, a empresa não
poderá aceitar a responsabilidade por quaisquer perdas ou danos causados
por esse e-mail ou por seus anexos’.*