[swift] Increase part power on account/container ring

newer
Security Group Rules on Octavia...

Thiago De Moraes Teixeira

20 Dec 2023 20 Dec '23

12:33 p.m.

Morning, Here we have a cluster with Accounts/Containers ring with part power 8 and our Objects ring with part power 20 (builted later). Now we want to migrate the 'old' Accounts/Containers ring to a new one, with higher part power, like 20. Is there a better way to do this? I'm doing some crazy tests with SAIO and multiples storage nodes, based in build a new ring with part power 20 and just swap the old files (account/container.ring.gz) with the new ones and let the replicators do their jobs, moving *.db files to their new home partition. Until today these tests worked fine, but I don't know how to do it on a large scale, with some terabytes of data. Just wondering about some ways to do that :) And before/after this change, is there any command in swift that i can audit this entire process, ensuring that cluster data remains the same? Thanks! -- _‘Esta mensagem é direcionada apenas para os endereços constantes no cabeçalho inicial. Se você não está listado nos endereços constantes no cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão imediatamente anuladas e proibidas’._ * **‘Apesar do Magazine Luiza tomar todas as precauções razoáveis para assegurar que nenhum vírus esteja presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.*

Attachments:

attachment.html (text/html — 2.5 KB)

Show replies by date

Pete Zaitcev

20 Dec 20 Dec

7:57 p.m.

On Wed, 20 Dec 2023 09:33:50 -0300 Thiago De Moraes Teixeira <teixeira.thiago@luizalabs.com> wrote:

...

... I'm doing some crazy tests with SAIO and multiples storage nodes, based in build a new ring with part power 20 and just swap the old files (account/container.ring.gz) with the new ones and let the replicators do their jobs, moving *.db files to their new home partition.

I don't see a show-stopper if you do it while cluster is not available to the client requests, in a maintenance window. Normally Swift is intended to be run with total zero downtime for the lifetime of a cluster. The observable problem is a window when your rings are switched over, but the container DBs are not yet moved. Proxy cannot find them at the new place and gives a 404. The same is true for updaters, I believe. You're risking losing track of container and account stats. If you quescent the cluster wrt the updaters, expirers, and clients, then remaking rings outright ought to become possible. However, I never tried what you're doing. I suggest you get engage attention of people who thought about all issues with the partition power changes - Christian Schwede, Clay Gerrard, maybe Alistair too. There may be something that we're not considering. -- Pete

Clay Gerrard

21 Dec 21 Dec

4:57 a.m.

This is an interesting challenge. To my knowledge no one has ever done a part power increase on an account/container ring. There is native support for online part power increases on object data rings. It seems you’re familiar with the general idea: https://docs.openstack.org/swift/latest/ring_partpower.html I second Pete’s suggestion for a maintenance window with the proxies disabled. The first object part power increases were also performed in offline mode; before the relinker aware object server code was added. The account and container databases are in theory a little easier than object layer part power increase since the replication model is already per item instead of per partition. But I might recommend you consider a relink based approach with a doubled part count ring to minimize downtime instead of “just” swapping out the ring and waiting on replication. The first step would be adapting the swift ring builder prepare part power increase command to work on account and container rings. The main advantage of a placement aware ring part power increase is that when part 1 gets split into 2 and 3 it will be assigned to the same device; making the relink/move operation much more io efficient. I’d love to review any more details you can share about your plan or your cluster. While most folks are probably going to be logging off for the holidays for the next couple of weeks you can probably find some of us in IRC for more real-time QA. Good luck! Clay Gerrard On Wed, Dec 20, 2023 at 1:57 PM Pete Zaitcev <zaitcev@redhat.com> wrote:

...

On Wed, 20 Dec 2023 09:33:50 -0300 Thiago De Moraes Teixeira <teixeira.thiago@luizalabs.com> wrote:

...
... I'm doing some crazy tests with SAIO and multiples storage nodes, based in build a new ring with part power 20 and just swap the old files (account/container.ring.gz) with the new ones and let the replicators do their jobs, moving *.db files to their new home partition.

I don't see a show-stopper if you do it while cluster is not available to the client requests, in a maintenance window. Normally Swift is intended to be run with total zero downtime for the lifetime of a cluster.

The observable problem is a window when your rings are switched over, but the container DBs are not yet moved. Proxy cannot find them at the new place and gives a 404. The same is true for updaters, I believe. You're risking losing track of container and account stats.

If you quescent the cluster wrt the updaters, expirers, and clients, then remaking rings outright ought to become possible.

However, I never tried what you're doing. I suggest you get engage attention of people who thought about all issues with the partition power changes - Christian Schwede, Clay Gerrard, maybe Alistair too. There may be something that we're not considering.

-- Pete

Thiago De Moraes Teixeira

5:28 p.m.

...

... I second Pete’s suggestion for a maintenance window with the proxies disabled. The first object part power increases were also performed in offline mode; before the relinker aware object server code was added.

Yeah, our first idea is to do that inside a maintenance window, to not be concerned about availability issues during the process.

...

... The first step would be adapting the swift ring builder prepare part power increase command to work on account and container rings. The main advantage of a placement aware ring part power increase is that when part 1 gets split into 2 and 3 it will be assigned to the same device; making the relink/move operation much more io efficient.

About this, it's helpful advice and a good starting point to do a safe increase. After adapting the swift ring builder to handle account and container rings, your recommendation is do this increase 1 per 1 part power, like 8->9;9->10;...19->20? One idea I've just thought of, besides this adaptation on swift ring builder, is to add an input argument to map new part power value (thinking if this makes sense...) About our cluster, we are gathering information about the production environment like how many accounts/containers do we have exactly and this stuff to consider our options for this change. I can come back later with useful information when I've it! Appreciate! Em qui., 21 de dez. de 2023 às 01:57, Clay Gerrard <clay.gerrard@gmail.com> escreveu:

...

This is an interesting challenge. To my knowledge no one has ever done a part power increase on an account/container ring. There is native support for online part power increases on object data rings. It seems you’re familiar with the general idea:

https://docs.openstack.org/swift/latest/ring_partpower.html

I second Pete’s suggestion for a maintenance window with the proxies disabled. The first object part power increases were also performed in offline mode; before the relinker aware object server code was added.

The account and container databases are in theory a little easier than object layer part power increase since the replication model is already per item instead of per partition. But I might recommend you consider a relink based approach with a doubled part count ring to minimize downtime instead of “just” swapping out the ring and waiting on replication.

The first step would be adapting the swift ring builder prepare part power increase command to work on account and container rings. The main advantage of a placement aware ring part power increase is that when part 1 gets split into 2 and 3 it will be assigned to the same device; making the relink/move operation much more io efficient.

I’d love to review any more details you can share about your plan or your cluster. While most folks are probably going to be logging off for the holidays for the next couple of weeks you can probably find some of us in IRC for more real-time QA.

Good luck!

Clay Gerrard

On Wed, Dec 20, 2023 at 1:57 PM Pete Zaitcev <zaitcev@redhat.com> wrote:

...
On Wed, 20 Dec 2023 09:33:50 -0300 Thiago De Moraes Teixeira <teixeira.thiago@luizalabs.com> wrote:

...
... I'm doing some crazy tests with SAIO and multiples storage nodes, based in build a new ring with part power 20 and just swap the old files (account/container.ring.gz) with the new ones and let the replicators do their jobs, moving *.db files to their new home partition.

I don't see a show-stopper if you do it while cluster is not available to the client requests, in a maintenance window. Normally Swift is intended to be run with total zero downtime for the lifetime of a cluster.

The observable problem is a window when your rings are switched over, but the container DBs are not yet moved. Proxy cannot find them at the new place and gives a 404. The same is true for updaters, I believe. You're risking losing track of container and account stats.

If you quescent the cluster wrt the updaters, expirers, and clients, then remaking rings outright ought to become possible.

However, I never tried what you're doing. I suggest you get engage attention of people who thought about all issues with the partition power changes - Christian Schwede, Clay Gerrard, maybe Alistair too. There may be something that we're not considering.

-- Pete

-- _‘Esta mensagem é direcionada apenas para os endereços constantes no cabeçalho inicial. Se você não está listado nos endereços constantes no cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão imediatamente anuladas e proibidas’._ * **‘Apesar do Magazine Luiza tomar todas as precauções razoáveis para assegurar que nenhum vírus esteja presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.*

Clay Gerrard

6:48 p.m.

On Thu, Dec 21, 2023 at 11:28 AM Thiago De Moraes Teixeira < teixeira.thiago@luizalabs.com> wrote:

...

do that inside a maintenance window, to not be concerned about availability issues during the process.

More than that I would encourage you to disable writes; ideally just turn off all your proxies. You should probably also stop the account/container consistency daeamons - or teach them how to abort/skip rings marked for part power like the object services do: https://github.com/NVIDIA/swift/blob/master/swift/obj/replicator.py#L911-L91...

...

your recommendation is do this increase 1 per 1 part power, like 8->9;9->10;...19->20?

I mean going from 8->9 already *doubles* your number of partitions. exponential goes pretty fast. So even if you only have 1 part per device, 2(9), 4(10), 8(11), 16(12), 32(13), 64(14), 128(15) - you can probably stop once you have >100 parts per device. But I'm only "recommending" it because that's what's supported and what we've seen work in the past.

...

One idea I've just thought of, besides this adaptation on swift ring builder, is to add an input argument to map new part power value (thinking if this makes sense...)

I think that makes sense for your use-case. Last time we did a part power increase we did *3* in a row. Each increase would disable replication while relinker/cleanup was running - then we'd let replication run for ~24 hours to address any inconsistencies, swap out failed drives etc. Then repeat the whole process again - it was pretty intensive for ops, but they got good at it and our users didn't notice. Since account/container doesn't support writes to relinking databases - you're going to be taking the cluster offline. Depending on how your ring management is automated it might be better to do the full relink from part power 8->X in one ring push. Christian is the expert but I *think* however many times you want to split a part you *should* be able to assign those new parts to the device they're actually on. If you're doing say 3 increases at once instead of part 1 splitting to 2, 3 - it'd split to [8, 9, 10, 11, 12, 13, 14, 15] - and you'd want to make sure all those parts are assigned to each of the devices currently holding a replica of 1 (you should double check my math: https://paste.openstack.org/show/b0TKpjHzXSs7z8xrSVtK/) The second step after you get your ring figured out is to traverse the account/container device's filesystem trees and for each database calculate what the new partition assignment (each individual database in part 1 will only go into one of the new parts). Then you just move it into the right dir on the same device - since if you did your ring manipulation correctly it should already belong on that device. You can probably use the existing object-relinker as inspiration but it would require heavy modification to work with account and container databases: https://github.com/NVIDIA/swift/blob/master/swift/cli/relinker.py I'd encourage you to share any modifications on gerrit even if you're not interested in polishing them for contributions upstream; others could potentially use your work as a starting point. The benefit to you would be mostly in the attention of people that have gone through this process for object part power increase might be able to spot issues with your attempted approach (i.e. bugs) or contribute some tests that may help ensure your migration goes smoothly/quickly. About our cluster, we are gathering information about the production

...

environment like how many accounts/containers do we have exactly

totals are good, also number of databases per disk and per partition (max, min, avg, std-dev) I can come back

...

later with useful information when I've it!

Appreciate!

That'll be great fun! I appreciate you taking on such an exciting endeavor and sharing your experience with the community.

Thiago De Moraes Teixeira

8:01 p.m.

...

I think that makes sense for your use-case. Last time we did a part power increase we did *3* in a row. Each increase would disable replication while relinker/cleanup was running - then we'd let replication run for ~24 hours to address any inconsistencies, swap out failed drives etc. Then repeat the whole process again - it was pretty intensive for ops, but they got good at it and our users didn't notice.

This part power increase was on the object ring right? What concerns me about all this increase is in your case these ~24H to make sure everything is ok. My hope is the accounts/containers databases size files are small sizes on kB/MB scale.

...

totals are good, also number of databases per disk and per partition (max, min, avg, std-dev)

Unfortunately I can't get this information yet, but it is a critical piece of information for this process.

...

you can probably stop once you have >100 parts per device. But I'm only "recommending" it because that's what's supported and what we've seen work in the past.

Is good to hear your experience, I read recently in Suse documentation <https://documentation.suse.com/soc/9/html/suse-openstack-cloud-clm-all/modify-input-model.html#selecting-partition-power> that a 100 parts per device is an ideal and you too now. With your experience, having >100 parts per device can be impacting on overall health/performance cluster? Em qui., 21 de dez. de 2023 às 15:48, Clay Gerrard <clay.gerrard@gmail.com> escreveu:

...

On Thu, Dec 21, 2023 at 11:28 AM Thiago De Moraes Teixeira < teixeira.thiago@luizalabs.com> wrote:

...
do that inside a maintenance window, to not be concerned about availability issues during the process.

More than that I would encourage you to disable writes; ideally just turn off all your proxies. You should probably also stop the account/container consistency daeamons - or teach them how to abort/skip rings marked for part power like the object services do:

https://github.com/NVIDIA/swift/blob/master/swift/obj/replicator.py#L911-L91...

...
your recommendation is do this increase 1 per 1 part power, like 8->9;9->10;...19->20?

I mean going from 8->9 already *doubles* your number of partitions. exponential goes pretty fast. So even if you only have 1 part per device, 2(9), 4(10), 8(11), 16(12), 32(13), 64(14), 128(15) - you can probably stop once you have >100 parts per device. But I'm only "recommending" it because that's what's supported and what we've seen work in the past.

...
One idea I've just thought of, besides this adaptation on swift ring builder, is to add an input argument to map new part power value (thinking if this makes sense...)

I think that makes sense for your use-case. Last time we did a part power increase we did *3* in a row. Each increase would disable replication while relinker/cleanup was running - then we'd let replication run for ~24 hours to address any inconsistencies, swap out failed drives etc. Then repeat the whole process again - it was pretty intensive for ops, but they got good at it and our users didn't notice.

Since account/container doesn't support writes to relinking databases - you're going to be taking the cluster offline. Depending on how your ring management is automated it might be better to do the full relink from part power 8->X in one ring push. Christian is the expert but I *think* however many times you want to split a part you *should* be able to assign those new parts to the device they're actually on.

If you're doing say 3 increases at once instead of part 1 splitting to 2, 3 - it'd split to [8, 9, 10, 11, 12, 13, 14, 15] - and you'd want to make sure all those parts are assigned to each of the devices currently holding a replica of 1 (you should double check my math: https://paste.openstack.org/show/b0TKpjHzXSs7z8xrSVtK/)

The second step after you get your ring figured out is to traverse the account/container device's filesystem trees and for each database calculate what the new partition assignment (each individual database in part 1 will only go into one of the new parts). Then you just move it into the right dir on the same device - since if you did your ring manipulation correctly it should already belong on that device. You can probably use the existing object-relinker as inspiration but it would require heavy modification to work with account and container databases:

https://github.com/NVIDIA/swift/blob/master/swift/cli/relinker.py

I'd encourage you to share any modifications on gerrit even if you're not interested in polishing them for contributions upstream; others could potentially use your work as a starting point. The benefit to you would be mostly in the attention of people that have gone through this process for object part power increase might be able to spot issues with your attempted approach (i.e. bugs) or contribute some tests that may help ensure your migration goes smoothly/quickly.

About our cluster, we are gathering information about the production

...
environment like how many accounts/containers do we have exactly

totals are good, also number of databases per disk and per partition (max, min, avg, std-dev)

I can come back

...
later with useful information when I've it!

Appreciate!

That'll be great fun! I appreciate you taking on such an exciting endeavor and sharing your experience with the community.

Clay Gerrard

9:08 p.m.

My only operational experience with part power increases is object rings. You’re in uncharted territory.

...

With your experience, having >100 parts per device can be impacting on overall health/performance cluster?

It’s an order of magnitude thing. 10s is bad, 100s is good, 1000s is ok, 10ks is bad again. The issue is mostly the time scanning the file system on concurrency processes. In my biggest clusters, even with sharded databases, we probably only have maybe 50-300K dbs in the whole cluster spread across hundreds of NVME drives. Millions of parts wouldn’t make sense. Clay Gerrard On Thu, Dec 21, 2023 at 2:01 PM Thiago De Moraes Teixeira < teixeira.thiago@luizalabs.com> wrote:

...

...
I think that makes sense for your use-case. Last time we did a part power increase we did *3* in a row. Each increase would disable replication while relinker/cleanup was running - then we'd let replication run for ~24 hours to address any inconsistencies, swap out failed drives etc. Then repeat the whole process again - it was pretty intensive for ops, but they got good at it and our users didn't notice.

This part power increase was on the object ring right? What concerns me about all this increase is in your case these ~24H to make sure everything is ok. My hope is the accounts/containers databases size files are small sizes on kB/MB scale.

...
totals are good, also number of databases per disk and per partition (max, min, avg, std-dev)

Unfortunately I can't get this information yet, but it is a critical piece of information for this process.

...
you can probably stop once you have >100 parts per device. But I'm only "recommending" it because that's what's supported and what we've seen work in the past.

Is good to hear your experience, I read recently in Suse documentation <https://documentation.suse.com/soc/9/html/suse-openstack-cloud-clm-all/modify-input-model.html#selecting-partition-power> that a 100 parts per device is an ideal and you too now. With your experience, having >100 parts per device can be impacting on overall health/performance cluster?

Em qui., 21 de dez. de 2023 às 15:48, Clay Gerrard <clay.gerrard@gmail.com> escreveu:

...
On Thu, Dec 21, 2023 at 11:28 AM Thiago De Moraes Teixeira < teixeira.thiago@luizalabs.com> wrote:

...
do that inside a maintenance window, to not be concerned about availability issues during the process.

More than that I would encourage you to disable writes; ideally just turn off all your proxies. You should probably also stop the account/container consistency daeamons - or teach them how to abort/skip rings marked for part power like the object services do:

https://github.com/NVIDIA/swift/blob/master/swift/obj/replicator.py#L911-L91...

...
your recommendation is do this increase 1 per 1 part power, like 8->9;9->10;...19->20?

I mean going from 8->9 already *doubles* your number of partitions. exponential goes pretty fast. So even if you only have 1 part per device, 2(9), 4(10), 8(11), 16(12), 32(13), 64(14), 128(15) - you can probably stop once you have >100 parts per device. But I'm only "recommending" it because that's what's supported and what we've seen work in the past.

...
One idea I've just thought of, besides this adaptation on swift ring builder, is to add an input argument to map new part power value (thinking if this makes sense...)

I think that makes sense for your use-case. Last time we did a part power increase we did *3* in a row. Each increase would disable replication while relinker/cleanup was running - then we'd let replication run for ~24 hours to address any inconsistencies, swap out failed drives etc. Then repeat the whole process again - it was pretty intensive for ops, but they got good at it and our users didn't notice.

Since account/container doesn't support writes to relinking databases - you're going to be taking the cluster offline. Depending on how your ring management is automated it might be better to do the full relink from part power 8->X in one ring push. Christian is the expert but I *think* however many times you want to split a part you *should* be able to assign those new parts to the device they're actually on.

If you're doing say 3 increases at once instead of part 1 splitting to 2, 3 - it'd split to [8, 9, 10, 11, 12, 13, 14, 15] - and you'd want to make sure all those parts are assigned to each of the devices currently holding a replica of 1 (you should double check my math: https://paste.openstack.org/show/b0TKpjHzXSs7z8xrSVtK/)

The second step after you get your ring figured out is to traverse the account/container device's filesystem trees and for each database calculate what the new partition assignment (each individual database in part 1 will only go into one of the new parts). Then you just move it into the right dir on the same device - since if you did your ring manipulation correctly it should already belong on that device. You can probably use the existing object-relinker as inspiration but it would require heavy modification to work with account and container databases:

https://github.com/NVIDIA/swift/blob/master/swift/cli/relinker.py

I'd encourage you to share any modifications on gerrit even if you're not interested in polishing them for contributions upstream; others could potentially use your work as a starting point. The benefit to you would be mostly in the attention of people that have gone through this process for object part power increase might be able to spot issues with your attempted approach (i.e. bugs) or contribute some tests that may help ensure your migration goes smoothly/quickly.

About our cluster, we are gathering information about the production

...
environment like how many accounts/containers do we have exactly

totals are good, also number of databases per disk and per partition (max, min, avg, std-dev)

I can come back

...
later with useful information when I've it!

Appreciate!

That'll be great fun! I appreciate you taking on such an exciting endeavor and sharing your experience with the community.

*‘Esta mensagem é direcionada apenas para os endereços constantes no cabeçalho inicial. Se você não está listado nos endereços constantes no cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão imediatamente anuladas e proibidas’.*

*‘Apesar do Magazine Luiza tomar todas as precauções razoáveis para assegurar que nenhum vírus esteja presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.*

Thiago De Moraes Teixeira

11 Mar 11 Mar

8:24 p.m.

Hey, sorry for the ghosting in these months. Just to share our solution for this problem: Unfortunately we can't develop an "increase part power" for account/container ring due the deadlines. A cluster overview, we've 20 hosts holding accounts/containers/objects and 30 hosts with only an object-2 ring. We create new account/container rings, with part power 20, built on top of these 30 hosts disks, that only have objects. Then we walk over accounts/containers directories in these 20 hosts, recalculate the path for hash dirs, and rsync to 3 right places in the new ring, for the hosts among these 30 new ring hosts. After this, whe just turn over the swift-proxy to use the new account/container ring, looking for the new places on the 30 hosts. A further step is to clean up the old disks and add to the new ring. Thank you all for your comments and help with this! Em qui., 21 de dez. de 2023 às 18:08, Clay Gerrard <clay.gerrard@gmail.com> escreveu:

...

My only operational experience with part power increases is object rings. You’re in uncharted territory.

...
With your experience, having >100 parts per device can be impacting on overall health/performance cluster?

It’s an order of magnitude thing. 10s is bad, 100s is good, 1000s is ok, 10ks is bad again. The issue is mostly the time scanning the file system on concurrency processes. In my biggest clusters, even with sharded databases, we probably only have maybe 50-300K dbs in the whole cluster spread across hundreds of NVME drives. Millions of parts wouldn’t make sense.

Clay Gerrard

On Thu, Dec 21, 2023 at 2:01 PM Thiago De Moraes Teixeira < teixeira.thiago@luizalabs.com> wrote:

...
...
I think that makes sense for your use-case. Last time we did a part power increase we did *3* in a row. Each increase would disable replication while relinker/cleanup was running - then we'd let replication run for ~24 hours to address any inconsistencies, swap out failed drives etc. Then repeat the whole process again - it was pretty intensive for ops, but they got good at it and our users didn't notice.

This part power increase was on the object ring right? What concerns me about all this increase is in your case these ~24H to make sure everything is ok. My hope is the accounts/containers databases size files are small sizes on kB/MB scale.

...
totals are good, also number of databases per disk and per partition (max, min, avg, std-dev)

Unfortunately I can't get this information yet, but it is a critical piece of information for this process.

...
you can probably stop once you have >100 parts per device. But I'm only "recommending" it because that's what's supported and what we've seen work in the past.

Is good to hear your experience, I read recently in Suse documentation <https://documentation.suse.com/soc/9/html/suse-openstack-cloud-clm-all/modify-input-model.html#selecting-partition-power> that a 100 parts per device is an ideal and you too now. With your experience, having >100 parts per device can be impacting on overall health/performance cluster?

Em qui., 21 de dez. de 2023 às 15:48, Clay Gerrard < clay.gerrard@gmail.com> escreveu:

...
On Thu, Dec 21, 2023 at 11:28 AM Thiago De Moraes Teixeira < teixeira.thiago@luizalabs.com> wrote:

...
do that inside a maintenance window, to not be concerned about availability issues during the process.

More than that I would encourage you to disable writes; ideally just turn off all your proxies. You should probably also stop the account/container consistency daeamons - or teach them how to abort/skip rings marked for part power like the object services do:

https://github.com/NVIDIA/swift/blob/master/swift/obj/replicator.py#L911-L91...

...
your recommendation is do this increase 1 per 1 part power, like 8->9;9->10;...19->20?

I mean going from 8->9 already *doubles* your number of partitions. exponential goes pretty fast. So even if you only have 1 part per device, 2(9), 4(10), 8(11), 16(12), 32(13), 64(14), 128(15) - you can probably stop once you have >100 parts per device. But I'm only "recommending" it because that's what's supported and what we've seen work in the past.

...
One idea I've just thought of, besides this adaptation on swift ring builder, is to add an input argument to map new part power value (thinking if this makes sense...)

I think that makes sense for your use-case. Last time we did a part power increase we did *3* in a row. Each increase would disable replication while relinker/cleanup was running - then we'd let replication run for ~24 hours to address any inconsistencies, swap out failed drives etc. Then repeat the whole process again - it was pretty intensive for ops, but they got good at it and our users didn't notice.

Since account/container doesn't support writes to relinking databases - you're going to be taking the cluster offline. Depending on how your ring management is automated it might be better to do the full relink from part power 8->X in one ring push. Christian is the expert but I *think* however many times you want to split a part you *should* be able to assign those new parts to the device they're actually on.

If you're doing say 3 increases at once instead of part 1 splitting to 2, 3 - it'd split to [8, 9, 10, 11, 12, 13, 14, 15] - and you'd want to make sure all those parts are assigned to each of the devices currently holding a replica of 1 (you should double check my math: https://paste.openstack.org/show/b0TKpjHzXSs7z8xrSVtK/)

The second step after you get your ring figured out is to traverse the account/container device's filesystem trees and for each database calculate what the new partition assignment (each individual database in part 1 will only go into one of the new parts). Then you just move it into the right dir on the same device - since if you did your ring manipulation correctly it should already belong on that device. You can probably use the existing object-relinker as inspiration but it would require heavy modification to work with account and container databases:

https://github.com/NVIDIA/swift/blob/master/swift/cli/relinker.py

I'd encourage you to share any modifications on gerrit even if you're not interested in polishing them for contributions upstream; others could potentially use your work as a starting point. The benefit to you would be mostly in the attention of people that have gone through this process for object part power increase might be able to spot issues with your attempted approach (i.e. bugs) or contribute some tests that may help ensure your migration goes smoothly/quickly.

About our cluster, we are gathering information about the production

...
environment like how many accounts/containers do we have exactly

totals are good, also number of databases per disk and per partition (max, min, avg, std-dev)

I can come back

...
later with useful information when I've it!

Appreciate!

That'll be great fun! I appreciate you taking on such an exciting endeavor and sharing your experience with the community.

*‘Esta mensagem é direcionada apenas para os endereços constantes no cabeçalho inicial. Se você não está listado nos endereços constantes no cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão imediatamente anuladas e proibidas’.*

*‘Apesar do Magazine Luiza tomar todas as precauções razoáveis para assegurar que nenhum vírus esteja presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.*

Clay Gerrard

13 Mar 13 Mar

12:28 p.m.

Thanks for the update! Seems like a perfectly reasonable migration and that you executed it well. I'm glad to hear it worked out. We've now heard of at least one team that managed to pull off an account/container part power migration - KUDOS! On Mon, Mar 11, 2024 at 3:25 PM Thiago De Moraes Teixeira < teixeira.thiago@luizalabs.com> wrote:

...

Hey, sorry for the ghosting in these months.

Just to share our solution for this problem:

Unfortunately we can't develop an "increase part power" for account/container ring due the deadlines.

A cluster overview, we've 20 hosts holding accounts/containers/objects and 30 hosts with only an object-2 ring. We create new account/container rings, with part power 20, built on top of these 30 hosts disks, that only have objects. Then we walk over accounts/containers directories in these 20 hosts, recalculate the path for hash dirs, and rsync to 3 right places in the new ring, for the hosts among these 30 new ring hosts. After this, whe just turn over the swift-proxy to use the new account/container ring, looking for the new places on the 30 hosts. A further step is to clean up the old disks and add to the new ring.

Thank you all for your comments and help with this!

Em qui., 21 de dez. de 2023 às 18:08, Clay Gerrard <clay.gerrard@gmail.com> escreveu:

...
My only operational experience with part power increases is object rings. You’re in uncharted territory.

...
With your experience, having >100 parts per device can be impacting on overall health/performance cluster?

It’s an order of magnitude thing. 10s is bad, 100s is good, 1000s is ok, 10ks is bad again. The issue is mostly the time scanning the file system on concurrency processes. In my biggest clusters, even with sharded databases, we probably only have maybe 50-300K dbs in the whole cluster spread across hundreds of NVME drives. Millions of parts wouldn’t make sense.

Clay Gerrard

On Thu, Dec 21, 2023 at 2:01 PM Thiago De Moraes Teixeira < teixeira.thiago@luizalabs.com> wrote:

...
...
I think that makes sense for your use-case. Last time we did a part power increase we did *3* in a row. Each increase would disable replication while relinker/cleanup was running - then we'd let replication run for ~24 hours to address any inconsistencies, swap out failed drives etc. Then repeat the whole process again - it was pretty intensive for ops, but they got good at it and our users didn't notice.

This part power increase was on the object ring right? What concerns me about all this increase is in your case these ~24H to make sure everything is ok. My hope is the accounts/containers databases size files are small sizes on kB/MB scale.

...
totals are good, also number of databases per disk and per partition (max, min, avg, std-dev)

Unfortunately I can't get this information yet, but it is a critical piece of information for this process.

...
you can probably stop once you have >100 parts per device. But I'm only "recommending" it because that's what's supported and what we've seen work in the past.

Is good to hear your experience, I read recently in Suse documentation <https://documentation.suse.com/soc/9/html/suse-openstack-cloud-clm-all/modify-input-model.html#selecting-partition-power> that a 100 parts per device is an ideal and you too now. With your experience, having >100 parts per device can be impacting on overall health/performance cluster?

Em qui., 21 de dez. de 2023 às 15:48, Clay Gerrard < clay.gerrard@gmail.com> escreveu:

...
On Thu, Dec 21, 2023 at 11:28 AM Thiago De Moraes Teixeira < teixeira.thiago@luizalabs.com> wrote:

...
do that inside a maintenance window, to not be concerned about availability issues during the process.

More than that I would encourage you to disable writes; ideally just turn off all your proxies. You should probably also stop the account/container consistency daeamons - or teach them how to abort/skip rings marked for part power like the object services do:

https://github.com/NVIDIA/swift/blob/master/swift/obj/replicator.py#L911-L91...

...
your recommendation is do this increase 1 per 1 part power, like 8->9;9->10;...19->20?

I mean going from 8->9 already *doubles* your number of partitions. exponential goes pretty fast. So even if you only have 1 part per device, 2(9), 4(10), 8(11), 16(12), 32(13), 64(14), 128(15) - you can probably stop once you have >100 parts per device. But I'm only "recommending" it because that's what's supported and what we've seen work in the past.

...
One idea I've just thought of, besides this adaptation on swift ring builder, is to add an input argument to map new part power value (thinking if this makes sense...)

I think that makes sense for your use-case. Last time we did a part power increase we did *3* in a row. Each increase would disable replication while relinker/cleanup was running - then we'd let replication run for ~24 hours to address any inconsistencies, swap out failed drives etc. Then repeat the whole process again - it was pretty intensive for ops, but they got good at it and our users didn't notice.

Since account/container doesn't support writes to relinking databases - you're going to be taking the cluster offline. Depending on how your ring management is automated it might be better to do the full relink from part power 8->X in one ring push. Christian is the expert but I *think* however many times you want to split a part you *should* be able to assign those new parts to the device they're actually on.

If you're doing say 3 increases at once instead of part 1 splitting to 2, 3 - it'd split to [8, 9, 10, 11, 12, 13, 14, 15] - and you'd want to make sure all those parts are assigned to each of the devices currently holding a replica of 1 (you should double check my math: https://paste.openstack.org/show/b0TKpjHzXSs7z8xrSVtK/)

The second step after you get your ring figured out is to traverse the account/container device's filesystem trees and for each database calculate what the new partition assignment (each individual database in part 1 will only go into one of the new parts). Then you just move it into the right dir on the same device - since if you did your ring manipulation correctly it should already belong on that device. You can probably use the existing object-relinker as inspiration but it would require heavy modification to work with account and container databases:

https://github.com/NVIDIA/swift/blob/master/swift/cli/relinker.py

I'd encourage you to share any modifications on gerrit even if you're not interested in polishing them for contributions upstream; others could potentially use your work as a starting point. The benefit to you would be mostly in the attention of people that have gone through this process for object part power increase might be able to spot issues with your attempted approach (i.e. bugs) or contribute some tests that may help ensure your migration goes smoothly/quickly.

About our cluster, we are gathering information about the production

...
environment like how many accounts/containers do we have exactly

totals are good, also number of databases per disk and per partition (max, min, avg, std-dev)

I can come back

...
later with useful information when I've it!

Appreciate!

That'll be great fun! I appreciate you taking on such an exciting endeavor and sharing your experience with the community.

*‘Esta mensagem é direcionada apenas para os endereços constantes no cabeçalho inicial. Se você não está listado nos endereços constantes no cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão imediatamente anuladas e proibidas’.*

*‘Apesar do Magazine Luiza tomar todas as precauções razoáveis para assegurar que nenhum vírus esteja presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.*

*‘Esta mensagem é direcionada apenas para os endereços constantes no cabeçalho inicial. Se você não está listado nos endereços constantes no cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão imediatamente anuladas e proibidas’.*

*‘Apesar do Magazine Luiza tomar todas as precauções razoáveis para assegurar que nenhum vírus esteja presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.*

-- Clay Gerrard 210 788 9431

487

Age (days ago)

571

Last active (days ago)

List overview

Download

8 comments

3 participants

participants (3)

Clay Gerrard
Pete Zaitcev
Thiago De Moraes Teixeira

[swift] Increase part power on account/container ring

tags

participants (3)