[Swift] Rebalancing EC question
Hello, We were working on expanding one of our clusters (Ussuri on Ubuntu 18.04) and are wondering about the rebalance behavior of swift-ring-builder. When we run it in debug mode on a 15/4 EC ring, we see this message about "Unable to finish rebalance plan after 2 attempts" and are seeing 100% partitions reassigned. DEBUG: Placed 10899/2 onto dev r1z3-10.40.48.72/d10 DEBUG: Placed 2183/3 onto dev r1z5-10.40.48.76/d11 DEBUG: Placed 1607/1 onto dev r1z3-10.40.48.70/d28 DEBUG: Assigned 32768 parts DEBUG: Gather start is 10278 (Last start was 25464) DEBUG: Unable to finish rebalance plan after 2 attempts Reassigned 32768 (100.00%) partitions. Balance is now 63.21. Dispersion is now 0.00 ------------------------------------------------------------------------------- NOTE: Balance of 63.21 indicates you should push this ring, wait at least 1 hours, and rebalance/repush. ------------------------------------------------------------------------------- Moving 100% seems scary, what does that mean in this situation? Is this message because 1 fragment from every partition is moved and that is the most that it can do per rebalance because they are technically the same partition? When we compare the swift-ring-builder output (partitions per device) between rebalances we can see some partitions move each time until we no longer see the push/wait/rebalance message again. So it's not really moving 100% partitions. Reid
Not scary! Because you have a 15/4 EC policy, we say each partition has 19 "replicas". And since rebalance will only move one "replica" of any partition max at each rebalance: up to 100% of your partitions may have at least one replica assignment move. That means, after you push out this ring, 100% of your object GET requests will experience at most one "replica" is out of place. But that's ok! In a 15/4 you only need 15 EC fragments to respond successfully and you have 18 total fragments that did NOT get reassigned. It's unfortunate the language is a little ambiguous, but it is talking about % of *partitions* that had a replica moved. Since each object resides in single a partition - the % of partitions affected most directly communicates the % of client objects affected by the rebalance. We do NOT display % of *partition-replicas* moved because while the number would be smaller - it could never be 100% because of the restriction that only one "replica" may move. When doing a large topology change - particularly with EC - it may be the case that more than one replica of each part will need to move (imagine doubling your capacity into a second zone on a 8+4 ring) - so it'll take a few cranks. Eventually you'll want to have moved 6 replicas of each part (6 in z1 and 6 in z2), but if we allowed you to move six replicas of 100% of your parts you'd only have 6/8 required parts to service reads! Protip: when you push out the new ring you can turn on handoffs_only mode for the reconstructor for a little while to get things rebalanced MUCH more quickly - just don't forget to turn it off! (sending second time because I forgot to reply all to the list) On Thu, Sep 16, 2021 at 11:35 AM Reid Guyett <rguyett@datto.com> wrote:
Hello,
We were working on expanding one of our clusters (Ussuri on Ubuntu 18.04) and are wondering about the rebalance behavior of swift-ring-builder. When we run it in debug mode on a 15/4 EC ring, we see this message about "Unable to finish rebalance plan after 2 attempts" and are seeing 100% partitions reassigned.
DEBUG: Placed 10899/2 onto dev r1z3-10.40.48.72/d10 DEBUG: Placed 2183/3 onto dev r1z5-10.40.48.76/d11 DEBUG: Placed 1607/1 onto dev r1z3-10.40.48.70/d28 DEBUG: Assigned 32768 parts DEBUG: Gather start is 10278 (Last start was 25464) DEBUG: Unable to finish rebalance plan after 2 attempts Reassigned 32768 (100.00%) partitions. Balance is now 63.21. Dispersion is now 0.00
------------------------------------------------------------------------------- NOTE: Balance of 63.21 indicates you should push this ring, wait at least 1 hours, and rebalance/repush.
-------------------------------------------------------------------------------
Moving 100% seems scary, what does that mean in this situation? Is this message because 1 fragment from every partition is moved and that is the most that it can do per rebalance because they are technically the same partition? When we compare the swift-ring-builder output (partitions per device) between rebalances we can see some partitions move each time until we no longer see the push/wait/rebalance message again. So it's not really moving 100% partitions.
Reid
-- Clay Gerrard
Thanks for that explanation. It is clear now how the rebalancing is working for the EC policies. We have adjusted our reconstruction workers to speed up the rebalance and it seemed to have helped. It went from weeks to days. Reid Guyett On Thu, Sep 16, 2021 at 2:02 PM Clay Gerrard <clay.gerrard@gmail.com> wrote:
Not scary!
Because you have a 15/4 EC policy, we say each partition has 19 "replicas". And since rebalance will only move one "replica" of any partition max at each rebalance: up to 100% of your partitions may have at least one replica assignment move.
That means, after you push out this ring, 100% of your object GET requests will experience at most one "replica" is out of place. But that's ok! In a 15/4 you only need 15 EC fragments to respond successfully and you have 18 total fragments that did NOT get reassigned.
It's unfortunate the language is a little ambiguous, but it is talking about % of *partitions* that had a replica moved. Since each object resides in single a partition - the % of partitions affected most directly communicates the % of client objects affected by the rebalance. We do NOT display % of *partition-replicas* moved because while the number would be smaller - it could never be 100% because of the restriction that only one "replica" may move.
When doing a large topology change - particularly with EC - it may be the case that more than one replica of each part will need to move (imagine doubling your capacity into a second zone on a 8+4 ring) - so it'll take a few cranks. Eventually you'll want to have moved 6 replicas of each part (6 in z1 and 6 in z2), but if we allowed you to move six replicas of 100% of your parts you'd only have 6/8 required parts to service reads!
Protip: when you push out the new ring you can turn on handoffs_only mode for the reconstructor for a little while to get things rebalanced MUCH more quickly - just don't forget to turn it off!
(sending second time because I forgot to reply all to the list)
On Thu, Sep 16, 2021 at 11:35 AM Reid Guyett <rguyett@datto.com> wrote:
Hello,
We were working on expanding one of our clusters (Ussuri on Ubuntu 18.04) and are wondering about the rebalance behavior of swift-ring-builder. When we run it in debug mode on a 15/4 EC ring, we see this message about "Unable to finish rebalance plan after 2 attempts" and are seeing 100% partitions reassigned.
DEBUG: Placed 10899/2 onto dev r1z3-10.40.48.72/d10 DEBUG: Placed 2183/3 onto dev r1z5-10.40.48.76/d11 DEBUG: Placed 1607/1 onto dev r1z3-10.40.48.70/d28 DEBUG: Assigned 32768 parts DEBUG: Gather start is 10278 (Last start was 25464) DEBUG: Unable to finish rebalance plan after 2 attempts Reassigned 32768 (100.00%) partitions. Balance is now 63.21. Dispersion is now 0.00 ------------------------------------------------------------------------------- NOTE: Balance of 63.21 indicates you should push this ring, wait at least 1 hours, and rebalance/repush. -------------------------------------------------------------------------------
Moving 100% seems scary, what does that mean in this situation? Is this message because 1 fragment from every partition is moved and that is the most that it can do per rebalance because they are technically the same partition? When we compare the swift-ring-builder output (partitions per device) between rebalances we can see some partitions move each time until we no longer see the push/wait/rebalance message again. So it's not really moving 100% partitions.
Reid
-- Clay Gerrard
participants (2)
-
Clay Gerrard
-
Reid Guyett