[swift]: Swiftstack Zone/DC decommission
Hello Team, We currently utilize SwiftStack Object Storage to meet our object storage requirements, and our cluster has a Primary Zone/Secondary Zone topology. Recently, we attempted to remove/disable the Secondary Zone. However, upon doing so, we observed a concerning trend: the storage utilization in the Primary Zone rapidly increased, reaching approximately 90%. This surge was particularly notable due to the use of the 'tworeplicaperregion' policy. In an attempt to address this, we brought the Secondary Zone nodes back online, expecting Swift to redistribute the data back to these nodes and alleviate the storage strain on the Primary Zone. Despite reintegrating the Secondary Zone nodes into the cluster, we continue to witness a persistent rise in storage utilization within the Primary Zone. Interestingly, two out of the three nodes in the Primary Zone seem to bear the brunt of the data and tokens, while the third node is not receiving a proportional distribution of data. At this juncture, we are seeking guidance on potential configurations or adjustments, such as specific knobs, that could expedite the rebalance process. Our objective is to efficiently distribute the data across nodes, particularly in the Primary Zone, and optimize the overall storage utilization in our SwiftStack Object Storage cluster. Any insights or recommendations on tuning parameters for a swifter rebalance would be highly appreciated. Thanks, Murali -- This message may contain confidential and/or privileged information. If you are not the addressee or authorized to receive this on behalf of the addressee you must not use, copy, disclose or take action based on this message or any information herein. If you have received this message in error, please advise the sender immediately by reply email and delete this message. Thank you.
accidently replied directly instead of to the list: On Mon, Nov 13, 2023 at 10:43 AM Muralikrishna Gutha <mgutha@liveperson.com> wrote:
Recently, we attempted to remove/disable the Secondary Zone. However, upon doing so, we observed [..] the storage utilization in the Primary Zone rapidly increased
This seems reasonable - what would have been your expected behavior?
we brought the Secondary Zone nodes back online, expecting Swift to redistribute the data back to these nodes
You may be effected by min_part_hours - once a replicate of a part moves you can't move another replica of that same part in subsequent rebalance immediately.
Interestingly, two out of the three nodes in the Primary Zone seem to bear the brunt of the data and tokens, while the third node is not receiving a proportional distribution of data.
It may be necessary to share more specifics from your cluster topology - the zone & server & device weights
At this juncture, we are seeking guidance on potential configurations or adjustments, such as specific knobs, that could expedite the rebalance process.
perhaps you can reduce min_part_hours, but there is some risk of unavailability until the rebalance engine catches up with ring placement.
Our objective is to efficiently distribute the data across nodes
That is typically the goal of the ring rebalance algorithm, balance is a primary driver - but there are some constraints such as: moving only one replica of a part per rebalance and maximizing dispersion. min_part_hours is a property set on the builder; you can view it with the swift-ring-builder CLI command: [root@cloud 1fe628ae-2c4b-4e29-a995-d62021a17bd8]# pwd /opt/ss/builder_configs/1fe628ae-2c4b-4e29-a995-d62021a17bd8 [root@cloud 1fe628ae-2c4b-4e29-a995-d62021a17bd8]# swift-ring-builder object.builder | head object.builder, build version 165, id 90011ec6baaf4973a317771d3da63976 1024 partitions, 3.000000 replicas, 1 regions, 3 zones, 32 devices, 10.42 balance, 11.65 dispersion The minimum number of hours before a partition can be reassigned is 0 (0:00:00 remaining) The overload factor is 10.00% (0.100000) Ring file object.ring.gz not found, probably it hasn't been written yet Devices: id region zone ip address:port replication ip:port name weight partitions balance flags meta 24 1 1 192.168.25.156:6000 192.168.28.175:6003 d24 4.19 87 -9.37 25 1 1 192.168.25.156:6006 192.168.28.175:6003 d25 4.19 87 -9.37 26 1 1 192.168.25.156:6007 192.168.28.175:6003 d26 4.19 87 -9.37 27 1 1 192.168.25.156:6008 192.168.28.175:6003 d27 4.19 86 -10.42 sometimes it's reasonable to set it to zero - esp if you're monitoring handoff parts during a rebalance and being careful about when you rebalance and push rings. If you have a legacy swiftstack support contact you can reach out for advice - we might be able to help! -- Clay Gerrard
Clary, Thanks for the update I was unable to locate the object.builder file in the instance where we are running swiftstack. We are currently on Swift Package Version: 2.24.0.3-1.el7 [root@hostname swift]# /opt/ss/bin/swift-ring-builder object.builder | head Ring Builder file does not exist: object.builder [root@hostname swift]# locate object.builder [root@hostname swift]#pwd /etc/swift
Nvm i see that from the controller node not from the swift nodes itself [root@hostname builder_configs]# ls -ltr total 0 drwxr-xr-x 3 ss-service ss-service 133 Nov 13 05:33 1972ccbf-a22c-4f92-994d-bc11ff1cf5a6 drwxr-xr-x 3 ss-service ss-service 110 Nov 13 16:23 ce40026c-9c3f-4079-a5f0-abee07053319 drwxr-xr-x 3 ss-service ss-service 110 Nov 13 16:23 9eefd792-8f9a-408b-9f2f-df935762ae89 drwxr-xr-x 3 ss-service ss-service 156 Nov 13 16:23 782f0361-9132-4231-b433-efdadd3938f3 [root@hostname builder_configs]# cd 1972ccbf-a22c-4f92-994d-bc11ff1cf5a6 [root@hostname 1972ccbf-a22c-4f92-994d-bc11ff1cf5a6]# swift-ring-builder object.builder | head object.builder, build version 728, id c88aa44a0f1748eaa6e0cb9b9f23a7d4 65536 partitions, 2.000000 replicas, 1 regions, 1 zones, 25 devices, 2.02 balance, 0.00 dispersion The minimum number of hours before a partition can be reassigned is 0 (0:00:00 remaining) The overload factor is 10.00% (0.100000) Ring file object.ring.gz not found, probably it hasn't been written yet Devices: id region zone ip address:port replication ip:port name weight partitions balance flags meta 100 1 1 182.16.128.12:6000 182.16.128.12:6003 d100 1920.38 13029 -0.53 101 1 1 182.16.128.12:6006 182.16.128.12:6003 d101 1920.38 13029 -0.53 102 1 1 182.16.128.12:6007 182.16.128.12:6003 d102 1920.38 13028 -0.54 103 1 1 182.16.128.12:6008 182.16.128.12:6003 d103 1920.38 13029 -0.53 [root@hostname 1972ccbf-a22c-4f92-994d-bc11ff1cf5a6]# cd ../ce40026c-9c3f-4079-a5f0-abee07053319/ [root@hostname ce40026c-9c3f-4079-a5f0-abee07053319]# swift-ring-builder object.builder | head object.builder, build version 1308, id 4ba3e0fb83244f6e82c2140f0dd9c65f 65536 partitions, 2.000000 replicas, 2 regions, 2 zones, 48 devices, 10.05 balance, 3.29 dispersion The minimum number of hours before a partition can be reassigned is 0 (0:00:00 remaining) The overload factor is 10.00% (0.100000) Ring file object.ring.gz not found, probably it hasn't been written yet Devices: id region zone ip address:port replication ip:port name weight partitions balance flags meta 110 1 1 11.32.128.2:6000 11.32.128.2:6003 d110 1920.38 3325 -7.41 111 1 1 11.32.128.2:6006 11.32.128.2:6003 d111 1920.38 3325 -7.41 112 1 1 11.32.128.2:6007 11.32.128.2:6003 d112 1920.38 3325 -7.41 129 1 1 11.32.128.2:6008 11.32.128.2:6003 d129 960.20 1664 -7.33 close failed in file object destructor: sys.excepthook is missing lost sys.stderr [root@hostname ce40026c-9c3f-4079-a5f0-abee07053319]# cd ../9eefd792-8f9a-408b-9f2f-df935762ae89/ [root@hostname 9eefd792-8f9a-408b-9f2f-df935762ae89]# swift-ring-builder object.builder | head object.builder, build version 356, id a6de7d241d624e30b3ddd00100e4f3e4 65536 partitions, 2.000000 replicas, 2 regions, 2 zones, 39 devices, 10.03 balance, 34.30 dispersion The minimum number of hours before a partition can be reassigned is 0 (0:00:00 remaining) The overload factor is 10.00% (0.100000) Ring file object.ring.gz not found, probably it hasn't been written yet Devices: id region zone ip address:port replication ip:port name weight partitions balance flags meta 0 1 1 11.160.72.12:6000 11.160.72.12:6003 d0 2000.40 3556 -1.90 1 1 1 11.160.72.12:6006 11.160.72.12:6003 d1 2000.40 3555 -1.93 10 1 1 11.160.72.12:6014 11.160.72.12:6003 d10 2000.40 3554 -1.95 2 1 1 11.160.72.12:6007 11.160.72.12:6003 d2 2000.40 3556 -1.90 [root@hostname 9eefd792-8f9a-408b-9f2f-df935762ae89]# cd ../782f0361-9132-4231-b433-efdadd3938f3/ [root@hostname 782f0361-9132-4231-b433-efdadd3938f3]# swift-ring-builder object.builder | head object.builder, build version 775, id 61b2166cb8684b2b80d6fc370cef37a6 65536 partitions, 2.000000 replicas, 2 regions, 2 zones, 95 devices, 38.89 balance, 0.00 dispersion The minimum number of hours before a partition can be reassigned is 0 (0:00:00 remaining) The overload factor is 10.00% (0.100000) Ring file object.ring.gz not found, probably it hasn't been written yet Devices: id region zone ip address:port replication ip:port name weight partitions balance flags meta 129 1 1 182.16.128.36:6000 182.16.128.36:6003 d129 1.00 2 38.89 130 1 1 182.16.128.36:6006 182.16.128.36:6003 d130 1.00 2 38.89 131 1 1 182.16.128.36:6007 182.16.128.36:6003 d131 1.00 2 38.89 132 1 1 182.16.128.36:6008 182.16.128.36:6003 d132 1.00 2 38.89
I have observed an imbalance in the cluster based on the object.builder output. Is this possibly due to an ongoing rebalance process? Additionally, I'd like to confirm my understanding of 'min_part_hours.' Is it accurate to say that this parameter represents the number of hours an object needs to reside on a disk before it's considered for movement to another device during a rebalance? According to the Swift documentation, the default value is 24 hours. I'm curious if this value is applied to data written specifically by the rebalance process or if it also includes data written by other Swift services. object.builder, build version 775, id 61b2166cb8684b2b80d6fc370cef37a6 65536 partitions, 2.000000 replicas, 2 regions, 2 zones, 95 devices, 38.89 balance, 0.00 dispersion The minimum number of hours before a partition can be reassigned is 0 (0:00:00 remaining) The overload factor is 10.00% (0.100000) Ring file object.ring.gz not found, probably it hasn't been written yet Devices: id region zone ip address:port replication ip:port name weight partitions balance flags meta
2 replicas in 2 regions is going to try VERY hard to put one replica of every partition on a server/device in each region. Unless you have EXACTLY the same number of GB/weight in each region you’re going to have some devices with more part-replicas assigned than other devices. The 10% overload you’ve configured is what allows this (trade balance for dispersion) and it usually makes sense to allow a little overload as long as the cluster isn’t terribly full. Min part hours is not a data plane concept; it only effects part-replica reassignment in the ring during rebalance (an offline, not runtime, process). Your ring has it set to 0, so each rebalance can reassign either replica of any partition if it would improve balance or dispersion. A dispersion of zero is perfect, and 38 balance isn’t necessarily that bad, but might be if the cluster is mostly full. You can reduce overload to zero and rebalance to prioritize balance at the cost of dispersion (i.e. some partitions might have both replicas assigned to servers in the same region) The swift-ring-builder CLI tool has a dispersion sub command that may help you inspect and understand the placement in your ring. Swiftstack managed clusters may require using the admin interface to set the ring’s overload value in the management db as they won’t necessarily pick up an overload you set on your builder via the swift-ring-builder CLI on the next ring push/rebalance. On Tue, Nov 14, 2023 at 9:20 AM <mgutha@liveperson.com> wrote:
I have observed an imbalance in the cluster based on the object.builder output. Is this possibly due to an ongoing rebalance process?
Additionally, I'd like to confirm my understanding of 'min_part_hours.' Is it accurate to say that this parameter represents the number of hours an object needs to reside on a disk before it's considered for movement to another device during a rebalance? According to the Swift documentation, the default value is 24 hours. I'm curious if this value is applied to data written specifically by the rebalance process or if it also includes data written by other Swift services.
object.builder, build version 775, id 61b2166cb8684b2b80d6fc370cef37a6 65536 partitions, 2.000000 replicas, 2 regions, 2 zones, 95 devices, 38.89 balance, 0.00 dispersion The minimum number of hours before a partition can be reassigned is 0 (0:00:00 remaining) The overload factor is 10.00% (0.100000) Ring file object.ring.gz not found, probably it hasn't been written yet Devices: id region zone ip address:port replication ip:port name weight partitions balance flags meta
We have two Regions; VA(r1z1) /CA (r2z2) when we attempted to remove the CA nodes from cluster we started to see the VA storage shoot up due to the twoperregion replication policy. As i mentioned earlier we have started the CA nodes back up and also added 3 more nodes with (10 drivers per node) to existing VA cluster what we are seeing is that rebalance is not distributing the data to newly added hosts. va-newnode-0[1-3] are the new nodes that were added to VA DC (r1z1) region, and also one the old va node va-oldnode-02 is doesn't have the data evenly distributed. va-oldnode-02 had a drive failures and we have added the node back since then the rebalance has not yet completed. object.builder, build version 775, id 61b2166cb8684b2b80d6fc370cef37a6 65536 partitions, 2.000000 replicas, 2 regions, 2 zones, 95 devices, 38.89 balance, 0.00 dispersion The minimum number of hours before a partition can be reassigned is 0 (0:00:00 remaining) The overload factor is 10.00% (0.100000) Ring file object.ring.gz not found, probably it hasn't been written yet Devices: id region zone ip address:port replication ip:port name weight partitions balance flags meta 129 1 1 va-newnode-01:6000 va-newip-01:6003 d129 1.00 2 38.89 130 1 1 va-newnode-01:6006 va-newip-01:6003 d130 1.00 2 38.89 131 1 1 va-newnode-01:6007 va-newip-01:6003 d131 1.00 2 38.89 132 1 1 va-newnode-01:6008 va-newip-01:6003 d132 1.00 2 38.89 133 1 1 va-newnode-01:6009 va-newip-01:6003 d133 1.00 2 38.89 134 1 1 va-newnode-01:6010 va-newip-01:6003 d134 1.00 1 -30.56 135 1 1 va-newnode-01:6011 va-newip-01:6003 d135 1.00 1 -30.56 136 1 1 va-newnode-01:6012 va-newip-01:6003 d136 1.00 1 -30.56 137 1 1 va-newnode-01:6013 va-newip-01:6003 d137 1.00 1 -30.56 138 1 1 va-newnode-01:6014 va-newip-01:6003 d138 1.00 1 -30.56 139 1 1 va-newnode-01:6015 va-newip-01:6003 d139 1.00 1 -30.56 140 1 1 va-newnode-02:6000 va-newnode-02:6003 d140 1.00 2 38.89 141 1 1 va-newnode-02:6006 va-newnode-02:6003 d141 1.00 2 38.89 142 1 1 va-newnode-02:6007 va-newnode-02:6003 d142 1.00 2 38.89 143 1 1 va-newnode-02:6008 va-newnode-02:6003 d143 1.00 2 38.89 144 1 1 va-newnode-02:6009 va-newnode-02:6003 d144 1.00 2 38.89 145 1 1 va-newnode-02:6010 va-newnode-02:6003 d145 1.00 1 -30.56 146 1 1 va-newnode-02:6011 va-newnode-02:6003 d146 1.00 1 -30.56 147 1 1 va-newnode-02:6012 va-newnode-02:6003 d147 1.00 1 -30.56 148 1 1 va-newnode-02:6013 va-newnode-02:6003 d148 1.00 1 -30.56 149 1 1 va-newnode-02:6014 va-newnode-02:6003 d149 1.00 1 -30.56 150 1 1 va-newnode-02:6015 va-newnode-02:6003 d150 1.00 1 -30.56 151 1 1 va-newnode-03:6000 va-newnode-03:6003 d151 1.00 2 38.89 152 1 1 va-newnode-03:6006 va-newnode-03:6003 d152 1.00 2 38.89 153 1 1 va-newnode-03:6007 va-newnode-03:6003 d153 1.00 2 38.89 154 1 1 va-newnode-03:6008 va-newnode-03:6003 d154 1.00 2 38.89 155 1 1 va-newnode-03:6009 va-newnode-03:6003 d155 1.00 2 38.89 156 1 1 va-newnode-03:6010 va-newnode-03:6003 d156 1.00 1 -30.56 157 1 1 va-newnode-03:6011 va-newnode-03:6003 d157 1.00 1 -30.56 158 1 1 va-newnode-03:6012 va-newnode-03:6003 d158 1.00 1 -30.56 159 1 1 va-newnode-03:6013 va-newnode-03:6003 d159 1.00 1 -30.56 160 1 1 va-newnode-03:6014 va-newnode-03:6003 d160 1.00 1 -30.56 161 1 1 va-newnode-03:6015 va-newnode-03:6003 d161 1.00 1 -30.56 55 1 1 va-oldnode-01:6000 va-oldnode-01:6003 d55 2000.40 2789 -3.18 56 1 1 va-oldnode-01:6006 va-oldnode-01:6003 d56 2000.40 2789 -3.18 57 1 1 va-oldnode-01:6007 va-oldnode-01:6003 d57 2000.40 2789 -3.18 58 1 1 va-oldnode-01:6008 va-oldnode-01:6003 d58 2000.40 2789 -3.18 59 1 1 va-oldnode-01:6009 va-oldnode-01:6003 d59 2000.40 2789 -3.18 60 1 1 va-oldnode-01:6010 va-oldnode-01:6003 d60 2000.40 2789 -3.18 61 1 1 va-oldnode-01:6011 va-oldnode-01:6003 d61 2000.40 2789 -3.18 62 1 1 va-oldnode-01:6012 va-oldnode-01:6003 d62 2000.40 2789 -3.18 63 1 1 va-oldnode-01:6013 va-oldnode-01:6003 d63 2000.40 2789 -3.18 64 1 1 va-oldnode-01:6014 va-oldnode-01:6003 d64 2000.40 2789 -3.18 65 1 1 va-oldnode-01:6015 va-oldnode-01:6003 d65 2000.40 2788 -3.21 44 1 1 va-oldnode-02:6000 va-oldnode-02:6003 d44 269.43 376 -3.09 45 1 1 va-oldnode-02:6006 va-oldnode-02:6003 d45 269.43 376 -3.09 46 1 1 va-oldnode-02:6007 va-oldnode-02:6003 d46 269.43 376 -3.09 47 1 1 va-oldnode-02:6008 va-oldnode-02:6003 d47 269.43 376 -3.09 48 1 1 va-oldnode-02:6009 va-oldnode-02:6003 d48 269.43 376 -3.09 49 1 1 va-oldnode-02:6010 va-oldnode-02:6003 d49 269.43 376 -3.09 50 1 1 va-oldnode-02:6011 va-oldnode-02:6003 d50 269.43 376 -3.09 51 1 1 va-oldnode-02:6012 va-oldnode-02:6003 d51 269.43 375 -3.35 52 1 1 va-oldnode-02:6013 va-oldnode-02:6003 d52 269.43 375 -3.35 53 1 1 va-oldnode-02:6014 va-oldnode-02:6003 d53 269.43 375 -3.35 54 1 1 va-oldnode-02:6015 va-oldnode-02:6003 d54 269.43 375 -3.35 0 1 1 va-oldnode-03:6000 va-oldnode-03:6003 d0 2000.40 2789 -3.18 1 1 1 va-oldnode-03:6006 va-oldnode-03:6003 d1 2000.40 2789 -3.18 10 1 1 va-oldnode-03:6015 va-oldnode-03:6003 d10 2000.40 2788 -3.21 2 1 1 va-oldnode-03:6007 va-oldnode-03:6003 d2 2000.40 2789 -3.18 3 1 1 va-oldnode-03:6008 va-oldnode-03:6003 d3 2000.40 2789 -3.18 4 1 1 va-oldnode-03:6009 va-oldnode-03:6003 d4 2000.40 2789 -3.18 5 1 1 va-oldnode-03:6010 va-oldnode-03:6003 d5 2000.40 2789 -3.18 6 1 1 va-oldnode-03:6011 va-oldnode-03:6003 d6 2000.40 2789 -3.18 7 1 1 va-oldnode-03:6012 va-oldnode-03:6003 d7 2000.40 2789 -3.18 8 1 1 va-oldnode-03:6013 va-oldnode-03:6003 d8 2000.40 2789 -3.18 9 1 1 va-oldnode-03:6014 va-oldnode-03:6003 d9 2000.40 2789 -3.18 11 2 2 ca-oldnode-01:6000 ca-oldnode-01:6003 d11 2000.40 2979 3.42 12 2 2 ca-oldnode-01:6006 ca-oldnode-01:6003 d12 2000.40 2979 3.42 13 2 2 ca-oldnode-01:6007 ca-oldnode-01:6003 d13 2000.40 2979 3.42 15 2 2 ca-oldnode-01:6008 ca-oldnode-01:6003 d15 2000.40 2979 3.42 16 2 2 ca-oldnode-01:6009 ca-oldnode-01:6003 d16 2000.40 2979 3.42 17 2 2 ca-oldnode-01:6010 ca-oldnode-01:6003 d17 2000.40 2978 3.38 18 2 2 ca-oldnode-01:6011 ca-oldnode-01:6003 d18 2000.40 2978 3.38 19 2 2 ca-oldnode-01:6012 ca-oldnode-01:6003 d19 2000.40 2978 3.38 20 2 2 ca-oldnode-01:6013 ca-oldnode-01:6003 d20 2000.40 2978 3.38 21 2 2 ca-oldnode-01:6014 ca-oldnode-01:6003 d21 2000.40 2978 3.38 77 2 2 ca-oldnode-01:6015 ca-oldnode-01:6003 d77 2000.40 2978 3.38 33 2 2 ca-oldnode-02:6000 ca-oldnode-02:6003 d33 2000.40 2979 3.42 34 2 2 ca-oldnode-02:6006 ca-oldnode-02:6003 d34 2000.40 2979 3.42 35 2 2 ca-oldnode-02:6007 ca-oldnode-02:6003 d35 2000.40 2979 3.42 36 2 2 ca-oldnode-02:6008 ca-oldnode-02:6003 d36 2000.40 2979 3.42 37 2 2 ca-oldnode-02:6009 ca-oldnode-02:6003 d37 2000.40 2978 3.38 38 2 2 ca-oldnode-02:6010 ca-oldnode-02:6003 d38 2000.40 2978 3.38 39 2 2 ca-oldnode-02:6011 ca-oldnode-02:6003 d39 2000.40 2978 3.38 40 2 2 ca-oldnode-02:6012 ca-oldnode-02:6003 d40 2000.40 2978 3.38 41 2 2 ca-oldnode-02:6013 ca-oldnode-02:6003 d41 2000.40 2978 3.38 42 2 2 ca-oldnode-02:6014 ca-oldnode-02:6003 d42 2000.40 2978 3.38 43 2 2 ca-oldnode-02:6015 ca-oldnode-02:6003 d43 2000.40 2978 3.38 162 2 2 ca-oldnode-03:6000 ca-oldnode-03:6003 d162 1.00 2 38.89 163 2 2 ca-oldnode-03:6006 ca-oldnode-03:6003 d163 1.00 2 38.89 164 2 2 ca-oldnode-03:6007 ca-oldnode-03:6003 d164 1.00 2 38.89 165 2 2 ca-oldnode-03:6008 ca-oldnode-03:6003 d165 1.00 2 38.89 166 2 2 ca-oldnode-03:6009 ca-oldnode-03:6003 d166 1.00 1 -30.56 167 2 2 ca-oldnode-03:6010 ca-oldnode-03:6003 d167 1.00 1 -30.56 168 2 2 ca-oldnode-03:6011 ca-oldnode-03:6003 d168 1.00 1 -30.56 Currently rebalance is running and we see handoff partitions for all the policies.
Below are how our object policies are setup. Object Policies Replicated Policies Name Storage Policy Index Part Power Multi-Region Replicas Devices Default Deprecated Standard-Replica 0 16 No 2.0 95 two-per-region-replica 1 16 No 4.0 95 two-replicas 2 16 No 2.0 66 VirginiaOnly 3 10 No 2.0 66
participants (3)
-
Clay Gerrard
-
mgutha@liveperson.com
-
Muralikrishna Gutha