<div dir="ltr"><div><div>Hi John,<br><br></div>Thanks for the explanation. Have a couple of more questions on this subject though.<br><br></div><div></div><div>1. "pretend_min_hours_passed" sounds like something that I could use. I'm okay if there is a chance of interruption in services to the user at this time, as long as it does not cause any data-loss or data-corruption.<br>

</div><div>2. It would have been really useful if the rebalancing operations could be logged by swift somewhere and automatically run later (after min_part_hours).<br><br></div><div>Regards,<br></div><div>Shyam<br></div></div>

<div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, May 1, 2014 at 11:15 PM, John Dickinson <span dir="ltr"><<a href="mailto:me@not.mn" target="_blank">me@not.mn</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class=""><br>

On May 1, 2014, at 10:32 AM, Shyam Prasad N <<a href="mailto:nspmangalore@gmail.com">nspmangalore@gmail.com</a>> wrote:<br>

<br>

> Hi Chuck,<br>

> Thanks for the reply.<br>

><br>

> The reason for such weight distribution seems to do with the ring rebalance command. I've scripted the disk addition (and rebalance) process to the ring using a wrapper command. When I trigger the rebalance after each disk addition, only the first rebalance seems to take effect.<br>


><br>

> Is there any other way to adjust the weights other than rebalance? Or is there a way to force a rebalance, even if the frequency of the rebalance (as a part of disk addition) is under an hour (the min_part_hours value in ring creation).<br>


<br>

</div>Rebalancing only moves one replica at a time to ensure that your data remains available, even if you have a hardware failure while you are adding capacity. This is why it may take multiple rebalances to get everything evenly balanced.<br>


<br>

The min_part_hours setting (perhaps poorly named) should match how long a replication pass takes in your cluster. You can understand this because of what I said above. By ensuring that replication has completed before putting another partition "in flight", Swift can ensure that you keep your data highly available.<br>


<br>

For completeness to answer your question, there is an (intentionally) undocumented option in swift-ring-builder called "pretend_min_part_hours_passed", but it should ALMOST NEVER be used in a production cluster, unless you really, really know what you are doing. Using that option will very likely cause service interruptions to your users. The better option is to correctly set the min_part_hours value to match your replication pass time (with set_min_part_hours), and then wait for swift to move things around.<br>


<br>

Here's some more info on how and why to add capacity to a running Swift cluster: <a href="https://swiftstack.com/blog/2012/04/09/swift-capacity-management/" target="_blank">https://swiftstack.com/blog/2012/04/09/swift-capacity-management/</a><br>


<span class="HOEnZb"><font color="#888888"><br>

--John<br>

</font></span><div class="HOEnZb"><div class="h5"><br>

<br>

<br>

<br>

<br>

> On May 1, 2014 9:00 PM, "Chuck Thier" <<a href="mailto:cthier@gmail.com">cthier@gmail.com</a>> wrote:<br>

> Hi Shyam,<br>

><br>

> If I am reading your ring output correctly, it looks like only the devices in node .202 have a weight set, and thus why all of your objects are going to that one node.  You can update the weight of the other devices, and rebalance, and things should get distributed correctly.<br>


><br>

> --<br>

> Chuck<br>

><br>

><br>

> On Thu, May 1, 2014 at 5:28 AM, Shyam Prasad N <<a href="mailto:nspmangalore@gmail.com">nspmangalore@gmail.com</a>> wrote:<br>

> Hi,<br>

><br>

> I created a swift cluster and configured the rings like this...<br>

><br>

> swift-ring-builder object.builder create 10 3 1<br>

><br>

> ubuntu-202:/etc/swift$ swift-ring-builder object.builder<br>

> object.builder, build version 12<br>

> 1024 partitions, 3.000000 replicas, 1 regions, 4 zones, 12 devices, 300.00 balance<br>

> The minimum number of hours before a partition can be reassigned is 1<br>

> Devices:    id  region  zone      ip address  port  replication ip  replication port      name weight partitions balance meta<br>

>              0       1     1      10.3.0.202  6010      10.3.0.202              6010      xvdb   1.00       1024  300.00<br>

>              1       1     1      10.3.0.202  6020      10.3.0.202              6020      xvdc   1.00       1024  300.00<br>

>              2       1     1      10.3.0.202  6030      10.3.0.202              6030      xvde   1.00       1024  300.00<br>

>              3       1     2      10.3.0.212  6010      10.3.0.212              6010      xvdb   1.00          0 -100.00<br>

>              4       1     2      10.3.0.212  6020      10.3.0.212              6020      xvdc   1.00          0 -100.00<br>

>              5       1     2      10.3.0.212  6030      10.3.0.212              6030      xvde   1.00          0 -100.00<br>

>              6       1     3      10.3.0.222  6010      10.3.0.222              6010      xvdb   1.00          0 -100.00<br>

>              7       1     3      10.3.0.222  6020      10.3.0.222              6020      xvdc   1.00          0 -100.00<br>

>              8       1     3      10.3.0.222  6030      10.3.0.222              6030      xvde   1.00          0 -100.00<br>

>              9       1     4      10.3.0.232  6010      10.3.0.232              6010      xvdb   1.00          0 -100.00<br>

>             10       1     4      10.3.0.232  6020      10.3.0.232              6020      xvdc   1.00          0 -100.00<br>

>             11       1     4      10.3.0.232  6030      10.3.0.232              6030      xvde   1.00          0 -100.00<br>

><br>

> Container and account rings have a similar configuration.<br>

> Once the rings were created and all the disks were added to the rings like above, I ran rebalance on each ring. (I ran rebalance after adding each of the node above.)<br>

> Then I immediately scp the rings to all other nodes in the cluster.<br>

><br>

> I now observe that the objects are all going to 10.3.0.202. I don't see the objects being replicated to the other nodes. So much so that 202 is approaching 100% disk usage, while other nodes are almost completely empty.<br>


> What am I doing wrong? Am I not supposed to run rebalance operation after addition of each disk/node?<br>

><br>

> Thanks in advance for the help.<br>

><br>

> --<br>

> -Shyam<br>

><br>

> _______________________________________________<br>

> OpenStack-dev mailing list<br>

> <a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>

> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

><br>

><br>

><br>

> _______________________________________________<br>

> OpenStack-dev mailing list<br>

> <a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>

> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

><br>

> _______________________________________________<br>

> OpenStack-dev mailing list<br>

> <a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>

> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

<br>

</div></div><br>_______________________________________________<br>

OpenStack-dev mailing list<br>

<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

<br></blockquote></div><br><br clear="all"><br>-- <br>-Shyam

</div>