[Openstack] Expanding Storage - Rebalance Extreeemely Slow (or Stalled?)

Emre Sokullu emre at groups-inc.com
Wed Oct 24 19:30:54 UTC 2012


Thanks Andi, that helps, it's true that my expectations were misplaced; I
was expecting all nodes to "rebalance" until they each store the same size.

What's weird though is there are missing folders in the newly created
c0d4p1 node. Here's what I get

root at storage3:/srv/node# ls c0d1p1/
accounts  async_pending  containers  objects  tmp

root at storage3:/srv/node# ls c0d4p1/
accounts  tmp

Is that normal?

And when I check /var/log/rsyncd.log for the moves in between storage
nodes, I see too many of the following- which, again, makes me think
whether there's something wrong :

2012/10/24 19:22:56 [6514] rsync to
container/c0d4p1/tmp/e49cf526-1d53-4069-bbea-b74f6dbec5f1 from storage2
(192.168.1.4)
2012/10/24 19:22:56 [6514] receiving file list
2012/10/24 19:22:56 [6514] sent 54 bytes  received 17527 bytes  total size
17408
2012/10/24 21:22:56 [6516] connect from storage2 (192.168.1.4)
2012/10/24 19:22:56 [6516] rsync to
container/c0d4p1/tmp/4b8b0618-077b-48e2-a7a0-fb998fcf11bc from storage2
(192.168.1.4)
2012/10/24 19:22:56 [6516] receiving file list
2012/10/24 19:22:56 [6516] sent 54 bytes  received 26743 bytes  total size
26624
2012/10/24 21:22:56 [6518] connect from storage2 (192.168.1.4)
2012/10/24 19:22:56 [6518] rsync to
container/c0d4p1/tmp/53452ee6-c52c-4e3b-abe2-a31a2c8d65ba from storage2
(192.168.1.4)
2012/10/24 19:22:56 [6518] receiving file list
2012/10/24 19:22:57 [6518] sent 54 bytes  received 24695 bytes  total size
24576
2012/10/24 21:22:57 [6550] connect from storage2 (192.168.1.4)
2012/10/24 19:22:57 [6550] rsync to
container/c0d4p1/tmp/b858126d-3152-4d71-a0e8-eea115f69fc8 from storage2
(192.168.1.4)
2012/10/24 19:22:57 [6550] receiving file list
2012/10/24 19:22:57 [6550] sent 54 bytes  received 24695 bytes  total size
24576
2012/10/24 21:22:57 [6552] connect from storage2 (192.168.1.4)
2012/10/24 19:22:57 [6552] rsync to
container/c0d4p1/tmp/f3ce8205-84ac-4236-baea-3a3aef2da6ab from storage2
(192.168.1.4)
2012/10/24 19:22:57 [6552] receiving file list
2012/10/24 19:22:58 [6552] sent 54 bytes  received 25719 bytes  total size
25600
2012/10/24 21:22:58 [6554] connect from storage2 (192.168.1.4)
2012/10/24 19:22:58 [6554] rsync to
container/c0d4p1/tmp/91b4f046-eacb-4a1d-aed1-727d0c982742 from storage2
(192.168.1.4)
2012/10/24 19:22:58 [6554] receiving file list
2012/10/24 19:22:58 [6554] sent 54 bytes  received 18551 bytes  total size
18432
2012/10/24 21:22:58 [6556] connect from storage2 (192.168.1.4)
2012/10/24 19:22:58 [6556] rsync to
container/c0d4p1/tmp/94d223f9-b84d-4911-be6b-bb28f89b6647 from storage2
(192.168.1.4)
2012/10/24 19:22:58 [6556] receiving file list
2012/10/24 19:22:58 [6556] sent 54 bytes  received 24695 bytes  total size
24576




On Tue, Oct 23, 2012 at 11:17 AM, andi abes <andi.abes at gmail.com> wrote:

> On Tue, Oct 23, 2012 at 12:16 PM, Emre Sokullu <emre at groups-inc.com>
> wrote:
> > Folks,
> >
> > This is the 3rd day and I see no or very little (kb.s) change with the
> new
> > disks.
> >
> > Could it be normal, is there a long computation process that takes time
> > first before actually filling newly added disks?
> >
> > Or should I just start from scratch with the "create" command this time.
> The
> > last time I did it, I didn't use the "swift-ring-builder create 20 3 1
> .."
> > command first but just started with "swift-ring-builder add ..." and used
> > existing ring.gz files, thinking otherwise I could be reformatting the
> whole
> > stack. I'm not sure if that's the case.
> >
>
> That is correct - you don't want to recreate the rings, since that is
> likely to cause redundant partition movement.
>
> > Please advise. Thanks,
> >
>
> I think your expectations might be misplaced. the ring builder tries
> to not move partitions needlessly. In your cluster, you had 3
> zones(and i'm assuming 3 replicas). swift placed the partitions as
> efficiently as it could, spread across the 3 zones (servers). As
> things stand, there's no real reason for partitions to move across the
> servers. I'm guessing that the data growth you've seen is from new
> data, not from existing data movement (but there are some calls to
> random in the code which might have produced some partition movement).
>
> If you truly want to move things around forcefully, you could:
> * decrease the weight of the old devices. This would cause them to be
> over weighted, and partitions reassigned away from them.
> * delete and re-add devices to the ring. This will cause all the
> partitions from the deleted devices to be spread across the new set of
> devices.
>
> After you perform your ring manipulation commands, execute the
> rebalance command and copy the ring files.
> This is likely to cause *lots* of activity in your cluster... which
> seems to be the desired outcome. Its likely to have negative impact of
> service requests to the proxy. It's something you probably want to be
> careful about.
>
> If you leave things alone as they are, new data will be distributed on
> the new devices, and as old data gets deleted usage will rebalance
> over time.
>
>
> > --
> > Emre
> >
> > On Mon, Oct 22, 2012 at 12:09 PM, Emre Sokullu <emre at groups-inc.com>
> wrote:
> >>
> >> Hi Samuel,
> >>
> >> Thanks for quick reply.
> >>
> >> They're all 100. And here's the output of swift-ring-builder
> >>
> >> root at proxy1:/etc/swift# swift-ring-builder account.builder
> >> account.builder, build version 13
> >> 1048576 partitions, 3 replicas, 3 zones, 12 devices, 0.00 balance
> >> The minimum number of hours before a partition can be reassigned is 1
> >> Devices:    id  zone      ip address  port      name weight partitions
> >> balance meta
> >>              0     1     192.168.1.3  6002    c0d1p1 100.00     262144
> >> 0.00
> >>              1     1     192.168.1.3  6002    c0d2p1 100.00     262144
> >> 0.00
> >>              2     1     192.168.1.3  6002    c0d3p1 100.00     262144
> >> 0.00
> >>              3     2     192.168.1.4  6002    c0d1p1 100.00     262144
> >> 0.00
> >>              4     2     192.168.1.4  6002    c0d2p1 100.00     262144
> >> 0.00
> >>              5     2     192.168.1.4  6002    c0d3p1 100.00     262144
> >> 0.00
> >>              6     3     192.168.1.5  6002    c0d1p1 100.00     262144
> >> 0.00
> >>              7     3     192.168.1.5  6002    c0d2p1 100.00     262144
> >> 0.00
> >>              8     3     192.168.1.5  6002    c0d3p1 100.00     262144
> >> 0.00
> >>              9     1     192.168.1.3  6002    c0d4p1 100.00     262144
> >> 0.00
> >>             10     2     192.168.1.4  6002    c0d4p1 100.00     262144
> >> 0.00
> >>             11     3     192.168.1.5  6002    c0d4p1 100.00     262144
> >> 0.00
> >>
> >> On Mon, Oct 22, 2012 at 12:03 PM, Samuel Merritt <sam at swiftstack.com>
> >> wrote:
> >> > On 10/22/12 9:38 AM, Emre Sokullu wrote:
> >> >>
> >> >> Hi folks,
> >> >>
> >> >> At GROU.PS, we've been an OpenStack SWIFT user for more than 1.5
> years
> >> >> now. Currently, we hold about 18TB of data on 3 storage nodes. Since
> >> >> we hit 84% in utilization, we have recently decided to expand the
> >> >> storage with more disks.
> >> >>
> >> >> In order to do that, after creating a new c0d4p1 partition in each of
> >> >> the storage nodes, we ran the following commands on our proxy server:
> >> >>
> >> >> swift-ring-builder account.builder add z1-192.168.1.3:6002/c0d4p1 100
> >> >> swift-ring-builder container.builder add z1-192.168.1.3:6002/c0d4p1
> 100
> >> >> swift-ring-builder object.builder add z1-192.168.1.3:6002/c0d4p1 100
> >> >> swift-ring-builder account.builder add z2-192.168.1.4:6002/c0d4p1 100
> >> >> swift-ring-builder container.builder add z2-192.168.1.4:6002/c0d4p1
> 100
> >> >> swift-ring-builder object.builder add z2-192.168.1.4:6002/c0d4p1 100
> >> >> swift-ring-builder account.builder add z3-192.168.1.5:6002/c0d4p1 100
> >> >> swift-ring-builder container.builder add z3-192.168.1.5:6002/c0d4p1
> 100
> >> >> swift-ring-builder object.builder add z3-192.168.1.5:6002/c0d4p1 100
> >> >>
> >> >> [snip]
> >> >
> >> >>
> >> >> So right now, the problem is;  the disk growth in each of the storage
> >> >> nodes seems to have stalled,
> >> >
> >> > So you've added 3 new devices to each ring and assigned a weight of
>  100
> >> > to
> >> > each one. What are the weights of the other devices in the ring? If
> >> > they're
> >> > much larger than 100, then that will cause the new devices to end up
> >> > with a
> >> > small fraction of the data you want on them.
> >> >
> >> > Running "swift-ring-builder <thing>.builder" will show you
> information,
> >> > including weights, of all the devices in the ring.
> >> >
> >> >
> >> >
> >> >> * Bonus question: why do we copy ring.gz files to storage nodes and
> >> >> how critical they are. To me it's not clear how Swift can afford to
> >> >> wait (even though it's just a few seconds ) for .ring.gz files to be
> >> >> in storage nodes after rebalancing- if those files are so critical.
> >> >
> >> >
> >> > The ring.gz files contain the mapping from Swift partitions to disks.
> As
> >> > you
> >> > know, the proxy server uses it to determine which backends have the
> data
> >> > for
> >> > a given request. The replicators also use the ring to determine where
> >> > data
> >> > belongs so that they can ensure the right number of replicas, etc.
> >> >
> >> > When two storage nodes have different versions of a ring.gz file, you
> >> > can
> >> > get replicator fights. They look like this:
> >> >
> >> > - node1's (old) ring says that the partition for a replica of
> >> > /cof/fee/cup
> >> > belongs on node2's /dev/sdf.
> >> > - node2's (new) ring says that the same partition belongs on node1's
> >> > /dev/sdd.
> >> >
> >> > When the replicator on node1 runs, it will see that it has the
> partition
> >> > for
> >> > /cof/fee/cup on its disk. It will then consult the ring, push that
> >> > partition's contents to node2, and then delete its local copy (since
> >> > node1's
> >> > ring says that this data does not belong on node1).
> >> >
> >> > When the replicator on node2 runs, it will do the converse: push to
> >> > node1,
> >> > then delete its local copy.
> >> >
> >> > If you leave the rings out of sync for a long time, then you'll end up
> >> > consuming disk and network IO ping-ponging a set of data around. If
> >> > they're
> >> > out of sync for a few seconds, then it's not a big deal.
> >> >
> >> > _______________________________________________
> >> > Mailing list: https://launchpad.net/~openstack
> >> > Post to     : openstack at lists.launchpad.net
> >> > Unsubscribe : https://launchpad.net/~openstack
> >> > More help   : https://help.launchpad.net/ListHelp
> >
> >
> >
> > _______________________________________________
> > Mailing list: https://launchpad.net/~openstack
> > Post to     : openstack at lists.launchpad.net
> > Unsubscribe : https://launchpad.net/~openstack
> > More help   : https://help.launchpad.net/ListHelp
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20121024/cba78f37/attachment.html>


More information about the Openstack mailing list