[swift] EC data left in old location after rebalance

Reid Guyett rguyett at datto.com
Mon Jul 26 20:28:42 UTC 2021


> handoffs_first for a reconstruction cycle to 2
Does this mean to set handoffs_first/handoffs_only = true and run the
reconstructor twice in with `-o`?

Thanks!

Reid




On Sun, Jul 25, 2021 at 9:02 PM Matthew Oliver <matt at oliver.net.au> wrote:
>
> You try enabling handoffs_first for a reconstruction cycle to 2, as this will prioritise those on handoffs. But make sure you turn it off as it'll stop normal reconstruction from happening.
>
> I am working on some code to build in better old primary handoff usage in the reconstructor but that code hasn't landed yet, and not sure when it will.
>
> Regards,
> Matt
>
> On Tue, Jul 20, 2021 at 11:54 PM Reid Guyett <rguyett at datto.com> wrote:
>>
>> Hello,
>>
>> We started using EC policies in a new cluster a few months ago and
>> added more capacity. During the rebalance (started June 30), it seems
>> that all the data was copied to the new locations but it didn't clean
>> up the old locations. This was identified through our handoff
>> monitoring.
>>
>> OS: Ubuntu 18.04
>> Swift: 2.17.1
>>
>> Example:
>> List of devices for partition
>>
>> ~$ swift-get-nodes /etc/swift/object-4.ring.gz -p 14242
>> ... removed ...
>> Server:Port Device      x.x.x.31:6031 d31
>> Server:Port Device      x.x.x.66:6030 d30
>> Server:Port Device      x.x.x.25:6029 d29
>> Server:Port Device      x.x.x.33:6027 d27
>> Server:Port Device      x.x.x.36:6020 d20
>> Server:Port Device      x.x.x.29:6018 d18
>> Server:Port Device      x.x.x.21:6033 d33
>> Server:Port Device      x.x.x.27:6025 d25
>> Server:Port Device      x.x.x.35:6022 d22
>> Server:Port Device      x.x.x.39:6031 d31
>> Server:Port Device      x.x.x.28:6032 d32
>> Server:Port Device      x.x.x.23:6021 d21
>> Server:Port Device      x.x.x.26:6022 d22
>> Server:Port Device      x.x.x.34:6023 d23
>> Server:Port Device      x.x.x.37:6019 d19
>> Server:Port Device      x.x.x.30:6017 d17
>> Server:Port Device      x.x.x.22:6027 d27
>> Server:Port Device      x.x.x.24:6031 d31
>> Server:Port Device      x.x.x.32:6032 d32
>>
>> Partitions look to have the correct data on them:
>>
>> ~$ ssh root at x.x.x.31 "ls -lah ${DEVICE:-/srv/node*}/d31/objects-4/14242 | wc -l"
>> 664
>> ~$ ssh root at x.x.x.66 "ls -lah ${DEVICE:-/srv/node*}/d30/objects-4/14242 | wc -l"
>> 664
>> ~$ ssh root at x.x.x.25 "ls -lah ${DEVICE:-/srv/node*}/d29/objects-4/14242 | wc -l"
>> 664
>> ~$ ssh root at x.x.x.33 "ls -lah ${DEVICE:-/srv/node*}/d27/objects-4/14242 | wc -l"
>> 664
>> ~$ ssh root at x.x.x.36 "ls -lah ${DEVICE:-/srv/node*}/d20/objects-4/14242 | wc -l"
>> 664
>> ~$ ssh root at x.x.x.29 "ls -lah ${DEVICE:-/srv/node*}/d18/objects-4/14242 | wc -l"
>> 664
>> ~$ ssh root at x.x.x.21 "ls -lah ${DEVICE:-/srv/node*}/d33/objects-4/14242 | wc -l"
>> 664
>> ~$ ssh root at x.x.x.27 "ls -lah ${DEVICE:-/srv/node*}/d25/objects-4/14242 | wc -l"
>> 664
>> ~$ ssh root at x.x.x.35 "ls -lah ${DEVICE:-/srv/node*}/d22/objects-4/14242 | wc -l"
>> 664
>> ~$ ssh root at x.x.x.39 "ls -lah ${DEVICE:-/srv/node*}/d31/objects-4/14242 | wc -l"
>> 664
>> ~$ ssh root at x.x.x.28 "ls -lah ${DEVICE:-/srv/node*}/d32/objects-4/14242 | wc -l"
>> 664
>> ~$ ssh root at x.x.x.23 "ls -lah ${DEVICE:-/srv/node*}/d21/objects-4/14242 | wc -l"
>> 664
>> ~$ ssh root at x.x.x.26 "ls -lah ${DEVICE:-/srv/node*}/d22/objects-4/14242 | wc -l"
>> 664
>> ~$ ssh root at x.x.x.34 "ls -lah ${DEVICE:-/srv/node*}/d23/objects-4/14242 | wc -l"
>> 664
>> ~$ ssh root at x.x.x.37 "ls -lah ${DEVICE:-/srv/node*}/d19/objects-4/14242 | wc -l"
>> 664
>> ~$ ssh root at x.x.x.30 "ls -lah ${DEVICE:-/srv/node*}/d17/objects-4/14242 | wc -l"
>> 664
>> ~$ ssh root at x.x.x.22 "ls -lah ${DEVICE:-/srv/node*}/d27/objects-4/14242 | wc -l"
>> 664
>> ~$ ssh root at x.x.x.24 "ls -lah ${DEVICE:-/srv/node*}/d31/objects-4/14242 | wc -l"
>> 664
>> ~$ ssh root at x.x.x.32 "ls -lah ${DEVICE:-/srv/node*}/d32/objects-4/14242 | wc -l"
>> 664
>>
>> From one of the nodes that does not belong to the list above. This
>> partition should not exist on this node after the rebalance.
>>
>> x.x.x.20:~# ls /srv/node/d28/objects-4/14242 | wc -l
>> 627
>>
>> The reconstructor is throwing a lot of these unexpected response
>> errors in the logs. Manually running it from the node that should not
>> have the partition, I can reproduce the error. x.x.y.0/24 is the
>> replication network.
>>
>> x.x.x.20:~# swift-object-reconstructor /etc/swift/object-server.conf
>> -d d28 -p 14242 -o -v
>> object-reconstructor: x.x.y.42:6200/d30/14242 Unexpected response:
>> ":ERROR: 500 'ERROR: With :UPDATES: 36 failures to 0 successes'"
>>
>> There looked to have been some partition locations cleaned up around
>> July 11. Our expectation is that the old partition locations should be
>> cleaned gradually since June 30 but we're not seeing that.
>> I Was hoping for some ideas on what may be the problem (if any) and
>> how we can make sure that the old partitions are cleaned up.
>>
>> Thanks,
>> Reid
>>
>>




More information about the openstack-discuss mailing list