<div dir="ltr">You try enabling handoffs_first for a reconstruction cycle to 2, as this will prioritise those on handoffs. But make sure you turn it off as it'll stop normal reconstruction from happening.<div><br></div><div>I am working on some code to build in better old primary handoff usage in the reconstructor but that code hasn't landed yet, and not sure when it will.</div><div><br></div><div>Regards,<br>Matt </div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Jul 20, 2021 at 11:54 PM Reid Guyett <<a href="mailto:rguyett@datto.com">rguyett@datto.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hello,<br>

<br>

We started using EC policies in a new cluster a few months ago and<br>

added more capacity. During the rebalance (started June 30), it seems<br>

that all the data was copied to the new locations but it didn't clean<br>

up the old locations. This was identified through our handoff<br>

monitoring.<br>

<br>

OS: Ubuntu 18.04<br>

Swift: 2.17.1<br>

<br>

Example:<br>

List of devices for partition<br>

<br>

~$ swift-get-nodes /etc/swift/object-4.ring.gz -p 14242<br>

... removed ...<br>

Server:Port Device      x.x.x.31:6031 d31<br>

Server:Port Device      x.x.x.66:6030 d30<br>

Server:Port Device      x.x.x.25:6029 d29<br>

Server:Port Device      x.x.x.33:6027 d27<br>

Server:Port Device      x.x.x.36:6020 d20<br>

Server:Port Device      x.x.x.29:6018 d18<br>

Server:Port Device      x.x.x.21:6033 d33<br>

Server:Port Device      x.x.x.27:6025 d25<br>

Server:Port Device      x.x.x.35:6022 d22<br>

Server:Port Device      x.x.x.39:6031 d31<br>

Server:Port Device      x.x.x.28:6032 d32<br>

Server:Port Device      x.x.x.23:6021 d21<br>

Server:Port Device      x.x.x.26:6022 d22<br>

Server:Port Device      x.x.x.34:6023 d23<br>

Server:Port Device      x.x.x.37:6019 d19<br>

Server:Port Device      x.x.x.30:6017 d17<br>

Server:Port Device      x.x.x.22:6027 d27<br>

Server:Port Device      x.x.x.24:6031 d31<br>

Server:Port Device      x.x.x.32:6032 d32<br>

<br>

Partitions look to have the correct data on them:<br>

<br>

~$ ssh root@x.x.x.31 "ls -lah ${DEVICE:-/srv/node*}/d31/objects-4/14242 | wc -l"<br>

664<br>

~$ ssh root@x.x.x.66 "ls -lah ${DEVICE:-/srv/node*}/d30/objects-4/14242 | wc -l"<br>

664<br>

~$ ssh root@x.x.x.25 "ls -lah ${DEVICE:-/srv/node*}/d29/objects-4/14242 | wc -l"<br>

664<br>

~$ ssh root@x.x.x.33 "ls -lah ${DEVICE:-/srv/node*}/d27/objects-4/14242 | wc -l"<br>

664<br>

~$ ssh root@x.x.x.36 "ls -lah ${DEVICE:-/srv/node*}/d20/objects-4/14242 | wc -l"<br>

664<br>

~$ ssh root@x.x.x.29 "ls -lah ${DEVICE:-/srv/node*}/d18/objects-4/14242 | wc -l"<br>

664<br>

~$ ssh root@x.x.x.21 "ls -lah ${DEVICE:-/srv/node*}/d33/objects-4/14242 | wc -l"<br>

664<br>

~$ ssh root@x.x.x.27 "ls -lah ${DEVICE:-/srv/node*}/d25/objects-4/14242 | wc -l"<br>

664<br>

~$ ssh root@x.x.x.35 "ls -lah ${DEVICE:-/srv/node*}/d22/objects-4/14242 | wc -l"<br>

664<br>

~$ ssh root@x.x.x.39 "ls -lah ${DEVICE:-/srv/node*}/d31/objects-4/14242 | wc -l"<br>

664<br>

~$ ssh root@x.x.x.28 "ls -lah ${DEVICE:-/srv/node*}/d32/objects-4/14242 | wc -l"<br>

664<br>

~$ ssh root@x.x.x.23 "ls -lah ${DEVICE:-/srv/node*}/d21/objects-4/14242 | wc -l"<br>

664<br>

~$ ssh root@x.x.x.26 "ls -lah ${DEVICE:-/srv/node*}/d22/objects-4/14242 | wc -l"<br>

664<br>

~$ ssh root@x.x.x.34 "ls -lah ${DEVICE:-/srv/node*}/d23/objects-4/14242 | wc -l"<br>

664<br>

~$ ssh root@x.x.x.37 "ls -lah ${DEVICE:-/srv/node*}/d19/objects-4/14242 | wc -l"<br>

664<br>

~$ ssh root@x.x.x.30 "ls -lah ${DEVICE:-/srv/node*}/d17/objects-4/14242 | wc -l"<br>

664<br>

~$ ssh root@x.x.x.22 "ls -lah ${DEVICE:-/srv/node*}/d27/objects-4/14242 | wc -l"<br>

664<br>

~$ ssh root@x.x.x.24 "ls -lah ${DEVICE:-/srv/node*}/d31/objects-4/14242 | wc -l"<br>

664<br>

~$ ssh root@x.x.x.32 "ls -lah ${DEVICE:-/srv/node*}/d32/objects-4/14242 | wc -l"<br>

664<br>

<br>

>From one of the nodes that does not belong to the list above. This<br>

partition should not exist on this node after the rebalance.<br>

<br>

x.x.x.20:~# ls /srv/node/d28/objects-4/14242 | wc -l<br>

627<br>

<br>

The reconstructor is throwing a lot of these unexpected response<br>

errors in the logs. Manually running it from the node that should not<br>

have the partition, I can reproduce the error. x.x.y.0/24 is the<br>

replication network.<br>

<br>

x.x.x.20:~# swift-object-reconstructor /etc/swift/object-server.conf<br>

-d d28 -p 14242 -o -v<br>

object-reconstructor: x.x.y.42:6200/d30/14242 Unexpected response:<br>

":ERROR: 500 'ERROR: With :UPDATES: 36 failures to 0 successes'"<br>

<br>

There looked to have been some partition locations cleaned up around<br>

July 11. Our expectation is that the old partition locations should be<br>

cleaned gradually since June 30 but we're not seeing that.<br>

I Was hoping for some ideas on what may be the problem (if any) and<br>

how we can make sure that the old partitions are cleaned up.<br>

<br>

Thanks,<br>

Reid<br>

<br>

<br>

</blockquote></div>