Hi Gorka, Thanks! I fixed issue by adding to multipathd config uxsock_timeout directive: uxsock_timeout 10000 Because in multipathd logs I saw this error: 3624a93705842cfae35d7483200015fd8: map flushed cli cmd 'del map 3624a93705842cfae35d7483200015fd8' timeout reached after 4.858561 secs Now large disk backups work fine. 2. This happens because despite the timeout of the first attempt and exit code 1, the multipath device was disconnected, so the next attempts ended with an error "is not a multipath device", since the multipath device had already disconnected. вт, 14 мар. 2023 г. в 14:46, Gorka Eguileor <geguileo@redhat.com>:
[Sending the email again as it seems it didn't reach the ML]
On 13/03, Gorka Eguileor wrote:
On 11/03, Rishat Azizov wrote:
Hi, Gorka,
Thanks. I see multiple "multipath -f" calls. Logs in attachments.
Hi,
There are multiple things going on here:
1. There is a bug in os-brick, because the disconnect_volume should not fail, since it is being called with force=True and ignore_errors=True.
The issues is that this call [1] is not wrapped in the ExceptionChainer context manager, and it should not even be a flush call, it should be a call to "multipathd remove map $map" instead.
2. The way multipath code is written [2][3], the error we see about "3624a93705842cfae35d7483200015fce is not a multipath device" means 2 different things: it is not a multipath or an error happened.
So we don't really know what happened without enabling more verbose multipathd log levels.
3. The "multipath -f" call should not be failing in the first place, because the failure is happening on disconnecting the source volume, which has no data buffered to be written and therefore no reason to fail the flush (unless it's using a friendly name).
I don't know if it could be happening that the first flush fails with a timeout (maybe because there is an extend operation happening), but multipathd keeps trying to flush it in the background and when it succeeds it removes the multipath device, which makes following calls fail.
If that's the case we would need to change the retry from automatic [4] to manual and check in-between to see if the device has been removed in-between calls.
The first issue is definitely a bug, the 2nd one is something that could be changed in the deployment to try to get additional information on the failure, and the 3rd one could be a bug.
I'll see if I can find someone who wants to work on the 1st and 3rd points.
Cheers, Gorka.
[1]: https://github.com/openstack/os-brick/blob/e15edf6c17449899ec8401c37482f7cb5... [2]: https://github.com/opensvc/multipath-tools/blob/db4804bc7393f2482448bdd87013... [3]: https://github.com/opensvc/multipath-tools/blob/db4804bc7393f2482448bdd87013... [4]: https://github.com/openstack/os-brick/blob/e15edf6c17449899ec8401c37482f7cb5...
чт, 9 мар. 2023 г. в 15:55, Gorka Eguileor <geguileo@redhat.com>:
On 06/03, Rishat Azizov wrote:
Hi,
It works with smaller volumes.
multipath.conf attached to thist email.
Cinder version - 18.2.0 Wallaby