Hi Gorka,

Thanks!

I fixed issue by adding to multipathd config uxsock_timeout directive:

uxsock_timeout 10000

Because in multipathd logs I saw this error:

3624a93705842cfae35d7483200015fd8: map flushed

cli cmd 'del map 3624a93705842cfae35d7483200015fd8' timeout reached after 4.858561 secs

Now large disk backups work fine.

2. This happens because despite the timeout of the first attempt and exit code 1, the multipath device was disconnected, so the next attempts ended with an error "is not a multipath device", since the multipath device had already disconnected.

вт, 14 мар. 2023 г. в 14:46, Gorka Eguileor <geguileo@redhat.com>:

[Sending the email again as it seems it didn't reach the ML]

On 13/03, Gorka Eguileor wrote:
> On 11/03, Rishat Azizov wrote:
> > Hi, Gorka,
> >
> > Thanks. I see multiple "multipath -f" calls. Logs in attachments.
> >

Hi,

There are multiple things going on here:

1. There is a bug in os-brick, because the disconnect_volume should not
fail, since it is being called with force=True and
ignore_errors=True.

The issues is that this call [1] is not wrapped in the
ExceptionChainer context manager, and it should not even be a flush
call, it should be a call to "multipathd remove map $map" instead.

2. The way multipath code is written [2][3], the error we see about
"3624a93705842cfae35d7483200015fce is not a multipath device" means 2
different things: it is not a multipath or an error happened.

So we don't really know what happened without enabling more verbose
multipathd log levels.

3. The "multipath -f" call should not be failing in the first place,
because the failure is happening on disconnecting the source volume,
which has no data buffered to be written and therefore no reason to
fail the flush (unless it's using a friendly name).

I don't know if it could be happening that the first flush fails with
a timeout (maybe because there is an extend operation happening), but
multipathd keeps trying to flush it in the background and when it
succeeds it removes the multipath device, which makes following calls
fail.

If that's the case we would need to change the retry from automatic
[4] to manual and check in-between to see if the device has been
removed in-between calls.

The first issue is definitely a bug, the 2nd one is something that could
be changed in the deployment to try to get additional information on the
failure, and the 3rd one could be a bug.

I'll see if I can find someone who wants to work on the 1st and 3rd
points.

Cheers,
Gorka.

[1]: https://github.com/openstack/os-brick/blob/e15edf6c17449899ec8401c37482f7cb5de207d3/os_brick/initiator/connectors/iscsi.py#L952
[2]: https://github.com/opensvc/multipath-tools/blob/db4804bc7393f2482448bdd870132522e65dd98e/multipath/main.c#L1063-L1064
[3]: https://github.com/opensvc/multipath-tools/blob/db4804bc7393f2482448bdd870132522e65dd98e/libmultipath/devmapper.c#L867-L872
[4]: https://github.com/openstack/os-brick/blob/e15edf6c17449899ec8401c37482f7cb5de207d3/os_brick/initiator/linuxscsi.py#L384

> >
> > чт, 9 мар. 2023 г. в 15:55, Gorka Eguileor <geguileo@redhat.com>:
> >
> > > On 06/03, Rishat Azizov wrote:
> > > > Hi,
> > > >
> > > > It works with smaller volumes.
> > > >
> > > > multipath.conf attached to thist email.
> > > >
> > > > Cinder version - 18.2.0 Wallaby
> > >