On Thu, Mar 16, 2023 at 5:45 PM Gorka Eguileor <geguileo@redhat.com> wrote:

On 16/03, Rishat Azizov wrote:
> Hi Gorka,
>
> Thanks!
> I fixed issue by adding to multipathd config uxsock_timeout directive:
> uxsock_timeout 10000
>
> Because in multipathd logs I saw this error:
> 3624a93705842cfae35d7483200015fd8: map flushed
> cli cmd 'del map 3624a93705842cfae35d7483200015fd8' timeout reached after
> 4.858561 secs
>
> Now large disk backups work fine.
>
> 2. This happens because despite the timeout of the first attempt and exit
> code 1, the multipath device was disconnected, so the next attempts ended
> with an error "is not a multipath device", since the multipath device had
> already disconnected.
>

Hi,

That's a nice workaround until we fix it upstream!!

Thanks for confirming my suspicions were right. This is the 3rd thing I
mentioned was happening, flush call failed but it actually removed the
device.

We'll proceed to fix the flushing code in master.

Cheers,
Gorka.

>
> вт, 14 мар. 2023 г. в 14:46, Gorka Eguileor <geguileo@redhat.com>:
>
> > [Sending the email again as it seems it didn't reach the ML]
> >
> >
> > On 13/03, Gorka Eguileor wrote:
> > > On 11/03, Rishat Azizov wrote:
> > > > Hi, Gorka,
> > > >
> > > > Thanks. I see multiple "multipath -f" calls. Logs in attachments.
> > > >
> >
> >
> >
> > Hi,
> >
> > There are multiple things going on here:
> >
> > 1. There is a bug in os-brick, because the disconnect_volume should not
> > fail, since it is being called with force=True and
> > ignore_errors=True.
> >
> > The issues is that this call [1] is not wrapped in the
> > ExceptionChainer context manager, and it should not even be a flush
> > call, it should be a call to "multipathd remove map $map" instead.
> >
> > 2. The way multipath code is written [2][3], the error we see about
> > "3624a93705842cfae35d7483200015fce is not a multipath device" means 2
> > different things: it is not a multipath or an error happened.
> >
> > So we don't really know what happened without enabling more verbose
> > multipathd log levels.
> >
> > 3. The "multipath -f" call should not be failing in the first place,
> > because the failure is happening on disconnecting the source volume,
> > which has no data buffered to be written and therefore no reason to
> > fail the flush (unless it's using a friendly name).
> >
> > I don't know if it could be happening that the first flush fails with
> > a timeout (maybe because there is an extend operation happening), but
> > multipathd keeps trying to flush it in the background and when it
> > succeeds it removes the multipath device, which makes following calls
> > fail.
> >
> > If that's the case we would need to change the retry from automatic
> > [4] to manual and check in-between to see if the device has been
> > removed in-between calls.
> >
> > The first issue is definitely a bug, the 2nd one is something that could
> > be changed in the deployment to try to get additional information on the
> > failure, and the 3rd one could be a bug.
> >
> > I'll see if I can find someone who wants to work on the 1st and 3rd
> > points.
> >
> > Cheers,
> > Gorka.
> >
> > [1]:
> > https://github.com/openstack/os-brick/blob/e15edf6c17449899ec8401c37482f7cb5de207d3/os_brick/initiator/connectors/iscsi.py#L952
> > [2]:
> > https://github.com/opensvc/multipath-tools/blob/db4804bc7393f2482448bdd870132522e65dd98e/multipath/main.c#L1063-L1064
> > [3]:
> > https://github.com/opensvc/multipath-tools/blob/db4804bc7393f2482448bdd870132522e65dd98e/libmultipath/devmapper.c#L867-L872
> > [4]:
> > https://github.com/openstack/os-brick/blob/e15edf6c17449899ec8401c37482f7cb5de207d3/os_brick/initiator/linuxscsi.py#L384
> >
> >
> >
> > > >
> > > > чт, 9 мар. 2023 г. в 15:55, Gorka Eguileor <geguileo@redhat.com>:
> > > >
> > > > > On 06/03, Rishat Azizov wrote:
> > > > > > Hi,
> > > > > >
> > > > > > It works with smaller volumes.
> > > > > >
> > > > > > multipath.conf attached to thist email.
> > > > > >
> > > > > > Cinder version - 18.2.0 Wallaby
> > > > >
> >
> >