Ok, thanks. I am going to apply steps you suggested.
Ignazio

Il Gio 15 Ott 2020, 10:57 Gorka Eguileor <geguileo@redhat.com> ha scritto:
On 14/10, Ignazio Cassano wrote:
> Hello, thank you for the answer.
> I am using os-brick 2.3.8 but I got same issues on stein with os.brick 2.8
> For explain better the situation I send you the output of multipath -ll on
> a compute node:
> root@podvc-kvm01 ansible]# multipath -ll
> Oct 14 18:50:01 | sdbg: alua not supported
> Oct 14 18:50:01 | sdbe: alua not supported
> Oct 14 18:50:01 | sdbd: alua not supported
> Oct 14 18:50:01 | sdbf: alua not supported
> 360060160f0d049007ab7275f743d0286 dm-11 DGC     ,VRAID
> size=30G features='1 retain_attached_hw_handler' hwhandler='1 alua' wp=rw
> |-+- policy='round-robin 0' prio=0 status=enabled
> | |- 15:0:0:71  sdbg 67:160 failed faulty running
> | `- 12:0:0:71  sdbe 67:128 failed faulty running
> `-+- policy='round-robin 0' prio=0 status=enabled
>   |- 11:0:0:71  sdbd 67:112 failed faulty running
>   `- 13:0:0:71  sdbf 67:144 failed faulty running
> 360060160f0d049004cdb615f52343fdb dm-8 DGC     ,VRAID
> size=80G features='2 queue_if_no_path retain_attached_hw_handler'
> hwhandler='1 alua' wp=rw
> |-+- policy='round-robin 0' prio=50 status=active
> | |- 15:0:0:210 sdau 66:224 active ready running
> | `- 12:0:0:210 sdas 66:192 active ready running
> `-+- policy='round-robin 0' prio=10 status=enabled
>   |- 11:0:0:210 sdar 66:176 active ready running
>   `- 13:0:0:210 sdat 66:208 active ready running
> 360060160f0d0490034aa645fe52265eb dm-12 DGC     ,VRAID
> size=100G features='2 queue_if_no_path retain_attached_hw_handler'
> hwhandler='1 alua' wp=rw
> |-+- policy='round-robin 0' prio=50 status=active
> | |- 12:0:0:177 sdbi 67:192 active ready running
> | `- 15:0:0:177 sdbk 67:224 active ready running
> `-+- policy='round-robin 0' prio=10 status=enabled
>   |- 11:0:0:177 sdbh 67:176 active ready running
>   `- 13:0:0:177 sdbj 67:208 active ready running
> 360060160f0d04900159f225fd6126db9 dm-6 DGC     ,VRAID
> size=40G features='2 queue_if_no_path retain_attached_hw_handler'
> hwhandler='1 alua' wp=rw
> |-+- policy='round-robin 0' prio=50 status=active
> | |- 11:0:0:26  sdaf 65:240 active ready running
> | `- 13:0:0:26  sdah 66:16  active ready running
> `-+- policy='round-robin 0' prio=10 status=enabled
>   |- 12:0:0:26  sdag 66:0   active ready running
>   `- 15:0:0:26  sdai 66:32  active ready running
> Oct 14 18:50:01 | sdba: alua not supported
> Oct 14 18:50:01 | sdbc: alua not supported
> Oct 14 18:50:01 | sdaz: alua not supported
> Oct 14 18:50:01 | sdbb: alua not supported
> 360060160f0d049007eb7275f93937511 dm-10 DGC     ,VRAID
> size=40G features='1 retain_attached_hw_handler' hwhandler='1 alua' wp=rw
> |-+- policy='round-robin 0' prio=0 status=enabled
> | |- 12:0:0:242 sdba 67:64  failed faulty running
> | `- 15:0:0:242 sdbc 67:96  failed faulty running
> `-+- policy='round-robin 0' prio=0 status=enabled
>   |- 11:0:0:242 sdaz 67:48  failed faulty running
>   `- 13:0:0:242 sdbb 67:80  failed faulty running
> 360060160f0d049003a567c5fb72201e8 dm-7 DGC     ,VRAID
> size=40G features='2 queue_if_no_path retain_attached_hw_handler'
> hwhandler='1 alua' wp=rw
> |-+- policy='round-robin 0' prio=50 status=active
> | |- 12:0:0:57  sdbq 68:64  active ready running
> | `- 15:0:0:57  sdbs 68:96  active ready running
> `-+- policy='round-robin 0' prio=10 status=enabled
>   |- 11:0:0:57  sdbp 68:48  active ready running
>   `- 13:0:0:57  sdbr 68:80  active ready running
> 360060160f0d04900c120625f802ea1fa dm-9 DGC     ,VRAID
> size=25G features='2 queue_if_no_path retain_attached_hw_handler'
> hwhandler='1 alua' wp=rw
> |-+- policy='round-robin 0' prio=50 status=active
> | |- 11:0:0:234 sdav 66:240 active ready running
> | `- 13:0:0:234 sdax 67:16  active ready running
> `-+- policy='round-robin 0' prio=10 status=enabled
>   |- 15:0:0:234 sday 67:32  active ready running
>   `- 12:0:0:234 sdaw 67:0   active ready running
> 360060160f0d04900b8b0615fb14ef1bd dm-3 DGC     ,VRAID
> size=50G features='2 queue_if_no_path retain_attached_hw_handler'
> hwhandler='1 alua' wp=rw
> |-+- policy='round-robin 0' prio=50 status=active
> | |- 11:0:0:11  sdan 66:112 active ready running
> | `- 13:0:0:11  sdap 66:144 active ready running
> `-+- policy='round-robin 0' prio=10 status=enabled
>   |- 12:0:0:11  sdao 66:128 active ready running
>   `- 15:0:0:11  sdaq 66:160 active ready running
>
> The active running are related to running virtual machines.
> The faulty are related to virtual macnines migrated on other kvm nodes.
> Every volume has 4 path because iscsi on unity needs two different vlans,
> each one with 2 addresses.
> I think this issue can be related to os-brick because when I migrate a
> virtual machine from host A host B in the cova compute log on host A I read:
> 2020-10-13 10:31:02.769 118727 DEBUG os_brick.initiator.connectors.iscsi
> [req-771ede8c-6e1b-4f3f-ad4a-1f6ed820a55c 66adb965bef64eaaab2af93ade87e2ca
> 85cace94dcc7484c85ff9337eb1d0c4c - default default] *Disconnecting from: []*
>
> Ignazio

Hi,

That's definitely the right clue!!  Though I don't fully agree with this
being an os-brick issue just yet.  ;-)

Like I mentioned before, RCA is usually non-trivial, and explaining how
to debug these issues over email is close to impossible, but if this
were my system, and assuming you have tested normal attach/detach
procedure and is working fine, this is what I would do:

- Enable DEBUG logs on Nova compute node (I believe you already have)
- Attach a new device to an instance on that node with --debug to get
  the request id
- Get the connection information dictionary that os-brick receives on
  the call to connect_volume for that request, and the data that
  os-brick returns to Nova on that method call completion.
- Check if the returned data to Nova is a multipathed device or not (in
  'path'), and whether we have the wwn or not (in 'scsi_wwn').  It
  should be a multipath device, and then I would check its status in the
  multipath daemon.
- Now do the live migration (with --debug to get the request id) and see
  what information Nova passes in that request to os-brick's
  disconnect_volume.
  - Is it the same? Then it's likely an os-brick issue, and I can have a
    look at the logs if you put the logs for that os-brick detach
    process in a pastebin [1].
  - Is it different? Then it's either a Nova bug or a Cinder driver
    specific bug.
    - Is there a call from Nova to Cinder, in the migration request, for
      that same volume to initialize_connection passing the source host
      connector info (info from the host that is currently attached)?
      If there is a call, check if the returned data is different from
      the one we used to do the attach, if that's the case then it's a
      Nova and Cinder driver bug that was solved on the Nova side in
      17.0.10 [2].
    - If there's no call to Cinder's initialize_connection, the it's
      most likely a Nova bug. Try to find out if this connection info
      makes any sense for that host (LUN, target, etc.) or if this is
      the one from the destination volume.

I hope this somehow helps.

Cheers,
Gorka.


[1]: http://paste.openstack.org/
[2]: https://review.opendev.org/#/c/637827/

>
> Il giorno mer 14 ott 2020 alle ore 13:41 Gorka Eguileor <geguileo@redhat.com>
> ha scritto:
>
> > On 09/10, Ignazio Cassano wrote:
> > > Hello Stackers, I am using dell emc iscsi driver on my centos 7 queens
> > > openstack. It works and instances work as well but on compute nodes I
> > got a
> > > lot a faulty device reported by multipath il comand.
> > > I do know why this happens, probably attacching and detaching volumes and
> > > live migrating instances do not close something well.
> > > I read this can cause serious performances problems on compute nodes.
> > > Please, any workaround and/or patch is suggested ?
> > > Regards
> > > Ignazio
> >
> > Hi,
> >
> > There are many, many, many things that could be happening there, and
> > it's not usually trivial doing the RCA, so the following questions are
> > just me hoping this is something "easy" to find out.
> >
> > What os-brick version from Queens are you running?  Latest (2.3.9), or
> > maybe one older than 2.3.3?
> >
> > When you say you have faulty devices reported, are these faulty devices
> > alone in the multipath DM? Or do you have some faulty ones with some
> > that are ok?
> >
> > If there are some OK and some that aren't, are they consecutive devices?
> > (as in /dev/sda /dev/sdb etc).
> >
> > Cheers,
> > Gorka.
> >
> >