Re: [Openstack][cinder] dell unity iscsi faulty devices

28 Oct 2020


      Hello Gorka, I would like to know if with unity iscsi driver,  I must
configure iscsi initiator on both compute and controller nodes.
At this time I installed and cinfigured iscsi initiator only on compute
nodes and I got a lot of faulty devices when volumes ate detached.
Thanks
Ignazio


Il Mar 20 Ott 2020, 19:58 Gorka Eguileor <geguileo@redhat.com> ha scritto:
...
On 20/10, Ignazio Cassano wrote:
...
This is the entre log from when the migration started:
http://paste.openstack.org/show/799199/
Ignazio
Hi,
There are no os-brick calls in there. :-(
You should look for the call to connect_volume that should have
something like:
==> disconnect_volume: call "{'args':
(<os_brick.initiator.connectors.iscsi
And the second parameter to that call is a dictionary where you can see
the target_lun, target_luns, target_portals, target_portal, target_iqn,
target_iqns...  This will allow us to check if we are actually connected
to those targets-portals
The third parameter should contain two things that are relevant, the
scsi_wwn and the path.  You can check if the path exists and if that
path actually has that wwn using /lib/udev/scsi_id --page 0x83
--whitelisted $path
Those are the things I would check, because the only reason I can think
that os-brick is not disconnecting any volumes are that the connection
info is not right, or that the volume is no longer connected.
Cheers,
Gorka.
...
Il giorno mar 20 ott 2020 alle ore 11:23 Gorka Eguileor <
...
ha scritto:
...
On 20/10, Ignazio Cassano wrote:
...
Hello Gorka,this is what happens on nova compute with debug enabled,
when I
migrate an instance with iscsi volumes ( note Disconnecting from[]
should
be the issue):
Hi,
The disconnect from [] is the right clue, not necessarily the issue.
OS-Brick is saying that for the connection information that has been
passed in the "disconnect_volume" call (which is not present in the
emailed logs) there are no volumes present in the system.
You should check the connection info that Nova is passing to
disconnect_volume and confirm if that data is correct.  For example
checking if the path present in the connection info dictionary is the
same as the one in the instance's XML dump, or if the LUN from the
connection info dict is actually present in the system.
There are multiple reasons why Nova could be passing the wrong
connection info to os-brick.  The ones that come to mind are:
- There was a failed migration at some point, and Nova didn't rollback
  the connection info on the BDM table.
- Nova is calling multiple times initialize_connection on Cinder for
...
...
same host and the driver being used is not idempotent.
Cheers,
Gorka.
...
stderr= _run_iscsiadm_bare
/usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py:1122
...
...
2020-10-20 09:52:33.066 132171 DEBUG
os_brick.initiator.connectors.iscsi
[-] iscsi session list stdout=tcp: [10] 10.138.209.48:3260,9
iqn.1992-04.com.emc:cx.ckm00184400687.a3 (non-flash)
tcp: [11] 10.138.215.17:3260,8
iqn.1992-04.com.emc:cx.ckm00184400687.a2
(non-flash)
tcp: [12] 10.138.215.17:3260,8
iqn.1992-04.com.emc:cx.ckm00184400687.a2
(non-flash)
tcp: [13] 10.138.215.18:3260,7
iqn.1992-04.com.emc:cx.ckm00184400687.b2
(non-flash)
tcp: [14] 10.138.215.18:3260,7
iqn.1992-04.com.emc:cx.ckm00184400687.b2
(non-flash)
tcp: [15] 10.138.209.47:3260,6
iqn.1992-04.com.emc:cx.ckm00184400687.b3
(non-flash)
tcp: [16] 10.138.209.47:3260,6
iqn.1992-04.com.emc:cx.ckm00184400687.b3
(non-flash)
tcp: [9] 10.138.209.48:3260,9
iqn.1992-04.com.emc:cx.ckm00184400687.a3
(non-flash)
 stderr= _run_iscsi_session
/usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py:1111
...
...
2020-10-20 09:52:33.078 132171 DEBUG
os_brick.initiator.connectors.iscsi
[-] Resulting device map defaultdict(<function <lambda> at
0x7f4f1b1f7cf8>,
{(u'10.138.215.17:3260',
u'iqn.1992-04.com.emc:cx.ckm00184400687.a2'):
(set([]), set([u'sdg', u'sdi'])), (u'10.138.209.47:3260',
u'iqn.1992-04.com.emc:cx.ckm00184400687.b3'): (set([]), set([u'sdo',
u'sdq'])), (u'10.138.209.48:3260',
u'iqn.1992-04.com.emc:cx.ckm00184400687.a3'): (set([]), set([u'sdd',
u'sdb'])), (u'10.138.215.18:3260',
u'iqn.1992-04.com.emc:cx.ckm00184400687.b2'): (set([]), set([u'sdm',
u'sdk']))}) _get_connection_devices
/usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py:844
...
...
2020-10-20 09:52:33.078 132171 DEBUG
os_brick.initiator.connectors.iscsi
[-] Disconnecting from: [] _disconnect_connection
/usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py:1099
...
...
2020-10-20 09:52:33.079 132171 DEBUG oslo_concurrency.lockutils [-]
Lock
"connect_volume" released by
"os_brick.initiator.connectors.iscsi.disconnect_volume" :: held
1.058s
inner
/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:339
2020-10-20 09:52:33.079 132171 DEBUG
os_brick.initiator.connectors.iscsi
[-] <== disconnect_volume: return (1057ms) None trace_logging_wrapper
/usr/lib/python2.7/site-packages/os_brick/utils.py:170
2020-10-20 09:52:33.079 132171 DEBUG nova.virt.libvirt.volume.iscsi
[-]
[instance: 0c846f66-f194-40de-b31e-d53652570fa7] Disconnected iSCSI
Volume
disconnect_volume
/usr/lib/python2.7/site-packages/nova/virt/libvirt/volume/iscsi.py:78
2020-10-20 09:52:33.080 132171 DEBUG os_brick.utils [-] ==>
get_connector_properties: call u"{'execute': None, 'my_ip':
'10.138.208.178', 'enforce_multipath': True, 'host':
'podiscsivc-kvm02',
'root_helper': 'sudo nova-rootwrap /etc/nova/rootwrap.conf',
'multipath':
True}" trace_logging_wrapper
/usr/lib/python2.7/site-packages/os_brick/utils.py:146
2020-10-20 09:52:33.125 132171 DEBUG os_brick.initiator.linuxfc [-]
No
Fibre Channel support detected on system. get_fc_hbas
/usr/lib/python2.7/site-packages/os_brick/initiator/linuxfc.py:157
2020-10-20 09:52:33.126 132171 DEBUG os_brick.initiator.linuxfc [-]
No
Fibre Channel support detected on system. get_fc_hbas
/usr/lib/python2.7/site-packages/os_brick/initiator/linuxfc.py:157
2020-10-20 09:52:33.145 132171 DEBUG os_brick.utils [-] <==
get_connector_properties: return (61ms) {'initiator':
u'iqn.1994-05.com.redhat:fbfdc37eed4c', 'ip': u'10.138.208.178',
'system
uuid': u'4C4C4544-0051-4E10-8057-B6C04F425932', 'platform':
u'x86_64',
'host': u'podiscsivc-kvm02', 'do_local_attach': False, 'os_type':
u'linux2', 'multipath': True} trace_logging_wrapper
/usr/lib/python2.7/site-packages/os_brick/utils.py:170
Best regards
Ignazio
Il giorno gio 15 ott 2020 alle ore 10:57 Gorka Eguileor <
geguileo@redhat.com>
ha scritto:
...
On 14/10, Ignazio Cassano wrote:
...
Hello, thank you for the answer.
I am using os-brick 2.3.8 but I got same issues on stein with
os.brick
2.8
For explain better the situation I send you the output of
multipath
-ll
on
a compute node:
root@podvc-kvm01 ansible]# multipath -ll
Oct 14 18:50:01 | sdbg: alua not supported
Oct 14 18:50:01 | sdbe: alua not supported
Oct 14 18:50:01 | sdbd: alua not supported
Oct 14 18:50:01 | sdbf: alua not supported
360060160f0d049007ab7275f743d0286 dm-11 DGC     ,VRAID
size=30G features='1 retain_attached_hw_handler' hwhandler='1
alua'
wp=rw
|-+- policy='round-robin 0' prio=0 status=enabled
| |- 15:0:0:71  sdbg 67:160 failed faulty running
| `- 12:0:0:71  sdbe 67:128 failed faulty running
`-+- policy='round-robin 0' prio=0 status=enabled
  |- 11:0:0:71  sdbd 67:112 failed faulty running
  `- 13:0:0:71  sdbf 67:144 failed faulty running
360060160f0d049004cdb615f52343fdb dm-8 DGC     ,VRAID
size=80G features='2 queue_if_no_path retain_attached_hw_handler'
hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 15:0:0:210 sdau 66:224 active ready running
| `- 12:0:0:210 sdas 66:192 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  |- 11:0:0:210 sdar 66:176 active ready running
  `- 13:0:0:210 sdat 66:208 active ready running
360060160f0d0490034aa645fe52265eb dm-12 DGC     ,VRAID
size=100G features='2 queue_if_no_path
retain_attached_hw_handler'
hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 12:0:0:177 sdbi 67:192 active ready running
| `- 15:0:0:177 sdbk 67:224 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  |- 11:0:0:177 sdbh 67:176 active ready running
  `- 13:0:0:177 sdbj 67:208 active ready running
360060160f0d04900159f225fd6126db9 dm-6 DGC     ,VRAID
size=40G features='2 queue_if_no_path retain_attached_hw_handler'
hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 11:0:0:26  sdaf 65:240 active ready running
| `- 13:0:0:26  sdah 66:16  active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  |- 12:0:0:26  sdag 66:0   active ready running
  `- 15:0:0:26  sdai 66:32  active ready running
Oct 14 18:50:01 | sdba: alua not supported
Oct 14 18:50:01 | sdbc: alua not supported
Oct 14 18:50:01 | sdaz: alua not supported
Oct 14 18:50:01 | sdbb: alua not supported
360060160f0d049007eb7275f93937511 dm-10 DGC     ,VRAID
size=40G features='1 retain_attached_hw_handler' hwhandler='1
alua'
wp=rw
|-+- policy='round-robin 0' prio=0 status=enabled
| |- 12:0:0:242 sdba 67:64  failed faulty running
| `- 15:0:0:242 sdbc 67:96  failed faulty running
`-+- policy='round-robin 0' prio=0 status=enabled
  |- 11:0:0:242 sdaz 67:48  failed faulty running
  `- 13:0:0:242 sdbb 67:80  failed faulty running
360060160f0d049003a567c5fb72201e8 dm-7 DGC     ,VRAID
size=40G features='2 queue_if_no_path retain_attached_hw_handler'
hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 12:0:0:57  sdbq 68:64  active ready running
| `- 15:0:0:57  sdbs 68:96  active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  |- 11:0:0:57  sdbp 68:48  active ready running
  `- 13:0:0:57  sdbr 68:80  active ready running
360060160f0d04900c120625f802ea1fa dm-9 DGC     ,VRAID
size=25G features='2 queue_if_no_path retain_attached_hw_handler'
hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 11:0:0:234 sdav 66:240 active ready running
| `- 13:0:0:234 sdax 67:16  active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  |- 15:0:0:234 sday 67:32  active ready running
  `- 12:0:0:234 sdaw 67:0   active ready running
360060160f0d04900b8b0615fb14ef1bd dm-3 DGC     ,VRAID
size=50G features='2 queue_if_no_path retain_attached_hw_handler'
hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 11:0:0:11  sdan 66:112 active ready running
| `- 13:0:0:11  sdap 66:144 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  |- 12:0:0:11  sdao 66:128 active ready running
  `- 15:0:0:11  sdaq 66:160 active ready running
The active running are related to running virtual machines.
The faulty are related to virtual macnines migrated on other kvm
nodes.
Every volume has 4 path because iscsi on unity needs two
different
vlans,
each one with 2 addresses.
I think this issue can be related to os-brick because when I
migrate
a
virtual machine from host A host B in the cova compute log on
host A
I
read:
2020-10-13 10:31:02.769 118727 DEBUG
os_brick.initiator.connectors.iscsi
[req-771ede8c-6e1b-4f3f-ad4a-1f6ed820a55c
66adb965bef64eaaab2af93ade87e2ca
85cace94dcc7484c85ff9337eb1d0c4c - default default]
*Disconnecting
from:
[]*
Ignazio
Hi,
That's definitely the right clue!!  Though I don't fully agree with
this
being an os-brick issue just yet.  ;-)
Like I mentioned before, RCA is usually non-trivial, and
explaining how
to debug these issues over email is close to impossible, but if
...
...
...
...
were my system, and assuming you have tested normal attach/detach
procedure and is working fine, this is what I would do:
- Enable DEBUG logs on Nova compute node (I believe you already
have)
- Attach a new device to an instance on that node with --debug to
get
  the request id
- Get the connection information dictionary that os-brick receives
on
  the call to connect_volume for that request, and the data that
  os-brick returns to Nova on that method call completion.
- Check if the returned data to Nova is a multipathed device or
not (in
  'path'), and whether we have the wwn or not (in 'scsi_wwn').  It
  should be a multipath device, and then I would check its status
in
the
  multipath daemon.
- Now do the live migration (with --debug to get the request id)
and
see
  what information Nova passes in that request to os-brick's
  disconnect_volume.
  - Is it the same? Then it's likely an os-brick issue, and I can
have
a
    look at the logs if you put the logs for that os-brick detach
    process in a pastebin [1].
  - Is it different? Then it's either a Nova bug or a Cinder driver
    specific bug.
    - Is there a call from Nova to Cinder, in the migration
request,
for
      that same volume to initialize_connection passing the source
host
      connector info (info from the host that is currently
attached)?
      If there is a call, check if the returned data is different
from
      the one we used to do the attach, if that's the case then
it's a
      Nova and Cinder driver bug that was solved on the Nova side
in
      17.0.10 [2].
    - If there's no call to Cinder's initialize_connection, the
it's
      most likely a Nova bug. Try to find out if this connection
info
      makes any sense for that host (LUN, target, etc.) or if this
is
      the one from the destination volume.
I hope this somehow helps.
Cheers,
Gorka.
[1]: http://paste.openstack.org/
[2]: https://review.opendev.org/#/c/637827/
...
Il giorno mer 14 ott 2020 alle ore 13:41 Gorka Eguileor <
geguileo@redhat.com>
...
ha scritto:
> On 09/10, Ignazio Cassano wrote:
> > Hello Stackers, I am using dell emc iscsi driver on my
centos 7
queens
> > openstack. It works and instances work as well but on compute
nodes I
> got a
> > lot a faulty device reported by multipath il comand.
> > I do know why this happens, probably attacching and detaching
volumes and
> > live migrating instances do not close something well.
> > I read this can cause serious performances problems on
compute
nodes.
> > Please, any workaround and/or patch is suggested ?
> > Regards
> > Ignazio
>
> Hi,
>
> There are many, many, many things that could be happening
geguileo@redhat.com>
the
this
there,
...
...
and
...
...
...
> it's not usually trivial doing the RCA, so the following
questions
are
> just me hoping this is something "easy" to find out.
>
> What os-brick version from Queens are you running?  Latest
(2.3.9), or
> maybe one older than 2.3.3?
>
> When you say you have faulty devices reported, are these faulty
devices
> alone in the multipath DM? Or do you have some faulty ones with
some
> that are ok?
>
> If there are some OK and some that aren't, are they consecutive
devices?
> (as in /dev/sda /dev/sdb etc).
>
> Cheers,
> Gorka.
>
>