[Openstack][cinder] dell unity iscsi faulty devices

Ignazio Cassano ignaziocassano at gmail.com
Tue Oct 20 09:38:07 UTC 2020


This is the entre log from when the migration started:

http://paste.openstack.org/show/799199/

Ignazio

Il giorno mar 20 ott 2020 alle ore 11:23 Gorka Eguileor <geguileo at redhat.com>
ha scritto:

> On 20/10, Ignazio Cassano wrote:
> > Hello Gorka,this is what happens on nova compute with debug enabled,
> when I
> > migrate an instance with iscsi volumes ( note Disconnecting from[] should
> > be the issue):
> >
>
> Hi,
>
> The disconnect from [] is the right clue, not necessarily the issue.
>
> OS-Brick is saying that for the connection information that has been
> passed in the "disconnect_volume" call (which is not present in the
> emailed logs) there are no volumes present in the system.
>
> You should check the connection info that Nova is passing to
> disconnect_volume and confirm if that data is correct.  For example
> checking if the path present in the connection info dictionary is the
> same as the one in the instance's XML dump, or if the LUN from the
> connection info dict is actually present in the system.
>
> There are multiple reasons why Nova could be passing the wrong
> connection info to os-brick.  The ones that come to mind are:
>
> - There was a failed migration at some point, and Nova didn't rollback
>   the connection info on the BDM table.
> - Nova is calling multiple times initialize_connection on Cinder for the
>   same host and the driver being used is not idempotent.
>
> Cheers,
> Gorka.
>
> > stderr= _run_iscsiadm_bare
> >
> /usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py:1122
> > 2020-10-20 09:52:33.066 132171 DEBUG os_brick.initiator.connectors.iscsi
> > [-] iscsi session list stdout=tcp: [10] 10.138.209.48:3260,9
> > iqn.1992-04.com.emc:cx.ckm00184400687.a3 (non-flash)
> > tcp: [11] 10.138.215.17:3260,8 iqn.1992-04.com.emc:cx.ckm00184400687.a2
> > (non-flash)
> > tcp: [12] 10.138.215.17:3260,8 iqn.1992-04.com.emc:cx.ckm00184400687.a2
> > (non-flash)
> > tcp: [13] 10.138.215.18:3260,7 iqn.1992-04.com.emc:cx.ckm00184400687.b2
> > (non-flash)
> > tcp: [14] 10.138.215.18:3260,7 iqn.1992-04.com.emc:cx.ckm00184400687.b2
> > (non-flash)
> > tcp: [15] 10.138.209.47:3260,6 iqn.1992-04.com.emc:cx.ckm00184400687.b3
> > (non-flash)
> > tcp: [16] 10.138.209.47:3260,6 iqn.1992-04.com.emc:cx.ckm00184400687.b3
> > (non-flash)
> > tcp: [9] 10.138.209.48:3260,9 iqn.1992-04.com.emc:cx.ckm00184400687.a3
> > (non-flash)
> >  stderr= _run_iscsi_session
> >
> /usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py:1111
> > 2020-10-20 09:52:33.078 132171 DEBUG os_brick.initiator.connectors.iscsi
> > [-] Resulting device map defaultdict(<function <lambda> at
> 0x7f4f1b1f7cf8>,
> > {(u'10.138.215.17:3260', u'iqn.1992-04.com.emc:cx.ckm00184400687.a2'):
> > (set([]), set([u'sdg', u'sdi'])), (u'10.138.209.47:3260',
> > u'iqn.1992-04.com.emc:cx.ckm00184400687.b3'): (set([]), set([u'sdo',
> > u'sdq'])), (u'10.138.209.48:3260',
> > u'iqn.1992-04.com.emc:cx.ckm00184400687.a3'): (set([]), set([u'sdd',
> > u'sdb'])), (u'10.138.215.18:3260',
> > u'iqn.1992-04.com.emc:cx.ckm00184400687.b2'): (set([]), set([u'sdm',
> > u'sdk']))}) _get_connection_devices
> >
> /usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py:844
> > 2020-10-20 09:52:33.078 132171 DEBUG os_brick.initiator.connectors.iscsi
> > [-] Disconnecting from: [] _disconnect_connection
> >
> /usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py:1099
> > 2020-10-20 09:52:33.079 132171 DEBUG oslo_concurrency.lockutils [-] Lock
> > "connect_volume" released by
> > "os_brick.initiator.connectors.iscsi.disconnect_volume" :: held 1.058s
> > inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:339
> > 2020-10-20 09:52:33.079 132171 DEBUG os_brick.initiator.connectors.iscsi
> > [-] <== disconnect_volume: return (1057ms) None trace_logging_wrapper
> > /usr/lib/python2.7/site-packages/os_brick/utils.py:170
> > 2020-10-20 09:52:33.079 132171 DEBUG nova.virt.libvirt.volume.iscsi [-]
> > [instance: 0c846f66-f194-40de-b31e-d53652570fa7] Disconnected iSCSI
> Volume
> > disconnect_volume
> > /usr/lib/python2.7/site-packages/nova/virt/libvirt/volume/iscsi.py:78
> > 2020-10-20 09:52:33.080 132171 DEBUG os_brick.utils [-] ==>
> > get_connector_properties: call u"{'execute': None, 'my_ip':
> > '10.138.208.178', 'enforce_multipath': True, 'host': 'podiscsivc-kvm02',
> > 'root_helper': 'sudo nova-rootwrap /etc/nova/rootwrap.conf', 'multipath':
> > True}" trace_logging_wrapper
> > /usr/lib/python2.7/site-packages/os_brick/utils.py:146
> > 2020-10-20 09:52:33.125 132171 DEBUG os_brick.initiator.linuxfc [-] No
> > Fibre Channel support detected on system. get_fc_hbas
> > /usr/lib/python2.7/site-packages/os_brick/initiator/linuxfc.py:157
> > 2020-10-20 09:52:33.126 132171 DEBUG os_brick.initiator.linuxfc [-] No
> > Fibre Channel support detected on system. get_fc_hbas
> > /usr/lib/python2.7/site-packages/os_brick/initiator/linuxfc.py:157
> > 2020-10-20 09:52:33.145 132171 DEBUG os_brick.utils [-] <==
> > get_connector_properties: return (61ms) {'initiator':
> > u'iqn.1994-05.com.redhat:fbfdc37eed4c', 'ip': u'10.138.208.178', 'system
> > uuid': u'4C4C4544-0051-4E10-8057-B6C04F425932', 'platform': u'x86_64',
> > 'host': u'podiscsivc-kvm02', 'do_local_attach': False, 'os_type':
> > u'linux2', 'multipath': True} trace_logging_wrapper
> > /usr/lib/python2.7/site-packages/os_brick/utils.py:170
> >
> >
> > Best regards
> > Ignazio
> >
> > Il giorno gio 15 ott 2020 alle ore 10:57 Gorka Eguileor <
> geguileo at redhat.com>
> > ha scritto:
> >
> > > On 14/10, Ignazio Cassano wrote:
> > > > Hello, thank you for the answer.
> > > > I am using os-brick 2.3.8 but I got same issues on stein with
> os.brick
> > > 2.8
> > > > For explain better the situation I send you the output of multipath
> -ll
> > > on
> > > > a compute node:
> > > > root at podvc-kvm01 ansible]# multipath -ll
> > > > Oct 14 18:50:01 | sdbg: alua not supported
> > > > Oct 14 18:50:01 | sdbe: alua not supported
> > > > Oct 14 18:50:01 | sdbd: alua not supported
> > > > Oct 14 18:50:01 | sdbf: alua not supported
> > > > 360060160f0d049007ab7275f743d0286 dm-11 DGC     ,VRAID
> > > > size=30G features='1 retain_attached_hw_handler' hwhandler='1 alua'
> wp=rw
> > > > |-+- policy='round-robin 0' prio=0 status=enabled
> > > > | |- 15:0:0:71  sdbg 67:160 failed faulty running
> > > > | `- 12:0:0:71  sdbe 67:128 failed faulty running
> > > > `-+- policy='round-robin 0' prio=0 status=enabled
> > > >   |- 11:0:0:71  sdbd 67:112 failed faulty running
> > > >   `- 13:0:0:71  sdbf 67:144 failed faulty running
> > > > 360060160f0d049004cdb615f52343fdb dm-8 DGC     ,VRAID
> > > > size=80G features='2 queue_if_no_path retain_attached_hw_handler'
> > > > hwhandler='1 alua' wp=rw
> > > > |-+- policy='round-robin 0' prio=50 status=active
> > > > | |- 15:0:0:210 sdau 66:224 active ready running
> > > > | `- 12:0:0:210 sdas 66:192 active ready running
> > > > `-+- policy='round-robin 0' prio=10 status=enabled
> > > >   |- 11:0:0:210 sdar 66:176 active ready running
> > > >   `- 13:0:0:210 sdat 66:208 active ready running
> > > > 360060160f0d0490034aa645fe52265eb dm-12 DGC     ,VRAID
> > > > size=100G features='2 queue_if_no_path retain_attached_hw_handler'
> > > > hwhandler='1 alua' wp=rw
> > > > |-+- policy='round-robin 0' prio=50 status=active
> > > > | |- 12:0:0:177 sdbi 67:192 active ready running
> > > > | `- 15:0:0:177 sdbk 67:224 active ready running
> > > > `-+- policy='round-robin 0' prio=10 status=enabled
> > > >   |- 11:0:0:177 sdbh 67:176 active ready running
> > > >   `- 13:0:0:177 sdbj 67:208 active ready running
> > > > 360060160f0d04900159f225fd6126db9 dm-6 DGC     ,VRAID
> > > > size=40G features='2 queue_if_no_path retain_attached_hw_handler'
> > > > hwhandler='1 alua' wp=rw
> > > > |-+- policy='round-robin 0' prio=50 status=active
> > > > | |- 11:0:0:26  sdaf 65:240 active ready running
> > > > | `- 13:0:0:26  sdah 66:16  active ready running
> > > > `-+- policy='round-robin 0' prio=10 status=enabled
> > > >   |- 12:0:0:26  sdag 66:0   active ready running
> > > >   `- 15:0:0:26  sdai 66:32  active ready running
> > > > Oct 14 18:50:01 | sdba: alua not supported
> > > > Oct 14 18:50:01 | sdbc: alua not supported
> > > > Oct 14 18:50:01 | sdaz: alua not supported
> > > > Oct 14 18:50:01 | sdbb: alua not supported
> > > > 360060160f0d049007eb7275f93937511 dm-10 DGC     ,VRAID
> > > > size=40G features='1 retain_attached_hw_handler' hwhandler='1 alua'
> wp=rw
> > > > |-+- policy='round-robin 0' prio=0 status=enabled
> > > > | |- 12:0:0:242 sdba 67:64  failed faulty running
> > > > | `- 15:0:0:242 sdbc 67:96  failed faulty running
> > > > `-+- policy='round-robin 0' prio=0 status=enabled
> > > >   |- 11:0:0:242 sdaz 67:48  failed faulty running
> > > >   `- 13:0:0:242 sdbb 67:80  failed faulty running
> > > > 360060160f0d049003a567c5fb72201e8 dm-7 DGC     ,VRAID
> > > > size=40G features='2 queue_if_no_path retain_attached_hw_handler'
> > > > hwhandler='1 alua' wp=rw
> > > > |-+- policy='round-robin 0' prio=50 status=active
> > > > | |- 12:0:0:57  sdbq 68:64  active ready running
> > > > | `- 15:0:0:57  sdbs 68:96  active ready running
> > > > `-+- policy='round-robin 0' prio=10 status=enabled
> > > >   |- 11:0:0:57  sdbp 68:48  active ready running
> > > >   `- 13:0:0:57  sdbr 68:80  active ready running
> > > > 360060160f0d04900c120625f802ea1fa dm-9 DGC     ,VRAID
> > > > size=25G features='2 queue_if_no_path retain_attached_hw_handler'
> > > > hwhandler='1 alua' wp=rw
> > > > |-+- policy='round-robin 0' prio=50 status=active
> > > > | |- 11:0:0:234 sdav 66:240 active ready running
> > > > | `- 13:0:0:234 sdax 67:16  active ready running
> > > > `-+- policy='round-robin 0' prio=10 status=enabled
> > > >   |- 15:0:0:234 sday 67:32  active ready running
> > > >   `- 12:0:0:234 sdaw 67:0   active ready running
> > > > 360060160f0d04900b8b0615fb14ef1bd dm-3 DGC     ,VRAID
> > > > size=50G features='2 queue_if_no_path retain_attached_hw_handler'
> > > > hwhandler='1 alua' wp=rw
> > > > |-+- policy='round-robin 0' prio=50 status=active
> > > > | |- 11:0:0:11  sdan 66:112 active ready running
> > > > | `- 13:0:0:11  sdap 66:144 active ready running
> > > > `-+- policy='round-robin 0' prio=10 status=enabled
> > > >   |- 12:0:0:11  sdao 66:128 active ready running
> > > >   `- 15:0:0:11  sdaq 66:160 active ready running
> > > >
> > > > The active running are related to running virtual machines.
> > > > The faulty are related to virtual macnines migrated on other kvm
> nodes.
> > > > Every volume has 4 path because iscsi on unity needs two different
> vlans,
> > > > each one with 2 addresses.
> > > > I think this issue can be related to os-brick because when I migrate
> a
> > > > virtual machine from host A host B in the cova compute log on host A
> I
> > > read:
> > > > 2020-10-13 10:31:02.769 118727 DEBUG
> os_brick.initiator.connectors.iscsi
> > > > [req-771ede8c-6e1b-4f3f-ad4a-1f6ed820a55c
> > > 66adb965bef64eaaab2af93ade87e2ca
> > > > 85cace94dcc7484c85ff9337eb1d0c4c - default default] *Disconnecting
> from:
> > > []*
> > > >
> > > > Ignazio
> > >
> > > Hi,
> > >
> > > That's definitely the right clue!!  Though I don't fully agree with
> this
> > > being an os-brick issue just yet.  ;-)
> > >
> > > Like I mentioned before, RCA is usually non-trivial, and explaining how
> > > to debug these issues over email is close to impossible, but if this
> > > were my system, and assuming you have tested normal attach/detach
> > > procedure and is working fine, this is what I would do:
> > >
> > > - Enable DEBUG logs on Nova compute node (I believe you already have)
> > > - Attach a new device to an instance on that node with --debug to get
> > >   the request id
> > > - Get the connection information dictionary that os-brick receives on
> > >   the call to connect_volume for that request, and the data that
> > >   os-brick returns to Nova on that method call completion.
> > > - Check if the returned data to Nova is a multipathed device or not (in
> > >   'path'), and whether we have the wwn or not (in 'scsi_wwn').  It
> > >   should be a multipath device, and then I would check its status in
> the
> > >   multipath daemon.
> > > - Now do the live migration (with --debug to get the request id) and
> see
> > >   what information Nova passes in that request to os-brick's
> > >   disconnect_volume.
> > >   - Is it the same? Then it's likely an os-brick issue, and I can have
> a
> > >     look at the logs if you put the logs for that os-brick detach
> > >     process in a pastebin [1].
> > >   - Is it different? Then it's either a Nova bug or a Cinder driver
> > >     specific bug.
> > >     - Is there a call from Nova to Cinder, in the migration request,
> for
> > >       that same volume to initialize_connection passing the source host
> > >       connector info (info from the host that is currently attached)?
> > >       If there is a call, check if the returned data is different from
> > >       the one we used to do the attach, if that's the case then it's a
> > >       Nova and Cinder driver bug that was solved on the Nova side in
> > >       17.0.10 [2].
> > >     - If there's no call to Cinder's initialize_connection, the it's
> > >       most likely a Nova bug. Try to find out if this connection info
> > >       makes any sense for that host (LUN, target, etc.) or if this is
> > >       the one from the destination volume.
> > >
> > > I hope this somehow helps.
> > >
> > > Cheers,
> > > Gorka.
> > >
> > >
> > > [1]: http://paste.openstack.org/
> > > [2]: https://review.opendev.org/#/c/637827/
> > >
> > > >
> > > > Il giorno mer 14 ott 2020 alle ore 13:41 Gorka Eguileor <
> > > geguileo at redhat.com>
> > > > ha scritto:
> > > >
> > > > > On 09/10, Ignazio Cassano wrote:
> > > > > > Hello Stackers, I am using dell emc iscsi driver on my centos 7
> > > queens
> > > > > > openstack. It works and instances work as well but on compute
> nodes I
> > > > > got a
> > > > > > lot a faulty device reported by multipath il comand.
> > > > > > I do know why this happens, probably attacching and detaching
> > > volumes and
> > > > > > live migrating instances do not close something well.
> > > > > > I read this can cause serious performances problems on compute
> nodes.
> > > > > > Please, any workaround and/or patch is suggested ?
> > > > > > Regards
> > > > > > Ignazio
> > > > >
> > > > > Hi,
> > > > >
> > > > > There are many, many, many things that could be happening there,
> and
> > > > > it's not usually trivial doing the RCA, so the following questions
> are
> > > > > just me hoping this is something "easy" to find out.
> > > > >
> > > > > What os-brick version from Queens are you running?  Latest
> (2.3.9), or
> > > > > maybe one older than 2.3.3?
> > > > >
> > > > > When you say you have faulty devices reported, are these faulty
> devices
> > > > > alone in the multipath DM? Or do you have some faulty ones with
> some
> > > > > that are ok?
> > > > >
> > > > > If there are some OK and some that aren't, are they consecutive
> > > devices?
> > > > > (as in /dev/sda /dev/sdb etc).
> > > > >
> > > > > Cheers,
> > > > > Gorka.
> > > > >
> > > > >
> > >
> > >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20201020/29d25b7e/attachment-0001.html>


More information about the openstack-discuss mailing list