<div dir="ltr"><div>This is the entre log from when the migration started:</div><div><br></div><div><a href="http://paste.openstack.org/show/799199/">http://paste.openstack.org/show/799199/</a></div><div><br></div><div>Ignazio<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Il giorno mar 20 ott 2020 alle ore 11:23 Gorka Eguileor <<a href="mailto:geguileo@redhat.com">geguileo@redhat.com</a>> ha scritto:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 20/10, Ignazio Cassano wrote:<br>
> Hello Gorka,this is what happens on nova compute with debug enabled, when I<br>
> migrate an instance with iscsi volumes ( note Disconnecting from[] should<br>
> be the issue):<br>
><br>
<br>
Hi,<br>
<br>
The disconnect from [] is the right clue, not necessarily the issue.<br>
<br>
OS-Brick is saying that for the connection information that has been<br>
passed in the "disconnect_volume" call (which is not present in the<br>
emailed logs) there are no volumes present in the system.<br>
<br>
You should check the connection info that Nova is passing to<br>
disconnect_volume and confirm if that data is correct. For example<br>
checking if the path present in the connection info dictionary is the<br>
same as the one in the instance's XML dump, or if the LUN from the<br>
connection info dict is actually present in the system.<br>
<br>
There are multiple reasons why Nova could be passing the wrong<br>
connection info to os-brick. The ones that come to mind are:<br>
<br>
- There was a failed migration at some point, and Nova didn't rollback<br>
the connection info on the BDM table.<br>
- Nova is calling multiple times initialize_connection on Cinder for the<br>
same host and the driver being used is not idempotent.<br>
<br>
Cheers,<br>
Gorka.<br>
<br>
> stderr= _run_iscsiadm_bare<br>
> /usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py:1122<br>
> 2020-10-20 09:52:33.066 132171 DEBUG os_brick.initiator.connectors.iscsi<br>
> [-] iscsi session list stdout=tcp: [10] <a href="http://10.138.209.48:3260" rel="noreferrer" target="_blank">10.138.209.48:3260</a>,9<br>
> iqn.1992-04.com.emc:cx.ckm00184400687.a3 (non-flash)<br>
> tcp: [11] <a href="http://10.138.215.17:3260" rel="noreferrer" target="_blank">10.138.215.17:3260</a>,8 iqn.1992-04.com.emc:cx.ckm00184400687.a2<br>
> (non-flash)<br>
> tcp: [12] <a href="http://10.138.215.17:3260" rel="noreferrer" target="_blank">10.138.215.17:3260</a>,8 iqn.1992-04.com.emc:cx.ckm00184400687.a2<br>
> (non-flash)<br>
> tcp: [13] <a href="http://10.138.215.18:3260" rel="noreferrer" target="_blank">10.138.215.18:3260</a>,7 iqn.1992-04.com.emc:cx.ckm00184400687.b2<br>
> (non-flash)<br>
> tcp: [14] <a href="http://10.138.215.18:3260" rel="noreferrer" target="_blank">10.138.215.18:3260</a>,7 iqn.1992-04.com.emc:cx.ckm00184400687.b2<br>
> (non-flash)<br>
> tcp: [15] <a href="http://10.138.209.47:3260" rel="noreferrer" target="_blank">10.138.209.47:3260</a>,6 iqn.1992-04.com.emc:cx.ckm00184400687.b3<br>
> (non-flash)<br>
> tcp: [16] <a href="http://10.138.209.47:3260" rel="noreferrer" target="_blank">10.138.209.47:3260</a>,6 iqn.1992-04.com.emc:cx.ckm00184400687.b3<br>
> (non-flash)<br>
> tcp: [9] <a href="http://10.138.209.48:3260" rel="noreferrer" target="_blank">10.138.209.48:3260</a>,9 iqn.1992-04.com.emc:cx.ckm00184400687.a3<br>
> (non-flash)<br>
> stderr= _run_iscsi_session<br>
> /usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py:1111<br>
> 2020-10-20 09:52:33.078 132171 DEBUG os_brick.initiator.connectors.iscsi<br>
> [-] Resulting device map defaultdict(<function <lambda> at 0x7f4f1b1f7cf8>,<br>
> {(u'<a href="http://10.138.215.17:3260" rel="noreferrer" target="_blank">10.138.215.17:3260</a>', u'iqn.1992-04.com.emc:cx.ckm00184400687.a2'):<br>
> (set([]), set([u'sdg', u'sdi'])), (u'<a href="http://10.138.209.47:3260" rel="noreferrer" target="_blank">10.138.209.47:3260</a>',<br>
> u'iqn.1992-04.com.emc:cx.ckm00184400687.b3'): (set([]), set([u'sdo',<br>
> u'sdq'])), (u'<a href="http://10.138.209.48:3260" rel="noreferrer" target="_blank">10.138.209.48:3260</a>',<br>
> u'iqn.1992-04.com.emc:cx.ckm00184400687.a3'): (set([]), set([u'sdd',<br>
> u'sdb'])), (u'<a href="http://10.138.215.18:3260" rel="noreferrer" target="_blank">10.138.215.18:3260</a>',<br>
> u'iqn.1992-04.com.emc:cx.ckm00184400687.b2'): (set([]), set([u'sdm',<br>
> u'sdk']))}) _get_connection_devices<br>
> /usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py:844<br>
> 2020-10-20 09:52:33.078 132171 DEBUG os_brick.initiator.connectors.iscsi<br>
> [-] Disconnecting from: [] _disconnect_connection<br>
> /usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py:1099<br>
> 2020-10-20 09:52:33.079 132171 DEBUG oslo_concurrency.lockutils [-] Lock<br>
> "connect_volume" released by<br>
> "os_brick.initiator.connectors.iscsi.disconnect_volume" :: held 1.058s<br>
> inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:339<br>
> 2020-10-20 09:52:33.079 132171 DEBUG os_brick.initiator.connectors.iscsi<br>
> [-] <== disconnect_volume: return (1057ms) None trace_logging_wrapper<br>
> /usr/lib/python2.7/site-packages/os_brick/utils.py:170<br>
> 2020-10-20 09:52:33.079 132171 DEBUG nova.virt.libvirt.volume.iscsi [-]<br>
> [instance: 0c846f66-f194-40de-b31e-d53652570fa7] Disconnected iSCSI Volume<br>
> disconnect_volume<br>
> /usr/lib/python2.7/site-packages/nova/virt/libvirt/volume/iscsi.py:78<br>
> 2020-10-20 09:52:33.080 132171 DEBUG os_brick.utils [-] ==><br>
> get_connector_properties: call u"{'execute': None, 'my_ip':<br>
> '10.138.208.178', 'enforce_multipath': True, 'host': 'podiscsivc-kvm02',<br>
> 'root_helper': 'sudo nova-rootwrap /etc/nova/rootwrap.conf', 'multipath':<br>
> True}" trace_logging_wrapper<br>
> /usr/lib/python2.7/site-packages/os_brick/utils.py:146<br>
> 2020-10-20 09:52:33.125 132171 DEBUG os_brick.initiator.linuxfc [-] No<br>
> Fibre Channel support detected on system. get_fc_hbas<br>
> /usr/lib/python2.7/site-packages/os_brick/initiator/linuxfc.py:157<br>
> 2020-10-20 09:52:33.126 132171 DEBUG os_brick.initiator.linuxfc [-] No<br>
> Fibre Channel support detected on system. get_fc_hbas<br>
> /usr/lib/python2.7/site-packages/os_brick/initiator/linuxfc.py:157<br>
> 2020-10-20 09:52:33.145 132171 DEBUG os_brick.utils [-] <==<br>
> get_connector_properties: return (61ms) {'initiator':<br>
> u'iqn.1994-05.com.redhat:fbfdc37eed4c', 'ip': u'10.138.208.178', 'system<br>
> uuid': u'4C4C4544-0051-4E10-8057-B6C04F425932', 'platform': u'x86_64',<br>
> 'host': u'podiscsivc-kvm02', 'do_local_attach': False, 'os_type':<br>
> u'linux2', 'multipath': True} trace_logging_wrapper<br>
> /usr/lib/python2.7/site-packages/os_brick/utils.py:170<br>
><br>
><br>
> Best regards<br>
> Ignazio<br>
><br>
> Il giorno gio 15 ott 2020 alle ore 10:57 Gorka Eguileor <<a href="mailto:geguileo@redhat.com" target="_blank">geguileo@redhat.com</a>><br>
> ha scritto:<br>
><br>
> > On 14/10, Ignazio Cassano wrote:<br>
> > > Hello, thank you for the answer.<br>
> > > I am using os-brick 2.3.8 but I got same issues on stein with os.brick<br>
> > 2.8<br>
> > > For explain better the situation I send you the output of multipath -ll<br>
> > on<br>
> > > a compute node:<br>
> > > root@podvc-kvm01 ansible]# multipath -ll<br>
> > > Oct 14 18:50:01 | sdbg: alua not supported<br>
> > > Oct 14 18:50:01 | sdbe: alua not supported<br>
> > > Oct 14 18:50:01 | sdbd: alua not supported<br>
> > > Oct 14 18:50:01 | sdbf: alua not supported<br>
> > > 360060160f0d049007ab7275f743d0286 dm-11 DGC ,VRAID<br>
> > > size=30G features='1 retain_attached_hw_handler' hwhandler='1 alua' wp=rw<br>
> > > |-+- policy='round-robin 0' prio=0 status=enabled<br>
> > > | |- 15:0:0:71 sdbg 67:160 failed faulty running<br>
> > > | `- 12:0:0:71 sdbe 67:128 failed faulty running<br>
> > > `-+- policy='round-robin 0' prio=0 status=enabled<br>
> > > |- 11:0:0:71 sdbd 67:112 failed faulty running<br>
> > > `- 13:0:0:71 sdbf 67:144 failed faulty running<br>
> > > 360060160f0d049004cdb615f52343fdb dm-8 DGC ,VRAID<br>
> > > size=80G features='2 queue_if_no_path retain_attached_hw_handler'<br>
> > > hwhandler='1 alua' wp=rw<br>
> > > |-+- policy='round-robin 0' prio=50 status=active<br>
> > > | |- 15:0:0:210 sdau 66:224 active ready running<br>
> > > | `- 12:0:0:210 sdas 66:192 active ready running<br>
> > > `-+- policy='round-robin 0' prio=10 status=enabled<br>
> > > |- 11:0:0:210 sdar 66:176 active ready running<br>
> > > `- 13:0:0:210 sdat 66:208 active ready running<br>
> > > 360060160f0d0490034aa645fe52265eb dm-12 DGC ,VRAID<br>
> > > size=100G features='2 queue_if_no_path retain_attached_hw_handler'<br>
> > > hwhandler='1 alua' wp=rw<br>
> > > |-+- policy='round-robin 0' prio=50 status=active<br>
> > > | |- 12:0:0:177 sdbi 67:192 active ready running<br>
> > > | `- 15:0:0:177 sdbk 67:224 active ready running<br>
> > > `-+- policy='round-robin 0' prio=10 status=enabled<br>
> > > |- 11:0:0:177 sdbh 67:176 active ready running<br>
> > > `- 13:0:0:177 sdbj 67:208 active ready running<br>
> > > 360060160f0d04900159f225fd6126db9 dm-6 DGC ,VRAID<br>
> > > size=40G features='2 queue_if_no_path retain_attached_hw_handler'<br>
> > > hwhandler='1 alua' wp=rw<br>
> > > |-+- policy='round-robin 0' prio=50 status=active<br>
> > > | |- 11:0:0:26 sdaf 65:240 active ready running<br>
> > > | `- 13:0:0:26 sdah 66:16 active ready running<br>
> > > `-+- policy='round-robin 0' prio=10 status=enabled<br>
> > > |- 12:0:0:26 sdag 66:0 active ready running<br>
> > > `- 15:0:0:26 sdai 66:32 active ready running<br>
> > > Oct 14 18:50:01 | sdba: alua not supported<br>
> > > Oct 14 18:50:01 | sdbc: alua not supported<br>
> > > Oct 14 18:50:01 | sdaz: alua not supported<br>
> > > Oct 14 18:50:01 | sdbb: alua not supported<br>
> > > 360060160f0d049007eb7275f93937511 dm-10 DGC ,VRAID<br>
> > > size=40G features='1 retain_attached_hw_handler' hwhandler='1 alua' wp=rw<br>
> > > |-+- policy='round-robin 0' prio=0 status=enabled<br>
> > > | |- 12:0:0:242 sdba 67:64 failed faulty running<br>
> > > | `- 15:0:0:242 sdbc 67:96 failed faulty running<br>
> > > `-+- policy='round-robin 0' prio=0 status=enabled<br>
> > > |- 11:0:0:242 sdaz 67:48 failed faulty running<br>
> > > `- 13:0:0:242 sdbb 67:80 failed faulty running<br>
> > > 360060160f0d049003a567c5fb72201e8 dm-7 DGC ,VRAID<br>
> > > size=40G features='2 queue_if_no_path retain_attached_hw_handler'<br>
> > > hwhandler='1 alua' wp=rw<br>
> > > |-+- policy='round-robin 0' prio=50 status=active<br>
> > > | |- 12:0:0:57 sdbq 68:64 active ready running<br>
> > > | `- 15:0:0:57 sdbs 68:96 active ready running<br>
> > > `-+- policy='round-robin 0' prio=10 status=enabled<br>
> > > |- 11:0:0:57 sdbp 68:48 active ready running<br>
> > > `- 13:0:0:57 sdbr 68:80 active ready running<br>
> > > 360060160f0d04900c120625f802ea1fa dm-9 DGC ,VRAID<br>
> > > size=25G features='2 queue_if_no_path retain_attached_hw_handler'<br>
> > > hwhandler='1 alua' wp=rw<br>
> > > |-+- policy='round-robin 0' prio=50 status=active<br>
> > > | |- 11:0:0:234 sdav 66:240 active ready running<br>
> > > | `- 13:0:0:234 sdax 67:16 active ready running<br>
> > > `-+- policy='round-robin 0' prio=10 status=enabled<br>
> > > |- 15:0:0:234 sday 67:32 active ready running<br>
> > > `- 12:0:0:234 sdaw 67:0 active ready running<br>
> > > 360060160f0d04900b8b0615fb14ef1bd dm-3 DGC ,VRAID<br>
> > > size=50G features='2 queue_if_no_path retain_attached_hw_handler'<br>
> > > hwhandler='1 alua' wp=rw<br>
> > > |-+- policy='round-robin 0' prio=50 status=active<br>
> > > | |- 11:0:0:11 sdan 66:112 active ready running<br>
> > > | `- 13:0:0:11 sdap 66:144 active ready running<br>
> > > `-+- policy='round-robin 0' prio=10 status=enabled<br>
> > > |- 12:0:0:11 sdao 66:128 active ready running<br>
> > > `- 15:0:0:11 sdaq 66:160 active ready running<br>
> > ><br>
> > > The active running are related to running virtual machines.<br>
> > > The faulty are related to virtual macnines migrated on other kvm nodes.<br>
> > > Every volume has 4 path because iscsi on unity needs two different vlans,<br>
> > > each one with 2 addresses.<br>
> > > I think this issue can be related to os-brick because when I migrate a<br>
> > > virtual machine from host A host B in the cova compute log on host A I<br>
> > read:<br>
> > > 2020-10-13 10:31:02.769 118727 DEBUG os_brick.initiator.connectors.iscsi<br>
> > > [req-771ede8c-6e1b-4f3f-ad4a-1f6ed820a55c<br>
> > 66adb965bef64eaaab2af93ade87e2ca<br>
> > > 85cace94dcc7484c85ff9337eb1d0c4c - default default] *Disconnecting from:<br>
> > []*<br>
> > ><br>
> > > Ignazio<br>
> ><br>
> > Hi,<br>
> ><br>
> > That's definitely the right clue!! Though I don't fully agree with this<br>
> > being an os-brick issue just yet. ;-)<br>
> ><br>
> > Like I mentioned before, RCA is usually non-trivial, and explaining how<br>
> > to debug these issues over email is close to impossible, but if this<br>
> > were my system, and assuming you have tested normal attach/detach<br>
> > procedure and is working fine, this is what I would do:<br>
> ><br>
> > - Enable DEBUG logs on Nova compute node (I believe you already have)<br>
> > - Attach a new device to an instance on that node with --debug to get<br>
> > the request id<br>
> > - Get the connection information dictionary that os-brick receives on<br>
> > the call to connect_volume for that request, and the data that<br>
> > os-brick returns to Nova on that method call completion.<br>
> > - Check if the returned data to Nova is a multipathed device or not (in<br>
> > 'path'), and whether we have the wwn or not (in 'scsi_wwn'). It<br>
> > should be a multipath device, and then I would check its status in the<br>
> > multipath daemon.<br>
> > - Now do the live migration (with --debug to get the request id) and see<br>
> > what information Nova passes in that request to os-brick's<br>
> > disconnect_volume.<br>
> > - Is it the same? Then it's likely an os-brick issue, and I can have a<br>
> > look at the logs if you put the logs for that os-brick detach<br>
> > process in a pastebin [1].<br>
> > - Is it different? Then it's either a Nova bug or a Cinder driver<br>
> > specific bug.<br>
> > - Is there a call from Nova to Cinder, in the migration request, for<br>
> > that same volume to initialize_connection passing the source host<br>
> > connector info (info from the host that is currently attached)?<br>
> > If there is a call, check if the returned data is different from<br>
> > the one we used to do the attach, if that's the case then it's a<br>
> > Nova and Cinder driver bug that was solved on the Nova side in<br>
> > 17.0.10 [2].<br>
> > - If there's no call to Cinder's initialize_connection, the it's<br>
> > most likely a Nova bug. Try to find out if this connection info<br>
> > makes any sense for that host (LUN, target, etc.) or if this is<br>
> > the one from the destination volume.<br>
> ><br>
> > I hope this somehow helps.<br>
> ><br>
> > Cheers,<br>
> > Gorka.<br>
> ><br>
> ><br>
> > [1]: <a href="http://paste.openstack.org/" rel="noreferrer" target="_blank">http://paste.openstack.org/</a><br>
> > [2]: <a href="https://review.opendev.org/#/c/637827/" rel="noreferrer" target="_blank">https://review.opendev.org/#/c/637827/</a><br>
> ><br>
> > ><br>
> > > Il giorno mer 14 ott 2020 alle ore 13:41 Gorka Eguileor <<br>
> > <a href="mailto:geguileo@redhat.com" target="_blank">geguileo@redhat.com</a>><br>
> > > ha scritto:<br>
> > ><br>
> > > > On 09/10, Ignazio Cassano wrote:<br>
> > > > > Hello Stackers, I am using dell emc iscsi driver on my centos 7<br>
> > queens<br>
> > > > > openstack. It works and instances work as well but on compute nodes I<br>
> > > > got a<br>
> > > > > lot a faulty device reported by multipath il comand.<br>
> > > > > I do know why this happens, probably attacching and detaching<br>
> > volumes and<br>
> > > > > live migrating instances do not close something well.<br>
> > > > > I read this can cause serious performances problems on compute nodes.<br>
> > > > > Please, any workaround and/or patch is suggested ?<br>
> > > > > Regards<br>
> > > > > Ignazio<br>
> > > ><br>
> > > > Hi,<br>
> > > ><br>
> > > > There are many, many, many things that could be happening there, and<br>
> > > > it's not usually trivial doing the RCA, so the following questions are<br>
> > > > just me hoping this is something "easy" to find out.<br>
> > > ><br>
> > > > What os-brick version from Queens are you running? Latest (2.3.9), or<br>
> > > > maybe one older than 2.3.3?<br>
> > > ><br>
> > > > When you say you have faulty devices reported, are these faulty devices<br>
> > > > alone in the multipath DM? Or do you have some faulty ones with some<br>
> > > > that are ok?<br>
> > > ><br>
> > > > If there are some OK and some that aren't, are they consecutive<br>
> > devices?<br>
> > > > (as in /dev/sda /dev/sdb etc).<br>
> > > ><br>
> > > > Cheers,<br>
> > > > Gorka.<br>
> > > ><br>
> > > ><br>
> ><br>
> ><br>
<br>
</blockquote></div>