Re: [Openstack][cinder] dell unity iscsi faulty devices
On 30/10, Ignazio Cassano wrote:
Please, se the last email where I upgraded to the last openstack nova on queens. [root@compute-0 nova]# rpm -qa|grep queens centos-release-openstack-queens-1-2.el7.centos.noarch [root@compute-0 nova]# rpm -qa|grep nova openstack-nova-compute-17.0.13-1.el7.noarch openstack-nova-common-17.0.13-1.el7.noarch python-nova-17.0.13-1.el7.noarch python2-novaclient-10.1.0-1.el7.noarch
I sent you the logs on the update release. I am not so skilled for reading fine logs output about this issue. Sorry Ignazio
Hi, I missed that email and the attachment, sorry. The logs you sent me were missing most of the connect_volume call, and only the end of the call was present, but I think it doesn't matter as I see what the problem is. The problem is that some of the nodes and sessions are duplicated. An example of a duplicated node: tcp: [3] 10.102.189.156:3260,15 iqn.1992-04.com.emc:cx.ckm00200502005.b11 (non-flash) tcp: [4] 10.102.189.156:3260,15 iqn.1992-04.com.emc:cx.ckm00200502005.b11 (non-flash) An example of that node's duplicated session: tcp: [3] 10.102.189.156:3260,15 iqn.1992-04.com.emc:cx.ckm00200502005.b11 (non-flash) tcp: [4] 10.102.189.156:3260,15 iqn.1992-04.com.emc:cx.ckm00200502005.b11 (non-flash) And os-brick is not prepared to handle that, because it is programmed to reuse the nodes and sesssions. So on disconnect it get's the first of each to look for the volumes provided by it. In the example of the duplicated node-session aboveit sees that it provides /dev/sdd, but that is not one of the disks that belong to the multipath that we are disconnecting, so it gets ignored. The volume we are looking for it's probably on the second session. So from this point forward (where we have duplicated node-sessions) it will not work again. I recommend you clean up that system so that you don't have duplicated nodes and sessions before trying to do a VM migration with a single volume attached. If that works, then try to attach 2 volumes on instances on the same host and see if the nodes and sessions are duplicated. Cheers, Gorka.
Il giorno ven 30 ott 2020 alle ore 09:27 Gorka Eguileor <geguileo@redhat.com> ha scritto:
On 30/10, Ignazio Cassano wrote:
Hello, these are versions we are using: [root@podto2-kvm02 ansible]# rpm -qa|grep queens centos-release-openstack-queens-1-2.el7.centos.noarch [root@podto2-kvm02 ansible]# rpm -qa|grep nova openstack-nova-common-17.0.11-1.el7.noarch python2-novaclient-10.1.0-1.el7.noarch openstack-nova-compute-17.0.11-1.el7.noarch python-nova-17.0.11-1.el7.noarch
Cheers,Gorka.
Hi,
That release has the Nova bug fix, so Nova should not be calling Cinder to do an initialize connection on the source on the post-migration step anymore.
I recommend comparing the connection info passed to connect_volume when the volume is attached on the source host and when it's disconnected on the post-migration step on the source host.
Cheers, Gorka.
Il giorno gio 29 ott 2020 alle ore 09:12 Gorka Eguileor <
geguileo@redhat.com>
ha scritto:
On 28/10, Ignazio Cassano wrote: > Hello Gorka, I would like to know if with unity iscsi driver, I
must
> configure iscsi initiator on both compute and controller nodes. > At this time I installed and cinfigured iscsi initiator only on compute > nodes and I got a lot of faulty devices when volumes ate detached. > Thanks > Ignazio >
Hi,
Both compute and controller nodes are in the data path. Computes when instances use the volumes, and controllers when we create volume from images, do generic volume migrations, create or restore backups, etc.
Unless your deployment isn't doing any of the Cinder operations
involve the data plane, you'll have to configure iSCSI on the controller as well.
Having said that, whether you configure the iSCSI initiator or not on the controller will have no effect on the paths used by the compute.
I've seen the iSCSI initiator going crazy when the iscsid and the iscsiadm are from different versions. I've seen this in containerized environments.
Faulty paths on multipathing is a tricky business, because there are different checkers, some generic (readsector0, tur, directio) and some vendor specific (emc_clarrion, hp_wd, rdac), and each one behaves in a different way.
If you have a multipath device with faulty paths, that you think should not be faulty, you should look into what's going on with those
- Confirm that the device is still in the system under /dev/XYZ - Confirm in your storage array's console/webconsole that the
still mapped on that target-portal to that host's iscsi initiator name. - Confirm you can read the faulty devices with dd on the host - Confirm that the WWN of the device is the same in all the paths (using /lib/udev/scsi_id) - Finally look into what checker is multipath using for your device (sometimes checkers have bugs).
Cheers, Gorka.
> > Il Mar 20 Ott 2020, 19:58 Gorka Eguileor <geguileo@redhat.com> ha scritto: > > > On 20/10, Ignazio Cassano wrote: > > > This is the entre log from when the migration started: > > > > > > http://paste.openstack.org/show/799199/ > > > > > > Ignazio > > > > Hi, > > > > There are no os-brick calls in there. :-( > > > > You should look for the call to connect_volume that should have > > something like: > > > > ==> disconnect_volume: call "{'args': > > (<os_brick.initiator.connectors.iscsi > > > > And the second parameter to that call is a dictionary where you can see > > the target_lun, target_luns, target_portals, target_portal, target_iqn, > > target_iqns... This will allow us to check if we are actually connected > > to those targets-portals > > > > The third parameter should contain two things that are relevant,
> > scsi_wwn and the path. You can check if the path exists and if
> > path actually has that wwn using /lib/udev/scsi_id --page 0x83 > > --whitelisted $path > > > > Those are the things I would check, because the only reason I can
volume is the that think
> > that os-brick is not disconnecting any volumes are that the connection > > info is not right, or that the volume is no longer connected. > > > > Cheers, > > Gorka. > > > > > > > > Il giorno mar 20 ott 2020 alle ore 11:23 Gorka Eguileor < > > geguileo@redhat.com> > > > ha scritto: > > > > > > > On 20/10, Ignazio Cassano wrote: > > > > > Hello Gorka,this is what happens on nova compute with debug enabled, > > > > when I > > > > > migrate an instance with iscsi volumes ( note Disconnecting from[] > > should > > > > > be the issue): > > > > > > > > > > > > > Hi, > > > > > > > > The disconnect from [] is the right clue, not necessarily
issue. > > > > > > > > OS-Brick is saying that for the connection information
been > > > > passed in the "disconnect_volume" call (which is not
> > > > emailed logs) there are no volumes present in the system. > > > > > > > > You should check the connection info that Nova is passing to > > > > disconnect_volume and confirm if that data is correct. For example > > > > checking if the path present in the connection info dictionary is the > > > > same as the one in the instance's XML dump, or if the LUN from
in the the
> > > > connection info dict is actually present in the system. > > > > > > > > There are multiple reasons why Nova could be passing the wrong > > > > connection info to os-brick. The ones that come to mind are: > > > > > > > > - There was a failed migration at some point, and Nova didn't rollback > > > > the connection info on the BDM table. > > > > - Nova is calling multiple times initialize_connection on Cinder for > > the > > > > same host and the driver being used is not idempotent. > > > > > > > > Cheers, > > > > Gorka. > > > > > > > > > stderr= _run_iscsiadm_bare > > > > > > > > > > >
/usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py:1122
> > > > > 2020-10-20 09:52:33.066 132171 DEBUG > > os_brick.initiator.connectors.iscsi > > > > > [-] iscsi session list stdout=tcp: [10] 10.138.209.48:3260,9 > > > > > iqn.1992-04.com.emc:cx.ckm00184400687.a3 (non-flash) > > > > > tcp: [11] 10.138.215.17:3260,8 > > iqn.1992-04.com.emc:cx.ckm00184400687.a2 > > > > > (non-flash) > > > > > tcp: [12] 10.138.215.17:3260,8 > > iqn.1992-04.com.emc:cx.ckm00184400687.a2 > > > > > (non-flash) > > > > > tcp: [13] 10.138.215.18:3260,7 > > iqn.1992-04.com.emc:cx.ckm00184400687.b2 > > > > > (non-flash) > > > > > tcp: [14] 10.138.215.18:3260,7 > > iqn.1992-04.com.emc:cx.ckm00184400687.b2 > > > > > (non-flash) > > > > > tcp: [15] 10.138.209.47:3260,6 > > iqn.1992-04.com.emc:cx.ckm00184400687.b3 > > > > > (non-flash) > > > > > tcp: [16] 10.138.209.47:3260,6 > > iqn.1992-04.com.emc:cx.ckm00184400687.b3 > > > > > (non-flash) > > > > > tcp: [9] 10.138.209.48:3260,9 > > iqn.1992-04.com.emc:cx.ckm00184400687.a3 > > > > > (non-flash) > > > > > stderr= _run_iscsi_session > > > > > > > > > > >
/usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py:1111
> > > > > 2020-10-20 09:52:33.078 132171 DEBUG > > os_brick.initiator.connectors.iscsi > > > > > [-] Resulting device map defaultdict(<function <lambda> at > > > > 0x7f4f1b1f7cf8>, > > > > > {(u'10.138.215.17:3260', > > u'iqn.1992-04.com.emc:cx.ckm00184400687.a2'): > > > > > (set([]), set([u'sdg', u'sdi'])), (u'10.138.209.47:3260 ', > > > > > u'iqn.1992-04.com.emc:cx.ckm00184400687.b3'): (set([]), set([u'sdo', > > > > > u'sdq'])), (u'10.138.209.48:3260', > > > > > u'iqn.1992-04.com.emc:cx.ckm00184400687.a3'): (set([]), set([u'sdd', > > > > > u'sdb'])), (u'10.138.215.18:3260', > > > > > u'iqn.1992-04.com.emc:cx.ckm00184400687.b2'): (set([]), set([u'sdm', > > > > > u'sdk']))}) _get_connection_devices > > > > > > > > > > >
/usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py:844
> > > > > 2020-10-20 09:52:33.078 132171 DEBUG > > os_brick.initiator.connectors.iscsi > > > > > [-] Disconnecting from: [] _disconnect_connection > > > > > > > > > > >
/usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py:1099
> > > > > 2020-10-20 09:52:33.079 132171 DEBUG oslo_concurrency.lockutils [-] > > Lock > > > > > "connect_volume" released by > > > > > "os_brick.initiator.connectors.iscsi.disconnect_volume" :: held > > 1.058s > > > > > inner > > /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:339 > > > > > 2020-10-20 09:52:33.079 132171 DEBUG > > os_brick.initiator.connectors.iscsi > > > > > [-] <== disconnect_volume: return (1057ms) None trace_logging_wrapper > > > > > /usr/lib/python2.7/site-packages/os_brick/utils.py:170 > > > > > 2020-10-20 09:52:33.079 132171 DEBUG nova.virt.libvirt.volume.iscsi > > [-] > > > > > [instance: 0c846f66-f194-40de-b31e-d53652570fa7] Disconnected iSCSI > > > > Volume > > > > > disconnect_volume > > > > >
/usr/lib/python2.7/site-packages/nova/virt/libvirt/volume/iscsi.py:78
> > > > > 2020-10-20 09:52:33.080 132171 DEBUG os_brick.utils [-] ==> > > > > > get_connector_properties: call u"{'execute': None, 'my_ip': > > > > > '10.138.208.178', 'enforce_multipath': True, 'host': > > 'podiscsivc-kvm02', > > > > > 'root_helper': 'sudo nova-rootwrap /etc/nova/rootwrap.conf', > > 'multipath': > > > > > True}" trace_logging_wrapper > > > > > /usr/lib/python2.7/site-packages/os_brick/utils.py:146 > > > > > 2020-10-20 09:52:33.125 132171 DEBUG os_brick.initiator.linuxfc [-] > > No > > > > > Fibre Channel support detected on system. get_fc_hbas > > > > > /usr/lib/python2.7/site-packages/os_brick/initiator/linuxfc.py:157 > > > > > 2020-10-20 09:52:33.126 132171 DEBUG os_brick.initiator.linuxfc [-] > > No > > > > > Fibre Channel support detected on system. get_fc_hbas > > > > > /usr/lib/python2.7/site-packages/os_brick/initiator/linuxfc.py:157 > > > > > 2020-10-20 09:52:33.145 132171 DEBUG os_brick.utils [-] <== > > > > > get_connector_properties: return (61ms) {'initiator': > > > > > u'iqn.1994-05.com.redhat:fbfdc37eed4c', 'ip': u'10.138.208.178', > > 'system > > > > > uuid': u'4C4C4544-0051-4E10-8057-B6C04F425932', 'platform': > > u'x86_64', > > > > > 'host': u'podiscsivc-kvm02', 'do_local_attach': False, 'os_type': > > > > > u'linux2', 'multipath': True} trace_logging_wrapper > > > > > /usr/lib/python2.7/site-packages/os_brick/utils.py:170 > > > > > > > > > > > > > > > Best regards > > > > > Ignazio > > > > > > > > > > Il giorno gio 15 ott 2020 alle ore 10:57 Gorka Eguileor < > > > > geguileo@redhat.com> > > > > > ha scritto: > > > > > > > > > > > On 14/10, Ignazio Cassano wrote: > > > > > > > Hello, thank you for the answer. > > > > > > > I am using os-brick 2.3.8 but I got same issues on stein with > > > > os.brick > > > > > > 2.8 > > > > > > > For explain better the situation I send you the output of > > multipath > > > > -ll > > > > > > on > > > > > > > a compute node: > > > > > > > root@podvc-kvm01 ansible]# multipath -ll > > > > > > > Oct 14 18:50:01 | sdbg: alua not supported > > > > > > > Oct 14 18:50:01 | sdbe: alua not supported > > > > > > > Oct 14 18:50:01 | sdbd: alua not supported > > > > > > > Oct 14 18:50:01 | sdbf: alua not supported > > > > > > > 360060160f0d049007ab7275f743d0286 dm-11 DGC ,VRAID > > > > > > > size=30G features='1 retain_attached_hw_handler' hwhandler='1 > > alua' > > > > wp=rw > > > > > > > |-+- policy='round-robin 0' prio=0 status=enabled > > > > > > > | |- 15:0:0:71 sdbg 67:160 failed faulty running > > > > > > > | `- 12:0:0:71 sdbe 67:128 failed faulty running > > > > > > > `-+- policy='round-robin 0' prio=0 status=enabled > > > > > > > |- 11:0:0:71 sdbd 67:112 failed faulty running > > > > > > > `- 13:0:0:71 sdbf 67:144 failed faulty running > > > > > > > 360060160f0d049004cdb615f52343fdb dm-8 DGC ,VRAID > > > > > > > size=80G features='2 queue_if_no_path retain_attached_hw_handler' > > > > > > > hwhandler='1 alua' wp=rw > > > > > > > |-+- policy='round-robin 0' prio=50 status=active > > > > > > > | |- 15:0:0:210 sdau 66:224 active ready running > > > > > > > | `- 12:0:0:210 sdas 66:192 active ready running > > > > > > > `-+- policy='round-robin 0' prio=10 status=enabled > > > > > > > |- 11:0:0:210 sdar 66:176 active ready running > > > > > > > `- 13:0:0:210 sdat 66:208 active ready running > > > > > > > 360060160f0d0490034aa645fe52265eb dm-12 DGC ,VRAID > > > > > > > size=100G features='2 queue_if_no_path > > retain_attached_hw_handler' > > > > > > > hwhandler='1 alua' wp=rw > > > > > > > |-+- policy='round-robin 0' prio=50 status=active > > > > > > > | |- 12:0:0:177 sdbi 67:192 active ready running > > > > > > > | `- 15:0:0:177 sdbk 67:224 active ready running > > > > > > > `-+- policy='round-robin 0' prio=10 status=enabled > > > > > > > |- 11:0:0:177 sdbh 67:176 active ready running > > > > > > > `- 13:0:0:177 sdbj 67:208 active ready running > > > > > > > 360060160f0d04900159f225fd6126db9 dm-6 DGC ,VRAID > > > > > > > size=40G features='2 queue_if_no_path retain_attached_hw_handler' > > > > > > > hwhandler='1 alua' wp=rw > > > > > > > |-+- policy='round-robin 0' prio=50 status=active > > > > > > > | |- 11:0:0:26 sdaf 65:240 active ready running > > > > > > > | `- 13:0:0:26 sdah 66:16 active ready running > > > > > > > `-+- policy='round-robin 0' prio=10 status=enabled > > > > > > > |- 12:0:0:26 sdag 66:0 active ready running > > > > > > > `- 15:0:0:26 sdai 66:32 active ready running > > > > > > > Oct 14 18:50:01 | sdba: alua not supported > > > > > > > Oct 14 18:50:01 | sdbc: alua not supported > > > > > > > Oct 14 18:50:01 | sdaz: alua not supported > > > > > > > Oct 14 18:50:01 | sdbb: alua not supported > > > > > > > 360060160f0d049007eb7275f93937511 dm-10 DGC ,VRAID > > > > > > > size=40G features='1 retain_attached_hw_handler' hwhandler='1 > > alua' > > > > wp=rw > > > > > > > |-+- policy='round-robin 0' prio=0 status=enabled > > > > > > > | |- 12:0:0:242 sdba 67:64 failed faulty running > > > > > > > | `- 15:0:0:242 sdbc 67:96 failed faulty running > > > > > > > `-+- policy='round-robin 0' prio=0 status=enabled > > > > > > > |- 11:0:0:242 sdaz 67:48 failed faulty running > > > > > > > `- 13:0:0:242 sdbb 67:80 failed faulty running > > > > > > > 360060160f0d049003a567c5fb72201e8 dm-7 DGC ,VRAID > > > > > > > size=40G features='2 queue_if_no_path retain_attached_hw_handler' > > > > > > > hwhandler='1 alua' wp=rw > > > > > > > |-+- policy='round-robin 0' prio=50 status=active > > > > > > > | |- 12:0:0:57 sdbq 68:64 active ready running > > > > > > > | `- 15:0:0:57 sdbs 68:96 active ready running > > > > > > > `-+- policy='round-robin 0' prio=10 status=enabled > > > > > > > |- 11:0:0:57 sdbp 68:48 active ready running > > > > > > > `- 13:0:0:57 sdbr 68:80 active ready running > > > > > > > 360060160f0d04900c120625f802ea1fa dm-9 DGC ,VRAID > > > > > > > size=25G features='2 queue_if_no_path retain_attached_hw_handler' > > > > > > > hwhandler='1 alua' wp=rw > > > > > > > |-+- policy='round-robin 0' prio=50 status=active > > > > > > > | |- 11:0:0:234 sdav 66:240 active ready running > > > > > > > | `- 13:0:0:234 sdax 67:16 active ready running > > > > > > > `-+- policy='round-robin 0' prio=10 status=enabled > > > > > > > |- 15:0:0:234 sday 67:32 active ready running > > > > > > > `- 12:0:0:234 sdaw 67:0 active ready running > > > > > > > 360060160f0d04900b8b0615fb14ef1bd dm-3 DGC ,VRAID > > > > > > > size=50G features='2 queue_if_no_path retain_attached_hw_handler' > > > > > > > hwhandler='1 alua' wp=rw > > > > > > > |-+- policy='round-robin 0' prio=50 status=active > > > > > > > | |- 11:0:0:11 sdan 66:112 active ready running > > > > > > > | `- 13:0:0:11 sdap 66:144 active ready running > > > > > > > `-+- policy='round-robin 0' prio=10 status=enabled > > > > > > > |- 12:0:0:11 sdao 66:128 active ready running > > > > > > > `- 15:0:0:11 sdaq 66:160 active ready running > > > > > > > > > > > > > > The active running are related to running virtual machines. > > > > > > > The faulty are related to virtual macnines migrated on other kvm > > > > nodes. > > > > > > > Every volume has 4 path because iscsi on unity needs two > > different > > > > vlans, > > > > > > > each one with 2 addresses. > > > > > > > I think this issue can be related to os-brick because when I > > migrate > > > > a > > > > > > > virtual machine from host A host B in the cova compute log on > > host A > > > > I > > > > > > read: > > > > > > > 2020-10-13 10:31:02.769 118727 DEBUG > > > > os_brick.initiator.connectors.iscsi > > > > > > > [req-771ede8c-6e1b-4f3f-ad4a-1f6ed820a55c > > > > > > 66adb965bef64eaaab2af93ade87e2ca > > > > > > > 85cace94dcc7484c85ff9337eb1d0c4c - default default] > > *Disconnecting > > > > from: > > > > > > []* > > > > > > > > > > > > > > Ignazio > > > > > > > > > > > > Hi, > > > > > > > > > > > > That's definitely the right clue!! Though I don't fully agree with > > > > this > > > > > > being an os-brick issue just yet. ;-) > > > > > > > > > > > > Like I mentioned before, RCA is usually non-trivial, and > > explaining how > > > > > > to debug these issues over email is close to impossible, but if > > this > > > > > > were my system, and assuming you have tested normal attach/detach > > > > > > procedure and is working fine, this is what I would do: > > > > > > > > > > > > - Enable DEBUG logs on Nova compute node (I believe you already > > have) > > > > > > - Attach a new device to an instance on that node with --debug to > > get > > > > > > the request id > > > > > > - Get the connection information dictionary that os-brick receives > > on > > > > > > the call to connect_volume for that request, and the data that > > > > > > os-brick returns to Nova on that method call completion. > > > > > > - Check if the returned data to Nova is a multipathed device or > > not (in > > > > > > 'path'), and whether we have the wwn or not (in 'scsi_wwn'). It > > > > > > should be a multipath device, and then I would check its status > > in > > > > the > > > > > > multipath daemon. > > > > > > - Now do the live migration (with --debug to get the request id) > > and > > > > see > > > > > > what information Nova passes in that request to os-brick's > > > > > > disconnect_volume. > > > > > > - Is it the same? Then it's likely an os-brick issue, and I can > > have > > > > a > > > > > > look at the logs if you put the logs for that os-brick detach > > > > > > process in a pastebin [1]. > > > > > > - Is it different? Then it's either a Nova bug or a Cinder driver > > > > > > specific bug. > > > > > > - Is there a call from Nova to Cinder, in the migration > > request, > > > > for > > > > > > that same volume to initialize_connection
source > > host > > > > > > connector info (info from the host that is currently > > attached)? > > > > > > If there is a call, check if the returned data is different > > from > > > > > > the one we used to do the attach, if that's the case then > > it's a > > > > > > Nova and Cinder driver bug that was solved on the Nova side > > in > > > > > > 17.0.10 [2]. > > > > > > - If there's no call to Cinder's initialize_connection, the > > it's > > > > > > most likely a Nova bug. Try to find out if this connection > > info > > > > > > makes any sense for that host (LUN, target, etc.) or if this > > is > > > > > > the one from the destination volume. > > > > > > > > > > > > I hope this somehow helps. > > > > > > > > > > > > Cheers, > > > > > > Gorka. > > > > > > > > > > > > > > > > > > [1]: http://paste.openstack.org/ > > > > > > [2]: https://review.opendev.org/#/c/637827/ > > > > > > > > > > > > > > > > > > > > Il giorno mer 14 ott 2020 alle ore 13:41 Gorka Eguileor < > > > > > > geguileo@redhat.com> > > > > > > > ha scritto: > > > > > > > > > > > > > > > On 09/10, Ignazio Cassano wrote: > > > > > > > > > Hello Stackers, I am using dell emc iscsi driver on my > > centos 7 > > > > > > queens > > > > > > > > > openstack. It works and instances work as well but on compute > > > > nodes I > > > > > > > > got a > > > > > > > > > lot a faulty device reported by multipath il comand. > > > > > > > > > I do know why this happens, probably attacching and detaching > > > > > > volumes and > > > > > > > > > live migrating instances do not close something well. > > > > > > > > > I read this can cause serious performances
https://github.com/openstack/nova/commit/013f421bca4067bd430a9fac1e3b290cf13... that paths: the that has present passing the problems
> > compute > > > > nodes. > > > > > > > > > Please, any workaround and/or patch is suggested ? > > > > > > > > > Regards > > > > > > > > > Ignazio > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > There are many, many, many things that could be happening > > there, > > > > and > > > > > > > > it's not usually trivial doing the RCA, so the following > > questions > > > > are > > > > > > > > just me hoping this is something "easy" to find out. > > > > > > > > > > > > > > > > What os-brick version from Queens are you running? Latest > > > > (2.3.9), or > > > > > > > > maybe one older than 2.3.3? > > > > > > > > > > > > > > > > When you say you have faulty devices reported, are
on these
faulty > > > > devices > > > > > > > > alone in the multipath DM? Or do you have some faulty ones with > > > > some > > > > > > > > that are ok? > > > > > > > > > > > > > > > > If there are some OK and some that aren't, are they consecutive > > > > > > devices? > > > > > > > > (as in /dev/sda /dev/sdb etc). > > > > > > > > > > > > > > > > Cheers, > > > > > > > > Gorka. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
Many thanks. Is there mode to clean without rebooting ? Il giorno ven 30 ott 2020 alle ore 11:52 Gorka Eguileor <geguileo@redhat.com> ha scritto:
On 30/10, Ignazio Cassano wrote:
Please, se the last email where I upgraded to the last openstack nova on queens. [root@compute-0 nova]# rpm -qa|grep queens centos-release-openstack-queens-1-2.el7.centos.noarch [root@compute-0 nova]# rpm -qa|grep nova openstack-nova-compute-17.0.13-1.el7.noarch openstack-nova-common-17.0.13-1.el7.noarch python-nova-17.0.13-1.el7.noarch python2-novaclient-10.1.0-1.el7.noarch
I sent you the logs on the update release. I am not so skilled for reading fine logs output about this issue. Sorry Ignazio
Hi,
I missed that email and the attachment, sorry.
The logs you sent me were missing most of the connect_volume call, and only the end of the call was present, but I think it doesn't matter as I see what the problem is.
The problem is that some of the nodes and sessions are duplicated.
An example of a duplicated node:
tcp: [3] 10.102.189.156:3260,15 iqn.1992-04.com.emc:cx.ckm00200502005.b11 (non-flash) tcp: [4] 10.102.189.156:3260,15 iqn.1992-04.com.emc:cx.ckm00200502005.b11 (non-flash)
An example of that node's duplicated session:
tcp: [3] 10.102.189.156:3260,15 iqn.1992-04.com.emc:cx.ckm00200502005.b11 (non-flash) tcp: [4] 10.102.189.156:3260,15 iqn.1992-04.com.emc:cx.ckm00200502005.b11 (non-flash)
And os-brick is not prepared to handle that, because it is programmed to reuse the nodes and sesssions.
So on disconnect it get's the first of each to look for the volumes provided by it. In the example of the duplicated node-session aboveit sees that it provides /dev/sdd, but that is not one of the disks that belong to the multipath that we are disconnecting, so it gets ignored. The volume we are looking for it's probably on the second session.
So from this point forward (where we have duplicated node-sessions) it will not work again.
I recommend you clean up that system so that you don't have duplicated nodes and sessions before trying to do a VM migration with a single volume attached.
If that works, then try to attach 2 volumes on instances on the same host and see if the nodes and sessions are duplicated.
Cheers, Gorka.
Il giorno ven 30 ott 2020 alle ore 09:27 Gorka Eguileor < geguileo@redhat.com> ha scritto:
On 30/10, Ignazio Cassano wrote:
Hello, these are versions we are using: [root@podto2-kvm02 ansible]# rpm -qa|grep queens centos-release-openstack-queens-1-2.el7.centos.noarch [root@podto2-kvm02 ansible]# rpm -qa|grep nova openstack-nova-common-17.0.11-1.el7.noarch python2-novaclient-10.1.0-1.el7.noarch openstack-nova-compute-17.0.11-1.el7.noarch python-nova-17.0.11-1.el7.noarch
Cheers,Gorka.
Hi,
That release has the Nova bug fix, so Nova should not be calling Cinder to do an initialize connection on the source on the post-migration step anymore.
I recommend comparing the connection info passed to connect_volume when the volume is attached on the source host and when it's disconnected on the post-migration step on the source host.
Cheers, Gorka.
Il giorno gio 29 ott 2020 alle ore 09:12 Gorka Eguileor <
geguileo@redhat.com>
ha scritto:
> On 28/10, Ignazio Cassano wrote: > > Hello Gorka, I would like to know if with unity iscsi
> > configure iscsi initiator on both compute and controller nodes. > > At this time I installed and cinfigured iscsi initiator only on compute > > nodes and I got a lot of faulty devices when volumes ate detached. > > Thanks > > Ignazio > > > > Hi, > > Both compute and controller nodes are in the data path. Computes when > instances use the volumes, and controllers when we create volume from > images, do generic volume migrations, create or restore backups, etc. > > Unless your deployment isn't doing any of the Cinder operations
driver, I must that
> involve the data plane, you'll have to configure iSCSI on the controller > as well. > > Having said that, whether you configure the iSCSI initiator or not on > the controller will have no effect on the paths used by the compute. > > I've seen the iSCSI initiator going crazy when the iscsid and
> iscsiadm are from different versions. I've seen this in containerized > environments. > > Faulty paths on multipathing is a tricky business, because
are
> different checkers, some generic (readsector0, tur, directio) and some > vendor specific (emc_clarrion, hp_wd, rdac), and each one behaves in a > different way. > > If you have a multipath device with faulty paths, that you
> not be faulty, you should look into what's going on with those
should paths:
> > - Confirm that the device is still in the system under /dev/XYZ > - Confirm in your storage array's console/webconsole that the volume is > still mapped on that target-portal to that host's iscsi initiator > name. > - Confirm you can read the faulty devices with dd on the host > - Confirm that the WWN of the device is the same in all the
> /lib/udev/scsi_id) > - Finally look into what checker is multipath using for your device > (sometimes checkers have bugs). > > Cheers, > Gorka. > > > > > > Il Mar 20 Ott 2020, 19:58 Gorka Eguileor < geguileo@redhat.com> ha > scritto: > > > > > On 20/10, Ignazio Cassano wrote: > > > > This is the entre log from when the migration started: > > > > > > > > http://paste.openstack.org/show/799199/ > > > > > > > > Ignazio > > > > > > Hi, > > > > > > There are no os-brick calls in there. :-( > > > > > > You should look for the call to connect_volume that should have > > > something like: > > > > > > ==> disconnect_volume: call "{'args': > > > (<os_brick.initiator.connectors.iscsi > > > > > > And the second parameter to that call is a dictionary where you can see > > > the target_lun, target_luns, target_portals, target_portal, target_iqn, > > > target_iqns... This will allow us to check if we are actually > connected > > > to those targets-portals > > > > > > The third parameter should contain two things that are relevant,
> > > scsi_wwn and the path. You can check if the path exists and if
> > > path actually has that wwn using /lib/udev/scsi_id --page 0x83 > > > --whitelisted $path > > > > > > Those are the things I would check, because the only reason I can
> > > that os-brick is not disconnecting any volumes are that the connection > > > info is not right, or that the volume is no longer connected. > > > > > > Cheers, > > > Gorka. > > > > > > > > > > > Il giorno mar 20 ott 2020 alle ore 11:23 Gorka Eguileor < > > > geguileo@redhat.com> > > > > ha scritto: > > > > > > > > > On 20/10, Ignazio Cassano wrote: > > > > > > Hello Gorka,this is what happens on nova compute with debug > enabled, > > > > > when I > > > > > > migrate an instance with iscsi volumes ( note Disconnecting > from[] > > > should > > > > > > be the issue): > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > The disconnect from [] is the right clue, not necessarily
> issue. > > > > > > > > > > OS-Brick is saying that for the connection information
> been > > > > > passed in the "disconnect_volume" call (which is not
(using the that think the that has present in the
> > > > > emailed logs) there are no volumes present in the system. > > > > > > > > > > You should check the connection info that Nova is
to
> > > > > disconnect_volume and confirm if that data is correct. For example > > > > > checking if the path present in the connection info dictionary is > the > > > > > same as the one in the instance's XML dump, or if the LUN from the > > > > > connection info dict is actually present in the system. > > > > > > > > > > There are multiple reasons why Nova could be passing
https://github.com/openstack/nova/commit/013f421bca4067bd430a9fac1e3b290cf13... the there think paths passing the
wrong
> > > > > connection info to os-brick. The ones that come to mind are: > > > > > > > > > > - There was a failed migration at some point, and Nova didn't > rollback > > > > > the connection info on the BDM table. > > > > > - Nova is calling multiple times initialize_connection on Cinder > for > > > the > > > > > same host and the driver being used is not idempotent. > > > > > > > > > > Cheers, > > > > > Gorka. > > > > > > > > > > > stderr= _run_iscsiadm_bare > > > > > > > > > > > > > > >
/usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py:1122
> > > > > > 2020-10-20 09:52:33.066 132171 DEBUG > > > os_brick.initiator.connectors.iscsi > > > > > > [-] iscsi session list stdout=tcp: [10] 10.138.209.48:3260,9 > > > > > > iqn.1992-04.com.emc:cx.ckm00184400687.a3 (non-flash) > > > > > > tcp: [11] 10.138.215.17:3260,8 > > > iqn.1992-04.com.emc:cx.ckm00184400687.a2 > > > > > > (non-flash) > > > > > > tcp: [12] 10.138.215.17:3260,8 > > > iqn.1992-04.com.emc:cx.ckm00184400687.a2 > > > > > > (non-flash) > > > > > > tcp: [13] 10.138.215.18:3260,7 > > > iqn.1992-04.com.emc:cx.ckm00184400687.b2 > > > > > > (non-flash) > > > > > > tcp: [14] 10.138.215.18:3260,7 > > > iqn.1992-04.com.emc:cx.ckm00184400687.b2 > > > > > > (non-flash) > > > > > > tcp: [15] 10.138.209.47:3260,6 > > > iqn.1992-04.com.emc:cx.ckm00184400687.b3 > > > > > > (non-flash) > > > > > > tcp: [16] 10.138.209.47:3260,6 > > > iqn.1992-04.com.emc:cx.ckm00184400687.b3 > > > > > > (non-flash) > > > > > > tcp: [9] 10.138.209.48:3260,9 > > > iqn.1992-04.com.emc:cx.ckm00184400687.a3 > > > > > > (non-flash) > > > > > > stderr= _run_iscsi_session > > > > > > > > > > > > > > >
/usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py:1111
> > > > > > 2020-10-20 09:52:33.078 132171 DEBUG > > > os_brick.initiator.connectors.iscsi > > > > > > [-] Resulting device map defaultdict(<function <lambda> at > > > > > 0x7f4f1b1f7cf8>, > > > > > > {(u'10.138.215.17:3260', > > > u'iqn.1992-04.com.emc:cx.ckm00184400687.a2'): > > > > > > (set([]), set([u'sdg', u'sdi'])), (u' 10.138.209.47:3260 ', > > > > > > u'iqn.1992-04.com.emc:cx.ckm00184400687.b3'): (set([]), > set([u'sdo', > > > > > > u'sdq'])), (u'10.138.209.48:3260', > > > > > > u'iqn.1992-04.com.emc:cx.ckm00184400687.a3'): (set([]), > set([u'sdd', > > > > > > u'sdb'])), (u'10.138.215.18:3260', > > > > > > u'iqn.1992-04.com.emc:cx.ckm00184400687.b2'): (set([]), > set([u'sdm', > > > > > > u'sdk']))}) _get_connection_devices > > > > > > > > > > > > > > >
/usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py:844
> > > > > > 2020-10-20 09:52:33.078 132171 DEBUG > > > os_brick.initiator.connectors.iscsi > > > > > > [-] Disconnecting from: [] _disconnect_connection > > > > > > > > > > > > > > >
> > > > > > 2020-10-20 09:52:33.079 132171 DEBUG oslo_concurrency.lockutils > [-] > > > Lock > > > > > > "connect_volume" released by > > > > > > "os_brick.initiator.connectors.iscsi.disconnect_volume" :: held > > > 1.058s > > > > > > inner > > > /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:339 > > > > > > 2020-10-20 09:52:33.079 132171 DEBUG > > > os_brick.initiator.connectors.iscsi > > > > > > [-] <== disconnect_volume: return (1057ms) None > trace_logging_wrapper > > > > > > /usr/lib/python2.7/site-packages/os_brick/utils.py:170 > > > > > > 2020-10-20 09:52:33.079 132171 DEBUG > nova.virt.libvirt.volume.iscsi > > > [-] > > > > > > [instance: 0c846f66-f194-40de-b31e-d53652570fa7] Disconnected > iSCSI > > > > > Volume > > > > > > disconnect_volume > > > > > > > /usr/lib/python2.7/site-packages/nova/virt/libvirt/volume/iscsi.py:78 > > > > > > 2020-10-20 09:52:33.080 132171 DEBUG os_brick.utils [-] ==> > > > > > > get_connector_properties: call u"{'execute': None, 'my_ip': > > > > > > '10.138.208.178', 'enforce_multipath': True, 'host': > > > 'podiscsivc-kvm02', > > > > > > 'root_helper': 'sudo nova-rootwrap /etc/nova/rootwrap.conf', > > > 'multipath': > > > > > > True}" trace_logging_wrapper > > > > > > /usr/lib/python2.7/site-packages/os_brick/utils.py:146 > > > > > > 2020-10-20 09:52:33.125 132171 DEBUG os_brick.initiator.linuxfc > [-] > > > No > > > > > > Fibre Channel support detected on system. get_fc_hbas > > > > > > > /usr/lib/python2.7/site-packages/os_brick/initiator/linuxfc.py:157 > > > > > > 2020-10-20 09:52:33.126 132171 DEBUG os_brick.initiator.linuxfc > [-] > > > No > > > > > > Fibre Channel support detected on system. get_fc_hbas > > > > > > > /usr/lib/python2.7/site-packages/os_brick/initiator/linuxfc.py:157 > > > > > > 2020-10-20 09:52:33.145 132171 DEBUG os_brick.utils [-] <== > > > > > > get_connector_properties: return (61ms) {'initiator': > > > > > > u'iqn.1994-05.com.redhat:fbfdc37eed4c', 'ip': u'10.138.208.178', > > > 'system > > > > > > uuid': u'4C4C4544-0051-4E10-8057-B6C04F425932', 'platform': > > > u'x86_64', > > > > > > 'host': u'podiscsivc-kvm02', 'do_local_attach': False, 'os_type': > > > > > > u'linux2', 'multipath': True} trace_logging_wrapper > > > > > > /usr/lib/python2.7/site-packages/os_brick/utils.py:170 > > > > > > > > > > > > > > > > > > Best regards > > > > > > Ignazio > > > > > > > > > > > > Il giorno gio 15 ott 2020 alle ore 10:57 Gorka Eguileor < > > > > > geguileo@redhat.com> > > > > > > ha scritto: > > > > > > > > > > > > > On 14/10, Ignazio Cassano wrote: > > > > > > > > Hello, thank you for the answer. > > > > > > > > I am using os-brick 2.3.8 but I got same issues on stein with > > > > > os.brick > > > > > > > 2.8 > > > > > > > > For explain better the situation I send you the output of > > > multipath > > > > > -ll > > > > > > > on > > > > > > > > a compute node: > > > > > > > > root@podvc-kvm01 ansible]# multipath -ll > > > > > > > > Oct 14 18:50:01 | sdbg: alua not supported > > > > > > > > Oct 14 18:50:01 | sdbe: alua not supported > > > > > > > > Oct 14 18:50:01 | sdbd: alua not supported > > > > > > > > Oct 14 18:50:01 | sdbf: alua not supported > > > > > > > > 360060160f0d049007ab7275f743d0286 dm-11 DGC ,VRAID > > > > > > > > size=30G features='1 retain_attached_hw_handler' hwhandler='1 > > > alua' > > > > > wp=rw > > > > > > > > |-+- policy='round-robin 0' prio=0 status=enabled > > > > > > > > | |- 15:0:0:71 sdbg 67:160 failed faulty running > > > > > > > > | `- 12:0:0:71 sdbe 67:128 failed faulty running > > > > > > > > `-+- policy='round-robin 0' prio=0 status=enabled > > > > > > > > |- 11:0:0:71 sdbd 67:112 failed faulty running > > > > > > > > `- 13:0:0:71 sdbf 67:144 failed faulty running > > > > > > > > 360060160f0d049004cdb615f52343fdb dm-8 DGC ,VRAID > > > > > > > > size=80G features='2 queue_if_no_path > retain_attached_hw_handler' > > > > > > > > hwhandler='1 alua' wp=rw > > > > > > > > |-+- policy='round-robin 0' prio=50 status=active > > > > > > > > | |- 15:0:0:210 sdau 66:224 active ready running > > > > > > > > | `- 12:0:0:210 sdas 66:192 active ready running > > > > > > > > `-+- policy='round-robin 0' prio=10 status=enabled > > > > > > > > |- 11:0:0:210 sdar 66:176 active ready running > > > > > > > > `- 13:0:0:210 sdat 66:208 active ready running > > > > > > > > 360060160f0d0490034aa645fe52265eb dm-12 DGC ,VRAID > > > > > > > > size=100G features='2 queue_if_no_path > > > retain_attached_hw_handler' > > > > > > > > hwhandler='1 alua' wp=rw > > > > > > > > |-+- policy='round-robin 0' prio=50 status=active > > > > > > > > | |- 12:0:0:177 sdbi 67:192 active ready running > > > > > > > > | `- 15:0:0:177 sdbk 67:224 active ready running > > > > > > > > `-+- policy='round-robin 0' prio=10 status=enabled > > > > > > > > |- 11:0:0:177 sdbh 67:176 active ready running > > > > > > > > `- 13:0:0:177 sdbj 67:208 active ready running > > > > > > > > 360060160f0d04900159f225fd6126db9 dm-6 DGC ,VRAID > > > > > > > > size=40G features='2 queue_if_no_path > retain_attached_hw_handler' > > > > > > > > hwhandler='1 alua' wp=rw > > > > > > > > |-+- policy='round-robin 0' prio=50 status=active > > > > > > > > | |- 11:0:0:26 sdaf 65:240 active ready running > > > > > > > > | `- 13:0:0:26 sdah 66:16 active ready running > > > > > > > > `-+- policy='round-robin 0' prio=10 status=enabled > > > > > > > > |- 12:0:0:26 sdag 66:0 active ready running > > > > > > > > `- 15:0:0:26 sdai 66:32 active ready running > > > > > > > > Oct 14 18:50:01 | sdba: alua not supported > > > > > > > > Oct 14 18:50:01 | sdbc: alua not supported > > > > > > > > Oct 14 18:50:01 | sdaz: alua not supported > > > > > > > > Oct 14 18:50:01 | sdbb: alua not supported > > > > > > > > 360060160f0d049007eb7275f93937511 dm-10 DGC ,VRAID > > > > > > > > size=40G features='1 retain_attached_hw_handler' hwhandler='1 > > > alua' > > > > > wp=rw > > > > > > > > |-+- policy='round-robin 0' prio=0 status=enabled > > > > > > > > | |- 12:0:0:242 sdba 67:64 failed faulty running > > > > > > > > | `- 15:0:0:242 sdbc 67:96 failed faulty running > > > > > > > > `-+- policy='round-robin 0' prio=0 status=enabled > > > > > > > > |- 11:0:0:242 sdaz 67:48 failed faulty running > > > > > > > > `- 13:0:0:242 sdbb 67:80 failed faulty running > > > > > > > > 360060160f0d049003a567c5fb72201e8 dm-7 DGC ,VRAID > > > > > > > > size=40G features='2 queue_if_no_path > retain_attached_hw_handler' > > > > > > > > hwhandler='1 alua' wp=rw > > > > > > > > |-+- policy='round-robin 0' prio=50 status=active > > > > > > > > | |- 12:0:0:57 sdbq 68:64 active ready running > > > > > > > > | `- 15:0:0:57 sdbs 68:96 active ready running > > > > > > > > `-+- policy='round-robin 0' prio=10 status=enabled > > > > > > > > |- 11:0:0:57 sdbp 68:48 active ready running > > > > > > > > `- 13:0:0:57 sdbr 68:80 active ready running > > > > > > > > 360060160f0d04900c120625f802ea1fa dm-9 DGC ,VRAID > > > > > > > > size=25G features='2 queue_if_no_path > retain_attached_hw_handler' > > > > > > > > hwhandler='1 alua' wp=rw > > > > > > > > |-+- policy='round-robin 0' prio=50 status=active > > > > > > > > | |- 11:0:0:234 sdav 66:240 active ready running > > > > > > > > | `- 13:0:0:234 sdax 67:16 active ready running > > > > > > > > `-+- policy='round-robin 0' prio=10 status=enabled > > > > > > > > |- 15:0:0:234 sday 67:32 active ready running > > > > > > > > `- 12:0:0:234 sdaw 67:0 active ready running > > > > > > > > 360060160f0d04900b8b0615fb14ef1bd dm-3 DGC ,VRAID > > > > > > > > size=50G features='2 queue_if_no_path > retain_attached_hw_handler' > > > > > > > > hwhandler='1 alua' wp=rw > > > > > > > > |-+- policy='round-robin 0' prio=50 status=active > > > > > > > > | |- 11:0:0:11 sdan 66:112 active ready running > > > > > > > > | `- 13:0:0:11 sdap 66:144 active ready running > > > > > > > > `-+- policy='round-robin 0' prio=10 status=enabled > > > > > > > > |- 12:0:0:11 sdao 66:128 active ready running > > > > > > > > `- 15:0:0:11 sdaq 66:160 active ready running > > > > > > > > > > > > > > > > The active running are related to running virtual machines. > > > > > > > > The faulty are related to virtual macnines migrated on other > kvm > > > > > nodes. > > > > > > > > Every volume has 4 path because iscsi on unity needs two > > > different > > > > > vlans, > > > > > > > > each one with 2 addresses. > > > > > > > > I think this issue can be related to os-brick because when I > > > migrate > > > > > a > > > > > > > > virtual machine from host A host B in the cova compute log on > > > host A > > > > > I > > > > > > > read: > > > > > > > > 2020-10-13 10:31:02.769 118727 DEBUG > > > > > os_brick.initiator.connectors.iscsi > > > > > > > > [req-771ede8c-6e1b-4f3f-ad4a-1f6ed820a55c > > > > > > > 66adb965bef64eaaab2af93ade87e2ca > > > > > > > > 85cace94dcc7484c85ff9337eb1d0c4c - default default] > > > *Disconnecting > > > > > from: > > > > > > > []* > > > > > > > > > > > > > > > > Ignazio > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > That's definitely the right clue!! Though I don't fully agree > with > > > > > this > > > > > > > being an os-brick issue just yet. ;-) > > > > > > > > > > > > > > Like I mentioned before, RCA is usually non-trivial, and > > > explaining how > > > > > > > to debug these issues over email is close to impossible, but if > > > this > > > > > > > were my system, and assuming you have tested normal > attach/detach > > > > > > > procedure and is working fine, this is what I would do: > > > > > > > > > > > > > > - Enable DEBUG logs on Nova compute node (I believe you already > > > have) > > > > > > > - Attach a new device to an instance on that node with --debug > to > > > get > > > > > > > the request id > > > > > > > - Get the connection information dictionary that os-brick > receives > > > on > > > > > > > the call to connect_volume for that request, and
data
> that > > > > > > > os-brick returns to Nova on that method call completion. > > > > > > > - Check if the returned data to Nova is a multipathed device or > > > not (in > > > > > > > 'path'), and whether we have the wwn or not (in > 'scsi_wwn'). It > > > > > > > should be a multipath device, and then I would check its > status > > > in > > > > > the > > > > > > > multipath daemon. > > > > > > > - Now do the live migration (with --debug to get
> id) > > > and > > > > > see > > > > > > > what information Nova passes in that request to os-brick's > > > > > > > disconnect_volume. > > > > > > > - Is it the same? Then it's likely an os-brick issue, and I > can > > > have > > > > > a > > > > > > > look at the logs if you put the logs for that os-brick > detach > > > > > > > process in a pastebin [1]. > > > > > > > - Is it different? Then it's either a Nova bug or a Cinder > driver > > > > > > > specific bug. > > > > > > > - Is there a call from Nova to Cinder, in the migration > > > request, > > > > > for > > > > > > > that same volume to initialize_connection
request passing the
> source > > > host > > > > > > > connector info (info from the host that is currently > > > attached)? > > > > > > > If there is a call, check if the returned data is > different > > > from > > > > > > > the one we used to do the attach, if that's
case
then
> > > it's a > > > > > > > Nova and Cinder driver bug that was solved on the Nova > side > > > in > > > > > > > 17.0.10 [2]. > > > > > > > - If there's no call to Cinder's initialize_connection, the > > > it's > > > > > > > most likely a Nova bug. Try to find out if
> connection > > > info > > > > > > > makes any sense for that host (LUN, target, etc.) or if > this > > > is > > > > > > > the one from the destination volume. > > > > > > > > > > > > > > I hope this somehow helps. > > > > > > > > > > > > > > Cheers, > > > > > > > Gorka. > > > > > > > > > > > > > > > > > > > > > [1]: http://paste.openstack.org/ > > > > > > > [2]: https://review.opendev.org/#/c/637827/ > > > > > > > > > > > > > > > > > > > > > > > Il giorno mer 14 ott 2020 alle ore 13:41 Gorka Eguileor < > > > > > > > geguileo@redhat.com> > > > > > > > > ha scritto: > > > > > > > > > > > > > > > > > On 09/10, Ignazio Cassano wrote: > > > > > > > > > > Hello Stackers, I am using dell emc iscsi driver on my > > > centos 7 > > > > > > > queens > > > > > > > > > > openstack. It works and instances work as well but on > compute > > > > > nodes I > > > > > > > > > got a > > > > > > > > > > lot a faulty device reported by multipath il comand. > > > > > > > > > > I do know why this happens, probably attacching and > detaching > > > > > > > volumes and > > > > > > > > > > live migrating instances do not close something well. > > > > > > > > > > I read this can cause serious performances problems on > > > compute > > > > > nodes. > > > > > > > > > > Please, any workaround and/or patch is suggested ? > > > > > > > > > > Regards > > > > > > > > > > Ignazio > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > There are many, many, many things that could be happening > > > there, > > > > > and > > > > > > > > > it's not usually trivial doing the RCA, so the following > > > questions > > > > > are > > > > > > > > > just me hoping this is something "easy" to find out. > > > > > > > > > > > > > > > > > > What os-brick version from Queens are you running? Latest > > > > > (2.3.9), or > > > > > > > > > maybe one older than 2.3.3? > > > > > > > > > > > > > > > > > > When you say you have faulty devices reported, are these > faulty > > > > > devices > > > > > > > > > alone in the multipath DM? Or do you have some faulty ones > with > > > > > some > > > > > > > > > that are ok? > > > > > > > > > > > > > > > > > > If there are some OK and some that aren't, are
/usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py:1099 the the the this they
> consecutive > > > > > > > devices? > > > > > > > > > (as in /dev/sda /dev/sdb etc). > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > Gorka. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
On 30/10, Ignazio Cassano wrote:
Many thanks. Is there mode to clean without rebooting ?
Hi, You would first have to make sure the devices are no longer being used. In the case of the migrated ones they are not being used for sure, since the have been unmapped on the array. Once you know that the volumes are not being used, you can flush the multipath, delete the devices, logout the session and delete the nodes without rebooting. Flushing: - Flush and remove multipaths: multipath -f <multipath> - Flush single path: blockdev --flushbufs <device> Remove device: echo 1 > /sys/block/<dev-name>/device/delete Logout session: - All sessions: iscsiadm -m session --logout - Single session: iscsiadm -m session -T <target-iqn> -p <target-portal> --logout Remove node: - All nodes: iscsiadm -m node -o delete - Single node: iscsiadm -m node -T <target-iqn> -p <target-portal> -o delete Cheers, Gorka.
Il giorno ven 30 ott 2020 alle ore 11:52 Gorka Eguileor <geguileo@redhat.com> ha scritto:
On 30/10, Ignazio Cassano wrote:
Please, se the last email where I upgraded to the last openstack nova on queens. [root@compute-0 nova]# rpm -qa|grep queens centos-release-openstack-queens-1-2.el7.centos.noarch [root@compute-0 nova]# rpm -qa|grep nova openstack-nova-compute-17.0.13-1.el7.noarch openstack-nova-common-17.0.13-1.el7.noarch python-nova-17.0.13-1.el7.noarch python2-novaclient-10.1.0-1.el7.noarch
I sent you the logs on the update release. I am not so skilled for reading fine logs output about this issue. Sorry Ignazio
Hi,
I missed that email and the attachment, sorry.
The logs you sent me were missing most of the connect_volume call, and only the end of the call was present, but I think it doesn't matter as I see what the problem is.
The problem is that some of the nodes and sessions are duplicated.
An example of a duplicated node:
tcp: [3] 10.102.189.156:3260,15 iqn.1992-04.com.emc:cx.ckm00200502005.b11 (non-flash) tcp: [4] 10.102.189.156:3260,15 iqn.1992-04.com.emc:cx.ckm00200502005.b11 (non-flash)
An example of that node's duplicated session:
tcp: [3] 10.102.189.156:3260,15 iqn.1992-04.com.emc:cx.ckm00200502005.b11 (non-flash) tcp: [4] 10.102.189.156:3260,15 iqn.1992-04.com.emc:cx.ckm00200502005.b11 (non-flash)
And os-brick is not prepared to handle that, because it is programmed to reuse the nodes and sesssions.
So on disconnect it get's the first of each to look for the volumes provided by it. In the example of the duplicated node-session aboveit sees that it provides /dev/sdd, but that is not one of the disks that belong to the multipath that we are disconnecting, so it gets ignored. The volume we are looking for it's probably on the second session.
So from this point forward (where we have duplicated node-sessions) it will not work again.
I recommend you clean up that system so that you don't have duplicated nodes and sessions before trying to do a VM migration with a single volume attached.
If that works, then try to attach 2 volumes on instances on the same host and see if the nodes and sessions are duplicated.
Cheers, Gorka.
Il giorno ven 30 ott 2020 alle ore 09:27 Gorka Eguileor < geguileo@redhat.com> ha scritto:
On 30/10, Ignazio Cassano wrote:
Hello, these are versions we are using: [root@podto2-kvm02 ansible]# rpm -qa|grep queens centos-release-openstack-queens-1-2.el7.centos.noarch [root@podto2-kvm02 ansible]# rpm -qa|grep nova openstack-nova-common-17.0.11-1.el7.noarch python2-novaclient-10.1.0-1.el7.noarch openstack-nova-compute-17.0.11-1.el7.noarch python-nova-17.0.11-1.el7.noarch
Cheers,Gorka.
Hi,
That release has the Nova bug fix, so Nova should not be calling Cinder to do an initialize connection on the source on the post-migration step anymore.
I recommend comparing the connection info passed to connect_volume when the volume is attached on the source host and when it's disconnected on the post-migration step on the source host.
Cheers, Gorka.
> > > Il giorno gio 29 ott 2020 alle ore 09:12 Gorka Eguileor < geguileo@redhat.com> > ha scritto: > > > On 28/10, Ignazio Cassano wrote: > > > Hello Gorka, I would like to know if with unity iscsi
> > > configure iscsi initiator on both compute and controller nodes. > > > At this time I installed and cinfigured iscsi initiator only on compute > > > nodes and I got a lot of faulty devices when volumes ate detached. > > > Thanks > > > Ignazio > > > > > > > Hi, > > > > Both compute and controller nodes are in the data path. Computes when > > instances use the volumes, and controllers when we create volume from > > images, do generic volume migrations, create or restore backups, etc. > > > > Unless your deployment isn't doing any of the Cinder operations
driver, I must that
> > involve the data plane, you'll have to configure iSCSI on the controller > > as well. > > > > Having said that, whether you configure the iSCSI initiator or not on > > the controller will have no effect on the paths used by the compute. > > > > I've seen the iSCSI initiator going crazy when the iscsid and
> > iscsiadm are from different versions. I've seen this in containerized > > environments. > > > > Faulty paths on multipathing is a tricky business, because
are
> > different checkers, some generic (readsector0, tur, directio) and some > > vendor specific (emc_clarrion, hp_wd, rdac), and each one behaves in a > > different way. > > > > If you have a multipath device with faulty paths, that you
> > not be faulty, you should look into what's going on with those
should paths:
> > > > - Confirm that the device is still in the system under /dev/XYZ > > - Confirm in your storage array's console/webconsole that the volume is > > still mapped on that target-portal to that host's iscsi initiator > > name. > > - Confirm you can read the faulty devices with dd on the host > > - Confirm that the WWN of the device is the same in all the
(using > > /lib/udev/scsi_id) > > - Finally look into what checker is multipath using for your device > > (sometimes checkers have bugs). > > > > Cheers, > > Gorka. > > > > > > > > > > Il Mar 20 Ott 2020, 19:58 Gorka Eguileor < geguileo@redhat.com> ha > > scritto: > > > > > > > On 20/10, Ignazio Cassano wrote: > > > > > This is the entre log from when the migration started: > > > > > > > > > > http://paste.openstack.org/show/799199/ > > > > > > > > > > Ignazio > > > > > > > > Hi, > > > > > > > > There are no os-brick calls in there. :-( > > > > > > > > You should look for the call to connect_volume that should have > > > > something like: > > > > > > > > ==> disconnect_volume: call "{'args': > > > > (<os_brick.initiator.connectors.iscsi > > > > > > > > And the second parameter to that call is a dictionary where you can see > > > > the target_lun, target_luns, target_portals, target_portal, target_iqn, > > > > target_iqns... This will allow us to check if we are actually > > connected > > > > to those targets-portals > > > > > > > > The third parameter should contain two things that are relevant, the > > > > scsi_wwn and the path. You can check if the path exists and if that > > > > path actually has that wwn using /lib/udev/scsi_id --page 0x83 > > > > --whitelisted $path > > > > > > > > Those are the things I would check, because the only reason I can think > > > > that os-brick is not disconnecting any volumes are that the connection > > > > info is not right, or that the volume is no longer connected. > > > > > > > > Cheers, > > > > Gorka. > > > > > > > > > > > > > > Il giorno mar 20 ott 2020 alle ore 11:23 Gorka Eguileor < > > > > geguileo@redhat.com> > > > > > ha scritto: > > > > > > > > > > > On 20/10, Ignazio Cassano wrote: > > > > > > > Hello Gorka,this is what happens on nova compute with debug > > enabled, > > > > > > when I > > > > > > > migrate an instance with iscsi volumes ( note Disconnecting > > from[] > > > > should > > > > > > > be the issue): > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > The disconnect from [] is the right clue, not necessarily the > > issue. > > > > > > > > > > > > OS-Brick is saying that for the connection information that has > > been > > > > > > passed in the "disconnect_volume" call (which is not present in the > > > > > > emailed logs) there are no volumes present in the system. > > > > > > > > > > > > You should check the connection info that Nova is
to
> > > > > > disconnect_volume and confirm if that data is correct. For example > > > > > > checking if the path present in the connection info dictionary is > > the > > > > > > same as the one in the instance's XML dump, or if the LUN from the > > > > > > connection info dict is actually present in the system. > > > > > > > > > > > > There are multiple reasons why Nova could be passing
https://github.com/openstack/nova/commit/013f421bca4067bd430a9fac1e3b290cf13... the there think paths passing the
wrong
> > > > > > connection info to os-brick. The ones that come to mind are: > > > > > > > > > > > > - There was a failed migration at some point, and Nova didn't > > rollback > > > > > > the connection info on the BDM table. > > > > > > - Nova is calling multiple times initialize_connection on Cinder > > for > > > > the > > > > > > same host and the driver being used is not idempotent. > > > > > > > > > > > > Cheers, > > > > > > Gorka. > > > > > > > > > > > > > stderr= _run_iscsiadm_bare > > > > > > > > > > > > > > > > > > >
/usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py:1122
> > > > > > > 2020-10-20 09:52:33.066 132171 DEBUG > > > > os_brick.initiator.connectors.iscsi > > > > > > > [-] iscsi session list stdout=tcp: [10] 10.138.209.48:3260,9 > > > > > > > iqn.1992-04.com.emc:cx.ckm00184400687.a3 (non-flash) > > > > > > > tcp: [11] 10.138.215.17:3260,8 > > > > iqn.1992-04.com.emc:cx.ckm00184400687.a2 > > > > > > > (non-flash) > > > > > > > tcp: [12] 10.138.215.17:3260,8 > > > > iqn.1992-04.com.emc:cx.ckm00184400687.a2 > > > > > > > (non-flash) > > > > > > > tcp: [13] 10.138.215.18:3260,7 > > > > iqn.1992-04.com.emc:cx.ckm00184400687.b2 > > > > > > > (non-flash) > > > > > > > tcp: [14] 10.138.215.18:3260,7 > > > > iqn.1992-04.com.emc:cx.ckm00184400687.b2 > > > > > > > (non-flash) > > > > > > > tcp: [15] 10.138.209.47:3260,6 > > > > iqn.1992-04.com.emc:cx.ckm00184400687.b3 > > > > > > > (non-flash) > > > > > > > tcp: [16] 10.138.209.47:3260,6 > > > > iqn.1992-04.com.emc:cx.ckm00184400687.b3 > > > > > > > (non-flash) > > > > > > > tcp: [9] 10.138.209.48:3260,9 > > > > iqn.1992-04.com.emc:cx.ckm00184400687.a3 > > > > > > > (non-flash) > > > > > > > stderr= _run_iscsi_session > > > > > > > > > > > > > > > > > > >
/usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py:1111
> > > > > > > 2020-10-20 09:52:33.078 132171 DEBUG > > > > os_brick.initiator.connectors.iscsi > > > > > > > [-] Resulting device map defaultdict(<function <lambda> at > > > > > > 0x7f4f1b1f7cf8>, > > > > > > > {(u'10.138.215.17:3260', > > > > u'iqn.1992-04.com.emc:cx.ckm00184400687.a2'): > > > > > > > (set([]), set([u'sdg', u'sdi'])), (u' 10.138.209.47:3260 ', > > > > > > > u'iqn.1992-04.com.emc:cx.ckm00184400687.b3'): (set([]), > > set([u'sdo', > > > > > > > u'sdq'])), (u'10.138.209.48:3260', > > > > > > > u'iqn.1992-04.com.emc:cx.ckm00184400687.a3'): (set([]), > > set([u'sdd', > > > > > > > u'sdb'])), (u'10.138.215.18:3260', > > > > > > > u'iqn.1992-04.com.emc:cx.ckm00184400687.b2'): (set([]), > > set([u'sdm', > > > > > > > u'sdk']))}) _get_connection_devices > > > > > > > > > > > > > > > > > > >
/usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py:844
> > > > > > > 2020-10-20 09:52:33.078 132171 DEBUG > > > > os_brick.initiator.connectors.iscsi > > > > > > > [-] Disconnecting from: [] _disconnect_connection > > > > > > > > > > > > > > > > > > >
> > > > > > > 2020-10-20 09:52:33.079 132171 DEBUG oslo_concurrency.lockutils > > [-] > > > > Lock > > > > > > > "connect_volume" released by > > > > > > > "os_brick.initiator.connectors.iscsi.disconnect_volume" :: held > > > > 1.058s > > > > > > > inner > > > > /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:339 > > > > > > > 2020-10-20 09:52:33.079 132171 DEBUG > > > > os_brick.initiator.connectors.iscsi > > > > > > > [-] <== disconnect_volume: return (1057ms) None > > trace_logging_wrapper > > > > > > > /usr/lib/python2.7/site-packages/os_brick/utils.py:170 > > > > > > > 2020-10-20 09:52:33.079 132171 DEBUG > > nova.virt.libvirt.volume.iscsi > > > > [-] > > > > > > > [instance: 0c846f66-f194-40de-b31e-d53652570fa7] Disconnected > > iSCSI > > > > > > Volume > > > > > > > disconnect_volume > > > > > > > > > /usr/lib/python2.7/site-packages/nova/virt/libvirt/volume/iscsi.py:78 > > > > > > > 2020-10-20 09:52:33.080 132171 DEBUG os_brick.utils [-] ==> > > > > > > > get_connector_properties: call u"{'execute': None, 'my_ip': > > > > > > > '10.138.208.178', 'enforce_multipath': True, 'host': > > > > 'podiscsivc-kvm02', > > > > > > > 'root_helper': 'sudo nova-rootwrap /etc/nova/rootwrap.conf', > > > > 'multipath': > > > > > > > True}" trace_logging_wrapper > > > > > > > /usr/lib/python2.7/site-packages/os_brick/utils.py:146 > > > > > > > 2020-10-20 09:52:33.125 132171 DEBUG os_brick.initiator.linuxfc > > [-] > > > > No > > > > > > > Fibre Channel support detected on system. get_fc_hbas > > > > > > > > > /usr/lib/python2.7/site-packages/os_brick/initiator/linuxfc.py:157 > > > > > > > 2020-10-20 09:52:33.126 132171 DEBUG os_brick.initiator.linuxfc > > [-] > > > > No > > > > > > > Fibre Channel support detected on system. get_fc_hbas > > > > > > > > > /usr/lib/python2.7/site-packages/os_brick/initiator/linuxfc.py:157 > > > > > > > 2020-10-20 09:52:33.145 132171 DEBUG os_brick.utils [-] <== > > > > > > > get_connector_properties: return (61ms) {'initiator': > > > > > > > u'iqn.1994-05.com.redhat:fbfdc37eed4c', 'ip': u'10.138.208.178', > > > > 'system > > > > > > > uuid': u'4C4C4544-0051-4E10-8057-B6C04F425932', 'platform': > > > > u'x86_64', > > > > > > > 'host': u'podiscsivc-kvm02', 'do_local_attach': False, 'os_type': > > > > > > > u'linux2', 'multipath': True} trace_logging_wrapper > > > > > > > /usr/lib/python2.7/site-packages/os_brick/utils.py:170 > > > > > > > > > > > > > > > > > > > > > Best regards > > > > > > > Ignazio > > > > > > > > > > > > > > Il giorno gio 15 ott 2020 alle ore 10:57 Gorka Eguileor < > > > > > > geguileo@redhat.com> > > > > > > > ha scritto: > > > > > > > > > > > > > > > On 14/10, Ignazio Cassano wrote: > > > > > > > > > Hello, thank you for the answer. > > > > > > > > > I am using os-brick 2.3.8 but I got same issues on stein with > > > > > > os.brick > > > > > > > > 2.8 > > > > > > > > > For explain better the situation I send you the output of > > > > multipath > > > > > > -ll > > > > > > > > on > > > > > > > > > a compute node: > > > > > > > > > root@podvc-kvm01 ansible]# multipath -ll > > > > > > > > > Oct 14 18:50:01 | sdbg: alua not supported > > > > > > > > > Oct 14 18:50:01 | sdbe: alua not supported > > > > > > > > > Oct 14 18:50:01 | sdbd: alua not supported > > > > > > > > > Oct 14 18:50:01 | sdbf: alua not supported > > > > > > > > > 360060160f0d049007ab7275f743d0286 dm-11 DGC ,VRAID > > > > > > > > > size=30G features='1 retain_attached_hw_handler' hwhandler='1 > > > > alua' > > > > > > wp=rw > > > > > > > > > |-+- policy='round-robin 0' prio=0 status=enabled > > > > > > > > > | |- 15:0:0:71 sdbg 67:160 failed faulty running > > > > > > > > > | `- 12:0:0:71 sdbe 67:128 failed faulty running > > > > > > > > > `-+- policy='round-robin 0' prio=0 status=enabled > > > > > > > > > |- 11:0:0:71 sdbd 67:112 failed faulty running > > > > > > > > > `- 13:0:0:71 sdbf 67:144 failed faulty running > > > > > > > > > 360060160f0d049004cdb615f52343fdb dm-8 DGC ,VRAID > > > > > > > > > size=80G features='2 queue_if_no_path > > retain_attached_hw_handler' > > > > > > > > > hwhandler='1 alua' wp=rw > > > > > > > > > |-+- policy='round-robin 0' prio=50 status=active > > > > > > > > > | |- 15:0:0:210 sdau 66:224 active ready running > > > > > > > > > | `- 12:0:0:210 sdas 66:192 active ready running > > > > > > > > > `-+- policy='round-robin 0' prio=10 status=enabled > > > > > > > > > |- 11:0:0:210 sdar 66:176 active ready running > > > > > > > > > `- 13:0:0:210 sdat 66:208 active ready running > > > > > > > > > 360060160f0d0490034aa645fe52265eb dm-12 DGC ,VRAID > > > > > > > > > size=100G features='2 queue_if_no_path > > > > retain_attached_hw_handler' > > > > > > > > > hwhandler='1 alua' wp=rw > > > > > > > > > |-+- policy='round-robin 0' prio=50 status=active > > > > > > > > > | |- 12:0:0:177 sdbi 67:192 active ready running > > > > > > > > > | `- 15:0:0:177 sdbk 67:224 active ready running > > > > > > > > > `-+- policy='round-robin 0' prio=10 status=enabled > > > > > > > > > |- 11:0:0:177 sdbh 67:176 active ready running > > > > > > > > > `- 13:0:0:177 sdbj 67:208 active ready running > > > > > > > > > 360060160f0d04900159f225fd6126db9 dm-6 DGC ,VRAID > > > > > > > > > size=40G features='2 queue_if_no_path > > retain_attached_hw_handler' > > > > > > > > > hwhandler='1 alua' wp=rw > > > > > > > > > |-+- policy='round-robin 0' prio=50 status=active > > > > > > > > > | |- 11:0:0:26 sdaf 65:240 active ready running > > > > > > > > > | `- 13:0:0:26 sdah 66:16 active ready running > > > > > > > > > `-+- policy='round-robin 0' prio=10 status=enabled > > > > > > > > > |- 12:0:0:26 sdag 66:0 active ready running > > > > > > > > > `- 15:0:0:26 sdai 66:32 active ready running > > > > > > > > > Oct 14 18:50:01 | sdba: alua not supported > > > > > > > > > Oct 14 18:50:01 | sdbc: alua not supported > > > > > > > > > Oct 14 18:50:01 | sdaz: alua not supported > > > > > > > > > Oct 14 18:50:01 | sdbb: alua not supported > > > > > > > > > 360060160f0d049007eb7275f93937511 dm-10 DGC ,VRAID > > > > > > > > > size=40G features='1 retain_attached_hw_handler' hwhandler='1 > > > > alua' > > > > > > wp=rw > > > > > > > > > |-+- policy='round-robin 0' prio=0 status=enabled > > > > > > > > > | |- 12:0:0:242 sdba 67:64 failed faulty running > > > > > > > > > | `- 15:0:0:242 sdbc 67:96 failed faulty running > > > > > > > > > `-+- policy='round-robin 0' prio=0 status=enabled > > > > > > > > > |- 11:0:0:242 sdaz 67:48 failed faulty running > > > > > > > > > `- 13:0:0:242 sdbb 67:80 failed faulty running > > > > > > > > > 360060160f0d049003a567c5fb72201e8 dm-7 DGC ,VRAID > > > > > > > > > size=40G features='2 queue_if_no_path > > retain_attached_hw_handler' > > > > > > > > > hwhandler='1 alua' wp=rw > > > > > > > > > |-+- policy='round-robin 0' prio=50 status=active > > > > > > > > > | |- 12:0:0:57 sdbq 68:64 active ready running > > > > > > > > > | `- 15:0:0:57 sdbs 68:96 active ready running > > > > > > > > > `-+- policy='round-robin 0' prio=10 status=enabled > > > > > > > > > |- 11:0:0:57 sdbp 68:48 active ready running > > > > > > > > > `- 13:0:0:57 sdbr 68:80 active ready running > > > > > > > > > 360060160f0d04900c120625f802ea1fa dm-9 DGC ,VRAID > > > > > > > > > size=25G features='2 queue_if_no_path > > retain_attached_hw_handler' > > > > > > > > > hwhandler='1 alua' wp=rw > > > > > > > > > |-+- policy='round-robin 0' prio=50 status=active > > > > > > > > > | |- 11:0:0:234 sdav 66:240 active ready running > > > > > > > > > | `- 13:0:0:234 sdax 67:16 active ready running > > > > > > > > > `-+- policy='round-robin 0' prio=10 status=enabled > > > > > > > > > |- 15:0:0:234 sday 67:32 active ready running > > > > > > > > > `- 12:0:0:234 sdaw 67:0 active ready running > > > > > > > > > 360060160f0d04900b8b0615fb14ef1bd dm-3 DGC ,VRAID > > > > > > > > > size=50G features='2 queue_if_no_path > > retain_attached_hw_handler' > > > > > > > > > hwhandler='1 alua' wp=rw > > > > > > > > > |-+- policy='round-robin 0' prio=50 status=active > > > > > > > > > | |- 11:0:0:11 sdan 66:112 active ready running > > > > > > > > > | `- 13:0:0:11 sdap 66:144 active ready running > > > > > > > > > `-+- policy='round-robin 0' prio=10 status=enabled > > > > > > > > > |- 12:0:0:11 sdao 66:128 active ready running > > > > > > > > > `- 15:0:0:11 sdaq 66:160 active ready running > > > > > > > > > > > > > > > > > > The active running are related to running virtual machines. > > > > > > > > > The faulty are related to virtual macnines migrated on other > > kvm > > > > > > nodes. > > > > > > > > > Every volume has 4 path because iscsi on unity needs two > > > > different > > > > > > vlans, > > > > > > > > > each one with 2 addresses. > > > > > > > > > I think this issue can be related to os-brick because when I > > > > migrate > > > > > > a > > > > > > > > > virtual machine from host A host B in the cova compute log on > > > > host A > > > > > > I > > > > > > > > read: > > > > > > > > > 2020-10-13 10:31:02.769 118727 DEBUG > > > > > > os_brick.initiator.connectors.iscsi > > > > > > > > > [req-771ede8c-6e1b-4f3f-ad4a-1f6ed820a55c > > > > > > > > 66adb965bef64eaaab2af93ade87e2ca > > > > > > > > > 85cace94dcc7484c85ff9337eb1d0c4c - default default] > > > > *Disconnecting > > > > > > from: > > > > > > > > []* > > > > > > > > > > > > > > > > > > Ignazio > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > That's definitely the right clue!! Though I don't fully agree > > with > > > > > > this > > > > > > > > being an os-brick issue just yet. ;-) > > > > > > > > > > > > > > > > Like I mentioned before, RCA is usually non-trivial, and > > > > explaining how > > > > > > > > to debug these issues over email is close to impossible, but if > > > > this > > > > > > > > were my system, and assuming you have tested normal > > attach/detach > > > > > > > > procedure and is working fine, this is what I would do: > > > > > > > > > > > > > > > > - Enable DEBUG logs on Nova compute node (I believe you already > > > > have) > > > > > > > > - Attach a new device to an instance on that node with --debug > > to > > > > get > > > > > > > > the request id > > > > > > > > - Get the connection information dictionary that os-brick > > receives > > > > on > > > > > > > > the call to connect_volume for that request, and
data
> > that > > > > > > > > os-brick returns to Nova on that method call completion. > > > > > > > > - Check if the returned data to Nova is a multipathed device or > > > > not (in > > > > > > > > 'path'), and whether we have the wwn or not (in > > 'scsi_wwn'). It > > > > > > > > should be a multipath device, and then I would check its > > status > > > > in > > > > > > the > > > > > > > > multipath daemon. > > > > > > > > - Now do the live migration (with --debug to get
request > > id) > > > > and > > > > > > see > > > > > > > > what information Nova passes in that request to os-brick's > > > > > > > > disconnect_volume. > > > > > > > > - Is it the same? Then it's likely an os-brick issue, and I > > can > > > > have > > > > > > a > > > > > > > > look at the logs if you put the logs for that os-brick > > detach > > > > > > > > process in a pastebin [1]. > > > > > > > > - Is it different? Then it's either a Nova bug or a Cinder > > driver > > > > > > > > specific bug. > > > > > > > > - Is there a call from Nova to Cinder, in the migration > > > > request, > > > > > > for > > > > > > > > that same volume to initialize_connection passing the > > source > > > > host > > > > > > > > connector info (info from the host that is currently > > > > attached)? > > > > > > > > If there is a call, check if the returned data is > > different > > > > from > > > > > > > > the one we used to do the attach, if that's
case
then > > > > it's a > > > > > > > > Nova and Cinder driver bug that was solved on the Nova > > side > > > > in > > > > > > > > 17.0.10 [2]. > > > > > > > > - If there's no call to Cinder's initialize_connection, the > > > > it's > > > > > > > > most likely a Nova bug. Try to find out if
> > connection > > > > info > > > > > > > > makes any sense for that host (LUN, target, etc.) or if > > this > > > > is > > > > > > > > the one from the destination volume. > > > > > > > > > > > > > > > > I hope this somehow helps. > > > > > > > > > > > > > > > > Cheers, > > > > > > > > Gorka. > > > > > > > > > > > > > > > > > > > > > > > > [1]: http://paste.openstack.org/ > > > > > > > > [2]: https://review.opendev.org/#/c/637827/ > > > > > > > > > > > > > > > > > > > > > > > > > > Il giorno mer 14 ott 2020 alle ore 13:41 Gorka Eguileor < > > > > > > > > geguileo@redhat.com> > > > > > > > > > ha scritto: > > > > > > > > > > > > > > > > > > > On 09/10, Ignazio Cassano wrote: > > > > > > > > > > > Hello Stackers, I am using dell emc iscsi driver on my > > > > centos 7 > > > > > > > > queens > > > > > > > > > > > openstack. It works and instances work as well but on > > compute > > > > > > nodes I > > > > > > > > > > got a > > > > > > > > > > > lot a faulty device reported by multipath il comand. > > > > > > > > > > > I do know why this happens, probably attacching and > > detaching > > > > > > > > volumes and > > > > > > > > > > > live migrating instances do not close something well. > > > > > > > > > > > I read this can cause serious performances problems on > > > > compute > > > > > > nodes. > > > > > > > > > > > Please, any workaround and/or patch is suggested ? > > > > > > > > > > > Regards > > > > > > > > > > > Ignazio > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > There are many, many, many things that could be happening > > > > there, > > > > > > and > > > > > > > > > > it's not usually trivial doing the RCA, so the following > > > > questions > > > > > > are > > > > > > > > > > just me hoping this is something "easy" to find out. > > > > > > > > > > > > > > > > > > > > What os-brick version from Queens are you running? Latest > > > > > > (2.3.9), or > > > > > > > > > > maybe one older than 2.3.3? > > > > > > > > > > > > > > > > > > > > When you say you have faulty devices reported, are these > > faulty > > > > > > devices > > > > > > > > > > alone in the multipath DM? Or do you have some faulty ones > > with > > > > > > some > > > > > > > > > > that are ok? > > > > > > > > > > > > > > > > > > > > If there are some OK and some that aren't, are
/usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py:1099 the the the this they
> > consecutive > > > > > > > > devices? > > > > > > > > > > (as in /dev/sda /dev/sdb etc). > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > Gorka. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
participants (2)
-
Gorka Eguileor
-
Ignazio Cassano