[nova][os-brick] iSCSI multipath oddness during hard reboot

Tony Pearce tonyppe at gmail.com
Thu Oct 29 01:51:16 UTC 2020


Also I am sending this as an FYI because I learned this the hard way :)
"ISSUES WITH QUEUE_IF_NO_PATH FEATURE" [1]

[1]
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/dm_multipath/queueifnopath_issues


Tony Pearce



On Thu, 29 Oct 2020 at 09:44, Tony Pearce <tonyppe at gmail.com> wrote:

> On Wed, 28 Oct 2020 at 18:06, Lee Yarwood <lyarwood at redhat.com> wrote:
>
>> On 28-10-20 17:35:59, Tony Pearce wrote:
>> > Grant,
>> >
>> > As a guess I am suspecting your "fail_if_no_path" might be the issue
>> but I
>> > am not sure on the inner workings or mechanism at play during the
>> reboot or
>> > why it's getting stuck here for you. Your storage vendor may have
>> > documentation to state what the multipath (and iscsid) config should be
>> > from your host. Before changing config though I recommend getting the
>> root
>> > cause realised.  /var/log/messages log could help.
>>
>> Did you mean queue_if_no_path?
>
>
> Yes indeed, my apologies. I no longer have "queue_if_no_path" in my config.
>
> Best regards,
>
> Tony Pearce
>
>
>
> On Wed, 28 Oct 2020 at 18:06, Lee Yarwood <lyarwood at redhat.com> wrote:
>
>> On 28-10-20 17:35:59, Tony Pearce wrote:
>> > Grant,
>> >
>> > As a guess I am suspecting your "fail_if_no_path" might be the issue
>> but I
>> > am not sure on the inner workings or mechanism at play during the
>> reboot or
>> > why it's getting stuck here for you. Your storage vendor may have
>> > documentation to state what the multipath (and iscsid) config should be
>> > from your host. Before changing config though I recommend getting the
>> root
>> > cause realised.  /var/log/messages log could help.
>>
>> Did you mean queue_if_no_path?
>>
>> > Also if you go into the multipath CLI "multipathd -k" and issue "show
>> > config" you may see a "NETAPP" config there already. Depending on the
>> IDs
>> > your storage may be matching that rather than the default config within
>> > multipath.conf FYI.
>>
>> So Nova will ask os-brick to try to disconnect volumes during a hard
>> reboot of an instance and I suspect this is where things are getting
>> stuck in your env if you're using queue_if_no_path.
>>
>> Assuming you're using the libvirt virt driver has the underlying domain
>> for the instance been destroyed already?
>>
>> $ sudo virsh dominfo $instance_uuid
>>
>> If it has been then we might be able to cleanup the volume manually.
>>
>> Either way it might be useful to raise a bug for this against Nova and
>> os-brick so we can take a look at the attempt to hard reboot in more
>> detail.
>>
>> https://launchpad.net/nova/+filebug
>>
>> ^ Please use the template underneath the futher information textbox once
>> you've provided a title and if possible include the additional output
>> somewhere for review.
>>
>> $ openstack server event list $instance_uuid
>>
>> ^ This will provide a list of actions and their associated request-ids.
>> Using the request-id assocaited with the failing hard reboot can you
>> then provide logs from the compute.
>>
>> $ zgrep -l $request-id /var/log/nova/*
>>
>> ^ Obviously this depends on how logging is enabled in your env but you
>> hopefully get the idea.
>>
>> > On Wed, 28 Oct 2020 at 15:56, Grant Morley <grant at civo.com> wrote:
>> >
>> > > Hi Tony,
>> > >
>> > > We are using NetApp SolidFire for our storage. Instances seem to be
>> in a
>> > > normal "working" state before we try and reboot them.
>> > >
>> > > I haven't looked into `/usr/bin/rescan-scsi-bus.sh` but I will now so
>> > > thanks for that.
>> > >
>> > > We are using multipath but kept it on the defaults so it looks like
>> only 1
>> > > path is being used.
>> > >
>> > > I had a feeling it was down to heavily loaded compute causing the
>> issue.
>> > >
>> > > The config for iscsi is also the defaults from which openstack Ansible
>> > > deployed.
>> > >
>> > > Thanks for your help.
>> > >
>> > > Grant
>> > > On 28/10/2020 02:25, Tony Pearce wrote:
>> > >
>> > > Hi Grant, what storage are you using here? Is the instance in an
>> > > apparently "working" state before you try and reboot it?
>> > >
>> > > Have you looked into `/usr/bin/rescan-scsi-bus.sh` ? Please see this
>> > > reference link in the first instance: [1] "When ‘rescan-scsi-bus.sh
>> -i’ is
>> > > run, script execute as well a LIP_RESET (ISSUE_LIP) which may cause a
>> > > disruption in I/O on the server and even cause an outage in case of a
>> > > system running on heavy load."
>> > >
>> > > Are you using multipath? Some helpful commands:
>> > >
>> > > `tail -f /var/log/messages | grep multipath`
>> > >
>> > > `multipathd -k` = will go into mutipath cli. Then while in the cli:
>> > > show config
>> > > show paths
>> > >
>> > > If the cli is accessible then you're likely using multipath even if 1
>> > > path. Then the multipath.conf is taking effect even if it's a default
>> > > config.
>> > >
>> > > Config files relating to iscsi storage:
>> > > /etc/iscsi/iscsid.conf
>> > > /etc/multipath/multipath.conf
>> > >
>> > > [1]
>> > >
>> https://www.thegeekdiary.com/when-to-use-rescan-scsi-bus-sh-i-lip-flag-in-centos-rhel/
>> > >
>> > > Regards,
>> > >
>> > > Tony Pearce
>> > >
>> > >
>> > >
>> > > On Wed, 28 Oct 2020 at 03:39, Grant Morley <grant at civo.com> wrote:
>> > >
>> > >> Hi all,
>> > >>
>> > >> We are seeing some oddness on a couple of our compute hosts that
>> seems to
>> > >> be related to iSCSI. On a couple of our hosts I am seeing this error
>> in the
>> > >> nova compute logs:
>> > >>
>> > >> 2020-10-27 18:56:14.814 31490 WARNING
>> os_brick.initiator.connectors.iscsi
>> > >> [req-8613ae69-1661-49cf-8bdc-6fec875d01ba - - - - -] Couldn't find
>> iscsi
>> > >> sessions because iscsiadm err: iscsiadm: could not read session
>> targetname:
>> > >> 5
>> > >> iscsiadm: could not find session info for session1707
>> > >>
>> > >> That seems to also stop any instance on the compute host from being
>> able
>> > >> to reboot.  Reboots seem to get accepted but the instance never
>> completes
>> > >> and gets stuck in the reboot state:
>> > >>
>> > >> 2020-10-27 19:11:58.891 48612 INFO nova.compute.manager [-]
>> [instance:
>> > >> c8079e85-4777-4615-9d5a-3d1151e11984] During sync_power_state the
>> instance
>> > >> has a pending task (reboot_started_hard). Skip.
>> > >> 2020-10-27 19:11:58.891 48612 INFO nova.compute.manager [-]
>> [instance:
>> > >> 31128f26-910d-411f-98e0-c95dd36f4f0f] During sync_power_state the
>> instance
>> > >> has a pending task (reboot_started_hard). Skip.
>> > >>
>> > >> Does anyone know of a way to resolve this without rebooting the
>> entire
>> > >> compute host? I can't see any other issues other than the fact there
>> is
>> > >> this iSCSI error which in turn seems to stop nova from processing
>> anything
>> > >> for any instance.
>> > >>
>> > >> Any advice would be much appreciated.
>> > >>
>> > >> Regards,
>>
>> --
>> Lee Yarwood                 A5D1 9385 88CB 7E5F BE64  6618 BCA6 6E33 F672
>> 2D76
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20201029/29f29fe4/attachment.html>


More information about the openstack-discuss mailing list