[nova][os-brick] iSCSI multipath oddness during hard reboot

Grant Morley grant at civo.com
Wed Oct 28 10:42:46 UTC 2020


Hi,

It appears we are using `queue_if_no_path`

device {
         vendor "NETAPP"
         product "LUN.*"
         path_grouping_policy "group_by_prio"
         path_checker "tur"
         features "3 queue_if_no_path pg_init_retries 50"
         hardware_handler "0"
         prio "ontap"
         failback "immediate"
         rr_weight "uniform"
         rr_min_io 128
         flush_on_last_del "yes"
         dev_loss_tmo "infinity"
         user_friendly_names "no"
         retain_attached_hw_handler "yes"
         detect_prio "yes"

We are using libvirt virt driver and this is the info from `sudo virsh 
dominfo $instance_uuid`

Id:             -
Name:           instance-00049282
UUID:           c8079e85-4777-4615-9d5a-3d1151e11984
OS Type:        hvm
State:          shut off
CPU(s):         1
Max memory:     1048576 KiB
Used memory:    1048576 KiB
Persistent:     yes
Autostart:      disable
Managed save:   no
Security model: apparmor
Security DOI:   0


We are using an old version of OpenStack ( Queens ) that we are in the 
process of migrating away from. Not sure it is worth raising a bug on 
such an old version?

Happy to do so if it will help.

Regards,

On 28/10/2020 10:06, Lee Yarwood wrote:
> On 28-10-20 17:35:59, Tony Pearce wrote:
>> Grant,
>>
>> As a guess I am suspecting your "fail_if_no_path" might be the issue but I
>> am not sure on the inner workings or mechanism at play during the reboot or
>> why it's getting stuck here for you. Your storage vendor may have
>> documentation to state what the multipath (and iscsid) config should be
>> from your host. Before changing config though I recommend getting the root
>> cause realised.  /var/log/messages log could help.
> Did you mean queue_if_no_path?
>   
>> Also if you go into the multipath CLI "multipathd -k" and issue "show
>> config" you may see a "NETAPP" config there already. Depending on the IDs
>> your storage may be matching that rather than the default config within
>> multipath.conf FYI.
> So Nova will ask os-brick to try to disconnect volumes during a hard
> reboot of an instance and I suspect this is where things are getting
> stuck in your env if you're using queue_if_no_path.
>
> Assuming you're using the libvirt virt driver has the underlying domain
> for the instance been destroyed already?
>
> $ sudo virsh dominfo $instance_uuid
>
> If it has been then we might be able to cleanup the volume manually.
>
> Either way it might be useful to raise a bug for this against Nova and
> os-brick so we can take a look at the attempt to hard reboot in more
> detail.
>
> https://launchpad.net/nova/+filebug
>
> ^ Please use the template underneath the futher information textbox once
> you've provided a title and if possible include the additional output
> somewhere for review.
>
> $ openstack server event list $instance_uuid
>
> ^ This will provide a list of actions and their associated request-ids.
> Using the request-id assocaited with the failing hard reboot can you
> then provide logs from the compute.
>
> $ zgrep -l $request-id /var/log/nova/*
>
> ^ Obviously this depends on how logging is enabled in your env but you
> hopefully get the idea.
>
>> On Wed, 28 Oct 2020 at 15:56, Grant Morley <grant at civo.com> wrote:
>>
>>> Hi Tony,
>>>
>>> We are using NetApp SolidFire for our storage. Instances seem to be in a
>>> normal "working" state before we try and reboot them.
>>>
>>> I haven't looked into `/usr/bin/rescan-scsi-bus.sh` but I will now so
>>> thanks for that.
>>>
>>> We are using multipath but kept it on the defaults so it looks like only 1
>>> path is being used.
>>>
>>> I had a feeling it was down to heavily loaded compute causing the issue.
>>>
>>> The config for iscsi is also the defaults from which openstack Ansible
>>> deployed.
>>>
>>> Thanks for your help.
>>>
>>> Grant
>>> On 28/10/2020 02:25, Tony Pearce wrote:
>>>
>>> Hi Grant, what storage are you using here? Is the instance in an
>>> apparently "working" state before you try and reboot it?
>>>
>>> Have you looked into `/usr/bin/rescan-scsi-bus.sh` ? Please see this
>>> reference link in the first instance: [1] "When ‘rescan-scsi-bus.sh -i’ is
>>> run, script execute as well a LIP_RESET (ISSUE_LIP) which may cause a
>>> disruption in I/O on the server and even cause an outage in case of a
>>> system running on heavy load."
>>>
>>> Are you using multipath? Some helpful commands:
>>>
>>> `tail -f /var/log/messages | grep multipath`
>>>
>>> `multipathd -k` = will go into mutipath cli. Then while in the cli:
>>> show config
>>> show paths
>>>
>>> If the cli is accessible then you're likely using multipath even if 1
>>> path. Then the multipath.conf is taking effect even if it's a default
>>> config.
>>>
>>> Config files relating to iscsi storage:
>>> /etc/iscsi/iscsid.conf
>>> /etc/multipath/multipath.conf
>>>
>>> [1]
>>> https://www.thegeekdiary.com/when-to-use-rescan-scsi-bus-sh-i-lip-flag-in-centos-rhel/
>>>
>>> Regards,
>>>
>>> Tony Pearce
>>>
>>>
>>>
>>> On Wed, 28 Oct 2020 at 03:39, Grant Morley <grant at civo.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> We are seeing some oddness on a couple of our compute hosts that seems to
>>>> be related to iSCSI. On a couple of our hosts I am seeing this error in the
>>>> nova compute logs:
>>>>
>>>> 2020-10-27 18:56:14.814 31490 WARNING os_brick.initiator.connectors.iscsi
>>>> [req-8613ae69-1661-49cf-8bdc-6fec875d01ba - - - - -] Couldn't find iscsi
>>>> sessions because iscsiadm err: iscsiadm: could not read session targetname:
>>>> 5
>>>> iscsiadm: could not find session info for session1707
>>>>
>>>> That seems to also stop any instance on the compute host from being able
>>>> to reboot.  Reboots seem to get accepted but the instance never completes
>>>> and gets stuck in the reboot state:
>>>>
>>>> 2020-10-27 19:11:58.891 48612 INFO nova.compute.manager [-] [instance:
>>>> c8079e85-4777-4615-9d5a-3d1151e11984] During sync_power_state the instance
>>>> has a pending task (reboot_started_hard). Skip.
>>>> 2020-10-27 19:11:58.891 48612 INFO nova.compute.manager [-] [instance:
>>>> 31128f26-910d-411f-98e0-c95dd36f4f0f] During sync_power_state the instance
>>>> has a pending task (reboot_started_hard). Skip.
>>>>
>>>> Does anyone know of a way to resolve this without rebooting the entire
>>>> compute host? I can't see any other issues other than the fact there is
>>>> this iSCSI error which in turn seems to stop nova from processing anything
>>>> for any instance.
>>>>
>>>> Any advice would be much appreciated.
>>>>
>>>> Regards,
-- 
Grant Morley
Cloud Engineer, Civo Ltd
Unit H-K, Gateway 1000, Whittle Way
Stevenage, Herts, SG1 2FP, UK
	

Visit us at www.civo.com <https://www.civo.com/> 	
	Signup for an account now > <https://www.civo.com/signup>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20201028/464cac6f/attachment-0001.html>


More information about the openstack-discuss mailing list