Nova iSCSI oddness

older
[tripleo] TripleO at Wallaby PTG...

Grant Morley

27 Oct 2020 27 Oct '20

7:37 p.m.

Hi all, We are seeing some oddness on a couple of our compute hosts that seems to be related to iSCSI. On a couple of our hosts I am seeing this error in the nova compute logs: 2020-10-27 18:56:14.814 31490 WARNING os_brick.initiator.connectors.iscsi [req-8613ae69-1661-49cf-8bdc-6fec875d01ba - - - - -] Couldn't find iscsi sessions because iscsiadm err: iscsiadm: could not read session targetname: 5 iscsiadm: could not find session info for session1707 That seems to also stop any instance on the compute host from being able to reboot. Reboots seem to get accepted but the instance never completes and gets stuck in the reboot state: 2020-10-27 19:11:58.891 48612 INFO nova.compute.manager [-] [instance: c8079e85-4777-4615-9d5a-3d1151e11984] During sync_power_state the instance has a pending task (reboot_started_hard). Skip. 2020-10-27 19:11:58.891 48612 INFO nova.compute.manager [-] [instance: 31128f26-910d-411f-98e0-c95dd36f4f0f] During sync_power_state the instance has a pending task (reboot_started_hard). Skip. Does anyone know of a way to resolve this without rebooting the entire compute host? I can't see any other issues other than the fact there is this iSCSI error which in turn seems to stop nova from processing anything for any instance. Any advice would be much appreciated. Regards, -- Grant Morley Cloud Engineer, Civo Ltd Unit H-K, Gateway 1000, Whittle Way Stevenage, Herts, SG1 2FP, UK Visit us at www.civo.com <https://www.civo.com/> Signup for an account now > <https://www.civo.com/signup>

Attachments:

attachment.html (text/html — 3.8 KB)

Show replies by date

Tony Pearce

28 Oct 28 Oct

2:25 a.m.

Hi Grant, what storage are you using here? Is the instance in an apparently "working" state before you try and reboot it? Have you looked into `/usr/bin/rescan-scsi-bus.sh` ? Please see this reference link in the first instance: [1] "When ‘rescan-scsi-bus.sh -i’ is run, script execute as well a LIP_RESET (ISSUE_LIP) which may cause a disruption in I/O on the server and even cause an outage in case of a system running on heavy load." Are you using multipath? Some helpful commands: `tail -f /var/log/messages | grep multipath` `multipathd -k` = will go into mutipath cli. Then while in the cli: show config show paths If the cli is accessible then you're likely using multipath even if 1 path. Then the multipath.conf is taking effect even if it's a default config. Config files relating to iscsi storage: /etc/iscsi/iscsid.conf /etc/multipath/multipath.conf [1] https://www.thegeekdiary.com/when-to-use-rescan-scsi-bus-sh-i-lip-flag-in-ce... Regards, Tony Pearce On Wed, 28 Oct 2020 at 03:39, Grant Morley <grant@civo.com> wrote:

...

Hi all,

We are seeing some oddness on a couple of our compute hosts that seems to be related to iSCSI. On a couple of our hosts I am seeing this error in the nova compute logs:

2020-10-27 18:56:14.814 31490 WARNING os_brick.initiator.connectors.iscsi [req-8613ae69-1661-49cf-8bdc-6fec875d01ba - - - - -] Couldn't find iscsi sessions because iscsiadm err: iscsiadm: could not read session targetname: 5 iscsiadm: could not find session info for session1707

That seems to also stop any instance on the compute host from being able to reboot. Reboots seem to get accepted but the instance never completes and gets stuck in the reboot state:

2020-10-27 19:11:58.891 48612 INFO nova.compute.manager [-] [instance: c8079e85-4777-4615-9d5a-3d1151e11984] During sync_power_state the instance has a pending task (reboot_started_hard). Skip. 2020-10-27 19:11:58.891 48612 INFO nova.compute.manager [-] [instance: 31128f26-910d-411f-98e0-c95dd36f4f0f] During sync_power_state the instance has a pending task (reboot_started_hard). Skip.

Does anyone know of a way to resolve this without rebooting the entire compute host? I can't see any other issues other than the fact there is this iSCSI error which in turn seems to stop nova from processing anything for any instance.

Any advice would be much appreciated.

Regards, -- Grant Morley Cloud Engineer, Civo Ltd Unit H-K, Gateway 1000, Whittle Way Stevenage, Herts, SG1 2FP, UK Visit us at www.civo.com Signup for an account now > <https://www.civo.com/signup>

Grant Morley

7:56 a.m.

Hi Tony, We are using NetApp SolidFire for our storage. Instances seem to be in a normal "working" state before we try and reboot them. I haven't looked into `/usr/bin/rescan-scsi-bus.sh` but I will now so thanks for that. We are using multipath but kept it on the defaults so it looks like only 1 path is being used. I had a feeling it was down to heavily loaded compute causing the issue. The config for iscsi is also the defaults from which openstack Ansible deployed. Thanks for your help. Grant On 28/10/2020 02:25, Tony Pearce wrote:

...

Hi Grant, what storage are you using here? Is the instance in an apparently "working" state before you try and reboot it?

Have you looked into `/usr/bin/rescan-scsi-bus.sh` ? Please see this reference link in the first instance: [1] "When ‘rescan-scsi-bus.sh -i’ is run, script execute as well a LIP_RESET (ISSUE_LIP) which may cause a disruption in I/O on the server and even cause an outage in case of a system running on heavy load."

Are you using multipath? Some helpful commands:

`tail -f /var/log/messages | grep multipath`

`multipathd -k` = will go into mutipath cli. Then while in the cli: show config show paths

If the cli is accessible then you're likely using multipath even if 1 path. Then the multipath.conf is taking effect even if it's a default config.

Config files relating to iscsi storage: /etc/iscsi/iscsid.conf /etc/multipath/multipath.conf

[1] https://www.thegeekdiary.com/when-to-use-rescan-scsi-bus-sh-i-lip-flag-in-ce...

Regards,

Tony Pearce

On Wed, 28 Oct 2020 at 03:39, Grant Morley <grant@civo.com <mailto:grant@civo.com>> wrote:

Hi all,

We are seeing some oddness on a couple of our compute hosts that seems to be related to iSCSI. On a couple of our hosts I am seeing this error in the nova compute logs:

2020-10-27 18:56:14.814 31490 WARNING os_brick.initiator.connectors.iscsi [req-8613ae69-1661-49cf-8bdc-6fec875d01ba - - - - -] Couldn't find iscsi sessions because iscsiadm err: iscsiadm: could not read session targetname: 5 iscsiadm: could not find session info for session1707

That seems to also stop any instance on the compute host from being able to reboot. Reboots seem to get accepted but the instance never completes and gets stuck in the reboot state:

2020-10-27 19:11:58.891 48612 INFO nova.compute.manager [-] [instance: c8079e85-4777-4615-9d5a-3d1151e11984] During sync_power_state the instance has a pending task (reboot_started_hard). Skip. 2020-10-27 19:11:58.891 48612 INFO nova.compute.manager [-] [instance: 31128f26-910d-411f-98e0-c95dd36f4f0f] During sync_power_state the instance has a pending task (reboot_started_hard). Skip.

Does anyone know of a way to resolve this without rebooting the entire compute host? I can't see any other issues other than the fact there is this iSCSI error which in turn seems to stop nova from processing anything for any instance.

Any advice would be much appreciated.

Regards,

-- Grant Morley Cloud Engineer, Civo Ltd Unit H-K, Gateway 1000, Whittle Way Stevenage, Herts, SG1 2FP, UK

Visit us at www.civo.com <https://www.civo.com/> Signup for an account now > <https://www.civo.com/signup>

-- Grant Morley Cloud Engineer, Civo Ltd Unit H-K, Gateway 1000, Whittle Way Stevenage, Herts, SG1 2FP, UK Visit us at www.civo.com <https://www.civo.com/> Signup for an account now > <https://www.civo.com/signup>

Tony Pearce

9:35 a.m.

Grant, As a guess I am suspecting your "fail_if_no_path" might be the issue but I am not sure on the inner workings or mechanism at play during the reboot or why it's getting stuck here for you. Your storage vendor may have documentation to state what the multipath (and iscsid) config should be from your host. Before changing config though I recommend getting the root cause realised. /var/log/messages log could help. Also if you go into the multipath CLI "multipathd -k" and issue "show config" you may see a "NETAPP" config there already. Depending on the IDs your storage may be matching that rather than the default config within multipath.conf FYI. Regards, Tony Pearce On Wed, 28 Oct 2020 at 15:56, Grant Morley <grant@civo.com> wrote:

...

Hi Tony,

We are using NetApp SolidFire for our storage. Instances seem to be in a normal "working" state before we try and reboot them.

I haven't looked into `/usr/bin/rescan-scsi-bus.sh` but I will now so thanks for that.

We are using multipath but kept it on the defaults so it looks like only 1 path is being used.

I had a feeling it was down to heavily loaded compute causing the issue.

The config for iscsi is also the defaults from which openstack Ansible deployed.

Thanks for your help.

Grant On 28/10/2020 02:25, Tony Pearce wrote:

Hi Grant, what storage are you using here? Is the instance in an apparently "working" state before you try and reboot it?

Have you looked into `/usr/bin/rescan-scsi-bus.sh` ? Please see this reference link in the first instance: [1] "When ‘rescan-scsi-bus.sh -i’ is run, script execute as well a LIP_RESET (ISSUE_LIP) which may cause a disruption in I/O on the server and even cause an outage in case of a system running on heavy load."

Are you using multipath? Some helpful commands:

`tail -f /var/log/messages | grep multipath`

`multipathd -k` = will go into mutipath cli. Then while in the cli: show config show paths

If the cli is accessible then you're likely using multipath even if 1 path. Then the multipath.conf is taking effect even if it's a default config.

Config files relating to iscsi storage: /etc/iscsi/iscsid.conf /etc/multipath/multipath.conf

[1] https://www.thegeekdiary.com/when-to-use-rescan-scsi-bus-sh-i-lip-flag-in-ce...

Regards,

Tony Pearce

On Wed, 28 Oct 2020 at 03:39, Grant Morley <grant@civo.com> wrote:

...
Hi all,

We are seeing some oddness on a couple of our compute hosts that seems to be related to iSCSI. On a couple of our hosts I am seeing this error in the nova compute logs:

2020-10-27 18:56:14.814 31490 WARNING os_brick.initiator.connectors.iscsi [req-8613ae69-1661-49cf-8bdc-6fec875d01ba - - - - -] Couldn't find iscsi sessions because iscsiadm err: iscsiadm: could not read session targetname: 5 iscsiadm: could not find session info for session1707

That seems to also stop any instance on the compute host from being able to reboot. Reboots seem to get accepted but the instance never completes and gets stuck in the reboot state:

2020-10-27 19:11:58.891 48612 INFO nova.compute.manager [-] [instance: c8079e85-4777-4615-9d5a-3d1151e11984] During sync_power_state the instance has a pending task (reboot_started_hard). Skip. 2020-10-27 19:11:58.891 48612 INFO nova.compute.manager [-] [instance: 31128f26-910d-411f-98e0-c95dd36f4f0f] During sync_power_state the instance has a pending task (reboot_started_hard). Skip.

Does anyone know of a way to resolve this without rebooting the entire compute host? I can't see any other issues other than the fact there is this iSCSI error which in turn seems to stop nova from processing anything for any instance.

Any advice would be much appreciated.

Regards, -- Grant Morley Cloud Engineer, Civo Ltd Unit H-K, Gateway 1000, Whittle Way Stevenage, Herts, SG1 2FP, UK Visit us at www.civo.com Signup for an account now > <https://www.civo.com/signup>

-- Grant Morley Cloud Engineer, Civo Ltd Unit H-K, Gateway 1000, Whittle Way Stevenage, Herts, SG1 2FP, UK Visit us at www.civo.com Signup for an account now > <https://www.civo.com/signup>

Lee Yarwood

10:06 a.m.

New subject: [nova][os-brick] iSCSI multipath oddness during hard reboot

On 28-10-20 17:35:59, Tony Pearce wrote:

...

Grant,

As a guess I am suspecting your "fail_if_no_path" might be the issue but I am not sure on the inner workings or mechanism at play during the reboot or why it's getting stuck here for you. Your storage vendor may have documentation to state what the multipath (and iscsid) config should be from your host. Before changing config though I recommend getting the root cause realised. /var/log/messages log could help.

Did you mean queue_if_no_path?

...

Also if you go into the multipath CLI "multipathd -k" and issue "show config" you may see a "NETAPP" config there already. Depending on the IDs your storage may be matching that rather than the default config within multipath.conf FYI.

So Nova will ask os-brick to try to disconnect volumes during a hard reboot of an instance and I suspect this is where things are getting stuck in your env if you're using queue_if_no_path. Assuming you're using the libvirt virt driver has the underlying domain for the instance been destroyed already? $ sudo virsh dominfo $instance_uuid If it has been then we might be able to cleanup the volume manually. Either way it might be useful to raise a bug for this against Nova and os-brick so we can take a look at the attempt to hard reboot in more detail. https://launchpad.net/nova/+filebug ^ Please use the template underneath the futher information textbox once you've provided a title and if possible include the additional output somewhere for review. $ openstack server event list $instance_uuid ^ This will provide a list of actions and their associated request-ids. Using the request-id assocaited with the failing hard reboot can you then provide logs from the compute. $ zgrep -l $request-id /var/log/nova/* ^ Obviously this depends on how logging is enabled in your env but you hopefully get the idea.

...

On Wed, 28 Oct 2020 at 15:56, Grant Morley <grant@civo.com> wrote:

...
Hi Tony,

We are using NetApp SolidFire for our storage. Instances seem to be in a normal "working" state before we try and reboot them.

I haven't looked into `/usr/bin/rescan-scsi-bus.sh` but I will now so thanks for that.

We are using multipath but kept it on the defaults so it looks like only 1 path is being used.

I had a feeling it was down to heavily loaded compute causing the issue.

The config for iscsi is also the defaults from which openstack Ansible deployed.

Thanks for your help.

Grant On 28/10/2020 02:25, Tony Pearce wrote:

Hi Grant, what storage are you using here? Is the instance in an apparently "working" state before you try and reboot it?

Have you looked into `/usr/bin/rescan-scsi-bus.sh` ? Please see this reference link in the first instance: [1] "When ‘rescan-scsi-bus.sh -i’ is run, script execute as well a LIP_RESET (ISSUE_LIP) which may cause a disruption in I/O on the server and even cause an outage in case of a system running on heavy load."

Are you using multipath? Some helpful commands:

`tail -f /var/log/messages | grep multipath`

`multipathd -k` = will go into mutipath cli. Then while in the cli: show config show paths

If the cli is accessible then you're likely using multipath even if 1 path. Then the multipath.conf is taking effect even if it's a default config.

Config files relating to iscsi storage: /etc/iscsi/iscsid.conf /etc/multipath/multipath.conf

[1] https://www.thegeekdiary.com/when-to-use-rescan-scsi-bus-sh-i-lip-flag-in-ce...

Regards,

Tony Pearce

On Wed, 28 Oct 2020 at 03:39, Grant Morley <grant@civo.com> wrote:

...
Hi all,

We are seeing some oddness on a couple of our compute hosts that seems to be related to iSCSI. On a couple of our hosts I am seeing this error in the nova compute logs:

2020-10-27 18:56:14.814 31490 WARNING os_brick.initiator.connectors.iscsi [req-8613ae69-1661-49cf-8bdc-6fec875d01ba - - - - -] Couldn't find iscsi sessions because iscsiadm err: iscsiadm: could not read session targetname: 5 iscsiadm: could not find session info for session1707

That seems to also stop any instance on the compute host from being able to reboot. Reboots seem to get accepted but the instance never completes and gets stuck in the reboot state:

2020-10-27 19:11:58.891 48612 INFO nova.compute.manager [-] [instance: c8079e85-4777-4615-9d5a-3d1151e11984] During sync_power_state the instance has a pending task (reboot_started_hard). Skip. 2020-10-27 19:11:58.891 48612 INFO nova.compute.manager [-] [instance: 31128f26-910d-411f-98e0-c95dd36f4f0f] During sync_power_state the instance has a pending task (reboot_started_hard). Skip.

Does anyone know of a way to resolve this without rebooting the entire compute host? I can't see any other issues other than the fact there is this iSCSI error which in turn seems to stop nova from processing anything for any instance.

Any advice would be much appreciated.

Regards,

-- Lee Yarwood A5D1 9385 88CB 7E5F BE64 6618 BCA6 6E33 F672 2D76

Grant Morley

10:42 a.m.

New subject: [nova][os-brick] iSCSI multipath oddness during hard reboot

Hi, It appears we are using `queue_if_no_path` device { vendor "NETAPP" product "LUN.*" path_grouping_policy "group_by_prio" path_checker "tur" features "3 queue_if_no_path pg_init_retries 50" hardware_handler "0" prio "ontap" failback "immediate" rr_weight "uniform" rr_min_io 128 flush_on_last_del "yes" dev_loss_tmo "infinity" user_friendly_names "no" retain_attached_hw_handler "yes" detect_prio "yes" We are using libvirt virt driver and this is the info from `sudo virsh dominfo $instance_uuid` Id: - Name: instance-00049282 UUID: c8079e85-4777-4615-9d5a-3d1151e11984 OS Type: hvm State: shut off CPU(s): 1 Max memory: 1048576 KiB Used memory: 1048576 KiB Persistent: yes Autostart: disable Managed save: no Security model: apparmor Security DOI: 0 We are using an old version of OpenStack ( Queens ) that we are in the process of migrating away from. Not sure it is worth raising a bug on such an old version? Happy to do so if it will help. Regards, On 28/10/2020 10:06, Lee Yarwood wrote:

...

On 28-10-20 17:35:59, Tony Pearce wrote:

...
Grant,

As a guess I am suspecting your "fail_if_no_path" might be the issue but I am not sure on the inner workings or mechanism at play during the reboot or why it's getting stuck here for you. Your storage vendor may have documentation to state what the multipath (and iscsid) config should be from your host. Before changing config though I recommend getting the root cause realised. /var/log/messages log could help. Did you mean queue_if_no_path?

...
Also if you go into the multipath CLI "multipathd -k" and issue "show config" you may see a "NETAPP" config there already. Depending on the IDs your storage may be matching that rather than the default config within multipath.conf FYI. So Nova will ask os-brick to try to disconnect volumes during a hard reboot of an instance and I suspect this is where things are getting stuck in your env if you're using queue_if_no_path.

Assuming you're using the libvirt virt driver has the underlying domain for the instance been destroyed already?

$ sudo virsh dominfo $instance_uuid

If it has been then we might be able to cleanup the volume manually.

Either way it might be useful to raise a bug for this against Nova and os-brick so we can take a look at the attempt to hard reboot in more detail.

https://launchpad.net/nova/+filebug

^ Please use the template underneath the futher information textbox once you've provided a title and if possible include the additional output somewhere for review.

$ openstack server event list $instance_uuid

^ This will provide a list of actions and their associated request-ids. Using the request-id assocaited with the failing hard reboot can you then provide logs from the compute.

$ zgrep -l $request-id /var/log/nova/*

^ Obviously this depends on how logging is enabled in your env but you hopefully get the idea.

...
On Wed, 28 Oct 2020 at 15:56, Grant Morley <grant@civo.com> wrote:

...
Hi Tony,

We are using NetApp SolidFire for our storage. Instances seem to be in a normal "working" state before we try and reboot them.

I haven't looked into `/usr/bin/rescan-scsi-bus.sh` but I will now so thanks for that.

We are using multipath but kept it on the defaults so it looks like only 1 path is being used.

I had a feeling it was down to heavily loaded compute causing the issue.

The config for iscsi is also the defaults from which openstack Ansible deployed.

Thanks for your help.

Grant On 28/10/2020 02:25, Tony Pearce wrote:

Hi Grant, what storage are you using here? Is the instance in an apparently "working" state before you try and reboot it?

Have you looked into `/usr/bin/rescan-scsi-bus.sh` ? Please see this reference link in the first instance: [1] "When ‘rescan-scsi-bus.sh -i’ is run, script execute as well a LIP_RESET (ISSUE_LIP) which may cause a disruption in I/O on the server and even cause an outage in case of a system running on heavy load."

Are you using multipath? Some helpful commands:

`tail -f /var/log/messages | grep multipath`

`multipathd -k` = will go into mutipath cli. Then while in the cli: show config show paths

If the cli is accessible then you're likely using multipath even if 1 path. Then the multipath.conf is taking effect even if it's a default config.

Config files relating to iscsi storage: /etc/iscsi/iscsid.conf /etc/multipath/multipath.conf

[1] https://www.thegeekdiary.com/when-to-use-rescan-scsi-bus-sh-i-lip-flag-in-ce...

Regards,

Tony Pearce

On Wed, 28 Oct 2020 at 03:39, Grant Morley <grant@civo.com> wrote:

...
Hi all,

We are seeing some oddness on a couple of our compute hosts that seems to be related to iSCSI. On a couple of our hosts I am seeing this error in the nova compute logs:

2020-10-27 18:56:14.814 31490 WARNING os_brick.initiator.connectors.iscsi [req-8613ae69-1661-49cf-8bdc-6fec875d01ba - - - - -] Couldn't find iscsi sessions because iscsiadm err: iscsiadm: could not read session targetname: 5 iscsiadm: could not find session info for session1707

That seems to also stop any instance on the compute host from being able to reboot. Reboots seem to get accepted but the instance never completes and gets stuck in the reboot state:

2020-10-27 19:11:58.891 48612 INFO nova.compute.manager [-] [instance: c8079e85-4777-4615-9d5a-3d1151e11984] During sync_power_state the instance has a pending task (reboot_started_hard). Skip. 2020-10-27 19:11:58.891 48612 INFO nova.compute.manager [-] [instance: 31128f26-910d-411f-98e0-c95dd36f4f0f] During sync_power_state the instance has a pending task (reboot_started_hard). Skip.

Does anyone know of a way to resolve this without rebooting the entire compute host? I can't see any other issues other than the fact there is this iSCSI error which in turn seems to stop nova from processing anything for any instance.

Any advice would be much appreciated.

Regards, -- Grant Morley Cloud Engineer, Civo Ltd Unit H-K, Gateway 1000, Whittle Way Stevenage, Herts, SG1 2FP, UK

Visit us at www.civo.com <https://www.civo.com/> Signup for an account now > <https://www.civo.com/signup>

Lee Yarwood

29 Oct 29 Oct

9:49 a.m.

New subject: [nova][os-brick] iSCSI multipath oddness during hard reboot

On 28-10-20 10:42:46, Grant Morley wrote:

...

Hi,

It appears we are using `queue_if_no_path`

device { vendor "NETAPP" product "LUN.*" path_grouping_policy "group_by_prio" path_checker "tur" features "3 queue_if_no_path pg_init_retries 50" hardware_handler "0" prio "ontap" failback "immediate" rr_weight "uniform" rr_min_io 128 flush_on_last_del "yes" dev_loss_tmo "infinity" user_friendly_names "no" retain_attached_hw_handler "yes" detect_prio "yes"

We are using libvirt virt driver and this is the info from `sudo virsh dominfo $instance_uuid`

Id: - Name: instance-00049282 UUID: c8079e85-4777-4615-9d5a-3d1151e11984 OS Type: hvm State: shut off CPU(s): 1 Max memory: 1048576 KiB Used memory: 1048576 KiB Persistent: yes Autostart: disable Managed save: no Security model: apparmor Security DOI: 0

kk so the domain is still defined, you can try to undefine this manually and remove the associated multipath device from the host before asking Nova to attempt to power on (actually the same as a hard reboot behind the scenes). This should get your instance back up and running at least.

...

We are using an old version of OpenStack ( Queens ) that we are in the process of migrating away from. Not sure it is worth raising a bug on such an old version?

Happy to do so if it will help.

There are still plenty of users on Queens so I'd be happy to take a look at the nova-compute logs during the initial attempt. Cheers, Lee

...

On 28/10/2020 10:06, Lee Yarwood wrote:

...
On 28-10-20 17:35:59, Tony Pearce wrote:

...
Grant,

As a guess I am suspecting your "fail_if_no_path" might be the issue but I am not sure on the inner workings or mechanism at play during the reboot or why it's getting stuck here for you. Your storage vendor may have documentation to state what the multipath (and iscsid) config should be from your host. Before changing config though I recommend getting the root cause realised. /var/log/messages log could help. Did you mean queue_if_no_path? Also if you go into the multipath CLI "multipathd -k" and issue "show config" you may see a "NETAPP" config there already. Depending on the IDs your storage may be matching that rather than the default config within multipath.conf FYI. So Nova will ask os-brick to try to disconnect volumes during a hard reboot of an instance and I suspect this is where things are getting stuck in your env if you're using queue_if_no_path.

Assuming you're using the libvirt virt driver has the underlying domain for the instance been destroyed already?

$ sudo virsh dominfo $instance_uuid

If it has been then we might be able to cleanup the volume manually.

Either way it might be useful to raise a bug for this against Nova and os-brick so we can take a look at the attempt to hard reboot in more detail.

https://launchpad.net/nova/+filebug

^ Please use the template underneath the futher information textbox once you've provided a title and if possible include the additional output somewhere for review.

$ openstack server event list $instance_uuid

^ This will provide a list of actions and their associated request-ids. Using the request-id assocaited with the failing hard reboot can you then provide logs from the compute.

$ zgrep -l $request-id /var/log/nova/*

^ Obviously this depends on how logging is enabled in your env but you hopefully get the idea.

...
On Wed, 28 Oct 2020 at 15:56, Grant Morley <grant@civo.com> wrote:

...
Hi Tony,

We are using NetApp SolidFire for our storage. Instances seem to be in a normal "working" state before we try and reboot them.

I haven't looked into `/usr/bin/rescan-scsi-bus.sh` but I will now so thanks for that.

We are using multipath but kept it on the defaults so it looks like only 1 path is being used.

I had a feeling it was down to heavily loaded compute causing the issue.

The config for iscsi is also the defaults from which openstack Ansible deployed.

Thanks for your help.

Grant On 28/10/2020 02:25, Tony Pearce wrote:

Hi Grant, what storage are you using here? Is the instance in an apparently "working" state before you try and reboot it?

Have you looked into `/usr/bin/rescan-scsi-bus.sh` ? Please see this reference link in the first instance: [1] "When ‘rescan-scsi-bus.sh -i’ is run, script execute as well a LIP_RESET (ISSUE_LIP) which may cause a disruption in I/O on the server and even cause an outage in case of a system running on heavy load."

Are you using multipath? Some helpful commands:

`tail -f /var/log/messages | grep multipath`

`multipathd -k` = will go into mutipath cli. Then while in the cli: show config show paths

If the cli is accessible then you're likely using multipath even if 1 path. Then the multipath.conf is taking effect even if it's a default config.

Config files relating to iscsi storage: /etc/iscsi/iscsid.conf /etc/multipath/multipath.conf

[1] https://www.thegeekdiary.com/when-to-use-rescan-scsi-bus-sh-i-lip-flag-in-ce...

Regards,

Tony Pearce

On Wed, 28 Oct 2020 at 03:39, Grant Morley <grant@civo.com> wrote:

...
Hi all,

We are seeing some oddness on a couple of our compute hosts that seems to be related to iSCSI. On a couple of our hosts I am seeing this error in the nova compute logs:

2020-10-27 18:56:14.814 31490 WARNING os_brick.initiator.connectors.iscsi [req-8613ae69-1661-49cf-8bdc-6fec875d01ba - - - - -] Couldn't find iscsi sessions because iscsiadm err: iscsiadm: could not read session targetname: 5 iscsiadm: could not find session info for session1707

That seems to also stop any instance on the compute host from being able to reboot. Reboots seem to get accepted but the instance never completes and gets stuck in the reboot state:

2020-10-27 19:11:58.891 48612 INFO nova.compute.manager [-] [instance: c8079e85-4777-4615-9d5a-3d1151e11984] During sync_power_state the instance has a pending task (reboot_started_hard). Skip. 2020-10-27 19:11:58.891 48612 INFO nova.compute.manager [-] [instance: 31128f26-910d-411f-98e0-c95dd36f4f0f] During sync_power_state the instance has a pending task (reboot_started_hard). Skip.

Does anyone know of a way to resolve this without rebooting the entire compute host? I can't see any other issues other than the fact there is this iSCSI error which in turn seems to stop nova from processing anything for any instance.

Any advice would be much appreciated.

Regards, -- Grant Morley Cloud Engineer, Civo Ltd Unit H-K, Gateway 1000, Whittle Way Stevenage, Herts, SG1 2FP, UK

Visit us at www.civo.com <https://www.civo.com/> Signup for an account now > <https://www.civo.com/signup>

-- Lee Yarwood A5D1 9385 88CB 7E5F BE64 6618 BCA6 6E33 F672 2D76

Tony Pearce

1:44 a.m.

New subject: [nova][os-brick] iSCSI multipath oddness during hard reboot

On Wed, 28 Oct 2020 at 18:06, Lee Yarwood <lyarwood@redhat.com> wrote:

...

On 28-10-20 17:35:59, Tony Pearce wrote:

...
Grant,

As a guess I am suspecting your "fail_if_no_path" might be the issue but I am not sure on the inner workings or mechanism at play during the reboot or why it's getting stuck here for you. Your storage vendor may have documentation to state what the multipath (and iscsid) config should be from your host. Before changing config though I recommend getting the root cause realised. /var/log/messages log could help.

Did you mean queue_if_no_path?

Yes indeed, my apologies. I no longer have "queue_if_no_path" in my config. Best regards, Tony Pearce On Wed, 28 Oct 2020 at 18:06, Lee Yarwood <lyarwood@redhat.com> wrote:

...

On 28-10-20 17:35:59, Tony Pearce wrote:

...
Grant,

As a guess I am suspecting your "fail_if_no_path" might be the issue but I am not sure on the inner workings or mechanism at play during the reboot or why it's getting stuck here for you. Your storage vendor may have documentation to state what the multipath (and iscsid) config should be from your host. Before changing config though I recommend getting the root cause realised. /var/log/messages log could help.

Did you mean queue_if_no_path?

...
Also if you go into the multipath CLI "multipathd -k" and issue "show config" you may see a "NETAPP" config there already. Depending on the IDs your storage may be matching that rather than the default config within multipath.conf FYI.

So Nova will ask os-brick to try to disconnect volumes during a hard reboot of an instance and I suspect this is where things are getting stuck in your env if you're using queue_if_no_path.

Assuming you're using the libvirt virt driver has the underlying domain for the instance been destroyed already?

$ sudo virsh dominfo $instance_uuid

If it has been then we might be able to cleanup the volume manually.

Either way it might be useful to raise a bug for this against Nova and os-brick so we can take a look at the attempt to hard reboot in more detail.

https://launchpad.net/nova/+filebug

^ Please use the template underneath the futher information textbox once you've provided a title and if possible include the additional output somewhere for review.

$ openstack server event list $instance_uuid

^ This will provide a list of actions and their associated request-ids. Using the request-id assocaited with the failing hard reboot can you then provide logs from the compute.

$ zgrep -l $request-id /var/log/nova/*

^ Obviously this depends on how logging is enabled in your env but you hopefully get the idea.

...
On Wed, 28 Oct 2020 at 15:56, Grant Morley <grant@civo.com> wrote:

...
Hi Tony,

We are using NetApp SolidFire for our storage. Instances seem to be in a normal "working" state before we try and reboot them.

I haven't looked into `/usr/bin/rescan-scsi-bus.sh` but I will now so thanks for that.

We are using multipath but kept it on the defaults so it looks like only 1 path is being used.

I had a feeling it was down to heavily loaded compute causing the issue.

The config for iscsi is also the defaults from which openstack Ansible deployed.

Thanks for your help.

Grant On 28/10/2020 02:25, Tony Pearce wrote:

Hi Grant, what storage are you using here? Is the instance in an apparently "working" state before you try and reboot it?

Have you looked into `/usr/bin/rescan-scsi-bus.sh` ? Please see this reference link in the first instance: [1] "When ‘rescan-scsi-bus.sh -i’ is run, script execute as well a LIP_RESET (ISSUE_LIP) which may cause a disruption in I/O on the server and even cause an outage in case of a system running on heavy load."

Are you using multipath? Some helpful commands:

`tail -f /var/log/messages | grep multipath`

`multipathd -k` = will go into mutipath cli. Then while in the cli: show config show paths

If the cli is accessible then you're likely using multipath even if 1 path. Then the multipath.conf is taking effect even if it's a default config.

Config files relating to iscsi storage: /etc/iscsi/iscsid.conf /etc/multipath/multipath.conf

[1]

https://www.thegeekdiary.com/when-to-use-rescan-scsi-bus-sh-i-lip-flag-in-ce...

...
Regards,

Tony Pearce

On Wed, 28 Oct 2020 at 03:39, Grant Morley <grant@civo.com> wrote:

...
Hi all,

We are seeing some oddness on a couple of our compute hosts that

seems to

...
...
be related to iSCSI. On a couple of our hosts I am seeing this error in the nova compute logs:

2020-10-27 18:56:14.814 31490 WARNING os_brick.initiator.connectors.iscsi [req-8613ae69-1661-49cf-8bdc-6fec875d01ba - - - - -] Couldn't find iscsi sessions because iscsiadm err: iscsiadm: could not read session targetname: 5 iscsiadm: could not find session info for session1707

That seems to also stop any instance on the compute host from being able to reboot. Reboots seem to get accepted but the instance never completes and gets stuck in the reboot state:

2020-10-27 19:11:58.891 48612 INFO nova.compute.manager [-] [instance: c8079e85-4777-4615-9d5a-3d1151e11984] During sync_power_state the instance has a pending task (reboot_started_hard). Skip. 2020-10-27 19:11:58.891 48612 INFO nova.compute.manager [-] [instance: 31128f26-910d-411f-98e0-c95dd36f4f0f] During sync_power_state the instance has a pending task (reboot_started_hard). Skip.

Does anyone know of a way to resolve this without rebooting the entire compute host? I can't see any other issues other than the fact there is this iSCSI error which in turn seems to stop nova from processing anything for any instance.

Any advice would be much appreciated.

Regards,

-- Lee Yarwood A5D1 9385 88CB 7E5F BE64 6618 BCA6 6E33 F672 2D76

Tony Pearce

1:51 a.m.

New subject: [nova][os-brick] iSCSI multipath oddness during hard reboot

Also I am sending this as an FYI because I learned this the hard way :) "ISSUES WITH QUEUE_IF_NO_PATH FEATURE" [1] [1] https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/htm... Tony Pearce On Thu, 29 Oct 2020 at 09:44, Tony Pearce <tonyppe@gmail.com> wrote:

...

On Wed, 28 Oct 2020 at 18:06, Lee Yarwood <lyarwood@redhat.com> wrote:

...
On 28-10-20 17:35:59, Tony Pearce wrote:

...
Grant,

As a guess I am suspecting your "fail_if_no_path" might be the issue but I am not sure on the inner workings or mechanism at play during the reboot or why it's getting stuck here for you. Your storage vendor may have documentation to state what the multipath (and iscsid) config should be from your host. Before changing config though I recommend getting the root cause realised. /var/log/messages log could help.

Did you mean queue_if_no_path?

Yes indeed, my apologies. I no longer have "queue_if_no_path" in my config.

Best regards,

Tony Pearce

On Wed, 28 Oct 2020 at 18:06, Lee Yarwood <lyarwood@redhat.com> wrote:

...
On 28-10-20 17:35:59, Tony Pearce wrote:

...
Grant,

As a guess I am suspecting your "fail_if_no_path" might be the issue but I am not sure on the inner workings or mechanism at play during the reboot or why it's getting stuck here for you. Your storage vendor may have documentation to state what the multipath (and iscsid) config should be from your host. Before changing config though I recommend getting the root cause realised. /var/log/messages log could help.

Did you mean queue_if_no_path?

...
Also if you go into the multipath CLI "multipathd -k" and issue "show config" you may see a "NETAPP" config there already. Depending on the IDs your storage may be matching that rather than the default config within multipath.conf FYI.

So Nova will ask os-brick to try to disconnect volumes during a hard reboot of an instance and I suspect this is where things are getting stuck in your env if you're using queue_if_no_path.

Assuming you're using the libvirt virt driver has the underlying domain for the instance been destroyed already?

$ sudo virsh dominfo $instance_uuid

If it has been then we might be able to cleanup the volume manually.

Either way it might be useful to raise a bug for this against Nova and os-brick so we can take a look at the attempt to hard reboot in more detail.

https://launchpad.net/nova/+filebug

^ Please use the template underneath the futher information textbox once you've provided a title and if possible include the additional output somewhere for review.

$ openstack server event list $instance_uuid

^ This will provide a list of actions and their associated request-ids. Using the request-id assocaited with the failing hard reboot can you then provide logs from the compute.

$ zgrep -l $request-id /var/log/nova/*

^ Obviously this depends on how logging is enabled in your env but you hopefully get the idea.

...
On Wed, 28 Oct 2020 at 15:56, Grant Morley <grant@civo.com> wrote:

...
Hi Tony,

We are using NetApp SolidFire for our storage. Instances seem to be in a normal "working" state before we try and reboot them.

I haven't looked into `/usr/bin/rescan-scsi-bus.sh` but I will now so thanks for that.

We are using multipath but kept it on the defaults so it looks like only 1 path is being used.

I had a feeling it was down to heavily loaded compute causing the issue.

The config for iscsi is also the defaults from which openstack Ansible deployed.

Thanks for your help.

Grant On 28/10/2020 02:25, Tony Pearce wrote:

Hi Grant, what storage are you using here? Is the instance in an apparently "working" state before you try and reboot it?

Have you looked into `/usr/bin/rescan-scsi-bus.sh` ? Please see this reference link in the first instance: [1] "When ‘rescan-scsi-bus.sh -i’ is run, script execute as well a LIP_RESET (ISSUE_LIP) which may cause a disruption in I/O on the server and even cause an outage in case of a system running on heavy load."

Are you using multipath? Some helpful commands:

`tail -f /var/log/messages | grep multipath`

`multipathd -k` = will go into mutipath cli. Then while in the cli: show config show paths

If the cli is accessible then you're likely using multipath even if 1 path. Then the multipath.conf is taking effect even if it's a default config.

Config files relating to iscsi storage: /etc/iscsi/iscsid.conf /etc/multipath/multipath.conf

[1]

https://www.thegeekdiary.com/when-to-use-rescan-scsi-bus-sh-i-lip-flag-in-ce...

...
Regards,

Tony Pearce

On Wed, 28 Oct 2020 at 03:39, Grant Morley <grant@civo.com> wrote:

...
Hi all,

We are seeing some oddness on a couple of our compute hosts that

seems to

...
...
be related to iSCSI. On a couple of our hosts I am seeing this error in the nova compute logs:

2020-10-27 18:56:14.814 31490 WARNING os_brick.initiator.connectors.iscsi [req-8613ae69-1661-49cf-8bdc-6fec875d01ba - - - - -] Couldn't find iscsi sessions because iscsiadm err: iscsiadm: could not read session targetname: 5 iscsiadm: could not find session info for session1707

That seems to also stop any instance on the compute host from being able to reboot. Reboots seem to get accepted but the instance never completes and gets stuck in the reboot state:

2020-10-27 19:11:58.891 48612 INFO nova.compute.manager [-] [instance: c8079e85-4777-4615-9d5a-3d1151e11984] During sync_power_state the instance has a pending task (reboot_started_hard). Skip. 2020-10-27 19:11:58.891 48612 INFO nova.compute.manager [-] [instance: 31128f26-910d-411f-98e0-c95dd36f4f0f] During sync_power_state the instance has a pending task (reboot_started_hard). Skip.

Does anyone know of a way to resolve this without rebooting the entire compute host? I can't see any other issues other than the fact there is this iSCSI error which in turn seems to stop nova from processing anything for any instance.

Any advice would be much appreciated.

Regards,

-- Lee Yarwood A5D1 9385 88CB 7E5F BE64 6618 BCA6 6E33 F672 2D76

Gorka Eguileor

28 Oct 28 Oct

9:52 a.m.

On 27/10, Grant Morley wrote:

...

Hi all,

We are seeing some oddness on a couple of our compute hosts that seems to be related to iSCSI. On a couple of our hosts I am seeing this error in the nova compute logs:

2020-10-27 18:56:14.814 31490 WARNING os_brick.initiator.connectors.iscsi [req-8613ae69-1661-49cf-8bdc-6fec875d01ba - - - - -] Couldn't find iscsi sessions because iscsiadm err: iscsiadm: could not read session targetname: 5 iscsiadm: could not find session info for session1707

That seems to also stop any instance on the compute host from being able to reboot.Â Reboots seem to get accepted but the instance never completes and gets stuck in the reboot state:

2020-10-27 19:11:58.891 48612 INFO nova.compute.manager [-] [instance: c8079e85-4777-4615-9d5a-3d1151e11984] During sync_power_state the instance has a pending task (reboot_started_hard). Skip. 2020-10-27 19:11:58.891 48612 INFO nova.compute.manager [-] [instance: 31128f26-910d-411f-98e0-c95dd36f4f0f] During sync_power_state the instance has a pending task (reboot_started_hard). Skip.

Does anyone know of a way to resolve this without rebooting the entire compute host? I can't see any other issues other than the fact there is this iSCSI error which in turn seems to stop nova from processing anything for any instance.

Any advice would be much appreciated.

Regards,

Hi Grant, What os-brick version are you using? If you don't know, then what OpenStack release are you using? Are you using containerized or non-containerized nova compute services? Cheers, Gorka.

...

-- Grant Morley Cloud Engineer, Civo Ltd Unit H-K, Gateway 1000, Whittle Way Stevenage, Herts, SG1 2FP, UK

Visit us at www.civo.com <https://www.civo.com/> Signup for an account now > <https://www.civo.com/signup>

1744

Age (days ago)

1746

Last active (days ago)

List overview

Download

9 comments

4 participants

participants (4)

Gorka Eguileor
Grant Morley
Lee Yarwood
Tony Pearce