[openstack-ansible][masakari][yoga]Masakari ignoring host failure
Dear All, *Although, Masakari service seems to be working fine with few warning in one of my cluster but it is still ignoring all host failure:* Feb 10 10:34:04 compute7 masakari-introspectiveinstancemonitor[2379]: 2025-02-10 10:34:04.965 2379 WARNING masakarimonitors.introspectiveinstancemonitor.qemu_utils [-] Failed to connect socket to '/var/run/libvirt/libvirt-sock': Permission denied: libvirt.libvirtError: Failed to connect socket to '/var/run/libvirt/libvirt-sock': Permission denied *Crm status shows no resources:* root@compute7:~# crm status Cluster Summary: * Stack: corosync * Current DC: compute6 (version 2.0.3-4b1f869f0f) - partition with quorum * Last updated: Mon Feb 10 10:24:49 2025 * Last change: Fri Feb 7 11:05:54 2025 by hacluster via crmd on compute7 * 6 nodes configured * 0 resource instances configured Node List: * Online: [ compute2 compute3 compute4 compute5 compute6 compute7 ] Full List of Resources: * No resources *From horizon instance-ha notification:* Notification UUID 0e745a900479 Source Host UUID 21c0d6d8e9f1 Type COMPUTE_HOST Status ignored Generated Time Feb. 10, 2025, 3:35 p.m. Created At Feb. 10, 2025, 3:35 p.m. Updated At Feb. 10, 2025, 3:35 p.m. Payload {'event': 'STOPPED', 'cluster_status': 'OFFLINE', 'host_status': 'UNKNOWN'} Regards, Danish Khan
Hey, Sorry, I don't have a masakari running in production for a while now, so I'm not sure about issues which could arise during the upgrade. From what I see in the output, you expect failover to happen based on the introspectiveinstancemonitor? Or, that is due to playbooks misconfiguring Introspective Instance Monitor and not adding it to the group allowed to access libvirt socket? hostmonitor is not triggering failure right now, as all nodes seem to be present in the cluster, which looks correct to me. Regarding notification - it was Ignored. So my suggestion would be to check for masakari-engine logs on why this event was ignored by it. пн, 10 февр. 2025 г. в 11:40, Danish Khan <danish52.jmi@gmail.com>:
Dear All,
Although, Masakari service seems to be working fine with few warning in one of my cluster but it is still ignoring all host failure:
Feb 10 10:34:04 compute7 masakari-introspectiveinstancemonitor[2379]: 2025-02-10 10:34:04.965 2379 WARNING masakarimonitors.introspectiveinstancemonitor.qemu_utils [-] Failed to connect socket to '/var/run/libvirt/libvirt-sock': Permission denied: libvirt.libvirtError: Failed to connect socket to '/var/run/libvirt/libvirt-sock': Permission denied
Crm status shows no resources: root@compute7:~# crm status Cluster Summary: * Stack: corosync * Current DC: compute6 (version 2.0.3-4b1f869f0f) - partition with quorum * Last updated: Mon Feb 10 10:24:49 2025 * Last change: Fri Feb 7 11:05:54 2025 by hacluster via crmd on compute7 * 6 nodes configured * 0 resource instances configured
Node List: * Online: [ compute2 compute3 compute4 compute5 compute6 compute7 ]
Full List of Resources: * No resources
From horizon instance-ha notification: Notification UUID 0e745a900479 Source Host UUID 21c0d6d8e9f1 Type COMPUTE_HOST Status ignored Generated Time Feb. 10, 2025, 3:35 p.m. Created At Feb. 10, 2025, 3:35 p.m. Updated At Feb. 10, 2025, 3:35 p.m. Payload {'event': 'STOPPED', 'cluster_status': 'OFFLINE', 'host_status': 'UNKNOWN'}
Regards, Danish Khan
Hi Dmitriy, Got below error in masakari-hostmonitor.service log: *ERROR masakarimonitors.hostmonitor.host_handler.handle_host [-] Failed to get params of ipmi RA*. To fix this, I have updated disable_ipmi_check value to true in masakarimonitors.conf on all compute nodes under [host]: *#disable_ipmi_check = Falsedisable_ipmi_check = True* My issue is fixed now and VMs are getting evacuated now. Regards, Danish Khan On Mon, Feb 10, 2025 at 5:46 PM Dmitriy Rabotyagov <noonedeadpunk@gmail.com> wrote:
Hey,
Sorry, I don't have a masakari running in production for a while now, so I'm not sure about issues which could arise during the upgrade.
From what I see in the output, you expect failover to happen based on the introspectiveinstancemonitor? Or, that is due to playbooks misconfiguring Introspective Instance Monitor and not adding it to the group allowed to access libvirt socket?
hostmonitor is not triggering failure right now, as all nodes seem to be present in the cluster, which looks correct to me.
Regarding notification - it was Ignored. So my suggestion would be to check for masakari-engine logs on why this event was ignored by it.
пн, 10 февр. 2025 г. в 11:40, Danish Khan <danish52.jmi@gmail.com>:
Dear All,
Although, Masakari service seems to be working fine with few warning in
one of my cluster but it is still ignoring all host failure:
Feb 10 10:34:04 compute7 masakari-introspectiveinstancemonitor[2379]:
2025-02-10 10:34:04.965 2379 WARNING masakarimonitors.introspectiveinstancemonitor.qemu_utils [-] Failed to connect socket to '/var/run/libvirt/libvirt-sock': Permission denied: libvirt.libvirtError: Failed to connect socket to '/var/run/libvirt/libvirt-sock': Permission denied
Crm status shows no resources: root@compute7:~# crm status Cluster Summary: * Stack: corosync * Current DC: compute6 (version 2.0.3-4b1f869f0f) - partition with
quorum
* Last updated: Mon Feb 10 10:24:49 2025 * Last change: Fri Feb 7 11:05:54 2025 by hacluster via crmd on compute7 * 6 nodes configured * 0 resource instances configured
Node List: * Online: [ compute2 compute3 compute4 compute5 compute6 compute7 ]
Full List of Resources: * No resources
From horizon instance-ha notification: Notification UUID 0e745a900479 Source Host UUID 21c0d6d8e9f1 Type COMPUTE_HOST Status ignored Generated Time Feb. 10, 2025, 3:35 p.m. Created At Feb. 10, 2025, 3:35 p.m. Updated At Feb. 10, 2025, 3:35 p.m. Payload {'event': 'STOPPED', 'cluster_status': 'OFFLINE', 'host_status': 'UNKNOWN'}
Regards, Danish Khan
I think you should be able to mainstream the change through defining the masakari_monitors_conf_overrides variable then. On Mon, 10 Feb 2025, 20:41 Danish Khan, <danish52.jmi@gmail.com> wrote:
Hi Dmitriy,
Got below error in masakari-hostmonitor.service log:
*ERROR masakarimonitors.hostmonitor.host_handler.handle_host [-] Failed to get params of ipmi RA*.
To fix this, I have updated disable_ipmi_check value to true in masakarimonitors.conf on all compute nodes under [host]:
*#disable_ipmi_check = Falsedisable_ipmi_check = True*
My issue is fixed now and VMs are getting evacuated now.
Regards, Danish Khan
On Mon, Feb 10, 2025 at 5:46 PM Dmitriy Rabotyagov < noonedeadpunk@gmail.com> wrote:
Hey,
Sorry, I don't have a masakari running in production for a while now, so I'm not sure about issues which could arise during the upgrade.
From what I see in the output, you expect failover to happen based on the introspectiveinstancemonitor? Or, that is due to playbooks misconfiguring Introspective Instance Monitor and not adding it to the group allowed to access libvirt socket?
hostmonitor is not triggering failure right now, as all nodes seem to be present in the cluster, which looks correct to me.
Regarding notification - it was Ignored. So my suggestion would be to check for masakari-engine logs on why this event was ignored by it.
пн, 10 февр. 2025 г. в 11:40, Danish Khan <danish52.jmi@gmail.com>:
Dear All,
Although, Masakari service seems to be working fine with few warning in
one of my cluster but it is still ignoring all host failure:
Feb 10 10:34:04 compute7 masakari-introspectiveinstancemonitor[2379]:
2025-02-10 10:34:04.965 2379 WARNING masakarimonitors.introspectiveinstancemonitor.qemu_utils [-] Failed to connect socket to '/var/run/libvirt/libvirt-sock': Permission denied: libvirt.libvirtError: Failed to connect socket to '/var/run/libvirt/libvirt-sock': Permission denied
Crm status shows no resources: root@compute7:~# crm status Cluster Summary: * Stack: corosync * Current DC: compute6 (version 2.0.3-4b1f869f0f) - partition with
quorum
* Last updated: Mon Feb 10 10:24:49 2025 * Last change: Fri Feb 7 11:05:54 2025 by hacluster via crmd on compute7 * 6 nodes configured * 0 resource instances configured
Node List: * Online: [ compute2 compute3 compute4 compute5 compute6 compute7 ]
Full List of Resources: * No resources
From horizon instance-ha notification: Notification UUID 0e745a900479 Source Host UUID 21c0d6d8e9f1 Type COMPUTE_HOST Status ignored Generated Time Feb. 10, 2025, 3:35 p.m. Created At Feb. 10, 2025, 3:35 p.m. Updated At Feb. 10, 2025, 3:35 p.m. Payload {'event': 'STOPPED', 'cluster_status': 'OFFLINE', 'host_status': 'UNKNOWN'}
Regards, Danish Khan
participants (2)
-
Danish Khan
-
Dmitriy Rabotyagov