I think you should be able to mainstream the change through defining the masakari_monitors_conf_overrides variable then.
Hi Dmitriy,Got below error in masakari-hostmonitor.service log:ERROR masakarimonitors.hostmonitor.host_handler.handle_host [-] Failed to get params of ipmi RA.To fix this, I have updated disable_ipmi_check value to true in masakarimonitors.conf on all compute nodes under [host]:#disable_ipmi_check = False
disable_ipmi_check = TrueMy issue is fixed now and VMs are getting evacuated now.Regards,Danish KhanOn Mon, Feb 10, 2025 at 5:46 PM Dmitriy Rabotyagov <noonedeadpunk@gmail.com> wrote:Hey,
Sorry, I don't have a masakari running in production for a while now,
so I'm not sure about issues which could arise during the upgrade.
From what I see in the output, you expect failover to happen based on
the introspectiveinstancemonitor? Or, that is due to playbooks
misconfiguring Introspective Instance Monitor and not adding it to the
group allowed to access libvirt socket?
hostmonitor is not triggering failure right now, as all nodes seem to
be present in the cluster, which looks correct to me.
Regarding notification - it was Ignored. So my suggestion would be to
check for masakari-engine logs on why this event was ignored by it.
пн, 10 февр. 2025 г. в 11:40, Danish Khan <danish52.jmi@gmail.com>:
>
> Dear All,
>
> Although, Masakari service seems to be working fine with few warning in one of my cluster but it is still ignoring all host failure:
>
> Feb 10 10:34:04 compute7 masakari-introspectiveinstancemonitor[2379]: 2025-02-10 10:34:04.965 2379 WARNING masakarimonitors.introspectiveinstancemonitor.qemu_utils [-] Failed to connect socket to '/var/run/libvirt/libvirt-sock': Permission denied: libvirt.libvirtError: Failed to connect socket to '/var/run/libvirt/libvirt-sock': Permission denied
>
> Crm status shows no resources:
> root@compute7:~# crm status
> Cluster Summary:
> * Stack: corosync
> * Current DC: compute6 (version 2.0.3-4b1f869f0f) - partition with quorum
> * Last updated: Mon Feb 10 10:24:49 2025
> * Last change: Fri Feb 7 11:05:54 2025 by hacluster via crmd on compute7
> * 6 nodes configured
> * 0 resource instances configured
>
> Node List:
> * Online: [ compute2 compute3 compute4 compute5 compute6 compute7 ]
>
> Full List of Resources:
> * No resources
>
> From horizon instance-ha notification:
> Notification UUID
> 0e745a900479
> Source Host UUID
> 21c0d6d8e9f1
> Type
> COMPUTE_HOST
> Status
> ignored
> Generated Time
> Feb. 10, 2025, 3:35 p.m.
> Created At
> Feb. 10, 2025, 3:35 p.m.
> Updated At
> Feb. 10, 2025, 3:35 p.m.
> Payload
> {'event': 'STOPPED', 'cluster_status': 'OFFLINE', 'host_status': 'UNKNOWN'}
>
> Regards,
> Danish Khan