New subject: Cluster fails when 2 controller nodes become down simultaneously | tripleo wallaby

1 Nov 2022

      I have configured a 3 node pcs cluster for openstack.
To test the HA, i issue the following commands:
iptables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT &&
iptables -A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT &&
iptables -A INPUT -p tcp -m state --state NEW -m tcp --dport 5016 -j ACCEPT
&&
iptables -A INPUT -p udp -m state --state NEW -m udp --dport 5016 -j ACCEPT
&&
iptables -A INPUT ! -i lo -j REJECT --reject-with icmp-host-prohibited &&
iptables -A OUTPUT -p tcp --sport 22 -j ACCEPT &&
iptables -A OUTPUT -p tcp --sport 5016 -j ACCEPT &&
iptables -A OUTPUT -p udp --sport 5016 -j ACCEPT &&
iptables -A OUTPUT ! -o lo -j REJECT --reject-with icmp-host-prohibited

When i issue iptables command on 1 node then it is fenced and forced to
reboot and cluster works fine.
But when i issue this on 2 of the controller nodes the resource bundles
fail and doesn't come back up.

[root@overcloud-controller-1 ~]# pcs status
Cluster name: tripleo_cluster
Cluster Summary:
  * Stack: corosync
  * Current DC: overcloud-controller-1 (version 2.1.2-4.el8-ada5c3b36e2) -
partition WITHOUT quorum
  * Last updated: Sat Oct 29 03:15:29 2022
  * Last change: Sat Oct 29 03:12:26 2022 by root via crm_resource on
overcloud-controller-1
  * 19 nodes configured
  * 68 resource instances configured

Node List:
  * Node overcloud-controller-0: UNCLEAN (offline)
  * Node overcloud-controller-2: UNCLEAN (offline)
  * Online: [ overcloud-controller-1 ]

Full List of Resources:
  * ip-172.25.201.91 (ocf::heartbeat:IPaddr2): Started
overcloud-controller-0 (UNCLEAN)
  * ip-172.25.201.150 (ocf::heartbeat:IPaddr2): Started
overcloud-controller-2 (UNCLEAN)
  * ip-172.25.201.206 (ocf::heartbeat:IPaddr2): Stopped
  * ip-172.25.201.250 (ocf::heartbeat:IPaddr2): Started
overcloud-controller-0 (UNCLEAN)
  * ip-172.25.202.50 (ocf::heartbeat:IPaddr2): Stopped
  * ip-172.25.202.90 (ocf::heartbeat:IPaddr2): Started
overcloud-controller-2 (UNCLEAN)
  * Container bundle set: haproxy-bundle [
172.25.201.68:8787/tripleomaster/openstack-haproxy:pcmklatest]:
    * haproxy-bundle-podman-0 (ocf::heartbeat:podman): Started
overcloud-controller-0 (UNCLEAN)
    * haproxy-bundle-podman-1 (ocf::heartbeat:podman): Stopped
    * haproxy-bundle-podman-2 (ocf::heartbeat:podman): Started
overcloud-controller-2 (UNCLEAN)
    * haproxy-bundle-podman-3 (ocf::heartbeat:podman): Stopped
  * Container bundle set: galera-bundle [
172.25.201.68:8787/tripleomaster/openstack-mariadb:pcmklatest]:
    * galera-bundle-0 (ocf::heartbeat:galera): Stopped
overcloud-controller-0 (UNCLEAN)
    * galera-bundle-1 (ocf::heartbeat:galera): Stopped
    * galera-bundle-2 (ocf::heartbeat:galera): Stopped
overcloud-controller-2 (UNCLEAN)
    * galera-bundle-3 (ocf::heartbeat:galera): Stopped
  * Container bundle set: redis-bundle [
172.25.201.68:8787/tripleomaster/openstack-redis:pcmklatest]:
    * redis-bundle-0 (ocf::heartbeat:redis): Stopped
    * redis-bundle-1 (ocf::heartbeat:redis): Stopped overcloud-controller-2
(UNCLEAN)
    * redis-bundle-2 (ocf::heartbeat:redis): Stopped overcloud-controller-0
(UNCLEAN)
    * redis-bundle-3 (ocf::heartbeat:redis): Stopped
  * Container bundle set: ovn-dbs-bundle [
172.25.201.68:8787/tripleomaster/openstack-ovn-northd:pcmklatest]:
    * ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Stopped
overcloud-controller-2 (UNCLEAN)
    * ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Stopped
overcloud-controller-0 (UNCLEAN)
    * ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Stopped
    * ovn-dbs-bundle-3 (ocf::ovn:ovndb-servers): Stopped
  * ip-172.25.201.208 (ocf::heartbeat:IPaddr2): Started
overcloud-controller-2 (UNCLEAN)
  * Container bundle: openstack-cinder-backup [
172.25.201.68:8787/tripleomaster/openstack-cinder-backup:pcmklatest]:
    * openstack-cinder-backup-podman-0 (ocf::heartbeat:podman): Started
overcloud-controller-0 (UNCLEAN)
  * Container bundle: openstack-cinder-volume [
172.25.201.68:8787/tripleomaster/openstack-cinder-volume:pcmklatest]:
    * openstack-cinder-volume-podman-0 (ocf::heartbeat:podman): Stopped
  * Container bundle set: rabbitmq-bundle [
172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]:
    * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped
overcloud-controller-2 (UNCLEAN)
    * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Stopped
overcloud-controller-0 (UNCLEAN)
    * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Stopped
    * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): Stopped
  * ip-172.25.204.250 (ocf::heartbeat:IPaddr2): Started
overcloud-controller-0 (UNCLEAN)
  * ceph-nfs (systemd:ceph-nfs@pacemaker): Started overcloud-controller-0
(UNCLEAN)
  * Container bundle: openstack-manila-share [
172.25.201.68:8787/tripleomaster/openstack-manila-share:pcmklatest]:
    * openstack-manila-share-podman-0 (ocf::heartbeat:podman): Started
overcloud-controller-0 (UNCLEAN)
  * stonith-fence_ipmilan-48d539a11820 (stonith:fence_ipmilan): Stopped
  * stonith-fence_ipmilan-48d539a1188c (stonith:fence_ipmilan): Started
overcloud-controller-2 (UNCLEAN)
  * stonith-fence_ipmilan-246e96349068 (stonith:fence_ipmilan): Started
overcloud-controller-2 (UNCLEAN)
  * stonith-fence_ipmilan-246e96348d30 (stonith:fence_ipmilan): Stopped

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

PCS requires more than half the nodes to be alive for the cluster to work.
To fix this step I issued a command:*pcs no-quorum-policy=ignore.*

And now the PCS cluster keeps on running even when there is no quorum.

Now the issue i have is the mariadb-bundle becomes slave and dosen't get
promoted to master.

Can you please suggest a proper workaround when more than half nodes go
down and my cloud will be still running.

With regards,

Swogat Pradhan

(no subject)

Swogat Pradhan

Swogat Pradhan

Harald Jensas

tags

participants (2)