I have configured a 3 node pcs cluster for openstack. To test the HA, i issue the following commands: iptables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT && iptables -A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT && iptables -A INPUT -p tcp -m state --state NEW -m tcp --dport 5016 -j ACCEPT && iptables -A INPUT -p udp -m state --state NEW -m udp --dport 5016 -j ACCEPT && iptables -A INPUT ! -i lo -j REJECT --reject-with icmp-host-prohibited && iptables -A OUTPUT -p tcp --sport 22 -j ACCEPT && iptables -A OUTPUT -p tcp --sport 5016 -j ACCEPT && iptables -A OUTPUT -p udp --sport 5016 -j ACCEPT && iptables -A OUTPUT ! -o lo -j REJECT --reject-with icmp-host-prohibited When i issue iptables command on 1 node then it is fenced and forced to reboot and cluster works fine. But when i issue this on 2 of the controller nodes the resource bundles fail and doesn't come back up. [root@overcloud-controller-1 ~]# pcs status Cluster name: tripleo_cluster Cluster Summary: * Stack: corosync * Current DC: overcloud-controller-1 (version 2.1.2-4.el8-ada5c3b36e2) - partition WITHOUT quorum * Last updated: Sat Oct 29 03:15:29 2022 * Last change: Sat Oct 29 03:12:26 2022 by root via crm_resource on overcloud-controller-1 * 19 nodes configured * 68 resource instances configured Node List: * Node overcloud-controller-0: UNCLEAN (offline) * Node overcloud-controller-2: UNCLEAN (offline) * Online: [ overcloud-controller-1 ] Full List of Resources: * ip-172.25.201.91 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 (UNCLEAN) * ip-172.25.201.150 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 (UNCLEAN) * ip-172.25.201.206 (ocf::heartbeat:IPaddr2): Stopped * ip-172.25.201.250 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 (UNCLEAN) * ip-172.25.202.50 (ocf::heartbeat:IPaddr2): Stopped * ip-172.25.202.90 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 (UNCLEAN) * Container bundle set: haproxy-bundle [ 172.25.201.68:8787/tripleomaster/openstack-haproxy:pcmklatest]: * haproxy-bundle-podman-0 (ocf::heartbeat:podman): Started overcloud-controller-0 (UNCLEAN) * haproxy-bundle-podman-1 (ocf::heartbeat:podman): Stopped * haproxy-bundle-podman-2 (ocf::heartbeat:podman): Started overcloud-controller-2 (UNCLEAN) * haproxy-bundle-podman-3 (ocf::heartbeat:podman): Stopped * Container bundle set: galera-bundle [ 172.25.201.68:8787/tripleomaster/openstack-mariadb:pcmklatest]: * galera-bundle-0 (ocf::heartbeat:galera): Stopped overcloud-controller-0 (UNCLEAN) * galera-bundle-1 (ocf::heartbeat:galera): Stopped * galera-bundle-2 (ocf::heartbeat:galera): Stopped overcloud-controller-2 (UNCLEAN) * galera-bundle-3 (ocf::heartbeat:galera): Stopped * Container bundle set: redis-bundle [ 172.25.201.68:8787/tripleomaster/openstack-redis:pcmklatest]: * redis-bundle-0 (ocf::heartbeat:redis): Stopped * redis-bundle-1 (ocf::heartbeat:redis): Stopped overcloud-controller-2 (UNCLEAN) * redis-bundle-2 (ocf::heartbeat:redis): Stopped overcloud-controller-0 (UNCLEAN) * redis-bundle-3 (ocf::heartbeat:redis): Stopped * Container bundle set: ovn-dbs-bundle [ 172.25.201.68:8787/tripleomaster/openstack-ovn-northd:pcmklatest]: * ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Stopped overcloud-controller-2 (UNCLEAN) * ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Stopped overcloud-controller-0 (UNCLEAN) * ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Stopped * ovn-dbs-bundle-3 (ocf::ovn:ovndb-servers): Stopped * ip-172.25.201.208 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 (UNCLEAN) * Container bundle: openstack-cinder-backup [ 172.25.201.68:8787/tripleomaster/openstack-cinder-backup:pcmklatest]: * openstack-cinder-backup-podman-0 (ocf::heartbeat:podman): Started overcloud-controller-0 (UNCLEAN) * Container bundle: openstack-cinder-volume [ 172.25.201.68:8787/tripleomaster/openstack-cinder-volume:pcmklatest]: * openstack-cinder-volume-podman-0 (ocf::heartbeat:podman): Stopped * Container bundle set: rabbitmq-bundle [ 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped overcloud-controller-2 (UNCLEAN) * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Stopped overcloud-controller-0 (UNCLEAN) * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Stopped * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): Stopped * ip-172.25.204.250 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 (UNCLEAN) * ceph-nfs (systemd:ceph-nfs@pacemaker): Started overcloud-controller-0 (UNCLEAN) * Container bundle: openstack-manila-share [ 172.25.201.68:8787/tripleomaster/openstack-manila-share:pcmklatest]: * openstack-manila-share-podman-0 (ocf::heartbeat:podman): Started overcloud-controller-0 (UNCLEAN) * stonith-fence_ipmilan-48d539a11820 (stonith:fence_ipmilan): Stopped * stonith-fence_ipmilan-48d539a1188c (stonith:fence_ipmilan): Started overcloud-controller-2 (UNCLEAN) * stonith-fence_ipmilan-246e96349068 (stonith:fence_ipmilan): Started overcloud-controller-2 (UNCLEAN) * stonith-fence_ipmilan-246e96348d30 (stonith:fence_ipmilan): Stopped Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled PCS requires more than half the nodes to be alive for the cluster to work. To fix this step I issued a command:*pcs no-quorum-policy=ignore.* And now the PCS cluster keeps on running even when there is no quorum. Now the issue i have is the mariadb-bundle becomes slave and dosen't get promoted to master. Can you please suggest a proper workaround when more than half nodes go down and my cloud will be still running. With regards, Swogat Pradhan
Hi, Updating the subject. On Tue, Nov 1, 2022 at 12:26 PM Swogat Pradhan <swogatpradhan22@gmail.com> wrote:
I have configured a 3 node pcs cluster for openstack. To test the HA, i issue the following commands: iptables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT && iptables -A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT && iptables -A INPUT -p tcp -m state --state NEW -m tcp --dport 5016 -j ACCEPT && iptables -A INPUT -p udp -m state --state NEW -m udp --dport 5016 -j ACCEPT && iptables -A INPUT ! -i lo -j REJECT --reject-with icmp-host-prohibited && iptables -A OUTPUT -p tcp --sport 22 -j ACCEPT && iptables -A OUTPUT -p tcp --sport 5016 -j ACCEPT && iptables -A OUTPUT -p udp --sport 5016 -j ACCEPT && iptables -A OUTPUT ! -o lo -j REJECT --reject-with icmp-host-prohibited
When i issue iptables command on 1 node then it is fenced and forced to reboot and cluster works fine. But when i issue this on 2 of the controller nodes the resource bundles fail and doesn't come back up.
[root@overcloud-controller-1 ~]# pcs status Cluster name: tripleo_cluster Cluster Summary: * Stack: corosync * Current DC: overcloud-controller-1 (version 2.1.2-4.el8-ada5c3b36e2) - partition WITHOUT quorum * Last updated: Sat Oct 29 03:15:29 2022 * Last change: Sat Oct 29 03:12:26 2022 by root via crm_resource on overcloud-controller-1 * 19 nodes configured * 68 resource instances configured
Node List: * Node overcloud-controller-0: UNCLEAN (offline) * Node overcloud-controller-2: UNCLEAN (offline) * Online: [ overcloud-controller-1 ]
Full List of Resources: * ip-172.25.201.91 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 (UNCLEAN) * ip-172.25.201.150 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 (UNCLEAN) * ip-172.25.201.206 (ocf::heartbeat:IPaddr2): Stopped * ip-172.25.201.250 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 (UNCLEAN) * ip-172.25.202.50 (ocf::heartbeat:IPaddr2): Stopped * ip-172.25.202.90 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 (UNCLEAN) * Container bundle set: haproxy-bundle [ 172.25.201.68:8787/tripleomaster/openstack-haproxy:pcmklatest]: * haproxy-bundle-podman-0 (ocf::heartbeat:podman): Started overcloud-controller-0 (UNCLEAN) * haproxy-bundle-podman-1 (ocf::heartbeat:podman): Stopped * haproxy-bundle-podman-2 (ocf::heartbeat:podman): Started overcloud-controller-2 (UNCLEAN) * haproxy-bundle-podman-3 (ocf::heartbeat:podman): Stopped * Container bundle set: galera-bundle [ 172.25.201.68:8787/tripleomaster/openstack-mariadb:pcmklatest]: * galera-bundle-0 (ocf::heartbeat:galera): Stopped overcloud-controller-0 (UNCLEAN) * galera-bundle-1 (ocf::heartbeat:galera): Stopped * galera-bundle-2 (ocf::heartbeat:galera): Stopped overcloud-controller-2 (UNCLEAN) * galera-bundle-3 (ocf::heartbeat:galera): Stopped * Container bundle set: redis-bundle [ 172.25.201.68:8787/tripleomaster/openstack-redis:pcmklatest]: * redis-bundle-0 (ocf::heartbeat:redis): Stopped * redis-bundle-1 (ocf::heartbeat:redis): Stopped overcloud-controller-2 (UNCLEAN) * redis-bundle-2 (ocf::heartbeat:redis): Stopped overcloud-controller-0 (UNCLEAN) * redis-bundle-3 (ocf::heartbeat:redis): Stopped * Container bundle set: ovn-dbs-bundle [ 172.25.201.68:8787/tripleomaster/openstack-ovn-northd:pcmklatest]: * ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Stopped overcloud-controller-2 (UNCLEAN) * ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Stopped overcloud-controller-0 (UNCLEAN) * ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Stopped * ovn-dbs-bundle-3 (ocf::ovn:ovndb-servers): Stopped * ip-172.25.201.208 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 (UNCLEAN) * Container bundle: openstack-cinder-backup [ 172.25.201.68:8787/tripleomaster/openstack-cinder-backup:pcmklatest]: * openstack-cinder-backup-podman-0 (ocf::heartbeat:podman): Started overcloud-controller-0 (UNCLEAN) * Container bundle: openstack-cinder-volume [ 172.25.201.68:8787/tripleomaster/openstack-cinder-volume:pcmklatest]: * openstack-cinder-volume-podman-0 (ocf::heartbeat:podman): Stopped * Container bundle set: rabbitmq-bundle [ 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]: * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped overcloud-controller-2 (UNCLEAN) * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Stopped overcloud-controller-0 (UNCLEAN) * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Stopped * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): Stopped * ip-172.25.204.250 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 (UNCLEAN) * ceph-nfs (systemd:ceph-nfs@pacemaker): Started overcloud-controller-0 (UNCLEAN) * Container bundle: openstack-manila-share [ 172.25.201.68:8787/tripleomaster/openstack-manila-share:pcmklatest]: * openstack-manila-share-podman-0 (ocf::heartbeat:podman): Started overcloud-controller-0 (UNCLEAN) * stonith-fence_ipmilan-48d539a11820 (stonith:fence_ipmilan): Stopped * stonith-fence_ipmilan-48d539a1188c (stonith:fence_ipmilan): Started overcloud-controller-2 (UNCLEAN) * stonith-fence_ipmilan-246e96349068 (stonith:fence_ipmilan): Started overcloud-controller-2 (UNCLEAN) * stonith-fence_ipmilan-246e96348d30 (stonith:fence_ipmilan): Stopped
Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
PCS requires more than half the nodes to be alive for the cluster to work. To fix this step I issued a command:*pcs no-quorum-policy=ignore.*
And now the PCS cluster keeps on running even when there is no quorum.
Now the issue i have is the mariadb-bundle becomes slave and dosen't get promoted to master.
Can you please suggest a proper workaround when more than half nodes go down and my cloud will be still running.
With regards,
Swogat Pradhan
On 11/1/22 11:01, Swogat Pradhan wrote:
Hi, Updating the subject.
On Tue, Nov 1, 2022 at 12:26 PM Swogat Pradhan <swogatpradhan22@gmail.com <mailto:swogatpradhan22@gmail.com>> wrote:
I have configured a 3 node pcs cluster for openstack. To test the HA, i issue the following commands: iptables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT && iptables -A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT && iptables -A INPUT -p tcp -m state --state NEW -m tcp --dport 5016 -j ACCEPT && iptables -A INPUT -p udp -m state --state NEW -m udp --dport 5016 -j ACCEPT && iptables -A INPUT ! -i lo -j REJECT --reject-with icmp-host-prohibited && iptables -A OUTPUT -p tcp --sport 22 -j ACCEPT && iptables -A OUTPUT -p tcp --sport 5016 -j ACCEPT && iptables -A OUTPUT -p udp --sport 5016 -j ACCEPT && iptables -A OUTPUT ! -o lo -j REJECT --reject-with icmp-host-prohibited
When i issue iptables command on 1 node then it is fenced and forced to reboot and cluster works fine. But when i issue this on 2 of the controller nodes the resource bundles fail and doesn't come back up.
This is expected behavior. In a cluster you need a majority quorum to be able to make the decision to fence a failing node, and keep services running on the nodes with the majority quorum. When you disconnect two nodes from the cluster with firewall rules, none of the 3 nodes can talk to any other node, i.e they are all isolated with no knowledge on what is the status on the 2 peer cluster nodes. Each node can only assume it is the only node that has been isolated, and the two other nodes are operational. To ensure data integrity any isolated node should stop it's services immediately. Imagine if all three nodes, isolated from each-other but still available to the loadbalancer. Requests would come in and each node would continue to service requests and write data. Each node servicing ~1/3 of the requests, the result would be a inconsistent data stores on all three nodes. A situation that would be practically impossible to recover from. -- Harald
participants (2)
-
Harald Jensas
-
Swogat Pradhan