Cluster fails when 2 controller nodes become down simultaneously | tripleo wallaby

Swogat Pradhan swogatpradhan22 at gmail.com
Tue Nov 1 10:01:35 UTC 2022


Hi,
Updating the subject.

On Tue, Nov 1, 2022 at 12:26 PM Swogat Pradhan <swogatpradhan22 at gmail.com>
wrote:

> I have configured a 3 node pcs cluster for openstack.
> To test the HA, i issue the following commands:
> iptables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT &&
> iptables -A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT
> &&
> iptables -A INPUT -p tcp -m state --state NEW -m tcp --dport 5016 -j
> ACCEPT &&
> iptables -A INPUT -p udp -m state --state NEW -m udp --dport 5016 -j
> ACCEPT &&
> iptables -A INPUT ! -i lo -j REJECT --reject-with icmp-host-prohibited &&
> iptables -A OUTPUT -p tcp --sport 22 -j ACCEPT &&
> iptables -A OUTPUT -p tcp --sport 5016 -j ACCEPT &&
> iptables -A OUTPUT -p udp --sport 5016 -j ACCEPT &&
> iptables -A OUTPUT ! -o lo -j REJECT --reject-with icmp-host-prohibited
>
> When i issue iptables command on 1 node then it is fenced and forced to
> reboot and cluster works fine.
> But when i issue this on 2 of the controller nodes the resource bundles
> fail and doesn't come back up.
>
> [root at overcloud-controller-1 ~]# pcs status
> Cluster name: tripleo_cluster
> Cluster Summary:
>   * Stack: corosync
>   * Current DC: overcloud-controller-1 (version 2.1.2-4.el8-ada5c3b36e2) -
> partition WITHOUT quorum
>   * Last updated: Sat Oct 29 03:15:29 2022
>   * Last change: Sat Oct 29 03:12:26 2022 by root via crm_resource on
> overcloud-controller-1
>   * 19 nodes configured
>   * 68 resource instances configured
>
> Node List:
>   * Node overcloud-controller-0: UNCLEAN (offline)
>   * Node overcloud-controller-2: UNCLEAN (offline)
>   * Online: [ overcloud-controller-1 ]
>
> Full List of Resources:
>   * ip-172.25.201.91 (ocf::heartbeat:IPaddr2): Started
> overcloud-controller-0 (UNCLEAN)
>   * ip-172.25.201.150 (ocf::heartbeat:IPaddr2): Started
> overcloud-controller-2 (UNCLEAN)
>   * ip-172.25.201.206 (ocf::heartbeat:IPaddr2): Stopped
>   * ip-172.25.201.250 (ocf::heartbeat:IPaddr2): Started
> overcloud-controller-0 (UNCLEAN)
>   * ip-172.25.202.50 (ocf::heartbeat:IPaddr2): Stopped
>   * ip-172.25.202.90 (ocf::heartbeat:IPaddr2): Started
> overcloud-controller-2 (UNCLEAN)
>   * Container bundle set: haproxy-bundle [
> 172.25.201.68:8787/tripleomaster/openstack-haproxy:pcmklatest]:
>     * haproxy-bundle-podman-0 (ocf::heartbeat:podman): Started
> overcloud-controller-0 (UNCLEAN)
>     * haproxy-bundle-podman-1 (ocf::heartbeat:podman): Stopped
>     * haproxy-bundle-podman-2 (ocf::heartbeat:podman): Started
> overcloud-controller-2 (UNCLEAN)
>     * haproxy-bundle-podman-3 (ocf::heartbeat:podman): Stopped
>   * Container bundle set: galera-bundle [
> 172.25.201.68:8787/tripleomaster/openstack-mariadb:pcmklatest]:
>     * galera-bundle-0 (ocf::heartbeat:galera): Stopped
> overcloud-controller-0 (UNCLEAN)
>     * galera-bundle-1 (ocf::heartbeat:galera): Stopped
>     * galera-bundle-2 (ocf::heartbeat:galera): Stopped
> overcloud-controller-2 (UNCLEAN)
>     * galera-bundle-3 (ocf::heartbeat:galera): Stopped
>   * Container bundle set: redis-bundle [
> 172.25.201.68:8787/tripleomaster/openstack-redis:pcmklatest]:
>     * redis-bundle-0 (ocf::heartbeat:redis): Stopped
>     * redis-bundle-1 (ocf::heartbeat:redis): Stopped
> overcloud-controller-2 (UNCLEAN)
>     * redis-bundle-2 (ocf::heartbeat:redis): Stopped
> overcloud-controller-0 (UNCLEAN)
>     * redis-bundle-3 (ocf::heartbeat:redis): Stopped
>   * Container bundle set: ovn-dbs-bundle [
> 172.25.201.68:8787/tripleomaster/openstack-ovn-northd:pcmklatest]:
>     * ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Stopped
> overcloud-controller-2 (UNCLEAN)
>     * ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Stopped
> overcloud-controller-0 (UNCLEAN)
>     * ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Stopped
>     * ovn-dbs-bundle-3 (ocf::ovn:ovndb-servers): Stopped
>   * ip-172.25.201.208 (ocf::heartbeat:IPaddr2): Started
> overcloud-controller-2 (UNCLEAN)
>   * Container bundle: openstack-cinder-backup [
> 172.25.201.68:8787/tripleomaster/openstack-cinder-backup:pcmklatest]:
>     * openstack-cinder-backup-podman-0 (ocf::heartbeat:podman): Started
> overcloud-controller-0 (UNCLEAN)
>   * Container bundle: openstack-cinder-volume [
> 172.25.201.68:8787/tripleomaster/openstack-cinder-volume:pcmklatest]:
>     * openstack-cinder-volume-podman-0 (ocf::heartbeat:podman): Stopped
>   * Container bundle set: rabbitmq-bundle [
> 172.25.201.68:8787/tripleomaster/openstack-rabbitmq:pcmklatest]:
>     * rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Stopped
> overcloud-controller-2 (UNCLEAN)
>     * rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Stopped
> overcloud-controller-0 (UNCLEAN)
>     * rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Stopped
>     * rabbitmq-bundle-3 (ocf::heartbeat:rabbitmq-cluster): Stopped
>   * ip-172.25.204.250 (ocf::heartbeat:IPaddr2): Started
> overcloud-controller-0 (UNCLEAN)
>   * ceph-nfs (systemd:ceph-nfs at pacemaker): Started overcloud-controller-0
> (UNCLEAN)
>   * Container bundle: openstack-manila-share [
> 172.25.201.68:8787/tripleomaster/openstack-manila-share:pcmklatest]:
>     * openstack-manila-share-podman-0 (ocf::heartbeat:podman): Started
> overcloud-controller-0 (UNCLEAN)
>   * stonith-fence_ipmilan-48d539a11820 (stonith:fence_ipmilan): Stopped
>   * stonith-fence_ipmilan-48d539a1188c (stonith:fence_ipmilan): Started
> overcloud-controller-2 (UNCLEAN)
>   * stonith-fence_ipmilan-246e96349068 (stonith:fence_ipmilan): Started
> overcloud-controller-2 (UNCLEAN)
>   * stonith-fence_ipmilan-246e96348d30 (stonith:fence_ipmilan): Stopped
>
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
>
> PCS requires more than half the nodes to be alive for the cluster to work.
> To fix this step I issued a command:*pcs no-quorum-policy=ignore.*
>
> And now the PCS cluster keeps on running even when there is no quorum.
>
> Now the issue i have is the mariadb-bundle becomes slave and dosen't get
> promoted to master.
>
> Can you please suggest a proper workaround when more than half nodes go
> down and my cloud will be still running.
>
>
> With regards,
>
> Swogat Pradhan
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20221101/0e84c091/attachment-0001.htm>


More information about the openstack-discuss mailing list