Cluster fails when 2 controller nodes become down simultaneously | tripleo wallaby

Harald Jensas hjensas at redhat.com
Thu Nov 3 15:00:08 UTC 2022


On 11/1/22 11:01, Swogat Pradhan wrote:
> Hi,
> Updating the subject.
> 
> On Tue, Nov 1, 2022 at 12:26 PM Swogat Pradhan 
> <swogatpradhan22 at gmail.com <mailto:swogatpradhan22 at gmail.com>> wrote:
> 
>     I have configured a 3 node pcs cluster for openstack.
>     To test the HA, i issue the following commands:
>     iptables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT &&
>     iptables -A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j
>     ACCEPT &&
>     iptables -A INPUT -p tcp -m state --state NEW -m tcp --dport 5016 -j
>     ACCEPT &&
>     iptables -A INPUT -p udp -m state --state NEW -m udp --dport 5016 -j
>     ACCEPT &&
>     iptables -A INPUT ! -i lo -j REJECT --reject-with
>     icmp-host-prohibited &&
>     iptables -A OUTPUT -p tcp --sport 22 -j ACCEPT &&
>     iptables -A OUTPUT -p tcp --sport 5016 -j ACCEPT &&
>     iptables -A OUTPUT -p udp --sport 5016 -j ACCEPT &&
>     iptables -A OUTPUT ! -o lo -j REJECT --reject-with icmp-host-prohibited
> 
>     When i issue iptables command on 1 node then it is fenced and forced
>     to reboot and cluster works fine.
>     But when i issue this on 2 of the controller nodes the resource
>     bundles fail and doesn't come back up.
>

This is expected behavior.

In a cluster you need a majority quorum to be able to make the decision 
to fence a failing node, and keep services running on the nodes with the 
majority quorum.

When you disconnect two nodes from the cluster with firewall rules, none 
of the 3 nodes can talk to any other node, i.e they are all isolated 
with no knowledge on what is the status on the 2 peer cluster nodes.

Each node can only assume it is the only node that has been isolated, 
and the two other nodes are operational. To ensure data integrity any 
isolated node should stop it's services immediately.

Imagine if all three nodes, isolated from each-other but still available 
to the loadbalancer. Requests would come in and each node would continue 
to service requests and write data. Each node servicing ~1/3 of the 
requests, the result would be a inconsistent data stores on all three 
nodes. A situation that would be practically impossible to recover from.


--
Harald




More information about the openstack-discuss mailing list