Re: [kolla] [train] haproxy and controller restart causes user impact

12 May 2023

      Hi Albert,

how is your haproxy placement controlled, something like pacemaker or  
similar? I would always do a failover when I'm aware of interruptions  
(maintenance window), that should speed things up for clients. We have  
a pacemaker controlled HA control plane, it takes more time until  
pacemaker realizes that the resource is gone if I just rebooted a  
server without failing over. I have no benchmarks though. There's  
always a risk of losing a couple of requests during the failover but  
we didn't have complaints yet, I believe most of the components try to  
resend the lost messages. In one of our customer's cluster with many  
resources (they also use terraform) I haven't seen issues during a  
regular maintenance window. When they had a DNS outage a few months  
back it resulted in a mess, manual cleaning was necessary, but the  
regular failovers seem to work just fine.
And I don't see rabbitmq issues either after rebooting a server,  
usually the haproxy (and virtual IP) failover suffice to prevent  
interruptions.

Regards,
Eugen

Zitat von Satish Patel <satish.txt@gmail.com>:
...
Are you running your stack on top of the kvm virtual machine? How many
controller nodes do you have? mostly rabbitMQ causing issues if you restart
controller nodes.
On Thu, May 11, 2023 at 8:34 AM Albert Braden <ozzzo@yahoo.com> wrote:
...
We have our haproxy and controller nodes on KVM hosts. When those KVM
hosts are restarted, customers who are building or deleting VMs see impact.
VMs may go into error status, fail to get DNS records, fail to delete, etc.
The obvious reason is because traffic that is being routed to the haproxy
on the restarting KVM is lost. If we manually fail over haproxy before
restarting the KVM, will that be sufficient to stop traffic being lost, or
do we also need to do something with the controller?

Re: [kolla] [train] haproxy and controller restart causes user impact

Eugen Block