[kolla] [train] haproxy and controller restart causes user impact

Albert Braden ozzzo at yahoo.com
Tue May 16 17:29:04 UTC 2023


 What's the recommended method for rebooting controllers? Do we need to use the "remove from cluster" and "add to cluster" procedures or is there a better way?

https://docs.openstack.org/kolla-ansible/train/user/adding-and-removing-hosts.html
     On Friday, May 12, 2023, 03:04:26 PM EDT, Albert Braden <ozzzo at yahoo.com> wrote:  
 
  We use keepalived and exabgp to manage failover for haproxy. That works but it takes a few minutes, and during those few minutes customers experience impact. We tell them to not build/delete VMs during patching, but they still do, and then complain about the failures.

We're planning to experiment with adding a "manual" haproxy failover to our patching automation, but I'm wondering if there is anything on the controller that needs to be failed over or disabled before rebooting the KVM. I looked at the "remove from cluster" and "add to cluster" procedures but that seems unnecessarily cumbersome for rebooting the KVM.
     On Friday, May 12, 2023, 03:42:42 AM EDT, Eugen Block <eblock at nde.ag> wrote:  
 
 Hi Albert,

how is your haproxy placement controlled, something like pacemaker or  
similar? I would always do a failover when I'm aware of interruptions  
(maintenance window), that should speed things up for clients. We have  
a pacemaker controlled HA control plane, it takes more time until  
pacemaker realizes that the resource is gone if I just rebooted a  
server without failing over. I have no benchmarks though. There's  
always a risk of losing a couple of requests during the failover but  
we didn't have complaints yet, I believe most of the components try to  
resend the lost messages. In one of our customer's cluster with many  
resources (they also use terraform) I haven't seen issues during a  
regular maintenance window. When they had a DNS outage a few months  
back it resulted in a mess, manual cleaning was necessary, but the  
regular failovers seem to work just fine.
And I don't see rabbitmq issues either after rebooting a server,  
usually the haproxy (and virtual IP) failover suffice to prevent  
interruptions.

Regards,
Eugen

Zitat von Satish Patel <satish.txt at gmail.com>:

> Are you running your stack on top of the kvm virtual machine? How many
> controller nodes do you have? mostly rabbitMQ causing issues if you restart
> controller nodes.
>
> On Thu, May 11, 2023 at 8:34 AM Albert Braden <ozzzo at yahoo.com> wrote:
>
>> We have our haproxy and controller nodes on KVM hosts. When those KVM
>> hosts are restarted, customers who are building or deleting VMs see impact.
>> VMs may go into error status, fail to get DNS records, fail to delete, etc.
>> The obvious reason is because traffic that is being routed to the haproxy
>> on the restarting KVM is lost. If we manually fail over haproxy before
>> restarting the KVM, will that be sufficient to stop traffic being lost, or
>> do we also need to do something with the controller?
>>
>>




    
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230516/4c3dfd9b/attachment.htm>


More information about the openstack-discuss mailing list