[kolla] [train] haproxy and controller restart causes user impact

Albert Braden ozzzo at yahoo.com
Fri May 12 23:21:28 UTC 2023


 We reboot quarterly for patching.
     On Friday, May 12, 2023, 04:58:45 PM EDT, Satish Patel <satish.txt at gmail.com> wrote:  
 
 Don't expect zero issue when you reboot the controller. It won't be user transparent. your computer nodes and other services still hang on old connections (rabbitmq/amqp) etc and that takes some time to get settled. 
Curious why are you running control plane service on KVM and second question why do you need them reboot frequently?  
I have physical nodes for the control plane and we see strange issues whenever we shouldn't use one of the controllers for maintenance. 
On Fri, May 12, 2023 at 2:59 PM Albert Braden <ozzzo at yahoo.com> wrote:

 We use keepalived and exabgp to manage failover for haproxy. That works but it takes a few minutes, and during those few minutes customers experience impact. We tell them to not build/delete VMs during patching, but they still do, and then complain about the failures.

We're planning to experiment with adding a "manual" haproxy failover to our patching automation, but I'm wondering if there is anything on the controller that needs to be failed over or disabled before rebooting the KVM. I looked at the "remove from cluster" and "add to cluster" procedures but that seems unnecessarily cumbersome for rebooting the KVM.
     On Friday, May 12, 2023, 03:42:42 AM EDT, Eugen Block <eblock at nde.ag> wrote:  
 
 Hi Albert,

how is your haproxy placement controlled, something like pacemaker or  
similar? I would always do a failover when I'm aware of interruptions  
(maintenance window), that should speed things up for clients. We have  
a pacemaker controlled HA control plane, it takes more time until  
pacemaker realizes that the resource is gone if I just rebooted a  
server without failing over. I have no benchmarks though. There's  
always a risk of losing a couple of requests during the failover but  
we didn't have complaints yet, I believe most of the components try to  
resend the lost messages. In one of our customer's cluster with many  
resources (they also use terraform) I haven't seen issues during a  
regular maintenance window. When they had a DNS outage a few months  
back it resulted in a mess, manual cleaning was necessary, but the  
regular failovers seem to work just fine.
And I don't see rabbitmq issues either after rebooting a server,  
usually the haproxy (and virtual IP) failover suffice to prevent  
interruptions.

Regards,
Eugen

Zitat von Satish Patel <satish.txt at gmail.com>:

> Are you running your stack on top of the kvm virtual machine? How many
> controller nodes do you have? mostly rabbitMQ causing issues if you restart
> controller nodes.
>
> On Thu, May 11, 2023 at 8:34 AM Albert Braden <ozzzo at yahoo.com> wrote:
>
>> We have our haproxy and controller nodes on KVM hosts. When those KVM
>> hosts are restarted, customers who are building or deleting VMs see impact.
>> VMs may go into error status, fail to get DNS records, fail to delete, etc.
>> The obvious reason is because traffic that is being routed to the haproxy
>> on the restarting KVM is lost. If we manually fail over haproxy before
>> restarting the KVM, will that be sufficient to stop traffic being lost, or
>> do we also need to do something with the controller?
>>
>>




  
  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230512/485ee892/attachment.htm>


More information about the openstack-discuss mailing list