[kolla] [train] haproxy and controller restart causes user impact
Albert Braden
ozzzo at yahoo.com
Tue May 16 17:29:04 UTC 2023
What's the recommended method for rebooting controllers? Do we need to use the "remove from cluster" and "add to cluster" procedures or is there a better way?
https://docs.openstack.org/kolla-ansible/train/user/adding-and-removing-hosts.html
On Friday, May 12, 2023, 03:04:26 PM EDT, Albert Braden <ozzzo at yahoo.com> wrote:
We use keepalived and exabgp to manage failover for haproxy. That works but it takes a few minutes, and during those few minutes customers experience impact. We tell them to not build/delete VMs during patching, but they still do, and then complain about the failures.
We're planning to experiment with adding a "manual" haproxy failover to our patching automation, but I'm wondering if there is anything on the controller that needs to be failed over or disabled before rebooting the KVM. I looked at the "remove from cluster" and "add to cluster" procedures but that seems unnecessarily cumbersome for rebooting the KVM.
On Friday, May 12, 2023, 03:42:42 AM EDT, Eugen Block <eblock at nde.ag> wrote:
Hi Albert,
how is your haproxy placement controlled, something like pacemaker or
similar? I would always do a failover when I'm aware of interruptions
(maintenance window), that should speed things up for clients. We have
a pacemaker controlled HA control plane, it takes more time until
pacemaker realizes that the resource is gone if I just rebooted a
server without failing over. I have no benchmarks though. There's
always a risk of losing a couple of requests during the failover but
we didn't have complaints yet, I believe most of the components try to
resend the lost messages. In one of our customer's cluster with many
resources (they also use terraform) I haven't seen issues during a
regular maintenance window. When they had a DNS outage a few months
back it resulted in a mess, manual cleaning was necessary, but the
regular failovers seem to work just fine.
And I don't see rabbitmq issues either after rebooting a server,
usually the haproxy (and virtual IP) failover suffice to prevent
interruptions.
Regards,
Eugen
Zitat von Satish Patel <satish.txt at gmail.com>:
> Are you running your stack on top of the kvm virtual machine? How many
> controller nodes do you have? mostly rabbitMQ causing issues if you restart
> controller nodes.
>
> On Thu, May 11, 2023 at 8:34 AM Albert Braden <ozzzo at yahoo.com> wrote:
>
>> We have our haproxy and controller nodes on KVM hosts. When those KVM
>> hosts are restarted, customers who are building or deleting VMs see impact.
>> VMs may go into error status, fail to get DNS records, fail to delete, etc.
>> The obvious reason is because traffic that is being routed to the haproxy
>> on the restarting KVM is lost. If we manually fail over haproxy before
>> restarting the KVM, will that be sufficient to stop traffic being lost, or
>> do we also need to do something with the controller?
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230516/4c3dfd9b/attachment.htm>
More information about the openstack-discuss
mailing list