On Fri, May 12, 2023 at 4:51 PM Satish Patel <satish.txt@gmail.com> wrote:

Don't expect zero issue when you reboot the controller. It won't be user transparent. your computer nodes and other services still hang on old connections (rabbitmq/amqp) etc and that takes some time to get settled.

Curious why are you running control plane service on KVM and second question why do you need them reboot frequently?

I have physical nodes for the control plane and we see strange issues whenever we shouldn't use one of the controllers for maintenance.

On Fri, May 12, 2023 at 2:59 PM Albert Braden <ozzzo@yahoo.com> wrote:
We use keepalived and exabgp to manage failover for haproxy. That works but it takes a few minutes, and during those few minutes customers experience impact. We tell them to not build/delete VMs during patching, but they still do, and then complain about the failures.

We're planning to experiment with adding a "manual" haproxy failover to our patching automation, but I'm wondering if there is anything on the controller that needs to be failed over or disabled before rebooting the KVM. I looked at the "remove from cluster" and "add to cluster" procedures but that seems unnecessarily cumbersome for rebooting the KVM.

On Friday, May 12, 2023, 03:42:42 AM EDT, Eugen Block <eblock@nde.ag> wrote:

Hi Albert,

how is your haproxy placement controlled, something like pacemaker or
similar? I would always do a failover when I'm aware of interruptions
(maintenance window), that should speed things up for clients. We have
a pacemaker controlled HA control plane, it takes more time until
pacemaker realizes that the resource is gone if I just rebooted a
server without failing over. I have no benchmarks though. There's
always a risk of losing a couple of requests during the failover but
we didn't have complaints yet, I believe most of the components try to
resend the lost messages. In one of our customer's cluster with many
resources (they also use terraform) I haven't seen issues during a
regular maintenance window. When they had a DNS outage a few months
back it resulted in a mess, manual cleaning was necessary, but the
regular failovers seem to work just fine.
And I don't see rabbitmq issues either after rebooting a server,
usually the haproxy (and virtual IP) failover suffice to prevent
interruptions.

Regards,
Eugen

Zitat von Satish Patel <satish.txt@gmail.com>:

> Are you running your stack on top of the kvm virtual machine? How many
> controller nodes do you have? mostly rabbitMQ causing issues if you restart
> controller nodes.
>
> On Thu, May 11, 2023 at 8:34 AM Albert Braden <ozzzo@yahoo.com> wrote:
>
>> We have our haproxy and controller nodes on KVM hosts. When those KVM
>> hosts are restarted, customers who are building or deleting VMs see impact.
>> VMs may go into error status, fail to get DNS records, fail to delete, etc.
>> The obvious reason is because traffic that is being routed to the haproxy
>> on the restarting KVM is lost. If we manually fail over haproxy before
>> restarting the KVM, will that be sufficient to stop traffic being lost, or
>> do we also need to do something with the controller?
>>
>>