<div dir="ltr">My two cents, If you are still running Train (which is EOL) then please upgrade to the next or latest release, you never know what bug causing the issue. </div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, May 12, 2023 at 4:51 PM Satish Patel <<a href="mailto:satish.txt@gmail.com">satish.txt@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Don't expect zero issue when you reboot the controller. It won't be user transparent. your computer nodes and other services still hang on old connections (rabbitmq/amqp) etc and that takes some time to get settled. <div><br></div><div>Curious why are you running control plane service on KVM and second question why do you need them reboot frequently? </div><div><br></div><div>I have physical nodes for the control plane and we see strange issues whenever we shouldn't use one of the controllers for maintenance. </div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, May 12, 2023 at 2:59 PM Albert Braden <<a href="mailto:ozzzo@yahoo.com" target="_blank">ozzzo@yahoo.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div> We use keepalived and exabgp to manage failover for haproxy. That works but it takes a few minutes, and during those few minutes customers experience impact. We tell them to not build/delete VMs during patching, but they still do, and then complain about the failures.<br><br>We're planning to experiment with adding a "manual" haproxy failover to our patching automation, but I'm wondering if there is anything on the controller that needs to be failed over or disabled before rebooting the KVM. I looked at the "remove from cluster" and "add to cluster" procedures but that seems unnecessarily cumbersome for rebooting the KVM.<br> </div> <div style="margin:10px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <div style="font-family:"Helvetica Neue",Helvetica,Arial,sans-serif;font-size:13px;color:rgb(38,40,42)"> <div> On Friday, May 12, 2023, 03:42:42 AM EDT, Eugen Block <<a href="mailto:eblock@nde.ag" target="_blank">eblock@nde.ag</a>> wrote: </div> <div><br></div> <div><br></div> <div>Hi Albert,<br><br>how is your haproxy placement controlled, something like pacemaker or <br>similar? I would always do a failover when I'm aware of interruptions <br>(maintenance window), that should speed things up for clients. We have <br>a pacemaker controlled HA control plane, it takes more time until <br>pacemaker realizes that the resource is gone if I just rebooted a <br>server without failing over. I have no benchmarks though. There's <br>always a risk of losing a couple of requests during the failover but <br>we didn't have complaints yet, I believe most of the components try to <br>resend the lost messages. In one of our customer's cluster with many <br>resources (they also use terraform) I haven't seen issues during a <br>regular maintenance window. When they had a DNS outage a few months <br>back it resulted in a mess, manual cleaning was necessary, but the <br>regular failovers seem to work just fine.<br>And I don't see rabbitmq issues either after rebooting a server, <br>usually the haproxy (and virtual IP) failover suffice to prevent <br>interruptions.<br><br>Regards,<br>Eugen<br><br>Zitat von Satish Patel <<a href="mailto:satish.txt@gmail.com" target="_blank">satish.txt@gmail.com</a>>:<br><br>> Are you running your stack on top of the kvm virtual machine? How many<br>> controller nodes do you have? mostly rabbitMQ causing issues if you restart<br>> controller nodes.<br>><br>> On Thu, May 11, 2023 at 8:34 AM Albert Braden <<a href="mailto:ozzzo@yahoo.com" target="_blank">ozzzo@yahoo.com</a>> wrote:<br>><br>>> We have our haproxy and controller nodes on KVM hosts. When those KVM<br>>> hosts are restarted, customers who are building or deleting VMs see impact.<br>>> VMs may go into error status, fail to get DNS records, fail to delete, etc.<br>>> The obvious reason is because traffic that is being routed to the haproxy<br>>> on the restarting KVM is lost. If we manually fail over haproxy before<br>>> restarting the KVM, will that be sufficient to stop traffic being lost, or<br>>> do we also need to do something with the controller?<br>>><br>>><br><br><br><br><br></div> </div> </div></blockquote></div>
</blockquote></div>