<div>                There must be a way to stop traffic from being sent to a controller, so that it can be rebooted in an orderly fashion. If that's not possible, then reducing the period of disruption with network settings would be my second choice.<br><br>Can someone from the kolla team give advice about this? What is the recommended method for rebooting a kolla-ansible controller in an orderly fashion? Do I need to use the "remove from cluster" and "add to cluster" procedures, or is there a better way?<br>            </div>            <div class="yahoo_quoted" style="margin:10px 0px 0px 0.8ex;border-left:1px solid #ccc;padding-left:1ex;">                        <div style="font-family:'Helvetica Neue', Helvetica, Arial, sans-serif;font-size:13px;color:#26282a;">                                <div>                    On Wednesday, May 17, 2023, 07:25:34 AM EDT, Eugen Block <eblock@nde.ag> wrote:                </div>                <div><br></div>                <div><br></div>                <div>Hi,<br>I found this [1] reference, it recommends to reduce the kernel option  <br>for tcp_retries to reduce the impact of a service interruption:<br><br># /etc/kolla/globals.yml<br>haproxy_host_ipv4_tcp_retries2: 6<br><br>Apparently, this option was introduced in Victoria [2], it states:<br><br>> Added a new haproxy configuration variable,  <br>> haproxy_host_ipv4_tcp_retries2, which allows users to modify this  <br>> kernel option. This option sets maximum number of times a TCP packet  <br>> is retransmitted in established state before giving up. The default  <br>> kernel value is 15, which corresponds to a duration of approximately  <br>> between 13 to 30 minutes, depending on the retransmission timeout.  <br>> This variable can be used to mitigate an issue with stuck  <br>> connections in case of VIP failover, see bug 1917068 for details.<br><br>It reads like exactly what you're describing. If I remember correctly,  <br>you're still on Train? In that case you'll probably have to configure  <br>that setting manually (scripted maybe), it is this value:  <br>/proc/sys/net/ipv4/tcp_retries2<br>The solution in [3] even talks about setting it to 3 for HA deployments.<br><br># sysctl -a | grep net.ipv4.tcp_retries2<br>net.ipv4.tcp_retries2 = 15<br><br>Regards,<br>Eugen<br><br>[1]  <br><a href="https://docs.openstack.org/kolla-ansible/latest/reference/high-availability/haproxy-guide.html" target="_blank">https://docs.openstack.org/kolla-ansible/latest/reference/high-availability/haproxy-guide.html</a><br>[2] <a href="https://docs.openstack.org/releasenotes/kolla-ansible/victoria.html" target="_blank">https://docs.openstack.org/releasenotes/kolla-ansible/victoria.html</a><br>[3] <a href="https://access.redhat.com/solutions/726753" target="_blank">https://access.redhat.com/solutions/726753</a><br><br>Zitat von Albert Braden <<a ymailto="mailto:ozzzo@yahoo.com" href="mailto:ozzzo@yahoo.com">ozzzo@yahoo.com</a>>:<br><br>> Before we switched to durable queues we were seeing RMQ issues after  <br>> a restart. Now RMQ is fine after restart, but operations in progress  <br>> will fail. VMs will fail to build, or not get DNS records. Volumes  <br>> don't get attached or detached. It looks like haproxy is the issue  <br>> now; connections continue going to the down node. I think we can fix  <br>> that by failing over haproxy before rebooting.<br>><br>> The problem is, I'm not sure that haproxy is the only issue. All 3  <br>> controllers are doing stuff, and when I reboot one, whatever it is  <br>> doing is likely to fail. Is there an orderly way to stop work from  <br>> being done on a controller without ruining work that is already in  <br>> progress, besides removing it from the cluster? Would "kolla-ansible  <br>> stop" do it?<br>>      On Tuesday, May 16, 2023, 02:23:59 PM EDT, Eugen Block  <br>> <<a ymailto="mailto:eblock@nde.ag" href="mailto:eblock@nde.ag">eblock@nde.ag</a>> wrote:<br>><br>>  Hi Albert,<br>><br>> sorry, I'm swamped with different stuff right now. I just took a <br>> glance at the docs you mentioned and it seems way too much for <br>> something simple as a controller restart to actually remove hosts, <br>> that should definitely not be necessary.<br>> I'm not familiar with kolla or exabgp, but can you describe what <br>> exactly takes that long to failover? Maybe that could be improved? And <br>> can you limit the failing requests to a specific service (volumes, <br>> network ports, etc.) or do they all fail? Maybe rabbitmq should be <br>> considered after all, you could share your rabbitmq settings from the <br>> different openstack services and I will collect mine to compare. And <br>> then also the rabbitmq config (policies, vhosts, queues).<br>><br>> Regards,<br>> Eugen<br>><br>> Zitat von Albert Braden <<a ymailto="mailto:ozzzo@yahoo.com" href="mailto:ozzzo@yahoo.com">ozzzo@yahoo.com</a>>:<br>><br>>> What's the recommended method for rebooting controllers? Do we need <br>>> to use the "remove from cluster" and "add to cluster" procedures or <br>>> is there a better way?<br>>><br>>> <a href="https://docs.openstack.org/kolla-ansible/train/user/adding-and-removing-hosts.html" target="_blank">https://docs.openstack.org/kolla-ansible/train/user/adding-and-removing-hosts.html</a><br>>>       On Friday, May 12, 2023, 03:04:26 PM EDT, Albert Braden <br>>> <<a ymailto="mailto:ozzzo@yahoo.com" href="mailto:ozzzo@yahoo.com">ozzzo@yahoo.com</a>> wrote:<br>>><br>>>   We use keepalived and exabgp to manage failover for haproxy. That <br>>> works but it takes a few minutes, and during those few minutes <br>>> customers experience impact. We tell them to not build/delete VMs <br>>> during patching, but they still do, and then complain about the <br>>> failures.<br>>><br>>> We're planning to experiment with adding a "manual" haproxy failover <br>>> to our patching automation, but I'm wondering if there is anything <br>>> on the controller that needs to be failed over or disabled before <br>>> rebooting the KVM. I looked at the "remove from cluster" and "add to <br>>> cluster" procedures but that seems unnecessarily cumbersome for <br>>> rebooting the KVM.<br>>>       On Friday, May 12, 2023, 03:42:42 AM EDT, Eugen Block <br>>> <<a ymailto="mailto:eblock@nde.ag" href="mailto:eblock@nde.ag">eblock@nde.ag</a>> wrote:<br>>><br>>>   Hi Albert,<br>>><br>>> how is your haproxy placement controlled, something like pacemaker or <br>>> similar? I would always do a failover when I'm aware of interruptions <br>>> (maintenance window), that should speed things up for clients. We have <br>>> a pacemaker controlled HA control plane, it takes more time until <br>>> pacemaker realizes that the resource is gone if I just rebooted a <br>>> server without failing over. I have no benchmarks though. There's <br>>> always a risk of losing a couple of requests during the failover but <br>>> we didn't have complaints yet, I believe most of the components try to <br>>> resend the lost messages. In one of our customer's cluster with many <br>>> resources (they also use terraform) I haven't seen issues during a <br>>> regular maintenance window. When they had a DNS outage a few months <br>>> back it resulted in a mess, manual cleaning was necessary, but the <br>>> regular failovers seem to work just fine.<br>>> And I don't see rabbitmq issues either after rebooting a server, <br>>> usually the haproxy (and virtual IP) failover suffice to prevent <br>>> interruptions.<br>>><br>>> Regards,<br>>> Eugen<br>>><br>>> Zitat von Satish Patel <<a ymailto="mailto:satish.txt@gmail.com" href="mailto:satish.txt@gmail.com">satish.txt@gmail.com</a>>:<br>>><br>>>> Are you running your stack on top of the kvm virtual machine? How many<br>>>> controller nodes do you have? mostly rabbitMQ causing issues if you restart<br>>>> controller nodes.<br>>>><br>>>> On Thu, May 11, 2023 at 8:34 AM Albert Braden <<a ymailto="mailto:ozzzo@yahoo.com" href="mailto:ozzzo@yahoo.com">ozzzo@yahoo.com</a>> wrote:<br>>>><br>>>>> We have our haproxy and controller nodes on KVM hosts. When those KVM<br>>>>> hosts are restarted, customers who are building or deleting VMs  <br>>>>> see impact.<br>>>>> VMs may go into error status, fail to get DNS records, fail to  <br>>>>> delete, etc.<br>>>>> The obvious reason is because traffic that is being routed to the haproxy<br>>>>> on the restarting KVM is lost. If we manually fail over haproxy before<br>>>>> restarting the KVM, will that be sufficient to stop traffic being lost, or<br>>>>> do we also need to do something with the controller?<br>>>>><br>>>>><br><br><br><br><br></div>            </div>                </div>