Hi,
I found this [1] reference, it recommends to reduce the kernel option
for tcp_retries to reduce the impact of a service interruption:
# /etc/kolla/globals.yml
haproxy_host_ipv4_tcp_retries2: 6
Apparently, this option was introduced in Victoria [2], it states:
> Added a new haproxy configuration variable,
> haproxy_host_ipv4_tcp_retries2, which allows users to modify this
> kernel option. This option sets maximum number of times a TCP packet
> is retransmitted in established state before giving up. The default
> kernel value is 15, which corresponds to a duration of approximately
> between 13 to 30 minutes, depending on the retransmission timeout.
> This variable can be used to mitigate an issue with stuck
> connections in case of VIP failover, see bug 1917068 for details.
It reads like exactly what you're describing. If I remember correctly,
you're still on Train? In that case you'll probably have to configure
that setting manually (scripted maybe), it is this value:
/proc/sys/net/ipv4/tcp_retries2
The solution in [3] even talks about setting it to 3 for HA deployments.
# sysctl -a | grep net.ipv4.tcp_retries2
net.ipv4.tcp_retries2 = 15
Regards,
Eugen
[1]
https://docs.openstack.org/kolla-ansible/latest/reference/high-availability/haproxy-guide.html[2]
https://docs.openstack.org/releasenotes/kolla-ansible/victoria.html[3]
https://access.redhat.com/solutions/726753Zitat von Albert Braden <
ozzzo@yahoo.com>:
> Before we switched to durable queues we were seeing RMQ issues after
> a restart. Now RMQ is fine after restart, but operations in progress
> will fail. VMs will fail to build, or not get DNS records. Volumes
> don't get attached or detached. It looks like haproxy is the issue
> now; connections continue going to the down node. I think we can fix
> that by failing over haproxy before rebooting.
>
> The problem is, I'm not sure that haproxy is the only issue. All 3
> controllers are doing stuff, and when I reboot one, whatever it is
> doing is likely to fail. Is there an orderly way to stop work from
> being done on a controller without ruining work that is already in
> progress, besides removing it from the cluster? Would "kolla-ansible
> stop" do it?
> On Tuesday, May 16, 2023, 02:23:59 PM EDT, Eugen Block
> <
eblock@nde.ag> wrote:
>
> Hi Albert,
>
> sorry, I'm swamped with different stuff right now. I just took a
> glance at the docs you mentioned and it seems way too much for
> something simple as a controller restart to actually remove hosts,
> that should definitely not be necessary.
> I'm not familiar with kolla or exabgp, but can you describe what
> exactly takes that long to failover? Maybe that could be improved? And
> can you limit the failing requests to a specific service (volumes,
> network ports, etc.) or do they all fail? Maybe rabbitmq should be
> considered after all, you could share your rabbitmq settings from the
> different openstack services and I will collect mine to compare. And
> then also the rabbitmq config (policies, vhosts, queues).
>
> Regards,
> Eugen
>
> Zitat von Albert Braden <
ozzzo@yahoo.com>:
>
>> What's the recommended method for rebooting controllers? Do we need
>> to use the "remove from cluster" and "add to cluster" procedures or
>> is there a better way?
>>
>>
https://docs.openstack.org/kolla-ansible/train/user/adding-and-removing-hosts.html>> On Friday, May 12, 2023, 03:04:26 PM EDT, Albert Braden
>> <
ozzzo@yahoo.com> wrote:
>>
>> We use keepalived and exabgp to manage failover for haproxy. That
>> works but it takes a few minutes, and during those few minutes
>> customers experience impact. We tell them to not build/delete VMs
>> during patching, but they still do, and then complain about the
>> failures.
>>
>> We're planning to experiment with adding a "manual" haproxy failover
>> to our patching automation, but I'm wondering if there is anything
>> on the controller that needs to be failed over or disabled before
>> rebooting the KVM. I looked at the "remove from cluster" and "add to
>> cluster" procedures but that seems unnecessarily cumbersome for
>> rebooting the KVM.
>> On Friday, May 12, 2023, 03:42:42 AM EDT, Eugen Block
>> <
eblock@nde.ag> wrote:
>>
>> Hi Albert,
>>
>> how is your haproxy placement controlled, something like pacemaker or
>> similar? I would always do a failover when I'm aware of interruptions
>> (maintenance window), that should speed things up for clients. We have
>> a pacemaker controlled HA control plane, it takes more time until
>> pacemaker realizes that the resource is gone if I just rebooted a
>> server without failing over. I have no benchmarks though. There's
>> always a risk of losing a couple of requests during the failover but
>> we didn't have complaints yet, I believe most of the components try to
>> resend the lost messages. In one of our customer's cluster with many
>> resources (they also use terraform) I haven't seen issues during a
>> regular maintenance window. When they had a DNS outage a few months
>> back it resulted in a mess, manual cleaning was necessary, but the
>> regular failovers seem to work just fine.
>> And I don't see rabbitmq issues either after rebooting a server,
>> usually the haproxy (and virtual IP) failover suffice to prevent
>> interruptions.
>>
>> Regards,
>> Eugen
>>
>> Zitat von Satish Patel <
satish.txt@gmail.com>:
>>
>>> Are you running your stack on top of the kvm virtual machine? How many
>>> controller nodes do you have? mostly rabbitMQ causing issues if you restart
>>> controller nodes.
>>>
>>> On Thu, May 11, 2023 at 8:34 AM Albert Braden <
ozzzo@yahoo.com> wrote:
>>>
>>>> We have our haproxy and controller nodes on KVM hosts. When those KVM
>>>> hosts are restarted, customers who are building or deleting VMs
>>>> see impact.
>>>> VMs may go into error status, fail to get DNS records, fail to
>>>> delete, etc.
>>>> The obvious reason is because traffic that is being routed to the haproxy
>>>> on the restarting KVM is lost. If we manually fail over haproxy before
>>>> restarting the KVM, will that be sufficient to stop traffic being lost, or
>>>> do we also need to do something with the controller?
>>>>
>>>>