[kolla] [train] haproxy and controller restart causes user impact

Eugen Block eblock at nde.ag
Wed May 17 11:17:30 UTC 2023


Hi,
I found this [1] reference, it recommends to reduce the kernel option  
for tcp_retries to reduce the impact of a service interruption:

# /etc/kolla/globals.yml
haproxy_host_ipv4_tcp_retries2: 6

Apparently, this option was introduced in Victoria [2], it states:

> Added a new haproxy configuration variable,  
> haproxy_host_ipv4_tcp_retries2, which allows users to modify this  
> kernel option. This option sets maximum number of times a TCP packet  
> is retransmitted in established state before giving up. The default  
> kernel value is 15, which corresponds to a duration of approximately  
> between 13 to 30 minutes, depending on the retransmission timeout.  
> This variable can be used to mitigate an issue with stuck  
> connections in case of VIP failover, see bug 1917068 for details.

It reads like exactly what you're describing. If I remember correctly,  
you're still on Train? In that case you'll probably have to configure  
that setting manually (scripted maybe), it is this value:  
/proc/sys/net/ipv4/tcp_retries2
The solution in [3] even talks about setting it to 3 for HA deployments.

# sysctl -a | grep net.ipv4.tcp_retries2
net.ipv4.tcp_retries2 = 15

Regards,
Eugen

[1]  
https://docs.openstack.org/kolla-ansible/latest/reference/high-availability/haproxy-guide.html
[2] https://docs.openstack.org/releasenotes/kolla-ansible/victoria.html
[3] https://access.redhat.com/solutions/726753

Zitat von Albert Braden <ozzzo at yahoo.com>:

> Before we switched to durable queues we were seeing RMQ issues after  
> a restart. Now RMQ is fine after restart, but operations in progress  
> will fail. VMs will fail to build, or not get DNS records. Volumes  
> don't get attached or detached. It looks like haproxy is the issue  
> now; connections continue going to the down node. I think we can fix  
> that by failing over haproxy before rebooting.
>
> The problem is, I'm not sure that haproxy is the only issue. All 3  
> controllers are doing stuff, and when I reboot one, whatever it is  
> doing is likely to fail. Is there an orderly way to stop work from  
> being done on a controller without ruining work that is already in  
> progress, besides removing it from the cluster? Would "kolla-ansible  
> stop" do it?
>      On Tuesday, May 16, 2023, 02:23:59 PM EDT, Eugen Block  
> <eblock at nde.ag> wrote:
>
>  Hi Albert,
>
> sorry, I'm swamped with different stuff right now. I just took a 
> glance at the docs you mentioned and it seems way too much for 
> something simple as a controller restart to actually remove hosts, 
> that should definitely not be necessary.
> I'm not familiar with kolla or exabgp, but can you describe what 
> exactly takes that long to failover? Maybe that could be improved? And 
> can you limit the failing requests to a specific service (volumes, 
> network ports, etc.) or do they all fail? Maybe rabbitmq should be 
> considered after all, you could share your rabbitmq settings from the 
> different openstack services and I will collect mine to compare. And 
> then also the rabbitmq config (policies, vhosts, queues).
>
> Regards,
> Eugen
>
> Zitat von Albert Braden <ozzzo at yahoo.com>:
>
>> What's the recommended method for rebooting controllers? Do we need 
>> to use the "remove from cluster" and "add to cluster" procedures or 
>> is there a better way?
>>
>> https://docs.openstack.org/kolla-ansible/train/user/adding-and-removing-hosts.html
>>       On Friday, May 12, 2023, 03:04:26 PM EDT, Albert Braden 
>> <ozzzo at yahoo.com> wrote:
>>
>>   We use keepalived and exabgp to manage failover for haproxy. That 
>> works but it takes a few minutes, and during those few minutes 
>> customers experience impact. We tell them to not build/delete VMs 
>> during patching, but they still do, and then complain about the 
>> failures.
>>
>> We're planning to experiment with adding a "manual" haproxy failover 
>> to our patching automation, but I'm wondering if there is anything 
>> on the controller that needs to be failed over or disabled before 
>> rebooting the KVM. I looked at the "remove from cluster" and "add to 
>> cluster" procedures but that seems unnecessarily cumbersome for 
>> rebooting the KVM.
>>       On Friday, May 12, 2023, 03:42:42 AM EDT, Eugen Block 
>> <eblock at nde.ag> wrote:
>>
>>   Hi Albert,
>>
>> how is your haproxy placement controlled, something like pacemaker or 
>> similar? I would always do a failover when I'm aware of interruptions 
>> (maintenance window), that should speed things up for clients. We have 
>> a pacemaker controlled HA control plane, it takes more time until 
>> pacemaker realizes that the resource is gone if I just rebooted a 
>> server without failing over. I have no benchmarks though. There's 
>> always a risk of losing a couple of requests during the failover but 
>> we didn't have complaints yet, I believe most of the components try to 
>> resend the lost messages. In one of our customer's cluster with many 
>> resources (they also use terraform) I haven't seen issues during a 
>> regular maintenance window. When they had a DNS outage a few months 
>> back it resulted in a mess, manual cleaning was necessary, but the 
>> regular failovers seem to work just fine.
>> And I don't see rabbitmq issues either after rebooting a server, 
>> usually the haproxy (and virtual IP) failover suffice to prevent 
>> interruptions.
>>
>> Regards,
>> Eugen
>>
>> Zitat von Satish Patel <satish.txt at gmail.com>:
>>
>>> Are you running your stack on top of the kvm virtual machine? How many
>>> controller nodes do you have? mostly rabbitMQ causing issues if you restart
>>> controller nodes.
>>>
>>> On Thu, May 11, 2023 at 8:34 AM Albert Braden <ozzzo at yahoo.com> wrote:
>>>
>>>> We have our haproxy and controller nodes on KVM hosts. When those KVM
>>>> hosts are restarted, customers who are building or deleting VMs  
>>>> see impact.
>>>> VMs may go into error status, fail to get DNS records, fail to  
>>>> delete, etc.
>>>> The obvious reason is because traffic that is being routed to the haproxy
>>>> on the restarting KVM is lost. If we manually fail over haproxy before
>>>> restarting the KVM, will that be sufficient to stop traffic being lost, or
>>>> do we also need to do something with the controller?
>>>>
>>>>






More information about the openstack-discuss mailing list