[kolla] [train] haproxy and controller restart causes user impact

Satish Patel satish.txt at gmail.com
Fri May 12 20:52:47 UTC 2023


My two cents, If you are still running Train (which is EOL) then please
upgrade to the next or latest release, you never know what bug causing the
issue.

On Fri, May 12, 2023 at 4:51 PM Satish Patel <satish.txt at gmail.com> wrote:

> Don't expect zero issue when you reboot the controller. It won't be user
> transparent. your computer nodes and other services still hang on old
> connections (rabbitmq/amqp) etc and that takes some time to get settled.
>
> Curious why are you running control plane service on KVM and second
> question why do you need them reboot frequently?
>
> I have physical nodes for the control plane and we see strange issues
> whenever we shouldn't use one of the controllers for maintenance.
>
> On Fri, May 12, 2023 at 2:59 PM Albert Braden <ozzzo at yahoo.com> wrote:
>
>> We use keepalived and exabgp to manage failover for haproxy. That works
>> but it takes a few minutes, and during those few minutes customers
>> experience impact. We tell them to not build/delete VMs during patching,
>> but they still do, and then complain about the failures.
>>
>> We're planning to experiment with adding a "manual" haproxy failover to
>> our patching automation, but I'm wondering if there is anything on the
>> controller that needs to be failed over or disabled before rebooting the
>> KVM. I looked at the "remove from cluster" and "add to cluster" procedures
>> but that seems unnecessarily cumbersome for rebooting the KVM.
>> On Friday, May 12, 2023, 03:42:42 AM EDT, Eugen Block <eblock at nde.ag>
>> wrote:
>>
>>
>> Hi Albert,
>>
>> how is your haproxy placement controlled, something like pacemaker or
>> similar? I would always do a failover when I'm aware of interruptions
>> (maintenance window), that should speed things up for clients. We have
>> a pacemaker controlled HA control plane, it takes more time until
>> pacemaker realizes that the resource is gone if I just rebooted a
>> server without failing over. I have no benchmarks though. There's
>> always a risk of losing a couple of requests during the failover but
>> we didn't have complaints yet, I believe most of the components try to
>> resend the lost messages. In one of our customer's cluster with many
>> resources (they also use terraform) I haven't seen issues during a
>> regular maintenance window. When they had a DNS outage a few months
>> back it resulted in a mess, manual cleaning was necessary, but the
>> regular failovers seem to work just fine.
>> And I don't see rabbitmq issues either after rebooting a server,
>> usually the haproxy (and virtual IP) failover suffice to prevent
>> interruptions.
>>
>> Regards,
>> Eugen
>>
>> Zitat von Satish Patel <satish.txt at gmail.com>:
>>
>> > Are you running your stack on top of the kvm virtual machine? How many
>> > controller nodes do you have? mostly rabbitMQ causing issues if you
>> restart
>> > controller nodes.
>> >
>> > On Thu, May 11, 2023 at 8:34 AM Albert Braden <ozzzo at yahoo.com> wrote:
>> >
>> >> We have our haproxy and controller nodes on KVM hosts. When those KVM
>> >> hosts are restarted, customers who are building or deleting VMs see
>> impact.
>> >> VMs may go into error status, fail to get DNS records, fail to delete,
>> etc.
>> >> The obvious reason is because traffic that is being routed to the
>> haproxy
>> >> on the restarting KVM is lost. If we manually fail over haproxy before
>> >> restarting the KVM, will that be sufficient to stop traffic being
>> lost, or
>> >> do we also need to do something with the controller?
>> >>
>> >>
>>
>>
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230512/cb4295dd/attachment.htm>


More information about the openstack-discuss mailing list