[kolla] [train] haproxy and controller restart causes user impact

Satish Patel satish.txt at gmail.com
Fri May 12 20:51:03 UTC 2023


Don't expect zero issue when you reboot the controller. It won't be user
transparent. your computer nodes and other services still hang on old
connections (rabbitmq/amqp) etc and that takes some time to get settled.

Curious why are you running control plane service on KVM and second
question why do you need them reboot frequently?

I have physical nodes for the control plane and we see strange issues
whenever we shouldn't use one of the controllers for maintenance.

On Fri, May 12, 2023 at 2:59 PM Albert Braden <ozzzo at yahoo.com> wrote:

> We use keepalived and exabgp to manage failover for haproxy. That works
> but it takes a few minutes, and during those few minutes customers
> experience impact. We tell them to not build/delete VMs during patching,
> but they still do, and then complain about the failures.
>
> We're planning to experiment with adding a "manual" haproxy failover to
> our patching automation, but I'm wondering if there is anything on the
> controller that needs to be failed over or disabled before rebooting the
> KVM. I looked at the "remove from cluster" and "add to cluster" procedures
> but that seems unnecessarily cumbersome for rebooting the KVM.
> On Friday, May 12, 2023, 03:42:42 AM EDT, Eugen Block <eblock at nde.ag>
> wrote:
>
>
> Hi Albert,
>
> how is your haproxy placement controlled, something like pacemaker or
> similar? I would always do a failover when I'm aware of interruptions
> (maintenance window), that should speed things up for clients. We have
> a pacemaker controlled HA control plane, it takes more time until
> pacemaker realizes that the resource is gone if I just rebooted a
> server without failing over. I have no benchmarks though. There's
> always a risk of losing a couple of requests during the failover but
> we didn't have complaints yet, I believe most of the components try to
> resend the lost messages. In one of our customer's cluster with many
> resources (they also use terraform) I haven't seen issues during a
> regular maintenance window. When they had a DNS outage a few months
> back it resulted in a mess, manual cleaning was necessary, but the
> regular failovers seem to work just fine.
> And I don't see rabbitmq issues either after rebooting a server,
> usually the haproxy (and virtual IP) failover suffice to prevent
> interruptions.
>
> Regards,
> Eugen
>
> Zitat von Satish Patel <satish.txt at gmail.com>:
>
> > Are you running your stack on top of the kvm virtual machine? How many
> > controller nodes do you have? mostly rabbitMQ causing issues if you
> restart
> > controller nodes.
> >
> > On Thu, May 11, 2023 at 8:34 AM Albert Braden <ozzzo at yahoo.com> wrote:
> >
> >> We have our haproxy and controller nodes on KVM hosts. When those KVM
> >> hosts are restarted, customers who are building or deleting VMs see
> impact.
> >> VMs may go into error status, fail to get DNS records, fail to delete,
> etc.
> >> The obvious reason is because traffic that is being routed to the
> haproxy
> >> on the restarting KVM is lost. If we manually fail over haproxy before
> >> restarting the KVM, will that be sufficient to stop traffic being lost,
> or
> >> do we also need to do something with the controller?
> >>
> >>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230512/87bedd13/attachment-0001.htm>


More information about the openstack-discuss mailing list