On 7/12/21 9:20 AM, Massimo Sgaravatto wrote:
Thanks a lot for your replies
I indeed forgot to say that I am using durable queues (i.e I set amqp_durable_queues = true in the conf files of the OpenStack services).
I'll investigate further about the root cause of these network partitions, but I implemented this rabbit cluster exactly to be able to manage such scenarios ... Looks like I can have a much more reliable system with a single rabbit instance ...
Moreover: is it normal/expected that it doesn't recover itself ?
There is pacemaker OCF RA [0] that automatically recovers from network partitions, mostly by resetting the Mnesia DB of failed nodes that cannot join. [0] https://www.rabbitmq.com/pacemaker.html#auto-pacemaker
Thanks, Massimo
On Fri, Jul 9, 2021 at 4:21 PM Fabian Zimmermann <dev.faz@gmail.com <mailto:dev.faz@gmail.com>> wrote:
Hi,
Am Fr., 9. Juli 2021 um 16:04 Uhr schrieb Sean Mooney <smooney@redhat.com <mailto:smooney@redhat.com>>:
> at lwast form a nova perspective if we send an cast for example from the api > its lost then we wont try to recover. > > in the case of an rpc call then the timeout will fire and we will fail whatever operation we were doing
well its a lot better to have consistent state with a limited amount of failed requests, than having an whole cluster stuck and it normally affects only a limited (if any!) requests at all. So I personally prefer - fail fast and restore :)
Fabian
-- Best regards, Bogdan Dobrelya, Irc #bogdando