Thanks a lot for your replies I indeed forgot to say that I am using durable queues (i.e I set amqp_durable_queues = true in the conf files of the OpenStack services). I'll investigate further about the root cause of these network partitions, but I implemented this rabbit cluster exactly to be able to manage such scenarios ... Looks like I can have a much more reliable system with a single rabbit instance ... Moreover: is it normal/expected that it doesn't recover itself ? Thanks, Massimo On Fri, Jul 9, 2021 at 4:21 PM Fabian Zimmermann <dev.faz@gmail.com> wrote:
Hi,
Am Fr., 9. Juli 2021 um 16:04 Uhr schrieb Sean Mooney <smooney@redhat.com
:
at lwast form a nova perspective if we send an cast for example from the api its lost then we wont try to recover.
in the case of an rpc call then the timeout will fire and we will fail whatever operation we were doing
well its a lot better to have consistent state with a limited amount of failed requests, than having an whole cluster stuck and it normally affects only a limited (if any!) requests at all. So I personally prefer - fail fast and restore :)
Fabian