Hello again,

just a short update about the results of my tests.

I currently see 2 ways of running openstack+rabbitmq

1. without durable-queues and without replication - just one rabbitmq-process which gets (somehow) restarted if it fails.
2. durable-queues and replication

Any other combination of these settings leads to more or less issues with

* broken / non working bindings
* broken queues

I think vexxhost is running (1) with their openstack-operator - for reasons.

I added [kolla], because kolla-ansible is installing rabbitmq with replication but without durable-queues.

May someone point me to the best way to document these findings to some official doc?
I think a lot of installations out there will run into issues if - under load - a node fails.

 Fabian


Am Do., 13. Aug. 2020 um 15:13 Uhr schrieb Fabian Zimmermann <dev.faz@gmail.com>:
Hi,

just did some short tests today in our test-environment (without durable queues and without replication):

* started a rally task to generate some load
* kill-9-ed rabbitmq on one node
* rally task immediately stopped and the cloud (mostly) stopped working

after some debugging i found (again) exchanges which had bindings to queues, but these bindings didnt forward any msgs.
Wrote a small script to detect these broken bindings and will now check if this is "reproducible"

then I will try "durable queues" and "durable queues with replication" to see if this helps. Even if I would expect
rabbitmq should be able to handle this without these "hidden broken bindings"

This just FYI.

 Fabian