[nova][neutron][oslo][ops] rabbit bindings issue

newer
[blazar] IRC meetings cancelled...

older
[security] Security SIG Meeting...

Arnaud Morin

6 Aug 2020 6 Aug '20

7:40 a.m.

Hey all, I would like to ask the community about a rabbit issue we have from time to time. In our current architecture, we have a cluster of rabbits (3 nodes) for all our OpenStack services (mostly nova and neutron). When one node of this cluster is down, the cluster continue working (we use pause_minority strategy). But, sometimes, the third server is not able to recover automatically and need a manual intervention. After this intervention, we restart the rabbitmq-server process, which is then able to join the cluster back. At this time, the cluster looks ok, everything is fine. BUT, nothing works. Neutron and nova agents are not able to report back to servers. They appear dead. Servers seems not being able to consume messages. The exchanges, queues, bindings seems good in rabbit. What we see is that removing bindings (using rabbitmqadmin delete binding or the web interface) and recreate them again (using the same routing key) brings the service back up and running. Doing this for all queues is really painful. Our next plan is to automate it, but is there anyone in the community already saw this kind of issues? Our bug looks like the one described in [1]. Someone recommands to create an Alternate Exchange. Is there anyone already tried that? FYI, we are running rabbit 3.8.2 (with OpenStack Stein). We had the same kind of issues using older version of rabbit. Thanks for your help. [1] https://groups.google.com/forum/#!newtopic/rabbitmq-users/rabbitmq-users/zFh... -- Arnaud Morin

Show replies by date

Fabian Zimmermann

7 Aug 7 Aug

9:30 p.m.

Hi, we also have this issue. Our solution was (up to now) to delete the queues with a script or even reset the complete cluster. We just upgraded rabbitmq to the latest version - without luck. Anyone else seeing this issue? Fabian Arnaud Morin <arnaud.morin@gmail.com> schrieb am Do., 6. Aug. 2020, 16:47:

...

Hey all,

I would like to ask the community about a rabbit issue we have from time to time.

In our current architecture, we have a cluster of rabbits (3 nodes) for all our OpenStack services (mostly nova and neutron).

When one node of this cluster is down, the cluster continue working (we use pause_minority strategy). But, sometimes, the third server is not able to recover automatically and need a manual intervention. After this intervention, we restart the rabbitmq-server process, which is then able to join the cluster back.

At this time, the cluster looks ok, everything is fine. BUT, nothing works. Neutron and nova agents are not able to report back to servers. They appear dead. Servers seems not being able to consume messages. The exchanges, queues, bindings seems good in rabbit.

What we see is that removing bindings (using rabbitmqadmin delete binding or the web interface) and recreate them again (using the same routing key) brings the service back up and running.

Doing this for all queues is really painful. Our next plan is to automate it, but is there anyone in the community already saw this kind of issues?

Our bug looks like the one described in [1]. Someone recommands to create an Alternate Exchange. Is there anyone already tried that?

FYI, we are running rabbit 3.8.2 (with OpenStack Stein). We had the same kind of issues using older version of rabbit.

Thanks for your help.

[1] https://groups.google.com/forum/#!newtopic/rabbitmq-users/rabbitmq-users/zFh...

-- Arnaud Morin

Massimo Sgaravatto

8 Aug 8 Aug

12:36 a.m.

We also see the issue. When it happens stopping and restarting the rabbit cluster usually helps. I thought the problem was because of a wrong setting in the openstack services conf files: I missed these settings (that I am now going to add): [oslo_messaging_rabbit] rabbit_ha_queues = true amqp_durable_queues = true Cheers, Massimo On Sat, Aug 8, 2020 at 6:34 AM Fabian Zimmermann <dev.faz@gmail.com> wrote:

...

Hi,

we also have this issue.

Our solution was (up to now) to delete the queues with a script or even reset the complete cluster.

We just upgraded rabbitmq to the latest version - without luck.

Anyone else seeing this issue?

Fabian

Arnaud Morin <arnaud.morin@gmail.com> schrieb am Do., 6. Aug. 2020, 16:47:

...
Hey all,

I would like to ask the community about a rabbit issue we have from time to time.

In our current architecture, we have a cluster of rabbits (3 nodes) for all our OpenStack services (mostly nova and neutron).

When one node of this cluster is down, the cluster continue working (we use pause_minority strategy). But, sometimes, the third server is not able to recover automatically and need a manual intervention. After this intervention, we restart the rabbitmq-server process, which is then able to join the cluster back.

At this time, the cluster looks ok, everything is fine. BUT, nothing works. Neutron and nova agents are not able to report back to servers. They appear dead. Servers seems not being able to consume messages. The exchanges, queues, bindings seems good in rabbit.

What we see is that removing bindings (using rabbitmqadmin delete binding or the web interface) and recreate them again (using the same routing key) brings the service back up and running.

Doing this for all queues is really painful. Our next plan is to automate it, but is there anyone in the community already saw this kind of issues?

Our bug looks like the one described in [1]. Someone recommands to create an Alternate Exchange. Is there anyone already tried that?

FYI, we are running rabbit 3.8.2 (with OpenStack Stein). We had the same kind of issues using older version of rabbit.

Thanks for your help.

[1] https://groups.google.com/forum/#!newtopic/rabbitmq-users/rabbitmq-users/zFh...

-- Arnaud Morin

Fabian Zimmermann

6:06 a.m.

Hi, dont know if durable queues help, but should be enabled by rabbitmq policy which (alone) doesnt seem to fix this (we have this active) Fabian Massimo Sgaravatto <massimo.sgaravatto@gmail.com> schrieb am Sa., 8. Aug. 2020, 09:36:

...

We also see the issue. When it happens stopping and restarting the rabbit cluster usually helps.

I thought the problem was because of a wrong setting in the openstack services conf files: I missed these settings (that I am now going to add):

[oslo_messaging_rabbit] rabbit_ha_queues = true amqp_durable_queues = true

Cheers, Massimo

On Sat, Aug 8, 2020 at 6:34 AM Fabian Zimmermann <dev.faz@gmail.com> wrote:

...
Hi,

we also have this issue.

Our solution was (up to now) to delete the queues with a script or even reset the complete cluster.

We just upgraded rabbitmq to the latest version - without luck.

Anyone else seeing this issue?

Fabian

Arnaud Morin <arnaud.morin@gmail.com> schrieb am Do., 6. Aug. 2020, 16:47:

...
Hey all,

I would like to ask the community about a rabbit issue we have from time to time.

In our current architecture, we have a cluster of rabbits (3 nodes) for all our OpenStack services (mostly nova and neutron).

When one node of this cluster is down, the cluster continue working (we use pause_minority strategy). But, sometimes, the third server is not able to recover automatically and need a manual intervention. After this intervention, we restart the rabbitmq-server process, which is then able to join the cluster back.

At this time, the cluster looks ok, everything is fine. BUT, nothing works. Neutron and nova agents are not able to report back to servers. They appear dead. Servers seems not being able to consume messages. The exchanges, queues, bindings seems good in rabbit.

What we see is that removing bindings (using rabbitmqadmin delete binding or the web interface) and recreate them again (using the same routing key) brings the service back up and running.

Doing this for all queues is really painful. Our next plan is to automate it, but is there anyone in the community already saw this kind of issues?

Our bug looks like the one described in [1]. Someone recommands to create an Alternate Exchange. Is there anyone already tried that?

FYI, we are running rabbit 3.8.2 (with OpenStack Stein). We had the same kind of issues using older version of rabbit.

Thanks for your help.

[1] https://groups.google.com/forum/#!newtopic/rabbitmq-users/rabbitmq-users/zFh...

-- Arnaud Morin

Thierry Carrez

11 Aug 11 Aug

3:24 a.m.

If you can reproduce it with current versions, I would suggest to file an issue on https://github.com/rabbitmq/rabbitmq-server/issues/ The behavior you describe seems to match https://github.com/rabbitmq/rabbitmq-server/issues/1873 but the maintainers seem to think it's been fixed by a number of somewhat-related changes in 3.7.13, because nobody reported issues anymore :) Fabian Zimmermann wrote:

...

Hi,

dont know if durable queues help, but should be enabled by rabbitmq policy which (alone) doesnt seem to fix this (we have this active)

Fabian

Massimo Sgaravatto <massimo.sgaravatto@gmail.com <mailto:massimo.sgaravatto@gmail.com>> schrieb am Sa., 8. Aug. 2020, 09:36:

We also see the issue. When it happens stopping and restarting the rabbit cluster usually helps.

I thought the problem was because of a wrong setting in the openstack services conf files: I missed these settings (that I am now going to add):

[oslo_messaging_rabbit] rabbit_ha_queues = true amqp_durable_queues = true

Cheers, Massimo

On Sat, Aug 8, 2020 at 6:34 AM Fabian Zimmermann <dev.faz@gmail.com <mailto:dev.faz@gmail.com>> wrote:

Hi,

we also have this issue.

Our solution was (up to now) to delete the queues with a script or even reset the complete cluster.

We just upgraded rabbitmq to the latest version - without luck.

Anyone else seeing this issue?

Fabian

Arnaud Morin <arnaud.morin@gmail.com <mailto:arnaud.morin@gmail.com>> schrieb am Do., 6. Aug. 2020, 16:47:

Hey all,

I would like to ask the community about a rabbit issue we have from time to time.

In our current architecture, we have a cluster of rabbits (3 nodes) for all our OpenStack services (mostly nova and neutron).

When one node of this cluster is down, the cluster continue working (we use pause_minority strategy). But, sometimes, the third server is not able to recover automatically and need a manual intervention. After this intervention, we restart the rabbitmq-server process, which is then able to join the cluster back.

At this time, the cluster looks ok, everything is fine. BUT, nothing works. Neutron and nova agents are not able to report back to servers. They appear dead. Servers seems not being able to consume messages. The exchanges, queues, bindings seems good in rabbit.

What we see is that removing bindings (using rabbitmqadmin delete binding or the web interface) and recreate them again (using the same routing key) brings the service back up and running.

Doing this for all queues is really painful. Our next plan is to automate it, but is there anyone in the community already saw this kind of issues?

Our bug looks like the one described in [1]. Someone recommands to create an Alternate Exchange. Is there anyone already tried that?

FYI, we are running rabbit 3.8.2 (with OpenStack Stein). We had the same kind of issues using older version of rabbit.

Thanks for your help.

[1] https://groups.google.com/forum/#!newtopic/rabbitmq-users/rabbitmq-users/zFh...

-- Arnaud Morin

-- Thierry Carrez (ttx)

Fabian Zimmermann

12 Aug 12 Aug

3:14 a.m.

Hi, just wrote some small scripts to reproduce our issue and send a msg to rabbitmq-list. https://groups.google.com/d/msg/rabbitmq-users/eC8jc-YEt8s/s8K_0KnXDQAJ Fabian Am Di., 11. Aug. 2020 um 12:31 Uhr schrieb Thierry Carrez < thierry@openstack.org>:

...

If you can reproduce it with current versions, I would suggest to file an issue on https://github.com/rabbitmq/rabbitmq-server/issues/

The behavior you describe seems to match https://github.com/rabbitmq/rabbitmq-server/issues/1873 but the maintainers seem to think it's been fixed by a number of somewhat-related changes in 3.7.13, because nobody reported issues anymore :)

Fabian Zimmermann

7:25 a.m.

Hi, just could prove, that "durable queues" seem to workaround the issue. If I enable durable queues, im no longer able to reproduce my issue. Afaik durable queues have downsides - esp if a node fails and the queue is not (jet) synced. Anyone information about this? Fabian Am Mi., 12. Aug. 2020 um 12:14 Uhr schrieb Fabian Zimmermann < dev.faz@gmail.com>:

...

Hi,

just wrote some small scripts to reproduce our issue and send a msg to rabbitmq-list.

https://groups.google.com/d/msg/rabbitmq-users/eC8jc-YEt8s/s8K_0KnXDQAJ

Fabian

Am Di., 11. Aug. 2020 um 12:31 Uhr schrieb Thierry Carrez < thierry@openstack.org>:

...
If you can reproduce it with current versions, I would suggest to file an issue on https://github.com/rabbitmq/rabbitmq-server/issues/

The behavior you describe seems to match https://github.com/rabbitmq/rabbitmq-server/issues/1873 but the maintainers seem to think it's been fixed by a number of somewhat-related changes in 3.7.13, because nobody reported issues anymore :)

Fabian Zimmermann

13 Aug 13 Aug

6:13 a.m.

Hi, just did some short tests today in our test-environment (without durable queues and without replication): * started a rally task to generate some load * kill-9-ed rabbitmq on one node * rally task immediately stopped and the cloud (mostly) stopped working after some debugging i found (again) exchanges which had bindings to queues, but these bindings didnt forward any msgs. Wrote a small script to detect these broken bindings and will now check if this is "reproducible" then I will try "durable queues" and "durable queues with replication" to see if this helps. Even if I would expect rabbitmq should be able to handle this without these "hidden broken bindings" This just FYI. Fabian

Fabian Zimmermann

14 Aug 14 Aug

4:21 a.m.

New subject: [nova][neutron][oslo][ops][kolla] rabbit bindings issue

Hello again, just a short update about the results of my tests. I currently see 2 ways of running openstack+rabbitmq 1. without durable-queues and without replication - just one rabbitmq-process which gets (somehow) restarted if it fails. 2. durable-queues and replication Any other combination of these settings leads to more or less issues with * broken / non working bindings * broken queues I think vexxhost is running (1) with their openstack-operator - for reasons. I added [kolla], because kolla-ansible is installing rabbitmq with replication but without durable-queues. May someone point me to the best way to document these findings to some official doc? I think a lot of installations out there will run into issues if - under load - a node fails. Fabian Am Do., 13. Aug. 2020 um 15:13 Uhr schrieb Fabian Zimmermann < dev.faz@gmail.com>:

...

Hi,

just did some short tests today in our test-environment (without durable queues and without replication):

* started a rally task to generate some load * kill-9-ed rabbitmq on one node * rally task immediately stopped and the cloud (mostly) stopped working

after some debugging i found (again) exchanges which had bindings to queues, but these bindings didnt forward any msgs. Wrote a small script to detect these broken bindings and will now check if this is "reproducible"

then I will try "durable queues" and "durable queues with replication" to see if this helps. Even if I would expect rabbitmq should be able to handle this without these "hidden broken bindings"

This just FYI.

Fabian

Satish Patel

7:59 a.m.

New subject: [nova][neutron][oslo][ops][kolla] rabbit bindings issue

Fabian, what do you mean?

...

...
I think vexxhost is running (1) with their openstack-operator - for reasons.

On Fri, Aug 14, 2020 at 7:28 AM Fabian Zimmermann <dev.faz@gmail.com> wrote:

...

Hello again,

just a short update about the results of my tests.

I currently see 2 ways of running openstack+rabbitmq

1. without durable-queues and without replication - just one rabbitmq-process which gets (somehow) restarted if it fails. 2. durable-queues and replication

Any other combination of these settings leads to more or less issues with

* broken / non working bindings * broken queues

I think vexxhost is running (1) with their openstack-operator - for reasons.

I added [kolla], because kolla-ansible is installing rabbitmq with replication but without durable-queues.

May someone point me to the best way to document these findings to some official doc? I think a lot of installations out there will run into issues if - under load - a node fails.

Fabian

Am Do., 13. Aug. 2020 um 15:13 Uhr schrieb Fabian Zimmermann <dev.faz@gmail.com>:

...
Hi,

just did some short tests today in our test-environment (without durable queues and without replication):

* started a rally task to generate some load * kill-9-ed rabbitmq on one node * rally task immediately stopped and the cloud (mostly) stopped working

after some debugging i found (again) exchanges which had bindings to queues, but these bindings didnt forward any msgs. Wrote a small script to detect these broken bindings and will now check if this is "reproducible"

then I will try "durable queues" and "durable queues with replication" to see if this helps. Even if I would expect rabbitmq should be able to handle this without these "hidden broken bindings"

This just FYI.

Fabian

Fabian Zimmermann

9:45 a.m.

New subject: [nova][neutron][oslo][ops][kolla] rabbit bindings issue

Hi, i read somewhere that vexxhosts kubernetes openstack-Operator is running one rabbitmq Container per Service. Just the kubernetes self healing is used as "ha" for rabbitmq. That seems to match with my finding: run rabbitmq standalone and use an external system to restart rabbitmq if required. Fabian Satish Patel <satish.txt@gmail.com> schrieb am Fr., 14. Aug. 2020, 16:59:

...

Fabian,

what do you mean?

...
...
I think vexxhost is running (1) with their openstack-operator - for reasons.

On Fri, Aug 14, 2020 at 7:28 AM Fabian Zimmermann <dev.faz@gmail.com> wrote:

...
Hello again,

just a short update about the results of my tests.

I currently see 2 ways of running openstack+rabbitmq

1. without durable-queues and without replication - just one

rabbitmq-process which gets (somehow) restarted if it fails.

...
2. durable-queues and replication

Any other combination of these settings leads to more or less issues with

* broken / non working bindings * broken queues

I think vexxhost is running (1) with their openstack-operator - for reasons.

I added [kolla], because kolla-ansible is installing rabbitmq with replication but without durable-queues.

May someone point me to the best way to document these findings to some official doc? I think a lot of installations out there will run into issues if - under load - a node fails.

Fabian

Am Do., 13. Aug. 2020 um 15:13 Uhr schrieb Fabian Zimmermann < dev.faz@gmail.com>:

...
Hi,

just did some short tests today in our test-environment (without

durable queues and without replication):

...
* started a rally task to generate some load * kill-9-ed rabbitmq on one node * rally task immediately stopped and the cloud (mostly) stopped working

after some debugging i found (again) exchanges which had bindings to

queues, but these bindings didnt forward any msgs.

...
Wrote a small script to detect these broken bindings and will now check if this is "reproducible"

then I will try "durable queues" and "durable queues with replication" to see if this helps. Even if I would expect rabbitmq should be able to handle this without these "hidden broken bindings"

This just FYI.

Fabian

Sean Mooney

12:09 p.m.

New subject: [nova][neutron][oslo][ops][kolla] rabbit bindings issue

...

Hi,

i read somewhere that vexxhosts kubernetes openstack-Operator is running one rabbitmq Container per Service. Just the kubernetes self healing is used as "ha" for rabbitmq.

That seems to match with my finding: run rabbitmq standalone and use an external system to restart rabbitmq if required.

On Fri, 2020-08-14 at 18:45 +0200, Fabian Zimmermann wrote: thats the design that was orginally planned for kolla-kubernetes orrignally each service was to be deployed with its own rabbit mq server if it required one and if it crashed it woudl just be recreated by k8s. it perfromace better then a cluster and if you trust k8s or the external service enough to ensure it is recteated it should be as effective a solution. you dont even need k8s to do that but it seams to be a good fit if your prepared to ocationally loose inflight rpcs. if you not then you can configure rabbit to persite all message to disk and mont that on a shared file system like nfs or cephfs so that when the rabbit instance is recreated the queue contency is perserved. assuming you can take the perfromance hit of writing all messages to disk that is.

...

Fabian

Satish Patel <satish.txt@gmail.com> schrieb am Fr., 14. Aug. 2020, 16:59:

...
Fabian,

what do you mean?

...
...
I think vexxhost is running (1) with their openstack-operator - for

reasons.

On Fri, Aug 14, 2020 at 7:28 AM Fabian Zimmermann <dev.faz@gmail.com> wrote:

...
Hello again,

just a short update about the results of my tests.

I currently see 2 ways of running openstack+rabbitmq

1. without durable-queues and without replication - just one

rabbitmq-process which gets (somehow) restarted if it fails.

...
2. durable-queues and replication

Any other combination of these settings leads to more or less issues with

* broken / non working bindings * broken queues

I think vexxhost is running (1) with their openstack-operator - for

reasons.

...
I added [kolla], because kolla-ansible is installing rabbitmq with

replication but without durable-queues.

...
May someone point me to the best way to document these findings to some

official doc?

...
I think a lot of installations out there will run into issues if - under

load - a node fails.

...
Fabian

Am Do., 13. Aug. 2020 um 15:13 Uhr schrieb Fabian Zimmermann <

dev.faz@gmail.com>:

...
...
Hi,

just did some short tests today in our test-environment (without

durable queues and without replication):

...
...
* started a rally task to generate some load * kill-9-ed rabbitmq on one node * rally task immediately stopped and the cloud (mostly) stopped working

after some debugging i found (again) exchanges which had bindings to

queues, but these bindings didnt forward any msgs.

...
...
Wrote a small script to detect these broken bindings and will now check

if this is "reproducible"

...
...
then I will try "durable queues" and "durable queues with replication"

to see if this helps. Even if I would expect

...
...
rabbitmq should be able to handle this without these "hidden broken

bindings"

...
...
This just FYI.

Fabian

Satish Patel

15 Aug 15 Aug

5:13 p.m.

New subject: [nova][neutron][oslo][ops][kolla] rabbit bindings issue

Hi Sean, Sounds good, but running rabbitmq for each service going to be little overhead also, how do you scale cluster (Yes we can use cellv2 but its not something everyone like to do because of complexity). If we thinks rabbitMQ is growing pain then why community not looking for alternative option (kafka) etc..? On Fri, Aug 14, 2020 at 3:09 PM Sean Mooney <smooney@redhat.com> wrote:

...

...
Hi,

i read somewhere that vexxhosts kubernetes openstack-Operator is running one rabbitmq Container per Service. Just the kubernetes self healing is used as "ha" for rabbitmq.

That seems to match with my finding: run rabbitmq standalone and use an external system to restart rabbitmq if required.

On Fri, 2020-08-14 at 18:45 +0200, Fabian Zimmermann wrote: thats the design that was orginally planned for kolla-kubernetes orrignally

each service was to be deployed with its own rabbit mq server if it required one and if it crashed it woudl just be recreated by k8s. it perfromace better then a cluster and if you trust k8s or the external service enough to ensure it is recteated it should be as effective a solution. you dont even need k8s to do that but it seams to be a good fit if your prepared to ocationally loose inflight rpcs. if you not then you can configure rabbit to persite all message to disk and mont that on a shared file system like nfs or cephfs so that when the rabbit instance is recreated the queue contency is perserved. assuming you can take the perfromance hit of writing all messages to disk that is.

...
Fabian

Satish Patel <satish.txt@gmail.com> schrieb am Fr., 14. Aug. 2020, 16:59:

...
Fabian,

what do you mean?

...
...
I think vexxhost is running (1) with their openstack-operator - for

reasons.

On Fri, Aug 14, 2020 at 7:28 AM Fabian Zimmermann <dev.faz@gmail.com> wrote:

...
Hello again,

just a short update about the results of my tests.

I currently see 2 ways of running openstack+rabbitmq

1. without durable-queues and without replication - just one

rabbitmq-process which gets (somehow) restarted if it fails.

...
2. durable-queues and replication

Any other combination of these settings leads to more or less issues with

* broken / non working bindings * broken queues

I think vexxhost is running (1) with their openstack-operator - for

reasons.

...
I added [kolla], because kolla-ansible is installing rabbitmq with

replication but without durable-queues.

...
May someone point me to the best way to document these findings to some

official doc?

...
I think a lot of installations out there will run into issues if - under

load - a node fails.

...
Fabian

Am Do., 13. Aug. 2020 um 15:13 Uhr schrieb Fabian Zimmermann <

dev.faz@gmail.com>:

...
...
Hi,

just did some short tests today in our test-environment (without

durable queues and without replication):

...
...
* started a rally task to generate some load * kill-9-ed rabbitmq on one node * rally task immediately stopped and the cloud (mostly) stopped working

after some debugging i found (again) exchanges which had bindings to

queues, but these bindings didnt forward any msgs.

...
...
Wrote a small script to detect these broken bindings and will now check

if this is "reproducible"

...
...
then I will try "durable queues" and "durable queues with replication"

to see if this helps. Even if I would expect

...
...
rabbitmq should be able to handle this without these "hidden broken

bindings"

...
...
This just FYI.

Fabian

Fabian Zimmermann

10:40 p.m.

New subject: [nova][neutron][oslo][ops][kolla] rabbit bindings issue

Hi, Already looked in Oslo.messaging, but rabbitmq is the only stable driver :( Kafka is marked as experimental and (if the docs are correct) is only usable for notifications. Would love to switch to an alternate. Fabian Satish Patel <satish.txt@gmail.com> schrieb am So., 16. Aug. 2020, 02:13:

...

Hi Sean,

Sounds good, but running rabbitmq for each service going to be little overhead also, how do you scale cluster (Yes we can use cellv2 but its not something everyone like to do because of complexity). If we thinks rabbitMQ is growing pain then why community not looking for alternative option (kafka) etc..?

On Fri, Aug 14, 2020 at 3:09 PM Sean Mooney <smooney@redhat.com> wrote:

...
On Fri, 2020-08-14 at 18:45 +0200, Fabian Zimmermann wrote:

...
Hi,

i read somewhere that vexxhosts kubernetes openstack-Operator is

running

...
...
one rabbitmq Container per Service. Just the kubernetes self healing is used as "ha" for rabbitmq.

That seems to match with my finding: run rabbitmq standalone and use an external system to restart rabbitmq if required. thats the design that was orginally planned for kolla-kubernetes orrignally

each service was to be deployed with its own rabbit mq server if it required one and if it crashed it woudl just be recreated by k8s. it perfromace better then a cluster and if you trust k8s or the external service enough to ensure it is recteated it should be as effective a solution. you dont even need k8s to do that but it seams to be a good fit if your prepared to ocationally loose inflight rpcs. if you not then you can configure rabbit to persite all message to disk and mont that on a shared file system like nfs or cephfs so that when the rabbit instance is recreated the queue contency is perserved. assuming you can take the perfromance hit of writing all messages to disk that is.

...
Fabian

Satish Patel <satish.txt@gmail.com> schrieb am Fr., 14. Aug. 2020,

16:59:

...
...
Fabian,

what do you mean?

...
...
I think vexxhost is running (1) with their openstack-operator -

for

...
...
reasons.

On Fri, Aug 14, 2020 at 7:28 AM Fabian Zimmermann <dev.faz@gmail.com

...
...
wrote:

...
Hello again,

just a short update about the results of my tests.

I currently see 2 ways of running openstack+rabbitmq

1. without durable-queues and without replication - just one

rabbitmq-process which gets (somehow) restarted if it fails.

...
2. durable-queues and replication

Any other combination of these settings leads to more or less issues with

* broken / non working bindings * broken queues

I think vexxhost is running (1) with their openstack-operator - for

reasons.

...
I added [kolla], because kolla-ansible is installing rabbitmq with

replication but without durable-queues.

...
May someone point me to the best way to document these findings to

some

official doc?

...
I think a lot of installations out there will run into issues if - under

load - a node fails.

...
Fabian

Am Do., 13. Aug. 2020 um 15:13 Uhr schrieb Fabian Zimmermann <

dev.faz@gmail.com>:

...
...
Hi,

just did some short tests today in our test-environment (without

durable queues and without replication):

...
...
* started a rally task to generate some load * kill-9-ed rabbitmq on one node * rally task immediately stopped and the cloud (mostly) stopped

working

...
after some debugging i found (again) exchanges which had

bindings to

queues, but these bindings didnt forward any msgs.

...
...
Wrote a small script to detect these broken bindings and will now check

if this is "reproducible"

...
...
then I will try "durable queues" and "durable queues with

replication"

to see if this helps. Even if I would expect

...
...
rabbitmq should be able to handle this without these "hidden broken

bindings"

...
...
This just FYI.

Fabian

Tobias Urdin

16 Aug 16 Aug

1:48 a.m.

New subject: [nova][neutron][oslo][ops][kolla] rabbit bindings issue

Hello, Kind of off topic but I’ve been starting doing some research to see if a KubeMQ driver could be added to oslo.messaging Best regards On 16 Aug 2020, at 07:44, Fabian Zimmermann <dev.faz@gmail.com> wrote: Hi, Already looked in Oslo.messaging, but rabbitmq is the only stable driver :( Kafka is marked as experimental and (if the docs are correct) is only usable for notifications. Would love to switch to an alternate. Fabian Satish Patel <satish.txt@gmail.com<mailto:satish.txt@gmail.com>> schrieb am So., 16. Aug. 2020, 02:13: Hi Sean, Sounds good, but running rabbitmq for each service going to be little overhead also, how do you scale cluster (Yes we can use cellv2 but its not something everyone like to do because of complexity). If we thinks rabbitMQ is growing pain then why community not looking for alternative option (kafka) etc..? On Fri, Aug 14, 2020 at 3:09 PM Sean Mooney <smooney@redhat.com<mailto:smooney@redhat.com>> wrote:

...

...
Hi,

i read somewhere that vexxhosts kubernetes openstack-Operator is running one rabbitmq Container per Service. Just the kubernetes self healing is used as "ha" for rabbitmq.

That seems to match with my finding: run rabbitmq standalone and use an external system to restart rabbitmq if required.

On Fri, 2020-08-14 at 18:45 +0200, Fabian Zimmermann wrote: thats the design that was orginally planned for kolla-kubernetes orrignally

each service was to be deployed with its own rabbit mq server if it required one and if it crashed it woudl just be recreated by k8s. it perfromace better then a cluster and if you trust k8s or the external service enough to ensure it is recteated it should be as effective a solution. you dont even need k8s to do that but it seams to be a good fit if your prepared to ocationally loose inflight rpcs. if you not then you can configure rabbit to persite all message to disk and mont that on a shared file system like nfs or cephfs so that when the rabbit instance is recreated the queue contency is perserved. assuming you can take the perfromance hit of writing all messages to disk that is.

...
Fabian

Satish Patel <satish.txt@gmail.com<mailto:satish.txt@gmail.com>> schrieb am Fr., 14. Aug. 2020, 16:59:

...
Fabian,

what do you mean?

...
...
I think vexxhost is running (1) with their openstack-operator - for

reasons.

On Fri, Aug 14, 2020 at 7:28 AM Fabian Zimmermann <dev.faz@gmail.com<mailto:dev.faz@gmail.com>> wrote:

...
Hello again,

just a short update about the results of my tests.

I currently see 2 ways of running openstack+rabbitmq

1. without durable-queues and without replication - just one

rabbitmq-process which gets (somehow) restarted if it fails.

...
2. durable-queues and replication

Any other combination of these settings leads to more or less issues with

* broken / non working bindings * broken queues

I think vexxhost is running (1) with their openstack-operator - for

reasons.

...
I added [kolla], because kolla-ansible is installing rabbitmq with

replication but without durable-queues.

...
May someone point me to the best way to document these findings to some

official doc?

...
I think a lot of installations out there will run into issues if - under

load - a node fails.

...
Fabian

Am Do., 13. Aug. 2020 um 15:13 Uhr schrieb Fabian Zimmermann <

dev.faz@gmail.com<mailto:dev.faz@gmail.com>>:

...
...
Hi,

just did some short tests today in our test-environment (without

durable queues and without replication):

...
...
* started a rally task to generate some load * kill-9-ed rabbitmq on one node * rally task immediately stopped and the cloud (mostly) stopped working

after some debugging i found (again) exchanges which had bindings to

queues, but these bindings didnt forward any msgs.

...
...
Wrote a small script to detect these broken bindings and will now check

if this is "reproducible"

...
...
then I will try "durable queues" and "durable queues with replication"

to see if this helps. Even if I would expect

...
...
rabbitmq should be able to handle this without these "hidden broken

bindings"

...
...
This just FYI.

Fabian

Ben Nemec

17 Aug 17 Aug

9:13 a.m.

New subject: [nova][neutron][oslo][ops][kolla] rabbit bindings issue

On 8/16/20 3:48 AM, Tobias Urdin wrote:

...

Hello,

Kind of off topic but I’ve been starting doing some research to see if a KubeMQ driver could be added to oslo.messaging

You may want to take a look at https://docs.openstack.org/oslo.messaging/latest/contributor/supported-messa... We've had bad luck with adding new drivers to oslo.messaging in the past, so we've tried to come up with a policy that gives them the best possible chance of being successful. It does set a rather high bar for integration though. Also take a look at https://review.opendev.org/#/c/692784/ A lot of the discussion there may be relevant to another new driver.

...

Best regards

...
On 16 Aug 2020, at 07:44, Fabian Zimmermann <dev.faz@gmail.com> wrote:

Hi,

Already looked in Oslo.messaging, but rabbitmq is the only stable driver :(

Kafka is marked as experimental and (if the docs are correct) is only usable for notifications.

Would love to switch to an alternate.

Fabian

Satish Patel <satish.txt@gmail.com <mailto:satish.txt@gmail.com>> schrieb am So., 16. Aug. 2020, 02:13:

Hi Sean,

Sounds good, but running rabbitmq for each service going to be little overhead also, how do you scale cluster (Yes we can use cellv2 but its not something everyone like to do because of complexity). If we thinks rabbitMQ is growing pain then why community not looking for alternative option (kafka) etc..?

On Fri, Aug 14, 2020 at 3:09 PM Sean Mooney <smooney@redhat.com <mailto:smooney@redhat.com>> wrote: > > On Fri, 2020-08-14 at 18:45 +0200, Fabian Zimmermann wrote: > > Hi, > > > > i read somewhere that vexxhosts kubernetes openstack-Operator is running > > one rabbitmq Container per Service. Just the kubernetes self healing is > > used as "ha" for rabbitmq. > > > > That seems to match with my finding: run rabbitmq standalone and use an > > external system to restart rabbitmq if required. > thats the design that was orginally planned for kolla-kubernetes orrignally > > each service was to be deployed with its own rabbit mq server if it required one > and if it crashed it woudl just be recreated by k8s. it perfromace better then a cluster > and if you trust k8s or the external service enough to ensure it is recteated it > should be as effective a solution. you dont even need k8s to do that but it seams to be > a good fit if your prepared to ocationally loose inflight rpcs. > if you not then you can configure rabbit to persite all message to disk and mont that on a shared > file system like nfs or cephfs so that when the rabbit instance is recreated the queue contency is > perserved. assuming you can take the perfromance hit of writing all messages to disk that is. > > > > Fabian > > > > Satish Patel <satish.txt@gmail.com <mailto:satish.txt@gmail.com>> schrieb am Fr., 14. Aug. 2020, 16:59: > > > > > Fabian, > > > > > > what do you mean? > > > > > > > > I think vexxhost is running (1) with their openstack-operator - for > > > > > > reasons. > > > > > > On Fri, Aug 14, 2020 at 7:28 AM Fabian Zimmermann <dev.faz@gmail.com <mailto:dev.faz@gmail.com>> > > > wrote: > > > > > > > > Hello again, > > > > > > > > just a short update about the results of my tests. > > > > > > > > I currently see 2 ways of running openstack+rabbitmq > > > > > > > > 1. without durable-queues and without replication - just one > > > > > > rabbitmq-process which gets (somehow) restarted if it fails. > > > > 2. durable-queues and replication > > > > > > > > Any other combination of these settings leads to more or less issues with > > > > > > > > * broken / non working bindings > > > > * broken queues > > > > > > > > I think vexxhost is running (1) with their openstack-operator - for > > > > > > reasons. > > > > > > > > I added [kolla], because kolla-ansible is installing rabbitmq with > > > > > > replication but without durable-queues. > > > > > > > > May someone point me to the best way to document these findings to some > > > > > > official doc? > > > > I think a lot of installations out there will run into issues if - under > > > > > > load - a node fails. > > > > > > > > Fabian > > > > > > > > > > > > Am Do., 13. Aug. 2020 um 15:13 Uhr schrieb Fabian Zimmermann < > > > > > > dev.faz@gmail.com <mailto:dev.faz@gmail.com>>: > > > > > > > > > > Hi, > > > > > > > > > > just did some short tests today in our test-environment (without > > > > > > durable queues and without replication): > > > > > > > > > > * started a rally task to generate some load > > > > > * kill-9-ed rabbitmq on one node > > > > > * rally task immediately stopped and the cloud (mostly) stopped working > > > > > > > > > > after some debugging i found (again) exchanges which had bindings to > > > > > > queues, but these bindings didnt forward any msgs. > > > > > Wrote a small script to detect these broken bindings and will now check > > > > > > if this is "reproducible" > > > > > > > > > > then I will try "durable queues" and "durable queues with replication" > > > > > > to see if this helps. Even if I would expect > > > > > rabbitmq should be able to handle this without these "hidden broken > > > > > > bindings" > > > > > > > > > > This just FYI. > > > > > > > > > > Fabian >

Sean Mooney

16 Aug 16 Aug

6:37 a.m.

New subject: [nova][neutron][oslo][ops][kolla] rabbit bindings issue

On Sat, 2020-08-15 at 20:13 -0400, Satish Patel wrote:

...

Hi Sean,

Sounds good, but running rabbitmq for each service going to be little overhead also, how do you scale cluster (Yes we can use cellv2 but its not something everyone like to do because of complexity).

...

If we thinks rabbitMQ is growing pain then why community not looking for alternative option (kafka) etc..? we have looked at alternivives several times rabbit mq wroks well enough ans scales well enough for most deployments.

my understanding is that when using rabbitmq adding multiple rabbitmq servers in a cluster lowers througput vs jsut 1 rabbitmq instance for any given excahnge. that is because the content of the queue need to be syconised across the cluster. so if cinder nova and neutron share a 3 node cluster and your compaure that to the same service deployed with cinder nova and neuton each having there on rabbitmq service then the independent deployment will tend to out perform the clustered solution. im not really sure if that has change i know tha thow clustering has been donw has evovled over the years but in the past clustering was the adversary of scaling. there other amqp implimantation that scale better then rabbit, activemq and qpid are both reported to scale better but they perfrom worse out of the box and need to be carfully tuned in the past zeromq has been supported but peole did not maintain it. kafka i dont think is a good alternative but nats https://nats.io/ might be. for what its worth all nova deployment are cellv2 deployments with 1 cell from around pike/rocky and its really not that complex. cells_v1 was much more complex bug part of the redesign for cells_v2 was makeing sure there is only 1 code path. adding a second cell just need another cell db and conductor to be deployed assuming you startted with a super conductor in the first place. the issue is cells is only a nova feature no other service have cells so it does not help you with cinder or neutron. as such cinder an neutron likely be the services that hit scaling limits first. adopign cells in other services is not nessaryally the right approch either but when we talk about scale we do need to keep in mind that cells is just for nova today.

...

On Fri, Aug 14, 2020 at 3:09 PM Sean Mooney <smooney@redhat.com> wrote:

...
On Fri, 2020-08-14 at 18:45 +0200, Fabian Zimmermann wrote:

...
Hi,

i read somewhere that vexxhosts kubernetes openstack-Operator is running one rabbitmq Container per Service. Just the kubernetes self healing is used as "ha" for rabbitmq.

That seems to match with my finding: run rabbitmq standalone and use an external system to restart rabbitmq if required.

thats the design that was orginally planned for kolla-kubernetes orrignally

each service was to be deployed with its own rabbit mq server if it required one and if it crashed it woudl just be recreated by k8s. it perfromace better then a cluster and if you trust k8s or the external service enough to ensure it is recteated it should be as effective a solution. you dont even need k8s to do that but it seams to be a good fit if your prepared to ocationally loose inflight rpcs. if you not then you can configure rabbit to persite all message to disk and mont that on a shared file system like nfs or cephfs so that when the rabbit instance is recreated the queue contency is perserved. assuming you can take the perfromance hit of writing all messages to disk that is.

...
Fabian

Satish Patel <satish.txt@gmail.com> schrieb am Fr., 14. Aug. 2020, 16:59:

...
Fabian,

what do you mean?

...
...
I think vexxhost is running (1) with their openstack-operator - for

reasons.

On Fri, Aug 14, 2020 at 7:28 AM Fabian Zimmermann <dev.faz@gmail.com> wrote:

...
Hello again,

just a short update about the results of my tests.

I currently see 2 ways of running openstack+rabbitmq

1. without durable-queues and without replication - just one

rabbitmq-process which gets (somehow) restarted if it fails.

...
2. durable-queues and replication

Any other combination of these settings leads to more or less issues with

* broken / non working bindings * broken queues

I think vexxhost is running (1) with their openstack-operator - for

reasons.

...
I added [kolla], because kolla-ansible is installing rabbitmq with

replication but without durable-queues.

...
May someone point me to the best way to document these findings to some

official doc?

...
I think a lot of installations out there will run into issues if - under

load - a node fails.

...
Fabian

Am Do., 13. Aug. 2020 um 15:13 Uhr schrieb Fabian Zimmermann <

dev.faz@gmail.com>:

...
...
Hi,

just did some short tests today in our test-environment (without

durable queues and without replication):

...
...
* started a rally task to generate some load * kill-9-ed rabbitmq on one node * rally task immediately stopped and the cloud (mostly) stopped working

after some debugging i found (again) exchanges which had bindings to

queues, but these bindings didnt forward any msgs.

...
...
Wrote a small script to detect these broken bindings and will now check

if this is "reproducible"

...
...
then I will try "durable queues" and "durable queues with replication"

to see if this helps. Even if I would expect

...
...
rabbitmq should be able to handle this without these "hidden broken

bindings"

...
...
This just FYI.

Fabian

Fabian Zimmermann

17 Aug 17 Aug

7:03 a.m.

New subject: [nova][neutron][oslo][ops][kolla] rabbit bindings issue

Just to keep the list updated. If you run with durable_queues and replication, there is still a possibility, that a short living queue will *not* jet be replicated and a node failure will mark these queue as "unreachable". This wouldnt be a problem, if openstack would create a new queue, but i fear it would just try to reuse the existing after reconnect. So, after all - it seems the less buggy way would be * use durable-queue and replication for long-running queues/exchanges * use non-durable-queue without replication for short (fanout, reply_) queues This should allow the short-living ones to destroy themself on node failure, and the long living ones should be able to be as available as possible. Absolutely untested - so use with caution, but here is a possible policy-regex: ^(?!amq\.)(?!reply_)(?!.*fanout).* Fabian Am So., 16. Aug. 2020 um 15:37 Uhr schrieb Sean Mooney <smooney@redhat.com>:

...

On Sat, 2020-08-15 at 20:13 -0400, Satish Patel wrote:

...
Hi Sean,

Sounds good, but running rabbitmq for each service going to be little overhead also, how do you scale cluster (Yes we can use cellv2 but its not something everyone like to do because of complexity).

my understanding is that when using rabbitmq adding multiple rabbitmq servers in a cluster lowers througput vs jsut 1 rabbitmq instance for any given excahnge. that is because the content of the queue need to be syconised across the cluster. so if cinder nova and neutron share a 3 node cluster and your compaure that to the same service deployed with cinder nova and neuton each having there on rabbitmq service then the independent deployment will tend to out perform the clustered solution. im not really sure if that has change i know tha thow clustering has been donw has evovled over the years but in the past clustering was the adversary of scaling.

...
If we thinks rabbitMQ is growing pain then why community not looking for alternative option (kafka) etc..? we have looked at alternivives several times rabbit mq wroks well enough ans scales well enough for most deployments. there other amqp implimantation that scale better then rabbit, activemq and qpid are both reported to scale better but they perfrom worse out of the box and need to be carfully tuned

in the past zeromq has been supported but peole did not maintain it.

kafka i dont think is a good alternative but nats https://nats.io/ might be.

for what its worth all nova deployment are cellv2 deployments with 1 cell from around pike/rocky and its really not that complex. cells_v1 was much more complex bug part of the redesign for cells_v2 was makeing sure there is only 1 code path. adding a second cell just need another cell db and conductor to be deployed assuming you startted with a super conductor in the first place. the issue is cells is only a nova feature no other service have cells so it does not help you with cinder or neutron. as such cinder an neutron likely be the services that hit scaling limits first. adopign cells in other services is not nessaryally the right approch either but when we talk about scale we do need to keep in mind that cells is just for nova today.

...
On Fri, Aug 14, 2020 at 3:09 PM Sean Mooney <smooney@redhat.com> wrote:

...
On Fri, 2020-08-14 at 18:45 +0200, Fabian Zimmermann wrote:

...
Hi,

i read somewhere that vexxhosts kubernetes openstack-Operator is running one rabbitmq Container per Service. Just the kubernetes self healing is used as "ha" for rabbitmq.

That seems to match with my finding: run rabbitmq standalone and use an external system to restart rabbitmq if required.

thats the design that was orginally planned for kolla-kubernetes orrignally

each service was to be deployed with its own rabbit mq server if it required one and if it crashed it woudl just be recreated by k8s. it perfromace better then a cluster and if you trust k8s or the external service enough to ensure it is recteated it should be as effective a solution. you dont even need k8s to do that but it seams to be a good fit if your prepared to ocationally loose inflight rpcs. if you not then you can configure rabbit to persite all message to disk and mont that on a shared file system like nfs or cephfs so that when the rabbit instance is recreated the queue contency is perserved. assuming you can take the perfromance hit of writing all messages to disk that is.

...
Fabian

Satish Patel <satish.txt@gmail.com> schrieb am Fr., 14. Aug. 2020, 16:59:

...
Fabian,

what do you mean?

...
> I think vexxhost is running (1) with their openstack-operator - for

reasons.

On Fri, Aug 14, 2020 at 7:28 AM Fabian Zimmermann <dev.faz@gmail.com> wrote:

...
Hello again,

just a short update about the results of my tests.

I currently see 2 ways of running openstack+rabbitmq

1. without durable-queues and without replication - just one

rabbitmq-process which gets (somehow) restarted if it fails.

...
2. durable-queues and replication

Any other combination of these settings leads to more or less issues with

* broken / non working bindings * broken queues

I think vexxhost is running (1) with their openstack-operator - for

reasons.

...
I added [kolla], because kolla-ansible is installing rabbitmq with

replication but without durable-queues.

...
May someone point me to the best way to document these findings to some

official doc?

...
I think a lot of installations out there will run into issues if - under

load - a node fails.

...
Fabian

Am Do., 13. Aug. 2020 um 15:13 Uhr schrieb Fabian Zimmermann <

dev.faz@gmail.com>:

...
> > Hi, > > just did some short tests today in our test-environment (without

durable queues and without replication):

...
> > * started a rally task to generate some load > * kill-9-ed rabbitmq on one node > * rally task immediately stopped and the cloud (mostly) stopped working > > after some debugging i found (again) exchanges which had bindings to

queues, but these bindings didnt forward any msgs.

...
> Wrote a small script to detect these broken bindings and will now check

if this is "reproducible"

...
> > then I will try "durable queues" and "durable queues with replication"

to see if this helps. Even if I would expect

...
> rabbitmq should be able to handle this without these "hidden broken

bindings"

...
> > This just FYI. > > Fabian

Arnaud Morin

7:17 a.m.

New subject: [nova][neutron][oslo][ops][kolla] rabbit bindings issue

Hey Fabian, I was thinking the same, and I found the "default" values from openstack-ansible: https://github.com/openstack/openstack-ansible-rabbitmq_server/blob/fc27e735... pattern: '^(?!(amq\.)|(.*_fanout_)|(reply_)).*' Which are setting HA for all except amq.* *_fanout_* reply_* So that would make sense? -- Arnaud Morin On 17.08.20 - 16:03, Fabian Zimmermann wrote:

...

Just to keep the list updated.

If you run with durable_queues and replication, there is still a possibility, that a short living queue will *not* jet be replicated and a node failure will mark these queue as "unreachable". This wouldnt be a problem, if openstack would create a new queue, but i fear it would just try to reuse the existing after reconnect.

So, after all - it seems the less buggy way would be

* use durable-queue and replication for long-running queues/exchanges * use non-durable-queue without replication for short (fanout, reply_) queues

This should allow the short-living ones to destroy themself on node failure, and the long living ones should be able to be as available as possible.

Absolutely untested - so use with caution, but here is a possible policy-regex: ^(?!amq\.)(?!reply_)(?!.*fanout).*

Fabian

Am So., 16. Aug. 2020 um 15:37 Uhr schrieb Sean Mooney <smooney@redhat.com>:

...
On Sat, 2020-08-15 at 20:13 -0400, Satish Patel wrote:

...
Hi Sean,

Sounds good, but running rabbitmq for each service going to be little overhead also, how do you scale cluster (Yes we can use cellv2 but its not something everyone like to do because of complexity).

my understanding is that when using rabbitmq adding multiple rabbitmq servers in a cluster lowers througput vs jsut 1 rabbitmq instance for any given excahnge. that is because the content of the queue need to be syconised across the cluster. so if cinder nova and neutron share a 3 node cluster and your compaure that to the same service deployed with cinder nova and neuton each having there on rabbitmq service then the independent deployment will tend to out perform the clustered solution. im not really sure if that has change i know tha thow clustering has been donw has evovled over the years but in the past clustering was the adversary of scaling.

...
If we thinks rabbitMQ is growing pain then why community not looking for alternative option (kafka) etc..? we have looked at alternivives several times rabbit mq wroks well enough ans scales well enough for most deployments. there other amqp implimantation that scale better then rabbit, activemq and qpid are both reported to scale better but they perfrom worse out of the box and need to be carfully tuned

in the past zeromq has been supported but peole did not maintain it.

kafka i dont think is a good alternative but nats https://nats.io/ might be.

for what its worth all nova deployment are cellv2 deployments with 1 cell from around pike/rocky and its really not that complex. cells_v1 was much more complex bug part of the redesign for cells_v2 was makeing sure there is only 1 code path. adding a second cell just need another cell db and conductor to be deployed assuming you startted with a super conductor in the first place. the issue is cells is only a nova feature no other service have cells so it does not help you with cinder or neutron. as such cinder an neutron likely be the services that hit scaling limits first. adopign cells in other services is not nessaryally the right approch either but when we talk about scale we do need to keep in mind that cells is just for nova today.

...
On Fri, Aug 14, 2020 at 3:09 PM Sean Mooney <smooney@redhat.com> wrote:

...
On Fri, 2020-08-14 at 18:45 +0200, Fabian Zimmermann wrote:

...
Hi,

i read somewhere that vexxhosts kubernetes openstack-Operator is running one rabbitmq Container per Service. Just the kubernetes self healing is used as "ha" for rabbitmq.

That seems to match with my finding: run rabbitmq standalone and use an external system to restart rabbitmq if required.

thats the design that was orginally planned for kolla-kubernetes orrignally

each service was to be deployed with its own rabbit mq server if it required one and if it crashed it woudl just be recreated by k8s. it perfromace better then a cluster and if you trust k8s or the external service enough to ensure it is recteated it should be as effective a solution. you dont even need k8s to do that but it seams to be a good fit if your prepared to ocationally loose inflight rpcs. if you not then you can configure rabbit to persite all message to disk and mont that on a shared file system like nfs or cephfs so that when the rabbit instance is recreated the queue contency is perserved. assuming you can take the perfromance hit of writing all messages to disk that is.

...
Fabian

Satish Patel <satish.txt@gmail.com> schrieb am Fr., 14. Aug. 2020, 16:59:

...
Fabian,

what do you mean?

> > I think vexxhost is running (1) with their openstack-operator - for

reasons.

On Fri, Aug 14, 2020 at 7:28 AM Fabian Zimmermann <dev.faz@gmail.com> wrote: > > Hello again, > > just a short update about the results of my tests. > > I currently see 2 ways of running openstack+rabbitmq > > 1. without durable-queues and without replication - just one

rabbitmq-process which gets (somehow) restarted if it fails. > 2. durable-queues and replication > > Any other combination of these settings leads to more or less issues with > > * broken / non working bindings > * broken queues > > I think vexxhost is running (1) with their openstack-operator - for

reasons. > > I added [kolla], because kolla-ansible is installing rabbitmq with

replication but without durable-queues. > > May someone point me to the best way to document these findings to some

official doc? > I think a lot of installations out there will run into issues if - under

load - a node fails. > > Fabian > > > Am Do., 13. Aug. 2020 um 15:13 Uhr schrieb Fabian Zimmermann <

dev.faz@gmail.com>: > > > > Hi, > > > > just did some short tests today in our test-environment (without

durable queues and without replication): > > > > * started a rally task to generate some load > > * kill-9-ed rabbitmq on one node > > * rally task immediately stopped and the cloud (mostly) stopped working > > > > after some debugging i found (again) exchanges which had bindings to

queues, but these bindings didnt forward any msgs. > > Wrote a small script to detect these broken bindings and will now check

if this is "reproducible" > > > > then I will try "durable queues" and "durable queues with replication"

to see if this helps. Even if I would expect > > rabbitmq should be able to handle this without these "hidden broken

bindings" > > > > This just FYI. > > > > Fabian

Fabian Zimmermann

7:21 a.m.

New subject: [nova][neutron][oslo][ops][kolla] rabbit bindings issue

Hi, oh, that's great! So, someone at openstack-ansible already detected this and just forgot to update the docs.openstack.org ;) I tested my regex and it seems to fix my issue (atm). I will run an openstack rally load test with the regex above to check what happens if I terminate a rabbitmq while load is hitting the system. Fabian Am Mo., 17. Aug. 2020 um 16:17 Uhr schrieb Arnaud Morin <arnaud.morin@gmail.com>:

...

Hey Fabian,

I was thinking the same, and I found the "default" values from openstack-ansible: https://github.com/openstack/openstack-ansible-rabbitmq_server/blob/fc27e735...

pattern: '^(?!(amq\.)|(.*_fanout_)|(reply_)).*'

Which are setting HA for all except amq.* *_fanout_* reply_*

So that would make sense?

-- Arnaud Morin

On 17.08.20 - 16:03, Fabian Zimmermann wrote:

...
Just to keep the list updated.

If you run with durable_queues and replication, there is still a possibility, that a short living queue will *not* jet be replicated and a node failure will mark these queue as "unreachable". This wouldnt be a problem, if openstack would create a new queue, but i fear it would just try to reuse the existing after reconnect.

So, after all - it seems the less buggy way would be

* use durable-queue and replication for long-running queues/exchanges * use non-durable-queue without replication for short (fanout, reply_) queues

This should allow the short-living ones to destroy themself on node failure, and the long living ones should be able to be as available as possible.

Absolutely untested - so use with caution, but here is a possible policy-regex: ^(?!amq\.)(?!reply_)(?!.*fanout).*

Fabian

Am So., 16. Aug. 2020 um 15:37 Uhr schrieb Sean Mooney <smooney@redhat.com>:

...
On Sat, 2020-08-15 at 20:13 -0400, Satish Patel wrote:

...
Hi Sean,

Sounds good, but running rabbitmq for each service going to be little overhead also, how do you scale cluster (Yes we can use cellv2 but its not something everyone like to do because of complexity).

my understanding is that when using rabbitmq adding multiple rabbitmq servers in a cluster lowers througput vs jsut 1 rabbitmq instance for any given excahnge. that is because the content of the queue need to be syconised across the cluster. so if cinder nova and neutron share a 3 node cluster and your compaure that to the same service deployed with cinder nova and neuton each having there on rabbitmq service then the independent deployment will tend to out perform the clustered solution. im not really sure if that has change i know tha thow clustering has been donw has evovled over the years but in the past clustering was the adversary of scaling.

...
If we thinks rabbitMQ is growing pain then why community not looking for alternative option (kafka) etc..? we have looked at alternivives several times rabbit mq wroks well enough ans scales well enough for most deployments. there other amqp implimantation that scale better then rabbit, activemq and qpid are both reported to scale better but they perfrom worse out of the box and need to be carfully tuned

in the past zeromq has been supported but peole did not maintain it.

kafka i dont think is a good alternative but nats https://nats.io/ might be.

for what its worth all nova deployment are cellv2 deployments with 1 cell from around pike/rocky and its really not that complex. cells_v1 was much more complex bug part of the redesign for cells_v2 was makeing sure there is only 1 code path. adding a second cell just need another cell db and conductor to be deployed assuming you startted with a super conductor in the first place. the issue is cells is only a nova feature no other service have cells so it does not help you with cinder or neutron. as such cinder an neutron likely be the services that hit scaling limits first. adopign cells in other services is not nessaryally the right approch either but when we talk about scale we do need to keep in mind that cells is just for nova today.

...
On Fri, Aug 14, 2020 at 3:09 PM Sean Mooney <smooney@redhat.com> wrote:

...
On Fri, 2020-08-14 at 18:45 +0200, Fabian Zimmermann wrote:

...
Hi,

i read somewhere that vexxhosts kubernetes openstack-Operator is running one rabbitmq Container per Service. Just the kubernetes self healing is used as "ha" for rabbitmq.

That seems to match with my finding: run rabbitmq standalone and use an external system to restart rabbitmq if required.

thats the design that was orginally planned for kolla-kubernetes orrignally

each service was to be deployed with its own rabbit mq server if it required one and if it crashed it woudl just be recreated by k8s. it perfromace better then a cluster and if you trust k8s or the external service enough to ensure it is recteated it should be as effective a solution. you dont even need k8s to do that but it seams to be a good fit if your prepared to ocationally loose inflight rpcs. if you not then you can configure rabbit to persite all message to disk and mont that on a shared file system like nfs or cephfs so that when the rabbit instance is recreated the queue contency is perserved. assuming you can take the perfromance hit of writing all messages to disk that is.

...
Fabian

Satish Patel <satish.txt@gmail.com> schrieb am Fr., 14. Aug. 2020, 16:59:

> Fabian, > > what do you mean? > > > > I think vexxhost is running (1) with their openstack-operator - for > > reasons. > > On Fri, Aug 14, 2020 at 7:28 AM Fabian Zimmermann <dev.faz@gmail.com> > wrote: > > > > Hello again, > > > > just a short update about the results of my tests. > > > > I currently see 2 ways of running openstack+rabbitmq > > > > 1. without durable-queues and without replication - just one > > rabbitmq-process which gets (somehow) restarted if it fails. > > 2. durable-queues and replication > > > > Any other combination of these settings leads to more or less issues with > > > > * broken / non working bindings > > * broken queues > > > > I think vexxhost is running (1) with their openstack-operator - for > > reasons. > > > > I added [kolla], because kolla-ansible is installing rabbitmq with > > replication but without durable-queues. > > > > May someone point me to the best way to document these findings to some > > official doc? > > I think a lot of installations out there will run into issues if - under > > load - a node fails. > > > > Fabian > > > > > > Am Do., 13. Aug. 2020 um 15:13 Uhr schrieb Fabian Zimmermann < > > dev.faz@gmail.com>: > > > > > > Hi, > > > > > > just did some short tests today in our test-environment (without > > durable queues and without replication): > > > > > > * started a rally task to generate some load > > > * kill-9-ed rabbitmq on one node > > > * rally task immediately stopped and the cloud (mostly) stopped working > > > > > > after some debugging i found (again) exchanges which had bindings to > > queues, but these bindings didnt forward any msgs. > > > Wrote a small script to detect these broken bindings and will now check > > if this is "reproducible" > > > > > > then I will try "durable queues" and "durable queues with replication" > > to see if this helps. Even if I would expect > > > rabbitmq should be able to handle this without these "hidden broken > > bindings" > > > > > > This just FYI. > > > > > > Fabian

Satish Patel

24 Aug 24 Aug

8:47 a.m.

New subject: [nova][neutron][oslo][ops][kolla] rabbit bindings issue

Sorry for the late reply Sean, When you said Cells is only a nova feature what does that mean? Correct me if i am wrong here, only nova means i can deploy rabbitmq in cells to just handl nova-* services but not neutron or any other services right? On Sun, Aug 16, 2020 at 9:37 AM Sean Mooney <smooney@redhat.com> wrote:

...

On Sat, 2020-08-15 at 20:13 -0400, Satish Patel wrote:

...
Hi Sean,

Sounds good, but running rabbitmq for each service going to be little overhead also, how do you scale cluster (Yes we can use cellv2 but its not something everyone like to do because of complexity).

my understanding is that when using rabbitmq adding multiple rabbitmq servers in a cluster lowers througput vs jsut 1 rabbitmq instance for any given excahnge. that is because the content of the queue need to be syconised across the cluster. so if cinder nova and neutron share a 3 node cluster and your compaure that to the same service deployed with cinder nova and neuton each having there on rabbitmq service then the independent deployment will tend to out perform the clustered solution. im not really sure if that has change i know tha thow clustering has been donw has evovled over the years but in the past clustering was the adversary of scaling.

...
If we thinks rabbitMQ is growing pain then why community not looking for alternative option (kafka) etc..? we have looked at alternivives several times rabbit mq wroks well enough ans scales well enough for most deployments. there other amqp implimantation that scale better then rabbit, activemq and qpid are both reported to scale better but they perfrom worse out of the box and need to be carfully tuned

in the past zeromq has been supported but peole did not maintain it.

kafka i dont think is a good alternative but nats https://nats.io/ might be.

for what its worth all nova deployment are cellv2 deployments with 1 cell from around pike/rocky and its really not that complex. cells_v1 was much more complex bug part of the redesign for cells_v2 was makeing sure there is only 1 code path. adding a second cell just need another cell db and conductor to be deployed assuming you startted with a super conductor in the first place. the issue is cells is only a nova feature no other service have cells so it does not help you with cinder or neutron. as such cinder an neutron likely be the services that hit scaling limits first. adopign cells in other services is not nessaryally the right approch either but when we talk about scale we do need to keep in mind that cells is just for nova today.

...
On Fri, Aug 14, 2020 at 3:09 PM Sean Mooney <smooney@redhat.com> wrote:

...
On Fri, 2020-08-14 at 18:45 +0200, Fabian Zimmermann wrote:

...
Hi,

i read somewhere that vexxhosts kubernetes openstack-Operator is running one rabbitmq Container per Service. Just the kubernetes self healing is used as "ha" for rabbitmq.

That seems to match with my finding: run rabbitmq standalone and use an external system to restart rabbitmq if required.

thats the design that was orginally planned for kolla-kubernetes orrignally

each service was to be deployed with its own rabbit mq server if it required one and if it crashed it woudl just be recreated by k8s. it perfromace better then a cluster and if you trust k8s or the external service enough to ensure it is recteated it should be as effective a solution. you dont even need k8s to do that but it seams to be a good fit if your prepared to ocationally loose inflight rpcs. if you not then you can configure rabbit to persite all message to disk and mont that on a shared file system like nfs or cephfs so that when the rabbit instance is recreated the queue contency is perserved. assuming you can take the perfromance hit of writing all messages to disk that is.

...
Fabian

Satish Patel <satish.txt@gmail.com> schrieb am Fr., 14. Aug. 2020, 16:59:

...
Fabian,

what do you mean?

...
> I think vexxhost is running (1) with their openstack-operator - for

reasons.

On Fri, Aug 14, 2020 at 7:28 AM Fabian Zimmermann <dev.faz@gmail.com> wrote:

...
Hello again,

just a short update about the results of my tests.

I currently see 2 ways of running openstack+rabbitmq

1. without durable-queues and without replication - just one

rabbitmq-process which gets (somehow) restarted if it fails.

...
2. durable-queues and replication

Any other combination of these settings leads to more or less issues with

* broken / non working bindings * broken queues

I think vexxhost is running (1) with their openstack-operator - for

reasons.

...
I added [kolla], because kolla-ansible is installing rabbitmq with

replication but without durable-queues.

...
May someone point me to the best way to document these findings to some

official doc?

...
I think a lot of installations out there will run into issues if - under

load - a node fails.

...
Fabian

Am Do., 13. Aug. 2020 um 15:13 Uhr schrieb Fabian Zimmermann <

dev.faz@gmail.com>:

...
> > Hi, > > just did some short tests today in our test-environment (without

durable queues and without replication):

...
> > * started a rally task to generate some load > * kill-9-ed rabbitmq on one node > * rally task immediately stopped and the cloud (mostly) stopped working > > after some debugging i found (again) exchanges which had bindings to

queues, but these bindings didnt forward any msgs.

...
> Wrote a small script to detect these broken bindings and will now check

if this is "reproducible"

...
> > then I will try "durable queues" and "durable queues with replication"

to see if this helps. Even if I would expect

...
> rabbitmq should be able to handle this without these "hidden broken

bindings"

...
> > This just FYI. > > Fabian

Arnaud Morin

18 Aug 18 Aug

5:07 a.m.

New subject: [nova][neutron][oslo][ops][kolla] rabbit bindings issue

Hey all, About the vexxhost strategy to use only one rabbit server and manage HA through rabbit. Do you plan to do the same for MariaDB/MySQL? -- Arnaud Morin On 14.08.20 - 18:45, Fabian Zimmermann wrote:

...

Hi,

i read somewhere that vexxhosts kubernetes openstack-Operator is running one rabbitmq Container per Service. Just the kubernetes self healing is used as "ha" for rabbitmq.

That seems to match with my finding: run rabbitmq standalone and use an external system to restart rabbitmq if required.

Fabian

Satish Patel <satish.txt@gmail.com> schrieb am Fr., 14. Aug. 2020, 16:59:

...
Fabian,

what do you mean?

...
...
I think vexxhost is running (1) with their openstack-operator - for reasons.

On Fri, Aug 14, 2020 at 7:28 AM Fabian Zimmermann <dev.faz@gmail.com> wrote:

...
Hello again,

just a short update about the results of my tests.

I currently see 2 ways of running openstack+rabbitmq

1. without durable-queues and without replication - just one

rabbitmq-process which gets (somehow) restarted if it fails.

...
2. durable-queues and replication

Any other combination of these settings leads to more or less issues with

* broken / non working bindings * broken queues

I think vexxhost is running (1) with their openstack-operator - for reasons.

I added [kolla], because kolla-ansible is installing rabbitmq with replication but without durable-queues.

May someone point me to the best way to document these findings to some official doc? I think a lot of installations out there will run into issues if - under load - a node fails.

Fabian

Am Do., 13. Aug. 2020 um 15:13 Uhr schrieb Fabian Zimmermann < dev.faz@gmail.com>:

...
Hi,

just did some short tests today in our test-environment (without

durable queues and without replication):

...
* started a rally task to generate some load * kill-9-ed rabbitmq on one node * rally task immediately stopped and the cloud (mostly) stopped working

after some debugging i found (again) exchanges which had bindings to

queues, but these bindings didnt forward any msgs.

...
Wrote a small script to detect these broken bindings and will now check if this is "reproducible"

then I will try "durable queues" and "durable queues with replication" to see if this helps. Even if I would expect rabbitmq should be able to handle this without these "hidden broken bindings"

This just FYI.

Fabian

Fabian Zimmermann

20 Aug 20 Aug

10:16 a.m.

New subject: [nova][neutron][oslo][ops][kolla] rabbit bindings issue

Hi, just another idea: Rabbitmq is able to count undelivered messages. We could use this information to detect the broken bindings (causing undeliverable messages). Anyone already doing this? I currently don't have a way to reproduce the broken bindings, so I'm unable to proof the idea. Seems we have to wait issue to happen again - what - hopefully - never happens :) Fabian Arnaud Morin <arnaud.morin@gmail.com> schrieb am Di., 18. Aug. 2020, 14:07:

...

Hey all,

About the vexxhost strategy to use only one rabbit server and manage HA through rabbit. Do you plan to do the same for MariaDB/MySQL?

-- Arnaud Morin

On 14.08.20 - 18:45, Fabian Zimmermann wrote:

...
Hi,

i read somewhere that vexxhosts kubernetes openstack-Operator is running one rabbitmq Container per Service. Just the kubernetes self healing is used as "ha" for rabbitmq.

That seems to match with my finding: run rabbitmq standalone and use an external system to restart rabbitmq if required.

Fabian

Satish Patel <satish.txt@gmail.com> schrieb am Fr., 14. Aug. 2020, 16:59:

...
Fabian,

what do you mean?

...
...
I think vexxhost is running (1) with their openstack-operator - for reasons.

On Fri, Aug 14, 2020 at 7:28 AM Fabian Zimmermann <dev.faz@gmail.com> wrote:

...
Hello again,

just a short update about the results of my tests.

I currently see 2 ways of running openstack+rabbitmq

1. without durable-queues and without replication - just one

rabbitmq-process which gets (somehow) restarted if it fails.

...
2. durable-queues and replication

Any other combination of these settings leads to more or less issues with

* broken / non working bindings * broken queues

I think vexxhost is running (1) with their openstack-operator - for reasons.

I added [kolla], because kolla-ansible is installing rabbitmq with replication but without durable-queues.

May someone point me to the best way to document these findings to some official doc? I think a lot of installations out there will run into issues if - under load - a node fails.

Fabian

Am Do., 13. Aug. 2020 um 15:13 Uhr schrieb Fabian Zimmermann < dev.faz@gmail.com>:

...
Hi,

just did some short tests today in our test-environment (without

durable queues and without replication):

...
* started a rally task to generate some load * kill-9-ed rabbitmq on one node * rally task immediately stopped and the cloud (mostly) stopped

working

...
after some debugging i found (again) exchanges which had bindings to

queues, but these bindings didnt forward any msgs.

...
Wrote a small script to detect these broken bindings and will now check if this is "reproducible"

then I will try "durable queues" and "durable queues with replication" to see if this helps. Even if I would expect rabbitmq should be able to handle this without these "hidden broken bindings"

This just FYI.

Fabian

Arnaud MORIN

12:28 p.m.

New subject: [nova][neutron][oslo][ops][kolla] rabbit bindings issue

Hello, Are you doing that using alternate exchange ? I started configuring it in our env but not yet finished. Cheers, Le jeu. 20 août 2020 à 19:16, Fabian Zimmermann <dev.faz@gmail.com> a écrit :

...

Hi,

just another idea:

Rabbitmq is able to count undelivered messages. We could use this information to detect the broken bindings (causing undeliverable messages).

Anyone already doing this?

I currently don't have a way to reproduce the broken bindings, so I'm unable to proof the idea.

Seems we have to wait issue to happen again - what - hopefully - never happens :)

Fabian

Arnaud Morin <arnaud.morin@gmail.com> schrieb am Di., 18. Aug. 2020, 14:07:

...
Hey all,

About the vexxhost strategy to use only one rabbit server and manage HA through rabbit. Do you plan to do the same for MariaDB/MySQL?

-- Arnaud Morin

On 14.08.20 - 18:45, Fabian Zimmermann wrote:

...
Hi,

i read somewhere that vexxhosts kubernetes openstack-Operator is running one rabbitmq Container per Service. Just the kubernetes self healing is used as "ha" for rabbitmq.

That seems to match with my finding: run rabbitmq standalone and use an external system to restart rabbitmq if required.

Fabian

Satish Patel <satish.txt@gmail.com> schrieb am Fr., 14. Aug. 2020, 16:59:

...
Fabian,

what do you mean?

...
...
I think vexxhost is running (1) with their openstack-operator - for reasons.

On Fri, Aug 14, 2020 at 7:28 AM Fabian Zimmermann <dev.faz@gmail.com> wrote:

...
Hello again,

just a short update about the results of my tests.

I currently see 2 ways of running openstack+rabbitmq

1. without durable-queues and without replication - just one

rabbitmq-process which gets (somehow) restarted if it fails.

...
2. durable-queues and replication

Any other combination of these settings leads to more or less issues with

* broken / non working bindings * broken queues

I think vexxhost is running (1) with their openstack-operator - for reasons.

I added [kolla], because kolla-ansible is installing rabbitmq with replication but without durable-queues.

May someone point me to the best way to document these findings to some official doc? I think a lot of installations out there will run into issues if - under load - a node fails.

Fabian

Am Do., 13. Aug. 2020 um 15:13 Uhr schrieb Fabian Zimmermann < dev.faz@gmail.com>:

...
Hi,

just did some short tests today in our test-environment (without

durable queues and without replication):

...
* started a rally task to generate some load * kill-9-ed rabbitmq on one node * rally task immediately stopped and the cloud (mostly) stopped

working

...
after some debugging i found (again) exchanges which had bindings

to queues, but these bindings didnt forward any msgs.

...
Wrote a small script to detect these broken bindings and will now check if this is "reproducible"

then I will try "durable queues" and "durable queues with replication" to see if this helps. Even if I would expect rabbitmq should be able to handle this without these "hidden broken bindings"

This just FYI.

Fabian

Fabian Zimmermann

21 Aug 21 Aug

12:06 a.m.

New subject: [nova][neutron][oslo][ops][kolla] rabbit bindings issue

Hi, don't understand what you mean with "alternate exchange"? I'm doing all my tests on my DEV-Env? It's a completely separated / dedicated (virtual) cluster. I just enabled the feature and wrote a small script to read the metrics from the api. I'm having some "dropped msg" in my cluster, just trying to figure out if they are "normal". Fabian Am Do., 20. Aug. 2020 um 21:28 Uhr schrieb Arnaud MORIN <arnaud.morin@gmail.com>:

...

Hello, Are you doing that using alternate exchange ? I started configuring it in our env but not yet finished.

Cheers,

Le jeu. 20 août 2020 à 19:16, Fabian Zimmermann <dev.faz@gmail.com> a écrit :

...
Hi,

just another idea:

Rabbitmq is able to count undelivered messages. We could use this information to detect the broken bindings (causing undeliverable messages).

Anyone already doing this?

I currently don't have a way to reproduce the broken bindings, so I'm unable to proof the idea.

Seems we have to wait issue to happen again - what - hopefully - never happens :)

Fabian

Arnaud Morin <arnaud.morin@gmail.com> schrieb am Di., 18. Aug. 2020, 14:07:

...
Hey all,

About the vexxhost strategy to use only one rabbit server and manage HA through rabbit. Do you plan to do the same for MariaDB/MySQL?

-- Arnaud Morin

On 14.08.20 - 18:45, Fabian Zimmermann wrote:

...
Hi,

i read somewhere that vexxhosts kubernetes openstack-Operator is running one rabbitmq Container per Service. Just the kubernetes self healing is used as "ha" for rabbitmq.

That seems to match with my finding: run rabbitmq standalone and use an external system to restart rabbitmq if required.

Fabian

Satish Patel <satish.txt@gmail.com> schrieb am Fr., 14. Aug. 2020, 16:59:

...
Fabian,

what do you mean?

...
> I think vexxhost is running (1) with their openstack-operator - for reasons.

On Fri, Aug 14, 2020 at 7:28 AM Fabian Zimmermann <dev.faz@gmail.com> wrote:

...
Hello again,

just a short update about the results of my tests.

I currently see 2 ways of running openstack+rabbitmq

1. without durable-queues and without replication - just one

rabbitmq-process which gets (somehow) restarted if it fails.

...
2. durable-queues and replication

Any other combination of these settings leads to more or less issues with

* broken / non working bindings * broken queues

I think vexxhost is running (1) with their openstack-operator - for reasons.

I added [kolla], because kolla-ansible is installing rabbitmq with replication but without durable-queues.

May someone point me to the best way to document these findings to some official doc? I think a lot of installations out there will run into issues if - under load - a node fails.

Fabian

Am Do., 13. Aug. 2020 um 15:13 Uhr schrieb Fabian Zimmermann < dev.faz@gmail.com>: > > Hi, > > just did some short tests today in our test-environment (without durable queues and without replication): > > * started a rally task to generate some load > * kill-9-ed rabbitmq on one node > * rally task immediately stopped and the cloud (mostly) stopped working > > after some debugging i found (again) exchanges which had bindings to queues, but these bindings didnt forward any msgs. > Wrote a small script to detect these broken bindings and will now check if this is "reproducible" > > then I will try "durable queues" and "durable queues with replication" to see if this helps. Even if I would expect > rabbitmq should be able to handle this without these "hidden broken bindings" > > This just FYI. > > Fabian

Arnaud Morin

1:13 a.m.

New subject: [nova][neutron][oslo][ops][kolla] rabbit bindings issue

Hey, I am talking about that: https://www.rabbitmq.com/ae.html Cheers, -- Arnaud Morin On 21.08.20 - 09:06, Fabian Zimmermann wrote:

...

Hi,

don't understand what you mean with "alternate exchange"? I'm doing all my tests on my DEV-Env? It's a completely separated / dedicated (virtual) cluster.

I just enabled the feature and wrote a small script to read the metrics from the api.

I'm having some "dropped msg" in my cluster, just trying to figure out if they are "normal".

Fabian

Am Do., 20. Aug. 2020 um 21:28 Uhr schrieb Arnaud MORIN <arnaud.morin@gmail.com>:

...
Hello, Are you doing that using alternate exchange ? I started configuring it in our env but not yet finished.

Cheers,

Le jeu. 20 août 2020 à 19:16, Fabian Zimmermann <dev.faz@gmail.com> a écrit :

...
Hi,

just another idea:

Rabbitmq is able to count undelivered messages. We could use this information to detect the broken bindings (causing undeliverable messages).

Anyone already doing this?

I currently don't have a way to reproduce the broken bindings, so I'm unable to proof the idea.

Seems we have to wait issue to happen again - what - hopefully - never happens :)

Fabian

Arnaud Morin <arnaud.morin@gmail.com> schrieb am Di., 18. Aug. 2020, 14:07:

...
Hey all,

About the vexxhost strategy to use only one rabbit server and manage HA through rabbit. Do you plan to do the same for MariaDB/MySQL?

-- Arnaud Morin

On 14.08.20 - 18:45, Fabian Zimmermann wrote:

...
Hi,

i read somewhere that vexxhosts kubernetes openstack-Operator is running one rabbitmq Container per Service. Just the kubernetes self healing is used as "ha" for rabbitmq.

That seems to match with my finding: run rabbitmq standalone and use an external system to restart rabbitmq if required.

Fabian

Satish Patel <satish.txt@gmail.com> schrieb am Fr., 14. Aug. 2020, 16:59:

...
Fabian,

what do you mean?

>> I think vexxhost is running (1) with their openstack-operator - for reasons.

On Fri, Aug 14, 2020 at 7:28 AM Fabian Zimmermann <dev.faz@gmail.com> wrote: > > Hello again, > > just a short update about the results of my tests. > > I currently see 2 ways of running openstack+rabbitmq > > 1. without durable-queues and without replication - just one rabbitmq-process which gets (somehow) restarted if it fails. > 2. durable-queues and replication > > Any other combination of these settings leads to more or less issues with > > * broken / non working bindings > * broken queues > > I think vexxhost is running (1) with their openstack-operator - for reasons. > > I added [kolla], because kolla-ansible is installing rabbitmq with replication but without durable-queues. > > May someone point me to the best way to document these findings to some official doc? > I think a lot of installations out there will run into issues if - under load - a node fails. > > Fabian > > > Am Do., 13. Aug. 2020 um 15:13 Uhr schrieb Fabian Zimmermann < dev.faz@gmail.com>: >> >> Hi, >> >> just did some short tests today in our test-environment (without durable queues and without replication): >> >> * started a rally task to generate some load >> * kill-9-ed rabbitmq on one node >> * rally task immediately stopped and the cloud (mostly) stopped working >> >> after some debugging i found (again) exchanges which had bindings to queues, but these bindings didnt forward any msgs. >> Wrote a small script to detect these broken bindings and will now check if this is "reproducible" >> >> then I will try "durable queues" and "durable queues with replication" to see if this helps. Even if I would expect >> rabbitmq should be able to handle this without these "hidden broken bindings" >> >> This just FYI. >> >> Fabian

Fabian Zimmermann

1:28 a.m.

New subject: [nova][neutron][oslo][ops][kolla] rabbit bindings issue

Hi, yeah, that's what I'm currently using. I also tried to use the unroutable-counters, but these are only available for channels, which may not have any bindings, so there is no way to find the "root cause" I created an AE "unroutable" and wrote a script to show me the msgs placed here.. currently I get -- 20 Exchange: q-agent-notifier-network-delete_fanout, RoutingKey: 226 Exchange: q-agent-notifier-port-delete_fanout, RoutingKey: 88 Exchange: q-agent-notifier-port-update_fanout, RoutingKey: 388 Exchange: q-agent-notifier-security_group-update_fanout, RoutingKey: -- I think I will start another thread to debug the reason for this, because it has nothing to do with "broken bindings". Fabian Am Fr., 21. Aug. 2020 um 10:13 Uhr schrieb Arnaud Morin <arnaud.morin@gmail.com>:

...

Hey, I am talking about that: https://www.rabbitmq.com/ae.html

Cheers,

-- Arnaud Morin

On 21.08.20 - 09:06, Fabian Zimmermann wrote:

...
Hi,

don't understand what you mean with "alternate exchange"? I'm doing all my tests on my DEV-Env? It's a completely separated / dedicated (virtual) cluster.

I just enabled the feature and wrote a small script to read the metrics from the api.

I'm having some "dropped msg" in my cluster, just trying to figure out if they are "normal".

Fabian

Am Do., 20. Aug. 2020 um 21:28 Uhr schrieb Arnaud MORIN <arnaud.morin@gmail.com>:

...
Hello, Are you doing that using alternate exchange ? I started configuring it in our env but not yet finished.

Cheers,

Le jeu. 20 août 2020 à 19:16, Fabian Zimmermann <dev.faz@gmail.com> a écrit :

...
Hi,

just another idea:

Rabbitmq is able to count undelivered messages. We could use this information to detect the broken bindings (causing undeliverable messages).

Anyone already doing this?

I currently don't have a way to reproduce the broken bindings, so I'm unable to proof the idea.

Seems we have to wait issue to happen again - what - hopefully - never happens :)

Fabian

Arnaud Morin <arnaud.morin@gmail.com> schrieb am Di., 18. Aug. 2020, 14:07:

...
Hey all,

About the vexxhost strategy to use only one rabbit server and manage HA through rabbit. Do you plan to do the same for MariaDB/MySQL?

-- Arnaud Morin

On 14.08.20 - 18:45, Fabian Zimmermann wrote:

...
Hi,

i read somewhere that vexxhosts kubernetes openstack-Operator is running one rabbitmq Container per Service. Just the kubernetes self healing is used as "ha" for rabbitmq.

That seems to match with my finding: run rabbitmq standalone and use an external system to restart rabbitmq if required.

Fabian

Satish Patel <satish.txt@gmail.com> schrieb am Fr., 14. Aug. 2020, 16:59:

> Fabian, > > what do you mean? > > >> I think vexxhost is running (1) with their openstack-operator - for > reasons. > > On Fri, Aug 14, 2020 at 7:28 AM Fabian Zimmermann <dev.faz@gmail.com> > wrote: > > > > Hello again, > > > > just a short update about the results of my tests. > > > > I currently see 2 ways of running openstack+rabbitmq > > > > 1. without durable-queues and without replication - just one > rabbitmq-process which gets (somehow) restarted if it fails. > > 2. durable-queues and replication > > > > Any other combination of these settings leads to more or less issues with > > > > * broken / non working bindings > > * broken queues > > > > I think vexxhost is running (1) with their openstack-operator - for > reasons. > > > > I added [kolla], because kolla-ansible is installing rabbitmq with > replication but without durable-queues. > > > > May someone point me to the best way to document these findings to some > official doc? > > I think a lot of installations out there will run into issues if - under > load - a node fails. > > > > Fabian > > > > > > Am Do., 13. Aug. 2020 um 15:13 Uhr schrieb Fabian Zimmermann < > dev.faz@gmail.com>: > >> > >> Hi, > >> > >> just did some short tests today in our test-environment (without > durable queues and without replication): > >> > >> * started a rally task to generate some load > >> * kill-9-ed rabbitmq on one node > >> * rally task immediately stopped and the cloud (mostly) stopped working > >> > >> after some debugging i found (again) exchanges which had bindings to > queues, but these bindings didnt forward any msgs. > >> Wrote a small script to detect these broken bindings and will now check > if this is "reproducible" > >> > >> then I will try "durable queues" and "durable queues with replication" > to see if this helps. Even if I would expect > >> rabbitmq should be able to handle this without these "hidden broken > bindings" > >> > >> This just FYI. > >> > >> Fabian >

Fabian Zimmermann

4:29 a.m.

New subject: [nova][neutron][oslo][ops][kolla] rabbit bindings issue

Hi, just to keep you updated. It seems these "q-agent-notifier"-exchanges are not used by every possible neutron-driver/agent-backend, so it seems to be fine to have unrouted msgs here. I was (again) able to get some broken bindings in my dev-cluster. The counters for "unrouted msg" are increased, but the msgs sent to these exchanges/bindings/queues are *NOT* placed in the alternate-exchange. It's quite bad, because of the above "normal" unrouted msgs we could not just use the counter as "error-indicator". I think I will try to create a valid "bind" in above exchanges, so these will not increment the "unroutable"-counter and use the counter as monitoring-target. Fabian Am Fr., 21. Aug. 2020 um 10:28 Uhr schrieb Fabian Zimmermann <dev.faz@gmail.com>:

...

Hi,

yeah, that's what I'm currently using.

I also tried to use the unroutable-counters, but these are only available for channels, which may not have any bindings, so there is no way to find the "root cause"

I created an AE "unroutable" and wrote a script to show me the msgs placed here.. currently I get

-- 20 Exchange: q-agent-notifier-network-delete_fanout, RoutingKey: 226 Exchange: q-agent-notifier-port-delete_fanout, RoutingKey: 88 Exchange: q-agent-notifier-port-update_fanout, RoutingKey: 388 Exchange: q-agent-notifier-security_group-update_fanout, RoutingKey: --

I think I will start another thread to debug the reason for this, because it has nothing to do with "broken bindings".

Fabian

Am Fr., 21. Aug. 2020 um 10:13 Uhr schrieb Arnaud Morin <arnaud.morin@gmail.com>:

...
Hey, I am talking about that: https://www.rabbitmq.com/ae.html

Cheers,

-- Arnaud Morin

On 21.08.20 - 09:06, Fabian Zimmermann wrote:

...
Hi,

don't understand what you mean with "alternate exchange"? I'm doing all my tests on my DEV-Env? It's a completely separated / dedicated (virtual) cluster.

I just enabled the feature and wrote a small script to read the metrics from the api.

I'm having some "dropped msg" in my cluster, just trying to figure out if they are "normal".

Fabian

Am Do., 20. Aug. 2020 um 21:28 Uhr schrieb Arnaud MORIN <arnaud.morin@gmail.com>:

...
Hello, Are you doing that using alternate exchange ? I started configuring it in our env but not yet finished.

Cheers,

Le jeu. 20 août 2020 à 19:16, Fabian Zimmermann <dev.faz@gmail.com> a écrit :

...
Hi,

just another idea:

Rabbitmq is able to count undelivered messages. We could use this information to detect the broken bindings (causing undeliverable messages).

Anyone already doing this?

I currently don't have a way to reproduce the broken bindings, so I'm unable to proof the idea.

Seems we have to wait issue to happen again - what - hopefully - never happens :)

Fabian

Arnaud Morin <arnaud.morin@gmail.com> schrieb am Di., 18. Aug. 2020, 14:07:

...
Hey all,

About the vexxhost strategy to use only one rabbit server and manage HA through rabbit. Do you plan to do the same for MariaDB/MySQL?

-- Arnaud Morin

On 14.08.20 - 18:45, Fabian Zimmermann wrote: > Hi, > > i read somewhere that vexxhosts kubernetes openstack-Operator is running > one rabbitmq Container per Service. Just the kubernetes self healing is > used as "ha" for rabbitmq. > > That seems to match with my finding: run rabbitmq standalone and use an > external system to restart rabbitmq if required. > > Fabian > > Satish Patel <satish.txt@gmail.com> schrieb am Fr., 14. Aug. 2020, 16:59: > > > Fabian, > > > > what do you mean? > > > > >> I think vexxhost is running (1) with their openstack-operator - for > > reasons. > > > > On Fri, Aug 14, 2020 at 7:28 AM Fabian Zimmermann <dev.faz@gmail.com> > > wrote: > > > > > > Hello again, > > > > > > just a short update about the results of my tests. > > > > > > I currently see 2 ways of running openstack+rabbitmq > > > > > > 1. without durable-queues and without replication - just one > > rabbitmq-process which gets (somehow) restarted if it fails. > > > 2. durable-queues and replication > > > > > > Any other combination of these settings leads to more or less issues with > > > > > > * broken / non working bindings > > > * broken queues > > > > > > I think vexxhost is running (1) with their openstack-operator - for > > reasons. > > > > > > I added [kolla], because kolla-ansible is installing rabbitmq with > > replication but without durable-queues. > > > > > > May someone point me to the best way to document these findings to some > > official doc? > > > I think a lot of installations out there will run into issues if - under > > load - a node fails. > > > > > > Fabian > > > > > > > > > Am Do., 13. Aug. 2020 um 15:13 Uhr schrieb Fabian Zimmermann < > > dev.faz@gmail.com>: > > >> > > >> Hi, > > >> > > >> just did some short tests today in our test-environment (without > > durable queues and without replication): > > >> > > >> * started a rally task to generate some load > > >> * kill-9-ed rabbitmq on one node > > >> * rally task immediately stopped and the cloud (mostly) stopped working > > >> > > >> after some debugging i found (again) exchanges which had bindings to > > queues, but these bindings didnt forward any msgs. > > >> Wrote a small script to detect these broken bindings and will now check > > if this is "reproducible" > > >> > > >> then I will try "durable queues" and "durable queues with replication" > > to see if this helps. Even if I would expect > > >> rabbitmq should be able to handle this without these "hidden broken > > bindings" > > >> > > >> This just FYI. > > >> > > >> Fabian > >

Mohammed Naser

24 Aug 24 Aug

11:54 a.m.

New subject: [nova][neutron][oslo][ops][kolla] rabbit bindings issue

On Tue, Aug 18, 2020 at 8:11 AM Arnaud Morin <arnaud.morin@gmail.com> wrote:

...

Hey all,

About the vexxhost strategy to use only one rabbit server and manage HA through rabbit. Do you plan to do the same for MariaDB/MySQL?

We use a MySQL operator to deploy a good o'l master/slave replication cluster and point towards the master, for every service, for two reasons: 1) We always pointed to a master Galera system anyways, multi-master was overcomplicated for no real advantage 2) The failover time vs the complexity of Galera (and how often we failover) favours #1 3) We use "orchestrator" by GitHub which manages all the promotions/etc for us

...

-- Arnaud Morin

On 14.08.20 - 18:45, Fabian Zimmermann wrote:

...
Hi,

i read somewhere that vexxhosts kubernetes openstack-Operator is running one rabbitmq Container per Service. Just the kubernetes self healing is used as "ha" for rabbitmq.

That seems to match with my finding: run rabbitmq standalone and use an external system to restart rabbitmq if required.

Fabian

Satish Patel <satish.txt@gmail.com> schrieb am Fr., 14. Aug. 2020, 16:59:

...
Fabian,

what do you mean?

...
...
I think vexxhost is running (1) with their openstack-operator - for reasons.

On Fri, Aug 14, 2020 at 7:28 AM Fabian Zimmermann <dev.faz@gmail.com> wrote:

...
Hello again,

just a short update about the results of my tests.

I currently see 2 ways of running openstack+rabbitmq

1. without durable-queues and without replication - just one

rabbitmq-process which gets (somehow) restarted if it fails.

...
2. durable-queues and replication

Any other combination of these settings leads to more or less issues with

* broken / non working bindings * broken queues

I think vexxhost is running (1) with their openstack-operator - for reasons.

I added [kolla], because kolla-ansible is installing rabbitmq with replication but without durable-queues.

May someone point me to the best way to document these findings to some official doc? I think a lot of installations out there will run into issues if - under load - a node fails.

Fabian

Am Do., 13. Aug. 2020 um 15:13 Uhr schrieb Fabian Zimmermann < dev.faz@gmail.com>:

...
Hi,

just did some short tests today in our test-environment (without

durable queues and without replication):

...
* started a rally task to generate some load * kill-9-ed rabbitmq on one node * rally task immediately stopped and the cloud (mostly) stopped working

after some debugging i found (again) exchanges which had bindings to

queues, but these bindings didnt forward any msgs.

...
Wrote a small script to detect these broken bindings and will now check if this is "reproducible"

then I will try "durable queues" and "durable queues with replication" to see if this helps. Even if I would expect rabbitmq should be able to handle this without these "hidden broken bindings"

This just FYI.

Fabian

-- Mohammed Naser VEXXHOST, Inc.

Arnaud Morin

11 Aug 11 Aug

3:28 a.m.

Thanks for those tips, I will check both values asap. About the complete reset of the cluster, this is also what we use to do, but this has some downside, such as the need to restart all agents, services, etc) Cheers, -- Arnaud Morin On 08.08.20 - 15:06, Fabian Zimmermann wrote:

...

Hi,

dont know if durable queues help, but should be enabled by rabbitmq policy which (alone) doesnt seem to fix this (we have this active)

Fabian

Massimo Sgaravatto <massimo.sgaravatto@gmail.com> schrieb am Sa., 8. Aug. 2020, 09:36:

...
We also see the issue. When it happens stopping and restarting the rabbit cluster usually helps.

I thought the problem was because of a wrong setting in the openstack services conf files: I missed these settings (that I am now going to add):

[oslo_messaging_rabbit] rabbit_ha_queues = true amqp_durable_queues = true

Cheers, Massimo

On Sat, Aug 8, 2020 at 6:34 AM Fabian Zimmermann <dev.faz@gmail.com> wrote:

...
Hi,

we also have this issue.

Our solution was (up to now) to delete the queues with a script or even reset the complete cluster.

We just upgraded rabbitmq to the latest version - without luck.

Anyone else seeing this issue?

Fabian

Arnaud Morin <arnaud.morin@gmail.com> schrieb am Do., 6. Aug. 2020, 16:47:

...
Hey all,

I would like to ask the community about a rabbit issue we have from time to time.

In our current architecture, we have a cluster of rabbits (3 nodes) for all our OpenStack services (mostly nova and neutron).

When one node of this cluster is down, the cluster continue working (we use pause_minority strategy). But, sometimes, the third server is not able to recover automatically and need a manual intervention. After this intervention, we restart the rabbitmq-server process, which is then able to join the cluster back.

At this time, the cluster looks ok, everything is fine. BUT, nothing works. Neutron and nova agents are not able to report back to servers. They appear dead. Servers seems not being able to consume messages. The exchanges, queues, bindings seems good in rabbit.

What we see is that removing bindings (using rabbitmqadmin delete binding or the web interface) and recreate them again (using the same routing key) brings the service back up and running.

Doing this for all queues is really painful. Our next plan is to automate it, but is there anyone in the community already saw this kind of issues?

Our bug looks like the one described in [1]. Someone recommands to create an Alternate Exchange. Is there anyone already tried that?

FYI, we are running rabbit 3.8.2 (with OpenStack Stein). We had the same kind of issues using older version of rabbit.

Thanks for your help.

[1] https://groups.google.com/forum/#!newtopic/rabbitmq-users/rabbitmq-users/zFh...

-- Arnaud Morin

1806

Age (days ago)

1824

Last active (days ago)

List overview

Download

29 comments

9 participants

participants (9)

Arnaud Morin
Ben Nemec
Fabian Zimmermann
Massimo Sgaravatto
Mohammed Naser
Satish Patel
Sean Mooney
Thierry Carrez
Tobias Urdin

[nova][neutron][oslo][ops] rabbit bindings issue

tags

participants (9)