[nova][neutron][oslo][ops][kolla] rabbit bindings issue

Fabian Zimmermann dev.faz at gmail.com
Fri Aug 21 11:29:15 UTC 2020


Hi,

just to keep you updated.

It seems these "q-agent-notifier"-exchanges are not used by every
possible neutron-driver/agent-backend, so it seems to be fine to have
unrouted msgs here.

I was (again) able to get some broken bindings in my dev-cluster.
The counters for "unrouted msg" are increased, but the msgs sent to
these exchanges/bindings/queues are *NOT* placed in the
alternate-exchange.

It's quite bad, because of the above "normal" unrouted msgs we could
not just use the counter as "error-indicator".

I think I will try to create a valid "bind" in above exchanges, so
these will not increment the "unroutable"-counter and use the counter
as monitoring-target.

 Fabian


Am Fr., 21. Aug. 2020 um 10:28 Uhr schrieb Fabian Zimmermann
<dev.faz at gmail.com>:
>
> Hi,
>
> yeah, that's what I'm currently using.
>
> I also tried to use the unroutable-counters, but these are only
> available for channels, which may not have any bindings, so there is
> no way to find the "root cause"
>
> I created an AE "unroutable" and wrote a script to show me the msgs
> placed here.. currently I get
>
> --
>      20 Exchange: q-agent-notifier-network-delete_fanout, RoutingKey:
>     226 Exchange: q-agent-notifier-port-delete_fanout, RoutingKey:
>      88 Exchange: q-agent-notifier-port-update_fanout, RoutingKey:
>     388 Exchange: q-agent-notifier-security_group-update_fanout, RoutingKey:
> --
>
> I think I will start another thread to debug the reason for this,
> because it has nothing to do with "broken bindings".
>
>  Fabian
>
> Am Fr., 21. Aug. 2020 um 10:13 Uhr schrieb Arnaud Morin
> <arnaud.morin at gmail.com>:
> >
> > Hey,
> > I am talking about that:
> > https://www.rabbitmq.com/ae.html
> >
> > Cheers,
> >
> > --
> > Arnaud Morin
> >
> > On 21.08.20 - 09:06, Fabian Zimmermann wrote:
> > > Hi,
> > >
> > > don't understand what you mean with "alternate exchange"? I'm doing
> > > all my tests on my DEV-Env? It's a completely separated / dedicated
> > > (virtual) cluster.
> > >
> > > I just enabled the feature and wrote a small script to read the
> > > metrics from the api.
> > >
> > > I'm having some "dropped msg" in my cluster, just trying to figure out
> > > if they are "normal".
> > >
> > >  Fabian
> > >
> > > Am Do., 20. Aug. 2020 um 21:28 Uhr schrieb Arnaud MORIN
> > > <arnaud.morin at gmail.com>:
> > > >
> > > > Hello,
> > > > Are you doing that using alternate exchange ?
> > > > I started configuring it in our env but not yet finished.
> > > >
> > > > Cheers,
> > > >
> > > > Le jeu. 20 août 2020 à 19:16, Fabian Zimmermann <dev.faz at gmail.com> a écrit :
> > > >>
> > > >> Hi,
> > > >>
> > > >> just another idea:
> > > >>
> > > >> Rabbitmq is able to count undelivered messages. We could use this information to detect the broken bindings (causing undeliverable messages).
> > > >>
> > > >> Anyone already doing this?
> > > >>
> > > >> I currently don't have a way to reproduce the broken bindings, so I'm unable to proof the idea.
> > > >>
> > > >> Seems we have to wait issue to happen again - what - hopefully - never happens :)
> > > >>
> > > >>  Fabian
> > > >>
> > > >> Arnaud Morin <arnaud.morin at gmail.com> schrieb am Di., 18. Aug. 2020, 14:07:
> > > >>>
> > > >>> Hey all,
> > > >>>
> > > >>> About the vexxhost strategy to use only one rabbit server and manage HA through
> > > >>> rabbit.
> > > >>> Do you plan to do the same for MariaDB/MySQL?
> > > >>>
> > > >>> --
> > > >>> Arnaud Morin
> > > >>>
> > > >>> On 14.08.20 - 18:45, Fabian Zimmermann wrote:
> > > >>> > Hi,
> > > >>> >
> > > >>> > i read somewhere that vexxhosts kubernetes openstack-Operator is running
> > > >>> > one rabbitmq Container per Service. Just the kubernetes self healing is
> > > >>> > used as "ha" for rabbitmq.
> > > >>> >
> > > >>> > That seems to match with my finding: run rabbitmq standalone and use an
> > > >>> > external system to restart rabbitmq if required.
> > > >>> >
> > > >>> >  Fabian
> > > >>> >
> > > >>> > Satish Patel <satish.txt at gmail.com> schrieb am Fr., 14. Aug. 2020, 16:59:
> > > >>> >
> > > >>> > > Fabian,
> > > >>> > >
> > > >>> > > what do you mean?
> > > >>> > >
> > > >>> > > >> I think vexxhost is running (1) with their openstack-operator - for
> > > >>> > > reasons.
> > > >>> > >
> > > >>> > > On Fri, Aug 14, 2020 at 7:28 AM Fabian Zimmermann <dev.faz at gmail.com>
> > > >>> > > wrote:
> > > >>> > > >
> > > >>> > > > Hello again,
> > > >>> > > >
> > > >>> > > > just a short update about the results of my tests.
> > > >>> > > >
> > > >>> > > > I currently see 2 ways of running openstack+rabbitmq
> > > >>> > > >
> > > >>> > > > 1. without durable-queues and without replication - just one
> > > >>> > > rabbitmq-process which gets (somehow) restarted if it fails.
> > > >>> > > > 2. durable-queues and replication
> > > >>> > > >
> > > >>> > > > Any other combination of these settings leads to more or less issues with
> > > >>> > > >
> > > >>> > > > * broken / non working bindings
> > > >>> > > > * broken queues
> > > >>> > > >
> > > >>> > > > I think vexxhost is running (1) with their openstack-operator - for
> > > >>> > > reasons.
> > > >>> > > >
> > > >>> > > > I added [kolla], because kolla-ansible is installing rabbitmq with
> > > >>> > > replication but without durable-queues.
> > > >>> > > >
> > > >>> > > > May someone point me to the best way to document these findings to some
> > > >>> > > official doc?
> > > >>> > > > I think a lot of installations out there will run into issues if - under
> > > >>> > > load - a node fails.
> > > >>> > > >
> > > >>> > > >  Fabian
> > > >>> > > >
> > > >>> > > >
> > > >>> > > > Am Do., 13. Aug. 2020 um 15:13 Uhr schrieb Fabian Zimmermann <
> > > >>> > > dev.faz at gmail.com>:
> > > >>> > > >>
> > > >>> > > >> Hi,
> > > >>> > > >>
> > > >>> > > >> just did some short tests today in our test-environment (without
> > > >>> > > durable queues and without replication):
> > > >>> > > >>
> > > >>> > > >> * started a rally task to generate some load
> > > >>> > > >> * kill-9-ed rabbitmq on one node
> > > >>> > > >> * rally task immediately stopped and the cloud (mostly) stopped working
> > > >>> > > >>
> > > >>> > > >> after some debugging i found (again) exchanges which had bindings to
> > > >>> > > queues, but these bindings didnt forward any msgs.
> > > >>> > > >> Wrote a small script to detect these broken bindings and will now check
> > > >>> > > if this is "reproducible"
> > > >>> > > >>
> > > >>> > > >> then I will try "durable queues" and "durable queues with replication"
> > > >>> > > to see if this helps. Even if I would expect
> > > >>> > > >> rabbitmq should be able to handle this without these "hidden broken
> > > >>> > > bindings"
> > > >>> > > >>
> > > >>> > > >> This just FYI.
> > > >>> > > >>
> > > >>> > > >>  Fabian
> > > >>> > >



More information about the openstack-discuss mailing list