Cool thanks for the summary. You seem to have this under control so I might bravely run away. I definitely think these are issues that deserve a backport when the time comes.

Michael

On Tue, Feb 5, 2019 at 1:52 PM Harald Jensås <hjensas@redhat.com> wrote:
On Tue, 2019-02-05 at 09:54 +1100, Michael Still wrote:
> Hi,
>
> I’ve been chasing a bug in ironic’s neutron agent for the last few
> days and I think its time to ask for some advice.
>

I'm working on the same issue. (In fact there are two issues.)

> Specifically, I was asked to debug why a set of controllers was using
> so much RAM, and the answer was that rabbitmq had a queue called
> ironic-neutron-agent-heartbeat.info with 800,000 messages enqueued.
> This notification queue is used by ironic’s neutron agent to
> calculate the hash ring. I have been able to duplicate this issue in
> a stock kolla-ansible install with ironic turned on but no bare metal
> nodes enrolled in ironic. About 0.6 messages are queued per second.
>
> I added some debugging code (hence the thread yesterday about
> mangling the code kolla deploys), and I can see that the messages in
> the queue are being read by the ironic neutron agent and acked
> correctly. However, they are not removed from the queue.
>
> You can see your queue size while using kolla with this command:
>
> docker exec rabbitmq rabbitmqctl list_queues messages name
> messages_ready consumers  | sort -n | tail -1
>
> My stock install that’s been running for about 12 hours currently has
> 8,244 messages in that queue.
>
> Where I’m a bit stumped is I had assumed that the messages weren’t
> being acked correctly, which is not the case. Is there something
> obvious about notification queues like them being persistent that
> I’ve missed in my general ignorance of the underlying implementation
> of notifications?
>

I opened a oslo.messaging bug[1] yesterday. When using notifications
and all consumers use one or more pools. The ironic-neutron-agent does
use pools for all listeners in it's hash-ring member manager. And the
result is that notifications are published to the 'ironic-neutron-
agent-heartbeat.info' queue and they are never consumed.

The second issue, each instance of the agent uses it's own pool to
ensure all agents are notified about the existance of peer-agents. The
pools use a uuid that is generated at startup (and re-generated on
restart, stop/start etc). In the case where
`[oslo_messaging_rabbit]/amqp_auto_delete = false` in neutron config
these uuid queues are not automatically removed. So after a restart of
the ironic-neutron-agent the queue with the old UUID is left in the
message broker without no consumers, growing ...


I intend to push patches to fix both issues. As a workaround (or the
permanent solution) will create another listener consuming the
notifications without a pool. This should fix the first issue.

Second change will set amqp_auto_delete for these specific queues to
'true' no matter. What I'm currently stuck on here is that I need to
change the control_exchange for the transport. According to
oslo.messaging documentation it should be possible to override the
control_exchange in the transport_url[3]. The idea is to set
amqp_auto_delete and a ironic-neutron-agent specific exchange on the
url when setting up the transport for notifications, but so far I
belive the doc string on the control_exchange option is wrong.


NOTE: The second issue can be worked around by stopping and starting
rabbitmq as a dependency of the ironic-neutron-agent service. This
ensure only queues for active agent uuid's are present, and those
queues will be consumed.


--
Harald Jensås


[1] https://bugs.launchpad.net/oslo.messaging/+bug/1814544
[2] https://storyboard.openstack.org/#!/story/2004933
[3]
https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/transport.py#L58-L62