I'm also interested in how we catch future instances of this. Is there something we can do in CI or in a runtime warning to let people know? I am sure there are plenty of ironic deployments out there consuming heaps more RAM than is required for this queue. Michael On Wed, Feb 6, 2019 at 8:41 AM Doug Hellmann <doug@doughellmann.com> wrote:
Ken Giusti <kgiusti@gmail.com> writes:
On 2/4/19, Harald Jensås <hjensas@redhat.com> wrote:
On Tue, 2019-02-05 at 09:54 +1100, Michael Still wrote:
Hi,
I’ve been chasing a bug in ironic’s neutron agent for the last few days and I think its time to ask for some advice.
I'm working on the same issue. (In fact there are two issues.)
Specifically, I was asked to debug why a set of controllers was using so much RAM, and the answer was that rabbitmq had a queue called ironic-neutron-agent-heartbeat.info with 800,000 messages enqueued. This notification queue is used by ironic’s neutron agent to calculate the hash ring. I have been able to duplicate this issue in a stock kolla-ansible install with ironic turned on but no bare metal nodes enrolled in ironic. About 0.6 messages are queued per second.
I added some debugging code (hence the thread yesterday about mangling the code kolla deploys), and I can see that the messages in the queue are being read by the ironic neutron agent and acked correctly. However, they are not removed from the queue.
You can see your queue size while using kolla with this command:
docker exec rabbitmq rabbitmqctl list_queues messages name messages_ready consumers | sort -n | tail -1
My stock install that’s been running for about 12 hours currently has 8,244 messages in that queue.
Where I’m a bit stumped is I had assumed that the messages weren’t being acked correctly, which is not the case. Is there something obvious about notification queues like them being persistent that I’ve missed in my general ignorance of the underlying implementation of notifications?
I opened a oslo.messaging bug[1] yesterday. When using notifications and all consumers use one or more pools. The ironic-neutron-agent does use pools for all listeners in it's hash-ring member manager. And the result is that notifications are published to the 'ironic-neutron- agent-heartbeat.info' queue and they are never consumed.
This is an issue with the design of the notification pool feature.
The Notification service is designed so notification events can be sent even though there may currently be no consumers. It supports the ability for events to be queued until a consumer(s) is ready to process them. So when a notifier issues an event and there are no consumers subscribed, a queue must be provisioned to hold that event until consumers appear.
This has come up several times over the last few years, and it's always a surprise to whoever it has bitten. I wonder if we should change the default behavior to not create the consumer queue in the publisher?
-- Doug