Re: [ironic] [oslo] ironic overloading notifications for internal messaging

4 Feb 2019


      Cool thanks for the summary. You seem to have this under control so I might
bravely run away. I definitely think these are issues that deserve a
backport when the time comes.

Michael

On Tue, Feb 5, 2019 at 1:52 PM Harald Jensås <hjensas@redhat.com> wrote:
...
On Tue, 2019-02-05 at 09:54 +1100, Michael Still wrote:
...
Hi,
I’ve been chasing a bug in ironic’s neutron agent for the last few
days and I think its time to ask for some advice.
I'm working on the same issue. (In fact there are two issues.)
...
Specifically, I was asked to debug why a set of controllers was using
so much RAM, and the answer was that rabbitmq had a queue called
ironic-neutron-agent-heartbeat.info with 800,000 messages enqueued.
This notification queue is used by ironic’s neutron agent to
calculate the hash ring. I have been able to duplicate this issue in
a stock kolla-ansible install with ironic turned on but no bare metal
nodes enrolled in ironic. About 0.6 messages are queued per second.
I added some debugging code (hence the thread yesterday about
mangling the code kolla deploys), and I can see that the messages in
the queue are being read by the ironic neutron agent and acked
correctly. However, they are not removed from the queue.
You can see your queue size while using kolla with this command:
docker exec rabbitmq rabbitmqctl list_queues messages name
messages_ready consumers  | sort -n | tail -1
My stock install that’s been running for about 12 hours currently has
8,244 messages in that queue.
Where I’m a bit stumped is I had assumed that the messages weren’t
being acked correctly, which is not the case. Is there something
obvious about notification queues like them being persistent that
I’ve missed in my general ignorance of the underlying implementation
of notifications?
I opened a oslo.messaging bug[1] yesterday. When using notifications
and all consumers use one or more pools. The ironic-neutron-agent does
use pools for all listeners in it's hash-ring member manager. And the
result is that notifications are published to the 'ironic-neutron-
agent-heartbeat.info' queue and they are never consumed.
The second issue, each instance of the agent uses it's own pool to
ensure all agents are notified about the existance of peer-agents. The
pools use a uuid that is generated at startup (and re-generated on
restart, stop/start etc). In the case where
`[oslo_messaging_rabbit]/amqp_auto_delete = false` in neutron config
these uuid queues are not automatically removed. So after a restart of
the ironic-neutron-agent the queue with the old UUID is left in the
message broker without no consumers, growing ...
I intend to push patches to fix both issues. As a workaround (or the
permanent solution) will create another listener consuming the
notifications without a pool. This should fix the first issue.
Second change will set amqp_auto_delete for these specific queues to
'true' no matter. What I'm currently stuck on here is that I need to
change the control_exchange for the transport. According to
oslo.messaging documentation it should be possible to override the
control_exchange in the transport_url[3]. The idea is to set
amqp_auto_delete and a ironic-neutron-agent specific exchange on the
url when setting up the transport for notifications, but so far I
belive the doc string on the control_exchange option is wrong.
NOTE: The second issue can be worked around by stopping and starting
rabbitmq as a dependency of the ironic-neutron-agent service. This
ensure only queues for active agent uuid's are present, and those
queues will be consumed.
--
Harald Jensås
[1] https://bugs.launchpad.net/oslo.messaging/+bug/1814544
[2] https://storyboard.openstack.org/#!/story/2004933
[3]
https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/trans...