[ironic] [oslo] ironic overloading notifications for internal messaging
Michael Still
mikal at stillhq.com
Tue Feb 5 02:56:38 UTC 2019
Cool thanks for the summary. You seem to have this under control so I might
bravely run away. I definitely think these are issues that deserve a
backport when the time comes.
Michael
On Tue, Feb 5, 2019 at 1:52 PM Harald Jensås <hjensas at redhat.com> wrote:
> On Tue, 2019-02-05 at 09:54 +1100, Michael Still wrote:
> > Hi,
> >
> > I’ve been chasing a bug in ironic’s neutron agent for the last few
> > days and I think its time to ask for some advice.
> >
>
> I'm working on the same issue. (In fact there are two issues.)
>
> > Specifically, I was asked to debug why a set of controllers was using
> > so much RAM, and the answer was that rabbitmq had a queue called
> > ironic-neutron-agent-heartbeat.info with 800,000 messages enqueued.
> > This notification queue is used by ironic’s neutron agent to
> > calculate the hash ring. I have been able to duplicate this issue in
> > a stock kolla-ansible install with ironic turned on but no bare metal
> > nodes enrolled in ironic. About 0.6 messages are queued per second.
> >
> > I added some debugging code (hence the thread yesterday about
> > mangling the code kolla deploys), and I can see that the messages in
> > the queue are being read by the ironic neutron agent and acked
> > correctly. However, they are not removed from the queue.
> >
> > You can see your queue size while using kolla with this command:
> >
> > docker exec rabbitmq rabbitmqctl list_queues messages name
> > messages_ready consumers | sort -n | tail -1
> >
> > My stock install that’s been running for about 12 hours currently has
> > 8,244 messages in that queue.
> >
> > Where I’m a bit stumped is I had assumed that the messages weren’t
> > being acked correctly, which is not the case. Is there something
> > obvious about notification queues like them being persistent that
> > I’ve missed in my general ignorance of the underlying implementation
> > of notifications?
> >
>
> I opened a oslo.messaging bug[1] yesterday. When using notifications
> and all consumers use one or more pools. The ironic-neutron-agent does
> use pools for all listeners in it's hash-ring member manager. And the
> result is that notifications are published to the 'ironic-neutron-
> agent-heartbeat.info' queue and they are never consumed.
>
> The second issue, each instance of the agent uses it's own pool to
> ensure all agents are notified about the existance of peer-agents. The
> pools use a uuid that is generated at startup (and re-generated on
> restart, stop/start etc). In the case where
> `[oslo_messaging_rabbit]/amqp_auto_delete = false` in neutron config
> these uuid queues are not automatically removed. So after a restart of
> the ironic-neutron-agent the queue with the old UUID is left in the
> message broker without no consumers, growing ...
>
>
> I intend to push patches to fix both issues. As a workaround (or the
> permanent solution) will create another listener consuming the
> notifications without a pool. This should fix the first issue.
>
> Second change will set amqp_auto_delete for these specific queues to
> 'true' no matter. What I'm currently stuck on here is that I need to
> change the control_exchange for the transport. According to
> oslo.messaging documentation it should be possible to override the
> control_exchange in the transport_url[3]. The idea is to set
> amqp_auto_delete and a ironic-neutron-agent specific exchange on the
> url when setting up the transport for notifications, but so far I
> belive the doc string on the control_exchange option is wrong.
>
>
> NOTE: The second issue can be worked around by stopping and starting
> rabbitmq as a dependency of the ironic-neutron-agent service. This
> ensure only queues for active agent uuid's are present, and those
> queues will be consumed.
>
>
> --
> Harald Jensås
>
>
> [1] https://bugs.launchpad.net/oslo.messaging/+bug/1814544
> [2] https://storyboard.openstack.org/#!/story/2004933
> [3]
>
> https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/transport.py#L58-L62
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20190205/83109ba5/attachment.html>
More information about the openstack-discuss
mailing list