Open Stack

Tue Feb 5 16:43:09 UTC 2019

On 2/4/19, Harald Jensås <hjensas at redhat.com> wrote:
> On Tue, 2019-02-05 at 09:54 +1100, Michael Still wrote:
>> Hi,
>>
>> I’ve been chasing a bug in ironic’s neutron agent for the last few
>> days and I think its time to ask for some advice.
>>
>
> I'm working on the same issue. (In fact there are two issues.)
>
>> Specifically, I was asked to debug why a set of controllers was using
>> so much RAM, and the answer was that rabbitmq had a queue called
>> ironic-neutron-agent-heartbeat.info with 800,000 messages enqueued.
>> This notification queue is used by ironic’s neutron agent to
>> calculate the hash ring. I have been able to duplicate this issue in
>> a stock kolla-ansible install with ironic turned on but no bare metal
>> nodes enrolled in ironic. About 0.6 messages are queued per second.
>>
>> I added some debugging code (hence the thread yesterday about
>> mangling the code kolla deploys), and I can see that the messages in
>> the queue are being read by the ironic neutron agent and acked
>> correctly. However, they are not removed from the queue.
>>
>> You can see your queue size while using kolla with this command:
>>
>> docker exec rabbitmq rabbitmqctl list_queues messages name
>> messages_ready consumers  | sort -n | tail -1
>>
>> My stock install that’s been running for about 12 hours currently has
>> 8,244 messages in that queue.
>>
>> Where I’m a bit stumped is I had assumed that the messages weren’t
>> being acked correctly, which is not the case. Is there something
>> obvious about notification queues like them being persistent that
>> I’ve missed in my general ignorance of the underlying implementation
>> of notifications?
>>
>
> I opened a oslo.messaging bug[1] yesterday. When using notifications
> and all consumers use one or more pools. The ironic-neutron-agent does
> use pools for all listeners in it's hash-ring member manager. And the
> result is that notifications are published to the 'ironic-neutron-
> agent-heartbeat.info' queue and they are never consumed.
>

This is an issue with the design of the notification pool feature.

The Notification service is designed so notification events can be
sent even though there may currently be no consumers.  It supports the
ability for events to be queued until a consumer(s) is ready to
process them.  So when a notifier issues an event and there are no
consumers subscribed, a queue must be provisioned to hold that event
until consumers appear.

For notification pools the pool identifier is supplied by the
notification listener when it subscribes.  The value of any pool id is
not known beforehand by the notifier, which is important because pool
ids can be dynamically created by the listeners.  And in many cases
pool ids are not even used.

So notifications are always published to a non-pooled queue.  If there
are pooled subscriptions we rely on the broker to do the fanout.
This means that the application should always have at least one
non-pooled listener for the topic, since any events that may be
published _before_ the listeners are established will be stored on a
non-pooled queue.

The documentation doesn't make that clear AFAIKT - that needs to be fixed.

> The second issue, each instance of the agent uses it's own pool to
> ensure all agents are notified about the existance of peer-agents. The
> pools use a uuid that is generated at startup (and re-generated on
> restart, stop/start etc). In the case where
> `[oslo_messaging_rabbit]/amqp_auto_delete = false` in neutron config
> these uuid queues are not automatically removed. So after a restart of
> the ironic-neutron-agent the queue with the old UUID is left in the
> message broker without no consumers, growing ...
>
>
> I intend to push patches to fix both issues. As a workaround (or the
> permanent solution) will create another listener consuming the
> notifications without a pool. This should fix the first issue.
>
> Second change will set amqp_auto_delete for these specific queues to
> 'true' no matter. What I'm currently stuck on here is that I need to
> change the control_exchange for the transport. According to
> oslo.messaging documentation it should be possible to override the
> control_exchange in the transport_url[3]. The idea is to set
> amqp_auto_delete and a ironic-neutron-agent specific exchange on the
> url when setting up the transport for notifications, but so far I
> belive the doc string on the control_exchange option is wrong.
>

Yes the doc string is wrong - you can override the default
control_exchange via the Target's exchange field:

https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo_messaging/target.py#n40

At least that's the intent...

... however the Notifier API does not take a Target, it takes a list
of topic _strings_:

https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo_messaging/notify/notifier.py#n239

Which seems wrong, especially since the notification Listener
subscribes to a list of Targets:

https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo_messaging/notify/listener.py#n227

I've opened a bug for this and will provide a patch for review shortly:

https://bugs.launchpad.net/oslo.messaging/+bug/1814797

>
> NOTE: The second issue can be worked around by stopping and starting
> rabbitmq as a dependency of the ironic-neutron-agent service. This
> ensure only queues for active agent uuid's are present, and those
> queues will be consumed.
>
>
> --
> Harald Jensås
>
>
> [1] https://bugs.launchpad.net/oslo.messaging/+bug/1814544
> [2] https://storyboard.openstack.org/#!/story/2004933
> [3]
> https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/transport.py#L58-L62
>
>
>

-- 
Ken Giusti  (kgiusti at gmail.com)

Open Stack

[ironic] [oslo] ironic overloading notifications for internal messaging

OpenStack

Community

Documentation

Branding & Legal