[openstack-dev] [Ceilometer][Oslo] Consuming Notifications in Batches
Gordon Sim
gsim at redhat.com
Fri Jan 3 11:43:06 UTC 2014
On 01/02/2014 10:46 PM, Herndon, John Luke wrote:
>
>
> On 1/2/14, 11:36 AM, "Gordon Sim" <gsim at redhat.com> wrote:
>
>> On 12/20/2013 09:26 PM, Herndon, John Luke wrote:
>>>
>>> On Dec 20, 2013, at 12:13 PM, Gordon Sim <gsim at redhat.com> wrote:
>>>
>>>> On 12/20/2013 05:27 PM, Herndon, John Luke wrote:
>>>>>
>>>>> Other protocols may support bulk consumption. My one concern with
>>>>> this approach is error handling. Currently the executors treat
>>>>> each notification individually. So let¹s say the broker hands
>>>>> 100 messages at a time. When client is done processing the
>>>>> messages, the broker needs to know if message 25 had an error or
>>>>> not. We would somehow need to communicate back to the broker
>>>>> which messages failed. I think this may take some refactoring of
>>>>> executors/dispatchers. What do you think?
>> [...]
>>>> (2) What would you want the broker to do with the failed messages?
>>>> What sort of things might fail? Is it related to the message
>>>> content itself? Or is it failures suspected to be of a temporal
>>>> nature?
>>>
>>> There will be situations where the message can¹t be parsed, and those
>>> messages can¹t just be thrown away. My current thought is that
>>> ceilometer could provide some sort of mechanism for sending messages
>>> that are invalid to an external data store (like a file, or a
>>> different topic on the amqp server) where a living, breathing human
>>> can look at them and try to parse out any meaningful information.
>>
>> Right, in those cases simply requeueing probably is not the right thing
>> and you really want it dead-lettered in some way. I guess the first
>> question is whether that is part of the notification systems function,
>> or if it is done by the application itself (e.g. by storing it or
>> republishing it). If it is the latter you may not need any explicit
>> negative acknowledgement.
>
> Exactly, I¹m thinking this is something we¹d build into ceilometer and not
> oslo, since ceilometer is where the event parsing knowledge lives. From an
> oslo point of view, the message would be 'acked¹.
>
>>
>>> Other errors might be ³database not available², in which case
>>> re-queing the message is probably the right way to go.
>>
>> That does mean however that the backlog of messages starts to grow on
>> the broker, so some scheme for dealing with this if the database outage
>> goes on for a bit is probably important. It also means that the messages
>> will keep being retried without any 'backoff' waiting for the database
>> to be restored which could increase the load.
>
> This is a problem we already have :(
Agreed, it is a property of reliable (i.e. acknowledged) transfer from
the broker, rather than batching. And of course, some degree of
buffering here is exactly what message queues are supposed to provide.
The point is simply to provide some way of configuring things so that
this can be bounded, or prevented from taking down the entire broker.
(And perhaps some way of altering the unfortunate someone!)
> https://github.com/openstack/ceilometer/blob/master/ceilometer/notification
> .py#L156-L158
> Since notifications cannot be lost, overflow needs to be detected and the
> messages need to be saved. I¹m thinking the database being down is a rare
> occurrence that will be worthy of waking someone up in the middle of the
> night. One possible solution: flip the collector into an emergency mode
> and save notifications to disc until the issue is resolved. Once the db is
> up and running, the collector inserts all of these saved messages (as one
> big batch!). Thoughts?
>
> I¹m not sure I understand what you are saying about retrying without a
> backoff. Can you explain?
I mean that if the messages are explicitly requeued and the original
subscription is still active, they will be immediately redelivered and
will thus keep cycling from broker to client, back to broker, back to
client etc etc until the database is available again.
Pulling messages off continually like this without actually being able
to dequeue them may reduce the brokers effectiveness at e.g. paging out,
and in any event involves some unnecessary load on top of the expanding
queue.
It might be better, just as an example, to abort the connection to the
broker (implicitly requeueing all unacked messages), and only reconnect
when the database becomes available (and that can be tried after 1
second, then 2, then 4 etc up to some maximum retry interval).
Or another alternative would be to leave the connection to the broker,
but by not requeing or acking ensure that once the prefetch has been
reached, no further messages will be delivered. Then locally, on the
client, retry the processing for the prefetched messages until the
database is back again.
The basic point I'm trying to make is that it seems to me there is
little value in simply handing the messages back to the broker for
immediate redelivery back to the client. It delays the retry certainly,
but at unnecessary expense.
More generally I wonder whether an explicit negative acknowledgement is
actually needed in the notify API at all. If it isn't, that may simplify
things for batching.
More information about the OpenStack-dev
mailing list