[openstack-dev] [Heat] Using Job Queues for timeout ops

Clint Byrum clint at fewbar.com
Tue Dec 2 00:05:28 UTC 2014


Excerpts from Zane Bitter's message of 2014-12-01 13:05:42 -0800:
> On 13/11/14 13:59, Clint Byrum wrote:
> > I'm not sure we have the same understanding of AMQP, so hopefully we can
> > clarify here. This stackoverflow answer echoes my understanding:
> >
> > http://stackoverflow.com/questions/17841843/rabbitmq-does-one-consumer-block-the-other-consumers-of-the-same-queue
> >
> > Not ack'ing just means they might get retransmitted if we never ack. It
> > doesn't block other consumers. And as the link above quotes from the
> > AMQP spec, when there are multiple consumers, FIFO is not guaranteed.
> > Other consumers get other messages.
> 
> Thanks, obviously my recollection of how AMQP works was coloured too 
> much by oslo.messaging.
> 
> > So just add the ability for a consumer to read, work, ack to
> > oslo.messaging, and this is mostly handled via AMQP. Of course that
> > also likely means no zeromq for Heat without accepting that messages
> > may be lost if workers die.
> >
> > Basically we need to add something that is not "RPC" but instead
> > "jobqueue" that mimics this:
> >
> > http://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo/messaging/rpc/dispatcher.py#n131
> >
> > I've always been suspicious of this bit of code, as it basically means
> > that if anything fails between that call, and the one below it, we have
> > lost contact, but as long as clients are written to re-send when there
> > is a lack of reply, there shouldn't be a problem. But, for a job queue,
> > there is no reply, and so the worker would dispatch, and then
> > acknowledge after the dispatched call had returned (including having
> > completed the step where new messages are added to the queue for any
> > newly-possible children).
> 
> I'm curious how people are deploying Rabbit at the moment. Are they 
> setting up multiple brokers and writing messages to disk before 
> accepting them? I assume yes on the former but no on the latter, since 
> there's no particular point in having e.g. 5 nines durability in the 
> queue when the overall system is as weak as your flakiest node.
> 

Usually the pseudo-code should be:

msg = queue.read()
do_something_idempotent_with(msg.payload)
msg.ack()

The idea is to ack only after you've done _everything_ with the payload,
but to not freak out if somebody already did _some_ of what you did with
the payload.

> OTOH if we were to add what you're proposing, then we would need folks 
> to deploy Rabbit that way (at least for Heat), since waiting for Acks on 
> receipt is insufficient to make messaging reliable if the broker can 
> easily outright lose the message.
> 

If you ask RabbitMQ to make a message durable, it writes it to a durable
queue storage. If your broker is in a cluster, it makes sure it's
written into _many_ queue storages.

Currently if you deploy TripleO w/ 3 controllers, you get a clustered
RabbitMQ and sufficient durability for the pattern I cited. Users may
not be deploying this way, but they should be.

I'm sort of assuming qpid's clustering works the same. 0mq will likely
not work at all for this. Other options are feasible too, like a simple
redis queue that you abuse as a job queue.

> I think all of the proposed approaches would benefit from this feature, 
> but I'm concerned about any increased burden on deployers too.

Right now they have the burden of supporting coarse timeouts which seems
like it will fail often. That seems worse in my head.



More information about the OpenStack-dev mailing list