[openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

Clint Byrum clint at fewbar.com
Fri Sep 19 13:07:44 UTC 2014


Excerpts from Eoghan Glynn's message of 2014-09-19 04:23:55 -0700:
> 
> > Hi All,
> > 
> > My understanding of Zaqar is that it's like SQS. SQS uses distributed queues,
> > which have a few unusual properties [0]:
> > Message Order
> > 
> > 
> > Amazon SQS makes a best effort to preserve order in messages, but due to the
> > distributed nature of the queue, we cannot guarantee you will receive
> > messages in the exact order you sent them. If your system requires that
> > order be preserved, we recommend you place sequencing information in each
> > message so you can reorder the messages upon receipt.
> > At-Least-Once Delivery
> > 
> > 
> > Amazon SQS stores copies of your messages on multiple servers for redundancy
> > and high availability. On rare occasions, one of the servers storing a copy
> > of a message might be unavailable when you receive or delete the message. If
> > that occurs, the copy of the message will not be deleted on that unavailable
> > server, and you might get that message copy again when you receive messages.
> > Because of this, you must design your application to be idempotent (i.e., it
> > must not be adversely affected if it processes the same message more than
> > once).
> > Message Sample
> > 
> > 
> > The behavior of retrieving messages from the queue depends whether you are
> > using short (standard) polling, the default behavior, or long polling. For
> > more information about long polling, see Amazon SQS Long Polling .
> > 
> > With short polling, when you retrieve messages from the queue, Amazon SQS
> > samples a subset of the servers (based on a weighted random distribution)
> > and returns messages from just those servers. This means that a particular
> > receive request might not return all your messages. Or, if you have a small
> > number of messages in your queue (less than 1000), it means a particular
> > request might not return any of your messages, whereas a subsequent request
> > will. If you keep retrieving from your queues, Amazon SQS will sample all of
> > the servers, and you will receive all of your messages.
> > 
> > The following figure shows short polling behavior of messages being returned
> > after one of your system components makes a receive request. Amazon SQS
> > samples several of the servers (in gray) and returns the messages from those
> > servers (Message A, C, D, and B). Message E is not returned to this
> > particular request, but it would be returned to a subsequent request.
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Presumably SQS has these properties because it makes the system scalable, if
> > so does Zaqar have the same properties (not just making these same
> > guarantees in the API, but actually having these properties in the
> > backends)? And if not, why? I looked on the wiki [1] for information on
> > this, but couldn't find anything.
> 
> The premise of this thread is flawed I think.
> 
> It seems to be predicated on a direct quote from the public
> documentation of a closed-source system justifying some
> assumptions about the internal architecture and design goals
> of that closed-source system.
> 
> It then proceeds to hold zaqar to account for not making
> the same choices as that closed-source system.
> 

I don't think we want Zaqar to make the same choices. OpenStack's
constraints are different from AWS's.

I want to highlight that our expectations are for the API to support
deploying at scale. SQS _clearly_ started with a point of extreme scale
for the deployer, and thus is a good example of an API that is limited
enough to scale like that.

What has always been the concern is that Zaqar would make it extremely
complicated and/or costly to get to that level.

> This puts the zaqar folks in a no-win situation, as it's hard
> to refute such arguments when they have no visibility over
> the innards of that closed-source system.
> 

Nobody expects to know the insides. But the outsides, the parts that
are public, are brilliant because they are _limited_, and yet they still
support many many use cases.

> Sure, the assumption may well be correct that the designers
> of SQS made the choice to expose applications to out-of-order
> messages as this was the only practical way of acheiving their
> scalability goals.
> 
> But since the code isn't on github and the design discussions
> aren't publicly archived, we have no way of validating that.
> 

We don't need to see the code. Not requiring ordering makes the whole
problem easier to reason about. You don't need explicit pools anymore.
Just throw messages wherever, and make sure that everywhere gets
polled on a reasonable enough frequency. This is the kind of thing
operations loves. No global state means no split brain to avoid, no
synchronization. Does it solve all problems? no. But it solves a single
one, REALLY well.

Frankly I don't understand why there would be this argument to hold on
to so many use cases and so much API surface area. Zaqar's life gets
easier without ordering guarantees or message browsing. And it still
retains _many_ of its potential users.



More information about the OpenStack-dev mailing list