[openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues
flavio at redhat.com
Wed Sep 24 08:02:19 UTC 2014
On 09/24/2014 03:48 AM, Devananda van der Veen wrote:
> I've taken a bit of time out of this thread, and I'd like to jump back
> in now and attempt to summarize what I've learned and hopefully frame
> it in such a way that it helps us to answer the question Thierry
I *loved* it! Thanks a lot for taking the time.
> On Fri, Sep 19, 2014 at 2:00 AM, Thierry Carrez <thierry at openstack.org> wrote:
>> The underlying question being... can Zaqar evolve to ultimately reach
>> the massive scale use case Joe, Clint and Devananda want it to reach, or
>> are those design choices so deeply rooted in the code and architecture
>> that Zaqar won't naturally mutate to support that use case.
> I also want to sincerely thank everyone who has been involved in this
> discussion, and helped to clarify the different viewpoints and
> uncertainties which have surrounded Zaqar lately. I hope that all of
> this helps provide the Zaqar team guidance on a path forward, as I do
> believe that a scalable cloud-based messaging service would greatly
> benefit the OpenStack ecosystem.
> Use cases
> So, I'd like to start from the perspective of a hypothetical user
> evaluating messaging services for the new application that I'm
> developing. What does my application need from a messaging service so
> that it can grow and become hugely popular with all the hipsters of
> the world? In other words, what might my architectural requirements
> (This is certainly not a complete list of features, and it's not meant
> to be -- it is a list of things that I *might* need from a messaging
> service. But feel free to point out any glaring omissions I may have
> made anyway :) )
> 1. Durability: I can't risk losing any messages
> Example: Using a queue to process votes. Every vote should count.
> 2. Single Delivery - each message must be processed *exactly* once
> Example: Using a queue to process votes. Every vote must be counted only once.
> 3. Low latency to interact with service
> Example: Single threaded application that can't wait on external calls
> 4. Low latency of a message through the system
> Example: Video streaming. Messages are very time-sensitive.
> 5. Aggregate throughput
> Example: Ad banner processing. Remember when sites could get
> slash-dotted? I need a queue resilient to truly massive spikes in
> 6. FIFO - When ordering matters
> Example: I can't "stop" a job that hasn't "start"ed yet.
> So, as a developer, I actually probably never need all of these in a
> single application -- but depending on what I'm doing, I'm going to
> need some of them. Hopefully, the examples above give some idea of
> what I have in mind for different sorts of applications I might
> develop which would require these guarantees from a messaging service.
> Why is this at all interesting or relevant? Because I think Zaqar and
> SQS are actually, in their current forms, trying to meet different
> sets of requirements. And, because I have not actually seen an
> application using a cloud which requires the things that Zaqar is
> guaranteeing - which doesn't mean they don't exist - it frames my past
> judgements about Zaqar in a much better way than simply "I have
> doubts". It explains _why_ I have those doubts.
> I'd now like to offer the following as a summary of this email thread
> and the available documentation on SQS and Zaqar, as far as which of
> the above requirements are satisfied by each service and why I believe
> that. If there are fallacies in here, please correct me.
> Requirements it meets: 1, 5
> The SQS documentation states that it guarantees durability of messages
> (1) and handles "unlimited" throughput (5).
> It does not guarantee once-and-only-once delivery (2) and requires
> applications that care about this to de-duplicate on the receiving
> It also does not guarantee message order (6), making it unsuitable for
> certain uses.
> SQS is not an application-local service nor does it use a wire-level
> protocol, so from this I infer that (3) and (4) were not design goals.
> Requirements it meets: 1*, 2, 6
> Zaqar states that it aims to guarantee message durability (1) but does
> so by pushing the guarantee of durability into the storage layer.
> Thus, Zaqar will not be able to guarantee durability of messages when
> a storage pool fails, is misconfigured, or what have you. Therefor I
> do not feel that message durability is a strong guarantee of Zaqar
> itself; in some configurations, this guarantee may be possible based
> on the underlying storage, but this capability will need to be exposed
> in such a way that users can make informed decisions about which Zaqar
> storage back-end (or "flavor") to use for their application based on
> whether or not they need durability.
I agree with the above but I would like to add a couple of things.
The first one is just a clarification on flavors. Flavor's are not
required to use pool whereas pools are required to use flavors.
Operators can install Zaqar on top of fully-durable pools without
flavors. This means all queues will be distributed across the available
pools. I'm bringing this up because an operator can say: "This is a
fully-reliable service" and don't give users a chance to shoot their feet.
One more note about flavors. Flavors have a 'capabilities' field that is
meant to expose the storage capabilities of the pool group it's been
mapped to. Some examples of capabilities: durable, high-throughput, etc.
The second thing I wanted to mention is about what the operator can or
cannot do. As much as Zaqar wants to guarantee durability, there's a lot
the operator can do to break it. I don't personally think there's a way
to fully-guarantee this not even dropping the dependency on the storage.
I also believe this is true for every service.
SQS guarantees durability because they control the service. They know
how it's installed and well, obviously, it's been written to do that.
But, lets say we re-write Zaqar and implement message distribution
within it. We would still require the deployer (and even force the
deployer) to add *at least* 3 pools (Murphy doesn't like 2, not even 3
but well...). Even with that, we won't be able to guarantee 100%
durability because the deployer could have installed all storage nodes
in the same server.
I know the above is a pretty extreme example. What I want to get at is
that Zaqar can't guarantee durability at a 100% not because the feature
depends on the storage but because - like in every other service in
OpenStack, I believe - this is a joint work with the operator and
there's a lot the operator can do.
I do believe Zaqar *has* to make the *impossible* to guarantee this and
to educate the operator on what the best way to deploy the service is.
> Single delivery of messages (2) is provided for by the claim semantics
> in Zaqar's API. FIFO (6) ordering was an architectural choice made
> based on feedback from users.
> Aggregate throughput of a single queue (5) is not scalable beyond the
> reach of a single storage pool. This makes it possible for an
> application to outgrow Zaqar when its total throughput needs exceed
> the capacity of a single pool. This would also make it possible for
> one user to DOS other users who share the same storage pool (unless
> rate-limits are implemented, which would further indicate that (5) was
> not a design goal). Also, as with durability, pushing this problem
> down to the storage layer is not the same as _solving_ it.
Right, I'm sure there are some limits that may vary depending on the
storage backend and I'm all for fixing the above.
For the sake of good discussions and looking for more feedback from all
of you on this matter, let me share a thought. Instead of baking Zaqar
with mongodb, lets assume we have a Swift storage, which we all know a
Swift has unlimited scaling capabilities. Whenever your swift cluster
reaches a limit, you can add more nodes to it, re-balance, etc. Wouldn't
a swift driver make Zaqar's scaling storing different? Meaning, that
having a storage that has unlimited scaling capabilities makes Zaqar an
unlimitedly scalable service. Assuming we have just 1 pool - just to
make the example easier - pointing to a swift cluster, we would have all
the queues distributed, replicated and stored in a storage cluster that
can scale unlimitedly.
Yes, that still depends on the storage, which means the concern of the
service relying on the storage backend still exists. But, it does solve
the problem by relying on a storage that does this already and has a
proven scaling story.
I don't think relying on the storage backend is a bad thing. One way or
another, all softwares rely on other softwares to provide their service
and guarantees. As far as Zaqar goes, I believe that as long as the
supported/bless - gosh, I hate this term - drivers guarantee (1) and (5)
I think the project's vision holds true.
**NOTE:** I'm not saying the current mongodb driver can't do the above.
I'm just using a service we're probably more familiar with.
> Zaqar relies on a store-and-forward architecture, which is not
> amenable to low-latency message processing (4). Again, as with SQS, it
> is not a wire-level protocol, so I don't believe low-latency
> connectivity (3) was a design goal.
> It looks like Zaqar should be very well suited to "small to mid sized"
> clouds. At this scale, I believe its architecture will meet some
> use-cases that SQS does not and all the ones it does. That's great for
> private clouds. And, as far as I can tell, the developer team is
> dedicated to making the project easy to use and administer at this
> scale. That's also great.
> However, I continue to believe that the current architecture of Zaqar
> is not going to handle the needs of public cloud providers who want to
> offer an alternative messaging service to SQS within their OpenStack
> clouds, primarily because the project is not itself directly
> addressing durability and throughput. It does not appear to be any
> more durable or more performant than the storage implementation
> underneath it, which, to a user who requires durability and
> throughput, makes it no better than those technologies.
> While we have several other projects today with well-known scaling
> limitations (*cough* Nova *cough*), the difference is that the current
> scaling limitations of Zaqar are a result of design principles of the
> As far as advice to the project moving forward, I would offer this:
> any design decision you make that limits either aggregate throughput
> or durability, when operating this service in a public cloud, at
> "unlimited" scale, is going to draw concern from potential operators
> and users, even if they're not (yet) at that scale. Because design
> decisions are much harder to fix later on than bugs.
As mentioned above, and in other emails, I think there's lot of room for
improvement in Zaqar and the team is all for improving the project.
Thanks for the advice and thoughts.
Thanks for taking the time to summarize the thread, Devananda. I
More information about the OpenStack-dev