Open Stack

Wed Sep 24 01:48:57 UTC 2014

I've taken a bit of time out of this thread, and I'd like to jump back
in now and attempt to summarize what I've learned and hopefully frame
it in such a way that it helps us to answer the question Thierry
asked:

On Fri, Sep 19, 2014 at 2:00 AM, Thierry Carrez <thierry at openstack.org> wrote:
>
> The underlying question being... can Zaqar evolve to ultimately reach
> the massive scale use case Joe, Clint and Devananda want it to reach, or
> are those design choices so deeply rooted in the code and architecture
> that Zaqar won't naturally mutate to support that use case.

I also want to sincerely thank everyone who has been involved in this
discussion, and helped to clarify the different viewpoints and
uncertainties which have surrounded Zaqar lately. I hope that all of
this helps provide the Zaqar team guidance on a path forward, as I do
believe that a scalable cloud-based messaging service would greatly
benefit the OpenStack ecosystem.

Use cases
--------------

So, I'd like to start from the perspective of a hypothetical user
evaluating messaging services for the new application that I'm
developing. What does my application need from a messaging service so
that it can grow and become hugely popular with all the hipsters of
the world? In other words, what might my architectural requirements
be?

(This is certainly not a complete list of features, and it's not meant
to be -- it is a list of things that I *might* need from a messaging
service. But feel free to point out any glaring omissions I may have
made anyway :) )

1. Durability: I can't risk losing any messages
  Example: Using a queue to process votes. Every vote should count.

2. Single Delivery - each message must be processed *exactly* once
  Example: Using a queue to process votes. Every vote must be counted only once.

3. Low latency to interact with service
  Example: Single threaded application that can't wait on external calls

4. Low latency of a message through the system
  Example: Video streaming. Messages are very time-sensitive.

5. Aggregate throughput
  Example: Ad banner processing. Remember when sites could get
slash-dotted? I need a queue resilient to truly massive spikes in
traffic.

6. FIFO - When ordering matters
  Example: I can't "stop" a job that hasn't "start"ed yet.

So, as a developer, I actually probably never need all of these in a
single application -- but depending on what I'm doing, I'm going to
need some of them. Hopefully, the examples above give some idea of
what I have in mind for different sorts of applications I might
develop which would require these guarantees from a messaging service.

Why is this at all interesting or relevant? Because I think Zaqar and
SQS are actually, in their current forms, trying to meet different
sets of requirements. And, because I have not actually seen an
application using a cloud which requires the things that Zaqar is
guaranteeing - which doesn't mean they don't exist - it frames my past
judgements about Zaqar in a much better way than simply "I have
doubts". It explains _why_ I have those doubts.

I'd now like to offer the following as a summary of this email thread
and the available documentation on SQS and Zaqar, as far as which of
the above requirements are satisfied by each service and why I believe
that. If there are fallacies in here, please correct me.

SQS
------

Requirements it meets: 1, 5

The SQS documentation states that it guarantees durability of messages
(1) and handles "unlimited" throughput (5).

It does not guarantee once-and-only-once delivery (2) and requires
applications that care about this to de-duplicate on the receiving
side.

It also does not guarantee message order (6), making it unsuitable for
certain uses.

SQS is not an application-local service nor does it use a wire-level
protocol, so from this I infer that (3) and (4) were not design goals.

Zaqar
--------

Requirements it meets: 1*, 2, 6

Zaqar states that it aims to guarantee message durability (1) but does
so by pushing the guarantee of durability into the storage layer.
Thus, Zaqar will not be able to guarantee durability of messages when
a storage pool fails, is misconfigured, or what have you. Therefor I
do not feel that message durability is a strong guarantee of Zaqar
itself; in some configurations, this guarantee may be possible based
on the underlying storage, but this capability will need to be exposed
in such a way that users can make informed decisions about which Zaqar
storage back-end (or "flavor") to use for their application based on
whether or not they need durability.

Single delivery of messages (2) is provided for by the claim semantics
in Zaqar's API. FIFO (6) ordering was an architectural choice made
based on feedback from users.

Aggregate throughput of a single queue (5) is not scalable beyond the
reach of a single storage pool. This makes it possible for an
application to outgrow Zaqar when its total throughput needs exceed
the capacity of a single pool. This would also make it possible for
one user to DOS other users who share the same storage pool (unless
rate-limits are implemented, which would further indicate that (5) was
not a design goal). Also, as with durability, pushing this problem
down to the storage layer is not the same as _solving_ it.

Zaqar relies on a store-and-forward architecture, which is not
amenable to low-latency message processing (4). Again, as with SQS, it
is not a wire-level protocol, so I don't believe low-latency
connectivity (3) was a design goal.

Summary
-------------

It looks like Zaqar should be very well suited to "small to mid sized"
clouds. At this scale, I believe its architecture will meet some
use-cases that SQS does not and all the ones it does. That's great for
private clouds. And, as far as I can tell, the developer team is
dedicated to making the project easy to use and administer at this
scale. That's also great.

However, I continue to believe that the current architecture of Zaqar
is not going to handle the needs of public cloud providers who want to
offer an alternative messaging service to SQS within their OpenStack
clouds, primarily because the project is not itself directly
addressing durability and throughput. It does not appear to be any
more durable or more performant than the storage implementation
underneath it, which, to a user who requires durability and
throughput, makes it no better than those technologies.

While we have several other projects today with well-known scaling
limitations (*cough* Nova *cough*), the difference is that the current
scaling limitations of Zaqar are a result of design principles of the
project.

As far as advice to the project moving forward, I would offer this:
any design decision you make that limits either aggregate throughput
or durability, when operating this service in a public cloud, at
"unlimited" scale, is going to draw concern from potential operators
and users, even if they're not (yet) at that scale. Because design
decisions are much harder to fix later on than bugs.

-Devananda

Open Stack

[openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

OpenStack

Community

Documentation

Branding & Legal