Open Stack

Thu Sep 11 20:09:25 UTC 2014

On 04/09/14 08:14, Sean Dague wrote:
>
> I've been one of the consistent voices concerned about a hard
> requirement on adding NoSQL into the mix. So I'll explain that thinking
> a bit more.
>
> I feel like when the TC makes an integration decision previously this
> has been about evaluating the project applying for integration, and if
> they met some specific criteria they were told about some time in the
> past. I think that's the wrong approach. It's a locally optimized
> approach that fails to ask the more interesting question.
>
> Is OpenStack better as a whole if this is a mandatory component of
> OpenStack? Better being defined as technically better (more features,
> less janky code work arounds, less unexpected behavior from the stack).
> Better from the sense of easier or harder to run an actual cloud by our
> Operators (taking into account what kinds of moving parts they are now
> expected to manage). Better from the sense of a better user experience
> in interacting with OpenStack as whole. Better from a sense that the
> OpenStack release will experience less bugs, less unexpected cross
> project interactions, an a greater overall feel of consistency so that
> the OpenStack API feels like one thing.
>
> https://dague.net/2014/08/26/openstack-as-layers/

I don't want to get off-topic here, but I want to state before this 
becomes the de-facto starting point for a layering discussion that I 
don't accept this model at all. It is not based on any analysis 
whatsoever but appears to be entirely arbitrary - a collection of 
individual prejudices arranged visually.

On a hopefully more constructive note, I believe there are at least two 
analyses that _would_ produce interesting data here:

1) Examine the dependencies, both hard and optional, between projects 
and enumerate the things you lose when ignoring each optional one.
2) Analyse projects based on the type of user consuming the service - 
e.g. Nova is mostly used (directly or indirectly via e.g. Heat and/or 
Horizon) by actual, corporeal persons, while Zaqar is used by both 
persons (to set up queues) and services (which actually send and receive 
messages) - of both OpenStack and applications. I believe, BTW that this 
analysis will uncover a lot of missing features in Keystone[1].

What you can _not_ produce is a linear model of the different types of 
clouds for different use cases, because different organisations have 
wildly differing needs.

> One of the interesting qualities of Layers 1 & 2 is they all follow an
> AMQP + RDBMS pattern (excepting swift). You can have a very effective
> IaaS out of that stack. They are the things that you can provide pretty
> solid integration testing on (and if you look at where everything stood
> before the new TC mandates on testing / upgrade that was basically what
> was getting integration tested). (Also note, I'll accept Barbican is
> probably in the wrong layer, and should be a Layer 2 service.)

Swift is the current exception here, but one could argue, and people 
have[2], that Swift is also the only project that actually conforms to 
our stated design tenets for OpenStack. I'd struggle to tell the Zaqar 
folks they've done the Wrong Thing... especially when abandoning the 
RDBMS driver was done largely at the direction of the TC iirc.

Speaking of Swift, I would really love to see it investigated as a 
potential storage backend for Zaqar. If it proves to have the right 
guarantees (and durability is the crucial one, so it sounds promising) 
then that has the potential to smooth over a lot of the deployment problem.

> While large shops can afford to have a dedicated team to figure out how
> to make mongo or redis HA, provide monitoring, have a DR plan for when a
> huricane requires them to flip datacenters, that basically means
> OpenStack heads further down the path of "only for the big folks". I
> don't want OpenStack to be only for the big folks, I want OpenStack to
> be for all sized folks. I really do want to have all the local small
> colleges around here have OpenStack clouds, because it's something that
> people believe they can do and manage. I know the people that work in
> this places, they all come out to the LUG I run. We've talked about
> this. OpenStack is basically seen as too complex for them to use as it
> stands, and that pains me a ton.

This is a great point, and one that we definitely have to keep in mind.

It's also worth noting that small organisations also get the most 
benefit. Rather than having to stand up a cluster of reliable message 
brokers (large organisations are much more likely to need this kind of 
flexibility anyway) - potentially one cluster per application - they can 
have their IT department deploy e.g. a single Redis cluster and have 
messaging handled for every application in their cloud with all the 
benefits of multitenancy.

Part of the move to the cloud is inevitably going to mean organisational 
changes in a lot of places, where the operations experts will 
increasingly focus on maintaining the cloud itself, rather than the 
applications running in it. We need to be wary of producing a product 
with a major impedance mismatch to the organisations that will use it, 
but we should also remember that we are not doing this in a vacuum. 
Change is coming whether anyone likes it or not; the big question is if 
we'll get a foot in the door or if everything will switch over to 
proprietary clouds.

Vish brought up an interesting idea at the TC meeting a couple of weeks 
back, of having "components" that could be deployed by users instead of 
_needing_ operators to do it (though on bigger clouds they likely 
would). To some extent this is already possible for things like Trove - 
for example, you can write a Heat template containing a Nova server 
running MySQL. On a small local cloud, you can pass an environment file 
that maps the OS::Trove::Instance resource type to this template, so 
that you get a MySQL server that you administer yourself. Then, we you 
move to a bigger cloud you launch the same template without the 
environment mapping and automatically get the managed Trove service with 
no changes. (Murano developers will be showing up shortly to tell you 
that they can make it even easier.) Unfortunately, this model doesn't 
work so well for something like Zaqar, where it needs to scale at a very 
fine granularity. Maybe it could be done (Zaqar can run standalone, I 
believe) if you're willing to give up multitenancy and run one copy for 
a number of applications... but at that point it's easier to run it as 
part of the cloud. If we had a Docker driver in Nova - or, preferably, a 
Nova-like Container API - then I can imagine this concept having more 
legs. It would still be expensive in the messaging case because of the 
durability requirements though. Something to think about.

> So I think Zaqar is good software, and really useful part of our
> ecosystem, but this added step function burden of a 3rd class of support
> software that has to be maintained... seems like it takes us further
> away from OpenStack at a small scale. If we were thinking about Zaqar as
> a thing that we could replace olso.messaging with, that becomes
> interesting in a different way, because we could instead of having 3
> classes of support software, remain at 2, just take a sideways shift on
> one of them. But that's not actually the path we are on.
>
> So, honestly, I'll probably remain -1 on the final integration vote, not
> because Zaqar is bad, but because I'm feeling more firmly that for
> OpenStack to not leave the small deployers behind we need to redefine
> the tightly integrated piece of OpenStack to basically the Layer 1 & 2
> parts of my diagram, and consider the rest of the layers exciting parts
> of our ecosystem that more advanced users may choose to deploy to meet
> their needs. Smaller tent, big ecosystem, easier on ramp.

Let's assume for a moment that I agree with the 'small tent' concept and 
don't find it in any way appalling.

I would argue that Marconi belongs very close to the centre of even the 
smallest of pup tents. Just behind Nova, but well ahead of, say, 
Neutron. Let's not forget that SQS actually pre-dates(!) EC2 by two years.

I'll give you an example of one use case we have for it in Heat. 
Ceilometer generates alarms that trigger autoscaling events in Heat. We 
could easily have Ceilometer simply call a Heat API endpoint with some 
data, but that's actually extremely limiting for the user. What if the 
user wants a particular alarm to cause a scaling event on the second 
Tuesday after a full moon and BTW Nagios is going bat****? We have some 
basic signal conditioning in Heat, but we don't want to turn it into a 
Turing-complete programming language or anything, and we don't want the 
user to have to give up on using data from Ceilometer altogether as soon 
as things get complicated. The best solution available for now, and the 
one that was implemented, is to make it a webhook - the user can either 
pass the webhook URL supplied by Heat to Ceilometer, or they can use it 
themselves and pass their own webhook to Ceilometer to do their own 
conditioning in between.

I submit that this is a horrible solution.

It's horrible because it can potentially turn Ceilometer into an engine 
for launching DOS attacks at arbitrary servers. (Operators themselves 
are actually the most vulnerable, though, because it comes from inside 
their control plane network. They have to be aware not to trust outgoing 
connections from the machine running Ceilometer.) It's horrible because 
it requires users to make the endpoint for the signal conditioner public 
(effective outsourcing security from the operator, who need only 
implement SSL and Keystone, to the user who is much more likely to get 
it wrong). It's horrible because this operation should have the 
semantics of a queue, but when it fails the choice is between losing the 
alarm or effectively reimplementing Zaqar inside Ceilometer.

All of those problems, and more, would be solved by using Zaqar instead, 
but we couldn't use it at the time because it didn't exist yet. And this 
use case is just the beginning! I (genuinely) lost count somewhere in 
the double digits of the number of new features facing the same dilemma 
that we have users and developers chomping at the bit for in Heat. I 
announced elsewhere that we decided to just go ahead and implement them 
using Zaqar because it's just so much work trying to hold developers and 
their webhook hacks at bay while we wait for Zaqar to graduate.

There's so much more than just Heat too. We have a "Dashboard" that 
isn't able to tell users about stuff that happens to their 
infrastructure for want of asynchronous notifications. That's nuts! This 
is a critical, fundamental part of a cloud that I would certainly 
arbitrarily place in layer 2, if not layer 1, of your diagram. Lots of 
core stuff needs to depend on it, and keeping it outside the tent makes 
about as much sense to me as keeping, say, Glance outside the tent.

Or, in other words, Zaqar will give us "more features, less janky code 
work arounds, [and] less unexpected behavior from the stack".

Finally, of course, this is without even considering the *main* use 
cases for Zaqar. As pointed out elsewhere in this thread, there is an 
entire class of applications - the most popular class of applications - 
where substantially every new one written will need something like this. 
So we have to balance the number of organisations who will be turned off 
having their own OpenStack because operating it is too complicated (a 
problem not created by Zaqar, as you note above) with the number who 
won't even consider it for lack of demand because all their users are 
comfortably locked in to AWS, which has had this functionality for nigh 
on 10 years now.

> I realize that largely means Zaqar would be caught up in a definition
> discussion outside of it's control, and that's kind of unfortunate, as
> Flavio and team have been doing a bang up job of late. But we need to
> stop considering "integration" as the end game of all interesting
> software in the OpenStack ecosystem, and I think it's better to have
> that conversation sooner rather than later.

I think it's clear that we need to have that conversation again, but in 
the specific case of Zaqar I believe it is such a fundamental building 
block that the question of how many building blocks are allowed inside 
the tent (sorry) is not relevant to the question at hand.

cheers,
Zane.

[1] 
http://lists.openstack.org/pipermail/openstack-dev/2014-August/043871.html
[2] http://blog.linux2go.dk/2013/08/30/openstack-design-tenets-part-2/

Open Stack

[openstack-dev] [Zaqar] Comments on the concerns arose during the TC meeting

OpenStack

Community

Documentation

Branding & Legal