[Openstack-operators] A Hypervisor supporting containers

Sean Dague sean at dague.net
Fri May 2 21:15:21 UTC 2014

On 05/02/2014 08:47 AM, Narayan Desai wrote:
> <This ended up being a bit of a primal scream -- I am writing this from
> a position of concern about the strategy that openstack is taking>
> tl;dr: openstack is starting to feel like a tv show called "when
> developers attack" 
> The openstack dev community is starting to feel more and more like
> software engineering fundamentalists. Testing is important. But frankly,
> there are a bunch of things that are as or more important. And matter
> much more to people that want to run, not just develop the software.
> We've seen features proposed for removal (not just in nova) because of
> lack of testing coverage. Features that have been integrated for years,
> that we've been using in production *for years without any problems*. 
> Getting new code integrated is a nightmare. Take a look at this:
> https://review.openstack.org/#/c/65113/
> for an example. You have a well established member of the openstack
> community, proposing code for a set of features that everyone wants
> (uniform integration of storage across glance, ephemeral storage, and
> cinder using ceph), and it gets blocked because devstack isn't up to
> snuff. Talk about cutting off your nose to spite your face. 

Hey Narayan :)

So, as I held the last -2 there, I feel a bit of a need to defend that.
Devstack is opinionated, and to this point has focused on OpenStack
technologies being the main bring up point.

Especially given the giant drive by that docker patches were last fall,
where docker support in devstack was broken about 50% of the time
because no one was maintaining it and it depending on code outside of
distros. And upstream package details kept changing in ways that it was
broken, a lot.

Deciding to have a mode that changes all the storage over to CEPH is
kind of a big deal. Especially given the rate of code integration into
devstack. If it isn't being tested, it rots, really fast. Then we (I)
get pinged privately in IRC and email about why it doesn't work. Every
other day. I am not regularly in the habit of waking up, closing 4
private IRC tabs from people I don't know without answering, before I
start my day.

The immune response is there for a reason.

At the same time I understand that people do want these things. So how
do we find a way to keep the upstream code something that's maintained
and working for people. Plenty of Fedora folks complain devstack is
always broken on Fedora, and it is, because nothing automatically checks
that code.

> Feedback from operators is regularly ignored in favor of clean (though
> clearly flawed) software architecture. There was a large discussion
> recently here about the fundamental flaws is the quota system, as
> currently designed. We chimed in, along with Tim Bell, Jay Pipes, and a
> few other people. It was one of the more detailed discussions that we've
> had here, and I thought did a good job of capturing issues. When these
> issues were brought up on IRC with nova devs, we got the response they
> couldn't be bothered to read the whole thread on operators, and several
> people continued to argue that we didn't need what we said we needed for
> quite some time. I'm not saying that operators should be deferred to in
> all areas here, but we do understand how the system works in practice
> and at scale quite a bit better than the developers.
> The feedback loops from users/ops continue to be broken. Tim's efforts
> on behalf of the user committee are important steps in the right
> direction, but the developer culture is openstack culture in a deep way.
> Operators continue to be on the outside.
> As another illustration of this, I was contacted a few months ago by
> developers interested in scheduling. Now, I have a lot of experience in
> scheduling, and have done research in the area for the last 10 years, so
> this is a good start. So, they are interested in breaking scheduling out
> to its own project. This may or may not be a good idea; taking that
> approach makes some things easier, like coordination of strategies, but
> comes at a higher coordination cost. Having worked through this
> transition with a different scheduler, i don't think this is a decision
> you make lightly. At any rate, they were looking for a person to push
> the effort forward, which would consist of 3-6 months of refactoring to
> get the code into a better state. This might sound basically reasonable,
> but any discussion of gap analysis was completely missing. The state of
> the scheduling (placement, actually, not scheduling really) is pretty
> underwhelming, and causes us operational problems all of the time, but
> that isn't on the radar. These guys had the best of intentions, are
> operating with a different set of incentives and experiences that cause
> them to prioritize things in a way that unintentionally clashes with ops
> folks. I understand why this happens, but it is unclear how to fix it. 
> To be clear, I don't think that there is any bad intent here, but the
> differences in goals, experiences, and incentives means this problem
> isn't going to fix itself. Devs need to make sure they maintain code
> quality, and have a reasonable immune system to protect from bad code
> and ideas. We just need to make sure we don't develop the process
> equivalent of lupus.
> Case in point. In the absence of a budget, unit testing is better than
> not, but integration testing ends up being more important in my
> experience. The thing that trumps both of them is real experience in
> actual large scale systems. 

I agree, with a caveat. The real experience captures the state of
working today. Which is great. It doesn't, however, help us keep things
working tomorrow.

There are, currently, 391 patches up for review in Nova, right now. Any
of those are capable of breaking OpenStack for everyone. Human eyes are
good, but completely foulable. Human eyes + integration tests are much

> Problems there will never be adequately
> captured by either of those processes. Making huge investments in the
> first two venues as gating criteria while doing the third informally
> seems like an overemphasis of the wrong things to me.

It is notable that plenty of major features that people believe to work,
that aren't gated, are kind of perpetually broken. Cells is a very good
instance. The Nectar folks have been chasing regressions through all of
Icehouse. Long standing cross service race conditions finally got
exposed in repeatable ways through integration testing during this cycle.

Onboarding the integration testing for the hypervisors was an eye
openning experience for most of the teams involved, as very serious
parts of their drivers *just didn't work*.

So there is a reason why there continues to be focus here, because when
we do integration testing we find really substantially broken things.

We should *definitely* also figure out how to get more large sale
experience injected back in. I think it's clear Summit is not that
venue, so the next question is where might that venue exist, if it's a
physical place, or a virtual one. The Linux Foundation addressed this
sort of issue around Linux with the End Users Summit as a completely
different kind of gathering event mostly to bridge these divides.

But maybe a real world event doesn't work well here. What about some
better format to get operator stories back into the hands of the
development community. I'd love nothing more than a regular voice/video
presentation by various operators discussing their installations and
major pain points, in a level of detail where we could start to figure
out parts / pieces that can be tackled in the near term (current cycle).


Sean Dague

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 530 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20140502/b950276e/attachment.pgp>

More information about the OpenStack-operators mailing list