[openstack-dev] [stable] juno is fubar in the gate

Matthew Treinish mtreinish at kortar.org
Tue Feb 10 15:35:50 UTC 2015


On Tue, Feb 10, 2015 at 11:19:20AM +0100, Thierry Carrez wrote:
> Joe, Matt & Matthew:
> 
> I hear your frustration with broken stable branches. With my
> vulnerability management team member hat, responsible for landing
> patches there with a strict deadline, I can certainly relate with the
> frustration of having to dive in to unbork the branch in the first
> place, rather than concentrate on the work you initially planned on doing.
> 
> That said, wearing my stable team member hat, I think it's a bit unfair
> to say that things are worse than they were and call for dramatic
> action. The stable branch team put a structure in place to try to
> continuously fix the stable branches rather than reactively fix it when
> we need it to work. Those champions have been quite active[1] unbreaking
> it in the past months. I'd argue that the branch is broken much less
> often than it used to. That doesn't mean it's never broken, though, or
> that those people are magicians.

I don't at all for 2 reasons. The first being in every discussion we had at 2
summits I raised the increased maint. burden for a longer support window and
was told that people were going to stand up so it wouldn't be an issue. I have
yet to see that happen. I have not seen anything to date that would convince
me that we are at all ready to be maintaining 3 stable branches at once.

The second is while I've seen that etherpad, I still view their still being a
huge disconnect here about what actually maintaining the branches requires. The
issue which I'm raising is about issues related to the gating infrastructure and
how to ensure that things stay working. There is a non-linear overhead involved
with making sure any gating job stays working. (on stable or master) People need
to take ownership of jobs to make sure they keep working.

> 
> One issue in the current situation is that the two groups (you and the
> stable maintainers) seem to work in parallel rather than collaborate.
> It's quite telling that the two groups maintained separate etherpads to
> keep track of the fixes that needed landing.

I don't actually view it as that. Just looking at the etherpad it has a very
small subset of the actual types of issues we're raising here. 

For example, there was a week in late Nov. when 2 consecutive oslo project
releases broke the stable gates. After we unwound all of this and landed the
fixes in the branches the next step was to changes to make sure we didn't allow
breakages in the same way:

http://lists.openstack.org/pipermail/openstack-dev/2014-November/051206.html

This was also happened at the same time as a new testtools stack release which
broke every branch (including master). Another example is all of the setuptools
stack churn from the famed Christmas releases. That was another critical
infrastructure piece that fell apart and was mostly handled by the infra team.
All of these things are getting fixed because they have to be, to make sure
development on master can continue not because those with a vested interest in
the stable branches working for 15 months are working on them.

The other aspect here are development efforts to make things more stable in this
space. Things like the effort to pin the requirements on stable branches which
Joe is spearheading. These are critical to the long term success of the stable
branches yet no one has stepped up to help with it.

I view this as a disconnect between what people think maintaining a stable
branch means and what it actually entails. Sure, the backporting of fixes to
intermittent failures is part of it. But, the most effort is spent on making
sure the gating machinery stays well oiled and doesn't breakdown.

> 
> [1] https://etherpad.openstack.org/p/stable-tracker
> 
> Matthew Treinish wrote:
> > So I think it's time we called the icehouse branch and marked it EOL. We
> > originally conditioned the longer support window on extra people stepping
> > forward to keep things working. I believe this latest issue is just the latest
> > indication that this hasn't happened. Issue 1 listed above is being caused by
> > the icehouse branch during upgrades. The fact that a stable release was pushed
> > at the same time things were wedged on the juno branch is just the latest
> > evidence to me that things aren't being maintained as they should be. Looking at
> > the #openstack-qa irc log from today or the etherpad about trying to sort this
> > issue should be an indication that no one has stepped up to help with the
> > maintenance and it shows given the poor state of the branch.
> 
> I disagree with the assessment. People have stepped up. I think the
> stable branches are less often broken than they were, and stable branch
> champions (as their tracking etherpad shows) have made a difference.
> There just has been more issues as usual recently and they probably
> couldn't keep track. It's not a fun job to babysit stable branches,
> belittling the stable branch champions results is not the best way to
> encourage them to continue in this position. I agree that they could
> work more with the QA team when they get overwhelmed, and raise more red
> flags when they just can't keep up.

I actually don't see it that way. As one of the few people who has been doing
this stable debug stuff for some time, it's really the same story as always. The
pain points have just shifted. The difference now being instead of everyone
panicking around stable release time that things don't work on the stable
branches, because we've moved to a branchless model for things like tempest,
certain people are seeing the pain constantly.

It's not about sitting around and babysitting necessarily, but at least to start
actually watching jobs that run on the stable branch. The periodic jobs don't
give even close to a complete picture of the state of the world and don't run
frequently enough to catch everything. Part of the issue here is because I work
on tempest, grenade, and devstack I see these failures every time they happen
because it'll inevitably block development on one of those projects since the
stable jobs are gating.

I don't mean to belittle anyone's efforts here, I personally know that I wouldn't
want or be able to do any of the traditional stable-maint backport work, and I
know it takes time to come up to speed on this work. But, it doesn't change the
position we're in right now.

> 
> I also disagree with the proposed solution. We announced a support
> timeframe for Icehouse, our downstream users made plans around it, so we
> should stick to it as much as we can. If we dropped stable branch
> support every time a patch can't be landed there, there would just not
> be any stable branch.

It's not just this latest issue which has caused me to raise this. (we have a
fix plan in progress, although EOL would make that moot) It's the same story
almost every other week at this point. The longer window was always just an
experiment and I was of the understanding if we deemed it untenable from a
maintenance POV that we wouldn't do it. I strongly feel that we need to just say
this isn't working right now and EOL especially before we enter a period where
we're maintaining 3 stable branches at once.

-Matt Treinish
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20150210/662fefa2/attachment.pgp>


More information about the OpenStack-dev mailing list