[openstack-dev] [stable] juno is fubar in the gate

Matthew Treinish mtreinish at kortar.org
Tue Feb 10 17:47:56 UTC 2015


On Tue, Feb 10, 2015 at 11:50:28AM -0500, David Kranz wrote:
> On 02/10/2015 10:35 AM, Matthew Treinish wrote:
> >On Tue, Feb 10, 2015 at 11:19:20AM +0100, Thierry Carrez wrote:
> >>Joe, Matt & Matthew:
> >>
> >>I hear your frustration with broken stable branches. With my
> >>vulnerability management team member hat, responsible for landing
> >>patches there with a strict deadline, I can certainly relate with the
> >>frustration of having to dive in to unbork the branch in the first
> >>place, rather than concentrate on the work you initially planned on doing.
> >>
> >>That said, wearing my stable team member hat, I think it's a bit unfair
> >>to say that things are worse than they were and call for dramatic
> >>action. The stable branch team put a structure in place to try to
> >>continuously fix the stable branches rather than reactively fix it when
> >>we need it to work. Those champions have been quite active[1] unbreaking
> >>it in the past months. I'd argue that the branch is broken much less
> >>often than it used to. That doesn't mean it's never broken, though, or
> >>that those people are magicians.
> >I don't at all for 2 reasons. The first being in every discussion we had at 2
> >summits I raised the increased maint. burden for a longer support window and
> >was told that people were going to stand up so it wouldn't be an issue. I have
> >yet to see that happen. I have not seen anything to date that would convince
> >me that we are at all ready to be maintaining 3 stable branches at once.
> >
> >The second is while I've seen that etherpad, I still view their still being a
> >huge disconnect here about what actually maintaining the branches requires. The
> >issue which I'm raising is about issues related to the gating infrastructure and
> >how to ensure that things stay working. There is a non-linear overhead involved
> >with making sure any gating job stays working. (on stable or master) People need
> >to take ownership of jobs to make sure they keep working.
> >
> >>One issue in the current situation is that the two groups (you and the
> >>stable maintainers) seem to work in parallel rather than collaborate.
> >>It's quite telling that the two groups maintained separate etherpads to
> >>keep track of the fixes that needed landing.
> >I don't actually view it as that. Just looking at the etherpad it has a very
> >small subset of the actual types of issues we're raising here.
> >
> >For example, there was a week in late Nov. when 2 consecutive oslo project
> >releases broke the stable gates. After we unwound all of this and landed the
> >fixes in the branches the next step was to changes to make sure we didn't allow
> >breakages in the same way:
> >
> >http://lists.openstack.org/pipermail/openstack-dev/2014-November/051206.html
> >
> >This was also happened at the same time as a new testtools stack release which
> >broke every branch (including master). Another example is all of the setuptools
> >stack churn from the famed Christmas releases. That was another critical
> >infrastructure piece that fell apart and was mostly handled by the infra team.
> >All of these things are getting fixed because they have to be, to make sure
> >development on master can continue not because those with a vested interest in
> >the stable branches working for 15 months are working on them.
> >
> >The other aspect here are development efforts to make things more stable in this
> >space. Things like the effort to pin the requirements on stable branches which
> >Joe is spearheading. These are critical to the long term success of the stable
> >branches yet no one has stepped up to help with it.
> >
> >I view this as a disconnect between what people think maintaining a stable
> >branch means and what it actually entails. Sure, the backporting of fixes to
> >intermittent failures is part of it. But, the most effort is spent on making
> >sure the gating machinery stays well oiled and doesn't breakdown.
> >
> >>[1] https://etherpad.openstack.org/p/stable-tracker
> >>
> >>Matthew Treinish wrote:
> >>>So I think it's time we called the icehouse branch and marked it EOL. We
> >>>originally conditioned the longer support window on extra people stepping
> >>>forward to keep things working. I believe this latest issue is just the latest
> >>>indication that this hasn't happened. Issue 1 listed above is being caused by
> >>>the icehouse branch during upgrades. The fact that a stable release was pushed
> >>>at the same time things were wedged on the juno branch is just the latest
> >>>evidence to me that things aren't being maintained as they should be. Looking at
> >>>the #openstack-qa irc log from today or the etherpad about trying to sort this
> >>>issue should be an indication that no one has stepped up to help with the
> >>>maintenance and it shows given the poor state of the branch.
> >>I disagree with the assessment. People have stepped up. I think the
> >>stable branches are less often broken than they were, and stable branch
> >>champions (as their tracking etherpad shows) have made a difference.
> >>There just has been more issues as usual recently and they probably
> >>couldn't keep track. It's not a fun job to babysit stable branches,
> >>belittling the stable branch champions results is not the best way to
> >>encourage them to continue in this position. I agree that they could
> >>work more with the QA team when they get overwhelmed, and raise more red
> >>flags when they just can't keep up.
> >I actually don't see it that way. As one of the few people who has been doing
> >this stable debug stuff for some time, it's really the same story as always. The
> >pain points have just shifted. The difference now being instead of everyone
> >panicking around stable release time that things don't work on the stable
> >branches, because we've moved to a branchless model for things like tempest,
> >certain people are seeing the pain constantly.
> >
> >It's not about sitting around and babysitting necessarily, but at least to start
> >actually watching jobs that run on the stable branch. The periodic jobs don't
> >give even close to a complete picture of the state of the world and don't run
> >frequently enough to catch everything. Part of the issue here is because I work
> >on tempest, grenade, and devstack I see these failures every time they happen
> >because it'll inevitably block development on one of those projects since the
> >stable jobs are gating.
> >
> >I don't mean to belittle anyone's efforts here, I personally know that I wouldn't
> >want or be able to do any of the traditional stable-maint backport work, and I
> >know it takes time to come up to speed on this work. But, it doesn't change the
> >position we're in right now.
> >
> >>I also disagree with the proposed solution. We announced a support
> >>timeframe for Icehouse, our downstream users made plans around it, so we
> >>should stick to it as much as we can. If we dropped stable branch
> >>support every time a patch can't be landed there, there would just not
> >>be any stable branch.
> >It's not just this latest issue which has caused me to raise this. (we have a
> >fix plan in progress, although EOL would make that moot) It's the same story
> >almost every other week at this point. The longer window was always just an
> >experiment and I was of the understanding if we deemed it untenable from a
> >maintenance POV that we wouldn't do it. I strongly feel that we need to just say
> >this isn't working right now and EOL especially before we enter a period where
> >we're maintaining 3 stable branches at once.
> >
> >-Matt Treinish
> Matt, I have hesitated to weigh in here but though I agree with much of
> this, I also think stable branches are more important than you seem to.
> Nomex suit on...
> 
> We should consider the possibility that branchless tempest may also be
> something where the true cost was not appreciated. When branchless tempest
> implied we needed to keep xml tests around in tempest, we threw them out
> anyway, which was reasonable. I would rather give up branchless tempest than
> the ability for real distributors/deployers/operators to collaborate on
> stable branches. If everything were pinned on stable branches, and without
> branchless tempest, then it would make things more tractable for both those
> interested in keeping stable working and those just interested in trunk. I
> believe this would also be closer to what many real deployers actually do.

I think the discussion on branchless tempest is conflating the issue here.
Branchless tempest just makes the test suite behave like the clients or a
library in this context. The issues associated with it from the stable branch
context are the same as what we're dealing with for the clients or libraries.
We've been fighting the same issues with client and library releases all cycle
the most recent one just happens to be a tempest-lib release being consumed by
javelin.

I disagree with your assertion that having a branched tempest model would make
anything simpler. In the context of this discussion all that does is change how
the requirements are installed for one project, we'd still be fighting all these
issues with the clients or libraries when they release. All we'd be doing is
trading the removal of 1 project from a very long list of things which get
installed during a gate run without a stable branch at the cost of making the
quality of compat testing between releases decrease significantly.

The other thing to remember is when we had a branched tempest model things in
the "stable" tempest branches often went mostly unmaintained. Which made
backporting any kind of fixes next to impossible. From that point of view the
branchless model is much better for stable-maint (which apevec alluded to
earlier) because it enables fixes in a timely manner without much headache.

The real fix here is the venvs for clients (and branchless tempest for the same
reason) to isolate the competing sets of requirements. We were already doing to
an extent with tempest because of tox, but there were bugs around how we were
using that. (which is what we are trying to fix when we hit the latest set of
issues)

> 
> That would still leave grenade as an issue because (I think) we try to run
> management code and multiple releases of OpenStack on the same node. I
> presume no real deployer does that and that the discussion about venvs will
> address that issue.
> 
> Following this thread I can't help thinking that folks want to help keep
> stable working but it is just very complicated the way things are now, and
> the consequences of making a mistake very high.
> 

Yes, this is part of my argument on this thread, there are a lot of moving pieces
and maintaining all of this is lot of additional work. I fully understand the need
for stable branches, but without additional people taking time to understand how
everything fits together here we just can't do it for 15 months and have the same
gating infrastructure in place.

-Matt Treinish
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20150210/a3c29ba4/attachment.pgp>


More information about the OpenStack-dev mailing list