[OpenStack-Infra] Status of check-tempest-dsvm-f20 job

Eoghan Glynn eglynn at redhat.com
Wed Jun 18 21:44:43 UTC 2014



> >> >>> If we were to use f20 more widely in the gate (not to entirely
> >> >>> supplant precise, more just to split the load more evenly) then
> >> >>> would the problem observed tend to naturally resolve itself?
> >> >>
> >> >> I would be happy to see that, having spent some time on the Fedora
> >> >> bring-up :) However I guess there is a chicken-egg problem with
> >> >> large-scale roll-out in that the platform isn't quite stable yet.
> >> >> We've hit some things that only really become apparent "in the wild";
> >> >> differences between Rackspace & HP images, issues running on Xen which
> >> >> we don't test much, the odd upstream bug requiring work-arounds [1],
> >> >> etc.
> >> >
> >> > Fair point.
> >> >
> >> >> It did seem that devstack changes was the best place to stabilze the
> >> >> job.  However, as is apparent, devstack changes often need to be
> >> >> pushed through quickly and that does not match well with a slightly
> >> >> unstable job.
> >> >>
> >> >> Having it experimental in devstack isn't much help in stabilizing.  If
> >> >> I trigger experimental builds for every devstack change it runs
> >> >> several other jobs too, so really I've just increased contention for
> >> >> limited resources by doing that.
> >> >
> >> > Very true also.
> >> >
> >> >> I say this *has* to be running for devstack eventually to stop the
> >> >> fairly frequent breakage of devstack on Fedora, which causes a lot of
> >> >> people wasted time often chasing the same bugs.
> >> >
> >> > I agree, if we're committed to Fedora being a first class citizen (as
> >> > per TC distro policy, IIUC) then it's crucial that Fedora-specific
> >> > breakages are exposed quickly in the gate, as opposed to being seen
> >> > by developers for the first time in the wild whenever they happen to
> >> > refresh their devstack.
> >> >
> >> >> But in the mean time, maybe suggestions for getting the Fedora job
> >> >> exposure somewhere else where it can brew and stabilize are a good
> >> >> idea.
> >> >
> >> > Well, I would suggest the ceilometer/py27 unit test job as a first
> >> > candidate for such exposure.
> >> >
> >> > The reason being that mongodb 2.4 is not available on precise, but
> >> > is on f20. As a result, the mongodb scenario tests are effectively
> >> > skipped in the ceilo/py27 units, which is clearly badness and needs
> >> > to be addressed.
> >> >
> >> > Obviously this lack of coverage will resolve itself quickly once
> >> > the Trusty switchover occurs, but it seems like we can short-circuit
> >> > that process by simply switching to f20 right now.
> >> >
> >> > I think the marconi jobs would be another good candidate, where
> >> > switching over to f20 now would add real value. The marconi tests
> >> > include some coverage against mongodb proper, but this is currently
> >> > disabled, as marconi requires mongodb version >= 2.2 (and precise
> >> > can only offer 2.0.4).
> >> >
> >> >> We could make a special queue just for f20 that only triggers that
> >> >> job, if others like that idea.
> >> >>
> >> >> Otherwise, ceilometer maybe?  I made some WIP patches [2,3] for this
> >> >> already.  I think it's close, just deciding what tempest tests to
> >> >> match for the job in [2].
> >> >
> >> > Thanks for that.
> >> >
> >> > So my feeling is that at least the following would make sense to base
> >> > on f20:
> >> >
> >> > 1. ceilometer/py27
> >> > 2. tempest variant with the ceilo DB configured as mongodb
> >> > 3. marconi/py27
> >> >
> >> > Then a random selection of other p27 jobs could potentially be added
> >> > over time to bring f20 usage up to approximately the same breath
> >> > as precise.
> >> >
> >> > Cheers,
> >> > Eoghan
> >
> > Thanks for the rapid response Sean!
> >
> >> Unit test nodes are a different image yet still. So this actually
> >> wouldn't make anything better, it would just also stall out ceilometer
> >> and marconi unit tests in the same scenario.
> >
> > A-ha, ok, thanks for the clarification on that.
> >
> > My thought was to make the zero-allocation-for-lightly-used-classes
> > problem go away by making f20 into a heavily-used node class, but
> > I guess just rebasing *only* the ceilometer & marconi py27 jobs onto
> > f20 would be insufficient to get f20 over the line in that regard?
> >
> > And if I understood correctly, rolling directly to a wider selection
> > of py27 jobs rebased on f20 would be premature in the infra group's
> > eyes, because f20 hasn't yet proven itself in the gate.
> >
> > So I guess I'd like to get a feel for where the bar is set on f20
> > being considered a valid candidate for a wider selection of gate
> > jobs?
> >
> > e.g. running as an experimental platform for X amount of time?
> >
> > From a purely parochial ceilometer perspective, I'd love to see the
> > py27 job running on f20 sooner rather later, as the asymmetry between
> > the tests run in the gate and in individual developers' environments
> > is clearly badness.
> >

Thanks for the response Clark, much appreciated.

> As one of the infra folk tasked with making things like this happen I
> am a very big -1 on this, -1.9 :) The issue is Fedora (and non LTS
> Ubuntu releases) are supported for far less time than we support
> stable branches. So what do we do after the ~13 months of Fedora
> release support and ~9 months of Ubuntu release support? Do we port
> the tests forward for stable branches? We have discussed this and that
> is something we will not do. Instead we need releases with >=18 months
> of support so that we can continue to run the tests on the distro
> releases they were originally tested on. This is why we test on CentOS
> and Ubuntu LTS.

This is a very interesting point, and I'm open to admitting that my
knowledge of this area is starting from a low enough base.

I've been working with an assumption that the TC distro policy:

 "OpenStack will target its development efforts to latest Ubuntu/Fedora, 
  but will not introduce any change that would make it impossible to run 
  on the latest Ubuntu LTS or latest RHEL." 

means that CI, as well as code-crafting efforts, should be focused on latest
Ubuntu/Fedora (as most folks in the community rightly view CI green-ness
as an integral part of the development workflow). 

But, by the same token, I'm realistic about the day-to-day pressures
that you Trojans of openstack qa/infra are under, and TBH I hate quoting
the highfalutin' policies handed down by the TC.

So I'd really like to understand the practical blockers here, and towards
that end I'll throw out a couple dumb questions:

* would it be outrageous to have a different CI distro target for
  master versus stable/most-recent-release-minus-1?

* would it be less outrageous to have a different CI distro target
  for master *and* stable/most-recent-release-minus-1 versus 
  stable/most-recent-release-minus-{2|3|..}?

What I'm getting at is the notion that we could have both master and
stable/most-recent-release-minus-1 running on Fedora-latest, while
the bitrot jobs run against the corresponding LTS release.

Apologies in advance if I've horribly over-simplified the issues in
play here.

> The f20 tempest job is a little special and those involved have
> basically agreed that we will just stop testing f20 when f20 is no
> longer supported and not replace that job on stable branches. This is
> ok because we run enough tempest tests on other distros that dropping
> this test should be reasonably safe for the stable branches. This is
> not true of the python 2.7 unittests jobs if we switch them to Fedora.
> After 13 months what do we do? Also, there is probably minimal value
> in testing Fedora vs Ubuntu LTS vs CentOS with the unittests as they
> run in relatively isolated virtualenvs anyways.

In general, I'd agree with you WRT the low value of unitests running
against latest versus LTS, *except* in the case where those tests are
hobbled the lack of some crucial dependency on LTS (e.g. mongodb>=2.4
for ceilo & marconi).

Cheers,
Eoghan

 
> Though I am not a strong -2 on this subject, I think the current
> stance is sane given the resources available to us and for unittests
> there is very little reason to incur the overhead of doing it
> differently.
> >
> > I believe some of the marconi folks would also be enthused about
> > enabling mongodb-based testing in the short term.
> >
> >> I think the real issue is to come up with a fairer algorithm that
> >> prevents any node class from starving, even in the extreme case. And get
> >> that implemented and accepted in nodepool.
> >
> > That would indeed be great. Do you think the simple randomization
> > idea mooted earlier on this thread would suffice there, or am I over-
> > simplifying?
> >
> >> I do think devstack was the right starting point, because it fixes lots
> >> of issues we've had with us accidentally breaking fedora in devstack.
> >> We've yet to figure out how overall reliable fedora is going to be.
> >
> > If there's anything more that the Red Hatters in the community can
> > do to expedite the process of establishing the reliability of f20,
> > please do let us know.
> >
> > Thanks!
> > Eoghan
> >
> >>       -Sean
> >>
> >> --
> >> Sean Dague
> >> http://dague.net
> >>
> >>
> >
> > _______________________________________________
> > OpenStack-Infra mailing list
> > OpenStack-Infra at lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra
> 
> Clark
> 
> _______________________________________________
> OpenStack-Infra mailing list
> OpenStack-Infra at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra
> 



More information about the OpenStack-Infra mailing list