[openstack-dev] Gate proposal - drop Postgresql configurations in the gate

Mark McLoughlin markmc at redhat.com
Fri Jun 13 12:13:25 UTC 2014


On Fri, 2014-06-13 at 07:31 -0400, Sean Dague wrote:
> On 06/13/2014 02:36 AM, Mark McLoughlin wrote:
> > On Thu, 2014-06-12 at 22:10 -0400, Dan Prince wrote:
> >> On Thu, 2014-06-12 at 08:06 -0400, Sean Dague wrote:
> >>> We're definitely deep into capacity issues, so it's going to be time to
> >>> start making tougher decisions about things we decide aren't different
> >>> enough to bother testing on every commit.
> >>
> >> In order to save resources why not combine some of the jobs in different
> >> ways. So for example instead of:
> >>
> >>  check-tempest-dsvm-full
> >>  check-tempest-dsvm-postgres-full
> >>
> >> Couldn't we just drop the postgres-full job and run one of the Neutron
> >> jobs w/ postgres instead? Or something similar, so long as at least one
> >> of the jobs which runs most of Tempest is using PostgreSQL I think we'd
> >> be mostly fine. Not shooting for 100% coverage for everything with our
> >> limited resource pool is fine, lets just do the best we can.
> >>
> >> Ditto for gate jobs (not check).
> > 
> > I think that's what Clark was suggesting in:
> > 
> > https://etherpad.openstack.org/p/juno-test-maxtrices
> > 
> >>> Previously we've been testing Postgresql in the gate because it has a
> >>> stricter interpretation of SQL than MySQL. And when we didn't test
> >>> Postgresql it regressed. I know, I chased it for about 4 weeks in grizzly.
> >>>
> >>> However Monty brought up a good point at Summit, that MySQL has a strict
> >>> mode. That should actually enforce the same strictness.
> >>>
> >>> My proposal is that we land this change to devstack -
> >>> https://review.openstack.org/#/c/97442/ and backport it to past devstack
> >>> branches.
> >>>
> >>> Then we drop the pg jobs, as the differences between the 2 configs
> >>> should then be very minimal. All the *actual* failures we've seen
> >>> between the 2 were completely about this strict SQL mode interpretation.
> >>
> >>
> >> I suppose I would like to see us keep it in the mix. Running SmokeStack
> >> for almost 3 years I found many an issue dealing w/ PostgreSQL. I ran it
> >> concurrently with many of the other jobs and I too had limited resources
> >> (much less that what we have in infra today).
> >>
> >> Would MySQL strict SQL mode catch stuff like this (old bugs, but still
> >> valid for this topic I think):
> >>
> >>  https://bugs.launchpad.net/nova/+bug/948066
> >>
> >>  https://bugs.launchpad.net/nova/+bug/1003756
> >>
> >>
> >> Having support for and testing against at least 2 databases helps keep
> >> our SQL queries and migrations cleaner... and is generally a good
> >> practice given we have abstractions which are meant to support this sort
> >> of thing anyway (so by all means let us test them!).
> >>
> >> Also, Having compacted the Nova migrations 3 times now I found many
> >> issues by testing on multiple databases (MySQL and PostgreSQL). I'm
> >> quite certain our migrations would be worse off if we just tested
> >> against the single database.
> > 
> > Certainly sounds like this testing is far beyond the "might one day be
> > useful" level Sean talks about.
> 
> The migration compaction is a good point. And I'm happy to see there
> were some bugs exposed as well.
> 
> Here is where I remain stuck....
> 
> We are now at a failure rate in which it's 3 days (minimum) to land a
> fix that decreases our failure rate at all.
> 
> The way we are currently solving this is by effectively building "manual
> zuul" and taking smart humans in coordination to end run around our
> system. We've merged 18 fixes so far -
> https://etherpad.openstack.org/p/gatetriage-june2014 this way. Merging a
> fix this way is at least an order of magnitude more expensive on people
> time because of the analysis and coordination we need to go through to
> make sure these things are the right things to jump the queue.
> 
> That effort, over 8 days, has gotten us down to *only* a 24hr merge
> delay. And there are no more smoking guns. What's left is a ton of
> subtle things. I've got ~ 30 patches outstanding right now (a bunch are
> things to clarify what's going on in the build runs especially in the
> fail scenarios). Every single one of them has been failed by Jenkins at
> least once. Almost every one was failed by a different unique issue.
> 
> So I'd say at best we're 25% of the way towards solving this. That being
> said, because of the deep queues, people are just recheck grinding (or
> hitting the jackpot and landing something through that then fails a lot
> after landing). That leads to bugs like this:
> 
> https://bugs.launchpad.net/heat/+bug/1306029
> 
> Which was seen early in the patch - https://review.openstack.org/#/c/97569/
> 
> Then kind of destroyed us completely for a day -
> http://status.openstack.org/elastic-recheck/ (it's the top graph).
> 
> And, predictably, a week into a long gate queue everyone is now grumpy.
> The sniping between projects, and within projects in assigning blame
> starts to spike at about day 4 of these events. Everyone assumes someone
> else is to blame for these things.
> 
> So there is real community impact when we get to these states.
> 
> ....
> 
> So, I'm kind of burnt out trying to figure out how to get us out of
> this. As I do take it personally when we as a project can't merge code.
> As that's a terrible state to be in.
> 
> Pleading to get more people to dive in, is mostly not helping.
> 
> So my only thinking at this point is we prune back our test jobs to the
> point that they are a small enough number of configurations that the
> fixed number of people actually trying to debug this, actually can.
> 
> If there are other ideas, that's great.
> 
> But 'you aren't allowed to do less' isn't really sustainable. That just
> leads to people giving up on helping.

Totally understand, and agree with the severity of the situation.

Retreating is one thing, but let's not label the job as useless in the
process. We can disable the job because of capacity issues even if we
feel its coverage is as important as the day we added the job.

How about explicitly priority ordering the jobs such that when we
retreat the lowest priority job is dropped first, but is also the first
one to be added back (assuming its pass rate is sufficiently high) when
we feel we have capacity again?

Debating the priority order of the jobs, brainstorming ways of mixing
configurations in the jobs to get the best coverage, etc. would then be
something we'd do with cool heads in calmer times rather than when the
gate is on fire.

Mark.




More information about the OpenStack-dev mailing list