[openstack-dev] Gate proposal - drop Postgresql configurations in the gate
sean at dague.net
Fri Jun 13 11:31:02 UTC 2014
On 06/13/2014 02:36 AM, Mark McLoughlin wrote:
> On Thu, 2014-06-12 at 22:10 -0400, Dan Prince wrote:
>> On Thu, 2014-06-12 at 08:06 -0400, Sean Dague wrote:
>>> We're definitely deep into capacity issues, so it's going to be time to
>>> start making tougher decisions about things we decide aren't different
>>> enough to bother testing on every commit.
>> In order to save resources why not combine some of the jobs in different
>> ways. So for example instead of:
>> Couldn't we just drop the postgres-full job and run one of the Neutron
>> jobs w/ postgres instead? Or something similar, so long as at least one
>> of the jobs which runs most of Tempest is using PostgreSQL I think we'd
>> be mostly fine. Not shooting for 100% coverage for everything with our
>> limited resource pool is fine, lets just do the best we can.
>> Ditto for gate jobs (not check).
> I think that's what Clark was suggesting in:
>>> Previously we've been testing Postgresql in the gate because it has a
>>> stricter interpretation of SQL than MySQL. And when we didn't test
>>> Postgresql it regressed. I know, I chased it for about 4 weeks in grizzly.
>>> However Monty brought up a good point at Summit, that MySQL has a strict
>>> mode. That should actually enforce the same strictness.
>>> My proposal is that we land this change to devstack -
>>> https://review.openstack.org/#/c/97442/ and backport it to past devstack
>>> Then we drop the pg jobs, as the differences between the 2 configs
>>> should then be very minimal. All the *actual* failures we've seen
>>> between the 2 were completely about this strict SQL mode interpretation.
>> I suppose I would like to see us keep it in the mix. Running SmokeStack
>> for almost 3 years I found many an issue dealing w/ PostgreSQL. I ran it
>> concurrently with many of the other jobs and I too had limited resources
>> (much less that what we have in infra today).
>> Would MySQL strict SQL mode catch stuff like this (old bugs, but still
>> valid for this topic I think):
>> Having support for and testing against at least 2 databases helps keep
>> our SQL queries and migrations cleaner... and is generally a good
>> practice given we have abstractions which are meant to support this sort
>> of thing anyway (so by all means let us test them!).
>> Also, Having compacted the Nova migrations 3 times now I found many
>> issues by testing on multiple databases (MySQL and PostgreSQL). I'm
>> quite certain our migrations would be worse off if we just tested
>> against the single database.
> Certainly sounds like this testing is far beyond the "might one day be
> useful" level Sean talks about.
The migration compaction is a good point. And I'm happy to see there
were some bugs exposed as well.
Here is where I remain stuck....
We are now at a failure rate in which it's 3 days (minimum) to land a
fix that decreases our failure rate at all.
The way we are currently solving this is by effectively building "manual
zuul" and taking smart humans in coordination to end run around our
system. We've merged 18 fixes so far -
https://etherpad.openstack.org/p/gatetriage-june2014 this way. Merging a
fix this way is at least an order of magnitude more expensive on people
time because of the analysis and coordination we need to go through to
make sure these things are the right things to jump the queue.
That effort, over 8 days, has gotten us down to *only* a 24hr merge
delay. And there are no more smoking guns. What's left is a ton of
subtle things. I've got ~ 30 patches outstanding right now (a bunch are
things to clarify what's going on in the build runs especially in the
fail scenarios). Every single one of them has been failed by Jenkins at
least once. Almost every one was failed by a different unique issue.
So I'd say at best we're 25% of the way towards solving this. That being
said, because of the deep queues, people are just recheck grinding (or
hitting the jackpot and landing something through that then fails a lot
after landing). That leads to bugs like this:
Which was seen early in the patch - https://review.openstack.org/#/c/97569/
Then kind of destroyed us completely for a day -
http://status.openstack.org/elastic-recheck/ (it's the top graph).
And, predictably, a week into a long gate queue everyone is now grumpy.
The sniping between projects, and within projects in assigning blame
starts to spike at about day 4 of these events. Everyone assumes someone
else is to blame for these things.
So there is real community impact when we get to these states.
So, I'm kind of burnt out trying to figure out how to get us out of
this. As I do take it personally when we as a project can't merge code.
As that's a terrible state to be in.
Pleading to get more people to dive in, is mostly not helping.
So my only thinking at this point is we prune back our test jobs to the
point that they are a small enough number of configurations that the
fixed number of people actually trying to debug this, actually can.
If there are other ideas, that's great.
But 'you aren't allowed to do less' isn't really sustainable. That just
leads to people giving up on helping.
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 482 bytes
Desc: OpenPGP digital signature
More information about the OpenStack-dev