Open Stack

Fri Jun 13 12:20:52 UTC 2014

On 06/13/2014 08:13 AM, Mark McLoughlin wrote:
> On Fri, 2014-06-13 at 07:31 -0400, Sean Dague wrote:
>> On 06/13/2014 02:36 AM, Mark McLoughlin wrote:
>>> On Thu, 2014-06-12 at 22:10 -0400, Dan Prince wrote:
>>>> On Thu, 2014-06-12 at 08:06 -0400, Sean Dague wrote:
>>>>> We're definitely deep into capacity issues, so it's going to be time to
>>>>> start making tougher decisions about things we decide aren't different
>>>>> enough to bother testing on every commit.
>>>>
>>>> In order to save resources why not combine some of the jobs in different
>>>> ways. So for example instead of:
>>>>
>>>>  check-tempest-dsvm-full
>>>>  check-tempest-dsvm-postgres-full
>>>>
>>>> Couldn't we just drop the postgres-full job and run one of the Neutron
>>>> jobs w/ postgres instead? Or something similar, so long as at least one
>>>> of the jobs which runs most of Tempest is using PostgreSQL I think we'd
>>>> be mostly fine. Not shooting for 100% coverage for everything with our
>>>> limited resource pool is fine, lets just do the best we can.
>>>>
>>>> Ditto for gate jobs (not check).
>>>
>>> I think that's what Clark was suggesting in:
>>>
>>> https://etherpad.openstack.org/p/juno-test-maxtrices
>>>
>>>>> Previously we've been testing Postgresql in the gate because it has a
>>>>> stricter interpretation of SQL than MySQL. And when we didn't test
>>>>> Postgresql it regressed. I know, I chased it for about 4 weeks in grizzly.
>>>>>
>>>>> However Monty brought up a good point at Summit, that MySQL has a strict
>>>>> mode. That should actually enforce the same strictness.
>>>>>
>>>>> My proposal is that we land this change to devstack -
>>>>> https://review.openstack.org/#/c/97442/ and backport it to past devstack
>>>>> branches.
>>>>>
>>>>> Then we drop the pg jobs, as the differences between the 2 configs
>>>>> should then be very minimal. All the *actual* failures we've seen
>>>>> between the 2 were completely about this strict SQL mode interpretation.
>>>>
>>>>
>>>> I suppose I would like to see us keep it in the mix. Running SmokeStack
>>>> for almost 3 years I found many an issue dealing w/ PostgreSQL. I ran it
>>>> concurrently with many of the other jobs and I too had limited resources
>>>> (much less that what we have in infra today).
>>>>
>>>> Would MySQL strict SQL mode catch stuff like this (old bugs, but still
>>>> valid for this topic I think):
>>>>
>>>>  https://bugs.launchpad.net/nova/+bug/948066
>>>>
>>>>  https://bugs.launchpad.net/nova/+bug/1003756
>>>>
>>>>
>>>> Having support for and testing against at least 2 databases helps keep
>>>> our SQL queries and migrations cleaner... and is generally a good
>>>> practice given we have abstractions which are meant to support this sort
>>>> of thing anyway (so by all means let us test them!).
>>>>
>>>> Also, Having compacted the Nova migrations 3 times now I found many
>>>> issues by testing on multiple databases (MySQL and PostgreSQL). I'm
>>>> quite certain our migrations would be worse off if we just tested
>>>> against the single database.
>>>
>>> Certainly sounds like this testing is far beyond the "might one day be
>>> useful" level Sean talks about.
>>
>> The migration compaction is a good point. And I'm happy to see there
>> were some bugs exposed as well.
>>
>> Here is where I remain stuck....
>>
>> We are now at a failure rate in which it's 3 days (minimum) to land a
>> fix that decreases our failure rate at all.
>>
>> The way we are currently solving this is by effectively building "manual
>> zuul" and taking smart humans in coordination to end run around our
>> system. We've merged 18 fixes so far -
>> https://etherpad.openstack.org/p/gatetriage-june2014 this way. Merging a
>> fix this way is at least an order of magnitude more expensive on people
>> time because of the analysis and coordination we need to go through to
>> make sure these things are the right things to jump the queue.
>>
>> That effort, over 8 days, has gotten us down to *only* a 24hr merge
>> delay. And there are no more smoking guns. What's left is a ton of
>> subtle things. I've got ~ 30 patches outstanding right now (a bunch are
>> things to clarify what's going on in the build runs especially in the
>> fail scenarios). Every single one of them has been failed by Jenkins at
>> least once. Almost every one was failed by a different unique issue.
>>
>> So I'd say at best we're 25% of the way towards solving this. That being
>> said, because of the deep queues, people are just recheck grinding (or
>> hitting the jackpot and landing something through that then fails a lot
>> after landing). That leads to bugs like this:
>>
>> https://bugs.launchpad.net/heat/+bug/1306029
>>
>> Which was seen early in the patch - https://review.openstack.org/#/c/97569/
>>
>> Then kind of destroyed us completely for a day -
>> http://status.openstack.org/elastic-recheck/ (it's the top graph).
>>
>> And, predictably, a week into a long gate queue everyone is now grumpy.
>> The sniping between projects, and within projects in assigning blame
>> starts to spike at about day 4 of these events. Everyone assumes someone
>> else is to blame for these things.
>>
>> So there is real community impact when we get to these states.
>>
>> ....
>>
>> So, I'm kind of burnt out trying to figure out how to get us out of
>> this. As I do take it personally when we as a project can't merge code.
>> As that's a terrible state to be in.
>>
>> Pleading to get more people to dive in, is mostly not helping.
>>
>> So my only thinking at this point is we prune back our test jobs to the
>> point that they are a small enough number of configurations that the
>> fixed number of people actually trying to debug this, actually can.
>>
>> If there are other ideas, that's great.
>>
>> But 'you aren't allowed to do less' isn't really sustainable. That just
>> leads to people giving up on helping.
> 
> Totally understand, and agree with the severity of the situation.
> 
> Retreating is one thing, but let's not label the job as useless in the
> process. We can disable the job because of capacity issues even if we
> feel its coverage is as important as the day we added the job.
> 
> How about explicitly priority ordering the jobs such that when we
> retreat the lowest priority job is dropped first, but is also the first
> one to be added back (assuming its pass rate is sufficiently high) when
> we feel we have capacity again?
> 
> Debating the priority order of the jobs, brainstorming ways of mixing
> configurations in the jobs to get the best coverage, etc. would then be
> something we'd do with cool heads in calmer times rather than when the
> gate is on fire.

Yeh, doing that prioritization is on on my todo list. It was about 7th
before we got into this state.

I really think it's going to be important to have sponsors as well.
Anything beyond mysql full and the base grenade job need sponsors. If
they fall behind on addressing / categorizing bugs, we start degrading
the jobs.

Because it's way to easy to set priorities for "someone else", and not
realize that there is no someone else.

	-Sean

-- 
Sean Dague
http://dague.net

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140613/47009813/attachment.pgp>

Open Stack

[openstack-dev] Gate proposal - drop Postgresql configurations in the gate

OpenStack

Community

Documentation

Branding & Legal