Open Stack

Fri Jun 13 15:43:39 UTC 2014

On 06/13/2014 07:31 AM, Sean Dague wrote:
> On 06/13/2014 02:36 AM, Mark McLoughlin wrote:
>> On Thu, 2014-06-12 at 22:10 -0400, Dan Prince wrote:
>>> On Thu, 2014-06-12 at 08:06 -0400, Sean Dague wrote:
>>>> We're definitely deep into capacity issues, so it's going to be time to
>>>> start making tougher decisions about things we decide aren't different
>>>> enough to bother testing on every commit.
>>> In order to save resources why not combine some of the jobs in different
>>> ways. So for example instead of:
>>>
>>>   check-tempest-dsvm-full
>>>   check-tempest-dsvm-postgres-full
>>>
>>> Couldn't we just drop the postgres-full job and run one of the Neutron
>>> jobs w/ postgres instead? Or something similar, so long as at least one
>>> of the jobs which runs most of Tempest is using PostgreSQL I think we'd
>>> be mostly fine. Not shooting for 100% coverage for everything with our
>>> limited resource pool is fine, lets just do the best we can.
>>>
>>> Ditto for gate jobs (not check).
>> I think that's what Clark was suggesting in:
>>
>> https://etherpad.openstack.org/p/juno-test-maxtrices
>>
>>>> Previously we've been testing Postgresql in the gate because it has a
>>>> stricter interpretation of SQL than MySQL. And when we didn't test
>>>> Postgresql it regressed. I know, I chased it for about 4 weeks in grizzly.
>>>>
>>>> However Monty brought up a good point at Summit, that MySQL has a strict
>>>> mode. That should actually enforce the same strictness.
>>>>
>>>> My proposal is that we land this change to devstack -
>>>> https://review.openstack.org/#/c/97442/ and backport it to past devstack
>>>> branches.
>>>>
>>>> Then we drop the pg jobs, as the differences between the 2 configs
>>>> should then be very minimal. All the *actual* failures we've seen
>>>> between the 2 were completely about this strict SQL mode interpretation.
>>>
>>> I suppose I would like to see us keep it in the mix. Running SmokeStack
>>> for almost 3 years I found many an issue dealing w/ PostgreSQL. I ran it
>>> concurrently with many of the other jobs and I too had limited resources
>>> (much less that what we have in infra today).
>>>
>>> Would MySQL strict SQL mode catch stuff like this (old bugs, but still
>>> valid for this topic I think):
>>>
>>>   https://bugs.launchpad.net/nova/+bug/948066
>>>
>>>   https://bugs.launchpad.net/nova/+bug/1003756
>>>
>>>
>>> Having support for and testing against at least 2 databases helps keep
>>> our SQL queries and migrations cleaner... and is generally a good
>>> practice given we have abstractions which are meant to support this sort
>>> of thing anyway (so by all means let us test them!).
>>>
>>> Also, Having compacted the Nova migrations 3 times now I found many
>>> issues by testing on multiple databases (MySQL and PostgreSQL). I'm
>>> quite certain our migrations would be worse off if we just tested
>>> against the single database.
>> Certainly sounds like this testing is far beyond the "might one day be
>> useful" level Sean talks about.
> The migration compaction is a good point. And I'm happy to see there
> were some bugs exposed as well.
>
> Here is where I remain stuck....
>
> We are now at a failure rate in which it's 3 days (minimum) to land a
> fix that decreases our failure rate at all.
>
> The way we are currently solving this is by effectively building "manual
> zuul" and taking smart humans in coordination to end run around our
> system. We've merged 18 fixes so far -
> https://etherpad.openstack.org/p/gatetriage-june2014 this way. Merging a
> fix this way is at least an order of magnitude more expensive on people
> time because of the analysis and coordination we need to go through to
> make sure these things are the right things to jump the queue.
>
> That effort, over 8 days, has gotten us down to *only* a 24hr merge
> delay. And there are no more smoking guns. What's left is a ton of
> subtle things. I've got ~ 30 patches outstanding right now (a bunch are
> things to clarify what's going on in the build runs especially in the
> fail scenarios). Every single one of them has been failed by Jenkins at
> least once. Almost every one was failed by a different unique issue.
>
> So I'd say at best we're 25% of the way towards solving this. That being
> said, because of the deep queues, people are just recheck grinding (or
> hitting the jackpot and landing something through that then fails a lot
> after landing). That leads to bugs like this:
>
> https://bugs.launchpad.net/heat/+bug/1306029
>
> Which was seen early in the patch - https://review.openstack.org/#/c/97569/
>
> Then kind of destroyed us completely for a day -
> http://status.openstack.org/elastic-recheck/ (it's the top graph).
>
> And, predictably, a week into a long gate queue everyone is now grumpy.
> The sniping between projects, and within projects in assigning blame
> starts to spike at about day 4 of these events. Everyone assumes someone
> else is to blame for these things.
>
> So there is real community impact when we get to these states.
>
> ....
>
> So, I'm kind of burnt out trying to figure out how to get us out of
> this. As I do take it personally when we as a project can't merge code.
> As that's a terrible state to be in.
>
> Pleading to get more people to dive in, is mostly not helping.
>
> So my only thinking at this point is we prune back our test jobs to the
> point that they are a small enough number of configurations that the
> fixed number of people actually trying to debug this, actually can.
>
> If there are other ideas, that's great.
I have another idea, which many will not like, but here goes anyway.  
First, I don't think we can thank Sean enough for his determination and 
skill in trying to deal with this. The fact that he has hit the wall and 
that others have not felt able to help indicates a real need to change 
something. I do not believe our current methodology can scale, and it 
already hasn't.

IMO we are suffering from race bugs combined with a "branch" model that 
focuses on integrating each patch into a 1.7 million line (or so it is 
said) code base as quickly as possible. We do enough testing in the gate 
so that race bugs cause merges to fail often, but not enough testing on 
each commit to keep race bugs out. If we actually tried to test enough 
in the gate to keep races out, we would never merge anything and we 
would not have the resources to do it anyway. Making it worse, a new 
race bug that slips through could have come from anywhere making it very 
difficult for people to help figure them out.

There is a different way to do this. We could adopt the same methodology 
we have now around gating, but applied to each project on its own 
branch. These project branches would be integrated into master at some 
frequency or when some new feature in project X is needed by project Y.  
Projects would want to pull from the master branch often, but the push 
process would be less frequent and run a much larger battery of tests 
than we do now. Doing this would have the following advantages:

1. It would be much harder for a race bug to get in. Each commit would 
be tested many more times on its branch before being merged to master 
than at present, including tests specialized for that project. The 
qa/infra teams and others would continue to define acceptance at the 
master level.
2. If a race bug does get in, projects have at least some chance to 
avoid merging the bad code.
3. Each project can develop its own gating policy for its own branch 
tailored to the issues and tradeoffs it has. This includes focus on 
spending time running their own tests. We would no longer run a complete 
battery of nova tests on every commit to swift.
4. If a project branch gets into the situation we are now in:
      a) it does not impact the ability of other projects to merge code
      b) it is highly likely the bad code is actually in the project so 
it is known who should help fix it
      c) those trying to fix it will be domain experts in the area that 
is failing
5. Distributing the gating load and policy to projects makes the whole 
system much more scalable as we add new projects.

Of course there are some drawbacks:

1. It will take longer, sometimes much longer, for any individual commit 
to make it to master. Of course if a super-serious issue made it to 
master and had to be fixed immediately it could be committed to master 
directly.
2. Branch management at the project level would be required. Projects 
would have to decide gating criteria, timing of pulls, and coordinate 
around integration to master with other projects.
3. There may be some technical limitations with git/gerrit/whatever that 
I don't understand but which would make this difficult.
4. It makes the whole thing more complicated from a process standpoint.

I have used this model in previous large software projects and it worked 
quite well. This may also be somewhat similar to what the linux kernel 
does in some ways. This is not an actual proposal, since many details 
would have to be worked out, but I believe adopting a methodology like 
this would actually make a big dent in both the resource limitation 
issues and the race to the bottom causing the current pain. And it would 
do so in a way that was scalable even as we grow enough projects to 
beyond the ability of any one person or team to have a handle on.

  -David
>
> But 'you aren't allowed to do less' isn't really sustainable. That just
> leads to people giving up on helping.
>
> 	-Sean
>
>

Open Stack

[openstack-dev] Rethink how we manage projects? (was Gate proposal - drop Postgresql configurations in the gate)

OpenStack

Community

Documentation

Branding & Legal