[openstack-dev] [tripleo] Contributing to TripleO is challenging

James Slagle james.slagle at gmail.com
Fri Mar 11 01:07:20 UTC 2016

On Fri, Mar 4, 2016 at 9:23 AM, Emilien Macchi <emilien at redhat.com> wrote:
> That's not the name of any Summit's talk, it's just an e-mail I wanted
> to write for a long time.
> It is an attempt to expose facts or things I've heard a lot; and bring
> constructive thoughts about why it's challenging to contribute in
> TripleO project.

Thanks for sharing your thoughts. I struggled a bit with responding to
be honest, as I think the points you call attention to often lead to
frustration on the side of both patch authors and reviewers.

I think these things are more anecdotal than fact based to be honest.
But, that doesn't mean that it's not a constructive conversation, so
thanks for calling out the issues.

At the core, we need a lot more investment in CI. We could use more
physical resources and more contributors. Or, we could direct some of
the capacity away from new development and put it towards CI
improvements instead. That might allow us to cover more features in
CI, which I think would have a direct impact on review velocity. There
are also some architectural changes proposed (split-stack) that will
allow us to scale CI more effectively than we have in the past.

> 1/ "I don't review this patch, we don't have CI coverage."

If a patch is obviously not covered by CI, it would go along way if
the author indicated if and how they manually tested a patch.

Often, I see a non-trivial patch that our CI does not cover. When I
try to manually test the patch, it doesn't work. Sometimes in very
obvious ways. Over time, this sort of pattern lowers confidence of
core reviewers to +2 and approve non-trivial patches that they
themselves haven't manually tested.

I know that manual testing doesn't scale.

However, right now, our CI doesn't scale to cover every possible
combination of features either.

So, if we want to keep moving forward at all, some things will have to
be manually tested. If patch authors added a comment such as "i tested
this with network isolation and it worked as expected, and it doesn't
break existing CI", that would go a long ways towards giving people
the confidence to approve it.

> One thing I've noticed in TripleO is that a very few people are involved
> in CI work.
> In my opinion, CI system is more critical than any feature in a product.
> Developing Software without tests is a bit like http://goo.gl/OlgFRc
> All people - specially core - in the project should be involved in CI
> work. If you are TripleO core and you don't contribute on CI, you might
> ask yourself why.

That's fair. Although it's also fair to ask the same of non-cores.

It's not just the job of core reviewers to get patches to pass CI. A
lot of times I get pinged to review patches with a sense of urgency
about them, yet they are sitting with failed CI. And it's not like
people are pinging me to help them understand and fix why the patch
has failed CI. They just want it reviewed/approved.

> 2/ "I don't review this patch, CI is broken."

Do you mean when CI is generally failing across the board?

I can honestly say that when CI is generally failing for whatever
reason (infrastructure issue, OpenStack regression, TripleO
regression), there is almost always 1 or 2 TripleO cores all over that
issue. Not everyone needs to be working on the issue at once. But, I
can honestly say I've never seen TripleO just completely red where at
least one person wasn't working on it almost exclusively.

However, if you're saying that people don't review a patch if that
specific patch has failed CI on it, then I think there is a lot of
shared responsibility there. It's not just on reviewers to see why
something has failed CI, or to try to get it to pass.

I'm less likely to review a patch if it has been sitting for several
days with a failed CI job on it. The author probably doesn't need it
landed that bad if that's the case. Often, there's not even a comment
if they looked at the failed job to see why. So, yea, I'm less likely
to review those patches honestly, and maybe that's not fair.

Just so I'm clear, I'm not saying that I ignore patches with failed
CI. I try and help new contributors or people I might not recognize
get their patches to pass CI. I recognize there is a steep learning
curve there.

But when I see patches out there from folks who are capable of at
least triaging the failure, and that hasn't been done, I'm certainly
guilty of de-prioritizing reviewing that patch. Maybe that's a bad

> Another thing I've noticed in TripleO is that when CI is broken, again,
> a very few people are actually working on fixing failures.

This sounds like you're talking about scenarios when all of TripleO Ci
is failing. In which case, I really disagree with your assertion that
people aren't rapidly fixing those failures.

> My experience over the last years taught me to stop my daily work when
> CI is broken and fix it asap.
> 3/ "I don't review it, because this feature / code is not my area".
> My first though is "Aren't we supposed to be engineers and learn new areas?"
> My second though is that I think we have a problem with TripleO Heat
> Templates.
> THT or TripleO Heat Templates's code is 80% of Puppet / Hiera. If
> TripleO core say "I'm not familiar with Puppet", we have a problem here,
> isn't?
> Maybe should we split this repository? Or revisit the list of people who
> can +2 patches on THT.

I'm sure we have some of this behavior. But I don't see it as an issue
to be honest. People should review code they are comfortable

It would also be great if people spent time getting to know parts of
the codebase they aren't as familiar with. But not everyone has time
to make that a priority, or they might not choose to work that way,
and that's Ok. We can't force people to step outside their comfort
zone as that's more of a personal decision.

I don't honestly think we've made poor technology choices that make it
difficult to review. Heat templates and puppet manifests are well
understood technologies in the deployment and configuration space.
Just like anything else, they take time to get familiar with.

As for revisiting who can +2 patches on THT, I think that is a
discussion that is always open. If you have any specifics in mind, I
would make those proposals or perhaps consult with the PTL and
existing cores to get a sense of what the concensus might be before
making a proposal, if you're more comfortable going that route.

> 4/ Patches are stalled. Most of the time.
> Over the last 12 months, I've pushed a lot of patches in TripleO and one
> thing I've noticed is that if I don't ping people, my patch got no
> review. And I have to rebase it, every week, because the interface
> changed. I got +2, cool ! Oh, merge conflict. Rebasing. Waiting for +2
> again... and so on..
> I personally spent 20% of my time to review code, every day.
> I wrote a blog post about how I'm doing review, with Gertty:
> http://my1.fr/blog/reviewing-puppet-openstack-patches/
> I suggest TripleO folks to spend more time on reviews, for some reasons:
> * decreasing frustration from contributors
> * accelerate development process
> * teach new contributors to work on TripleO, and eventually scale-up the
> core team. It's a time investment, but worth it.

We definitely have more patches out there than the core team can keep up with.

> In Puppet team, we have weekly triage sessions and it's pretty helpful.
> 5/ Most of the tests are run... manually.
> How many times I've heard "I've tested this patch locally, and it does
> not work so -1".

I don't understand your point here. How would you expect someone to
review a patch once they've discovered it doesn't even work?

Having to manually test things isn't what most people want to spend
their time doing.

However, some reviewers choose to manually test patches that they know
are not covered by CI. I choose to do that since I know we don't have
enough capacity to CI everything and I want to make sure that
non-trivial patches do not break TripleO for everyone. I don't see a
problem with that.

If you want to decrease the likelihood your patch gets manually
tested, you could add CI coverage for it. If we don't have capacity to
add CI coverage for it, work on finding a different way, such as
adding a periodic job. If you don't want to do that, push a temporary
patch to tripleo-ci that depends on your patch, exercises the code
path, and shows it working.

Some people will still choose to manually test though, and I see
absolutely nothing wrong with that. And if it doesn't work as
advertised, I'd expect to see a -1 on the review. In the cases when I
have done this, I try and provide all of the details about how it
failed for me and what I think the problem is.

I think that's a fair and constructive review and it helps push it
forward in terms of getting it committed.

> The only test we do in current CI is a ping to an instance. Seriously?
> Most of OpenStack CIs (Fuel included), run Tempest, for testing APIs and
> real scenarios. And we run a ping.
> That's similar to 1/ but I wanted to raise it too.

I'd love to see us do a tempest run, with at least the smoketests.
After redeploying the testenvs to have 5GB overcloud nodes, this might
be possible. We should prove this out before turning it on, probably
via some tripleo-ci patches, so that we know how much time it adds,
and if it's something we can live with.

-- James Slagle

More information about the OpenStack-dev mailing list