[openstack-dev] [Fuel][Fuel-Library] Fuel CI issues
Dmitry Borodaenko
dborodaenko at mirantis.com
Mon Mar 7 07:33:08 UTC 2016
Aleksandra,
Very good point on separating the concerns about integration tests for
Fuel as a whole and verifying commits to a single component such as
fuel-library. In theory, it could support the right balance between
stable CI and up-to-date code, but only if we resolve the two remaining
problems: one small and technical and the other large and social.
You've already pointed out the first problem: update of fuel-library CI
environment is not yet fully automated, and so the environment is liable
to lag behind all involved components for days if not weeks.
This by itself is simple enough, if labourous, to work around (update it
manually every day, or after every successful BVT), but still leaves us
with the problem of motivation.
We've been discussing the CI duty for fuel-library integration with
puppet-openstack since more than a month ago [0], and it has
continuously failed to materialize. Within days of getting an action
item in that IRC meeting to arrange it, Andrew Maksimov has responded
privately that nobody in his team has time for this. And we all know
what "I don't have time" actually means [1]. Two weeks later, we were
ready to launch the integration and the question of CI duty came up
again [2], with the same result.
[0] http://eavesdrop.openstack.org/meetings/fuel/2016/fuel.2016-02-04-16.02.log.html#l-66
[1] http://lifehacker.com/5892948/instead-of-saying-i-dont-have-time-say-its-not-a-priority
[2] http://eavesdrop.openstack.org/meetings/fuel/2016/fuel.2016-02-18-16.00.log.html#l-190
Here we are two more weeks later, the integration is on, and the first
reaction from fuel-library core reviewers is "we don't have time to deal
with this, turn it back off right now". And I'm not just summarizing
Vladimir's email, on Friday we had a long thread on an internal mailing
list with exactly this in the subject line (my apologies, but my disgust
at the fact that it was started behind closed doors drowns any qualms
about dragging it back into the open).
After we change Fuel CI to use fixed, most recent to have passed BVT,
revisions of puppet-openstack modules, first thing that will happen is
that BVT on Fuel ISO will start failing again, while fuel-library CI
will continue to work. Without the pressure of failing commit
verification CI, fuel-library developers will have even less incentive
to keep fuel-library up to date with puppet-openstack (not to mention
pro-actively reviewing puppet-openstack commits to catch potential
regressions before they happen), and very soon Fuel QA team will get fed
up with not having a stable ISO for the swarm test, and will demand that
we go back to using fixed puppet-openstack revisions for the ISO, too.
Both here and on the internal thread, many technical and organizational
concerns were raised, and I'll get to them in a bit, but a concern
without the will to resolve it is only an excuse, we won't get far if we
don't want to make it work.
So why don't fuel-library developers want to spend time on
puppet-openstack integration?
I see two dimensions to this problem. On one axis, there's the
cost/benefit balance: how much work does it take, and what do we gain
from doing it? On the other is the question of who benefits and who
carries the costs?
Without tracking HEAD of puppet-openstack in fuel-library, the primary
cost is carried by puppet-openstack developers who maintain the upstream
modules in the first place, and a small fraction of fuel-library
contributors (5+ out of 50+ [3][4]) who periodically have to spend
significant amount of effort to bring fuel-library up to date with the
current state of puppet-openstack. Even though the conversion to
librarian has made the upstream sync simpler and safer, preparing the
update to Mitaka still took a full month of work for 5-7 people.
[3] http://stackalytics.com/?module=puppet%20openstack-group&company=mirantis&metric=commits
[4] http://stackalytics.com/?module=fuel-library&company=mirantis&metric=commits
Secondary costs are carried by Fuel Infra and QA teams who have to
support CI based on two OpenStack releases in parallel during that
month, fuel-library and puppet-openstack developers who have to deal
with a spike in code churn, all Fuel contributors who are blocked by
merge freeze during transition, and once again Fuel QA team who
occasionally get blocked by bugs that were fixed in upstream and not yet
pulled into fuel-library.
In short, under that model, most fuel-library developers don't have to
do much to gain the benefit of being up to date with upstream, such us
getting support of the next OpenStack release. The integration cost,
around 7-10 man-months per release, is carried mostly by other people.
Transition to full integration with upstream via tracking HEAD of
puppet-openstack in fuel-library dramatically alters this balance.
Massive upstream sync is gone, and so are the associated costs of
parallel CI, transition merge freeze, and missing upstream bugfixes. The
code churn is still there, but more evenly spread over time.
Instead, the primary cost becomes the CI duty that requires a
fuel-library developer to watch upstream commits for Fuel CI failures
and prevent those from impacting fuel-library. According to the same
internal thread, that's "over 50% of one developer's time every day", so
3-5 man-months per release, or roughly half of the cost of the periodic
sync.
The secondary cost is the risk of upstream commits causing regressions
that block the whole fuel-library team for several hours at a time. Is
this risk a good excuse to revert the change that reduces the cost of
supporting a new OpenStack release by half and reduces Fuel's lag behind
puppet-openstack by a month? Only if we can't mitigate it.
The problem is, most fuel-library developers don't stand to gain
anything from this change: they now have to participate in something
that was previously taken care of, however inefficiently, by other
people. And that is why, instead of constructive proposals about
mitigating the risk of regressions, we see demands to go back to the
time when they didn't need to bother.
As promised, moving on to specific concerns and questions.
On Tue, Mar 01, 2016 at 02:21:48PM +0300, Vladimir Kuklin wrote:
> Dmitry, could you please point me at the person who will be strictly
> responsible for creating this 'ketchup' commit? Do you know that this
> may take up the whole day (couple of hours to do RCA, couple of hours
> on writing and debugging and couple of hours for FUEL CI tests run)
> and block the entire Fuel project from having ANY code merged?
It's not reasonable to expect a single person, or even a small team, to
do this every day all year around. That's why we've been discussing CI
duty. Even if it takes all day every day, between 50+ fuel-library
developers that's just one week per person per year, not that much of a
burden.
And it doesn't have to block anyone from merging code to Fuel
repositories, there are many ways to mitigate that, like the ones that
Sergey and Aleksandra have proposed in this thread. We just need to
start discussing these ways instead of arguing about why we shouldn't
bother.
> I have always thought that buliding software is about verification
> being more important than 'trust'. There should not be any
> humanitarian stuff invloved - we are not in a relationship with
> Puppet-OpenStack folks,
I have explained above why motivation is the blocking issue here, and
not the technical concerns. Of course we are in a relationship with
Puppet OpenStack: both projects are part of OpenStack Big Tent, we have
the same six-month release cycle, and on the code level their modules
are so tightly coupled into fuel-library that we can't treat them as a
third-party library. The fact that we've started to pull them from
separate git repositories shouldn't have stopped us from treating them
as a part of our codebase. Like it or not, our relationship with them is
more "in the same boat" than it is a "zero-sum game".
> although I really admire their work very much.
lip service
n 1: an expression of agreement that is not supported by real
conviction [syn: {hypocrisy}, {lip service}]
> We should not follow sliding git references without being 100% sure
> that we have mutual gating of the code.
Setting up mutual gating is impossible without the mutual trust that you
have so easily dismissed. Sliding git references and the CI duty to
support them are all parts of establishing that mutual trust, it won't
just appear out of thin air and empty promises.
Even at the level of trust we already have, I'm sure puppet-openstack
core reviewers can agree to hold off merging a commit if a fuel-library
developer votes -1 with a comment like "Fuel CI failed for this one,
please give me a couple of hours to figure out why". A poor man's
substitute of mutual gating, but serviceable nonetheless.
> Moreover, having such git ref as a source in our Puppetfile will lead
> to the situation when we have UNREPRODUCIBLE build of Fuel project.
Easily mitigated with tooling, same as the undeservedly maligned removal
of version.yaml.
On Fri, Mar 04, 2016 at 04:51:34PM +0300, Dmitry Pyzhov wrote:
> 1) It takes more than 50% time of a senior engineer;
As explained above, even at 100% time it's less than the time we've been
spending on periodic upstream syncs.
> 2) There is a lot of noise in tests results because of broken CI
> and/or broken Fuel master;
Can be fixed by Aleksandra's proposal.
> 3) There is a log of noise in tests results because of big number of
> WIP commits that nobody is going to merge;
Once we make Fuel CI votes visible (I see no reason to delay that any
longer), it's going to be trivial to filter out commits with WIP flag or
with a -1 from a voting gate job (why investigate Fuel CI failure if the
commit can't pass a beaker test).
> 4) There is no quick way to understand if the test failure caused by
> commit or by other reasons;
Is this a duplicate of #2 or a general observation about how difficult
it is to investigate Fuel CI failures? If the latter, this problem is
not limited to puppet-openstack and is causing us pain in all our repos,
we should either fix it soon or give up on Fuel CI altogether.
> 5) There is no quick way to understand if the issue should be fixed in
> the commit or in Fuel;
Yes there is: simply pick the side where it's easier to fix.
> 6) Most important. Our monitoring doesn't protect us. Our master will
> be broken by upstream manifests again sooner or later. And nobody
> knows how much time it will take to fix it.
Our master gets broken by our own mistakes at least as often as by
upstream manifests, anything we can do to protect ourselves from that is
applicable to puppet-openstack just the same.
--
Dmitry Borodaenko
More information about the OpenStack-dev
mailing list