Open Stack

Thu Mar 10 21:09:44 UTC 2016

This seems to be the week people want to pile it on TripleO. Talking
about upstream is great but I suppose I'd rather debate major changes
after we branch Mitaka. :/

Anyways, might as well get into it now. replies inline....

On Thu, 2016-03-10 at 17:32 +0000, Jeremy Stanley wrote:
> On 2016-03-10 09:50:03 -0500 (-0500), Emilien Macchi wrote:
> [...]
> > 
> > OpenStack Infra provides an easy way to plug CI systems and some
> > CIs (Nova, Neutron, Cinder, etc) already gate on some third party
> > systems. I was wondering if we would not be interested to
> > investigate this area and maybe ask to our third party drivers
> > contributors (Bigswitch, Nuage, Midonet, Cisco, Netapp, etc) to
> > run on their own hardware TripleO CI jobs running their specific
> > environment to enable the drivers. This CI would be plugged to
> > TripleO CI, and would provide awesome feedback.
> [...]
> 
> It's also worth broadening the discussion to reassess whether the
> existing TripleO CI should itself follow our third-party integration
> model instead of the current implementation relying on our main
> community Zuul/Nodepool/Jenkins servers. When this was first
> implemented, there was a promise of adding more regions for
> robustness and of being able to use the surplus resources maintained
> in the TripleO CI clouds to augment our generic CI workload. It's
> been years now and these things have not really come to pass; if
> anything, that system and its operators are still struggling to keep
> a single region up and operational and providing enough resources to
> handle the current TripleO test load.

Yeah. We actually lost a region of hardware this last year too.

I think there is a distinction between our cloud being up and trunk
being broken. Now we've had some troubles with both over the last
couple years but in general I think our CI cloud (which provides
instances) has been up 98 maybe even 99% of the time. To be honest I've
not been tracking our actual uptime for bragging rights but I think the
actual cloud (which is connected to nodepool) has a good uptime.

We have been dealing with a lot of trunk breakages however. This is
something that occurs because we are not a gate... and it is related to
the fact that we have limited resources, and a long job wall time. So
taking a step away from the common infrastructure pipelines which do
act as an upstream gate would likely only make this worse for us.

To be fair the last outage you refer to occurred over the course of
days because we made a config change only to discover the breakage days
later (because nodepool caches the keystone endpoints). We are learning
and we do timebox our systems administration a bit more than most pure
administrators but I think the general uptime of our cloud has been
good.

> 
> The majority of unplanned whole-provider outages we've experienced
> in Nodepool have been from the TripleO cloud going completely
> offline (sometimes for a week or more straight), by far the
> longest-running jobs we have are running there (which substantially
> hampers our ability to do things like gracefully restart our Jenkins
> masters without aborting running jobs), and ultimately the benefits
> to TripleO for this model are very minimal anyway (different
> pipelines means the jobs aren't effectively even voting, much less
> gating).

With regards to Jenkins restarts I think it is understood that our job
times are long. How often do you find infra needs to restart Jenkins?
And regardless of that what if we just said we didn't mind the
destructiveness of losing a few jobs now and then (until our job times
are under the line... say 1.5 hours or so). To be clear I'd be fine
with infra pulling the rug on running jobs if this is the root cause of
the long running jobs in TripleO.

I think the "benefits are minimal" is bit of an overstatement. The
initial vision for TripleO CI stands and I would still like to see
individual projects entertain the option to use us in their gates.
Perhaps the strongest community influences are within Heat, Ironic, and
Puppet. The ability to manage the interaction with Heat, Ironic, and
Puppet in the common infrastructure is a clear benefit and there are
members of these communities that I think would agree.

> 
> I'm not trying to slam the TripleO cloud operators, I think they're
> doing an amazing job given the limitations they're working under and
> much of their work has provided inspiration for our Infra-Cloud
> project too. They're helpful and responsive and a joy to collaborate
> with, but ultimately I think TripleO might actually realize more
> benefit from adding a Zuul/Nodepool/Jenkins of their own to this
> (we've massively streamlined the Puppet for maintaining these
> recently and have very thorough deployment and operational
> documentation) rather than dealing with the issues which arise from
> being half-integrated into one they don't control.

We've actually move most of our daily management tasks for TripleO into
the tripleo-ci project so we don't have to bother infra with minor
config changes. So I don't think its like we are taking up a huge
amount of infra review time or causing a burden to you. We have very
few 'system' side changes for infra to deal with... and like I said
above I think it would be reasonable to give infra a pass on restarting
things due to our long job times.

Anyways, if we go off and run our own Zuul/Nodepool/Jenkins I do agree
it could work. But I think it is a step backwards for our integration
with some of the communities, and I'm not sure it costs that much for
us to stay where we are at. Especially given we are willing to make
some changes to make any of the pain points you mention more agreeable.

Dan

> 
> I've been meaning to bring that up for discussion for a while, just
> keep forgetting, but this thread seems like a good segue into the
> topic.

Open Stack

[openstack-dev] [tripleo] becoming third party CI (was: enabling third party CI)

OpenStack

Community

Documentation

Branding & Legal