[openstack-dev] [tripleo] tripleo upstream gate outtage, was: -> gate jobs impacted RAX yum mirror

Wesley Hayutin whayutin at redhat.com
Mon May 14 18:00:05 UTC 2018

On Mon, May 14, 2018 at 12:37 PM Jeremy Stanley <fungi at yuggoth.org> wrote:

> On 2018-05-14 09:57:17 -0600 (-0600), Wesley Hayutin wrote:
> > On Mon, May 14, 2018 at 10:36 AM Jeremy Stanley <fungi at yuggoth.org>
> wrote:
> [...]
> > > Couldn't a significant burst of new packages cause the same
> > > symptoms even without it being tied to a minor version increase?
> >
> > Yes, certainly this could happen outside of a minor update of the
> > baseos.
> Thanks for confirming. So this is not specifically a CentOS minor
> version increase issue, it's just more likely to occur at minor
> version boundaries.

Correct, you got it

> > So the only thing out of our control is the package set on the
> > base nodepool image. If that suddenly gets updated with too many
> > packages, then we have to scramble to ensure the images and
> > containers are also udpated.
> It's still unclear to me why the packages on the test instance image
> (i.e. the "container host") are related to the packages in the
> container guest images at all. That would seem to be the whole point
> of having containers?

You are right, just note some services are not 100% containerized yet.
This doesn't happen overnight it's a process and we're getting there.

> > If there is a breaking change in the nodepool image for example
> > [a], we have to react to and fix that as well.
> I would argue that one is a terrible workaround which happened to
> show its warts. We should fix DIB's pip-and-virtualenv element
> rather than continue rely on side effects of pinning RPM versions.
> I've commented to that effect on https://launchpad.net/bugs/1770298
> just now.
k.. thanks

> > > It sounds like a problem with how the jobs are designed
> > > and expectations around distros slowly trickling package updates
> > > into the series without occasional larger bursts of package deltas.
> > > I'd like to understand more about why you upgrade packages inside
> > > your externally-produced container images at job runtime at all,
> > > rather than relying on the package versions baked into them.
> >
> > We do that to ensure the gerrit review itself and it's
> > dependencies are built via rpm and injected into the build. If we
> > did not do this the job would not be testing the change at all.
> > This is a result of being a package based deployment for better or
> > worse.
> [...]
> Now I'll risk jumping to proposing solutions, but have you
> considered building those particular packages in containers too?
> That way they're built against the same package versions as will be
> present in the other container images you're using rather than to
> the package versions on the host, right? Seems like it would
> completely sidestep the problem.

So a little background.  The containers and images used in TripleO are
rebuilt multiple times each day via periodic jobs, when they pass our
criteria they are pushed out and used upstream.
Each zuul change and it's dependencies can potentially impact a few or all
the containers in play.   We can not rebuild all the containers due to time
constraints in each job.  We have been able to mount and yum update the
containers involved with the zuul change.

Latest patch to fine tune that process is here

> > An enhancement could be to stage the new images for say one week
> > or so. Do we need the CentOS updates immediately? Is there a
> > possible path that does not create a lot of work for infra, but
> > also provides some space for projects to prep for the consumption
> > of the updates?
> [...]
> Nodepool builds new images constantly, but at least daily. Part of
> this is to prevent the delta of available packages/indices and other
> files baked into those images from being more than a day or so stale
> at any given point in time. The older the image, the more packages
> (on average) jobs will need to download if they want to test with
> latest package versions and the more strain it will put on our
> mirrors and on our bandwidth quotas/donors' networks.

Sure that makes perfect sense.  We do the same with our containers and

> There's also a question of retention, if we're building images at
> least daily but keeping them around for 7 days (storage on the
> builders, tenant quotas for Glance in our providers) as well as the
> explosion of additional nodes we'd need since we pre-boot nodes with
> each of our images (and the idea as I understand it is that you
> would want jobs to be able to select between any of them). One
> option, I suppose, would be to switch to building images weekly
> instead of daily, but that only solves the storage and node count
> problem not the additional bandwidth and mirror load. And of course,
> nodepool would need to learn to be able to boot nodes from older
> versions of an image on record which is not a feature it has right
> now.

OK.. thanks for walking me through that.  It totally makes sense to be
concerned with updating the image to save time, bandwidth etc.
It would be interesting to see if we could come up with something to protect
projects from changes to the new images and maintain images with fresh

Project non-voting check jobs on the node-pool image creation job perhaps
could be the canary in the coal mine we
are seeking.  Maybe we could see if that would be something that could be
useful to both infra
and to various OpenStack projects?

> > Understood, I suspect this will become a more widespread issue as
> > more projects start to use containers ( not sure ).
> I'm still confused as to what makes this a container problem in the
> general sense, rather than just a problem (leaky abstraction) with
> how you've designed the job framework in which you're using them.
> > It's my understanding that there are some mechanisms in place to
> > pin packages in the centos nodepool image so there has been some
> > thoughts generally in the area of this issue.
> [...]
> If this is a reference back to bug 1770298, as mentioned already I
> think that's a mistake in diskimage-builder's stdlib which should be
> corrected, not a pattern we should propagate.

Cool, good to know and thank you!

> --
> Jeremy Stanley
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20180514/99c6b29f/attachment.html>

More information about the OpenStack-dev mailing list