[tripleo][ci] container pulls failing
Wesley Hayutin
whayutin at redhat.com
Wed Jul 29 13:13:24 UTC 2020
On Wed, Jul 29, 2020 at 2:25 AM Bogdan Dobrelya <bdobreli at redhat.com> wrote:
> On 7/28/20 6:09 PM, Wesley Hayutin wrote:
> >
> >
> > On Tue, Jul 28, 2020 at 7:24 AM Emilien Macchi <emilien at redhat.com
> > <mailto:emilien at redhat.com>> wrote:
> >
> >
> >
> > On Tue, Jul 28, 2020 at 9:20 AM Alex Schultz <aschultz at redhat.com
> > <mailto:aschultz at redhat.com>> wrote:
> >
> > On Tue, Jul 28, 2020 at 7:13 AM Emilien Macchi
> > <emilien at redhat.com <mailto:emilien at redhat.com>> wrote:
> > >
> > >
> > >
> > > On Mon, Jul 27, 2020 at 5:27 PM Wesley Hayutin
> > <whayutin at redhat.com <mailto:whayutin at redhat.com>> wrote:
> > >>
> > >> FYI...
> > >>
> > >> If you find your jobs are failing with an error similar to
> > [1], you have been rate limited by docker.io <http://docker.io>
> > via the upstream mirror system and have hit [2]. I've been
> > discussing the issue w/ upstream infra, rdo-infra and a few CI
> > engineers.
> > >>
> > >> There are a few ways to mitigate the issue however I don't
> > see any of the options being completed very quickly so I'm
> > asking for your patience while this issue is socialized and
> > resolved.
> > >>
> > >> For full transparency we're considering the following
> options.
> > >>
> > >> 1. move off of docker.io <http://docker.io> to quay.io
> > <http://quay.io>
> > >
> > >
> > > quay.io <http://quay.io> also has API rate limit:
> > > https://docs.quay.io/issues/429.html
> > >
> > > Now I'm not sure about how many requests per seconds one can
> > do vs the other but this would need to be checked with the quay
> > team before changing anything.
> > > Also quay.io <http://quay.io> had its big downtimes as well,
> > SLA needs to be considered.
> > >
> > >> 2. local container builds for each job in master, possibly
> > ussuri
> > >
> > >
> > > Not convinced.
> > > You can look at CI logs:
> > > - pulling / updating / pushing container images from
> > docker.io <http://docker.io> to local registry takes ~10 min on
> > standalone (OVH)
> > > - building containers from scratch with updated repos and
> > pushing them to local registry takes ~29 min on standalone (OVH).
> > >
> > >>
> > >> 3. parent child jobs upstream where rpms and containers will
> > be build and host artifacts for the child jobs
> > >
> > >
> > > Yes, we need to investigate that.
> > >
> > >>
> > >> 4. remove some portion of the upstream jobs to lower the
> > impact we have on 3rd party infrastructure.
> > >
> > >
> > > I'm not sure I understand this one, maybe you can give an
> > example of what could be removed?
> >
> > We need to re-evaulate our use of scenarios (e.g. we have two
> > scenario010's both are non-voting). There's a reason we
> > historically
> > didn't want to add more jobs because of these types of resource
> > constraints. I think we've added new jobs recently and likely
> > need to
> > reduce what we run. Additionally we might want to look into
> reducing
> > what we run on stable branches as well.
> >
> >
> > Oh... removing jobs (I thought we would remove some steps of the
> jobs).
> > Yes big +1, this should be a continuous goal when working on CI, and
> > always evaluating what we need vs what we run now.
> >
> > We should look at:
> > 1) services deployed in scenarios that aren't worth testing (e.g.
> > deprecated or unused things) (and deprecate the unused things)
> > 2) jobs themselves (I don't have any example beside scenario010 but
> > I'm sure there are more).
> > --
> > Emilien Macchi
> >
> >
> > Thanks Alex, Emilien
> >
> > +1 to reviewing the catalog and adjusting things on an ongoing basis.
> >
> > All.. it looks like the issues with docker.io <http://docker.io> were
> > more of a flare up than a change in docker.io <http://docker.io> policy
> > or infrastructure [2]. The flare up started on July 27 8am utc and
> > ended on July 27 17:00 utc, see screenshots.
>
> The numbers of image prepare workers and its exponential fallback
> intervals should be also adjusted. I've analysed the log snippet [0] for
> the connection reset counts by workers versus the times the rate
> limiting was triggered. See the details in the reported bug [1].
>
> tl;dr -- for an example 5 sec interval 03:55:31,379 - 03:55:36,110:
>
> Conn Reset Counts by a Worker PID:
> 3 58412
> 2 58413
> 3 58415
> 3 58417
>
> which seems too much of (workers*reconnects) and triggers rate limiting
> immediately.
>
> [0]
>
> https://13b475d7469ed7126ee9-28d4ad440f46f2186fe3f98464e57890.ssl.cf1.rackcdn.com/741228/6/check/tripleo-ci-centos-8-undercloud-containers/8e47836/logs/undercloud/var/log/tripleo-container-image-prepare.log
>
> [1] https://bugs.launchpad.net/tripleo/+bug/1889372
>
> --
> Best regards,
> Bogdan Dobrelya,
> Irc #bogdando
>
>
FYI..
The issue w/ "too many requests" is back. Expect delays and failures in
attempting to merge your patches upstream across all branches. The issue
is being tracked as a critical issue.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20200729/eb3d0fba/attachment.html>
More information about the openstack-discuss
mailing list