On Wed, Jul 29, 2020 at 2:25 AM Bogdan Dobrelya <bdobreli@redhat.com> wrote:
On 7/28/20 6:09 PM, Wesley Hayutin wrote:
>
>
> On Tue, Jul 28, 2020 at 7:24 AM Emilien Macchi <emilien@redhat.com
> <mailto:emilien@redhat.com>> wrote:
>
>
>
>     On Tue, Jul 28, 2020 at 9:20 AM Alex Schultz <aschultz@redhat.com
>     <mailto:aschultz@redhat.com>> wrote:
>
>         On Tue, Jul 28, 2020 at 7:13 AM Emilien Macchi
>         <emilien@redhat.com <mailto:emilien@redhat.com>> wrote:
>          >
>          >
>          >
>          > On Mon, Jul 27, 2020 at 5:27 PM Wesley Hayutin
>         <whayutin@redhat.com <mailto:whayutin@redhat.com>> wrote:
>          >>
>          >> FYI...
>          >>
>          >> If you find your jobs are failing with an error similar to
>         [1], you have been rate limited by docker.io <http://docker.io>
>         via the upstream mirror system and have hit [2].  I've been
>         discussing the issue w/ upstream infra, rdo-infra and a few CI
>         engineers.
>          >>
>          >> There are a few ways to mitigate the issue however I don't
>         see any of the options being completed very quickly so I'm
>         asking for your patience while this issue is socialized and
>         resolved.
>          >>
>          >> For full transparency we're considering the following options.
>          >>
>          >> 1. move off of docker.io <http://docker.io> to quay.io
>         <http://quay.io>
>          >
>          >
>          > quay.io <http://quay.io> also has API rate limit:
>          > https://docs.quay.io/issues/429.html
>          >
>          > Now I'm not sure about how many requests per seconds one can
>         do vs the other but this would need to be checked with the quay
>         team before changing anything.
>          > Also quay.io <http://quay.io> had its big downtimes as well,
>         SLA needs to be considered.
>          >
>          >> 2. local container builds for each job in master, possibly
>         ussuri
>          >
>          >
>          > Not convinced.
>          > You can look at CI logs:
>          > - pulling / updating / pushing container images from
>         docker.io <http://docker.io> to local registry takes ~10 min on
>         standalone (OVH)
>          > - building containers from scratch with updated repos and
>         pushing them to local registry takes ~29 min on standalone (OVH).
>          >
>          >>
>          >> 3. parent child jobs upstream where rpms and containers will
>         be build and host artifacts for the child jobs
>          >
>          >
>          > Yes, we need to investigate that.
>          >
>          >>
>          >> 4. remove some portion of the upstream jobs to lower the
>         impact we have on 3rd party infrastructure.
>          >
>          >
>          > I'm not sure I understand this one, maybe you can give an
>         example of what could be removed?
>
>         We need to re-evaulate our use of scenarios (e.g. we have two
>         scenario010's both are non-voting).  There's a reason we
>         historically
>         didn't want to add more jobs because of these types of resource
>         constraints.  I think we've added new jobs recently and likely
>         need to
>         reduce what we run. Additionally we might want to look into reducing
>         what we run on stable branches as well.
>
>
>     Oh... removing jobs (I thought we would remove some steps of the jobs).
>     Yes big +1, this should be a continuous goal when working on CI, and
>     always evaluating what we need vs what we run now.
>
>     We should look at:
>     1) services deployed in scenarios that aren't worth testing (e.g.
>     deprecated or unused things) (and deprecate the unused things)
>     2) jobs themselves (I don't have any example beside scenario010 but
>     I'm sure there are more).
>     --
>     Emilien Macchi
>
>
> Thanks Alex, Emilien
>
> +1 to reviewing the catalog and adjusting things on an ongoing basis.
>
> All.. it looks like the issues with docker.io <http://docker.io> were
> more of a flare up than a change in docker.io <http://docker.io> policy
> or infrastructure [2].  The flare up started on July 27 8am utc and
> ended on July 27 17:00 utc, see screenshots.

The numbers of image prepare workers and its exponential fallback
intervals should be also adjusted. I've analysed the log snippet [0] for
the connection reset counts by workers versus the times the rate
limiting was triggered. See the details in the reported bug [1].

tl;dr -- for an example 5 sec interval 03:55:31,379 - 03:55:36,110:

Conn Reset Counts by a Worker PID:
       3 58412
       2 58413
       3 58415
       3 58417

which seems too much of (workers*reconnects) and triggers rate limiting
immediately.

[0]
https://13b475d7469ed7126ee9-28d4ad440f46f2186fe3f98464e57890.ssl.cf1.rackcdn.com/741228/6/check/tripleo-ci-centos-8-undercloud-containers/8e47836/logs/undercloud/var/log/tripleo-container-image-prepare.log

[1] https://bugs.launchpad.net/tripleo/+bug/1889372

--
Best regards,
Bogdan Dobrelya,
Irc #bogdando


FYI..

The issue w/ "too many requests" is back.  Expect delays and failures in attempting to merge your patches upstream across all branches.   The issue is being tracked as a critical issue.