[tripleo][ci] container pulls failing
Wesley Hayutin
whayutin at redhat.com
Wed Aug 5 16:23:46 UTC 2020
On Wed, Jul 29, 2020 at 4:48 PM Wesley Hayutin <whayutin at redhat.com> wrote:
>
>
> On Wed, Jul 29, 2020 at 4:33 PM Alex Schultz <aschultz at redhat.com> wrote:
>
>> On Wed, Jul 29, 2020 at 7:13 AM Wesley Hayutin <whayutin at redhat.com>
>> wrote:
>> >
>> >
>> >
>> > On Wed, Jul 29, 2020 at 2:25 AM Bogdan Dobrelya <bdobreli at redhat.com>
>> wrote:
>> >>
>> >> On 7/28/20 6:09 PM, Wesley Hayutin wrote:
>> >> >
>> >> >
>> >> > On Tue, Jul 28, 2020 at 7:24 AM Emilien Macchi <emilien at redhat.com
>> >> > <mailto:emilien at redhat.com>> wrote:
>> >> >
>> >> >
>> >> >
>> >> > On Tue, Jul 28, 2020 at 9:20 AM Alex Schultz <
>> aschultz at redhat.com
>> >> > <mailto:aschultz at redhat.com>> wrote:
>> >> >
>> >> > On Tue, Jul 28, 2020 at 7:13 AM Emilien Macchi
>> >> > <emilien at redhat.com <mailto:emilien at redhat.com>> wrote:
>> >> > >
>> >> > >
>> >> > >
>> >> > > On Mon, Jul 27, 2020 at 5:27 PM Wesley Hayutin
>> >> > <whayutin at redhat.com <mailto:whayutin at redhat.com>> wrote:
>> >> > >>
>> >> > >> FYI...
>> >> > >>
>> >> > >> If you find your jobs are failing with an error similar
>> to
>> >> > [1], you have been rate limited by docker.io <
>> http://docker.io>
>> >> > via the upstream mirror system and have hit [2]. I've been
>> >> > discussing the issue w/ upstream infra, rdo-infra and a few
>> CI
>> >> > engineers.
>> >> > >>
>> >> > >> There are a few ways to mitigate the issue however I
>> don't
>> >> > see any of the options being completed very quickly so I'm
>> >> > asking for your patience while this issue is socialized and
>> >> > resolved.
>> >> > >>
>> >> > >> For full transparency we're considering the following
>> options.
>> >> > >>
>> >> > >> 1. move off of docker.io <http://docker.io> to quay.io
>> >> > <http://quay.io>
>> >> > >
>> >> > >
>> >> > > quay.io <http://quay.io> also has API rate limit:
>> >> > > https://docs.quay.io/issues/429.html
>> >> > >
>> >> > > Now I'm not sure about how many requests per seconds one
>> can
>> >> > do vs the other but this would need to be checked with the
>> quay
>> >> > team before changing anything.
>> >> > > Also quay.io <http://quay.io> had its big downtimes as
>> well,
>> >> > SLA needs to be considered.
>> >> > >
>> >> > >> 2. local container builds for each job in master,
>> possibly
>> >> > ussuri
>> >> > >
>> >> > >
>> >> > > Not convinced.
>> >> > > You can look at CI logs:
>> >> > > - pulling / updating / pushing container images from
>> >> > docker.io <http://docker.io> to local registry takes ~10
>> min on
>> >> > standalone (OVH)
>> >> > > - building containers from scratch with updated repos and
>> >> > pushing them to local registry takes ~29 min on standalone
>> (OVH).
>> >> > >
>> >> > >>
>> >> > >> 3. parent child jobs upstream where rpms and containers
>> will
>> >> > be build and host artifacts for the child jobs
>> >> > >
>> >> > >
>> >> > > Yes, we need to investigate that.
>> >> > >
>> >> > >>
>> >> > >> 4. remove some portion of the upstream jobs to lower the
>> >> > impact we have on 3rd party infrastructure.
>> >> > >
>> >> > >
>> >> > > I'm not sure I understand this one, maybe you can give an
>> >> > example of what could be removed?
>> >> >
>> >> > We need to re-evaulate our use of scenarios (e.g. we have two
>> >> > scenario010's both are non-voting). There's a reason we
>> >> > historically
>> >> > didn't want to add more jobs because of these types of
>> resource
>> >> > constraints. I think we've added new jobs recently and
>> likely
>> >> > need to
>> >> > reduce what we run. Additionally we might want to look into
>> reducing
>> >> > what we run on stable branches as well.
>> >> >
>> >> >
>> >> > Oh... removing jobs (I thought we would remove some steps of the
>> jobs).
>> >> > Yes big +1, this should be a continuous goal when working on CI,
>> and
>> >> > always evaluating what we need vs what we run now.
>> >> >
>> >> > We should look at:
>> >> > 1) services deployed in scenarios that aren't worth testing (e.g.
>> >> > deprecated or unused things) (and deprecate the unused things)
>> >> > 2) jobs themselves (I don't have any example beside scenario010
>> but
>> >> > I'm sure there are more).
>> >> > --
>> >> > Emilien Macchi
>> >> >
>> >> >
>> >> > Thanks Alex, Emilien
>> >> >
>> >> > +1 to reviewing the catalog and adjusting things on an ongoing basis.
>> >> >
>> >> > All.. it looks like the issues with docker.io <http://docker.io>
>> were
>> >> > more of a flare up than a change in docker.io <http://docker.io>
>> policy
>> >> > or infrastructure [2]. The flare up started on July 27 8am utc and
>> >> > ended on July 27 17:00 utc, see screenshots.
>> >>
>> >> The numbers of image prepare workers and its exponential fallback
>> >> intervals should be also adjusted. I've analysed the log snippet [0]
>> for
>> >> the connection reset counts by workers versus the times the rate
>> >> limiting was triggered. See the details in the reported bug [1].
>> >>
>> >> tl;dr -- for an example 5 sec interval 03:55:31,379 - 03:55:36,110:
>> >>
>> >> Conn Reset Counts by a Worker PID:
>> >> 3 58412
>> >> 2 58413
>> >> 3 58415
>> >> 3 58417
>> >>
>> >> which seems too much of (workers*reconnects) and triggers rate limiting
>> >> immediately.
>> >>
>> >> [0]
>> >>
>> https://13b475d7469ed7126ee9-28d4ad440f46f2186fe3f98464e57890.ssl.cf1.rackcdn.com/741228/6/check/tripleo-ci-centos-8-undercloud-containers/8e47836/logs/undercloud/var/log/tripleo-container-image-prepare.log
>> >>
>> >> [1] https://bugs.launchpad.net/tripleo/+bug/1889372
>> >>
>> >> --
>> >> Best regards,
>> >> Bogdan Dobrelya,
>> >> Irc #bogdando
>> >>
>> >
>> > FYI..
>> >
>> > The issue w/ "too many requests" is back. Expect delays and failures
>> in attempting to merge your patches upstream across all branches. The
>> issue is being tracked as a critical issue.
>>
>> Working with the infra folks and we have identified the authorization
>> header as causing issues when we're rediected from docker.io to
>> cloudflare. I'll throw up a patch tomorrow to handle this case which
>> should improve our usage of the cache. It needs some testing against
>> other registries to ensure that we don't break authenticated fetching
>> of resources.
>>
>> Thanks Alex!
>
FYI.. we have been revisited by the container pull issue, "too many
requests".
Alex has some fresh patches on it:
https://review.opendev.org/#/q/status:open+project:openstack/tripleo-common+topic:bug/1889122
expect trouble in check and gate:
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22429%20Client%20Error%3A%20Too%20Many%20Requests%20for%20url%3A%5C%22%20AND%20voting%3A1
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20200805/35fe6f31/attachment-0001.html>
More information about the openstack-discuss
mailing list