[tripleo][ci] container pulls failing

Wesley Hayutin whayutin at redhat.com
Wed Aug 5 16:23:46 UTC 2020


On Wed, Jul 29, 2020 at 4:48 PM Wesley Hayutin <whayutin at redhat.com> wrote:

>
>
> On Wed, Jul 29, 2020 at 4:33 PM Alex Schultz <aschultz at redhat.com> wrote:
>
>> On Wed, Jul 29, 2020 at 7:13 AM Wesley Hayutin <whayutin at redhat.com>
>> wrote:
>> >
>> >
>> >
>> > On Wed, Jul 29, 2020 at 2:25 AM Bogdan Dobrelya <bdobreli at redhat.com>
>> wrote:
>> >>
>> >> On 7/28/20 6:09 PM, Wesley Hayutin wrote:
>> >> >
>> >> >
>> >> > On Tue, Jul 28, 2020 at 7:24 AM Emilien Macchi <emilien at redhat.com
>> >> > <mailto:emilien at redhat.com>> wrote:
>> >> >
>> >> >
>> >> >
>> >> >     On Tue, Jul 28, 2020 at 9:20 AM Alex Schultz <
>> aschultz at redhat.com
>> >> >     <mailto:aschultz at redhat.com>> wrote:
>> >> >
>> >> >         On Tue, Jul 28, 2020 at 7:13 AM Emilien Macchi
>> >> >         <emilien at redhat.com <mailto:emilien at redhat.com>> wrote:
>> >> >          >
>> >> >          >
>> >> >          >
>> >> >          > On Mon, Jul 27, 2020 at 5:27 PM Wesley Hayutin
>> >> >         <whayutin at redhat.com <mailto:whayutin at redhat.com>> wrote:
>> >> >          >>
>> >> >          >> FYI...
>> >> >          >>
>> >> >          >> If you find your jobs are failing with an error similar
>> to
>> >> >         [1], you have been rate limited by docker.io <
>> http://docker.io>
>> >> >         via the upstream mirror system and have hit [2].  I've been
>> >> >         discussing the issue w/ upstream infra, rdo-infra and a few
>> CI
>> >> >         engineers.
>> >> >          >>
>> >> >          >> There are a few ways to mitigate the issue however I
>> don't
>> >> >         see any of the options being completed very quickly so I'm
>> >> >         asking for your patience while this issue is socialized and
>> >> >         resolved.
>> >> >          >>
>> >> >          >> For full transparency we're considering the following
>> options.
>> >> >          >>
>> >> >          >> 1. move off of docker.io <http://docker.io> to quay.io
>> >> >         <http://quay.io>
>> >> >          >
>> >> >          >
>> >> >          > quay.io <http://quay.io> also has API rate limit:
>> >> >          > https://docs.quay.io/issues/429.html
>> >> >          >
>> >> >          > Now I'm not sure about how many requests per seconds one
>> can
>> >> >         do vs the other but this would need to be checked with the
>> quay
>> >> >         team before changing anything.
>> >> >          > Also quay.io <http://quay.io> had its big downtimes as
>> well,
>> >> >         SLA needs to be considered.
>> >> >          >
>> >> >          >> 2. local container builds for each job in master,
>> possibly
>> >> >         ussuri
>> >> >          >
>> >> >          >
>> >> >          > Not convinced.
>> >> >          > You can look at CI logs:
>> >> >          > - pulling / updating / pushing container images from
>> >> >         docker.io <http://docker.io> to local registry takes ~10
>> min on
>> >> >         standalone (OVH)
>> >> >          > - building containers from scratch with updated repos and
>> >> >         pushing them to local registry takes ~29 min on standalone
>> (OVH).
>> >> >          >
>> >> >          >>
>> >> >          >> 3. parent child jobs upstream where rpms and containers
>> will
>> >> >         be build and host artifacts for the child jobs
>> >> >          >
>> >> >          >
>> >> >          > Yes, we need to investigate that.
>> >> >          >
>> >> >          >>
>> >> >          >> 4. remove some portion of the upstream jobs to lower the
>> >> >         impact we have on 3rd party infrastructure.
>> >> >          >
>> >> >          >
>> >> >          > I'm not sure I understand this one, maybe you can give an
>> >> >         example of what could be removed?
>> >> >
>> >> >         We need to re-evaulate our use of scenarios (e.g. we have two
>> >> >         scenario010's both are non-voting).  There's a reason we
>> >> >         historically
>> >> >         didn't want to add more jobs because of these types of
>> resource
>> >> >         constraints.  I think we've added new jobs recently and
>> likely
>> >> >         need to
>> >> >         reduce what we run. Additionally we might want to look into
>> reducing
>> >> >         what we run on stable branches as well.
>> >> >
>> >> >
>> >> >     Oh... removing jobs (I thought we would remove some steps of the
>> jobs).
>> >> >     Yes big +1, this should be a continuous goal when working on CI,
>> and
>> >> >     always evaluating what we need vs what we run now.
>> >> >
>> >> >     We should look at:
>> >> >     1) services deployed in scenarios that aren't worth testing (e.g.
>> >> >     deprecated or unused things) (and deprecate the unused things)
>> >> >     2) jobs themselves (I don't have any example beside scenario010
>> but
>> >> >     I'm sure there are more).
>> >> >     --
>> >> >     Emilien Macchi
>> >> >
>> >> >
>> >> > Thanks Alex, Emilien
>> >> >
>> >> > +1 to reviewing the catalog and adjusting things on an ongoing basis.
>> >> >
>> >> > All.. it looks like the issues with docker.io <http://docker.io>
>> were
>> >> > more of a flare up than a change in docker.io <http://docker.io>
>> policy
>> >> > or infrastructure [2].  The flare up started on July 27 8am utc and
>> >> > ended on July 27 17:00 utc, see screenshots.
>> >>
>> >> The numbers of image prepare workers and its exponential fallback
>> >> intervals should be also adjusted. I've analysed the log snippet [0]
>> for
>> >> the connection reset counts by workers versus the times the rate
>> >> limiting was triggered. See the details in the reported bug [1].
>> >>
>> >> tl;dr -- for an example 5 sec interval 03:55:31,379 - 03:55:36,110:
>> >>
>> >> Conn Reset Counts by a Worker PID:
>> >>        3 58412
>> >>        2 58413
>> >>        3 58415
>> >>        3 58417
>> >>
>> >> which seems too much of (workers*reconnects) and triggers rate limiting
>> >> immediately.
>> >>
>> >> [0]
>> >>
>> https://13b475d7469ed7126ee9-28d4ad440f46f2186fe3f98464e57890.ssl.cf1.rackcdn.com/741228/6/check/tripleo-ci-centos-8-undercloud-containers/8e47836/logs/undercloud/var/log/tripleo-container-image-prepare.log
>> >>
>> >> [1] https://bugs.launchpad.net/tripleo/+bug/1889372
>> >>
>> >> --
>> >> Best regards,
>> >> Bogdan Dobrelya,
>> >> Irc #bogdando
>> >>
>> >
>> > FYI..
>> >
>> > The issue w/ "too many requests" is back.  Expect delays and failures
>> in attempting to merge your patches upstream across all branches.   The
>> issue is being tracked as a critical issue.
>>
>> Working with the infra folks and we have identified the authorization
>> header as causing issues when we're rediected from docker.io to
>> cloudflare. I'll throw up a patch tomorrow to handle this case which
>> should improve our usage of the cache.  It needs some testing against
>> other registries to ensure that we don't break authenticated fetching
>> of resources.
>>
>> Thanks Alex!
>


FYI.. we have been revisited by the container pull issue, "too many
requests".
Alex has some fresh patches on it:
https://review.opendev.org/#/q/status:open+project:openstack/tripleo-common+topic:bug/1889122

expect trouble in check and gate:
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22429%20Client%20Error%3A%20Too%20Many%20Requests%20for%20url%3A%5C%22%20AND%20voting%3A1
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20200805/35fe6f31/attachment-0001.html>


More information about the openstack-discuss mailing list