[tripleo][ci] container pulls failing

Luke Short ekultails at gmail.com
Wed Aug 19 13:15:06 UTC 2020


Hey folks,

All of the latest patches to address this have been merged in but we are
still seeing this error randomly in CI jobs that involve an Undercloud or
Standalone node. As far as I can tell, the error is appearing less often
than before but it is still present making merging new patches difficult. I
would be happy to help work towards other possible solutions however I am
unsure where to start from here. Any help would be greatly appreciated.

Sincerely,
    Luke Short

On Wed, Aug 5, 2020 at 12:26 PM Wesley Hayutin <whayutin at redhat.com> wrote:

>
>
> On Wed, Jul 29, 2020 at 4:48 PM Wesley Hayutin <whayutin at redhat.com>
> wrote:
>
>>
>>
>> On Wed, Jul 29, 2020 at 4:33 PM Alex Schultz <aschultz at redhat.com> wrote:
>>
>>> On Wed, Jul 29, 2020 at 7:13 AM Wesley Hayutin <whayutin at redhat.com>
>>> wrote:
>>> >
>>> >
>>> >
>>> > On Wed, Jul 29, 2020 at 2:25 AM Bogdan Dobrelya <bdobreli at redhat.com>
>>> wrote:
>>> >>
>>> >> On 7/28/20 6:09 PM, Wesley Hayutin wrote:
>>> >> >
>>> >> >
>>> >> > On Tue, Jul 28, 2020 at 7:24 AM Emilien Macchi <emilien at redhat.com
>>> >> > <mailto:emilien at redhat.com>> wrote:
>>> >> >
>>> >> >
>>> >> >
>>> >> >     On Tue, Jul 28, 2020 at 9:20 AM Alex Schultz <
>>> aschultz at redhat.com
>>> >> >     <mailto:aschultz at redhat.com>> wrote:
>>> >> >
>>> >> >         On Tue, Jul 28, 2020 at 7:13 AM Emilien Macchi
>>> >> >         <emilien at redhat.com <mailto:emilien at redhat.com>> wrote:
>>> >> >          >
>>> >> >          >
>>> >> >          >
>>> >> >          > On Mon, Jul 27, 2020 at 5:27 PM Wesley Hayutin
>>> >> >         <whayutin at redhat.com <mailto:whayutin at redhat.com>> wrote:
>>> >> >          >>
>>> >> >          >> FYI...
>>> >> >          >>
>>> >> >          >> If you find your jobs are failing with an error similar
>>> to
>>> >> >         [1], you have been rate limited by docker.io <
>>> http://docker.io>
>>> >> >         via the upstream mirror system and have hit [2].  I've been
>>> >> >         discussing the issue w/ upstream infra, rdo-infra and a few
>>> CI
>>> >> >         engineers.
>>> >> >          >>
>>> >> >          >> There are a few ways to mitigate the issue however I
>>> don't
>>> >> >         see any of the options being completed very quickly so I'm
>>> >> >         asking for your patience while this issue is socialized and
>>> >> >         resolved.
>>> >> >          >>
>>> >> >          >> For full transparency we're considering the following
>>> options.
>>> >> >          >>
>>> >> >          >> 1. move off of docker.io <http://docker.io> to quay.io
>>> >> >         <http://quay.io>
>>> >> >          >
>>> >> >          >
>>> >> >          > quay.io <http://quay.io> also has API rate limit:
>>> >> >          > https://docs.quay.io/issues/429.html
>>> >> >          >
>>> >> >          > Now I'm not sure about how many requests per seconds one
>>> can
>>> >> >         do vs the other but this would need to be checked with the
>>> quay
>>> >> >         team before changing anything.
>>> >> >          > Also quay.io <http://quay.io> had its big downtimes as
>>> well,
>>> >> >         SLA needs to be considered.
>>> >> >          >
>>> >> >          >> 2. local container builds for each job in master,
>>> possibly
>>> >> >         ussuri
>>> >> >          >
>>> >> >          >
>>> >> >          > Not convinced.
>>> >> >          > You can look at CI logs:
>>> >> >          > - pulling / updating / pushing container images from
>>> >> >         docker.io <http://docker.io> to local registry takes ~10
>>> min on
>>> >> >         standalone (OVH)
>>> >> >          > - building containers from scratch with updated repos and
>>> >> >         pushing them to local registry takes ~29 min on standalone
>>> (OVH).
>>> >> >          >
>>> >> >          >>
>>> >> >          >> 3. parent child jobs upstream where rpms and containers
>>> will
>>> >> >         be build and host artifacts for the child jobs
>>> >> >          >
>>> >> >          >
>>> >> >          > Yes, we need to investigate that.
>>> >> >          >
>>> >> >          >>
>>> >> >          >> 4. remove some portion of the upstream jobs to lower the
>>> >> >         impact we have on 3rd party infrastructure.
>>> >> >          >
>>> >> >          >
>>> >> >          > I'm not sure I understand this one, maybe you can give an
>>> >> >         example of what could be removed?
>>> >> >
>>> >> >         We need to re-evaulate our use of scenarios (e.g. we have
>>> two
>>> >> >         scenario010's both are non-voting).  There's a reason we
>>> >> >         historically
>>> >> >         didn't want to add more jobs because of these types of
>>> resource
>>> >> >         constraints.  I think we've added new jobs recently and
>>> likely
>>> >> >         need to
>>> >> >         reduce what we run. Additionally we might want to look into
>>> reducing
>>> >> >         what we run on stable branches as well.
>>> >> >
>>> >> >
>>> >> >     Oh... removing jobs (I thought we would remove some steps of
>>> the jobs).
>>> >> >     Yes big +1, this should be a continuous goal when working on
>>> CI, and
>>> >> >     always evaluating what we need vs what we run now.
>>> >> >
>>> >> >     We should look at:
>>> >> >     1) services deployed in scenarios that aren't worth testing
>>> (e.g.
>>> >> >     deprecated or unused things) (and deprecate the unused things)
>>> >> >     2) jobs themselves (I don't have any example beside scenario010
>>> but
>>> >> >     I'm sure there are more).
>>> >> >     --
>>> >> >     Emilien Macchi
>>> >> >
>>> >> >
>>> >> > Thanks Alex, Emilien
>>> >> >
>>> >> > +1 to reviewing the catalog and adjusting things on an ongoing
>>> basis.
>>> >> >
>>> >> > All.. it looks like the issues with docker.io <http://docker.io>
>>> were
>>> >> > more of a flare up than a change in docker.io <http://docker.io>
>>> policy
>>> >> > or infrastructure [2].  The flare up started on July 27 8am utc and
>>> >> > ended on July 27 17:00 utc, see screenshots.
>>> >>
>>> >> The numbers of image prepare workers and its exponential fallback
>>> >> intervals should be also adjusted. I've analysed the log snippet [0]
>>> for
>>> >> the connection reset counts by workers versus the times the rate
>>> >> limiting was triggered. See the details in the reported bug [1].
>>> >>
>>> >> tl;dr -- for an example 5 sec interval 03:55:31,379 - 03:55:36,110:
>>> >>
>>> >> Conn Reset Counts by a Worker PID:
>>> >>        3 58412
>>> >>        2 58413
>>> >>        3 58415
>>> >>        3 58417
>>> >>
>>> >> which seems too much of (workers*reconnects) and triggers rate
>>> limiting
>>> >> immediately.
>>> >>
>>> >> [0]
>>> >>
>>> https://13b475d7469ed7126ee9-28d4ad440f46f2186fe3f98464e57890.ssl.cf1.rackcdn.com/741228/6/check/tripleo-ci-centos-8-undercloud-containers/8e47836/logs/undercloud/var/log/tripleo-container-image-prepare.log
>>> >>
>>> >> [1] https://bugs.launchpad.net/tripleo/+bug/1889372
>>> >>
>>> >> --
>>> >> Best regards,
>>> >> Bogdan Dobrelya,
>>> >> Irc #bogdando
>>> >>
>>> >
>>> > FYI..
>>> >
>>> > The issue w/ "too many requests" is back.  Expect delays and failures
>>> in attempting to merge your patches upstream across all branches.   The
>>> issue is being tracked as a critical issue.
>>>
>>> Working with the infra folks and we have identified the authorization
>>> header as causing issues when we're rediected from docker.io to
>>> cloudflare. I'll throw up a patch tomorrow to handle this case which
>>> should improve our usage of the cache.  It needs some testing against
>>> other registries to ensure that we don't break authenticated fetching
>>> of resources.
>>>
>>> Thanks Alex!
>>
>
>
> FYI.. we have been revisited by the container pull issue, "too many
> requests".
> Alex has some fresh patches on it:
> https://review.opendev.org/#/q/status:open+project:openstack/tripleo-common+topic:bug/1889122
>
> expect trouble in check and gate:
>
> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22429%20Client%20Error%3A%20Too%20Many%20Requests%20for%20url%3A%5C%22%20AND%20voting%3A1
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20200819/c1b9ddeb/attachment.html>


More information about the openstack-discuss mailing list