[tripleo][ci] container pulls failing

Bogdan Dobrelya bdobreli at redhat.com
Wed Aug 19 14:34:34 UTC 2020


On 8/19/20 4:31 PM, Bogdan Dobrelya wrote:
> On 8/19/20 3:55 PM, Alex Schultz wrote:
>> On Wed, Aug 19, 2020 at 7:53 AM Bogdan Dobrelya <bdobreli at redhat.com> 
>> wrote:
>>>
>>> On 8/19/20 3:23 PM, Alex Schultz wrote:
>>>> On Wed, Aug 19, 2020 at 7:15 AM Luke Short <ekultails at gmail.com> wrote:
>>>>>
>>>>> Hey folks,
>>>>>
>>>>> All of the latest patches to address this have been merged in but 
>>>>> we are still seeing this error randomly in CI jobs that involve an 
>>>>> Undercloud or Standalone node. As far as I can tell, the error is 
>>>>> appearing less often than before but it is still present making 
>>>>> merging new patches difficult. I would be happy to help work 
>>>>> towards other possible solutions however I am unsure where to start 
>>>>> from here. Any help would be greatly appreciated.
>>>>>
>>>>
>>>> I'm looking at this today but from what I can tell the problem is
>>>> likely caused by a reduced anonymous query quota from docker.io and
>>>> our usage of the upstream mirrors.  Because the mirrors essentially
>>>> funnel all requests through a single IP we're hitting limits faster
>>>> than if we didn't use the mirrors. Due to the nature of the requests,
>>>> the metadata queries don't get cached due to the authorization header
>>>> but are subject to the rate limiting.  Additionally we're querying the
>>>> registry to determine which containers we need to update in CI because
>>>> we limit our updates to a certain set of containers as part of the CI
>>>> jobs.
>>>>
>>>> So there are likely a few different steps forward on this and we can
>>>> do a few of these together.
>>>>
>>>> 1) stop using mirrors (not ideal but likely makes this go away).
>>>> Alternatively switch stable branches off the mirrors due to a reduced
>>>> number of executions and leave mirrors configured on master only (or
>>>> vice versa).
>>>
>>> Also, the stable/(N-1) branch could use quay.io, while master keeps
>>> using docker.io (assuming containers for that N-1 release will be hosted
>>> there instead of the dockerhub)
>>>
>>
>> quay has its own limits and likely will suffer from a similar problem.
> 
> Right. But dropped numbers of total requests sent to each registry could 
> end up with less often rate limiting by either of two.
> 
>>
>>>> 2) reduce the number of jobs
>>>> 3) stop querying the registry for the update filters (i'm looking into
>>>> this today) and use the information in tripleo-common first.
>>>> 4) build containers always instead of fetching from docker.io
> 
> There may be a middle-ground solution. Building it only once for each 
> patchset executed in TripleO Zuul pipelines. Transient images, like [0], 
> that can have TTL and self-expire should be used for that purpose.
> 
> [0] 
> https://idbs-engineering.com/containers/2019/08/27/auto-expiry-quayio-tags.html 
> 
> 
> That would require the zuul jobs with dependencies passing ansible 
> variables to each other, by the execution results. Can that be done?

...or even simpler than that, predictable names can be created for those 
transient images, like <namespace>/<tag>_<patchset>

> 
> Pretty much like we have it already set in TripleO for tox jobs as a 
> dependency for standalone/multinode jobs. But adding an extra step to 
> prepare such a transient pack of the container images (only to be used 
> for that patchset) and push it to a quay registry hosted elsewhere by 
> TripleO devops folks.
> 
> Then the jobs that have that dependency met can use those transient 
> images via an ansible variable passed for the jobs. Auto expiration 
> solves the space/lifecycle requirements for the cloud that will be 
> hosting that registry.
> 
>>>>
>>>> Thanks,
>>>> -Alex
>>>>
>>>>
>>>>
>>>>> Sincerely,
>>>>>       Luke Short
>>>>>
>>>>> On Wed, Aug 5, 2020 at 12:26 PM Wesley Hayutin 
>>>>> <whayutin at redhat.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Jul 29, 2020 at 4:48 PM Wesley Hayutin 
>>>>>> <whayutin at redhat.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jul 29, 2020 at 4:33 PM Alex Schultz 
>>>>>>> <aschultz at redhat.com> wrote:
>>>>>>>>
>>>>>>>> On Wed, Jul 29, 2020 at 7:13 AM Wesley Hayutin 
>>>>>>>> <whayutin at redhat.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jul 29, 2020 at 2:25 AM Bogdan Dobrelya 
>>>>>>>>> <bdobreli at redhat.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On 7/28/20 6:09 PM, Wesley Hayutin wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jul 28, 2020 at 7:24 AM Emilien Macchi 
>>>>>>>>>>> <emilien at redhat.com
>>>>>>>>>>> <mailto:emilien at redhat.com>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>       On Tue, Jul 28, 2020 at 9:20 AM Alex Schultz 
>>>>>>>>>>> <aschultz at redhat.com
>>>>>>>>>>>       <mailto:aschultz at redhat.com>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>           On Tue, Jul 28, 2020 at 7:13 AM Emilien Macchi
>>>>>>>>>>>           <emilien at redhat.com <mailto:emilien at redhat.com>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>            >
>>>>>>>>>>>            >
>>>>>>>>>>>            >
>>>>>>>>>>>            > On Mon, Jul 27, 2020 at 5:27 PM Wesley Hayutin
>>>>>>>>>>>           <whayutin at redhat.com <mailto:whayutin at redhat.com>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>            >>
>>>>>>>>>>>            >> FYI...
>>>>>>>>>>>            >>
>>>>>>>>>>>            >> If you find your jobs are failing with an error 
>>>>>>>>>>> similar to
>>>>>>>>>>>           [1], you have been rate limited by docker.io 
>>>>>>>>>>> <http://docker.io>
>>>>>>>>>>>           via the upstream mirror system and have hit [2].  
>>>>>>>>>>> I've been
>>>>>>>>>>>           discussing the issue w/ upstream infra, rdo-infra 
>>>>>>>>>>> and a few CI
>>>>>>>>>>>           engineers.
>>>>>>>>>>>            >>
>>>>>>>>>>>            >> There are a few ways to mitigate the issue 
>>>>>>>>>>> however I don't
>>>>>>>>>>>           see any of the options being completed very quickly 
>>>>>>>>>>> so I'm
>>>>>>>>>>>           asking for your patience while this issue is 
>>>>>>>>>>> socialized and
>>>>>>>>>>>           resolved.
>>>>>>>>>>>            >>
>>>>>>>>>>>            >> For full transparency we're considering the 
>>>>>>>>>>> following options.
>>>>>>>>>>>            >>
>>>>>>>>>>>            >> 1. move off of docker.io <http://docker.io> to 
>>>>>>>>>>> quay.io
>>>>>>>>>>>           <http://quay.io>
>>>>>>>>>>>            >
>>>>>>>>>>>            >
>>>>>>>>>>>            > quay.io <http://quay.io> also has API rate limit:
>>>>>>>>>>>            > https://docs.quay.io/issues/429.html
>>>>>>>>>>>            >
>>>>>>>>>>>            > Now I'm not sure about how many requests per 
>>>>>>>>>>> seconds one can
>>>>>>>>>>>           do vs the other but this would need to be checked 
>>>>>>>>>>> with the quay
>>>>>>>>>>>           team before changing anything.
>>>>>>>>>>>            > Also quay.io <http://quay.io> had its big 
>>>>>>>>>>> downtimes as well,
>>>>>>>>>>>           SLA needs to be considered.
>>>>>>>>>>>            >
>>>>>>>>>>>            >> 2. local container builds for each job in 
>>>>>>>>>>> master, possibly
>>>>>>>>>>>           ussuri
>>>>>>>>>>>            >
>>>>>>>>>>>            >
>>>>>>>>>>>            > Not convinced.
>>>>>>>>>>>            > You can look at CI logs:
>>>>>>>>>>>            > - pulling / updating / pushing container images 
>>>>>>>>>>> from
>>>>>>>>>>>           docker.io <http://docker.io> to local registry 
>>>>>>>>>>> takes ~10 min on
>>>>>>>>>>>           standalone (OVH)
>>>>>>>>>>>            > - building containers from scratch with updated 
>>>>>>>>>>> repos and
>>>>>>>>>>>           pushing them to local registry takes ~29 min on 
>>>>>>>>>>> standalone (OVH).
>>>>>>>>>>>            >
>>>>>>>>>>>            >>
>>>>>>>>>>>            >> 3. parent child jobs upstream where rpms and 
>>>>>>>>>>> containers will
>>>>>>>>>>>           be build and host artifacts for the child jobs
>>>>>>>>>>>            >
>>>>>>>>>>>            >
>>>>>>>>>>>            > Yes, we need to investigate that.
>>>>>>>>>>>            >
>>>>>>>>>>>            >>
>>>>>>>>>>>            >> 4. remove some portion of the upstream jobs to 
>>>>>>>>>>> lower the
>>>>>>>>>>>           impact we have on 3rd party infrastructure.
>>>>>>>>>>>            >
>>>>>>>>>>>            >
>>>>>>>>>>>            > I'm not sure I understand this one, maybe you 
>>>>>>>>>>> can give an
>>>>>>>>>>>           example of what could be removed?
>>>>>>>>>>>
>>>>>>>>>>>           We need to re-evaulate our use of scenarios (e.g. 
>>>>>>>>>>> we have two
>>>>>>>>>>>           scenario010's both are non-voting).  There's a 
>>>>>>>>>>> reason we
>>>>>>>>>>>           historically
>>>>>>>>>>>           didn't want to add more jobs because of these types 
>>>>>>>>>>> of resource
>>>>>>>>>>>           constraints.  I think we've added new jobs recently 
>>>>>>>>>>> and likely
>>>>>>>>>>>           need to
>>>>>>>>>>>           reduce what we run. Additionally we might want to 
>>>>>>>>>>> look into reducing
>>>>>>>>>>>           what we run on stable branches as well.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>       Oh... removing jobs (I thought we would remove some 
>>>>>>>>>>> steps of the jobs).
>>>>>>>>>>>       Yes big +1, this should be a continuous goal when 
>>>>>>>>>>> working on CI, and
>>>>>>>>>>>       always evaluating what we need vs what we run now.
>>>>>>>>>>>
>>>>>>>>>>>       We should look at:
>>>>>>>>>>>       1) services deployed in scenarios that aren't worth 
>>>>>>>>>>> testing (e.g.
>>>>>>>>>>>       deprecated or unused things) (and deprecate the unused 
>>>>>>>>>>> things)
>>>>>>>>>>>       2) jobs themselves (I don't have any example beside 
>>>>>>>>>>> scenario010 but
>>>>>>>>>>>       I'm sure there are more).
>>>>>>>>>>>       --
>>>>>>>>>>>       Emilien Macchi
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks Alex, Emilien
>>>>>>>>>>>
>>>>>>>>>>> +1 to reviewing the catalog and adjusting things on an 
>>>>>>>>>>> ongoing basis.
>>>>>>>>>>>
>>>>>>>>>>> All.. it looks like the issues with docker.io 
>>>>>>>>>>> <http://docker.io> were
>>>>>>>>>>> more of a flare up than a change in docker.io 
>>>>>>>>>>> <http://docker.io> policy
>>>>>>>>>>> or infrastructure [2].  The flare up started on July 27 8am 
>>>>>>>>>>> utc and
>>>>>>>>>>> ended on July 27 17:00 utc, see screenshots.
>>>>>>>>>>
>>>>>>>>>> The numbers of image prepare workers and its exponential fallback
>>>>>>>>>> intervals should be also adjusted. I've analysed the log 
>>>>>>>>>> snippet [0] for
>>>>>>>>>> the connection reset counts by workers versus the times the rate
>>>>>>>>>> limiting was triggered. See the details in the reported bug [1].
>>>>>>>>>>
>>>>>>>>>> tl;dr -- for an example 5 sec interval 03:55:31,379 - 
>>>>>>>>>> 03:55:36,110:
>>>>>>>>>>
>>>>>>>>>> Conn Reset Counts by a Worker PID:
>>>>>>>>>>          3 58412
>>>>>>>>>>          2 58413
>>>>>>>>>>          3 58415
>>>>>>>>>>          3 58417
>>>>>>>>>>
>>>>>>>>>> which seems too much of (workers*reconnects) and triggers rate 
>>>>>>>>>> limiting
>>>>>>>>>> immediately.
>>>>>>>>>>
>>>>>>>>>> [0]
>>>>>>>>>> https://13b475d7469ed7126ee9-28d4ad440f46f2186fe3f98464e57890.ssl.cf1.rackcdn.com/741228/6/check/tripleo-ci-centos-8-undercloud-containers/8e47836/logs/undercloud/var/log/tripleo-container-image-prepare.log 
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [1] https://bugs.launchpad.net/tripleo/+bug/1889372
>>>>>>>>>>
>>>>>>>>>> -- 
>>>>>>>>>> Best regards,
>>>>>>>>>> Bogdan Dobrelya,
>>>>>>>>>> Irc #bogdando
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> FYI..
>>>>>>>>>
>>>>>>>>> The issue w/ "too many requests" is back.  Expect delays and 
>>>>>>>>> failures in attempting to merge your patches upstream across 
>>>>>>>>> all branches.   The issue is being tracked as a critical issue.
>>>>>>>>
>>>>>>>> Working with the infra folks and we have identified the 
>>>>>>>> authorization
>>>>>>>> header as causing issues when we're rediected from docker.io to
>>>>>>>> cloudflare. I'll throw up a patch tomorrow to handle this case 
>>>>>>>> which
>>>>>>>> should improve our usage of the cache.  It needs some testing 
>>>>>>>> against
>>>>>>>> other registries to ensure that we don't break authenticated 
>>>>>>>> fetching
>>>>>>>> of resources.
>>>>>>>>
>>>>>>> Thanks Alex!
>>>>>>
>>>>>>
>>>>>>
>>>>>> FYI.. we have been revisited by the container pull issue, "too 
>>>>>> many requests".
>>>>>> Alex has some fresh patches on it: 
>>>>>> https://review.opendev.org/#/q/status:open+project:openstack/tripleo-common+topic:bug/1889122 
>>>>>>
>>>>>>
>>>>>> expect trouble in check and gate:
>>>>>> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22429%20Client%20Error%3A%20Too%20Many%20Requests%20for%20url%3A%5C%22%20AND%20voting%3A1 
>>>>>>
>>>>>>
>>>>
>>>
>>>
>>> -- 
>>> Best regards,
>>> Bogdan Dobrelya,
>>> Irc #bogdando
>>>
>>
> 
> 


-- 
Best regards,
Bogdan Dobrelya,
Irc #bogdando




More information about the openstack-discuss mailing list